AWS AI/ML Production Troubleshooting Reference
Critical Production Failures and Root Causes
SageMaker Training Job Mysterious Deaths
Failure Pattern: Jobs die at 85% completion with "UnexpectedStatusException"
Real Impact: Wastes GPU credits, no clear error indication from AWS
Root Causes:
- Out of memory during checkpointing (model grew larger than expected)
- S3 permissions changed mid-training (IAM policy modifications)
- Spot instance termination (2-minute warning, capacity reclaimed)
- Docker container timeout (infinite loops/deadlocks in training script)
Debug Strategy:
- Check CloudWatch Logs:
/aws/sagemaker/TrainingJobs/[job-name]
- Search for: "ERROR", "Exception", "killed", "terminated"
- Monitor memory usage via
nvidia-smi
output in logs - Verify S3 access permissions haven't changed
Production Monitoring Code:
import logging
import psutil
import subprocess
def debug_system_health():
memory = psutil.virtual_memory()
disk = psutil.disk_usage('/')
logging.info(f"Memory: {memory.percent}% used, {memory.available / (1024**3):.1f}GB available")
logging.info(f"Disk: {disk.percent}% used, {disk.free / (1024**3):.1f}GB free")
try:
result = subprocess.run(['nvidia-smi', '--query-gpu=memory.used,memory.total',
'--format=csv,noheader,nounits'],
capture_output=True, text=True)
logging.info(f"GPU Memory: {result.stdout.strip()}")
except:
pass
# Call every 100 training steps
if step % 100 == 0:
debug_system_health()
Bedrock Rate Limiting During Critical Operations
Failure Pattern: "ThrottlingException: Rate exceeded" during demos/peak usage
Real Impact: 30-second user-facing failures, business presentation disasters
Regional Quota Variations:
- Claude 3.5 Sonnet: 400,000 tokens/min (us-west-2) vs 100,000 tokens/min (some regions)
- Nova Pro: 500,000 tokens/min (most regions)
- Document analysis with context: 5,000-10,000 tokens per request
Emergency Fix with Exponential Backoff:
import time
import random
from botocore.exceptions import ClientError
def bedrock_with_backoff(bedrock_client, **kwargs):
max_retries = 5
base_delay = 1
for attempt in range(max_retries):
try:
return bedrock_client.invoke_model(**kwargs)
except ClientError as e:
if e.response['Error']['Code'] == 'ThrottlingException':
if attempt == max_retries - 1:
raise
delay = (base_delay * 2 ** attempt) + random.uniform(0, 1)
print(f"Rate limited, waiting {delay:.1f}s before retry {attempt + 1}")
time.sleep(delay)
else:
raise
Long-term Solution: Request quota increases 48-72 hours before critical events (AWS takes 1-3 business days)
Inference Endpoint Random Timeouts
Failure Pattern: 30% of requests timeout after weeks of normal operation
User Impact: Service appears broken despite healthy metrics
Root Causes:
- Memory leaks in inference containers (gradual memory consumption)
- Auto-scaling cold starts (5+ minute initialization for new instances)
- Container health check failures (load balancer marking healthy instances unhealthy)
- Input validation hanging on malformed requests
Diagnostic Approach:
CloudWatch Metrics to Check:
ModelLatency
: Increasing response timesInvocation4XXErrors
: Client-side issuesInvocation5XXErrors
: Server-side problemsCPUUtilization
andMemoryUtilization
: Resource exhaustion
Container Log Analysis:
aws logs filter-log-events \
--log-group-name /aws/sagemaker/Endpoints/your-endpoint-name \
--start-time 1693900800000 \
--filter-pattern "ERROR"
Quick Recovery Actions:
- Restart endpoint (containers sometimes become corrupted)
- Scale up instance count (better fault tolerance)
- Switch to larger instance type (may be hitting resource limits)
Nova Model Cold Start Performance Issues
User Experience Impact: 8-15 second delays perceived as application failure
User Abandonment: 3-second patience threshold before users think system is broken
Cold Start Triggers: 15-20 minute idle periods
Keep-Alive Strategy:
import schedule
import time
import boto3
bedrock = boto3.client('bedrock-runtime')
def keep_warm():
try:
bedrock.invoke_model(
modelId='amazon.nova-pro-v1:0',
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 10,
"messages": [{"role": "user", "content": "ping"}]
})
)
print("Keep-alive successful")
except Exception as e:
print(f"Keep-alive failed: {e}")
# Ping every 5 minutes during business hours
schedule.every(5).minutes.do(keep_warm)
Async Processing Alternative:
def process_document_async(document_id):
job_id = str(uuid.uuid4())
threading.Thread(
target=analyze_document_background,
args=(job_id, document_id)
).start()
return {"job_id": job_id, "status": "processing"}
Cross-Region Deployment Failure Patterns
Common Breaking Points:
- Model availability varies by region
- IAM role ARNs are region-specific
- S3 cross-region access denied errors
- VPC resources don't exist in target region
- KMS keys are region-specific
Multi-Region Validation Script:
# Verify model availability
aws bedrock list-foundation-models --region eu-central-1 \
--query 'modelSummaries[?contains(modelId, `claude`)]'
# Check IAM role exists
aws iam get-role --role-name YourSageMakerRole --region eu-central-1
# Test S3 access
aws s3 ls s3://your-model-bucket --region eu-central-1
# Verify VPC resources
aws ec2 describe-subnets --region eu-central-1 --filters "Name=tag:Name,Values=ml-*"
Advanced Debugging Techniques
X-Ray Distributed Tracing Setup
Use Case: Multi-service AI workflows (Lambda → Bedrock → SageMaker → DynamoDB)
Cost: $5 per 1M traces
Value: Exact latency breakdown and error source identification
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.core import patch_all
import boto3
patch_all()
@xray_recorder.capture('process_document')
def process_document(document_id):
with xray_recorder.in_subsegment('bedrock_analysis'):
bedrock = boto3.client('bedrock-runtime')
response = bedrock.invoke_model(...)
with xray_recorder.in_subsegment('sagemaker_inference'):
sagemaker = boto3.client('sagemaker-runtime')
result = sagemaker.invoke_endpoint(...)
return result
CloudWatch Insights Production Queries
Memory Issues Detection:
fields @timestamp, @message
| filter @message like /memory/
| stats count() by bin(5m), @message
| sort @timestamp desc
Token Usage Cost Tracking:
fields @timestamp, @message
| parse @message "tokens: *" as token_count
| stats avg(token_count), max(token_count), sum(token_count) by bin(1h)
| sort @timestamp desc
Rate Limiting Pattern Analysis:
fields @timestamp, @message
| filter @message like /ThrottlingException/ or @message like /429/
| stats count() by bin(1m)
| sort @timestamp desc
Cost-Optimized Monitoring
Daily Logging Budget Control:
class CostOptimizedLogger:
def __init__(self, max_daily_cost=10.00):
self.max_daily_cost = max_daily_cost
self.current_cost = 0
def smart_log(self, level, message, force=False):
if not force and random.random() > self.sample_rate:
return
message_size = len(message.encode('utf-8'))
estimated_cost = (message_size / (1024**3)) * 0.50 # $0.50/GB
if self.current_cost + estimated_cost > self.max_daily_cost:
if level >= logging.ERROR: # Always log errors
logging.log(level, f"[COST_LIMITED] {message}")
return
self.current_cost += estimated_cost
logging.log(level, message)
Production Incident Response Playbook
Immediate Triage (0-5 minutes)
- Check AWS Service Health - Don't debug if AWS has known issues
- Test basic connectivity to services
- Review recent deployments in CloudTrail
- Verify service quotas aren't blocking requests
Emergency Debug Script:
#!/bin/bash
echo "=== SageMaker Status ==="
aws sagemaker list-training-jobs --status-equals InProgress --max-results 10
aws sagemaker list-endpoints --status-equals InService --max-results 10
echo "=== Recent Bedrock Errors ==="
aws logs filter-log-events \
--log-group-name "/aws/lambda/your-function" \
--start-time $(date -d '1 hour ago' +%s)000 \
--filter-pattern "ThrottlingException"
echo "=== CloudWatch Alarms ==="
aws cloudwatch describe-alarms --state-value ALARM --max-records 10
Service-Specific Debugging (5-15 minutes)
SageMaker Training Job Issues:
- Check
/aws/sagemaker/TrainingJobs/[job-name]
logs - Search for "OutOfMemoryError", "killed", "terminated"
- Verify S3 bucket permissions unchanged
- Check spot instance termination notices
Bedrock Rate Limiting:
- Implement exponential backoff immediately
- Check regional quota differences
- Split large requests into smaller chunks
- Enable keep-alive for frequently used models
Inference Endpoint Problems:
- Monitor
ModelLatency
,MemoryUtilization
metrics - Check container logs for memory leaks
- Test with minimal payload to isolate issue
- Scale up instances for immediate relief
Critical Warning Thresholds
Cost Alerts (Mandatory)
- Daily spend > $500: Catches runaway training jobs
- Hourly token usage > 2x baseline: Detects retry loops
- Real-time endpoint costs: $37/hour for GPU instances
Performance Degradation Indicators
- Response time > 10 seconds: Users perceive as broken
- Error rate > 5%: Service reliability threshold
- Memory utilization > 90%: Imminent failure risk
- Token usage spike > 2x normal: Rate limiting incoming
Automated Root Cause Analysis
Error Pattern Detection:
class ErrorPatternDetector:
def analyze_error_patterns(self):
recent_errors = [e for e in self.error_history[-100:] if not e['success']]
if len(recent_errors) > 10: # 10% error rate threshold
patterns = self.find_patterns(recent_errors)
if patterns:
self.alert_ops_team(f"Error pattern detected: {patterns}")
def find_patterns(self, errors):
patterns = []
# Time clustering
error_hours = [e['timestamp'].hour for e in errors]
if len(set(error_hours)) < 3:
patterns.append(f"Errors concentrated in hours: {set(error_hours)}")
# User-specific issues
error_users = [e['context'].get('user_id') for e in errors if e['context']]
user_counts = {user: error_users.count(user) for user in set(error_users)}
if max(user_counts.values()) > 5:
patterns.append(f"High error user: {max(user_counts, key=user_counts.get)}")
return patterns
Resource Requirements and Implementation Costs
Time Investment for Debugging Setup
- Basic CloudWatch monitoring: 2-4 hours initial setup
- X-Ray distributed tracing: 8-16 hours for full implementation
- Custom metrics and alerting: 16-24 hours for comprehensive coverage
- Automated root cause analysis: 40+ hours for production-ready system
Monthly Operational Costs
- CloudWatch Logs: $0.50/GB ingested + $0.03/GB stored
- CloudWatch Insights: $0.005 per query
- X-Ray tracing: $5.00 per 1M traces
- Custom metrics: $0.30 per metric per month
- Real-time endpoints: $37/hour for GPU instances (runs 24/7)
Expertise Requirements
- Junior Engineers: Can implement basic CloudWatch logging and alerts
- Senior Engineers: Required for X-Ray setup, custom metrics, cost optimization
- Principal/Staff Engineers: Needed for automated root cause analysis, pattern detection
Breaking Points and Failure Modes
When CloudWatch Logs Fail You:
- Container never starts (no logs generated)
- IAM permissions block log delivery
- Log retention settings cause data loss
- Cost limits trigger log sampling
When X-Ray Becomes Unreliable:
- Code instrumentation incomplete
- Third-party services not supported
- Adds 1-5ms latency per traced request
- Complex async workflows break tracing
When Custom Monitoring Becomes Overhead:
- Metric cardinality explosion increases costs
- Alert fatigue from false positives
- Maintenance burden grows with system complexity
- Performance impact from excessive monitoring
Model Selection for Cost Optimization
Token-Efficient Model Selection:
def select_optimal_model(self, document):
word_count = len(document.split())
if word_count < 1000:
return "amazon.nova-micro-v1:0" # Cheapest for simple tasks
elif word_count < 5000:
return "amazon.nova-lite-v1:0" # Balance of cost/capability
else:
return "amazon.nova-pro-v1:0" # Premium for complex analysis
Parallel Processing for Performance:
async def process_documents_parallel(self, documents):
tasks = []
for doc in documents:
task = asyncio.create_task(self.process_single_document(doc))
tasks.append(task)
results = await asyncio.gather(*tasks, return_exceptions=True)
return results
Emergency Response Decision Matrix
Symptom | Immediate Action | Root Cause Investigation | Prevention |
---|---|---|---|
Training job dies with UnexpectedStatusException | Check CloudWatch logs for OOM/killed messages | Analyze memory usage patterns, S3 permissions | Implement system health monitoring every 100 steps |
Bedrock ThrottlingException during demo | Implement exponential backoff, switch regions | Check quota differences across regions | Request quota increases 72 hours before critical events |
Inference endpoint timeouts (30% requests) | Restart endpoint, scale up instances | Monitor memory leaks, health check failures | Set up proactive memory/CPU alerting |
Nova model 15+ second cold starts | Implement keep-alive pinging | Analyze usage patterns, idle timeout triggers | Async processing with status polling |
Cross-region deployment failures | Use CloudFormation StackSets | Validate IAM roles, VPC resources, model availability | Infrastructure as code for consistent deployments |
Cost spike overnight | Check Cost Explorer, stop runaway resources | Identify training job loops, billing alert gaps | Mandatory $500/day cost alerts |
Real-World Failure Examples
$30,000 Weekend Training Loop
What Happened: Hyperparameter tuning jobs failed and restarted in loop over long weekend
Cost Impact: $30,000+ in GPU credits before Monday discovery
Root Cause: No daily spend alerts configured
Prevention: Billing alerts for $500+ daily spend, weekend monitoring
Demo Disaster from Regional Routing
What Happened: Board presentation hit rate limits despite successful testing
Technical Cause: Testing in us-west-2 (400K tokens/min), demo routed to eu-central-1 (100K tokens/min)
Business Impact: Executive embarrassment, questions about technical competence
Fix: Always verify actual region in production, not just configured region
Silent Model Performance Degradation
What Happened: Production model accuracy dropped 15% over two weeks
Detection Method: Daily validation test suite
Root Cause: Undocumented AWS model update changed behavior
Recovery: Implemented daily validation tests, baseline comparison alerts
These incidents demonstrate that debugging AWS AI services requires both technical monitoring and business impact awareness. The difference between teams that survive production incidents and those that burn out is having systematic debugging approaches ready before everything catches fire.
Useful Links for Further Investigation
Official AWS Resources (The Ones That Don't Suck)
Link | Description |
---|---|
AWS Service Health Dashboard | Check this FIRST when everything breaks. Don't debug your code if AWS is having issues. |
SageMaker Troubleshooting Guide | Actually useful AWS documentation. The CloudWatch logs section will save hours. |
CloudWatch Logs Insights Query Examples | Copy-paste queries for common scenarios instead of writing from scratch. |
Stack Overflow - amazon-sagemaker tag | Real engineers solving real problems. Search here before filing support tickets. |
AWS ML Community Slack | Active community. The #troubleshooting channel has engineers who've debugged your exact problem. |
AWS Cost Explorer | Find what's burning money during outages. Filter by service and time range. |
AWS Personal Health Dashboard | Personalized service health notifications for resources you actually use. |
Related Tools & Recommendations
MLflow - Stop Losing Track of Your Fucking Model Runs
MLflow: Open-source platform for machine learning lifecycle management
GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus
How to Wire Together the Modern DevOps Stack Without Losing Your Sanity
PyTorch ↔ TensorFlow Model Conversion: The Real Story
How to actually move models between frameworks without losing your sanity
Google Vertex AI - Google's Answer to AWS SageMaker
Google's ML platform that combines their scattered AI services into one place. Expect higher bills than advertised but decent Gemini model access if you're alre
Azure ML - For When Your Boss Says "Just Use Microsoft Everything"
The ML platform that actually works with Active Directory without requiring a PhD in IAM policies
Databricks Raises $1B While Actually Making Money (Imagine That)
Company hits $100B valuation with real revenue and positive cash flow - what a concept
Databricks vs Snowflake vs BigQuery Pricing: Which Platform Will Bankrupt You Slowest
We burned through about $47k in cloud bills figuring this out so you don't have to
Stop MLflow from Murdering Your Database Every Time Someone Logs an Experiment
Deploy MLflow tracking that survives more than one data scientist
MLflow Production Troubleshooting Guide - Fix the Shit That Always Breaks
When MLflow works locally but dies in production. Again.
RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)
Running RAG Systems on K8s Will Make You Hate Your Life, But Sometimes You Don't Have a Choice
Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break
When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go
Docker Alternatives That Won't Break Your Budget
Docker got expensive as hell. Here's how to escape without breaking everything.
I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works
Trivy, Docker Scout, Snyk Container, Grype, and Clair - which one won't make you want to quit DevOps
JupyterLab Debugging Guide - Fix the Shit That Always Breaks
When your kernels die and your notebooks won't cooperate, here's what actually works
JupyterLab Team Collaboration: Why It Breaks and How to Actually Fix It
integrates with JupyterLab
JupyterLab Extension Development - Build Extensions That Don't Suck
Stop wrestling with broken tools and build something that actually works for your workflow
TensorFlow Serving Production Deployment - The Shit Nobody Tells You About
Until everything's on fire during your anniversary dinner and you're debugging memory leaks at 11 PM
TensorFlow - End-to-End Machine Learning Platform
Google's ML framework that actually works in production (most of the time)
PyTorch Debugging - When Your Models Decide to Die
integrates with PyTorch
PyTorch - The Deep Learning Framework That Doesn't Suck
I've been using PyTorch since 2019. It's popular because the API makes sense and debugging actually works.
Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization