AWS AI/ML Production Debugging: AI-Optimized Reference
Critical Failure Patterns
SageMaker Training Job Failures
UnexpectedStatusException Pattern
- Primary Cause (90%): IAM role lacks S3 access permissions
- Failure Impact: Training jobs fail silently with cryptic error messages
- Detection Time: Can waste hours before proper logs are found
- Fix Complexity: Low (10 minutes) if IAM issue, High (2+ hours) if VPC/networking
Critical IAM Policy Requirements:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject", "s3:PutObject", "s3:DeleteObject", "s3:ListBucket",
"logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents"
],
"Resource": ["arn:aws:s3:::your-bucket/*", "arn:aws:s3:::your-bucket", "*"]
}
]
}
Training Jobs Stuck "InProgress"
- Root Causes: Spot instance termination (60%), S3 cross-region access (25%), Docker container failure (15%)
- Cost Impact: Can burn hundreds of dollars before detection
- Emergency Fix: Add
MaxRuntimeInSeconds: 3600
to prevent infinite billing
Bedrock Service Failures
ThrottlingException During Peak Hours
- Default Quotas (Pathetically Low):
- Claude 3.5 Sonnet: ~8k tokens/min
- Nova Pro: ~10k tokens/min
- GPT-4 via Bedrock: ~12k tokens/min
- Business Impact: User-facing features fail during demos/high traffic
- Quota Increase Timeline: 2-5 business days via AWS support
- Emergency Workaround: Multi-region failover + exponential backoff
Essential Retry Logic:
import time
import random
def bedrock_with_retry(bedrock_call, max_retries=5):
for attempt in range(max_retries):
try:
return bedrock_call()
except ClientError as e:
if e.response['Error']['Code'] == 'ThrottlingException':
wait_time = (2 ** attempt) + random.uniform(0, 1)
time.sleep(wait_time)
continue
else:
raise e
raise Exception("Max retries exceeded")
ModelNotReadyException - Cold Start Hell
- Latency Impact: 10-30 seconds for first request after idle
- User Experience: Appears as broken application
- Workaround Cost: ~$5/month to keep models warm
- Implementation: Ping every 5 minutes with minimal request
SageMaker Endpoint Deployment Failures
EndpointCreationFailed with Useless Errors
- Debug Priority: Always test on smallest instance (ml.t2.medium) first
- Common Root Causes:
- Model artifact corruption (30%)
- Docker memory issues (25%)
- IAM permissions (20%)
- VPC blocking S3 access (15%)
- Python dependency conflicts (10%)
Endpoint Returns ModelError Despite "InService" Status
- Failure Indicator: Endpoint deployed successfully but all requests return 500
- Primary Cause: Inference script bugs (90% of cases)
- Debug Command: Check CloudWatch logs immediately, not endpoint status
Resource Requirements and Costs
Training Job Resource Planning
- Spot Instance Risk: 40% chance of termination for jobs >30 minutes
- Memory Requirements: Add 50% buffer to model size estimates
- GPU Instance Quotas: Default limits prevent most real workloads
- Cost Spike Risk: Failed jobs continue billing until manually stopped
Production Endpoint Sizing
- Auto-scaling Latency: 3-5 minutes to spin up new instances
- Memory Overhead: Multi-model endpoints require 2x model size in RAM
- Minimum Viable Setup: 2 instances for any production workload
- Cost vs Performance: ml.p3 instances 10x cost but 3x performance vs ml.m5
Critical Configuration Settings
SageMaker Training Configuration
# Production-safe training job configuration
sagemaker.create_training_job(
TrainingJobName='job-name',
StoppingCondition={'MaxRuntimeInSeconds': 3600}, # Prevent infinite billing
EnableNetworkIsolation=False, # Unless VPC is properly configured
EnableManagedSpotTraining=False # For mission-critical training
)
Multi-Model Endpoint Memory Management
'MultiModelConfig': {
'ModelCacheSetting': 'Enabled',
'MaxModels': 3 # Limit concurrent models to prevent OOM
}
Auto-scaling Configuration
# Scale aggressively, cost is secondary to uptime
sagemaker.put_scaling_policy(
TargetTrackingScalingPolicyConfiguration={
'TargetValue': 50.0, # Scale at 50% CPU, not 80%
'ScaleOutCooldown': 60, # Scale out fast
'ScaleInCooldown': 900, # Scale in slow
}
)
Regional Availability Matrix
Model/Service | us-east-1 | us-west-2 | eu-west-1 | ap-southeast-1 |
---|---|---|---|---|
Claude 3.5 Sonnet | ✓ | ✓ | ✓ | ✗ |
Nova Pro | ✓ | ✓ | ✗ | ✗ |
ml.p3.8xlarge | ✓ | ✓ | Limited | Limited |
Emergency Debugging Commands
Immediate Status Check (30 seconds)
# AWS service health
curl -s https://status.aws.amazon.com/data.json | jq '.current_events'
# Running expensive resources
aws sagemaker list-training-jobs --status-equals InProgress
aws sagemaker list-endpoints --status-equals InService
Log Analysis (2 minutes)
# Recent SageMaker errors
aws logs filter-log-events \
--log-group-name /aws/sagemaker/TrainingJobs \
--start-time $(date -d '1 hour ago' +%s)000 \
--filter-pattern "ERROR"
# Bedrock throttling patterns
aws logs filter-log-events \
--log-group-name /aws/bedrock \
--filter-pattern "ThrottlingException"
Quota Verification (1 minute)
# Critical quotas that cause production failures
aws service-quotas get-service-quota \
--service-code sagemaker \
--quota-code L-1194D53C # ml.p3.2xlarge instances
aws service-quotas get-service-quota \
--service-code bedrock \
--quota-code L-22C574D0 # Claude requests per minute
VPC and Networking Requirements
VPC Endpoint Requirements for SageMaker
- S3 VPC Endpoint: Required for training data access
- SageMaker API Endpoint: Required for service communication
- Alternative: NAT Gateway (more expensive but simpler)
- Common Failure: Training jobs timeout after 30 minutes without proper endpoints
Security Group Rules for AI/ML
- Outbound HTTPS (443): Required for API calls
- Outbound HTTP (80): Required for some model downloads
- Emergency Rule: Allow all outbound traffic initially, then restrict
Performance Thresholds and Breaking Points
SageMaker Limits
- Training Job Timeout: Default no timeout leads to infinite billing
- Multi-Model Memory: >1000 spans causes UI breakdown
- Endpoint Auto-scaling: 3-5 minute delay causes user-visible failures
- Batch Transform: Files >100MB cause random failures
Bedrock Performance Characteristics
- Cold Start: 10-30 seconds for idle models
- Token Limits: Vary by region and change without notice
- Regional Failover: Essential for production reliability
Common Misconceptions
"SageMaker Handles Everything Automatically"
- Reality: Requires extensive IAM configuration, VPC setup, and monitoring
- Hidden Costs: Auto-scaling, data transfer, CloudWatch logging
- Failure Modes: Silent failures due to permission issues
"AWS Error Messages Are Helpful"
- Reality: 90% of errors require CloudWatch log analysis
- UnexpectedStatusException: Means "something failed, figure it out yourself"
- AccessDenied: Could be 15 different permission issues
"Default Settings Work in Production"
- Reality: Default quotas prevent any serious workload
- Auto-scaling: Default thresholds cause user-visible latency
- Timeout Settings: Will cause infinite billing without limits
Emergency Recovery Procedures
Nuclear Options (Last Resort)
- Delete and Recreate Endpoints: When configuration is corrupted
- Reset IAM Roles: When permissions are completely broken
- Multi-Region Failover: When primary region has issues
Recovery Timeline Expectations
- IAM Permission Fixes: 10-15 minutes
- Quota Increase Requests: 2-5 business days
- Endpoint Recreation: 5-10 minutes
- Training Job Restarts: 15-30 minutes depending on data size
Cost Impact During Outages
- Running Training Jobs: Continue billing until manually stopped
- Idle Endpoints: $50-500/day depending on instance type
- Failed Batch Jobs: May process partial data and still charge
Decision Criteria for Implementation
When to Use SageMaker vs Bedrock
- SageMaker: Custom models, fine-tuning, batch processing
- Bedrock: Quick LLM integration, managed scaling, multiple model access
- Cost Comparison: Bedrock 3-5x more expensive per token but simpler ops
Instance Type Selection
- Development: ml.t2.medium for debugging (cheapest)
- CPU Inference: ml.m5.large for simple models
- GPU Training: ml.p3.2xlarge minimum for deep learning
- Production Inference: ml.c5.xlarge for latency-sensitive applications
Multi-Region Strategy
- Essential for Production: Single region will fail
- Cost Impact: 2x infrastructure costs but prevents business disruption
- Implementation Complexity: High, requires sophisticated load balancing
This reference prioritizes operational intelligence over theoretical knowledge, focusing on the failures that actually occur in production environments and the proven solutions that resolve them quickly.
Useful Links for Further Investigation
Emergency Resources When Everything's Broken
Link | Description |
---|---|
AWS Status Page | First place to check when everything's broken. Bookmark this. AWS won't tell you about outages via error messages. |
SageMaker Service Quotas Documentation | Check your limits before they kill your training jobs. Default quotas are pathetically small. |
Bedrock Service Quotas Documentation | Bedrock quotas are even worse. Request increases immediately. |
Stack Overflow - amazon-sagemaker tag | Real engineers solving real problems. Search here before filing support tickets. |
AWS ML Community Slack | Active community. Post emergencies in #troubleshooting channel. |
AWS Cost Explorer | Find what's burning money during outages. Filter by service and time range. |
IAM Policy Simulator | Test IAM permissions without breaking production. Essential for AccessDenied errors. |
Related Tools & Recommendations
MLflow - Stop Losing Track of Your Fucking Model Runs
MLflow: Open-source platform for machine learning lifecycle management
GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus
How to Wire Together the Modern DevOps Stack Without Losing Your Sanity
PyTorch ↔ TensorFlow Model Conversion: The Real Story
How to actually move models between frameworks without losing your sanity
Google Vertex AI - Google's Answer to AWS SageMaker
Google's ML platform that combines their scattered AI services into one place. Expect higher bills than advertised but decent Gemini model access if you're alre
Azure ML - For When Your Boss Says "Just Use Microsoft Everything"
The ML platform that actually works with Active Directory without requiring a PhD in IAM policies
Databricks Raises $1B While Actually Making Money (Imagine That)
Company hits $100B valuation with real revenue and positive cash flow - what a concept
Databricks vs Snowflake vs BigQuery Pricing: Which Platform Will Bankrupt You Slowest
We burned through about $47k in cloud bills figuring this out so you don't have to
Stop MLflow from Murdering Your Database Every Time Someone Logs an Experiment
Deploy MLflow tracking that survives more than one data scientist
MLOps Production Pipeline: Kubeflow + MLflow + Feast Integration
How to Connect These Three Tools Without Losing Your Sanity
RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)
Running RAG Systems on K8s Will Make You Hate Your Life, But Sometimes You Don't Have a Choice
Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break
When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go
Docker Alternatives That Won't Break Your Budget
Docker got expensive as hell. Here's how to escape without breaking everything.
I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works
Trivy, Docker Scout, Snyk Container, Grype, and Clair - which one won't make you want to quit DevOps
JupyterLab Debugging Guide - Fix the Shit That Always Breaks
When your kernels die and your notebooks won't cooperate, here's what actually works
JupyterLab Team Collaboration: Why It Breaks and How to Actually Fix It
integrates with JupyterLab
JupyterLab Extension Development - Build Extensions That Don't Suck
Stop wrestling with broken tools and build something that actually works for your workflow
TensorFlow Serving Production Deployment - The Shit Nobody Tells You About
Until everything's on fire during your anniversary dinner and you're debugging memory leaks at 11 PM
TensorFlow - End-to-End Machine Learning Platform
Google's ML framework that actually works in production (most of the time)
PyTorch Debugging - When Your Models Decide to Die
integrates with PyTorch
PyTorch - The Deep Learning Framework That Doesn't Suck
I've been using PyTorch since 2019. It's popular because the API makes sense and debugging actually works.
Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization