Currently viewing the AI version
Switch to human version

AWS AI/ML Production Debugging: AI-Optimized Reference

Critical Failure Patterns

SageMaker Training Job Failures

UnexpectedStatusException Pattern

  • Primary Cause (90%): IAM role lacks S3 access permissions
  • Failure Impact: Training jobs fail silently with cryptic error messages
  • Detection Time: Can waste hours before proper logs are found
  • Fix Complexity: Low (10 minutes) if IAM issue, High (2+ hours) if VPC/networking

Critical IAM Policy Requirements:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject", "s3:PutObject", "s3:DeleteObject", "s3:ListBucket",
                "logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents"
            ],
            "Resource": ["arn:aws:s3:::your-bucket/*", "arn:aws:s3:::your-bucket", "*"]
        }
    ]
}

Training Jobs Stuck "InProgress"

  • Root Causes: Spot instance termination (60%), S3 cross-region access (25%), Docker container failure (15%)
  • Cost Impact: Can burn hundreds of dollars before detection
  • Emergency Fix: Add MaxRuntimeInSeconds: 3600 to prevent infinite billing

Bedrock Service Failures

ThrottlingException During Peak Hours

  • Default Quotas (Pathetically Low):
    • Claude 3.5 Sonnet: ~8k tokens/min
    • Nova Pro: ~10k tokens/min
    • GPT-4 via Bedrock: ~12k tokens/min
  • Business Impact: User-facing features fail during demos/high traffic
  • Quota Increase Timeline: 2-5 business days via AWS support
  • Emergency Workaround: Multi-region failover + exponential backoff

Essential Retry Logic:

import time
import random

def bedrock_with_retry(bedrock_call, max_retries=5):
    for attempt in range(max_retries):
        try:
            return bedrock_call()
        except ClientError as e:
            if e.response['Error']['Code'] == 'ThrottlingException':
                wait_time = (2 ** attempt) + random.uniform(0, 1)
                time.sleep(wait_time)
                continue
            else:
                raise e
    raise Exception("Max retries exceeded")

ModelNotReadyException - Cold Start Hell

  • Latency Impact: 10-30 seconds for first request after idle
  • User Experience: Appears as broken application
  • Workaround Cost: ~$5/month to keep models warm
  • Implementation: Ping every 5 minutes with minimal request

SageMaker Endpoint Deployment Failures

EndpointCreationFailed with Useless Errors

  • Debug Priority: Always test on smallest instance (ml.t2.medium) first
  • Common Root Causes:
    1. Model artifact corruption (30%)
    2. Docker memory issues (25%)
    3. IAM permissions (20%)
    4. VPC blocking S3 access (15%)
    5. Python dependency conflicts (10%)

Endpoint Returns ModelError Despite "InService" Status

  • Failure Indicator: Endpoint deployed successfully but all requests return 500
  • Primary Cause: Inference script bugs (90% of cases)
  • Debug Command: Check CloudWatch logs immediately, not endpoint status

Resource Requirements and Costs

Training Job Resource Planning

  • Spot Instance Risk: 40% chance of termination for jobs >30 minutes
  • Memory Requirements: Add 50% buffer to model size estimates
  • GPU Instance Quotas: Default limits prevent most real workloads
  • Cost Spike Risk: Failed jobs continue billing until manually stopped

Production Endpoint Sizing

  • Auto-scaling Latency: 3-5 minutes to spin up new instances
  • Memory Overhead: Multi-model endpoints require 2x model size in RAM
  • Minimum Viable Setup: 2 instances for any production workload
  • Cost vs Performance: ml.p3 instances 10x cost but 3x performance vs ml.m5

Critical Configuration Settings

SageMaker Training Configuration

# Production-safe training job configuration
sagemaker.create_training_job(
    TrainingJobName='job-name',
    StoppingCondition={'MaxRuntimeInSeconds': 3600},  # Prevent infinite billing
    EnableNetworkIsolation=False,  # Unless VPC is properly configured
    EnableManagedSpotTraining=False  # For mission-critical training
)

Multi-Model Endpoint Memory Management

'MultiModelConfig': {
    'ModelCacheSetting': 'Enabled',
    'MaxModels': 3  # Limit concurrent models to prevent OOM
}

Auto-scaling Configuration

# Scale aggressively, cost is secondary to uptime
sagemaker.put_scaling_policy(
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 50.0,  # Scale at 50% CPU, not 80%
        'ScaleOutCooldown': 60,   # Scale out fast
        'ScaleInCooldown': 900,   # Scale in slow
    }
)

Regional Availability Matrix

Model/Service us-east-1 us-west-2 eu-west-1 ap-southeast-1
Claude 3.5 Sonnet
Nova Pro
ml.p3.8xlarge Limited Limited

Emergency Debugging Commands

Immediate Status Check (30 seconds)

# AWS service health
curl -s https://status.aws.amazon.com/data.json | jq '.current_events'

# Running expensive resources
aws sagemaker list-training-jobs --status-equals InProgress
aws sagemaker list-endpoints --status-equals InService

Log Analysis (2 minutes)

# Recent SageMaker errors
aws logs filter-log-events \
    --log-group-name /aws/sagemaker/TrainingJobs \
    --start-time $(date -d '1 hour ago' +%s)000 \
    --filter-pattern "ERROR"

# Bedrock throttling patterns
aws logs filter-log-events \
    --log-group-name /aws/bedrock \
    --filter-pattern "ThrottlingException"

Quota Verification (1 minute)

# Critical quotas that cause production failures
aws service-quotas get-service-quota \
    --service-code sagemaker \
    --quota-code L-1194D53C  # ml.p3.2xlarge instances

aws service-quotas get-service-quota \
    --service-code bedrock \
    --quota-code L-22C574D0  # Claude requests per minute

VPC and Networking Requirements

VPC Endpoint Requirements for SageMaker

  • S3 VPC Endpoint: Required for training data access
  • SageMaker API Endpoint: Required for service communication
  • Alternative: NAT Gateway (more expensive but simpler)
  • Common Failure: Training jobs timeout after 30 minutes without proper endpoints

Security Group Rules for AI/ML

  • Outbound HTTPS (443): Required for API calls
  • Outbound HTTP (80): Required for some model downloads
  • Emergency Rule: Allow all outbound traffic initially, then restrict

Performance Thresholds and Breaking Points

SageMaker Limits

  • Training Job Timeout: Default no timeout leads to infinite billing
  • Multi-Model Memory: >1000 spans causes UI breakdown
  • Endpoint Auto-scaling: 3-5 minute delay causes user-visible failures
  • Batch Transform: Files >100MB cause random failures

Bedrock Performance Characteristics

  • Cold Start: 10-30 seconds for idle models
  • Token Limits: Vary by region and change without notice
  • Regional Failover: Essential for production reliability

Common Misconceptions

"SageMaker Handles Everything Automatically"

  • Reality: Requires extensive IAM configuration, VPC setup, and monitoring
  • Hidden Costs: Auto-scaling, data transfer, CloudWatch logging
  • Failure Modes: Silent failures due to permission issues

"AWS Error Messages Are Helpful"

  • Reality: 90% of errors require CloudWatch log analysis
  • UnexpectedStatusException: Means "something failed, figure it out yourself"
  • AccessDenied: Could be 15 different permission issues

"Default Settings Work in Production"

  • Reality: Default quotas prevent any serious workload
  • Auto-scaling: Default thresholds cause user-visible latency
  • Timeout Settings: Will cause infinite billing without limits

Emergency Recovery Procedures

Nuclear Options (Last Resort)

  1. Delete and Recreate Endpoints: When configuration is corrupted
  2. Reset IAM Roles: When permissions are completely broken
  3. Multi-Region Failover: When primary region has issues

Recovery Timeline Expectations

  • IAM Permission Fixes: 10-15 minutes
  • Quota Increase Requests: 2-5 business days
  • Endpoint Recreation: 5-10 minutes
  • Training Job Restarts: 15-30 minutes depending on data size

Cost Impact During Outages

  • Running Training Jobs: Continue billing until manually stopped
  • Idle Endpoints: $50-500/day depending on instance type
  • Failed Batch Jobs: May process partial data and still charge

Decision Criteria for Implementation

When to Use SageMaker vs Bedrock

  • SageMaker: Custom models, fine-tuning, batch processing
  • Bedrock: Quick LLM integration, managed scaling, multiple model access
  • Cost Comparison: Bedrock 3-5x more expensive per token but simpler ops

Instance Type Selection

  • Development: ml.t2.medium for debugging (cheapest)
  • CPU Inference: ml.m5.large for simple models
  • GPU Training: ml.p3.2xlarge minimum for deep learning
  • Production Inference: ml.c5.xlarge for latency-sensitive applications

Multi-Region Strategy

  • Essential for Production: Single region will fail
  • Cost Impact: 2x infrastructure costs but prevents business disruption
  • Implementation Complexity: High, requires sophisticated load balancing

This reference prioritizes operational intelligence over theoretical knowledge, focusing on the failures that actually occur in production environments and the proven solutions that resolve them quickly.

Useful Links for Further Investigation

Emergency Resources When Everything's Broken

LinkDescription
AWS Status PageFirst place to check when everything's broken. Bookmark this. AWS won't tell you about outages via error messages.
SageMaker Service Quotas DocumentationCheck your limits before they kill your training jobs. Default quotas are pathetically small.
Bedrock Service Quotas DocumentationBedrock quotas are even worse. Request increases immediately.
Stack Overflow - amazon-sagemaker tagReal engineers solving real problems. Search here before filing support tickets.
AWS ML Community SlackActive community. Post emergencies in #troubleshooting channel.
AWS Cost ExplorerFind what's burning money during outages. Filter by service and time range.
IAM Policy SimulatorTest IAM permissions without breaking production. Essential for AccessDenied errors.

Related Tools & Recommendations

tool
Recommended

MLflow - Stop Losing Track of Your Fucking Model Runs

MLflow: Open-source platform for machine learning lifecycle management

Databricks MLflow
/tool/databricks-mlflow/overview
100%
integration
Recommended

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

How to Wire Together the Modern DevOps Stack Without Losing Your Sanity

kubernetes
/integration/docker-kubernetes-argocd-prometheus/gitops-workflow-integration
99%
integration
Recommended

PyTorch ↔ TensorFlow Model Conversion: The Real Story

How to actually move models between frameworks without losing your sanity

PyTorch
/integration/pytorch-tensorflow/model-interoperability-guide
99%
tool
Recommended

Google Vertex AI - Google's Answer to AWS SageMaker

Google's ML platform that combines their scattered AI services into one place. Expect higher bills than advertised but decent Gemini model access if you're alre

Google Vertex AI
/tool/google-vertex-ai/overview
63%
tool
Recommended

Azure ML - For When Your Boss Says "Just Use Microsoft Everything"

The ML platform that actually works with Active Directory without requiring a PhD in IAM policies

Azure Machine Learning
/tool/azure-machine-learning/overview
63%
news
Recommended

Databricks Raises $1B While Actually Making Money (Imagine That)

Company hits $100B valuation with real revenue and positive cash flow - what a concept

OpenAI GPT
/news/2025-09-08/databricks-billion-funding
58%
pricing
Recommended

Databricks vs Snowflake vs BigQuery Pricing: Which Platform Will Bankrupt You Slowest

We burned through about $47k in cloud bills figuring this out so you don't have to

Databricks
/pricing/databricks-snowflake-bigquery-comparison/comprehensive-pricing-breakdown
58%
howto
Recommended

Stop MLflow from Murdering Your Database Every Time Someone Logs an Experiment

Deploy MLflow tracking that survives more than one data scientist

MLflow
/howto/setup-mlops-pipeline-mlflow-kubernetes/complete-setup-guide
57%
integration
Recommended

MLOps Production Pipeline: Kubeflow + MLflow + Feast Integration

How to Connect These Three Tools Without Losing Your Sanity

Kubeflow
/integration/kubeflow-mlflow-feast/complete-mlops-pipeline
57%
integration
Recommended

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

Running RAG Systems on K8s Will Make You Hate Your Life, But Sometimes You Don't Have a Choice

Vector Databases
/integration/vector-database-rag-production-deployment/kubernetes-orchestration
57%
integration
Recommended

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go

Apache Kafka
/integration/kafka-mongodb-kubernetes-prometheus-event-driven/complete-observability-architecture
57%
alternatives
Recommended

Docker Alternatives That Won't Break Your Budget

Docker got expensive as hell. Here's how to escape without breaking everything.

Docker
/alternatives/docker/budget-friendly-alternatives
57%
compare
Recommended

I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works

Trivy, Docker Scout, Snyk Container, Grype, and Clair - which one won't make you want to quit DevOps

docker
/compare/docker-security/cicd-integration/docker-security-cicd-integration
57%
tool
Recommended

JupyterLab Debugging Guide - Fix the Shit That Always Breaks

When your kernels die and your notebooks won't cooperate, here's what actually works

JupyterLab
/tool/jupyter-lab/debugging-guide
57%
tool
Recommended

JupyterLab Team Collaboration: Why It Breaks and How to Actually Fix It

integrates with JupyterLab

JupyterLab
/tool/jupyter-lab/team-collaboration-deployment
57%
tool
Recommended

JupyterLab Extension Development - Build Extensions That Don't Suck

Stop wrestling with broken tools and build something that actually works for your workflow

JupyterLab
/tool/jupyter-lab/extension-development-guide
57%
tool
Recommended

TensorFlow Serving Production Deployment - The Shit Nobody Tells You About

Until everything's on fire during your anniversary dinner and you're debugging memory leaks at 11 PM

TensorFlow Serving
/tool/tensorflow-serving/production-deployment-guide
57%
tool
Recommended

TensorFlow - End-to-End Machine Learning Platform

Google's ML framework that actually works in production (most of the time)

TensorFlow
/tool/tensorflow/overview
57%
tool
Recommended

PyTorch Debugging - When Your Models Decide to Die

integrates with PyTorch

PyTorch
/tool/pytorch/debugging-troubleshooting-guide
57%
tool
Recommended

PyTorch - The Deep Learning Framework That Doesn't Suck

I've been using PyTorch since 2019. It's popular because the API makes sense and debugging actually works.

PyTorch
/tool/pytorch/overview
57%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization