AWS AI/ML Services: Production Debugging - When Everything's On Fire

The AWS AI/ML Disasters That Actually Happen (And How to Fix Them Fast)

Today is Friday, September 05, 2025. I'm writing this after spending 18 hours yesterday debugging why our Bedrock endpoints were randomly throwing 500 errors during a board demo. Turned out to be a quota limit we hit because someone forgot to request an increase for our new model. The CEO was not amused.

Here's the shit that actually breaks in production and how to fix it when you're under pressure.

SageMaker Training Jobs That Die Mysteriously

"UnexpectedStatusException" - The Error Message From Hell

This is AWS's way of saying "something went wrong but we're not gonna tell you what." I've seen this error more times than I care to count. The official SageMaker troubleshooting documentation barely scratches the surface.

What usually causes it:

IAM permissions are fucked (most common) - check IAM troubleshooting guide
Your training script has a bug that only shows up on AWS - see training job errors guide
Instance doesn't have enough memory - review instance types documentation
VPC config blocking S3 access - check VPC endpoint configuration

Quick debugging steps:

## 1. Check CloudWatch logs first (this saves hours)
aws logs describe-log-groups --log-group-name-prefix /aws/sagemaker/TrainingJobs

## 2. Look at the actual error in CloudWatch
aws logs get-log-events --log-group-name /aws/sagemaker/TrainingJobs/your-job-name --log-stream-name your-job-name/algo-1-1234567890

## 3. Check if your IAM role can actually access S3
aws sts assume-role --role-arn your-sagemaker-execution-role-arn --role-session-name test

For comprehensive log analysis, check the CloudWatch logs documentation for SageMaker. When jobs fail without logs appearing, it's usually a pre-training configuration issue.

The fix that works 90% of the time:
Your SageMaker execution role is missing permissions. Add this policy and try again:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::your-bucket/*",
                "arn:aws:s3:::your-bucket"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "*"
        }
    ]
}

Training Jobs Stuck in "InProgress" Forever

Your job shows as running but hasn't done anything for 3 hours. This usually means:

SageMaker Training Architecture

Spot instance got terminated (check CloudWatch events)
Data loading is stuck (your S3 bucket is in a different region)
Docker container won't start (base image is corrupted or wrong region)

Quick fix (following SageMaker best practices):

## Add timeout to your training jobs so they don't run forever
sagemaker.create_training_job(
    TrainingJobName='my-job',
    StoppingCondition={
        'MaxRuntimeInSeconds': 3600  # Kill it after 1 hour
    },
    # ... other params
)

For more advanced timeout and training configurations, check the SageMaker training compiler best practices and training job setup guide.

"ResourceLimitExceeded" During Training

You hit an instance quota or your account can't spin up the GPU instances you requested.

AWS EC2 Instance Types

## Check your service quotas
aws service-quotas get-service-quota \
    --service-code sagemaker \
    --quota-code L-1194D53C  # ml.p3.2xlarge instances

## Request quota increase (takes 2-5 business days)
aws service-quotas request-service-quota-increase \
    --service-code sagemaker \
    --quota-code L-1194D53C \
    --desired-value 10

For faster quota increases, use the Service Quotas console instead of the CLI. The AWS Support console method also works but takes longer.

Emergency workaround: Use a smaller instance type or switch to on-demand if you were using spot instances.

Bedrock Models Randomly Failing

AWS Bedrock Service Architecture

ThrottlingException During Peak Hours

Your app works fine during testing but starts throwing 429 errors when real users hit it. The official Bedrock API error codes documentation recommends exponential backoff with jitter.

Default quotas are pathetic:

Claude 3.5 Sonnet: something like 8k tokens/min, maybe less in some regions
Nova Pro: around 10k tokens/min if you're lucky
GPT-4 (through Bedrock): I think 12k tokens/min but these change randomly

Check the current Bedrock quotas documentation for your specific region since these limits vary wildly.

Quick fixes:

First, implement exponential backoff (should have done this from the start):

import time
import random

def bedrock_with_retry(bedrock_call, max_retries=5):
    for attempt in range(max_retries):
        try:
            return bedrock_call()
        except ClientError as e:
            if e.response['Error']['Code'] == 'ThrottlingException':
                wait_time = (2 ** attempt) + random.uniform(0, 1)
                time.sleep(wait_time)
                continue
            else:
                raise e
    raise Exception("Max retries exceeded")

Then request quota increases using the Bedrock quota increase process and use multiple regions to spread load. The Boto3 retry documentation has more advanced retry strategies. Also ensure you have proper model access permissions and follow the getting started guide for API setup. For troubleshooting access issues, check the model access modification guide.

"ModelNotReadyException" - Cold Start Hell

First request to a Bedrock model after it's been idle takes 10-30 seconds. Your users think the app is broken. The official ModelNotReadyException troubleshooting guide recommends implementing heartbeat strategies.

Hacky but necessary fix - keep models warm:

import threading
import time

def keep_bedrock_warm():
    while True:
        try:
            # Cheapest possible request to keep model loaded
            bedrock.invoke_model(
                modelId='anthropic.claude-3-haiku-20240307-v1:0',
                body=json.dumps({
                    "anthropic_version": "bedrock-2023-05-31",
                    "max_tokens": 1,
                    "messages": [{"role": "user", "content": "hi"}]
                })
            )
            time.sleep(300)  # Every 5 minutes
        except Exception:
            time.sleep(60)  # Retry in 1 minute

## Start as daemon thread
threading.Thread(target=keep_bedrock_warm, daemon=True).start()

Costs about $5/month but saves your user experience.

Regional Availability Hell

Your app worked fine in us-east-1 during testing. Now you're deploying to eu-west-1 and nothing works because the model isn't available there.

Check model availability by region:

import boto3

def check_model_availability(model_id, region):
    try:
        bedrock = boto3.client('bedrock', region_name=region)
        response = bedrock.list_foundation_models()
        available_models = [model['modelId'] for model in response['modelSummaries']]
        return model_id in available_models
    except Exception as e:
        return False

## Check before deployment
regions = ['us-east-1', 'us-west-2', 'eu-west-1']
model = 'anthropic.claude-3-5-sonnet-20240620-v1:0'

for region in regions:
    available = check_model_availability(model, region)
    print(f"{model} in {region}: {'✓' if available else '✗'}")

SageMaker Endpoints That Won't Deploy

"EndpointCreationFailed" with No Useful Error

Your model trained fine but won't deploy to an endpoint. The error message is useless.

Most common causes:

Model artifact is corrupted - repackage and upload again
Docker container can't load the model - memory issues or wrong Python version
IAM role can't access the model artifact - permissions again, always permissions

Debug with a test endpoint:

## Deploy to smallest possible instance first
test_config = {
    'EndpointConfigName': 'debug-config',
    'ProductionVariants': [{
        'VariantName': 'debug-variant',
        'ModelName': 'your-model',
        'InitialInstanceCount': 1,
        'InstanceType': 'ml.t2.medium',  # Cheapest option for debugging
        'InitialVariantWeight': 1
    }]
}

sagemaker.create_endpoint_config(**test_config)

If it fails on ml.t2.medium, the problem is your model code. If it works, you have a resource issue.

Endpoint Deployed But Returns 500 Errors

The endpoint shows as "InService" but every request fails with ModelError.

This is usually your fault:

Your inference script has bugs
Model expects different input format than you're sending
Python dependencies are missing from the container

Quick debug:

## Test with the simplest possible input
import json

test_payload = {"instances": [{"data": "test"}]}

response = sagemaker_runtime.invoke_endpoint(
    EndpointName='your-endpoint',
    ContentType='application/json',
    Body=json.dumps(test_payload)
)

print(response['Body'].read())

Check CloudWatch logs for the actual Python error. It's usually a missing import or wrong data type.

Multi-Model Endpoints from Hell

Models Fighting for Memory

You deployed 8 models to one endpoint to save money. Now random models fail with OutOfMemoryError and you can't predict which ones.

Quick fix - limit concurrent models:

## In your endpoint configuration
'MultiModelConfig': {
    'ModelCacheSetting': 'Enabled',
    'MaxModels': 3  # Only keep 3 models in memory at once
}

Better fix - profile your models:

import psutil

def get_model_memory_usage(model_name):
    # Load model and check memory
    model = load_your_model(model_name)
    memory_usage = psutil.Process().memory_info().rss / 1024 / 1024  # MB
    return memory_usage

## Size your endpoint based on actual usage
total_memory = 0
for model in your_models:
    memory = get_model_memory_usage(model)
    total_memory += memory
    print(f"{model}: {memory:.1f} MB")

print(f"Total memory needed: {total_memory:.1f} MB")

Model Loading Timeouts

New models take 60+ seconds to load on first request. Users abandon your app.

Preload important models:

## In your inference.py
def model_fn(model_dir):
    # Load all critical models at startup
    critical_models = ['model1', 'model2', 'model3']
    
    loaded_models = {}
    for model_name in critical_models:
        loaded_models[model_name] = load_model(f"{model_dir}/{model_name}")
    
    return loaded_models

def predict_fn(input_data, models):
    model_name = input_data.get('model_name', 'default')
    if model_name in models:
        return models[model_name].predict(input_data)
    else:
        # Load on demand for non-critical models
        model = load_model_on_demand(model_name)
        return model.predict(input_data)

Real-Time Inference Performance Hell

Response Times Randomly Spike to 30+ Seconds

Your endpoint usually responds in 2 seconds but randomly takes 30+ seconds for the same request.

Usually it's garbage collection in Python - your model loads too much into memory. Could also be CPU-bound instances, auto-scaling kicking in, or the model loading from disk every damn time.

Quick profiling:

import time
import cProfile

def profile_inference(input_data):
    profiler = cProfile.Profile()
    profiler.enable()
    
    start_time = time.time()
    result = your_model.predict(input_data)
    end_time = time.time()
    
    profiler.disable()
    profiler.print_stats(sort='cumulative')
    
    print(f"Total inference time: {end_time - start_time:.2f} seconds")
    return result

Auto-Scaling Triggers Too Late

Your endpoint gets overwhelmed before auto-scaling kicks in. Users see 503 errors.

Fix the scaling policy:

sagemaker.put_scaling_policy(
    PolicyName='target-tracking-scaling',
    ServiceNamespace='sagemaker',
    ResourceId='endpoint/your-endpoint/variant/your-variant',
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 70.0,  # Scale when CPU hits 70% (was probably 90%)
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance'
        },
        'ScaleOutCooldown': 60,  # Scale out faster
        'ScaleInCooldown': 300   # Scale in slower
    }
)

IAM Permission Debugging (The Eternal Struggle)

"AccessDenied" with Zero Context

AWS error message: "AccessDenied: User is not authorized to perform X on resource Y"

Gee, thanks AWS. Super helpful.

Debug IAM step by step:

## 1. Who am I and what permissions do I have?
aws sts get-caller-identity

## 2. What policies are attached to my role?
aws iam list-attached-role-policies --role-name YourRoleName

## 3. What's in those policies?
aws iam get-policy-version --policy-arn your-policy-arn --version-id v1

## 4. Test specific actions
aws iam simulate-principal-policy \
    --policy-source-arn your-role-arn \
    --action-names sagemaker:InvokeEndpoint \
    --resource-arns your-endpoint-arn

The nuclear option (for debugging only):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "*",
            "Resource": "*"
        }
    ]
}

If it works with this policy, gradually restrict permissions until you find the missing one.

VPC and Networking Nightmares

SageMaker Can't Access S3 in VPC Mode

Your training job works fine without VPC but fails when you add VPC configuration for security.

You need VPC endpoints:

## Create S3 VPC endpoint
aws ec2 create-vpc-endpoint \
    --vpc-id vpc-12345678 \
    --service-name com.amazonaws.region.s3 \
    --route-table-ids rtb-12345678

## Create SageMaker API endpoint
aws ec2 create-vpc-endpoint \
    --vpc-id vpc-12345678 \
    --service-name com.amazonaws.region.sagemaker.api \
    --subnet-ids subnet-12345678

Or use NAT Gateway (more expensive but simpler):

## Create NAT gateway for private subnets
aws ec2 create-nat-gateway \
    --subnet-id subnet-12345678 \
    --allocation-id eipalloc-12345678

Security Groups Blocking Everything

Your instances can't talk to each other or access external APIs.

Debug security group rules:

## Check what's allowed
aws ec2 describe-security-groups --group-ids sg-12345678

## Test network connectivity
aws ec2 run-instances \
    --image-id ami-12345678 \
    --instance-type t2.micro \
    --security-group-ids sg-12345678 \
    --user-data "#!/bin/bash
curl -I https://aws.amazon.com
curl -I https://api.openai.com"

Emergency fix (lock down later):

{
    "IpPermissions": [
        {
            "IpProtocol": "-1",
            "IpRanges": [{"CidrIp": "0.0.0.0/0"}]
        }
    ]
}

Data Pipeline Failures

Batch Transform Jobs Failing Randomly

Your batch inference works for 100 files then fails on file 101 with no clear error.

Common causes:

File format inconsistency - one file has different columns
Memory issues - one file is much larger than others
Encoding problems - file has weird characters
Timeout issues - one file takes too long to process

Quick debugging:

## Process files individually to find the problem
def debug_batch_files(s3_bucket, file_list):
    failed_files = []
    
    for file_key in file_list:
        try:
            # Download and basic validation
            obj = s3.get_object(Bucket=s3_bucket, Key=file_key)
            data = obj['Body'].read()
            
            # Check file size
            size_mb = len(data) / 1024 / 1024
            if size_mb > 100:  # Files over 100MB often cause issues
                print(f"Large file: {file_key} ({size_mb:.1f} MB)")
            
            # Check encoding
            try:
                text = data.decode('utf-8')
            except UnicodeDecodeError:
                print(f"Encoding issue: {file_key}")
                failed_files.append(file_key)
                continue
                
            # Basic format validation
            if not validate_file_format(text):
                print(f"Format issue: {file_key}")
                failed_files.append(file_key)
                
        except Exception as e:
            print(f"Error with {file_key}: {e}")
            failed_files.append(file_key)
    
    return failed_files

Emergency Debugging Checklist

When everything's on fire and you need answers fast:

1. Check AWS Status (30 seconds)

curl -s https://status.aws.amazon.com/data.json | jq '.current_events'

2. Check Your Recent Changes (2 minutes)

## What did we deploy recently?
aws logs filter-log-events \
    --log-group-name /aws/lambda/your-function \
    --start-time $(date -d '2 hours ago' +%s)000

## Any IAM changes?
aws cloudtrail lookup-events \
    --lookup-attributes AttributeKey=EventName,AttributeValue=AttachRolePolicy \
    --start-time $(date -d '24 hours ago' +%s)

3. Check Quotas and Limits (1 minute)

## Quick quota check for common limits
quotas_to_check = [
    'L-1194D53C',  # ml.p3.2xlarge instances
    'L-22C574D0',  # Bedrock Claude requests per minute
    'L-888C8DB6',  # Real-time endpoints
]

for quota_code in quotas_to_check:
    response = service_quotas.get_service_quota(
        ServiceCode='sagemaker',
        QuotaCode=quota_code
    )
    print(f"{quota_code}: {response['Quota']['Value']}")

4. Check CloudWatch Errors (2 minutes)

## Recent errors across all services
aws logs filter-log-events \
    --log-group-name /aws/sagemaker/TrainingJobs \
    --start-time $(date -d '1 hour ago' +%s)000 \
    --filter-pattern "ERROR"

aws logs filter-log-events \
    --log-group-name /aws/lambda/your-function \
    --start-time $(date -d '1 hour ago' +%s)000 \
    --filter-pattern "ERROR"

5. Check Billing Spikes (1 minute)

## Did something start burning money?
cloudwatch = boto3.client('cloudwatch')

response = cloudwatch.get_metric_statistics(
    Namespace='AWS/Billing',
    MetricName='EstimatedCharges',
    Dimensions=[{'Name': 'Currency', 'Value': 'USD'}],
    StartTime=datetime.utcnow() - timedelta(hours=24),
    EndTime=datetime.utcnow(),
    Period=3600,  # 1 hour
    Statistics=['Maximum']
)

current_cost = response['Datapoints'][-1]['Maximum']
print(f"Current estimated charges: ${current_cost:.2f}")

The Reality of AWS AI/ML Production Support

After debugging dozens of production issues, here's what I've learned:

AWS support is slow unless you pay for Business/Enterprise support. The community forums and Stack Overflow often have faster answers for common issues.

Error messages are designed to be unhelpful. Always check CloudWatch logs first. The actual error is usually buried in there somewhere.

IAM permissions cause 60% of all production issues. When in doubt, it's probably IAM. The other 40% is usually quotas or regional availability.

Spot instance interruptions will ruin your weekend if you don't handle them properly. Always use checkpointing for training jobs longer than 30 minutes.

Multi-region deployments add 10x complexity but are necessary for production reliability. Plan for things to break differently in each region.

The key to surviving AWS AI/ML in production is having good monitoring, automated rollback procedures, and the phone numbers of colleagues who've been through this shit before.

When your Bedrock endpoints are returning 500 errors during a board presentation, you don't want to be reading documentation. You want copy-paste solutions that work. That's what this guide provides - the debugging commands and fixes that actually work when production is melting down and everyone's staring at you waiting for an ETA.

Emergency Debugging: Questions You Ask at 3am

Why is my SageMaker training job stuck in "InProgress" for 3 hours with no logs?

Usually means your training script crashed immediately but SageMaker doesn't know it yet.

Classic AWS

lying to you about what's actually happening.Check CloudWatch logs:bashaws logs describe-log-groups --log-group-name-prefix /aws/sagemaker/TrainingJobsaws logs get-log-events --log-group-name /aws/sagemaker/TrainingJobs/your-job --log-stream-name your-stream90% of the time it's some bullshit like:
Missing Python dependencies in your container
Entry point script path is wrong
IAM role can't access S3 (always the IAM)Quick fix:

Add MaxRuntimeInSeconds to kill stuck jobs:pythonStoppingCondition={'MaxRuntimeInSeconds': 3600} # 1 hour max

Bedrock is throwing 500 errors but worked fine yesterday. What changed?

First, check if AWS is having issues: https://status.aws.amazon.com/If AWS status is green, it's probably you hit a quota limit. Could also be regional model availability changed, prompts too long, or IAM permissions expired. Always check quotas first.Emergency debug:python# Test with minimal requestresponse = bedrock.invoke_model( modelId='anthropic.claude-3-haiku-20240307-v1:0', # Cheapest model body=json.dumps({ "anthropic_version": "bedrock-2023-05-31", "max_tokens": 10, "messages": [{"role": "user", "content": "hi"}] }))If this fails, it's infrastructure. If it works, your original request is the problem.

My endpoint deployed successfully but returns ModelError on every request. How do I debug this?

Your inference code is broken.

The endpoint deployed because the container started, but your model loading/prediction code has bugs.Check CloudWatch logs for the endpoint:bashaws logs filter-log-events \ --log-group-name /aws/sagemaker/Endpoints/your-endpoint \ --start-time $(date -d '1 hour ago' +%s)000 \ --filter-pattern "ERROR"Common issues:

Missing model files in /opt/ml/model/
Wrong Python version or missing dependencies
Model expects different input format
Memory issues loading large modelsTest with minimal payload:```pythonimport jsontest_payload = {"data": "test"}response = sagemaker_runtime.invoke_endpoint( Endpoint

Name='your-endpoint', ContentType='application/json', Body=json.dumps(test_payload))```

How do I check if I'm hitting AWS service quotas?

pythonimport boto3service_quotas = boto3.client('service-quotas')# Common AI/ML quotas that cause problemsquotas_to_check = [ ('sagemaker', 'L-1194D53C'), # ml.p3.2xlarge instances ('sagemaker', 'L-888C8DB6'), # Real-time endpoints ('bedrock', 'L-22C574D0'), # Claude requests per minute]for service, quota_code in quotas_to_check: try: response = service_quotas.get_service_quota( ServiceCode=service, QuotaCode=quota_code ) quota = response['Quota'] print(f"{quota['QuotaName']}: {quota['Value']} {quota.get('Unit', '')}") except Exception as e: print(f"Error checking {service}/{quota_code}: {e}")

My multi-model endpoint randomly fails with OutOfMemoryError. Which model is causing it?

Multi-model endpoints share memory across all models.

When one model uses too much RAM, others get killed.Debug by checking model sizes:```python# In your inference.py, add memory profilingimport psutilimport osdef model_fn(model_dir): model = load_your_model(model_dir) # Log memory usage after loading process = psutil.

Process(os.getpid()) memory_mb = process.memory_info().rss / 1024 / 1024 print(f"Model loaded. Memory usage: {memory_mb:.1f} MB") return model```Quick fixes:

Reduce MaxModels in MultiModelConfig (limit concurrent models)
Use larger instance types (more RAM)
Move biggest models to separate endpoints
Add model unloading logic for unused models

IAM permissions are fucked and I don't know which policy is wrong. Help.

Use the IAM policy simulator to test specific actions:bashaws iam simulate-principal-policy \ --policy-source-arn arn:aws:iam::123456789012:role/YourRole \ --action-names sagemaker:InvokeEndpoint \ --resource-arns arn:aws:sagemaker:us-east-1:123456789012:endpoint/your-endpointFor emergency debugging, temporarily attach this overly permissive policy:json{ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Action": "*", "Resource": "*" }]}If it works with this policy, the issue is permissions. Gradually restrict until you find the missing permission.Never leave the * policy in production.

My Bedrock requests work in us-east-1 but fail in eu-west-1. Same exact code.

Not all models are available in all regions. Check model availability:pythondef check_model_availability(model_id, region): bedrock = boto3.client('bedrock', region_name=region) try: response = bedrock.list_foundation_models() available = [m['modelId'] for m in response['modelSummaries']] return model_id in available except: return Falsemodel = 'anthropic.claude-3-5-sonnet-20240620-v1:0'regions = ['us-east-1', 'us-west-2', 'eu-west-1']for region in regions: available = check_model_availability(model, region) print(f"{model} in {region}: {'Available' if available else 'NOT AVAILABLE'}")Fallback strategy:pythondef bedrock_with_fallback(prompt, preferred_regions=['us-east-1', 'us-west-2']): for region in preferred_regions: try: bedrock = boto3.client('bedrock-runtime', region_name=region) response = bedrock.invoke_model(...) return response except Exception as e: print(f"Failed in {region}: {e}") continue raise Exception("All regions failed")

Training job failed with "ResourceLimitExceeded". I thought I had quota for this instance type?

You might have quota for the instance type but not for the number of instances you requested, or you're hitting account limits.

Check current usage:bash# See what's currently runningaws sagemaker list-training-jobs --status-equals InProgress# Check specific quotaaws service-quotas get-service-quota \ --service-code sagemaker \ --quota-code L-1194D53C # ml.p3.2xlarge instancesEmergency workaround:

Use smaller instance types (ml.m5.xlarge instead of ml.p3.8xlarge)
Reduce InstanceCount in your training job
Try spot instances (they have separate quotas)
Split your training data and run multiple smaller jobs

My endpoint auto-scaling isn't working. Traffic spikes and users get 503 errors.

Auto-scaling takes 3-5 minutes to spin up new instances. By then, users have already left.Fix the scaling policy:python# Scale out faster, scale in slowersagemaker.put_scaling_policy( ResourceId='endpoint/your-endpoint/variant/your-variant', ScalableDimension='sagemaker:variant:DesiredInstanceCount', PolicyType='TargetTrackingScaling', TargetTrackingScalingPolicyConfiguration={ 'TargetValue': 50.0, # Scale at 50% CPU instead of 80% 'ScaleOutCooldown': 60, # Scale out after 1 minute 'ScaleInCooldown': 900, # Scale in after 15 minutes 'PredefinedMetricSpecification': { 'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance' } })Or maintain minimum capacity:pythonsagemaker.register_scalable_target( ResourceId='endpoint/your-endpoint/variant/your-variant', ScalableDimension='sagemaker:variant:DesiredInstanceCount', MinCapacity=2, # Always keep 2 instances running MaxCapacity=10)

Batch transform job processes first 1000 files fine then starts failing randomly. Why?

Usually one of these:

Inconsistent file formats
- file #1001 has different columns than files 1-10002. Memory leak
- your inference code accumulates memory over time
Large files
- some files are much bigger and cause timeouts
Encoding issues
- random file has weird charactersDebug by processing files individually:python# Find the problem filedef debug_failed_files(input_s3_path, output_s3_path): # List all input files input_files = list_s3_files(input_s3_path) # List successful outputs output_files = list_s3_files(output_s3_path) successful_files = [f.replace('.out', '') for f in output_files] # Find missing files failed_files = [f for f in input_files if f not in successful_files] print(f"Failed files: {failed_files}") # Test each failed file individually for failed_file in failed_files[:5]: # Test first 5 try: # Download and inspect obj = s3.get_object(Bucket=bucket, Key=failed_file) content = obj['Body'].read() print(f"{failed_file}: {len(content)} bytes") # Check if it's valid JSON/CSV/whatever format you expect validate_file_format(content) except Exception as e: print(f"Problem with {failed_file}: {e}")

My VPC-enabled SageMaker training job can't access S3 and keeps timing out.

VPC mode blocks internet access by default. You need VPC endpoints or a NAT gateway.Quick check:bash# Do you have S3 VPC endpoint?aws ec2 describe-vpc-endpoints \ --filters "Name=service-name,Values=com.amazonaws.us-east-1.s3"# Do you have SageMaker API endpoint? aws ec2 describe-vpc-endpoints \ --filters "Name=service-name,Values=com.amazonaws.us-east-1.sagemaker.api"Emergency fix (create S3 VPC endpoint):bashaws ec2 create-vpc-endpoint \ --vpc-id vpc-12345678 \ --service-name com.amazonaws.us-east-1.s3 \ --route-table-ids rtb-12345678Or use NAT gateway (costs more but simpler):bash# Your private subnets need routes to NAT gatewayaws ec2 create-route \ --route-table-id rtb-private \ --destination-cidr-block 0.0.0.0/0 \ --nat-gateway-id nat-12345678

How do I know if my AWS bill spike is from a runaway job vs normal usage?

Check CloudWatch billing metrics by service:```pythonimport boto3from datetime import datetime, timedeltacloudwatch = boto3.client('cloudwatch')# Get costs by service for last 24 hoursservices = ['Sage

Maker', 'Bedrock', 'EC2-Instance', 'S3']for service in services: response = cloudwatch.get_metric_statistics( Namespace='AWS/Billing', Metric

Name='EstimatedCharges', Dimensions=[ {'Name': 'Service

Name', 'Value': service}, {'Name': 'Currency', 'Value': 'USD'} ], Start

Time=datetime.utcnow()

timedelta(days=1), EndTime=datetime.utcnow(), Period=3600, # 1 hour intervals Statistics=['Maximum'] ) if response['Datapoints']: latest_cost = response['Datapoints'][-1]['Maximum'] print(f"{service}: ${latest_cost:.2f}")```Look for sudden spikes.

If Sage

Maker cost jumped from $10 to $500 overnight, you probably left a training job or endpoint running.Find expensive resources:```bash# Running training jobsaws sagemaker list-training-jobs --status-equals In

Progress# Active endpoints aws sagemaker list-endpoints --status-equals InService# EC2 instances (check for accidentally created ones)aws ec2 describe-instances --filters "Name=instance-state-name,Values=running"```

Everything was working fine until I updated my IAM policy. Now nothing works. How do I rollback IAM changes?

Check CloudTrail for recent IAM changes:```bashaws cloudtrail lookup-events \ --lookup-attributes Attribute

Key=EventName,AttributeValue=AttachRolePolicy \ --start-time $(date -d '24 hours ago' +%s) \ --end-time $(date +%s)See what policies were modified:bashaws cloudtrail lookup-events \ --lookup-attributes Attribute

Key=EventName,AttributeValue=PutRolePolicy \ --start-time $(date -d '24 hours ago' +%s)```Rollback process:

Find the previous policy version in Cloud

Trail events 2. Revert to working policy (save the broken one first)3. Test immediately with a simple API callFor managed policies, check version history:bashaws iam list-policy-versions --policy-arn your-policy-arnaws iam set-default-policy-version --policy-arn your-policy-arn --version-id v1

My model accuracy suddenly dropped in production but works fine in development. What's different?

This is usually data distribution shift

your production data looks different than your training data.

Quick checks:

Input preprocessing
- are you scaling/normalizing the same way?2. Data types
- int vs float can break models silently
Missing features
- production might be missing columns your model expects
Encoding differences
- UTF-8 vs ASCII issues
Time zones
- datetime features can shift between environmentsDebug with data profiling:python# Compare production inputs to training datadef profile_input_data(production_sample): print("Production data profile:") print(f"Shape: {production_sample.shape}") print(f"Data types: {production_sample.dtypes}") print(f"Missing values: {production_sample.isnull().sum()}") print(f"Numeric ranges: {production_sample.describe()}") # Compare to your training data stats # Flag significant differences

My Lambda function calling Bedrock works locally but times out in AWS. Why?

Lambda cold starts + Bedrock cold starts = disaster.

Your function is timing out waiting for the model to warm up.Solutions:

Increase Lambda timeout to 5-10 minutes (max)2. Use provisioned concurrency for Lambda (costs more)3. Keep Bedrock models warm with scheduled pings
Add retry logic with exponential backoff```pythonimport timedef lambda_handler(event, context): max_retries = 3 for attempt in range(max_retries): try: response = bedrock.invoke_model(...) return response except Exception as e: if attempt < max_retries

1: wait_time = 2 ** attempt # 1, 2, 4 seconds time.sleep(wait_time) continue raise e```Also check Lambda memory allocation
Bedrock SDK needs decent memory (512MB+).

How do I get AWS support to actually help with AI/ML issues instead of sending me documentation links?

Upgrade to Business or Enterprise support
- Basic support is useless for production issues
Provide specific error messages and AWS request IDs from Cloud

Trail 3. Include exact reproduction steps with code samples 4. Mention business impact ("affecting 50,000 users" gets faster response)5. Use severity levels correctly

don't cry wolf with "urgent" for everythingFor faster help:
AWS ML community Slack
Stack Overflow with aws-sagemaker or amazon-bedrock tags
Git

Hub issues on aws-samples repositoriesAWS support is good for account limits and billing issues. For technical problems, the community often has better answers.

Comparison Table

Problem Type	Symptoms	Most Likely Cause (80% of cases)	Quick Debug Command	Emergency Fix	Time to Fix
SageMaker Training Stuck	InProgress for hours, no logs	IAM role can't access S3 training data	`aws logs get-log-events --log-group-name /aws/sagemaker/TrainingJobs/job-name`	Add S3 permissions to execution role	10 minutes
Bedrock 500 Errors	Worked yesterday, failing today	Hit service quota limits	`aws service-quotas get-service-quota --service-code bedrock --quota-code L-22C574D0`	Request quota increase or retry with backoff	2-5 business days
Endpoint ModelError	Deployed successfully but 500s on requests	Inference code has bugs	`aws logs filter-log-events --log-group-name /aws/sagemaker/Endpoints/endpoint-name --filter-pattern ERROR`	Test with minimal payload, check Python dependencies	30 minutes
Training ResourceLimitExceeded	Can't start training job	Account quota hit for GPU instances	`aws sagemaker list-training-jobs --status-equals InProgress`	Use smaller instance type or request quota increase	Immediate or 2-5 days
Multi-Model OOM	Random models fail with OutOfMemoryError	Too many models loaded simultaneously	Check CloudWatch memory metrics	Reduce MaxModels in config or use larger instances	15 minutes
VPC Training Timeouts	Training job hangs trying to access S3	No VPC endpoints configured	`aws ec2 describe-vpc-endpoints --filters "Name=service-name,Values=com.amazonaws.region.s3"`	Create S3 VPC endpoint or NAT gateway	20 minutes
Bedrock Cold Starts	First request takes 15+ seconds	Model goes idle after 5 minutes	Test with keep-alive ping script	Implement model warming with cron job	30 minutes
Auto-scaling Too Slow	Users get 503 during traffic spikes	Scaling triggers too late	Check CloudWatch auto-scaling events	Lower scaling threshold to 50% CPU	10 minutes
Regional Failures	Works in us-east-1, fails in eu-west-1	Model not available in that region	`aws bedrock list-foundation-models --region eu-west-1`	Use different region or different model	5 minutes
IAM AccessDenied	Vague permission errors	Missing specific IAM action	`aws iam simulate-principal-policy --policy-source-arn role-arn --action-names specific-action`	Temporarily add wildcard permissions, then narrow down	45 minutes
Batch Transform Partial Failure	Processes 1000 files, fails randomly after	Inconsistent file formats or sizes	Process failed files individually	Add file validation, handle exceptions gracefully	2 hours
Endpoint 503 Errors	Intermittent service unavailable	Insufficient capacity or scaling issues	Check endpoint CloudWatch metrics for CPU/memory	Increase instance count or enable auto-scaling	15 minutes
Lambda Bedrock Timeout	Works locally, times out in AWS	Cold start + model loading time	Increase Lambda timeout and memory	Use provisioned concurrency or async processing	20 minutes
Cost Spike Overnight	AWS bill jumped 10x unexpectedly	Left expensive resource running	`aws sagemaker list-training-jobs --status-equals InProgress`	Kill running jobs, delete idle endpoints	10 minutes

Advanced Debugging Techniques: When Standard Troubleshooting Fails

CloudWatch Deep Diving for AI/ML Services

When basic debugging doesn't reveal the issue, you need to dig deeper into CloudWatch metrics and logs. Here's how to extract useful information when AWS gives you cryptic errors. The CloudWatch Insights documentation covers query syntax, but here are the practical queries that actually work.

Custom CloudWatch Queries for AI/ML Debugging

Most engineers just check basic metrics, but the real debugging happens with custom queries. For comprehensive profiling, check the SageMaker Debugger profiling guide.

## Find all SageMaker errors in the last hour with context
aws logs start-query \
    --log-group-name \"/aws/sagemaker/TrainingJobs\" \
    --start-time $(date -d '1 hour ago' +%s) \
    --end-time $(date +%s) \
    --query-string '
        fields @timestamp, @message
        | filter @message like /ERROR/
        | sort @timestamp desc
        | limit 100'

## Get the query ID from the response, then:
aws logs get-query-results --query-id your-query-id

Advanced Bedrock debugging query:

## Track token usage patterns that might reveal quota issues
aws logs start-query \
    --log-group-name \"/aws/bedrock\" \
    --start-time $(date -d '24 hours ago' +%s) \
    --end-time $(date +%s) \
    --query-string '
        fields @timestamp, requestId, inputTokens, outputTokens
        | filter ispresent(inputTokens)
        | stats avg(inputTokens), avg(outputTokens), count() by bin(5m)
        | sort @timestamp'

SageMaker Model Monitor Statistics

Performance Profiling in Production

When your AI models are running slow but you can't reproduce locally, you need production profiling.

SageMaker Pipeline DAG

## Add this to your SageMaker inference code
import time
import json
import psutil
import threading
from collections import defaultdict

class ProductionProfiler:
    def __init__(self):
        self.metrics = defaultdict(list)
        self.lock = threading.Lock()
    
    def profile_function(self, func_name):
        def decorator(func):
            def wrapper(*args, **kwargs):
                start_time = time.time()
                start_memory = psutil.Process().memory_info().rss / 1024 / 1024
                
                try:
                    result = func(*args, **kwargs)
                    success = True
                except Exception as e:
                    result = None
                    success = False
                    print(f\"Function {func_name} failed: {e}\")
                
                end_time = time.time()
                end_memory = psutil.Process().memory_info().rss / 1024 / 1024
                
                with self.lock:
                    self.metrics[func_name].append({
                        'duration': end_time - start_time,
                        'memory_delta': end_memory - start_memory,
                        'success': success,
                        'timestamp': time.time()
                    })
                
                # Log slow operations
                duration = end_time - start_time
                if duration > 5.0:  # Slower than 5 seconds
                    print(f\"SLOW: {func_name} took {duration:.2f}s\")
                
                return result
            return wrapper
        return decorator
    
    def get_stats(self):
        stats = {}
        with self.lock:
            for func_name, measurements in self.metrics.items():
                if not measurements:
                    continue
                    
                durations = [m['duration'] for m in measurements if m['success']]
                if durations:
                    stats[func_name] = {
                        'avg_duration': sum(durations) / len(durations),
                        'max_duration': max(durations),
                        'call_count': len(measurements),
                        'success_rate': sum(m['success'] for m in measurements) / len(measurements)
                    }
        return stats

## Use it in your inference code
profiler = ProductionProfiler()

@profiler.profile_function(\"model_loading\")
def load_model():
    # Your model loading code
    pass

@profiler.profile_function(\"preprocessing\") 
def preprocess_data(data):
    # Your preprocessing code
    pass

@profiler.profile_function(\"prediction\")
def predict(model, data):
    # Your prediction code  
    pass

## In your handler, log stats periodically
def model_fn(model_dir):
    model = load_model()
    
    # Log stats every 100 requests
    def log_stats():
        while True:
            time.sleep(300)  # Every 5 minutes
            stats = profiler.get_stats()
            print(f\"PROFILER_STATS: {json.dumps(stats)}\")
    
    threading.Thread(target=log_stats, daemon=True).start()
    return model

Network Debugging for VPC Issues

VPC networking problems are the hardest to debug because error messages are useless. Here's how to systematically diagnose network issues.

## Add this test to your SageMaker container or Lambda function
import socket
import urllib.request
import json

def network_diagnostics():
    \"\"\"Run comprehensive network tests from inside AWS infrastructure\"\"\"
    results = {}
    
    # Test DNS resolution
    try:
        socket.gethostbyname('s3.amazonaws.com')
        results['dns_s3'] = 'OK'
    except Exception as e:
        results['dns_s3'] = f'FAILED: {e}'
    
    # Test S3 connectivity
    try:
        response = urllib.request.urlopen('https://s3.amazonaws.com', timeout=10)
        results['s3_https'] = f'OK: {response.getcode()}'
    except Exception as e:
        results['s3_https'] = f'FAILED: {e}'
    
    # Test Bedrock connectivity
    try:
        # This will fail with auth error but proves connectivity
        urllib.request.urlopen('https://bedrock-runtime.us-east-1.amazonaws.com', timeout=10)
    except urllib.error.HTTPError as e:
        if e.code == 403:
            results['bedrock_connectivity'] = 'OK: Reachable (403 expected)'
        else:
            results['bedrock_connectivity'] = f'FAILED: HTTP {e.code}'
    except Exception as e:
        results['bedrock_connectivity'] = f'FAILED: {e}'
    
    # Test metadata service (should work from EC2/SageMaker)
    try:
        response = urllib.request.urlopen('http://169.254.169.254/latest/meta-data/', timeout=5)
        results['metadata_service'] = 'OK'
    except Exception as e:
        results['metadata_service'] = f'FAILED: {e}'
    
    print(f\"NETWORK_DIAGNOSTICS: {json.dumps(results, indent=2)}\")
    return results

## Run this from your failing container/function
network_diagnostics()

Memory Leak Detection in Long-Running Services

AI models can have subtle memory leaks that only appear under production load. Here's how to catch them.

import psutil
import gc
import threading
import time
from collections import deque

class MemoryTracker:
    def __init__(self, alert_threshold_mb=1000):
        self.measurements = deque(maxlen=100)  # Keep last 100 measurements
        self.alert_threshold = alert_threshold_mb
        self.baseline = None
        self.running = True
        
    def start_monitoring(self):
        def monitor():
            while self.running:
                process = psutil.Process()
                memory_mb = process.memory_info().rss / 1024 / 1024
                
                self.measurements.append({
                    'timestamp': time.time(),
                    'memory_mb': memory_mb,
                    'gc_count': sum(gc.get_count())
                })
                
                if self.baseline is None:
                    self.baseline = memory_mb
                
                # Alert if memory grew significantly
                if memory_mb > self.baseline + self.alert_threshold:
                    print(f\"MEMORY_LEAK_ALERT: {memory_mb:.1f}MB (baseline: {self.baseline:.1f}MB)\")
                    
                    # Force garbage collection
                    collected = gc.collect()
                    new_memory = psutil.Process().memory_info().rss / 1024 / 1024
                    print(f\"GC collected {collected} objects, memory now: {new_memory:.1f}MB\")
                
                time.sleep(60)  # Check every minute
        
        threading.Thread(target=monitor, daemon=True).start()
    
    def get_trend(self):
        if len(self.measurements) < 10:
            return \"insufficient_data\"
        
        recent = list(self.measurements)[-10:]
        old = list(self.measurements)[:10]
        
        recent_avg = sum(m['memory_mb'] for m in recent) / len(recent)
        old_avg = sum(m['memory_mb'] for m in old) / len(old)
        
        growth_rate = (recent_avg - old_avg) / old_avg * 100
        
        if growth_rate > 20:
            return \"likely_leak\"
        elif growth_rate > 10:
            return \"possible_leak\"  
        else:
            return \"stable\"

## Add to your model server startup
memory_tracker = MemoryTracker(alert_threshold_mb=500)
memory_tracker.start_monitoring()

## Check trend periodically
def check_memory_health():
    trend = memory_tracker.get_trend()
    if trend != \"stable\":
        print(f\"MEMORY_TREND_ALERT: {trend}\")

Distributed Tracing for Multi-Service AI Pipelines

When your AI pipeline spans multiple services (Lambda → SageMaker → Bedrock), you need distributed tracing to see where things break.

import boto3
import json
import uuid
from datetime import datetime

class AITracer:
    def __init__(self, service_name):
        self.service_name = service_name
        self.cloudwatch = boto3.client('cloudwatch')
        
    def start_trace(self, operation_name, trace_id=None):
        if trace_id is None:
            trace_id = str(uuid.uuid4())
        
        span = {
            'trace_id': trace_id,
            'span_id': str(uuid.uuid4()),
            'service': self.service_name,
            'operation': operation_name,
            'start_time': time.time(),
            'end_time': None,
            'success': None,
            'error': None,
            'metadata': {}
        }
        
        return AISpan(span, self.cloudwatch)

class AISpan:
    def __init__(self, span_data, cloudwatch):
        self.span = span_data
        self.cloudwatch = cloudwatch
        
    def add_metadata(self, key, value):
        self.span['metadata'][key] = value
        
    def finish(self, success=True, error=None):
        self.span['end_time'] = time.time()
        self.span['success'] = success
        self.span['error'] = str(error) if error else None
        self.span['duration'] = self.span['end_time'] - self.span['start_time']
        
        # Log to CloudWatch
        self.cloudwatch.put_log_events(
            logGroupName='/aws/ai-tracing',
            logStreamName=f\"{self.span['service']}-{datetime.now().strftime('%Y-%m-%d')}\",
            logEvents=[{
                'timestamp': int(self.span['end_time'] * 1000),
                'message': json.dumps(self.span)
            }]
        )
        
        # Also send metrics
        self.cloudwatch.put_metric_data(
            Namespace='AI/Tracing',
            MetricData=[
                {
                    'MetricName': 'Duration',
                    'Dimensions': [
                        {'Name': 'Service', 'Value': self.span['service']},
                        {'Name': 'Operation', 'Value': self.span['operation']}
                    ],
                    'Value': self.span['duration'],
                    'Unit': 'Seconds'
                },
                {
                    'MetricName': 'Success',
                    'Dimensions': [
                        {'Name': 'Service', 'Value': self.span['service']},
                        {'Name': 'Operation', 'Value': self.span['operation']}
                    ],
                    'Value': 1 if success else 0,
                    'Unit': 'Count'
                }
            ]
        )

## Usage across your pipeline
tracer = AITracer('bedrock-service')

def call_bedrock_with_tracing(prompt, trace_id=None):
    span = tracer.start_trace('bedrock_inference', trace_id)
    span.add_metadata('prompt_length', len(prompt))
    
    try:
        response = bedrock.invoke_model(...)
        span.add_metadata('response_tokens', count_tokens(response))
        span.finish(success=True)
        return response
    except Exception as e:
        span.finish(success=False, error=e)
        raise

## Pass trace_id between services
def lambda_handler(event, context):
    trace_id = event.get('trace_id', str(uuid.uuid4()))
    
    # Add trace_id to all downstream calls
    sagemaker_response = call_sagemaker_endpoint(data, trace_id=trace_id)
    bedrock_response = call_bedrock_with_tracing(prompt, trace_id=trace_id)
    
    return {'trace_id': trace_id, 'result': result}

Database Query for Pattern Analysis

When you have thousands of AI requests, patterns emerge that aren't visible in individual errors.

-- CloudWatch Insights queries for pattern analysis

-- Find requests that consistently fail at the same step
fields @timestamp, @message
| filter @message like /ERROR/
| parse @message /ERROR in (?<component>\w+):/
| stats count() by component
| sort count desc

-- Find slow requests by model type
fields @timestamp, model_id, duration
| filter ispresent(duration)
| stats avg(duration), max(duration), count() by model_id
| sort avg desc

-- Identify quota exhaustion patterns  
fields @timestamp, @message
| filter @message like /ThrottlingException/
| parse @message /model.*?(?<model_name>[a-zA-Z0-9.-]+)/
| stats count() by bin(5m), model_name
| sort @timestamp

-- Find memory-related failures
fields @timestamp, @message
| filter @message like /OutOfMemory/ or @message like /MemoryError/
| parse @message /instance.*?(?<instance_type>ml\.[a-z0-9.]+)/
| stats count() by instance_type
| sort count desc

Automated Alerting for Complex Failure Patterns

Set up alerts that catch problems before they become disasters.

import boto3
import json

def create_ai_failure_alerts():
    cloudwatch = boto3.client('cloudwatch')
    
    # Alert on increasing error rates
    cloudwatch.put_metric_alarm(
        AlarmName='AI-Error-Rate-Spike',
        ComparisonOperator='GreaterThanThreshold',
        EvaluationPeriods=2,
        MetricName='Errors',
        Namespace='AWS/SageMaker',
        Period=300,
        Statistic='Sum',
        Threshold=10,
        ActionsEnabled=True,
        AlarmActions=['arn:aws:sns:us-east-1:123456789012:ai-alerts'],
        AlarmDescription='AI service error rate spike detected'
    )
    
    # Alert on slow inference
    cloudwatch.put_metric_alarm(
        AlarmName='AI-Response-Time-High',
        ComparisonOperator='GreaterThanThreshold', 
        EvaluationPeriods=3,
        MetricName='ModelLatency',
        Namespace='AWS/SageMaker',
        Period=300,
        Statistic='Average',
        Threshold=10000,  # 10 seconds
        ActionsEnabled=True,
        AlarmActions=['arn:aws:sns:us-east-1:123456789012:ai-alerts'],
        AlarmDescription='AI inference taking too long'
    )
    
    # Alert on quota exhaustion
    cloudwatch.put_metric_alarm(
        AlarmName='Bedrock-Quota-Exhaustion',
        ComparisonOperator='GreaterThanThreshold',
        EvaluationPeriods=1,
        MetricName='ThrottledRequests',
        Namespace='AWS/Bedrock',
        Period=300,
        Statistic='Sum', 
        Threshold=5,
        ActionsEnabled=True,
        AlarmActions=['arn:aws:sns:us-east-1:123456789012:ai-alerts'],
        AlarmDescription='Bedrock requests being throttled'
    )

The Nuclear Options: When Everything Else Fails

Complete Service Recreation

Sometimes the fastest fix is burning it down and rebuilding:

## Save current configuration before destroying
def backup_endpoint_config(endpoint_name):
    sagemaker = boto3.client('sagemaker')
    
    endpoint = sagemaker.describe_endpoint(EndpointName=endpoint_name)
    config_name = endpoint['EndpointConfigName']
    
    config = sagemaker.describe_endpoint_config(EndpointConfigName=config_name)
    
    # Save to S3 for recreation
    backup_data = {
        'endpoint_config': config,
        'endpoint': endpoint,
        'timestamp': datetime.now().isoformat()
    }
    
    s3 = boto3.client('s3')
    s3.put_object(
        Bucket='your-backup-bucket',
        Key=f'endpoint-backups/{endpoint_name}-{int(time.time())}.json',
        Body=json.dumps(backup_data, default=str)
    )
    
    return backup_data

def nuclear_endpoint_recreation(endpoint_name):
    \"\"\"Last resort: delete and recreate the endpoint\"\"\"
    sagemaker = boto3.client('sagemaker')
    
    # 1. Backup configuration
    backup_data = backup_endpoint_config(endpoint_name)
    
    # 2. Delete endpoint
    sagemaker.delete_endpoint(EndpointName=endpoint_name)
    
    # 3. Wait for deletion
    waiter = sagemaker.get_waiter('endpoint_deleted')
    waiter.wait(EndpointName=endpoint_name)
    
    # 4. Recreate with same configuration
    config = backup_data['endpoint_config']
    
    sagemaker.create_endpoint(
        EndpointName=endpoint_name,
        EndpointConfigName=config['EndpointConfigName']
    )
    
    print(f\"Recreating endpoint {endpoint_name}. This will take 5-10 minutes.\")

Account-Level Resource Reset

When permissions are completely fucked and you can't figure out what's wrong:

#!/bin/bash
## Nuclear option: reset all AI/ML related IAM roles

echo \"WARNING: This will delete and recreate all SageMaker execution roles\"
read -p \"Are you sure? (type 'yes'): \" confirm

if [ \"$confirm\" != \"yes\" ]; then
    echo \"Aborted\"
    exit 1
fi

## Backup existing roles
mkdir -p iam-backup
aws iam list-roles --query 'Roles[?contains(RoleName, `SageMaker`) || contains(RoleName, `Bedrock`)].RoleName' --output text | \
while read role; do
    echo \"Backing up role: $role\"
    aws iam get-role --role-name \"$role\" > \"iam-backup/${role}.json\"
    aws iam list-attached-role-policies --role-name \"$role\" > \"iam-backup/${role}-policies.json\"
done

## Delete and recreate basic SageMaker role
aws iam delete-role --role-name SageMakerExecutionRole-Emergency
aws iam create-role \
    --role-name SageMakerExecutionRole-Emergency \
    --assume-role-policy-document '{
        \"Version\": \"2012-10-17\",
        \"Statement\": [{
            \"Effect\": \"Allow\",
            \"Principal\": {\"Service\": \"sagemaker.amazonaws.com\"},
            \"Action\": \"sts:AssumeRole\"
        }]
    }'

## Attach necessary policies
aws iam attach-role-policy \
    --role-name SageMakerExecutionRole-Emergency \
    --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess

echo \"Emergency role created. Test your services and then restrict permissions.\"

Recovery Procedures: Getting Back Online Fast

When production is down and the CEO is asking questions, here's your recovery playbook:

5-Minute Recovery Checklist

#!/bin/bash
## Run this when everything is on fire

echo \"=== AWS AI/ML Emergency Recovery ===\"
echo \"Time started: $(date)\"

## 1. Check AWS status
echo \"1. Checking AWS service health...\"
curl -s https://status.aws.amazon.com/data.json | jq -r '.current_events[] | select(.status_key != \"resolved\") | \"\(.service_name): \(.status_key)\"' 

## 2. Check service quotas
echo \"2. Checking service quotas...\"
aws service-quotas get-service-quota --service-code sagemaker --quota-code L-1194D53C | jq -r '.Quota | \"SageMaker GPU instances: \(.Value)\"' 
aws service-quotas get-service-quota --service-code bedrock --quota-code L-22C574D0 | jq -r '.Quota | \"Bedrock requests/min: \(.Value)\"' 

## 3. Check running resources
echo \"3. Checking running resources...\"
aws sagemaker list-training-jobs --status-equals InProgress --query 'TrainingJobSummaries[].TrainingJobName' --output table
aws sagemaker list-endpoints --status-equals InService --query 'Endpoints[].EndpointName' --output table

## 4. Check recent CloudWatch errors
echo \"4. Recent errors...\"
aws logs filter-log-events \
    --log-group-name /aws/sagemaker/Endpoints \
    --start-time $(date -d '1 hour ago' +%s)000 \
    --filter-pattern ERROR | \
    jq -r '.events[] | \"\(.timestamp | strftime(\"%H:%M:%S\")): \(.message)\"' | head -5

## 5. Check costs
echo \"5. Current spending...\"
aws cloudwatch get-metric-statistics \
    --namespace AWS/Billing \
    --metric-name EstimatedCharges \
    --dimensions Name=Currency,Value=USD \
    --start-time $(date -d '1 day ago' +%s) \
    --end-time $(date +%s) \
    --period 86400 \
    --statistics Maximum \
    --query 'Datapoints[0].Maximum' \
    --output text | \
    xargs printf \"Current charges: $%.2f
\"

echo \"=== Recovery check complete at $(date) ===\"

The key to surviving AWS AI/ML disasters isn't preventing them (impossible) - it's detecting them fast and having systematic approaches to fix them. Build monitoring that alerts on the things that actually matter, keep runbooks for the common failures, and always have a rollback plan.

When you're debugging at 3am with executives breathing down your neck, you don't want to be reading documentation. You want proven commands and procedures that work. That's what this debugging approach provides - real solutions for real problems that happen in production.

Emergency Resources When Everything's Broken

46%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization

Quick Navigation