The AWS AI/ML Disasters That Actually Happen (And How to Fix Them Fast)

Today is Friday, September 05, 2025. I'm writing this after spending 18 hours yesterday debugging why our Bedrock endpoints were randomly throwing 500 errors during a board demo. Turned out to be a quota limit we hit because someone forgot to request an increase for our new model. The CEO was not amused.

Here's the shit that actually breaks in production and how to fix it when you're under pressure.

SageMaker Training Jobs That Die Mysteriously

"UnexpectedStatusException" - The Error Message From Hell

This is AWS's way of saying "something went wrong but we're not gonna tell you what." I've seen this error more times than I care to count. The official SageMaker troubleshooting documentation barely scratches the surface.

What usually causes it:

Quick debugging steps:

## 1. Check CloudWatch logs first (this saves hours)
aws logs describe-log-groups --log-group-name-prefix /aws/sagemaker/TrainingJobs

## 2. Look at the actual error in CloudWatch
aws logs get-log-events --log-group-name /aws/sagemaker/TrainingJobs/your-job-name --log-stream-name your-job-name/algo-1-1234567890

## 3. Check if your IAM role can actually access S3
aws sts assume-role --role-arn your-sagemaker-execution-role-arn --role-session-name test

For comprehensive log analysis, check the CloudWatch logs documentation for SageMaker. When jobs fail without logs appearing, it's usually a pre-training configuration issue.

The fix that works 90% of the time:
Your SageMaker execution role is missing permissions. Add this policy and try again:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::your-bucket/*",
                "arn:aws:s3:::your-bucket"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "*"
        }
    ]
}

Training Jobs Stuck in "InProgress" Forever

Your job shows as running but hasn't done anything for 3 hours. This usually means:

SageMaker Training Architecture

  1. Spot instance got terminated (check CloudWatch events)
  2. Data loading is stuck (your S3 bucket is in a different region)
  3. Docker container won't start (base image is corrupted or wrong region)

Quick fix (following SageMaker best practices):

## Add timeout to your training jobs so they don't run forever
sagemaker.create_training_job(
    TrainingJobName='my-job',
    StoppingCondition={
        'MaxRuntimeInSeconds': 3600  # Kill it after 1 hour
    },
    # ... other params
)

For more advanced timeout and training configurations, check the SageMaker training compiler best practices and training job setup guide.

"ResourceLimitExceeded" During Training

You hit an instance quota or your account can't spin up the GPU instances you requested.

AWS EC2 Instance Types

## Check your service quotas
aws service-quotas get-service-quota \
    --service-code sagemaker \
    --quota-code L-1194D53C  # ml.p3.2xlarge instances

## Request quota increase (takes 2-5 business days)
aws service-quotas request-service-quota-increase \
    --service-code sagemaker \
    --quota-code L-1194D53C \
    --desired-value 10

For faster quota increases, use the Service Quotas console instead of the CLI. The AWS Support console method also works but takes longer.

Emergency workaround: Use a smaller instance type or switch to on-demand if you were using spot instances.

Bedrock Models Randomly Failing

AWS Bedrock Service Architecture

ThrottlingException During Peak Hours

Your app works fine during testing but starts throwing 429 errors when real users hit it. The official Bedrock API error codes documentation recommends exponential backoff with jitter.

Default quotas are pathetic:

  • Claude 3.5 Sonnet: something like 8k tokens/min, maybe less in some regions
  • Nova Pro: around 10k tokens/min if you're lucky
  • GPT-4 (through Bedrock): I think 12k tokens/min but these change randomly

Check the current Bedrock quotas documentation for your specific region since these limits vary wildly.

Quick fixes:

First, implement exponential backoff (should have done this from the start):

import time
import random

def bedrock_with_retry(bedrock_call, max_retries=5):
    for attempt in range(max_retries):
        try:
            return bedrock_call()
        except ClientError as e:
            if e.response['Error']['Code'] == 'ThrottlingException':
                wait_time = (2 ** attempt) + random.uniform(0, 1)
                time.sleep(wait_time)
                continue
            else:
                raise e
    raise Exception("Max retries exceeded")

Then request quota increases using the Bedrock quota increase process and use multiple regions to spread load. The Boto3 retry documentation has more advanced retry strategies. Also ensure you have proper model access permissions and follow the getting started guide for API setup. For troubleshooting access issues, check the model access modification guide.

"ModelNotReadyException" - Cold Start Hell

First request to a Bedrock model after it's been idle takes 10-30 seconds. Your users think the app is broken. The official ModelNotReadyException troubleshooting guide recommends implementing heartbeat strategies.

Hacky but necessary fix - keep models warm:

import threading
import time

def keep_bedrock_warm():
    while True:
        try:
            # Cheapest possible request to keep model loaded
            bedrock.invoke_model(
                modelId='anthropic.claude-3-haiku-20240307-v1:0',
                body=json.dumps({
                    "anthropic_version": "bedrock-2023-05-31",
                    "max_tokens": 1,
                    "messages": [{"role": "user", "content": "hi"}]
                })
            )
            time.sleep(300)  # Every 5 minutes
        except Exception:
            time.sleep(60)  # Retry in 1 minute

## Start as daemon thread
threading.Thread(target=keep_bedrock_warm, daemon=True).start()

Costs about $5/month but saves your user experience.

Regional Availability Hell

Your app worked fine in us-east-1 during testing. Now you're deploying to eu-west-1 and nothing works because the model isn't available there.

Check model availability by region:

import boto3

def check_model_availability(model_id, region):
    try:
        bedrock = boto3.client('bedrock', region_name=region)
        response = bedrock.list_foundation_models()
        available_models = [model['modelId'] for model in response['modelSummaries']]
        return model_id in available_models
    except Exception as e:
        return False

## Check before deployment
regions = ['us-east-1', 'us-west-2', 'eu-west-1']
model = 'anthropic.claude-3-5-sonnet-20240620-v1:0'

for region in regions:
    available = check_model_availability(model, region)
    print(f"{model} in {region}: {'✓' if available else '✗'}")

SageMaker Endpoints That Won't Deploy

"EndpointCreationFailed" with No Useful Error

Your model trained fine but won't deploy to an endpoint. The error message is useless.

Most common causes:

  • Model artifact is corrupted - repackage and upload again
  • Docker container can't load the model - memory issues or wrong Python version
  • IAM role can't access the model artifact - permissions again, always permissions

Debug with a test endpoint:

## Deploy to smallest possible instance first
test_config = {
    'EndpointConfigName': 'debug-config',
    'ProductionVariants': [{
        'VariantName': 'debug-variant',
        'ModelName': 'your-model',
        'InitialInstanceCount': 1,
        'InstanceType': 'ml.t2.medium',  # Cheapest option for debugging
        'InitialVariantWeight': 1
    }]
}

sagemaker.create_endpoint_config(**test_config)

If it fails on ml.t2.medium, the problem is your model code. If it works, you have a resource issue.

Endpoint Deployed But Returns 500 Errors

The endpoint shows as "InService" but every request fails with ModelError.

This is usually your fault:

  • Your inference script has bugs
  • Model expects different input format than you're sending
  • Python dependencies are missing from the container

Quick debug:

## Test with the simplest possible input
import json

test_payload = {"instances": [{"data": "test"}]}

response = sagemaker_runtime.invoke_endpoint(
    EndpointName='your-endpoint',
    ContentType='application/json',
    Body=json.dumps(test_payload)
)

print(response['Body'].read())

Check CloudWatch logs for the actual Python error. It's usually a missing import or wrong data type.

Multi-Model Endpoints from Hell

Models Fighting for Memory

You deployed 8 models to one endpoint to save money. Now random models fail with OutOfMemoryError and you can't predict which ones.

Quick fix - limit concurrent models:

## In your endpoint configuration
'MultiModelConfig': {
    'ModelCacheSetting': 'Enabled',
    'MaxModels': 3  # Only keep 3 models in memory at once
}

Better fix - profile your models:

import psutil

def get_model_memory_usage(model_name):
    # Load model and check memory
    model = load_your_model(model_name)
    memory_usage = psutil.Process().memory_info().rss / 1024 / 1024  # MB
    return memory_usage

## Size your endpoint based on actual usage
total_memory = 0
for model in your_models:
    memory = get_model_memory_usage(model)
    total_memory += memory
    print(f"{model}: {memory:.1f} MB")

print(f"Total memory needed: {total_memory:.1f} MB")

Model Loading Timeouts

New models take 60+ seconds to load on first request. Users abandon your app.

Preload important models:

## In your inference.py
def model_fn(model_dir):
    # Load all critical models at startup
    critical_models = ['model1', 'model2', 'model3']
    
    loaded_models = {}
    for model_name in critical_models:
        loaded_models[model_name] = load_model(f"{model_dir}/{model_name}")
    
    return loaded_models

def predict_fn(input_data, models):
    model_name = input_data.get('model_name', 'default')
    if model_name in models:
        return models[model_name].predict(input_data)
    else:
        # Load on demand for non-critical models
        model = load_model_on_demand(model_name)
        return model.predict(input_data)

Real-Time Inference Performance Hell

Response Times Randomly Spike to 30+ Seconds

Your endpoint usually responds in 2 seconds but randomly takes 30+ seconds for the same request.

Usually it's garbage collection in Python - your model loads too much into memory. Could also be CPU-bound instances, auto-scaling kicking in, or the model loading from disk every damn time.

Quick profiling:

import time
import cProfile

def profile_inference(input_data):
    profiler = cProfile.Profile()
    profiler.enable()
    
    start_time = time.time()
    result = your_model.predict(input_data)
    end_time = time.time()
    
    profiler.disable()
    profiler.print_stats(sort='cumulative')
    
    print(f"Total inference time: {end_time - start_time:.2f} seconds")
    return result

Auto-Scaling Triggers Too Late

Your endpoint gets overwhelmed before auto-scaling kicks in. Users see 503 errors.

Fix the scaling policy:

sagemaker.put_scaling_policy(
    PolicyName='target-tracking-scaling',
    ServiceNamespace='sagemaker',
    ResourceId='endpoint/your-endpoint/variant/your-variant',
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 70.0,  # Scale when CPU hits 70% (was probably 90%)
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance'
        },
        'ScaleOutCooldown': 60,  # Scale out faster
        'ScaleInCooldown': 300   # Scale in slower
    }
)

IAM Permission Debugging (The Eternal Struggle)

"AccessDenied" with Zero Context

AWS error message: "AccessDenied: User is not authorized to perform X on resource Y"

Gee, thanks AWS. Super helpful.

Debug IAM step by step:

## 1. Who am I and what permissions do I have?
aws sts get-caller-identity

## 2. What policies are attached to my role?
aws iam list-attached-role-policies --role-name YourRoleName

## 3. What's in those policies?
aws iam get-policy-version --policy-arn your-policy-arn --version-id v1

## 4. Test specific actions
aws iam simulate-principal-policy \
    --policy-source-arn your-role-arn \
    --action-names sagemaker:InvokeEndpoint \
    --resource-arns your-endpoint-arn

The nuclear option (for debugging only):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "*",
            "Resource": "*"
        }
    ]
}

If it works with this policy, gradually restrict permissions until you find the missing one.

VPC and Networking Nightmares

SageMaker Can't Access S3 in VPC Mode

Your training job works fine without VPC but fails when you add VPC configuration for security.

You need VPC endpoints:

## Create S3 VPC endpoint
aws ec2 create-vpc-endpoint \
    --vpc-id vpc-12345678 \
    --service-name com.amazonaws.region.s3 \
    --route-table-ids rtb-12345678

## Create SageMaker API endpoint
aws ec2 create-vpc-endpoint \
    --vpc-id vpc-12345678 \
    --service-name com.amazonaws.region.sagemaker.api \
    --subnet-ids subnet-12345678

Or use NAT Gateway (more expensive but simpler):

## Create NAT gateway for private subnets
aws ec2 create-nat-gateway \
    --subnet-id subnet-12345678 \
    --allocation-id eipalloc-12345678

Security Groups Blocking Everything

Your instances can't talk to each other or access external APIs.

Debug security group rules:

## Check what's allowed
aws ec2 describe-security-groups --group-ids sg-12345678

## Test network connectivity
aws ec2 run-instances \
    --image-id ami-12345678 \
    --instance-type t2.micro \
    --security-group-ids sg-12345678 \
    --user-data "#!/bin/bash
curl -I https://aws.amazon.com
curl -I https://api.openai.com"

Emergency fix (lock down later):

{
    "IpPermissions": [
        {
            "IpProtocol": "-1",
            "IpRanges": [{"CidrIp": "0.0.0.0/0"}]
        }
    ]
}

Data Pipeline Failures

Batch Transform Jobs Failing Randomly

Your batch inference works for 100 files then fails on file 101 with no clear error.

Common causes:

  1. File format inconsistency - one file has different columns
  2. Memory issues - one file is much larger than others
  3. Encoding problems - file has weird characters
  4. Timeout issues - one file takes too long to process

Quick debugging:

## Process files individually to find the problem
def debug_batch_files(s3_bucket, file_list):
    failed_files = []
    
    for file_key in file_list:
        try:
            # Download and basic validation
            obj = s3.get_object(Bucket=s3_bucket, Key=file_key)
            data = obj['Body'].read()
            
            # Check file size
            size_mb = len(data) / 1024 / 1024
            if size_mb > 100:  # Files over 100MB often cause issues
                print(f"Large file: {file_key} ({size_mb:.1f} MB)")
            
            # Check encoding
            try:
                text = data.decode('utf-8')
            except UnicodeDecodeError:
                print(f"Encoding issue: {file_key}")
                failed_files.append(file_key)
                continue
                
            # Basic format validation
            if not validate_file_format(text):
                print(f"Format issue: {file_key}")
                failed_files.append(file_key)
                
        except Exception as e:
            print(f"Error with {file_key}: {e}")
            failed_files.append(file_key)
    
    return failed_files

Emergency Debugging Checklist

When everything's on fire and you need answers fast:

1. Check AWS Status (30 seconds)

curl -s https://status.aws.amazon.com/data.json | jq '.current_events'

2. Check Your Recent Changes (2 minutes)

## What did we deploy recently?
aws logs filter-log-events \
    --log-group-name /aws/lambda/your-function \
    --start-time $(date -d '2 hours ago' +%s)000

## Any IAM changes?
aws cloudtrail lookup-events \
    --lookup-attributes AttributeKey=EventName,AttributeValue=AttachRolePolicy \
    --start-time $(date -d '24 hours ago' +%s)

3. Check Quotas and Limits (1 minute)

## Quick quota check for common limits
quotas_to_check = [
    'L-1194D53C',  # ml.p3.2xlarge instances
    'L-22C574D0',  # Bedrock Claude requests per minute
    'L-888C8DB6',  # Real-time endpoints
]

for quota_code in quotas_to_check:
    response = service_quotas.get_service_quota(
        ServiceCode='sagemaker',
        QuotaCode=quota_code
    )
    print(f"{quota_code}: {response['Quota']['Value']}")

4. Check CloudWatch Errors (2 minutes)

## Recent errors across all services
aws logs filter-log-events \
    --log-group-name /aws/sagemaker/TrainingJobs \
    --start-time $(date -d '1 hour ago' +%s)000 \
    --filter-pattern "ERROR"

aws logs filter-log-events \
    --log-group-name /aws/lambda/your-function \
    --start-time $(date -d '1 hour ago' +%s)000 \
    --filter-pattern "ERROR"

5. Check Billing Spikes (1 minute)

## Did something start burning money?
cloudwatch = boto3.client('cloudwatch')

response = cloudwatch.get_metric_statistics(
    Namespace='AWS/Billing',
    MetricName='EstimatedCharges',
    Dimensions=[{'Name': 'Currency', 'Value': 'USD'}],
    StartTime=datetime.utcnow() - timedelta(hours=24),
    EndTime=datetime.utcnow(),
    Period=3600,  # 1 hour
    Statistics=['Maximum']
)

current_cost = response['Datapoints'][-1]['Maximum']
print(f"Current estimated charges: ${current_cost:.2f}")

The Reality of AWS AI/ML Production Support

After debugging dozens of production issues, here's what I've learned:

AWS support is slow unless you pay for Business/Enterprise support. The community forums and Stack Overflow often have faster answers for common issues.

Error messages are designed to be unhelpful. Always check CloudWatch logs first. The actual error is usually buried in there somewhere.

IAM permissions cause 60% of all production issues. When in doubt, it's probably IAM. The other 40% is usually quotas or regional availability.

Spot instance interruptions will ruin your weekend if you don't handle them properly. Always use checkpointing for training jobs longer than 30 minutes.

Multi-region deployments add 10x complexity but are necessary for production reliability. Plan for things to break differently in each region.

The key to surviving AWS AI/ML in production is having good monitoring, automated rollback procedures, and the phone numbers of colleagues who've been through this shit before.

When your Bedrock endpoints are returning 500 errors during a board presentation, you don't want to be reading documentation. You want copy-paste solutions that work. That's what this guide provides - the debugging commands and fixes that actually work when production is melting down and everyone's staring at you waiting for an ETA.

Emergency Debugging: Questions You Ask at 3am

Q

Why is my SageMaker training job stuck in "InProgress" for 3 hours with no logs?

A

Usually means your training script crashed immediately but SageMaker doesn't know it yet.

Classic AWS

  • lying to you about what's actually happening.Check CloudWatch logs:bashaws logs describe-log-groups --log-group-name-prefix /aws/sagemaker/TrainingJobsaws logs get-log-events --log-group-name /aws/sagemaker/TrainingJobs/your-job --log-stream-name your-stream90% of the time it's some bullshit like:

  • Missing Python dependencies in your container

  • Entry point script path is wrong

  • IAM role can't access S3 (always the IAM)Quick fix:

Add MaxRuntimeInSeconds to kill stuck jobs:pythonStoppingCondition={'MaxRuntimeInSeconds': 3600} # 1 hour max

Q

Bedrock is throwing 500 errors but worked fine yesterday. What changed?

A

First, check if AWS is having issues: https://status.aws.amazon.com/If AWS status is green, it's probably you hit a quota limit. Could also be regional model availability changed, prompts too long, or IAM permissions expired. Always check quotas first.Emergency debug:python# Test with minimal requestresponse = bedrock.invoke_model( modelId='anthropic.claude-3-haiku-20240307-v1:0', # Cheapest model body=json.dumps({ "anthropic_version": "bedrock-2023-05-31", "max_tokens": 10, "messages": [{"role": "user", "content": "hi"}] }))If this fails, it's infrastructure. If it works, your original request is the problem.

Q

My endpoint deployed successfully but returns ModelError on every request. How do I debug this?

A

Your inference code is broken.

The endpoint deployed because the container started, but your model loading/prediction code has bugs.Check CloudWatch logs for the endpoint:bashaws logs filter-log-events \ --log-group-name /aws/sagemaker/Endpoints/your-endpoint \ --start-time $(date -d '1 hour ago' +%s)000 \ --filter-pattern "ERROR"Common issues:

  • Missing model files in /opt/ml/model/
  • Wrong Python version or missing dependencies
  • Model expects different input format
  • Memory issues loading large modelsTest with minimal payload:```pythonimport jsontest_payload = {"data": "test"}response = sagemaker_runtime.invoke_endpoint( Endpoint

Name='your-endpoint', ContentType='application/json', Body=json.dumps(test_payload))```

Q

How do I check if I'm hitting AWS service quotas?

A

pythonimport boto3service_quotas = boto3.client('service-quotas')# Common AI/ML quotas that cause problemsquotas_to_check = [ ('sagemaker', 'L-1194D53C'), # ml.p3.2xlarge instances ('sagemaker', 'L-888C8DB6'), # Real-time endpoints ('bedrock', 'L-22C574D0'), # Claude requests per minute]for service, quota_code in quotas_to_check: try: response = service_quotas.get_service_quota( ServiceCode=service, QuotaCode=quota_code ) quota = response['Quota'] print(f"{quota['QuotaName']}: {quota['Value']} {quota.get('Unit', '')}") except Exception as e: print(f"Error checking {service}/{quota_code}: {e}")

Q

My multi-model endpoint randomly fails with OutOfMemoryError. Which model is causing it?

A

Multi-model endpoints share memory across all models.

When one model uses too much RAM, others get killed.Debug by checking model sizes:```python# In your inference.py, add memory profilingimport psutilimport osdef model_fn(model_dir): model = load_your_model(model_dir) # Log memory usage after loading process = psutil.

Process(os.getpid()) memory_mb = process.memory_info().rss / 1024 / 1024 print(f"Model loaded. Memory usage: {memory_mb:.1f} MB") return model```Quick fixes:

  • Reduce MaxModels in MultiModelConfig (limit concurrent models)
  • Use larger instance types (more RAM)
  • Move biggest models to separate endpoints
  • Add model unloading logic for unused models
Q

IAM permissions are fucked and I don't know which policy is wrong. Help.

A

Use the IAM policy simulator to test specific actions:bashaws iam simulate-principal-policy \ --policy-source-arn arn:aws:iam::123456789012:role/YourRole \ --action-names sagemaker:InvokeEndpoint \ --resource-arns arn:aws:sagemaker:us-east-1:123456789012:endpoint/your-endpointFor emergency debugging, temporarily attach this overly permissive policy:json{ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Action": "*", "Resource": "*" }]}If it works with this policy, the issue is permissions. Gradually restrict until you find the missing permission.Never leave the * policy in production.

Q

My Bedrock requests work in us-east-1 but fail in eu-west-1. Same exact code.

A

Not all models are available in all regions. Check model availability:pythondef check_model_availability(model_id, region): bedrock = boto3.client('bedrock', region_name=region) try: response = bedrock.list_foundation_models() available = [m['modelId'] for m in response['modelSummaries']] return model_id in available except: return Falsemodel = 'anthropic.claude-3-5-sonnet-20240620-v1:0'regions = ['us-east-1', 'us-west-2', 'eu-west-1']for region in regions: available = check_model_availability(model, region) print(f"{model} in {region}: {'Available' if available else 'NOT AVAILABLE'}")Fallback strategy:pythondef bedrock_with_fallback(prompt, preferred_regions=['us-east-1', 'us-west-2']): for region in preferred_regions: try: bedrock = boto3.client('bedrock-runtime', region_name=region) response = bedrock.invoke_model(...) return response except Exception as e: print(f"Failed in {region}: {e}") continue raise Exception("All regions failed")

Q

Training job failed with "ResourceLimitExceeded". I thought I had quota for this instance type?

A

You might have quota for the instance type but not for the number of instances you requested, or you're hitting account limits.

Check current usage:bash# See what's currently runningaws sagemaker list-training-jobs --status-equals InProgress# Check specific quotaaws service-quotas get-service-quota \ --service-code sagemaker \ --quota-code L-1194D53C # ml.p3.2xlarge instancesEmergency workaround:

  • Use smaller instance types (ml.m5.xlarge instead of ml.p3.8xlarge)
  • Reduce InstanceCount in your training job
  • Try spot instances (they have separate quotas)
  • Split your training data and run multiple smaller jobs
Q

My endpoint auto-scaling isn't working. Traffic spikes and users get 503 errors.

A

Auto-scaling takes 3-5 minutes to spin up new instances. By then, users have already left.Fix the scaling policy:python# Scale out faster, scale in slowersagemaker.put_scaling_policy( ResourceId='endpoint/your-endpoint/variant/your-variant', ScalableDimension='sagemaker:variant:DesiredInstanceCount', PolicyType='TargetTrackingScaling', TargetTrackingScalingPolicyConfiguration={ 'TargetValue': 50.0, # Scale at 50% CPU instead of 80% 'ScaleOutCooldown': 60, # Scale out after 1 minute 'ScaleInCooldown': 900, # Scale in after 15 minutes 'PredefinedMetricSpecification': { 'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance' } })Or maintain minimum capacity:pythonsagemaker.register_scalable_target( ResourceId='endpoint/your-endpoint/variant/your-variant', ScalableDimension='sagemaker:variant:DesiredInstanceCount', MinCapacity=2, # Always keep 2 instances running MaxCapacity=10)

Q

Batch transform job processes first 1000 files fine then starts failing randomly. Why?

A

Usually one of these:

  1. Inconsistent file formats
    • file #1001 has different columns than files 1-10002. Memory leak
    • your inference code accumulates memory over time
  2. Large files
    • some files are much bigger and cause timeouts
  3. Encoding issues
    • random file has weird charactersDebug by processing files individually:python# Find the problem filedef debug_failed_files(input_s3_path, output_s3_path): # List all input files input_files = list_s3_files(input_s3_path) # List successful outputs output_files = list_s3_files(output_s3_path) successful_files = [f.replace('.out', '') for f in output_files] # Find missing files failed_files = [f for f in input_files if f not in successful_files] print(f"Failed files: {failed_files}") # Test each failed file individually for failed_file in failed_files[:5]: # Test first 5 try: # Download and inspect obj = s3.get_object(Bucket=bucket, Key=failed_file) content = obj['Body'].read() print(f"{failed_file}: {len(content)} bytes") # Check if it's valid JSON/CSV/whatever format you expect validate_file_format(content) except Exception as e: print(f"Problem with {failed_file}: {e}")
Q

My VPC-enabled SageMaker training job can't access S3 and keeps timing out.

A

VPC mode blocks internet access by default. You need VPC endpoints or a NAT gateway.Quick check:bash# Do you have S3 VPC endpoint?aws ec2 describe-vpc-endpoints \ --filters "Name=service-name,Values=com.amazonaws.us-east-1.s3"# Do you have SageMaker API endpoint? aws ec2 describe-vpc-endpoints \ --filters "Name=service-name,Values=com.amazonaws.us-east-1.sagemaker.api"Emergency fix (create S3 VPC endpoint):bashaws ec2 create-vpc-endpoint \ --vpc-id vpc-12345678 \ --service-name com.amazonaws.us-east-1.s3 \ --route-table-ids rtb-12345678Or use NAT gateway (costs more but simpler):bash# Your private subnets need routes to NAT gatewayaws ec2 create-route \ --route-table-id rtb-private \ --destination-cidr-block 0.0.0.0/0 \ --nat-gateway-id nat-12345678

Q

How do I know if my AWS bill spike is from a runaway job vs normal usage?

A

Check CloudWatch billing metrics by service:```pythonimport boto3from datetime import datetime, timedeltacloudwatch = boto3.client('cloudwatch')# Get costs by service for last 24 hoursservices = ['Sage

Maker', 'Bedrock', 'EC2-Instance', 'S3']for service in services: response = cloudwatch.get_metric_statistics( Namespace='AWS/Billing', Metric

Name='EstimatedCharges', Dimensions=[ {'Name': 'Service

Name', 'Value': service}, {'Name': 'Currency', 'Value': 'USD'} ], Start

Time=datetime.utcnow()

  • timedelta(days=1), EndTime=datetime.utcnow(), Period=3600, # 1 hour intervals Statistics=['Maximum'] ) if response['Datapoints']: latest_cost = response['Datapoints'][-1]['Maximum'] print(f"{service}: ${latest_cost:.2f}")```Look for sudden spikes.

If Sage

Maker cost jumped from $10 to $500 overnight, you probably left a training job or endpoint running.Find expensive resources:```bash# Running training jobsaws sagemaker list-training-jobs --status-equals In

Progress# Active endpoints aws sagemaker list-endpoints --status-equals InService# EC2 instances (check for accidentally created ones)aws ec2 describe-instances --filters "Name=instance-state-name,Values=running"```

Q

Everything was working fine until I updated my IAM policy. Now nothing works. How do I rollback IAM changes?

A

Check CloudTrail for recent IAM changes:```bashaws cloudtrail lookup-events \ --lookup-attributes Attribute

Key=EventName,AttributeValue=AttachRolePolicy \ --start-time $(date -d '24 hours ago' +%s) \ --end-time $(date +%s)See what policies were modified:bashaws cloudtrail lookup-events \ --lookup-attributes Attribute

Key=EventName,AttributeValue=PutRolePolicy \ --start-time $(date -d '24 hours ago' +%s)```Rollback process:

  1. Find the previous policy version in Cloud

Trail events 2. Revert to working policy (save the broken one first)3. Test immediately with a simple API callFor managed policies, check version history:bashaws iam list-policy-versions --policy-arn your-policy-arnaws iam set-default-policy-version --policy-arn your-policy-arn --version-id v1

Q

My model accuracy suddenly dropped in production but works fine in development. What's different?

A

This is usually data distribution shift

  • your production data looks different than your training data.

Quick checks:

  1. Input preprocessing
    • are you scaling/normalizing the same way?2. Data types
    • int vs float can break models silently
  2. Missing features
    • production might be missing columns your model expects
  3. Encoding differences
    • UTF-8 vs ASCII issues
  4. Time zones
    • datetime features can shift between environmentsDebug with data profiling:python# Compare production inputs to training datadef profile_input_data(production_sample): print("Production data profile:") print(f"Shape: {production_sample.shape}") print(f"Data types: {production_sample.dtypes}") print(f"Missing values: {production_sample.isnull().sum()}") print(f"Numeric ranges: {production_sample.describe()}") # Compare to your training data stats # Flag significant differences
Q

My Lambda function calling Bedrock works locally but times out in AWS. Why?

A

Lambda cold starts + Bedrock cold starts = disaster.

Your function is timing out waiting for the model to warm up.Solutions:

  1. Increase Lambda timeout to 5-10 minutes (max)2. Use provisioned concurrency for Lambda (costs more)3. Keep Bedrock models warm with scheduled pings
  2. Add retry logic with exponential backoff```pythonimport timedef lambda_handler(event, context): max_retries = 3 for attempt in range(max_retries): try: response = bedrock.invoke_model(...) return response except Exception as e: if attempt < max_retries
  • 1: wait_time = 2 ** attempt # 1, 2, 4 seconds time.sleep(wait_time) continue raise e```Also check Lambda memory allocation
  • Bedrock SDK needs decent memory (512MB+).
Q

How do I get AWS support to actually help with AI/ML issues instead of sending me documentation links?

A
  1. Upgrade to Business or Enterprise support
    • Basic support is useless for production issues
  2. Provide specific error messages and AWS request IDs from Cloud

Trail 3. Include exact reproduction steps with code samples 4. Mention business impact ("affecting 50,000 users" gets faster response)5. Use severity levels correctly

  • don't cry wolf with "urgent" for everythingFor faster help:

  • AWS ML community Slack

  • Stack Overflow with aws-sagemaker or amazon-bedrock tags

  • Git

Hub issues on aws-samples repositoriesAWS support is good for account limits and billing issues. For technical problems, the community often has better answers.

Comparison Table

Problem Type

Symptoms

Most Likely Cause (80% of cases)

Quick Debug Command

Emergency Fix

Time to Fix

SageMaker Training Stuck

InProgress for hours, no logs

IAM role can't access S3 training data

aws logs get-log-events --log-group-name /aws/sagemaker/TrainingJobs/job-name

Add S3 permissions to execution role

10 minutes

Bedrock 500 Errors

Worked yesterday, failing today

Hit service quota limits

aws service-quotas get-service-quota --service-code bedrock --quota-code L-22C574D0

Request quota increase or retry with backoff

2-5 business days

Endpoint ModelError

Deployed successfully but 500s on requests

Inference code has bugs

aws logs filter-log-events --log-group-name /aws/sagemaker/Endpoints/endpoint-name --filter-pattern ERROR

Test with minimal payload, check Python dependencies

30 minutes

Training ResourceLimitExceeded

Can't start training job

Account quota hit for GPU instances

aws sagemaker list-training-jobs --status-equals InProgress

Use smaller instance type or request quota increase

Immediate or 2-5 days

Multi-Model OOM

Random models fail with OutOfMemoryError

Too many models loaded simultaneously

Check CloudWatch memory metrics

Reduce MaxModels in config or use larger instances

15 minutes

VPC Training Timeouts

Training job hangs trying to access S3

No VPC endpoints configured

aws ec2 describe-vpc-endpoints --filters "Name=service-name,Values=com.amazonaws.region.s3"

Create S3 VPC endpoint or NAT gateway

20 minutes

Bedrock Cold Starts

First request takes 15+ seconds

Model goes idle after 5 minutes

Test with keep-alive ping script

Implement model warming with cron job

30 minutes

Auto-scaling Too Slow

Users get 503 during traffic spikes

Scaling triggers too late

Check CloudWatch auto-scaling events

Lower scaling threshold to 50% CPU

10 minutes

Regional Failures

Works in us-east-1, fails in eu-west-1

Model not available in that region

aws bedrock list-foundation-models --region eu-west-1

Use different region or different model

5 minutes

IAM AccessDenied

Vague permission errors

Missing specific IAM action

aws iam simulate-principal-policy --policy-source-arn role-arn --action-names specific-action

Temporarily add wildcard permissions, then narrow down

45 minutes

Batch Transform Partial Failure

Processes 1000 files, fails randomly after

Inconsistent file formats or sizes

Process failed files individually

Add file validation, handle exceptions gracefully

2 hours

Endpoint 503 Errors

Intermittent service unavailable

Insufficient capacity or scaling issues

Check endpoint CloudWatch metrics for CPU/memory

Increase instance count or enable auto-scaling

15 minutes

Lambda Bedrock Timeout

Works locally, times out in AWS

Cold start + model loading time

Increase Lambda timeout and memory

Use provisioned concurrency or async processing

20 minutes

Cost Spike Overnight

AWS bill jumped 10x unexpectedly

Left expensive resource running

aws sagemaker list-training-jobs --status-equals InProgress

Kill running jobs, delete idle endpoints

10 minutes

Advanced Debugging Techniques: When Standard Troubleshooting Fails

CloudWatch Deep Diving for AI/ML Services

When basic debugging doesn't reveal the issue, you need to dig deeper into CloudWatch metrics and logs. Here's how to extract useful information when AWS gives you cryptic errors. The CloudWatch Insights documentation covers query syntax, but here are the practical queries that actually work.

Custom CloudWatch Queries for AI/ML Debugging

Most engineers just check basic metrics, but the real debugging happens with custom queries. For comprehensive profiling, check the SageMaker Debugger profiling guide.

## Find all SageMaker errors in the last hour with context
aws logs start-query \
    --log-group-name \"/aws/sagemaker/TrainingJobs\" \
    --start-time $(date -d '1 hour ago' +%s) \
    --end-time $(date +%s) \
    --query-string '
        fields @timestamp, @message
        | filter @message like /ERROR/
        | sort @timestamp desc
        | limit 100'

## Get the query ID from the response, then:
aws logs get-query-results --query-id your-query-id

Advanced Bedrock debugging query:

## Track token usage patterns that might reveal quota issues
aws logs start-query \
    --log-group-name \"/aws/bedrock\" \
    --start-time $(date -d '24 hours ago' +%s) \
    --end-time $(date +%s) \
    --query-string '
        fields @timestamp, requestId, inputTokens, outputTokens
        | filter ispresent(inputTokens)
        | stats avg(inputTokens), avg(outputTokens), count() by bin(5m)
        | sort @timestamp'

SageMaker Model Monitor Statistics

Performance Profiling in Production

When your AI models are running slow but you can't reproduce locally, you need production profiling.

SageMaker Pipeline DAG

## Add this to your SageMaker inference code
import time
import json
import psutil
import threading
from collections import defaultdict

class ProductionProfiler:
    def __init__(self):
        self.metrics = defaultdict(list)
        self.lock = threading.Lock()
    
    def profile_function(self, func_name):
        def decorator(func):
            def wrapper(*args, **kwargs):
                start_time = time.time()
                start_memory = psutil.Process().memory_info().rss / 1024 / 1024
                
                try:
                    result = func(*args, **kwargs)
                    success = True
                except Exception as e:
                    result = None
                    success = False
                    print(f\"Function {func_name} failed: {e}\")
                
                end_time = time.time()
                end_memory = psutil.Process().memory_info().rss / 1024 / 1024
                
                with self.lock:
                    self.metrics[func_name].append({
                        'duration': end_time - start_time,
                        'memory_delta': end_memory - start_memory,
                        'success': success,
                        'timestamp': time.time()
                    })
                
                # Log slow operations
                duration = end_time - start_time
                if duration > 5.0:  # Slower than 5 seconds
                    print(f\"SLOW: {func_name} took {duration:.2f}s\")
                
                return result
            return wrapper
        return decorator
    
    def get_stats(self):
        stats = {}
        with self.lock:
            for func_name, measurements in self.metrics.items():
                if not measurements:
                    continue
                    
                durations = [m['duration'] for m in measurements if m['success']]
                if durations:
                    stats[func_name] = {
                        'avg_duration': sum(durations) / len(durations),
                        'max_duration': max(durations),
                        'call_count': len(measurements),
                        'success_rate': sum(m['success'] for m in measurements) / len(measurements)
                    }
        return stats

## Use it in your inference code
profiler = ProductionProfiler()

@profiler.profile_function(\"model_loading\")
def load_model():
    # Your model loading code
    pass

@profiler.profile_function(\"preprocessing\") 
def preprocess_data(data):
    # Your preprocessing code
    pass

@profiler.profile_function(\"prediction\")
def predict(model, data):
    # Your prediction code  
    pass

## In your handler, log stats periodically
def model_fn(model_dir):
    model = load_model()
    
    # Log stats every 100 requests
    def log_stats():
        while True:
            time.sleep(300)  # Every 5 minutes
            stats = profiler.get_stats()
            print(f\"PROFILER_STATS: {json.dumps(stats)}\")
    
    threading.Thread(target=log_stats, daemon=True).start()
    return model

Network Debugging for VPC Issues

VPC networking problems are the hardest to debug because error messages are useless. Here's how to systematically diagnose network issues.

## Add this test to your SageMaker container or Lambda function
import socket
import urllib.request
import json

def network_diagnostics():
    \"\"\"Run comprehensive network tests from inside AWS infrastructure\"\"\"
    results = {}
    
    # Test DNS resolution
    try:
        socket.gethostbyname('s3.amazonaws.com')
        results['dns_s3'] = 'OK'
    except Exception as e:
        results['dns_s3'] = f'FAILED: {e}'
    
    # Test S3 connectivity
    try:
        response = urllib.request.urlopen('https://s3.amazonaws.com', timeout=10)
        results['s3_https'] = f'OK: {response.getcode()}'
    except Exception as e:
        results['s3_https'] = f'FAILED: {e}'
    
    # Test Bedrock connectivity
    try:
        # This will fail with auth error but proves connectivity
        urllib.request.urlopen('https://bedrock-runtime.us-east-1.amazonaws.com', timeout=10)
    except urllib.error.HTTPError as e:
        if e.code == 403:
            results['bedrock_connectivity'] = 'OK: Reachable (403 expected)'
        else:
            results['bedrock_connectivity'] = f'FAILED: HTTP {e.code}'
    except Exception as e:
        results['bedrock_connectivity'] = f'FAILED: {e}'
    
    # Test metadata service (should work from EC2/SageMaker)
    try:
        response = urllib.request.urlopen('http://169.254.169.254/latest/meta-data/', timeout=5)
        results['metadata_service'] = 'OK'
    except Exception as e:
        results['metadata_service'] = f'FAILED: {e}'
    
    print(f\"NETWORK_DIAGNOSTICS: {json.dumps(results, indent=2)}\")
    return results

## Run this from your failing container/function
network_diagnostics()

Memory Leak Detection in Long-Running Services

AI models can have subtle memory leaks that only appear under production load. Here's how to catch them.

import psutil
import gc
import threading
import time
from collections import deque

class MemoryTracker:
    def __init__(self, alert_threshold_mb=1000):
        self.measurements = deque(maxlen=100)  # Keep last 100 measurements
        self.alert_threshold = alert_threshold_mb
        self.baseline = None
        self.running = True
        
    def start_monitoring(self):
        def monitor():
            while self.running:
                process = psutil.Process()
                memory_mb = process.memory_info().rss / 1024 / 1024
                
                self.measurements.append({
                    'timestamp': time.time(),
                    'memory_mb': memory_mb,
                    'gc_count': sum(gc.get_count())
                })
                
                if self.baseline is None:
                    self.baseline = memory_mb
                
                # Alert if memory grew significantly
                if memory_mb > self.baseline + self.alert_threshold:
                    print(f\"MEMORY_LEAK_ALERT: {memory_mb:.1f}MB (baseline: {self.baseline:.1f}MB)\")
                    
                    # Force garbage collection
                    collected = gc.collect()
                    new_memory = psutil.Process().memory_info().rss / 1024 / 1024
                    print(f\"GC collected {collected} objects, memory now: {new_memory:.1f}MB\")
                
                time.sleep(60)  # Check every minute
        
        threading.Thread(target=monitor, daemon=True).start()
    
    def get_trend(self):
        if len(self.measurements) < 10:
            return \"insufficient_data\"
        
        recent = list(self.measurements)[-10:]
        old = list(self.measurements)[:10]
        
        recent_avg = sum(m['memory_mb'] for m in recent) / len(recent)
        old_avg = sum(m['memory_mb'] for m in old) / len(old)
        
        growth_rate = (recent_avg - old_avg) / old_avg * 100
        
        if growth_rate > 20:
            return \"likely_leak\"
        elif growth_rate > 10:
            return \"possible_leak\"  
        else:
            return \"stable\"

## Add to your model server startup
memory_tracker = MemoryTracker(alert_threshold_mb=500)
memory_tracker.start_monitoring()

## Check trend periodically
def check_memory_health():
    trend = memory_tracker.get_trend()
    if trend != \"stable\":
        print(f\"MEMORY_TREND_ALERT: {trend}\")

Distributed Tracing for Multi-Service AI Pipelines

When your AI pipeline spans multiple services (Lambda → SageMaker → Bedrock), you need distributed tracing to see where things break.

import boto3
import json
import uuid
from datetime import datetime

class AITracer:
    def __init__(self, service_name):
        self.service_name = service_name
        self.cloudwatch = boto3.client('cloudwatch')
        
    def start_trace(self, operation_name, trace_id=None):
        if trace_id is None:
            trace_id = str(uuid.uuid4())
        
        span = {
            'trace_id': trace_id,
            'span_id': str(uuid.uuid4()),
            'service': self.service_name,
            'operation': operation_name,
            'start_time': time.time(),
            'end_time': None,
            'success': None,
            'error': None,
            'metadata': {}
        }
        
        return AISpan(span, self.cloudwatch)

class AISpan:
    def __init__(self, span_data, cloudwatch):
        self.span = span_data
        self.cloudwatch = cloudwatch
        
    def add_metadata(self, key, value):
        self.span['metadata'][key] = value
        
    def finish(self, success=True, error=None):
        self.span['end_time'] = time.time()
        self.span['success'] = success
        self.span['error'] = str(error) if error else None
        self.span['duration'] = self.span['end_time'] - self.span['start_time']
        
        # Log to CloudWatch
        self.cloudwatch.put_log_events(
            logGroupName='/aws/ai-tracing',
            logStreamName=f\"{self.span['service']}-{datetime.now().strftime('%Y-%m-%d')}\",
            logEvents=[{
                'timestamp': int(self.span['end_time'] * 1000),
                'message': json.dumps(self.span)
            }]
        )
        
        # Also send metrics
        self.cloudwatch.put_metric_data(
            Namespace='AI/Tracing',
            MetricData=[
                {
                    'MetricName': 'Duration',
                    'Dimensions': [
                        {'Name': 'Service', 'Value': self.span['service']},
                        {'Name': 'Operation', 'Value': self.span['operation']}
                    ],
                    'Value': self.span['duration'],
                    'Unit': 'Seconds'
                },
                {
                    'MetricName': 'Success',
                    'Dimensions': [
                        {'Name': 'Service', 'Value': self.span['service']},
                        {'Name': 'Operation', 'Value': self.span['operation']}
                    ],
                    'Value': 1 if success else 0,
                    'Unit': 'Count'
                }
            ]
        )

## Usage across your pipeline
tracer = AITracer('bedrock-service')

def call_bedrock_with_tracing(prompt, trace_id=None):
    span = tracer.start_trace('bedrock_inference', trace_id)
    span.add_metadata('prompt_length', len(prompt))
    
    try:
        response = bedrock.invoke_model(...)
        span.add_metadata('response_tokens', count_tokens(response))
        span.finish(success=True)
        return response
    except Exception as e:
        span.finish(success=False, error=e)
        raise

## Pass trace_id between services
def lambda_handler(event, context):
    trace_id = event.get('trace_id', str(uuid.uuid4()))
    
    # Add trace_id to all downstream calls
    sagemaker_response = call_sagemaker_endpoint(data, trace_id=trace_id)
    bedrock_response = call_bedrock_with_tracing(prompt, trace_id=trace_id)
    
    return {'trace_id': trace_id, 'result': result}

Database Query for Pattern Analysis

When you have thousands of AI requests, patterns emerge that aren't visible in individual errors.

-- CloudWatch Insights queries for pattern analysis

-- Find requests that consistently fail at the same step
fields @timestamp, @message
| filter @message like /ERROR/
| parse @message /ERROR in (?<component>\w+):/
| stats count() by component
| sort count desc

-- Find slow requests by model type
fields @timestamp, model_id, duration
| filter ispresent(duration)
| stats avg(duration), max(duration), count() by model_id
| sort avg desc

-- Identify quota exhaustion patterns  
fields @timestamp, @message
| filter @message like /ThrottlingException/
| parse @message /model.*?(?<model_name>[a-zA-Z0-9.-]+)/
| stats count() by bin(5m), model_name
| sort @timestamp

-- Find memory-related failures
fields @timestamp, @message
| filter @message like /OutOfMemory/ or @message like /MemoryError/
| parse @message /instance.*?(?<instance_type>ml\.[a-z0-9.]+)/
| stats count() by instance_type
| sort count desc

Automated Alerting for Complex Failure Patterns

Set up alerts that catch problems before they become disasters.

import boto3
import json

def create_ai_failure_alerts():
    cloudwatch = boto3.client('cloudwatch')
    
    # Alert on increasing error rates
    cloudwatch.put_metric_alarm(
        AlarmName='AI-Error-Rate-Spike',
        ComparisonOperator='GreaterThanThreshold',
        EvaluationPeriods=2,
        MetricName='Errors',
        Namespace='AWS/SageMaker',
        Period=300,
        Statistic='Sum',
        Threshold=10,
        ActionsEnabled=True,
        AlarmActions=['arn:aws:sns:us-east-1:123456789012:ai-alerts'],
        AlarmDescription='AI service error rate spike detected'
    )
    
    # Alert on slow inference
    cloudwatch.put_metric_alarm(
        AlarmName='AI-Response-Time-High',
        ComparisonOperator='GreaterThanThreshold', 
        EvaluationPeriods=3,
        MetricName='ModelLatency',
        Namespace='AWS/SageMaker',
        Period=300,
        Statistic='Average',
        Threshold=10000,  # 10 seconds
        ActionsEnabled=True,
        AlarmActions=['arn:aws:sns:us-east-1:123456789012:ai-alerts'],
        AlarmDescription='AI inference taking too long'
    )
    
    # Alert on quota exhaustion
    cloudwatch.put_metric_alarm(
        AlarmName='Bedrock-Quota-Exhaustion',
        ComparisonOperator='GreaterThanThreshold',
        EvaluationPeriods=1,
        MetricName='ThrottledRequests',
        Namespace='AWS/Bedrock',
        Period=300,
        Statistic='Sum', 
        Threshold=5,
        ActionsEnabled=True,
        AlarmActions=['arn:aws:sns:us-east-1:123456789012:ai-alerts'],
        AlarmDescription='Bedrock requests being throttled'
    )

The Nuclear Options: When Everything Else Fails

Complete Service Recreation

Sometimes the fastest fix is burning it down and rebuilding:

## Save current configuration before destroying
def backup_endpoint_config(endpoint_name):
    sagemaker = boto3.client('sagemaker')
    
    endpoint = sagemaker.describe_endpoint(EndpointName=endpoint_name)
    config_name = endpoint['EndpointConfigName']
    
    config = sagemaker.describe_endpoint_config(EndpointConfigName=config_name)
    
    # Save to S3 for recreation
    backup_data = {
        'endpoint_config': config,
        'endpoint': endpoint,
        'timestamp': datetime.now().isoformat()
    }
    
    s3 = boto3.client('s3')
    s3.put_object(
        Bucket='your-backup-bucket',
        Key=f'endpoint-backups/{endpoint_name}-{int(time.time())}.json',
        Body=json.dumps(backup_data, default=str)
    )
    
    return backup_data

def nuclear_endpoint_recreation(endpoint_name):
    \"\"\"Last resort: delete and recreate the endpoint\"\"\"
    sagemaker = boto3.client('sagemaker')
    
    # 1. Backup configuration
    backup_data = backup_endpoint_config(endpoint_name)
    
    # 2. Delete endpoint
    sagemaker.delete_endpoint(EndpointName=endpoint_name)
    
    # 3. Wait for deletion
    waiter = sagemaker.get_waiter('endpoint_deleted')
    waiter.wait(EndpointName=endpoint_name)
    
    # 4. Recreate with same configuration
    config = backup_data['endpoint_config']
    
    sagemaker.create_endpoint(
        EndpointName=endpoint_name,
        EndpointConfigName=config['EndpointConfigName']
    )
    
    print(f\"Recreating endpoint {endpoint_name}. This will take 5-10 minutes.\")

Account-Level Resource Reset

When permissions are completely fucked and you can't figure out what's wrong:

#!/bin/bash
## Nuclear option: reset all AI/ML related IAM roles

echo \"WARNING: This will delete and recreate all SageMaker execution roles\"
read -p \"Are you sure? (type 'yes'): \" confirm

if [ \"$confirm\" != \"yes\" ]; then
    echo \"Aborted\"
    exit 1
fi

## Backup existing roles
mkdir -p iam-backup
aws iam list-roles --query 'Roles[?contains(RoleName, `SageMaker`) || contains(RoleName, `Bedrock`)].RoleName' --output text | \
while read role; do
    echo \"Backing up role: $role\"
    aws iam get-role --role-name \"$role\" > \"iam-backup/${role}.json\"
    aws iam list-attached-role-policies --role-name \"$role\" > \"iam-backup/${role}-policies.json\"
done

## Delete and recreate basic SageMaker role
aws iam delete-role --role-name SageMakerExecutionRole-Emergency
aws iam create-role \
    --role-name SageMakerExecutionRole-Emergency \
    --assume-role-policy-document '{
        \"Version\": \"2012-10-17\",
        \"Statement\": [{
            \"Effect\": \"Allow\",
            \"Principal\": {\"Service\": \"sagemaker.amazonaws.com\"},
            \"Action\": \"sts:AssumeRole\"
        }]
    }'

## Attach necessary policies
aws iam attach-role-policy \
    --role-name SageMakerExecutionRole-Emergency \
    --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess

echo \"Emergency role created. Test your services and then restrict permissions.\"

Recovery Procedures: Getting Back Online Fast

When production is down and the CEO is asking questions, here's your recovery playbook:

5-Minute Recovery Checklist

#!/bin/bash
## Run this when everything is on fire

echo \"=== AWS AI/ML Emergency Recovery ===\"
echo \"Time started: $(date)\"

## 1. Check AWS status
echo \"1. Checking AWS service health...\"
curl -s https://status.aws.amazon.com/data.json | jq -r '.current_events[] | select(.status_key != \"resolved\") | \"\(.service_name): \(.status_key)\"' 

## 2. Check service quotas
echo \"2. Checking service quotas...\"
aws service-quotas get-service-quota --service-code sagemaker --quota-code L-1194D53C | jq -r '.Quota | \"SageMaker GPU instances: \(.Value)\"' 
aws service-quotas get-service-quota --service-code bedrock --quota-code L-22C574D0 | jq -r '.Quota | \"Bedrock requests/min: \(.Value)\"' 

## 3. Check running resources
echo \"3. Checking running resources...\"
aws sagemaker list-training-jobs --status-equals InProgress --query 'TrainingJobSummaries[].TrainingJobName' --output table
aws sagemaker list-endpoints --status-equals InService --query 'Endpoints[].EndpointName' --output table

## 4. Check recent CloudWatch errors
echo \"4. Recent errors...\"
aws logs filter-log-events \
    --log-group-name /aws/sagemaker/Endpoints \
    --start-time $(date -d '1 hour ago' +%s)000 \
    --filter-pattern ERROR | \
    jq -r '.events[] | \"\(.timestamp | strftime(\"%H:%M:%S\")): \(.message)\"' | head -5

## 5. Check costs
echo \"5. Current spending...\"
aws cloudwatch get-metric-statistics \
    --namespace AWS/Billing \
    --metric-name EstimatedCharges \
    --dimensions Name=Currency,Value=USD \
    --start-time $(date -d '1 day ago' +%s) \
    --end-time $(date +%s) \
    --period 86400 \
    --statistics Maximum \
    --query 'Datapoints[0].Maximum' \
    --output text | \
    xargs printf \"Current charges: $%.2f
\"

echo \"=== Recovery check complete at $(date) ===\"

The key to surviving AWS AI/ML disasters isn't preventing them (impossible) - it's detecting them fast and having systematic approaches to fix them. Build monitoring that alerts on the things that actually matter, keep runbooks for the common failures, and always have a rollback plan.

When you're debugging at 3am with executives breathing down your neck, you don't want to be reading documentation. You want proven commands and procedures that work. That's what this debugging approach provides - real solutions for real problems that happen in production.

Related Tools & Recommendations

tool
Similar content

AWS AI/ML Cost Optimization: Cut Bills 60-90% | Expert Guide

Stop AWS from bleeding you dry - optimization strategies to cut AI/ML costs 60-90% without breaking production

Amazon Web Services AI/ML Services
/tool/aws-ai-ml-services/cost-optimization-guide
100%
tool
Similar content

Integrating AWS AI/ML Services: Enterprise Patterns & MLOps

Explore the reality of integrating AWS AI/ML services, from common challenges to MLOps pipelines. Learn about Bedrock vs. SageMaker and security best practices.

Amazon Web Services AI/ML Services
/tool/aws-ai-ml-services/enterprise-integration-patterns
87%
tool
Similar content

Amazon EC2 Overview: Elastic Cloud Compute Explained

Rent Linux or Windows boxes by the hour, resize them on the fly, and description only pay for what you use

Amazon EC2
/tool/amazon-ec2/overview
83%
tool
Similar content

AWS AI/ML Troubleshooting: Debugging SageMaker & Bedrock in Production

Real debugging strategies for SageMaker, Bedrock, and the rest of AWS's AI mess

Amazon Web Services AI/ML Services
/tool/aws-ai-ml-services/production-troubleshooting-guide
83%
tool
Similar content

Amazon Nova Models: AWS's Own AI - Guide & Production Tips

Nova Pro costs about a third of what we were paying OpenAI

Amazon Web Services AI/ML Services
/tool/aws-ai-ml-services/amazon-nova-models-guide
73%
tool
Similar content

AWS AI/ML Security Hardening Guide: Protect Your Models from Exploits

Your AI Models Are One IAM Fuckup Away From Being the Next Breach Headline

Amazon Web Services AI/ML Services
/tool/aws-ai-ml-services/security-hardening-guide
69%
tool
Similar content

AWS AI/ML Migration: OpenAI & Azure to Bedrock Guide

Real migration timeline, actual costs, and why your first attempt will probably fail

Amazon Web Services AI/ML Services
/tool/aws-ai-ml-services/migration-implementation-guide
68%
pricing
Recommended

Databricks vs Snowflake vs BigQuery Pricing: Which Platform Will Bankrupt You Slowest

We burned through about $47k in cloud bills figuring this out so you don't have to

Databricks
/pricing/databricks-snowflake-bigquery-comparison/comprehensive-pricing-breakdown
67%
tool
Similar content

AWS AI/ML 2025 Updates: The New Features That Actually Matter

SageMaker Unified Studio, Bedrock Multi-Agent Collaboration, and other updates that changed the game

Amazon Web Services AI/ML Services
/tool/aws-ai-ml-services/aws-2025-updates
66%
tool
Similar content

AWS AI/ML Services: Practical Guide to Costs, Deployment & What Works

AWS AI: works great until the bill shows up and you realize SageMaker training costs $768/day

Amazon Web Services AI/ML Services
/tool/aws-ai-ml-services/overview
66%
tool
Similar content

LangChain Production Deployment Guide: What Actually Breaks

Learn how to deploy LangChain applications to production, covering common pitfalls, infrastructure, monitoring, security, API key management, and troubleshootin

LangChain
/tool/langchain/production-deployment-guide
62%
tool
Similar content

CUDA Production Debugging: Fix GPU Crashes & Memory Errors

The real-world guide to fixing CUDA crashes, memory errors, and performance disasters before your boss finds out

CUDA Development Toolkit
/tool/cuda/debugging-production-issues
60%
tool
Similar content

AWS CodeBuild Overview: Managed Builds, Real-World Issues

Finally, a build service that doesn't require you to babysit Jenkins servers

AWS CodeBuild
/tool/aws-codebuild/overview
58%
tool
Similar content

Claude API Production Debugging: Real-World Troubleshooting Guide

The real troubleshooting guide for when Claude API decides to ruin your weekend

Claude API
/tool/claude-api/production-debugging
58%
tool
Similar content

AWS Database Migration Service: Real-World Migrations & Costs

Explore AWS Database Migration Service (DMS): understand its true costs, functionality, and what actually happens during production migrations. Get practical, r

AWS Database Migration Service
/tool/aws-database-migration-service/overview
54%
tool
Similar content

AWS MGN: Server Migration to AWS - What to Expect & Costs

MGN replicates your physical or virtual servers to AWS. It works, but expect some networking headaches and licensing surprises along the way.

AWS Application Migration Service
/tool/aws-application-migration-service/overview
52%
tool
Similar content

Cassandra Vector Search for RAG: Simplify AI Apps with 5.0

Learn how Apache Cassandra 5.0's integrated vector search simplifies RAG applications. Build AI apps efficiently, overcome common issues like timeouts and slow

Apache Cassandra
/tool/apache-cassandra/vector-search-ai-guide
52%
tool
Similar content

Grok Code Fast 1: Emergency Production Debugging Guide

Learn how to use Grok Code Fast 1 for emergency production debugging. This guide covers strategies, playbooks, and advanced patterns to resolve critical issues

XAI Coding Agent
/tool/xai-coding-agent/production-debugging-guide
50%
integration
Similar content

AWS Lambda DynamoDB: Serverless Data Processing in Production

The good, the bad, and the shit AWS doesn't tell you about serverless data processing

AWS Lambda
/integration/aws-lambda-dynamodb/serverless-architecture-guide
46%
troubleshoot
Similar content

Kubernetes CrashLoopBackOff: Debug & Fix Pod Restart Issues

Your pod is fucked and everyone knows it - time to fix this shit

Kubernetes
/troubleshoot/kubernetes-pod-crashloopbackoff/crashloopbackoff-debugging
46%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization