Today is Friday, September 05, 2025. I'm writing this after spending 18 hours yesterday debugging why our Bedrock endpoints were randomly throwing 500 errors during a board demo. Turned out to be a quota limit we hit because someone forgot to request an increase for our new model. The CEO was not amused.
Here's the shit that actually breaks in production and how to fix it when you're under pressure.
SageMaker Training Jobs That Die Mysteriously
"UnexpectedStatusException" - The Error Message From Hell
This is AWS's way of saying "something went wrong but we're not gonna tell you what." I've seen this error more times than I care to count. The official SageMaker troubleshooting documentation barely scratches the surface.
What usually causes it:
- IAM permissions are fucked (most common) - check IAM troubleshooting guide
- Your training script has a bug that only shows up on AWS - see training job errors guide
- Instance doesn't have enough memory - review instance types documentation
- VPC config blocking S3 access - check VPC endpoint configuration
Quick debugging steps:
## 1. Check CloudWatch logs first (this saves hours)
aws logs describe-log-groups --log-group-name-prefix /aws/sagemaker/TrainingJobs
## 2. Look at the actual error in CloudWatch
aws logs get-log-events --log-group-name /aws/sagemaker/TrainingJobs/your-job-name --log-stream-name your-job-name/algo-1-1234567890
## 3. Check if your IAM role can actually access S3
aws sts assume-role --role-arn your-sagemaker-execution-role-arn --role-session-name test
For comprehensive log analysis, check the CloudWatch logs documentation for SageMaker. When jobs fail without logs appearing, it's usually a pre-training configuration issue.
The fix that works 90% of the time:
Your SageMaker execution role is missing permissions. Add this policy and try again:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::your-bucket/*",
"arn:aws:s3:::your-bucket"
]
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "*"
}
]
}
Training Jobs Stuck in "InProgress" Forever
Your job shows as running but hasn't done anything for 3 hours. This usually means:
- Spot instance got terminated (check CloudWatch events)
- Data loading is stuck (your S3 bucket is in a different region)
- Docker container won't start (base image is corrupted or wrong region)
Quick fix (following SageMaker best practices):
## Add timeout to your training jobs so they don't run forever
sagemaker.create_training_job(
TrainingJobName='my-job',
StoppingCondition={
'MaxRuntimeInSeconds': 3600 # Kill it after 1 hour
},
# ... other params
)
For more advanced timeout and training configurations, check the SageMaker training compiler best practices and training job setup guide.
"ResourceLimitExceeded" During Training
You hit an instance quota or your account can't spin up the GPU instances you requested.
## Check your service quotas
aws service-quotas get-service-quota \
--service-code sagemaker \
--quota-code L-1194D53C # ml.p3.2xlarge instances
## Request quota increase (takes 2-5 business days)
aws service-quotas request-service-quota-increase \
--service-code sagemaker \
--quota-code L-1194D53C \
--desired-value 10
For faster quota increases, use the Service Quotas console instead of the CLI. The AWS Support console method also works but takes longer.
Emergency workaround: Use a smaller instance type or switch to on-demand if you were using spot instances.
Bedrock Models Randomly Failing
ThrottlingException During Peak Hours
Your app works fine during testing but starts throwing 429 errors when real users hit it. The official Bedrock API error codes documentation recommends exponential backoff with jitter.
Default quotas are pathetic:
- Claude 3.5 Sonnet: something like 8k tokens/min, maybe less in some regions
- Nova Pro: around 10k tokens/min if you're lucky
- GPT-4 (through Bedrock): I think 12k tokens/min but these change randomly
Check the current Bedrock quotas documentation for your specific region since these limits vary wildly.
Quick fixes:
First, implement exponential backoff (should have done this from the start):
import time
import random
def bedrock_with_retry(bedrock_call, max_retries=5):
for attempt in range(max_retries):
try:
return bedrock_call()
except ClientError as e:
if e.response['Error']['Code'] == 'ThrottlingException':
wait_time = (2 ** attempt) + random.uniform(0, 1)
time.sleep(wait_time)
continue
else:
raise e
raise Exception("Max retries exceeded")
Then request quota increases using the Bedrock quota increase process and use multiple regions to spread load. The Boto3 retry documentation has more advanced retry strategies. Also ensure you have proper model access permissions and follow the getting started guide for API setup. For troubleshooting access issues, check the model access modification guide.
"ModelNotReadyException" - Cold Start Hell
First request to a Bedrock model after it's been idle takes 10-30 seconds. Your users think the app is broken. The official ModelNotReadyException troubleshooting guide recommends implementing heartbeat strategies.
Hacky but necessary fix - keep models warm:
import threading
import time
def keep_bedrock_warm():
while True:
try:
# Cheapest possible request to keep model loaded
bedrock.invoke_model(
modelId='anthropic.claude-3-haiku-20240307-v1:0',
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 1,
"messages": [{"role": "user", "content": "hi"}]
})
)
time.sleep(300) # Every 5 minutes
except Exception:
time.sleep(60) # Retry in 1 minute
## Start as daemon thread
threading.Thread(target=keep_bedrock_warm, daemon=True).start()
Costs about $5/month but saves your user experience.
Regional Availability Hell
Your app worked fine in us-east-1 during testing. Now you're deploying to eu-west-1 and nothing works because the model isn't available there.
Check model availability by region:
import boto3
def check_model_availability(model_id, region):
try:
bedrock = boto3.client('bedrock', region_name=region)
response = bedrock.list_foundation_models()
available_models = [model['modelId'] for model in response['modelSummaries']]
return model_id in available_models
except Exception as e:
return False
## Check before deployment
regions = ['us-east-1', 'us-west-2', 'eu-west-1']
model = 'anthropic.claude-3-5-sonnet-20240620-v1:0'
for region in regions:
available = check_model_availability(model, region)
print(f"{model} in {region}: {'✓' if available else '✗'}")
SageMaker Endpoints That Won't Deploy
"EndpointCreationFailed" with No Useful Error
Your model trained fine but won't deploy to an endpoint. The error message is useless.
Most common causes:
- Model artifact is corrupted - repackage and upload again
- Docker container can't load the model - memory issues or wrong Python version
- IAM role can't access the model artifact - permissions again, always permissions
Debug with a test endpoint:
## Deploy to smallest possible instance first
test_config = {
'EndpointConfigName': 'debug-config',
'ProductionVariants': [{
'VariantName': 'debug-variant',
'ModelName': 'your-model',
'InitialInstanceCount': 1,
'InstanceType': 'ml.t2.medium', # Cheapest option for debugging
'InitialVariantWeight': 1
}]
}
sagemaker.create_endpoint_config(**test_config)
If it fails on ml.t2.medium, the problem is your model code. If it works, you have a resource issue.
Endpoint Deployed But Returns 500 Errors
The endpoint shows as "InService" but every request fails with ModelError.
This is usually your fault:
- Your inference script has bugs
- Model expects different input format than you're sending
- Python dependencies are missing from the container
Quick debug:
## Test with the simplest possible input
import json
test_payload = {"instances": [{"data": "test"}]}
response = sagemaker_runtime.invoke_endpoint(
EndpointName='your-endpoint',
ContentType='application/json',
Body=json.dumps(test_payload)
)
print(response['Body'].read())
Check CloudWatch logs for the actual Python error. It's usually a missing import or wrong data type.
Multi-Model Endpoints from Hell
Models Fighting for Memory
You deployed 8 models to one endpoint to save money. Now random models fail with OutOfMemoryError and you can't predict which ones.
Quick fix - limit concurrent models:
## In your endpoint configuration
'MultiModelConfig': {
'ModelCacheSetting': 'Enabled',
'MaxModels': 3 # Only keep 3 models in memory at once
}
Better fix - profile your models:
import psutil
def get_model_memory_usage(model_name):
# Load model and check memory
model = load_your_model(model_name)
memory_usage = psutil.Process().memory_info().rss / 1024 / 1024 # MB
return memory_usage
## Size your endpoint based on actual usage
total_memory = 0
for model in your_models:
memory = get_model_memory_usage(model)
total_memory += memory
print(f"{model}: {memory:.1f} MB")
print(f"Total memory needed: {total_memory:.1f} MB")
Model Loading Timeouts
New models take 60+ seconds to load on first request. Users abandon your app.
Preload important models:
## In your inference.py
def model_fn(model_dir):
# Load all critical models at startup
critical_models = ['model1', 'model2', 'model3']
loaded_models = {}
for model_name in critical_models:
loaded_models[model_name] = load_model(f"{model_dir}/{model_name}")
return loaded_models
def predict_fn(input_data, models):
model_name = input_data.get('model_name', 'default')
if model_name in models:
return models[model_name].predict(input_data)
else:
# Load on demand for non-critical models
model = load_model_on_demand(model_name)
return model.predict(input_data)
Real-Time Inference Performance Hell
Response Times Randomly Spike to 30+ Seconds
Your endpoint usually responds in 2 seconds but randomly takes 30+ seconds for the same request.
Usually it's garbage collection in Python - your model loads too much into memory. Could also be CPU-bound instances, auto-scaling kicking in, or the model loading from disk every damn time.
Quick profiling:
import time
import cProfile
def profile_inference(input_data):
profiler = cProfile.Profile()
profiler.enable()
start_time = time.time()
result = your_model.predict(input_data)
end_time = time.time()
profiler.disable()
profiler.print_stats(sort='cumulative')
print(f"Total inference time: {end_time - start_time:.2f} seconds")
return result
Auto-Scaling Triggers Too Late
Your endpoint gets overwhelmed before auto-scaling kicks in. Users see 503 errors.
Fix the scaling policy:
sagemaker.put_scaling_policy(
PolicyName='target-tracking-scaling',
ServiceNamespace='sagemaker',
ResourceId='endpoint/your-endpoint/variant/your-variant',
ScalableDimension='sagemaker:variant:DesiredInstanceCount',
PolicyType='TargetTrackingScaling',
TargetTrackingScalingPolicyConfiguration={
'TargetValue': 70.0, # Scale when CPU hits 70% (was probably 90%)
'PredefinedMetricSpecification': {
'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance'
},
'ScaleOutCooldown': 60, # Scale out faster
'ScaleInCooldown': 300 # Scale in slower
}
)
IAM Permission Debugging (The Eternal Struggle)
"AccessDenied" with Zero Context
AWS error message: "AccessDenied: User is not authorized to perform X on resource Y"
Gee, thanks AWS. Super helpful.
Debug IAM step by step:
## 1. Who am I and what permissions do I have?
aws sts get-caller-identity
## 2. What policies are attached to my role?
aws iam list-attached-role-policies --role-name YourRoleName
## 3. What's in those policies?
aws iam get-policy-version --policy-arn your-policy-arn --version-id v1
## 4. Test specific actions
aws iam simulate-principal-policy \
--policy-source-arn your-role-arn \
--action-names sagemaker:InvokeEndpoint \
--resource-arns your-endpoint-arn
The nuclear option (for debugging only):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "*",
"Resource": "*"
}
]
}
If it works with this policy, gradually restrict permissions until you find the missing one.
VPC and Networking Nightmares
SageMaker Can't Access S3 in VPC Mode
Your training job works fine without VPC but fails when you add VPC configuration for security.
You need VPC endpoints:
## Create S3 VPC endpoint
aws ec2 create-vpc-endpoint \
--vpc-id vpc-12345678 \
--service-name com.amazonaws.region.s3 \
--route-table-ids rtb-12345678
## Create SageMaker API endpoint
aws ec2 create-vpc-endpoint \
--vpc-id vpc-12345678 \
--service-name com.amazonaws.region.sagemaker.api \
--subnet-ids subnet-12345678
Or use NAT Gateway (more expensive but simpler):
## Create NAT gateway for private subnets
aws ec2 create-nat-gateway \
--subnet-id subnet-12345678 \
--allocation-id eipalloc-12345678
Security Groups Blocking Everything
Your instances can't talk to each other or access external APIs.
Debug security group rules:
## Check what's allowed
aws ec2 describe-security-groups --group-ids sg-12345678
## Test network connectivity
aws ec2 run-instances \
--image-id ami-12345678 \
--instance-type t2.micro \
--security-group-ids sg-12345678 \
--user-data "#!/bin/bash
curl -I https://aws.amazon.com
curl -I https://api.openai.com"
Emergency fix (lock down later):
{
"IpPermissions": [
{
"IpProtocol": "-1",
"IpRanges": [{"CidrIp": "0.0.0.0/0"}]
}
]
}
Data Pipeline Failures
Batch Transform Jobs Failing Randomly
Your batch inference works for 100 files then fails on file 101 with no clear error.
Common causes:
- File format inconsistency - one file has different columns
- Memory issues - one file is much larger than others
- Encoding problems - file has weird characters
- Timeout issues - one file takes too long to process
Quick debugging:
## Process files individually to find the problem
def debug_batch_files(s3_bucket, file_list):
failed_files = []
for file_key in file_list:
try:
# Download and basic validation
obj = s3.get_object(Bucket=s3_bucket, Key=file_key)
data = obj['Body'].read()
# Check file size
size_mb = len(data) / 1024 / 1024
if size_mb > 100: # Files over 100MB often cause issues
print(f"Large file: {file_key} ({size_mb:.1f} MB)")
# Check encoding
try:
text = data.decode('utf-8')
except UnicodeDecodeError:
print(f"Encoding issue: {file_key}")
failed_files.append(file_key)
continue
# Basic format validation
if not validate_file_format(text):
print(f"Format issue: {file_key}")
failed_files.append(file_key)
except Exception as e:
print(f"Error with {file_key}: {e}")
failed_files.append(file_key)
return failed_files
Emergency Debugging Checklist
When everything's on fire and you need answers fast:
1. Check AWS Status (30 seconds)
curl -s https://status.aws.amazon.com/data.json | jq '.current_events'
2. Check Your Recent Changes (2 minutes)
## What did we deploy recently?
aws logs filter-log-events \
--log-group-name /aws/lambda/your-function \
--start-time $(date -d '2 hours ago' +%s)000
## Any IAM changes?
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=EventName,AttributeValue=AttachRolePolicy \
--start-time $(date -d '24 hours ago' +%s)
3. Check Quotas and Limits (1 minute)
## Quick quota check for common limits
quotas_to_check = [
'L-1194D53C', # ml.p3.2xlarge instances
'L-22C574D0', # Bedrock Claude requests per minute
'L-888C8DB6', # Real-time endpoints
]
for quota_code in quotas_to_check:
response = service_quotas.get_service_quota(
ServiceCode='sagemaker',
QuotaCode=quota_code
)
print(f"{quota_code}: {response['Quota']['Value']}")
4. Check CloudWatch Errors (2 minutes)
## Recent errors across all services
aws logs filter-log-events \
--log-group-name /aws/sagemaker/TrainingJobs \
--start-time $(date -d '1 hour ago' +%s)000 \
--filter-pattern "ERROR"
aws logs filter-log-events \
--log-group-name /aws/lambda/your-function \
--start-time $(date -d '1 hour ago' +%s)000 \
--filter-pattern "ERROR"
5. Check Billing Spikes (1 minute)
## Did something start burning money?
cloudwatch = boto3.client('cloudwatch')
response = cloudwatch.get_metric_statistics(
Namespace='AWS/Billing',
MetricName='EstimatedCharges',
Dimensions=[{'Name': 'Currency', 'Value': 'USD'}],
StartTime=datetime.utcnow() - timedelta(hours=24),
EndTime=datetime.utcnow(),
Period=3600, # 1 hour
Statistics=['Maximum']
)
current_cost = response['Datapoints'][-1]['Maximum']
print(f"Current estimated charges: ${current_cost:.2f}")
The Reality of AWS AI/ML Production Support
After debugging dozens of production issues, here's what I've learned:
AWS support is slow unless you pay for Business/Enterprise support. The community forums and Stack Overflow often have faster answers for common issues.
Error messages are designed to be unhelpful. Always check CloudWatch logs first. The actual error is usually buried in there somewhere.
IAM permissions cause 60% of all production issues. When in doubt, it's probably IAM. The other 40% is usually quotas or regional availability.
Spot instance interruptions will ruin your weekend if you don't handle them properly. Always use checkpointing for training jobs longer than 30 minutes.
Multi-region deployments add 10x complexity but are necessary for production reliability. Plan for things to break differently in each region.
The key to surviving AWS AI/ML in production is having good monitoring, automated rollback procedures, and the phone numbers of colleagues who've been through this shit before.
When your Bedrock endpoints are returning 500 errors during a board presentation, you don't want to be reading documentation. You want copy-paste solutions that work. That's what this guide provides - the debugging commands and fixes that actually work when production is melting down and everyone's staring at you waiting for an ETA.