Azure Container Instances Production Troubleshooting

The "Oh Shit" Errors: What Failed and How to Fix It Fast

Container keeps restarting with exit code 137 - what the hell is happening?

Your container died with exit code 137? Yeah, that's the kernel killing it for eating too much memory. Been there, it sucks. ACI enforces memory limits like a bouncer - no warnings, just instant death when you cross the line.

Quick fix:

az container show --resource-group mygroup --name mycontainer --query "containers[0].resources"

If your container is requesting 1GB but actually needs 1.5GB, bump it to 2GB. Don't try to be clever with exact limits - Azure bills per second so the extra memory costs pennies compared to downtime.

Why this happens: Your Node.js app that "only uses 500MB in dev" suddenly needs 2GB in production because you forgot about the 15 npm packages that cache everything in memory. Or that Python script that seemed fine locally but loads the entire dataset into a pandas DataFrame on Azure.

"Failed to pull image" but the image exists and worked yesterday

This is the most frustrating error because it's usually not your fault. ACI's image pulling is inconsistent as hell, especially for private registries.

First thing to try: Delete and recreate the container group. Seriously. ACI gets stuck in weird states.

az container delete --resource-group mygroup --name mycontainer --yes
az container create --resource-group mygroup --name mycontainer --image myimage

If that doesn't work: Check if your ACR credentials expired. Service principals have expiration dates that nobody remembers to renew:

az acr credential show --name myregistry

Last resort: Use managed identity instead of service principals. At least then Azure handles the credential rotation:

az container create --resource-group mygroup --name mycontainer \
  --image myregistry.azurecr.io/myapp:latest \
  --assign-identity --acr-login-server myregistry.azurecr.io

Container starts then immediately dies - no useful logs

This usually means your app is crashing before it can write anything meaningful. The container starts, your process exits, ACI restarts it, repeat forever.

Debug with a sleep command: Override the container command to keep it running so you can exec into it:

az container create --resource-group mygroup --name debug-container \
  --image myapp:latest --command-line "sleep 3600"

Then exec in and run your app manually:

az container exec --resource-group mygroup --container-name debug-container --exec-command "/bin/bash"

Common causes:

Missing environment variables your app needs
Database connection strings that don't work from Azure
File permissions fucked up in your Docker image
Your app expects files that don't exist in the container

Container group stuck in "Pending" state forever

This means Azure can't find resources to run your container. Usually happens in smaller regions or when you're asking for too much CPU/memory.

Quick fix: Try a different region:

az container create --resource-group mygroup --name mycontainer \
  --image myapp:latest --location eastus2

Resource requirements check: ACI has quotas per region. If you're asking for 4 vCPUs in West Europe during peak hours, you might wait forever. Drop to 2 vCPUs or try a different region.

Emergency workaround: Use spot containers if your workload can handle interruptions:

az container create --resource-group mygroup --name mycontainer \
  --image myapp:latest --priority Spot

Networking issues - can't connect to container ports

ACI networking is simple until it isn't. Most connection issues come down to port mismatches or security group bullshit.

Port mapping reality check: Unlike Docker, ACI doesn't do port mapping. If your app listens on port 3000 inside the container, you expose port 3000 in ACI. Period.

az container create --resource-group mygroup --name mycontainer \
  --image myapp:latest --ports 3000 --ip-address Public

VNet deployment gotcha: If you deploy into a VNet, you need a NAT gateway for outbound connections. Microsoft's docs mention this in small print after you've already hit the issue:

## Your container can't reach the internet without this
az network nat gateway create --resource-group mygroup --name mynatgateway

The 3AM Debugging Methodology: How to Actually Fix ACI Issues

When your ACI containers are down and you're getting paged at 3am, you don't have time for Microsoft's 47-step troubleshooting flowchart. Here's the methodology that actually works when everything's on fire.

Step 1: Get the Real Error Message (Not the Bullshit Summary)

The Azure portal shows you friendly error messages that are completely useless. Get the actual error details:

az container show --resource-group mygroup --name mycontainer --query \"containers[0].instanceView\"

This gives you the real container events, not the sanitized portal version. Look for the previousState and currentState sections - that's where the actual errors hide.

What you're looking for: Exit codes, specific error messages, and the restartCount. If restartCount keeps climbing, you have a restart loop. If it's stuck at 0 with state: Waiting, your container never started properly.

Step 2: The Resource Check (Because Azure Lies About Availability)

ACI will tell you resources are available in a region, then fail to provision them. Check what you're actually asking for:

## See what you requested
az container show --resource-group mygroup --name mycontainer --query \"containers[0].resources.requests\"

## See what you actually got (if anything)
az container show --resource-group mygroup --name mycontainer --query \"containers[0].resources.limits\"

If you requested 4 vCPUs and got nothing, try 2 vCPUs. If you requested 8GB RAM and got nothing, try 4GB. ACI availability is unpredictable, especially in smaller regions.

Regional reality check: Some Azure regions just don't have capacity for larger container instances. If you're stuck, try these regions that usually have availability:

East US 2
West Europe
Southeast Asia

Step 3: The Image Pull Investigation

Image pull failures are the most common production issue because there are 15 different ways they can break:

Registry connectivity test:

az acr login --name myregistry
docker pull myregistry.azurecr.io/myapp:latest

If this fails, your image doesn't exist or your credentials are wrong. If it works, ACI is having registry authentication issues.

Registry authentication debugging:

## Check ACR credentials
az acr credential show --name myregistry

## Check managed identity assignment
az container show --resource-group mygroup --name mycontainer --query \"identity\"

The nuclear option: If nothing else works, copy your image to Docker Hub temporarily. It's not secure for production, but it'll get your service back up while you fix ACR authentication:

docker tag myregistry.azurecr.io/myapp:latest mydockerhubuser/myapp:emergency
docker push mydockerhubuser/myapp:emergency

Step 4: Container Startup Debugging

If your container pulls successfully but crashes immediately, you need to debug the startup process:

Method 1: Override the entrypoint

az container create --resource-group mygroup --name debug-myapp \
  --image myregistry.azurecr.io/myapp:latest \
  --command-line \"tail -f /dev/null\"

This keeps the container running so you can exec in and debug manually.

Method 2: Check environment variables

az container exec --resource-group mygroup --container-name debug-myapp --exec-command \"env\"

Your app might be looking for environment variables that aren't set. Production containers often have different ENV requirements than development.

Method 3: Run your app manually

az container exec --resource-group mygroup --container-name debug-myapp --exec-command \"/bin/bash\"
## Then inside the container
/path/to/your/app

This will show you the actual error your app throws, which is usually more helpful than ACI's generic error messages.

Step 5: The Memory and CPU Reality Check

ACI enforces resource limits strictly. If your app works fine in development but crashes in ACI, you're probably hitting resource limits:

Memory debugging:

az container exec --resource-group mygroup --container-name mycontainer --exec-command \"free -h\"

## Check what processes are using memory
az container exec --resource-group mygroup --container-name mycontainer --exec-command \"ps aux --sort=-%mem\"

The 25% rule: Always request 25% more CPU and memory than your app needs. ACI doesn't handle resource contention gracefully - it just kills containers that exceed limits.

CPU performance gotcha: Fractional vCPUs (0.5, 1.5) perform poorly when Azure is busy. If your app is CPU-intensive, use whole vCPU numbers (1, 2, 4).

This methodology has saved me countless hours of frustration. Skip the Microsoft docs flowcharts and go straight to what actually identifies the problem.

The Advanced Fuckups: When Basic Troubleshooting Isn't Enough

Multi-container groups: one container kills the others

Container groups in ACI are like a house of cards. When one container in the group crashes, it can take down the whole group depending on your restart policy.

The problem: You have a web app + Redis sidecar. Redis runs out of memory and crashes. ACI restarts the entire container group, killing your web app sessions.

Quick fix: Separate your containers into different container groups:

## Create Redis separately
az container create --resource-group mygroup --name redis-instance \
  --image redis:alpine --memory 1 --cpu 0.5

## Create web app that connects to Redis via its IP
az container create --resource-group mygroup --name webapp-instance \
  --image myapp:latest --memory 2 --cpu 1 \
  --environment-variables REDIS_HOST=<redis-instance-ip>

When you actually need container groups: Only when containers absolutely must share localhost networking. Otherwise, separate container groups are more reliable.

Storage volumes randomly unmount

ACI's Azure Files integration looks great on paper but fails in creative ways in production.

Symptoms: Your app writes to /data/logs/app.log successfully, then later crashes because the file doesn't exist. The Azure Files share appears to be unmounted.

Root cause: Network hiccups between ACI and Azure Files. When the network connection drops, the mount fails silently.

Workaround: Mount Azure Files with error handling in your application:

## Check if mount exists before writing
if [ -d \"/data\" ]; then
    echo \"Writing to /data/logs/app.log\"
else
    echo \"Mount failed, writing to /tmp/app.log instead\"
fi

Better solution: Use blob storage with the Azure CLI instead of file mounts:

az storage blob upload --account-name mystorage --container-name logs \
  --name \"app-$(date +%Y%m%d).log\" --file /tmp/app.log

Confidential containers cost a fortune but don't work with your app

Microsoft's confidential containers sound cool until you realize they break half your applications and cost 3x normal pricing.

What breaks:

Apps that need privileged operations
Containers that access hardware directly
Some Node.js modules that use native binaries
Any app that expects specific CPU features

Cost reality: Confidential containers can cost $0.15/vCPU-hour vs $0.045/vCPU-hour for regular containers. That's $108/month vs $32/month for a single 1-vCPU container running 24/7.

When to actually use them: Processing truly sensitive data where the extra cost is worth it. For most applications, standard ACI with network security groups is sufficient.

Spot containers get evicted during critical operations

Spot containers save 70% on costs but can disappear with 30 seconds notice when Azure needs the capacity back.

The problem: You're running a 4-hour batch job on spot containers. After 3 hours and 45 minutes, Azure evicts your container and you lose all progress.

Solution pattern: Checkpoint your progress regularly:

## Save state every 15 minutes
while true; do
  process_batch_chunk()
  save_state_to_blob()
  sleep 900
done

Don't use spot for:

Web applications (users will get 500 errors when evicted)
Database containers
Long-running tasks without checkpointing

Good for:

CI/CD pipelines (can restart from failed step)
Log processing (can reprocess failed batch)
Development/testing environments

Regional capacity issues during Azure outages

When Azure has regional issues, ACI often fails silently. Your container group just sits in "Pending" state forever.

Detection: If your container has been pending for more than 10 minutes, it's probably a capacity issue:

az container show --resource-group mygroup --name mycontainer --query \"provisioningState\"

Multi-region deployment: Deploy identical container groups in multiple regions:

## Primary region
az container create --resource-group mygroup-east --name myapp-east \
  --image myapp:latest --location eastus2

## Backup region  
az container create --resource-group mygroup-west --name myapp-west \
  --image myapp:latest --location westus2

Use Azure Traffic Manager to route traffic between healthy instances. It's more expensive but prevents total outages during regional issues.

Production-Ready ACI Patterns That Actually Work

Most ACI tutorials show you hello-world demos that work perfectly in development but fall apart the moment real users hit them. Here are the patterns that survive production deployment.

The "Always-On" Container Pattern

ACI containers can randomly restart for platform maintenance. Your application needs to handle this gracefully or you'll have mysterious 30-second outages.

What doesn't work: Storing application state in memory or local files. When ACI restarts your container, everything in memory disappears.

What works: Store state externally and design for quick restart:

## Bad: state stored in container
webapp_container:
  image: myapp:latest
  # App stores user sessions in memory
  # File uploads saved to /tmp
  # Database connections cached in global variables

## Good: stateless design
webapp_container:
  image: myapp:latest
  environment:
    REDIS_URL: myredis.redis.cache.windows.net
    BLOB_STORAGE: mystorageaccount.blob.core.windows.net
    DB_CONNECTION_POOL_SIZE: 10

Startup time reality: ACI cold starts can take 15-60 seconds (or forever on a bad day). Speed up your container startup:

Use Alpine Linux base images (5MB vs 200MB for Ubuntu)
Pre-compile languages like .NET with ReadyToRun
Cache dependencies in the Docker image, not downloaded at startup
Use multi-stage builds to minimize final image size

The "Scale-to-Zero" Anti-Pattern

ACI markets itself as "scale to zero" but this only works for batch jobs. Web applications that scale to zero have terrible user experience.

The problem: First request after scale-to-zero takes 30-90 seconds to respond while ACI provisions a new container and your app starts up.

Better approach: Use 1 minimum instance with Azure Container Apps instead:

az containerapp create --resource-group mygroup --name myapp \
  --image myapp:latest \
  --min-replicas 1 --max-replicas 10

Container Apps gives you the serverless benefits without the cold start penalty for the first user.

When ACI scale-to-zero makes sense:

Scheduled batch jobs (run at 2am, complete in 10 minutes, shut down)
CI/CD build agents (spin up per build, shut down after completion)
Development environments (nobody cares about 30-second startup times)

The "Monitoring That Actually Helps" Pattern

Azure's default ACI monitoring shows you pretty graphs that don't help during outages. Set up monitoring that tells you what's broken and how to fix it.

Essential metrics to track:

## Container restart count (alerts when > 3 in 10 minutes)
az monitor metrics list --resource mycontainer --metric restartCount

## Memory utilization (alerts when > 80% to prevent OOM kills)
az monitor metrics list --resource mycontainer --metric memoryUsage

## CPU utilization over time (identifies performance degradation)
az monitor metrics list --resource mycontainer --metric cpuUsage

Log aggregation that doesn't suck: ACI's default logging only keeps logs for 7 days. Send logs somewhere useful:

az container create --resource-group mygroup --name mycontainer \
  --image myapp:latest \
  --log-analytics-workspace /subscriptions/xxx/resourceGroups/xxx/providers/Microsoft.OperationalInsights/workspaces/myworkspace

Error alerting: Set up alerts for actual error conditions, not just "container restarted":

## Alert when container fails to start 3 times in 15 minutes
az monitor scheduled-query create --resource-group mygroup \
  --name \"ACI-StartupFailures\" \
  --scopes mycontainer \
  --query \"ContainerInstanceLog_CL | where TimeGenerated > ago(15m) | where LogEntry_s contains 'Failed to pull image' | summarize count() by bin(TimeGenerated, 5m) | where count_ >= 3\"

The "Cost Control That Prevents Bill Shock" Pattern

ACI's per-second billing sounds cheap until you realize a container stuck in a restart loop can cost hundreds of dollars before anyone notices.

Resource limits that prevent runaway costs:

az container create --resource-group mygroup --name mycontainer \
  --image myapp:latest \
  --cpu 1 --memory 2 \
  --restart-policy OnFailure  # Don't restart forever
  --max-restart-count 3       # Give up after 3 failures

Regional cost optimization: ACI pricing varies significantly by region. For batch workloads, use cheaper regions:

East US: $0.045/vCPU-hour
South Central US: $0.043/vCPU-hour
West Europe: $0.054/vCPU-hour
Japan East: $0.061/vCPU-hour

Spot container risk management: Spot instances save money but can disappear. Use them for fault-tolerant workloads:

## Spot container with automatic failover to regular pricing
az container create --resource-group mygroup --name myapp-spot \
  --image myapp:latest --priority Spot

## If spot fails, create regular container as backup
az container create --resource-group mygroup --name myapp-regular \
  --image myapp:latest --priority Regular

These patterns come from learning ACI's limitations the hard way. Use them to avoid the common production failures that turn simple container deployments into 3am emergency calls.

Emergency Resources When Everything's Broken

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization

Quick Navigation

Container keeps restarting with exit code 137 - what the hell is happening?

"Failed to pull image" but the image exists and worked yesterday

Container starts then immediately dies - no useful logs

Container group stuck in "Pending" state forever

Networking issues - can't connect to container ports

Step 1: Get the Real Error Message (Not the Bullshit Summary)

Step 2: The Resource Check (Because Azure Lies About Availability)

Step 3: The Image Pull Investigation

Step 4: Container Startup Debugging

Step 5: The Memory and CPU Reality Check

Multi-container groups: one container kills the others

Storage volumes randomly unmount

Confidential containers cost a fortune but don't work with your app

Spot containers get evicted during critical operations

Regional capacity issues during Azure outages

The "Always-On" Container Pattern

The "Scale-to-Zero" Anti-Pattern

The "Monitoring That Actually Helps" Pattern

The "Cost Control That Prevents Bill Shock" Pattern

Related Tools & Recommendations

Azure Container Instances (ACI): Run Containers Without Kubernetes

Anchor Framework Production Deployment: Debugging & Real-World Failures

Aqua Security Troubleshooting: Resolve Production Issues Fast

AWS AI/ML Troubleshooting: Debugging SageMaker & Bedrock in Production

Pinecone Production Architecture: Fix Common Issues & Best Practices

OrbStack Performance Troubleshooting: Fix Issues & Optimize

Atlassian Confluence Performance Troubleshooting: Fix Slow Issues & Optimize

Google Cloud Run - Throw a Container at Google, Get Back a URL

React Error Boundaries in Production: Debugging Silent Failures

Google Kubernetes Engine (GKE) - Google's Managed Kubernetes (That Actually Works Most of the Time)

Anthropic's Claude Can Now Hang Up on Abusive Users Like a Customer Service Rep

Anthropic Raises $13B at $183B Valuation: AI Bubble Peak or Actual Revenue?

Hemi Network Bitcoin Integration: Debugging Smart Contract Issues

Optimize Jira Software Performance: Troubleshooting & Fixes

Morgan Stanley Open Sources Calm: Because Drawing Architecture Diagrams 47 Times Gets Old

Rancher Desktop - Docker Desktop's Free Replacement That Actually Works

Git Disaster Recovery - When Everything Goes Wrong

Roblox Launches TikTok Clone and AI Tools That Actually Work

Fix Docker Daemon Connection Failures

Docker Container Won't Start? Here's How to Actually Fix It