The "Oh Shit" Errors: What Failed and How to Fix It Fast

Q

Container keeps restarting with exit code 137 - what the hell is happening?

A

Your container died with exit code 137? Yeah, that's the kernel killing it for eating too much memory. Been there, it sucks. ACI enforces memory limits like a bouncer - no warnings, just instant death when you cross the line.

Quick fix:

az container show --resource-group mygroup --name mycontainer --query "containers[0].resources"

If your container is requesting 1GB but actually needs 1.5GB, bump it to 2GB. Don't try to be clever with exact limits - Azure bills per second so the extra memory costs pennies compared to downtime.

Why this happens: Your Node.js app that "only uses 500MB in dev" suddenly needs 2GB in production because you forgot about the 15 npm packages that cache everything in memory. Or that Python script that seemed fine locally but loads the entire dataset into a pandas DataFrame on Azure.

Q

"Failed to pull image" but the image exists and worked yesterday

A

This is the most frustrating error because it's usually not your fault. ACI's image pulling is inconsistent as hell, especially for private registries.

First thing to try: Delete and recreate the container group. Seriously. ACI gets stuck in weird states.

az container delete --resource-group mygroup --name mycontainer --yes
az container create --resource-group mygroup --name mycontainer --image myimage

If that doesn't work: Check if your ACR credentials expired. Service principals have expiration dates that nobody remembers to renew:

az acr credential show --name myregistry

Last resort: Use managed identity instead of service principals. At least then Azure handles the credential rotation:

az container create --resource-group mygroup --name mycontainer \
  --image myregistry.azurecr.io/myapp:latest \
  --assign-identity --acr-login-server myregistry.azurecr.io
Q

Container starts then immediately dies - no useful logs

A

This usually means your app is crashing before it can write anything meaningful. The container starts, your process exits, ACI restarts it, repeat forever.

Debug with a sleep command: Override the container command to keep it running so you can exec into it:

az container create --resource-group mygroup --name debug-container \
  --image myapp:latest --command-line "sleep 3600"

Then exec in and run your app manually:

az container exec --resource-group mygroup --container-name debug-container --exec-command "/bin/bash"

Common causes:

  • Missing environment variables your app needs
  • Database connection strings that don't work from Azure
  • File permissions fucked up in your Docker image
  • Your app expects files that don't exist in the container
Q

Container group stuck in "Pending" state forever

A

This means Azure can't find resources to run your container. Usually happens in smaller regions or when you're asking for too much CPU/memory.

Quick fix: Try a different region:

az container create --resource-group mygroup --name mycontainer \
  --image myapp:latest --location eastus2

Resource requirements check: ACI has quotas per region. If you're asking for 4 vCPUs in West Europe during peak hours, you might wait forever. Drop to 2 vCPUs or try a different region.

Emergency workaround: Use spot containers if your workload can handle interruptions:

az container create --resource-group mygroup --name mycontainer \
  --image myapp:latest --priority Spot
Q

Networking issues - can't connect to container ports

A

ACI networking is simple until it isn't. Most connection issues come down to port mismatches or security group bullshit.

Port mapping reality check: Unlike Docker, ACI doesn't do port mapping. If your app listens on port 3000 inside the container, you expose port 3000 in ACI. Period.

az container create --resource-group mygroup --name mycontainer \
  --image myapp:latest --ports 3000 --ip-address Public

VNet deployment gotcha: If you deploy into a VNet, you need a NAT gateway for outbound connections. Microsoft's docs mention this in small print after you've already hit the issue:

## Your container can't reach the internet without this
az network nat gateway create --resource-group mygroup --name mynatgateway

The 3AM Debugging Methodology: How to Actually Fix ACI Issues

When your ACI containers are down and you're getting paged at 3am, you don't have time for Microsoft's 47-step troubleshooting flowchart. Here's the methodology that actually works when everything's on fire.

Step 1: Get the Real Error Message (Not the Bullshit Summary)

The Azure portal shows you friendly error messages that are completely useless. Get the actual error details:

az container show --resource-group mygroup --name mycontainer --query \"containers[0].instanceView\"

This gives you the real container events, not the sanitized portal version. Look for the previousState and currentState sections - that's where the actual errors hide.

What you're looking for: Exit codes, specific error messages, and the restartCount. If restartCount keeps climbing, you have a restart loop. If it's stuck at 0 with state: Waiting, your container never started properly.

Step 2: The Resource Check (Because Azure Lies About Availability)

ACI will tell you resources are available in a region, then fail to provision them. Check what you're actually asking for:

## See what you requested
az container show --resource-group mygroup --name mycontainer --query \"containers[0].resources.requests\"

## See what you actually got (if anything)
az container show --resource-group mygroup --name mycontainer --query \"containers[0].resources.limits\"

If you requested 4 vCPUs and got nothing, try 2 vCPUs. If you requested 8GB RAM and got nothing, try 4GB. ACI availability is unpredictable, especially in smaller regions.

Regional reality check: Some Azure regions just don't have capacity for larger container instances. If you're stuck, try these regions that usually have availability:

  • East US 2
  • West Europe
  • Southeast Asia

Step 3: The Image Pull Investigation

Image pull failures are the most common production issue because there are 15 different ways they can break:

Registry connectivity test:

az acr login --name myregistry
docker pull myregistry.azurecr.io/myapp:latest

If this fails, your image doesn't exist or your credentials are wrong. If it works, ACI is having registry authentication issues.

Registry authentication debugging:

## Check ACR credentials
az acr credential show --name myregistry

## Check managed identity assignment
az container show --resource-group mygroup --name mycontainer --query \"identity\"

The nuclear option: If nothing else works, copy your image to Docker Hub temporarily. It's not secure for production, but it'll get your service back up while you fix ACR authentication:

docker tag myregistry.azurecr.io/myapp:latest mydockerhubuser/myapp:emergency
docker push mydockerhubuser/myapp:emergency

Step 4: Container Startup Debugging

If your container pulls successfully but crashes immediately, you need to debug the startup process:

Method 1: Override the entrypoint

az container create --resource-group mygroup --name debug-myapp \
  --image myregistry.azurecr.io/myapp:latest \
  --command-line \"tail -f /dev/null\"

This keeps the container running so you can exec in and debug manually.

Method 2: Check environment variables

az container exec --resource-group mygroup --container-name debug-myapp --exec-command \"env\"

Your app might be looking for environment variables that aren't set. Production containers often have different ENV requirements than development.

Method 3: Run your app manually

az container exec --resource-group mygroup --container-name debug-myapp --exec-command \"/bin/bash\"
## Then inside the container
/path/to/your/app

This will show you the actual error your app throws, which is usually more helpful than ACI's generic error messages.

Step 5: The Memory and CPU Reality Check

ACI enforces resource limits strictly. If your app works fine in development but crashes in ACI, you're probably hitting resource limits:

Memory debugging:

az container exec --resource-group mygroup --container-name mycontainer --exec-command \"free -h\"

## Check what processes are using memory
az container exec --resource-group mygroup --container-name mycontainer --exec-command \"ps aux --sort=-%mem\"

The 25% rule: Always request 25% more CPU and memory than your app needs. ACI doesn't handle resource contention gracefully - it just kills containers that exceed limits.

CPU performance gotcha: Fractional vCPUs (0.5, 1.5) perform poorly when Azure is busy. If your app is CPU-intensive, use whole vCPU numbers (1, 2, 4).

This methodology has saved me countless hours of frustration. Skip the Microsoft docs flowcharts and go straight to what actually identifies the problem.

The Advanced Fuckups: When Basic Troubleshooting Isn't Enough

Q

Multi-container groups: one container kills the others

A

Container groups in ACI are like a house of cards. When one container in the group crashes, it can take down the whole group depending on your restart policy.

The problem: You have a web app + Redis sidecar. Redis runs out of memory and crashes. ACI restarts the entire container group, killing your web app sessions.

Quick fix: Separate your containers into different container groups:

## Create Redis separately
az container create --resource-group mygroup --name redis-instance \
  --image redis:alpine --memory 1 --cpu 0.5

## Create web app that connects to Redis via its IP
az container create --resource-group mygroup --name webapp-instance \
  --image myapp:latest --memory 2 --cpu 1 \
  --environment-variables REDIS_HOST=<redis-instance-ip>

When you actually need container groups: Only when containers absolutely must share localhost networking. Otherwise, separate container groups are more reliable.

Q

Storage volumes randomly unmount

A

ACI's Azure Files integration looks great on paper but fails in creative ways in production.

Symptoms: Your app writes to /data/logs/app.log successfully, then later crashes because the file doesn't exist. The Azure Files share appears to be unmounted.

Root cause: Network hiccups between ACI and Azure Files. When the network connection drops, the mount fails silently.

Workaround: Mount Azure Files with error handling in your application:

## Check if mount exists before writing
if [ -d \"/data\" ]; then
    echo \"Writing to /data/logs/app.log\"
else
    echo \"Mount failed, writing to /tmp/app.log instead\"
fi

Better solution: Use blob storage with the Azure CLI instead of file mounts:

az storage blob upload --account-name mystorage --container-name logs \
  --name \"app-$(date +%Y%m%d).log\" --file /tmp/app.log
Q

Confidential containers cost a fortune but don't work with your app

A

Microsoft's confidential containers sound cool until you realize they break half your applications and cost 3x normal pricing.

What breaks:

  • Apps that need privileged operations
  • Containers that access hardware directly
  • Some Node.js modules that use native binaries
  • Any app that expects specific CPU features

Cost reality: Confidential containers can cost $0.15/vCPU-hour vs $0.045/vCPU-hour for regular containers. That's $108/month vs $32/month for a single 1-vCPU container running 24/7.

When to actually use them: Processing truly sensitive data where the extra cost is worth it. For most applications, standard ACI with network security groups is sufficient.

Q

Spot containers get evicted during critical operations

A

Spot containers save 70% on costs but can disappear with 30 seconds notice when Azure needs the capacity back.

The problem: You're running a 4-hour batch job on spot containers. After 3 hours and 45 minutes, Azure evicts your container and you lose all progress.

Solution pattern: Checkpoint your progress regularly:

## Save state every 15 minutes
while true; do
  process_batch_chunk()
  save_state_to_blob()
  sleep 900
done

Don't use spot for:

  • Web applications (users will get 500 errors when evicted)
  • Database containers
  • Long-running tasks without checkpointing

Good for:

  • CI/CD pipelines (can restart from failed step)
  • Log processing (can reprocess failed batch)
  • Development/testing environments
Q

Regional capacity issues during Azure outages

A

When Azure has regional issues, ACI often fails silently. Your container group just sits in "Pending" state forever.

Detection: If your container has been pending for more than 10 minutes, it's probably a capacity issue:

az container show --resource-group mygroup --name mycontainer --query \"provisioningState\"

Multi-region deployment: Deploy identical container groups in multiple regions:

## Primary region
az container create --resource-group mygroup-east --name myapp-east \
  --image myapp:latest --location eastus2

## Backup region  
az container create --resource-group mygroup-west --name myapp-west \
  --image myapp:latest --location westus2

Use Azure Traffic Manager to route traffic between healthy instances. It's more expensive but prevents total outages during regional issues.

Production-Ready ACI Patterns That Actually Work

Most ACI tutorials show you hello-world demos that work perfectly in development but fall apart the moment real users hit them. Here are the patterns that survive production deployment.

The "Always-On" Container Pattern

ACI containers can randomly restart for platform maintenance. Your application needs to handle this gracefully or you'll have mysterious 30-second outages.

What doesn't work: Storing application state in memory or local files. When ACI restarts your container, everything in memory disappears.

What works: Store state externally and design for quick restart:

## Bad: state stored in container
webapp_container:
  image: myapp:latest
  # App stores user sessions in memory
  # File uploads saved to /tmp
  # Database connections cached in global variables

## Good: stateless design
webapp_container:
  image: myapp:latest
  environment:
    REDIS_URL: myredis.redis.cache.windows.net
    BLOB_STORAGE: mystorageaccount.blob.core.windows.net
    DB_CONNECTION_POOL_SIZE: 10

Startup time reality: ACI cold starts can take 15-60 seconds (or forever on a bad day). Speed up your container startup:

  • Use Alpine Linux base images (5MB vs 200MB for Ubuntu)
  • Pre-compile languages like .NET with ReadyToRun
  • Cache dependencies in the Docker image, not downloaded at startup
  • Use multi-stage builds to minimize final image size

The "Scale-to-Zero" Anti-Pattern

ACI markets itself as "scale to zero" but this only works for batch jobs. Web applications that scale to zero have terrible user experience.

The problem: First request after scale-to-zero takes 30-90 seconds to respond while ACI provisions a new container and your app starts up.

Better approach: Use 1 minimum instance with Azure Container Apps instead:

az containerapp create --resource-group mygroup --name myapp \
  --image myapp:latest \
  --min-replicas 1 --max-replicas 10

Container Apps gives you the serverless benefits without the cold start penalty for the first user.

When ACI scale-to-zero makes sense:

  • Scheduled batch jobs (run at 2am, complete in 10 minutes, shut down)
  • CI/CD build agents (spin up per build, shut down after completion)
  • Development environments (nobody cares about 30-second startup times)

The "Monitoring That Actually Helps" Pattern

Azure's default ACI monitoring shows you pretty graphs that don't help during outages. Set up monitoring that tells you what's broken and how to fix it.

Essential metrics to track:

## Container restart count (alerts when > 3 in 10 minutes)
az monitor metrics list --resource mycontainer --metric restartCount

## Memory utilization (alerts when > 80% to prevent OOM kills)
az monitor metrics list --resource mycontainer --metric memoryUsage

## CPU utilization over time (identifies performance degradation)
az monitor metrics list --resource mycontainer --metric cpuUsage

Log aggregation that doesn't suck: ACI's default logging only keeps logs for 7 days. Send logs somewhere useful:

az container create --resource-group mygroup --name mycontainer \
  --image myapp:latest \
  --log-analytics-workspace /subscriptions/xxx/resourceGroups/xxx/providers/Microsoft.OperationalInsights/workspaces/myworkspace

Error alerting: Set up alerts for actual error conditions, not just "container restarted":

## Alert when container fails to start 3 times in 15 minutes
az monitor scheduled-query create --resource-group mygroup \
  --name \"ACI-StartupFailures\" \
  --scopes mycontainer \
  --query \"ContainerInstanceLog_CL | where TimeGenerated > ago(15m) | where LogEntry_s contains 'Failed to pull image' | summarize count() by bin(TimeGenerated, 5m) | where count_ >= 3\"

The "Cost Control That Prevents Bill Shock" Pattern

ACI's per-second billing sounds cheap until you realize a container stuck in a restart loop can cost hundreds of dollars before anyone notices.

Resource limits that prevent runaway costs:

az container create --resource-group mygroup --name mycontainer \
  --image myapp:latest \
  --cpu 1 --memory 2 \
  --restart-policy OnFailure  # Don't restart forever
  --max-restart-count 3       # Give up after 3 failures

Regional cost optimization: ACI pricing varies significantly by region. For batch workloads, use cheaper regions:

  • East US: $0.045/vCPU-hour
  • South Central US: $0.043/vCPU-hour
  • West Europe: $0.054/vCPU-hour
  • Japan East: $0.061/vCPU-hour

Spot container risk management: Spot instances save money but can disappear. Use them for fault-tolerant workloads:

## Spot container with automatic failover to regular pricing
az container create --resource-group mygroup --name myapp-spot \
  --image myapp:latest --priority Spot

## If spot fails, create regular container as backup
az container create --resource-group mygroup --name myapp-regular \
  --image myapp:latest --priority Regular

These patterns come from learning ACI's limitations the hard way. Use them to avoid the common production failures that turn simple container deployments into 3am emergency calls.

Emergency Resources When Everything's Broken

Related Tools & Recommendations

tool
Similar content

Azure Container Instances (ACI): Run Containers Without Kubernetes

Deploy containers fast without cluster management hell

Azure Container Instances
/tool/azure-container-instances/overview
100%
tool
Similar content

Anchor Framework Production Deployment: Debugging & Real-World Failures

The failures, the costs, and the late-night debugging sessions nobody talks about in the tutorials

Anchor Framework
/tool/anchor/production-deployment
54%
tool
Similar content

Aqua Security Troubleshooting: Resolve Production Issues Fast

Real fixes for the shit that goes wrong when Aqua Security decides to ruin your weekend

Aqua Security Platform
/tool/aqua-security/production-troubleshooting
50%
tool
Similar content

AWS AI/ML Troubleshooting: Debugging SageMaker & Bedrock in Production

Real debugging strategies for SageMaker, Bedrock, and the rest of AWS's AI mess

Amazon Web Services AI/ML Services
/tool/aws-ai-ml-services/production-troubleshooting-guide
50%
tool
Similar content

Pinecone Production Architecture: Fix Common Issues & Best Practices

Shit that actually breaks in production (and how to fix it)

Pinecone
/tool/pinecone/production-architecture-patterns
50%
tool
Similar content

OrbStack Performance Troubleshooting: Fix Issues & Optimize

Troubleshoot common OrbStack performance issues, from file descriptor limits and container startup failures to M1/M2/M3 Mac performance and VirtioFS optimizatio

OrbStack
/tool/orbstack/performance-troubleshooting
48%
tool
Similar content

Atlassian Confluence Performance Troubleshooting: Fix Slow Issues & Optimize

Fix Your Damn Confluence Performance - The Guide That Actually Works

Atlassian Confluence
/tool/atlassian-confluence/performance-troubleshooting-guide
45%
tool
Recommended

Google Cloud Run - Throw a Container at Google, Get Back a URL

Skip the Kubernetes hell and deploy containers that actually work.

Google Cloud Run
/tool/google-cloud-run/overview
42%
tool
Similar content

React Error Boundaries in Production: Debugging Silent Failures

Learn why React Error Boundaries often fail silently in production builds and discover effective strategies to debug and fix them, preventing white screens for

React Error Boundary
/tool/react-error-boundary/error-handling-patterns
39%
tool
Recommended

Google Kubernetes Engine (GKE) - Google's Managed Kubernetes (That Actually Works Most of the Time)

Google runs your Kubernetes clusters so you don't wake up to etcd corruption at 3am. Costs way more than DIY but beats losing your weekend to cluster disasters.

Google Kubernetes Engine (GKE)
/tool/google-kubernetes-engine/overview
38%
news
Popular choice

Anthropic's Claude Can Now Hang Up on Abusive Users Like a Customer Service Rep

AI chatbot gains ability to end conversations when users are persistent assholes - because apparently we needed this

General Technology News
/news/2025-08-24/claude-abuse-protection
38%
news
Popular choice

Anthropic Raises $13B at $183B Valuation: AI Bubble Peak or Actual Revenue?

Another AI funding round that makes no sense - $183 billion for a chatbot company that burns through investor money faster than AWS bills in a misconfigured k8s

/news/2025-09-02/anthropic-funding-surge
37%
tool
Similar content

Hemi Network Bitcoin Integration: Debugging Smart Contract Issues

What actually breaks when you try to build Bitcoin-aware smart contracts

Hemi Network
/tool/hemi/debugging-bitcoin-integration
35%
tool
Similar content

Optimize Jira Software Performance: Troubleshooting & Fixes

Frustrated with slow Jira Software? Learn step-by-step performance troubleshooting techniques to identify and fix common issues, optimize your instance, and boo

Jira Software
/tool/jira-software/performance-troubleshooting
35%
news
Popular choice

Morgan Stanley Open Sources Calm: Because Drawing Architecture Diagrams 47 Times Gets Old

Wall Street Bank Finally Releases Tool That Actually Solves Real Developer Problems

GitHub Copilot
/news/2025-08-22/meta-ai-hiring-freeze
32%
tool
Popular choice

Rancher Desktop - Docker Desktop's Free Replacement That Actually Works

Discover why Rancher Desktop is a powerful, free alternative to Docker Desktop. Learn its features, installation process, and solutions for common issues on mac

Rancher Desktop
/tool/rancher-desktop/overview
30%
tool
Popular choice

Git Disaster Recovery - When Everything Goes Wrong

Learn Git disaster recovery strategies and get immediate action steps for the critical CVE-2025-48384 security alert affecting Linux and macOS users.

Git
/tool/git/disaster-recovery-troubleshooting
29%
news
Popular choice

Roblox Launches TikTok Clone and AI Tools That Actually Work

Finally: A Gaming Platform That Doesn't Half-Ass Creator Features

OpenAI/ChatGPT
/news/2025-09-06/roblox-ai-creators-video-moments
27%
troubleshoot
Recommended

Fix Docker Daemon Connection Failures

When Docker decides to fuck you over at 2 AM

Docker Engine
/troubleshoot/docker-error-during-connect-daemon-not-running/daemon-connection-failures
26%
troubleshoot
Recommended

Docker Container Won't Start? Here's How to Actually Fix It

Real solutions for when Docker decides to ruin your day (again)

Docker
/troubleshoot/docker-container-wont-start-error/container-startup-failures
26%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization