Azure Container Instances: Production Troubleshooting Reference
Critical Failure Scenarios and Fixes
Container Exit Code 137 (Memory Kill)
Cause: Kernel kills container for exceeding memory limits - no warnings, instant death
Impact: Service downtime, lost user sessions, data loss
Fix: Increase memory allocation by 25% minimum
az container show --resource-group mygroup --name mycontainer --query "containers[0].resources"
Reality Check: Apps that use 500MB in dev often need 2GB in production due to caching and dependencies
Image Pull Failures ("Failed to pull image")
Frequency: Most common production failure
Root Causes:
- ACI stuck in weird states (60% of cases)
- Expired service principal credentials (30% of cases)
- Registry connectivity issues (10% of cases)
Fix Priority Order:
- Delete and recreate container group (fixes 60% of issues)
- Check ACR credential expiration:
az acr credential show --name myregistry
- Switch to managed identity for credential rotation
Emergency Workaround: Copy image to Docker Hub temporarily
Container Restart Loops (Immediate Death)
Symptoms: Container starts, process exits, ACI restarts, repeat
Debug Method: Override entrypoint with sleep 3600
and exec in
Common Causes:
- Missing environment variables (70% of cases)
- Database connection failures from Azure (20% of cases)
- File permission issues in Docker image (10% of cases)
Container Group Stuck in "Pending"
Cause: Azure resource unavailability
Timeout Threshold: >10 minutes indicates capacity issue
Solutions:
- Try different region (East US 2, West Europe, Southeast Asia have best availability)
- Reduce resource requirements (4→2 vCPUs, 8→4GB RAM)
- Use spot containers with interruption tolerance
Resource Requirements and Limits
Memory Allocation Strategy
- Rule: Always request 25% more than app needs
- Why: ACI kills containers that exceed limits without graceful handling
- Cost Impact: Extra memory costs pennies vs downtime costs
CPU Performance Characteristics
- Fractional vCPUs (0.5, 1.5): Poor performance during Azure busy periods
- Whole vCPUs (1, 2, 4): Consistent performance for CPU-intensive apps
- Spot Containers: 70% cost savings but 30-second eviction notice
Regional Capacity Reality
- Small Regions: Limited capacity for large instances (>2 vCPU)
- Peak Hours: West Europe frequently unavailable
- Reliable Regions: East US 2, West Europe, Southeast Asia
Configuration That Works in Production
Container Startup Optimization
- Cold Start Time: 15-60 seconds (can be infinite on bad days)
- Image Size Impact: Alpine Linux (5MB) vs Ubuntu (200MB)
- Dependency Strategy: Cache in image, don't download at startup
Storage Volume Reality
- Azure Files Mounts: Silently fail during network hiccups
- Workaround: Check mount exists before writing
- Better Solution: Use blob storage with Azure CLI
Multi-Container Group Risks
- Problem: One container crash kills entire group
- Solution: Separate containers into different groups unless localhost networking required
- Exception: Only group containers that absolutely must share localhost
Cost Control and Billing
Pricing by Region (per vCPU-hour)
- East US: $0.045
- South Central US: $0.043
- West Europe: $0.054
- Japan East: $0.061
Cost Protection Settings
--restart-policy OnFailure # Don't restart forever
--max-restart-count 3 # Give up after 3 failures
Confidential Containers
- Cost: 3x normal pricing ($0.15 vs $0.045 per vCPU-hour)
- Limitations: Breaks privileged operations, hardware access, some Node.js modules
- Use Case: Only for truly sensitive data processing
Emergency Debugging Methodology
Step 1: Get Real Error Messages
az container show --resource-group mygroup --name mycontainer --query "containers[0].instanceView"
Look for previousState
, currentState
, and restartCount
Step 2: Resource Verification
# What you requested
az container show --resource-group mygroup --name mycontainer --query "containers[0].resources.requests"
# What you got
az container show --resource-group mygroup --name mycontainer --query "containers[0].resources.limits"
Step 3: Registry Authentication Test
az acr login --name myregistry
docker pull myregistry.azurecr.io/myapp:latest
Step 4: Container Startup Debug
# Override entrypoint to keep container running
az container create --command-line "sleep 3600"
# Exec in and run app manually
az container exec --exec-command "/bin/bash"
Monitoring and Alerting
Essential Metrics
- Restart Count: Alert when >3 in 10 minutes
- Memory Usage: Alert when >80% to prevent OOM kills
- CPU Usage: Track performance degradation over time
Log Aggregation
- Default Retention: 7 days only
- Solution: Send to Log Analytics Workspace
- Query: Set up alerts for startup failures, not just restarts
Production Patterns That Survive
Always-On Container Design
- State Storage: External only (Redis, blob storage)
- Database Connections: Connection pooling, not global variables
- File Storage: Blob storage, not local filesystem
Scale-to-Zero Anti-Pattern
- Problem: 30-90 second cold start kills user experience
- Solution: Use Azure Container Apps with min-replicas 1
- Valid Use Cases: Batch jobs, CI/CD agents, development environments
Multi-Region Deployment
# Primary region
az container create --location eastus2
# Backup region
az container create --location westus2
Use Traffic Manager for automatic failover during regional issues
Breaking Points and Failure Modes
Memory Limits
- Enforcement: Strict, no warnings before kill
- Exit Code 137: Kernel memory kill
- Prevention: 25% overhead rule
Network Dependencies
- VNet Deployment: Requires NAT gateway for outbound connections
- Port Mapping: No Docker-style port mapping (container port = exposed port)
- Azure Files: Unreliable during network issues
Spot Container Evictions
- Warning Time: 30 seconds
- Don't Use For: Web apps, databases, long tasks without checkpointing
- Good For: CI/CD, log processing, development environments
Critical Warnings
What Microsoft Docs Don't Tell You
- ACI availability is unpredictable in smaller regions
- Image pull failures are common and often not your fault
- Multi-container groups are fragile and should be avoided
- Scale-to-zero has terrible user experience for web apps
- Azure Files mounts fail silently during network issues
Common Misconceptions
- "ACI is reliable for production web apps" - False, use Container Apps instead
- "Fractional vCPUs save money" - False, they perform poorly
- "Container groups are better than separate containers" - False, unless localhost networking required
- "Default monitoring is sufficient" - False, set up proper alerting
Resource Exhaustion Scenarios
- Container stuck in restart loop can cost hundreds before detection
- Pending containers consume quota without providing service
- Memory leaks cause cascading failures in container groups
- Regional outages can last hours without alternative regions configured
Useful Links for Further Investigation
Emergency Resources When Everything's Broken
Link | Description |
---|---|
Microsoft's ACI Troubleshooting Guide | The error codes are accurate even if the solutions are generic |
ACI Resource Limits and Quotas | Know these limits before you hit them at 3am |
Container Group Events and Logs | How to get the real error messages, not the sanitized portal versions |
Azure CLI Container Commands Reference | All the az container commands you need for emergency debugging |
Docker Best Practices for ACI | Make your images start faster in ACI |
Exit Code Reference Guide | What those cryptic exit codes actually mean |
ACR Authentication Methods | Service principals, managed identities, and which one to use when |
ACR Troubleshooting Guide | Fix "failed to pull image" errors |
Azure Managed Identity Setup | Eliminate service principal credential expiration issues |
Azure Monitor for Containers | Set up monitoring that catches failures before customers do |
Log Analytics Workspace Setup | Store container logs somewhere you can actually query them |
Azure Alert Rules for ACI | Alert on restart loops and resource exhaustion |
Azure Container Apps vs ACI | When to switch to Container Apps for better reliability |
AKS Serverless Options | Virtual nodes give you ACI simplicity with AKS reliability |
AWS Fargate Comparison | Sometimes the nuclear option is switching cloud providers |
Related Tools & Recommendations
Google Cloud Run - Throw a Container at Google, Get Back a URL
Skip the Kubernetes hell and deploy containers that actually work.
Azure Container Registry - Microsoft's Private Docker Registry
Store your container images without the headaches of running your own registry. ACR works with Docker CLI, costs more than you think, but actually works when yo
Google Kubernetes Engine (GKE) - Google's Managed Kubernetes (That Actually Works Most of the Time)
Google runs your Kubernetes clusters so you don't wake up to etcd corruption at 3am. Costs way more than DIY but beats losing your weekend to cluster disasters.
GKE Security That Actually Stops Attacks
Secure your GKE clusters without the security theater bullshit. Real configs that actually work when attackers hit your production cluster during lunch break.
Braintree - PayPal's Payment Processing That Doesn't Suck
The payment processor for businesses that actually need to scale (not another Stripe clone)
Trump Threatens 100% Chip Tariff (With a Giant Fucking Loophole)
Donald Trump threatens a 100% chip tariff, potentially raising electronics prices. Discover the loophole and if your iPhone will cost more. Get the full impact
Tech News Roundup: August 23, 2025 - The Day Reality Hit
Four stories that show the tech industry growing up, crashing down, and engineering miracles all at once
Someone Convinced Millions of Kids Roblox Was Shutting Down September 1st - August 25, 2025
Fake announcement sparks mass panic before Roblox steps in to tell everyone to chill out
Microsoft's August Update Breaks NDI Streaming Worldwide
KB5063878 causes severe lag and stuttering in live video production systems
Docker Desktop Hit by Critical Container Escape Vulnerability
CVE-2025-9074 exposes host systems to complete compromise through API misconfiguration
Roblox Stock Jumps 5% as Wall Street Finally Gets the Kids' Game Thing - August 25, 2025
Analysts scramble to raise price targets after realizing millions of kids spending birthday money on virtual items might be good business
Docker Alternatives That Won't Break Your Budget
Docker got expensive as hell. Here's how to escape without breaking everything.
GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus
How to Wire Together the Modern DevOps Stack Without Losing Your Sanity
I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works
Trivy, Docker Scout, Snyk Container, Grype, and Clair - which one won't make you want to quit DevOps
Meta Slashes Android Build Times by 3x With Kotlin Buck2 Breakthrough
Facebook's engineers just cracked the holy grail of mobile development: making Kotlin builds actually fast for massive codebases
Apple's ImageIO Framework is Fucked Again: CVE-2025-43300
Another zero-day in image parsing that someone's already using to pwn iPhones - patch your shit now
Figma Gets Lukewarm Wall Street Reception Despite AI Potential - August 25, 2025
Major investment banks issue neutral ratings citing $37.6B valuation concerns while acknowledging design platform's AI integration opportunities
Anchor Framework Performance Optimization - The Shit They Don't Teach You
No-Bullshit Performance Optimization for Production Anchor Programs
GPT-5 Is So Bad That Users Are Begging for the Old Version Back
OpenAI forced everyone to use an objectively worse model. The backlash was so brutal they had to bring back GPT-4o within days.
Git RCE Vulnerability Is Being Exploited in the Wild Right Now
CVE-2025-48384 lets attackers execute code just by cloning malicious repos - CISA added it to the actively exploited list today
Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization