Currently viewing the AI version
Switch to human version

Azure Container Instances: Production Troubleshooting Reference

Critical Failure Scenarios and Fixes

Container Exit Code 137 (Memory Kill)

Cause: Kernel kills container for exceeding memory limits - no warnings, instant death
Impact: Service downtime, lost user sessions, data loss
Fix: Increase memory allocation by 25% minimum

az container show --resource-group mygroup --name mycontainer --query "containers[0].resources"

Reality Check: Apps that use 500MB in dev often need 2GB in production due to caching and dependencies

Image Pull Failures ("Failed to pull image")

Frequency: Most common production failure
Root Causes:

  • ACI stuck in weird states (60% of cases)
  • Expired service principal credentials (30% of cases)
  • Registry connectivity issues (10% of cases)

Fix Priority Order:

  1. Delete and recreate container group (fixes 60% of issues)
  2. Check ACR credential expiration: az acr credential show --name myregistry
  3. Switch to managed identity for credential rotation

Emergency Workaround: Copy image to Docker Hub temporarily

Container Restart Loops (Immediate Death)

Symptoms: Container starts, process exits, ACI restarts, repeat
Debug Method: Override entrypoint with sleep 3600 and exec in
Common Causes:

  • Missing environment variables (70% of cases)
  • Database connection failures from Azure (20% of cases)
  • File permission issues in Docker image (10% of cases)

Container Group Stuck in "Pending"

Cause: Azure resource unavailability
Timeout Threshold: >10 minutes indicates capacity issue
Solutions:

  • Try different region (East US 2, West Europe, Southeast Asia have best availability)
  • Reduce resource requirements (4→2 vCPUs, 8→4GB RAM)
  • Use spot containers with interruption tolerance

Resource Requirements and Limits

Memory Allocation Strategy

  • Rule: Always request 25% more than app needs
  • Why: ACI kills containers that exceed limits without graceful handling
  • Cost Impact: Extra memory costs pennies vs downtime costs

CPU Performance Characteristics

  • Fractional vCPUs (0.5, 1.5): Poor performance during Azure busy periods
  • Whole vCPUs (1, 2, 4): Consistent performance for CPU-intensive apps
  • Spot Containers: 70% cost savings but 30-second eviction notice

Regional Capacity Reality

  • Small Regions: Limited capacity for large instances (>2 vCPU)
  • Peak Hours: West Europe frequently unavailable
  • Reliable Regions: East US 2, West Europe, Southeast Asia

Configuration That Works in Production

Container Startup Optimization

  • Cold Start Time: 15-60 seconds (can be infinite on bad days)
  • Image Size Impact: Alpine Linux (5MB) vs Ubuntu (200MB)
  • Dependency Strategy: Cache in image, don't download at startup

Storage Volume Reality

  • Azure Files Mounts: Silently fail during network hiccups
  • Workaround: Check mount exists before writing
  • Better Solution: Use blob storage with Azure CLI

Multi-Container Group Risks

  • Problem: One container crash kills entire group
  • Solution: Separate containers into different groups unless localhost networking required
  • Exception: Only group containers that absolutely must share localhost

Cost Control and Billing

Pricing by Region (per vCPU-hour)

  • East US: $0.045
  • South Central US: $0.043
  • West Europe: $0.054
  • Japan East: $0.061

Cost Protection Settings

--restart-policy OnFailure  # Don't restart forever
--max-restart-count 3       # Give up after 3 failures

Confidential Containers

  • Cost: 3x normal pricing ($0.15 vs $0.045 per vCPU-hour)
  • Limitations: Breaks privileged operations, hardware access, some Node.js modules
  • Use Case: Only for truly sensitive data processing

Emergency Debugging Methodology

Step 1: Get Real Error Messages

az container show --resource-group mygroup --name mycontainer --query "containers[0].instanceView"

Look for previousState, currentState, and restartCount

Step 2: Resource Verification

# What you requested
az container show --resource-group mygroup --name mycontainer --query "containers[0].resources.requests"
# What you got
az container show --resource-group mygroup --name mycontainer --query "containers[0].resources.limits"

Step 3: Registry Authentication Test

az acr login --name myregistry
docker pull myregistry.azurecr.io/myapp:latest

Step 4: Container Startup Debug

# Override entrypoint to keep container running
az container create --command-line "sleep 3600"
# Exec in and run app manually
az container exec --exec-command "/bin/bash"

Monitoring and Alerting

Essential Metrics

  • Restart Count: Alert when >3 in 10 minutes
  • Memory Usage: Alert when >80% to prevent OOM kills
  • CPU Usage: Track performance degradation over time

Log Aggregation

  • Default Retention: 7 days only
  • Solution: Send to Log Analytics Workspace
  • Query: Set up alerts for startup failures, not just restarts

Production Patterns That Survive

Always-On Container Design

  • State Storage: External only (Redis, blob storage)
  • Database Connections: Connection pooling, not global variables
  • File Storage: Blob storage, not local filesystem

Scale-to-Zero Anti-Pattern

  • Problem: 30-90 second cold start kills user experience
  • Solution: Use Azure Container Apps with min-replicas 1
  • Valid Use Cases: Batch jobs, CI/CD agents, development environments

Multi-Region Deployment

# Primary region
az container create --location eastus2
# Backup region
az container create --location westus2

Use Traffic Manager for automatic failover during regional issues

Breaking Points and Failure Modes

Memory Limits

  • Enforcement: Strict, no warnings before kill
  • Exit Code 137: Kernel memory kill
  • Prevention: 25% overhead rule

Network Dependencies

  • VNet Deployment: Requires NAT gateway for outbound connections
  • Port Mapping: No Docker-style port mapping (container port = exposed port)
  • Azure Files: Unreliable during network issues

Spot Container Evictions

  • Warning Time: 30 seconds
  • Don't Use For: Web apps, databases, long tasks without checkpointing
  • Good For: CI/CD, log processing, development environments

Critical Warnings

What Microsoft Docs Don't Tell You

  • ACI availability is unpredictable in smaller regions
  • Image pull failures are common and often not your fault
  • Multi-container groups are fragile and should be avoided
  • Scale-to-zero has terrible user experience for web apps
  • Azure Files mounts fail silently during network issues

Common Misconceptions

  • "ACI is reliable for production web apps" - False, use Container Apps instead
  • "Fractional vCPUs save money" - False, they perform poorly
  • "Container groups are better than separate containers" - False, unless localhost networking required
  • "Default monitoring is sufficient" - False, set up proper alerting

Resource Exhaustion Scenarios

  • Container stuck in restart loop can cost hundreds before detection
  • Pending containers consume quota without providing service
  • Memory leaks cause cascading failures in container groups
  • Regional outages can last hours without alternative regions configured

Useful Links for Further Investigation

Emergency Resources When Everything's Broken

LinkDescription
Microsoft's ACI Troubleshooting GuideThe error codes are accurate even if the solutions are generic
ACI Resource Limits and QuotasKnow these limits before you hit them at 3am
Container Group Events and LogsHow to get the real error messages, not the sanitized portal versions
Azure CLI Container Commands ReferenceAll the az container commands you need for emergency debugging
Docker Best Practices for ACIMake your images start faster in ACI
Exit Code Reference GuideWhat those cryptic exit codes actually mean
ACR Authentication MethodsService principals, managed identities, and which one to use when
ACR Troubleshooting GuideFix "failed to pull image" errors
Azure Managed Identity SetupEliminate service principal credential expiration issues
Azure Monitor for ContainersSet up monitoring that catches failures before customers do
Log Analytics Workspace SetupStore container logs somewhere you can actually query them
Azure Alert Rules for ACIAlert on restart loops and resource exhaustion
Azure Container Apps vs ACIWhen to switch to Container Apps for better reliability
AKS Serverless OptionsVirtual nodes give you ACI simplicity with AKS reliability
AWS Fargate ComparisonSometimes the nuclear option is switching cloud providers

Related Tools & Recommendations

tool
Recommended

Google Cloud Run - Throw a Container at Google, Get Back a URL

Skip the Kubernetes hell and deploy containers that actually work.

Google Cloud Run
/tool/google-cloud-run/overview
67%
tool
Recommended

Azure Container Registry - Microsoft's Private Docker Registry

Store your container images without the headaches of running your own registry. ACR works with Docker CLI, costs more than you think, but actually works when yo

Azure Container Registry
/tool/azure-container-registry/overview
66%
tool
Recommended

Google Kubernetes Engine (GKE) - Google's Managed Kubernetes (That Actually Works Most of the Time)

Google runs your Kubernetes clusters so you don't wake up to etcd corruption at 3am. Costs way more than DIY but beats losing your weekend to cluster disasters.

Google Kubernetes Engine (GKE)
/tool/google-kubernetes-engine/overview
60%
tool
Recommended

GKE Security That Actually Stops Attacks

Secure your GKE clusters without the security theater bullshit. Real configs that actually work when attackers hit your production cluster during lunch break.

Google Kubernetes Engine (GKE)
/tool/google-kubernetes-engine/security-best-practices
60%
tool
Popular choice

Braintree - PayPal's Payment Processing That Doesn't Suck

The payment processor for businesses that actually need to scale (not another Stripe clone)

Braintree
/tool/braintree/overview
60%
news
Popular choice

Trump Threatens 100% Chip Tariff (With a Giant Fucking Loophole)

Donald Trump threatens a 100% chip tariff, potentially raising electronics prices. Discover the loophole and if your iPhone will cost more. Get the full impact

Technology News Aggregation
/news/2025-08-25/trump-chip-tariff-threat
55%
news
Popular choice

Tech News Roundup: August 23, 2025 - The Day Reality Hit

Four stories that show the tech industry growing up, crashing down, and engineering miracles all at once

GitHub Copilot
/news/tech-roundup-overview
52%
news
Popular choice

Someone Convinced Millions of Kids Roblox Was Shutting Down September 1st - August 25, 2025

Fake announcement sparks mass panic before Roblox steps in to tell everyone to chill out

Roblox Studio
/news/2025-08-25/roblox-shutdown-hoax
50%
news
Popular choice

Microsoft's August Update Breaks NDI Streaming Worldwide

KB5063878 causes severe lag and stuttering in live video production systems

Technology News Aggregation
/news/2025-08-25/windows-11-kb5063878-streaming-disaster
47%
news
Popular choice

Docker Desktop Hit by Critical Container Escape Vulnerability

CVE-2025-9074 exposes host systems to complete compromise through API misconfiguration

Technology News Aggregation
/news/2025-08-25/docker-cve-2025-9074
45%
news
Popular choice

Roblox Stock Jumps 5% as Wall Street Finally Gets the Kids' Game Thing - August 25, 2025

Analysts scramble to raise price targets after realizing millions of kids spending birthday money on virtual items might be good business

Roblox Studio
/news/2025-08-25/roblox-stock-surge
42%
alternatives
Recommended

Docker Alternatives That Won't Break Your Budget

Docker got expensive as hell. Here's how to escape without breaking everything.

Docker
/alternatives/docker/budget-friendly-alternatives
42%
integration
Recommended

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

How to Wire Together the Modern DevOps Stack Without Losing Your Sanity

docker
/integration/docker-kubernetes-argocd-prometheus/gitops-workflow-integration
42%
compare
Recommended

I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works

Trivy, Docker Scout, Snyk Container, Grype, and Clair - which one won't make you want to quit DevOps

docker
/compare/docker-security/cicd-integration/docker-security-cicd-integration
42%
news
Popular choice

Meta Slashes Android Build Times by 3x With Kotlin Buck2 Breakthrough

Facebook's engineers just cracked the holy grail of mobile development: making Kotlin builds actually fast for massive codebases

Technology News Aggregation
/news/2025-08-26/meta-kotlin-buck2-incremental-compilation
40%
news
Popular choice

Apple's ImageIO Framework is Fucked Again: CVE-2025-43300

Another zero-day in image parsing that someone's already using to pwn iPhones - patch your shit now

GitHub Copilot
/news/2025-08-22/apple-zero-day-cve-2025-43300
40%
news
Popular choice

Figma Gets Lukewarm Wall Street Reception Despite AI Potential - August 25, 2025

Major investment banks issue neutral ratings citing $37.6B valuation concerns while acknowledging design platform's AI integration opportunities

Technology News Aggregation
/news/2025-08-25/figma-neutral-wall-street
40%
tool
Popular choice

Anchor Framework Performance Optimization - The Shit They Don't Teach You

No-Bullshit Performance Optimization for Production Anchor Programs

Anchor Framework
/tool/anchor/performance-optimization
40%
news
Popular choice

GPT-5 Is So Bad That Users Are Begging for the Old Version Back

OpenAI forced everyone to use an objectively worse model. The backlash was so brutal they had to bring back GPT-4o within days.

GitHub Copilot
/news/2025-08-22/gpt5-user-backlash
40%
news
Popular choice

Git RCE Vulnerability Is Being Exploited in the Wild Right Now

CVE-2025-48384 lets attackers execute code just by cloning malicious repos - CISA added it to the actively exploited list today

Technology News Aggregation
/news/2025-08-26/git-cve-rce-exploit
40%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization