When your ACI containers are down and you're getting paged at 3am, you don't have time for Microsoft's 47-step troubleshooting flowchart. Here's the methodology that actually works when everything's on fire.
Step 1: Get the Real Error Message (Not the Bullshit Summary)
The Azure portal shows you friendly error messages that are completely useless. Get the actual error details:
az container show --resource-group mygroup --name mycontainer --query \"containers[0].instanceView\"
This gives you the real container events, not the sanitized portal version. Look for the previousState
and currentState
sections - that's where the actual errors hide.
What you're looking for: Exit codes, specific error messages, and the restartCount
. If restartCount
keeps climbing, you have a restart loop. If it's stuck at 0 with state: Waiting
, your container never started properly.
Step 2: The Resource Check (Because Azure Lies About Availability)
ACI will tell you resources are available in a region, then fail to provision them. Check what you're actually asking for:
## See what you requested
az container show --resource-group mygroup --name mycontainer --query \"containers[0].resources.requests\"
## See what you actually got (if anything)
az container show --resource-group mygroup --name mycontainer --query \"containers[0].resources.limits\"
If you requested 4 vCPUs and got nothing, try 2 vCPUs. If you requested 8GB RAM and got nothing, try 4GB. ACI availability is unpredictable, especially in smaller regions.
Regional reality check: Some Azure regions just don't have capacity for larger container instances. If you're stuck, try these regions that usually have availability:
- East US 2
- West Europe
- Southeast Asia
Step 3: The Image Pull Investigation
Image pull failures are the most common production issue because there are 15 different ways they can break:
Registry connectivity test:
az acr login --name myregistry
docker pull myregistry.azurecr.io/myapp:latest
If this fails, your image doesn't exist or your credentials are wrong. If it works, ACI is having registry authentication issues.
Registry authentication debugging:
## Check ACR credentials
az acr credential show --name myregistry
## Check managed identity assignment
az container show --resource-group mygroup --name mycontainer --query \"identity\"
The nuclear option: If nothing else works, copy your image to Docker Hub temporarily. It's not secure for production, but it'll get your service back up while you fix ACR authentication:
docker tag myregistry.azurecr.io/myapp:latest mydockerhubuser/myapp:emergency
docker push mydockerhubuser/myapp:emergency
Step 4: Container Startup Debugging
If your container pulls successfully but crashes immediately, you need to debug the startup process:
Method 1: Override the entrypoint
az container create --resource-group mygroup --name debug-myapp \
--image myregistry.azurecr.io/myapp:latest \
--command-line \"tail -f /dev/null\"
This keeps the container running so you can exec in and debug manually.
Method 2: Check environment variables
az container exec --resource-group mygroup --container-name debug-myapp --exec-command \"env\"
Your app might be looking for environment variables that aren't set. Production containers often have different ENV requirements than development.
Method 3: Run your app manually
az container exec --resource-group mygroup --container-name debug-myapp --exec-command \"/bin/bash\"
## Then inside the container
/path/to/your/app
This will show you the actual error your app throws, which is usually more helpful than ACI's generic error messages.
Step 5: The Memory and CPU Reality Check
ACI enforces resource limits strictly. If your app works fine in development but crashes in ACI, you're probably hitting resource limits:
Memory debugging:
az container exec --resource-group mygroup --container-name mycontainer --exec-command \"free -h\"
## Check what processes are using memory
az container exec --resource-group mygroup --container-name mycontainer --exec-command \"ps aux --sort=-%mem\"
The 25% rule: Always request 25% more CPU and memory than your app needs. ACI doesn't handle resource contention gracefully - it just kills containers that exceed limits.
CPU performance gotcha: Fractional vCPUs (0.5, 1.5) perform poorly when Azure is busy. If your app is CPU-intensive, use whole vCPU numbers (1, 2, 4).
This methodology has saved me countless hours of frustration. Skip the Microsoft docs flowcharts and go straight to what actually identifies the problem.