Docker Exit Code 137: OOM Kill Prevention and Debugging
Critical Context
What Exit Code 137 Means:
- Signal: 128 + 9 (SIGKILL - uncatchable termination)
- Cause: Linux kernel OOM killer terminated container for exceeding memory limits
- Timing: Often occurs during traffic spikes or at 3AM when monitoring is minimal
- Impact: No graceful shutdown, no cleanup, immediate service disruption
Common Production Scenario:
Container runs stable for weeks with 512MB limit → Traffic spike requires 600MB → Instant death with no warning
Configuration That Actually Works
Memory Limit Sizing Formula
Memory limit = (Peak RSS × 1.5) + Runtime overhead
Runtime Overhead Requirements:
- JVM applications: +200-400MB for non-heap memory
- Node.js applications: +50-100MB for V8 overhead
- Go applications: +20-50MB for garbage collection
- Python applications: +30-100MB for interpreter overhead
JVM Container Configuration
Critical Problem: JVM allocates heap based on host memory (32GB), not container limits (512MB)
Required Flags:
# Modern JVMs (Java 11+)
-XX:+UseContainerSupport
# Legacy JVMs
-Xmx400m # Leave 112MB for non-heap in 512MB container
Failure Mode: JVM tries to allocate 8GB heap in 512MB container → immediate OOM kill
Kubernetes Memory Settings
2025 Best Practice: Set requests = limits
to prevent overcommitment
Why This Matters:
- Scheduler uses
requests
for node placement - OOM killer respects
limits
- Gap between them causes unpredictable failures when nodes are overcommitted
resources:
requests:
memory: "512Mi"
limits:
memory: "512Mi" # Same value prevents overcommit
Overcommitment Failure: Multiple pods scheduled based on low requests, all hit high limits simultaneously → node memory exhaustion → random pod kills
Diagnostic Commands
Confirm OOM Kill
# Check OOM kill flag (can be misleading for child process kills)
docker inspect --format '{{.State.OOMKilled}}' container_name
# Verify exit code
docker inspect --format '{{.State.ExitCode}}' container_name
# Check system logs for actual OOM events
dmesg | grep -i "killed process"
Memory Monitoring
# Real-time monitoring (includes cache/buffers - can be misleading)
docker stats container_name
# Container-internal memory view
docker exec container_name cat /proc/meminfo
Critical Warning: docker stats
shows cache memory that kernel can reclaim. OOM killer looks at RSS (resident set size). Container can show 200MB usage but die at 512MB limit due to committed memory.
Failure Scenarios and Solutions
Memory Leak vs Memory Spike Detection
Memory Leak Pattern:
- Gradual memory growth over time
- No periodic drops from garbage collection
- Consistent upward trend in monitoring
Memory Spike Pattern:
- Stable baseline usage
- Sudden jumps during traffic/processing
- Returns to baseline after load decreases
Diagnostic Difference: Leaks require application profiling, spikes require better resource planning or rate limiting.
Child Process OOM Confusion
Scenario: Container shows OOMKilled: false
but exit code 137
Cause: Child process got OOM killed, main process (PID 1) remained alive but exited
Detection: Check kernel logs for actual OOM events, not just Docker flags
False OOM Signals
Windows Containers: OOM behavior differs significantly from Linux
Child Process Kills: Docker only flags OOMKilled if PID 1 dies directly
Memory Accounting: Different between Docker stats and kernel OOM killer view
Production Prevention Strategies
Application-Level Defense
Backpressure Implementation:
- Reject non-essential operations at 80% memory usage
- Implement circuit breakers for memory-intensive operations
- Stream data processing instead of loading entire datasets
Connection Pool Limits:
# Memory-aware pool sizing
session = requests.Session()
session.mount('http://', HTTPAdapter(pool_maxsize=50))
Monitoring Setup Requirements
Essential Stack:
- cAdvisor for accurate container metrics collection
- Prometheus for storage and alerting
- Alert at 75% memory usage, panic at 90%
- Monitor memory growth rate, not just absolute usage
Critical Metric: RSS memory tracking, not Docker stats cache-inclusive numbers
Container Sizing Methodology
- Baseline Measurement: Run production workload without limits for 48-72 hours
- Peak Calculation: Monitor actual RSS usage under realistic load
- Safety Margin: Apply 1.5x multiplier plus runtime overhead
- Load Testing: Verify limits under stress conditions that mimic production spikes
Exit Code Reference
Code | Meaning | Required Action |
---|---|---|
137 | SIGKILL (often OOM) | Check OOMKilled flag, increase memory or fix leak |
125 | Docker daemon error | Verify Dockerfile syntax, image availability |
126 | Command not executable | Fix permissions or command path |
127 | Command not found | Check binary exists, PATH configuration |
1 | Application error | Review application logs for specific error |
0 | Clean exit | Normal termination (investigate if unexpected) |
Resource Requirements
Time Investment
- Initial Sizing: 2-3 days of monitoring plus load testing
- Production Debugging: 30 minutes to 4 hours depending on complexity
- Monitoring Setup: 1-2 days for complete observability stack
Expertise Requirements
- Basic: Understanding of container memory limits and Docker commands
- Advanced: Knowledge of kernel memory management, cgroups, and runtime-specific behavior
- Expert: Application profiling, custom metrics, and distributed system debugging
Breaking Points
- 1000+ concurrent containers: Standard monitoring tools may become inadequate
- Multi-GB containers: Require careful node sizing and network considerations
- High-frequency allocations: May need custom memory management strategies
Critical Warnings
What Documentation Doesn't Tell You
- Docker stats memory reporting includes reclaimable cache
- OOMKilled flag only set when PID 1 dies directly
- JVM heap size calculation ignores container limits by default
- Kubernetes scheduler and OOM killer use different memory values
- Child process OOM kills don't trigger container restart policies
Hidden Costs
- Memory Overcommitment: Appears to save resources but causes unpredictable failures
- Insufficient Monitoring: Delayed detection leads to extended outages
- Undersized Containers: Create cascade failures during traffic spikes
- Legacy Runtime Defaults: Most runtimes ignore container memory limits without explicit configuration
Migration Pain Points
- Kubernetes 1.20+: Changes in memory accounting affect existing deployments
- Docker Desktop vs Production: Different memory management behavior
- Cloud Platform Differences: AWS ECS, Azure Container Apps, GCP Cloud Run have platform-specific OOM handling
This knowledge enables automated detection of memory pressure, proper container sizing, and prevention of production OOM kills through systematic monitoring and application-level defensive programming.
Useful Links for Further Investigation
Essential Resources for Docker Memory Debugging
Link | Description |
---|---|
Docker Resource Constraints | Complete guide to memory, CPU, and other resource limits. Essential reading for understanding how Docker implements cgroups and memory accounting. |
Docker Runtime Metrics | Official documentation for docker stats and container monitoring. Explains what each metric actually measures and when it's useful. |
Docker Container Run Reference | Complete reference for docker run including exit codes. Section on exit status codes explains what 137, 125, and other codes mean. |
GitHub Issue: Process OOM within Docker Container | Detailed issue about OOM behavior differences between Windows and Linux containers. Shows how OOMKilled flag can be misleading. |
Kubernetes Issue: Container marked OOMKilled when non-init process dies | Explains why containers sometimes show OOMKilled: false even with exit code 137. Important for understanding child process kills. |
Stack Overflow: How to detect Docker memory limit reached | Practical commands for checking if container was OOM killed. Community answers with working examples. |
Kubernetes Issue: pods getting terminated with Exit Code 137 | Recent issue showing how memory pressure affects Kubernetes pods. Good example of production troubleshooting process. |
cAdvisor Container Monitoring | Google's container advisor for collecting runtime metrics. Essential for production monitoring - provides accurate memory usage data. |
Prometheus Container Metrics | How to set up Prometheus monitoring for Docker containers using cAdvisor. Includes example queries for memory alerts. |
Advanced Container Monitoring Guide | Comprehensive guide to using docker stats effectively. Covers real-time monitoring and automated alerting strategies. |
Kubernetes OOMKilled Troubleshooting | Complete guide to handling OOM kills in Kubernetes. Covers requests vs limits, proper sizing, and monitoring setup. |
Memory Resource Management Best Practices | 2025 best practices for Kubernetes memory management. Recommends setting requests = limits to prevent overcommitment. |
Azure Container Apps Exit Code 137 | Platform-specific troubleshooting for Azure Container Apps. Shows how cloud platforms handle container OOM kills. |
Tracking Down Invisible OOM Kills | Advanced debugging when child processes get OOM killed but Kubernetes doesn't detect it. Essential for complex applications. |
Docker Container Memory Leak Detection | Comprehensive guide to detecting and fixing memory leaks in containerized applications. Includes monitoring setup and debugging techniques. |
OOM Killer Deep Dive | Technical explanation of how the Linux OOM killer works and how it interacts with container orchestration platforms. |
JVM Container Support | Oracle's documentation on JVM container awareness. Critical for Java applications that need to respect container memory limits. |
Node.js Memory Management in Containers | Official Node.js documentation for memory-related command line options. Essential for sizing Node.js containers correctly. |
Go Memory Management | Go's garbage collector guide. Helps understand memory patterns in Go applications running in containers. |
Docker System Commands | When everything is broken, these commands can help. docker system prune and related cleanup commands for desperate times. |
Kubernetes Exit Codes Reference | Official Kubernetes guide to debugging pod failures. Includes exit code meanings and troubleshooting steps. |
Container Exit Codes Complete Guide | Comprehensive reference for all container exit codes. Bookmark this for 3AM debugging sessions. |
Related Tools & Recommendations
GitHub Actions + Jenkins Security Integration
When Security Wants Scans But Your Pipeline Lives in Jenkins Hell
Fix Kubernetes OOMKilled Pods - Production Memory Crisis Management
When your pods die with exit code 137 at 3AM and production is burning - here's the field guide that actually works
Stop Fighting Your CI/CD Tools - Make Them Work Together
When Jenkins, GitHub Actions, and GitLab CI All Live in Your Company
GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus
How to Wire Together the Modern DevOps Stack Without Losing Your Sanity
Docker Desktop vs Podman Desktop vs Rancher Desktop vs OrbStack: What Actually Happens
powers Docker Desktop
Fix Kubernetes ImagePullBackOff Error - The Complete Battle-Tested Guide
From "Pod stuck in ImagePullBackOff" to "Problem solved in 90 seconds"
GitHub Actions is Fine for Open Source Projects, But Try Explaining to an Auditor Why Your CI/CD Platform Was Built for Hobby Projects
integrates with GitHub Actions
GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015
Deploy your app without losing your mind or your weekend
Jenkins - The CI/CD Server That Won't Die
integrates with Jenkins
Deploy Django with Docker Compose - Complete Production Guide
End the deployment nightmare: From broken containers to bulletproof production deployments that actually work
Docker Compose 2.39.2 and Buildx 0.27.0 Released with Major Updates
Latest versions bring improved multi-platform builds and security fixes for containerized applications
Stop Docker from Killing Your Containers at Random (Exit Code 137 Is Not Your Friend)
Three weeks into a project and Docker Desktop suddenly decides your container needs 16GB of RAM to run a basic Node.js app
containerd - The Container Runtime That Actually Just Works
The boring container runtime that Kubernetes uses instead of Docker (and you probably don't need to care about it)
Podman Desktop - Free Docker Desktop Alternative
competes with Podman Desktop
Podman Desktop Alternatives That Don't Suck
Container tools that actually work (tested by someone who's debugged containers at 3am)
GitLab Container Registry
GitLab's container registry that doesn't make you juggle five different sets of credentials like every other registry solution
Colima - Docker Desktop Alternative That Doesn't Suck
For when Docker Desktop starts costing money and eating half your Mac's RAM
Docker Desktop Alternatives That Don't Suck
powers Docker Desktop
Docker Desktop is Fucked - CVE-2025-9074 Container Escape
Any container can take over your entire machine with one HTTP request
How to Actually Escape Docker Desktop Without Losing Your Shit
powers Docker Desktop
Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization