Fix Docker Exit Code 137 - Stop OOM Killer From Murdering Your Containers

What Exit Code 137 Actually Means (And Why It Happens at 3AM)

Exit code 137 is Docker's way of telling you the Linux kernel just killed your container because it tried to eat more memory than you allocated. The number comes from 128 + 9, where 9 is the signal number for SIGKILL - the nuclear option that can't be caught or ignored.

Container Networking Architecture

The Real-World Scenario

Picture this: You're running a Node.js app in production with --memory=512m because that seemed reasonable. Everything works fine for weeks. Then at 3:17 AM on a Tuesday, your monitoring starts screaming. Your container died with exit code 137.

What happened? Your app hit a traffic spike, loaded more data into memory, and suddenly needed 600MB. The Linux kernel's OOM killer said "nope" and killed it instantly. No graceful shutdown, no cleanup, just dead. This is a common production scenario that catches teams off guard.

How to Confirm It's Actually an OOM Kill

Don't guess. Check if Docker flagged it as an OOM kill:

## Check if container was OOM killed
docker inspect --format '{{.State.OOMKilled}}' container_name

## See the actual exit code
docker inspect --format '{{.State.ExitCode}}' container_name

## Get container logs to see what happened before death
docker logs --tail=50 container_name

If OOMKilled is true and exit code is 137, you found your culprit. Sometimes the OOMKilled flag is misleading or false, especially on Windows containers or when child processes get killed instead of the main process.

Memory Usage vs Memory Limits: The Gotcha

Here's what trips up most people: Docker's memory reporting includes cache and buffers, but the OOM killer looks at RSS (resident set size) - the memory your process actually claims. This memory accounting difference causes confusion during debugging.

Use docker stats to see real-time usage:

## Live memory monitoring
docker stats container_name

## One-time snapshot
docker stats --no-stream

The memory column shows current usage vs limit. If you're consistently hitting 80%+ of your limit, you're playing Russian roulette with the OOM killer. For more detailed monitoring, consider using cAdvisor or other monitoring tools.

The JVM Memory Trap

Java applications are notorious for this. The JVM allocates heap space based on available system memory, not container limits. If your container has 512MB but the host has 32GB, the JVM might try to allocate 8GB of heap and instantly die. This is a well-documented issue in containerized environments.

Fix it by setting JVM flags to respect container limits:

## Modern JVMs (Java 11+) should detect container limits automatically
-XX:+UseContainerSupport

## Or set explicitly 
-Xmx400m # Leave room for non-heap memory

Same problem exists with other runtimes. Node.js with --max-old-space-size, Python with memory pools, Go with garbage collection - they all need to know about your container's memory constraints. .NET applications have similar considerations.

Kubernetes Makes It More Complicated

In Kubernetes, you set both requests and limits. The OOM killer respects limits, but Kubernetes scheduling uses requests. This creates a dangerous gap that leads to unpredictable OOM kills.

If you set requests: 256Mi and limits: 512Mi, Kubernetes might schedule your pod on a node assuming it needs 256Mi. But if it actually uses 512Mi and the node is overcommitted, multiple pods can hit OOM simultaneously. This is explained in detail in the official Kubernetes documentation.

Current best practice is setting requests = limits to avoid this surprise. The Kubernetes community increasingly recommends this approach for production workloads.

Memory Leaks vs Memory Spikes

Exit code 137 from a memory leak looks different than from a traffic spike. Leaks show gradual memory growth in monitoring until sudden death. Spikes show stable usage followed by immediate jumps. Proper monitoring helps distinguish between these patterns.

Real GitHub issue: Astro builds were failing with exit code 137 because too many concurrent image optimizations pushed memory usage over limits. Not a leak - just bad resource planning. Similar issues appear across different platforms and applications.

The fix was either increasing memory limits or throttling concurrent operations. Sometimes the answer isn't "give it more memory" but "make it use memory more efficiently." Production debugging techniques help identify the root cause.

OOM Troubleshooting FAQ: The Questions You're Actually Asking

My container shows OOMKilled: false but still exited with 137 - what gives?

This happens when a child process gets OOM killed instead of your main process. Docker only flags OOMKilled if PID 1 dies. If your app spawns worker processes and one gets killed, the main process might exit with code 137 but OOMKilled stays false.Check /var/log/kern.log on the host for actual OOM kill messages: dmesg | grep -i "killed process"

Docker stats shows 200MB usage but my 512MB container died - why?

Docker stats can be misleading because it includes cached memory that the kernel can reclaim. The OOM killer looks at committed memory (RSS + swap). Your app might have allocated more memory than Docker stats indicates.Use docker exec container_name cat /proc/meminfo to see detailed memory breakdown inside the container.

How do I set the right memory limit without guessing?

Run your app without limits first and monitor it under realistic load:bash# Run without memory limitdocker run -d --name test-app your-image# Monitor for 24-48 hoursdocker stats test-app# Set limit to 2x observed peak usagedocker run --memory=800m your-image # If peak was 400MBAlways leave headroom. Memory usage isn't constant.

Why does my Java app use way more memory than my heap size?

JVM memory != heap memory.

The JVM needs memory for:

Method area/Metaspace
Code cache
Compressed class space
Direct memory (NIO, unsafe operations)
Stack space for threadsRule of thumb: Container limit = heap + 25% for JVM overhead.

Can I prevent OOM kills completely?

No, and you shouldn't want to.

OOM kills protect your system from runaway processes. Instead:

Set appropriate memory limits
Handle memory pressure gracefully in your application
Use swap carefully (it can mask problems and hurt performance)
Monitor and alert on memory usage trends

My container uses 100% CPU after OOM kill - is this normal?

No. If your container survives an OOM kill (child process died), the main process might be stuck in a bad state

possibly trying to allocate memory in a loop or handling the death of child processes poorly.Restart the container and investigate why child processes are getting killed.

How do I handle OOM kills in production gracefully?

Application level:

Implement graceful degradation when memory is low
Use streaming instead of loading large datasets into memory
Add circuit breakers for memory-intensive operationsInfrastructure level:
Set up monitoring with memory usage alerts
Use horizontal scaling (more containers) instead of just increasing memory
Implement health checks that detect memory pressure

Why do some containers survive longer than others with the same memory usage?

The OOM killer doesn't just look at memory usage

it scores processes based on:
How much memory they're using (higher = more likely to die)
How long they've been running (newer processes get killed first)
Process priority and OOM adjustment scoresMain processes with long uptime are less likely to be killed than recently spawned children using the same amount of memory.

Should I use swap in containers?

Generally no.

Swap can mask memory issues and hurt performance. If you need swap:

Set a reasonable limit (--memory=1g --memory-swap=1.5g)
Monitor swap usage
high swap usage indicates undersized containers
Consider memory.swappiness settings

How can I debug OOM kills after the fact?

Check system logs:bash# On the Docker hostdmesg | grep -i "killed process"journalctl -u docker | grep -i oomCheck container inspection:bashdocker inspect container_name | grep -A5 -B5 Memory**Application logs:**Most apps don't log memory pressure, but some frameworks do. Look for garbage collection warnings, allocation failures, or "out of memory" messages.

My memory usage gradually increases then suddenly drops to zero - memory leak?

Probably garbage collection, not a leak.

Languages with GC (Java, Go, JavaScript) accumulate memory until GC runs. This creates a sawtooth pattern.Real leaks show consistently increasing memory with no drops. Profile your application with tools like:

Java:

JProfiler, VisualVM

Node.js: heapdump, clinic.js
Go: pprof
Python: memory_profiler

Prevention Strategies That Actually Work in Production

The best OOM kill is the one that never happens. Here's how to engineer your containers to handle memory pressure without dying spectacularly at inconvenient times.

Memory Monitoring That Doesn't Lie

Stop relying on `docker stats` for production monitoring. It shows you what Docker thinks your container is using, not what the OOM killer sees. Use proper monitoring that tracks RSS memory and sends alerts before you hit limits. This monitoring approach prevents surprises in production.

Monitoring stack that works:

cAdvisor for container metrics collection
Prometheus for storage and alerting
Grafana for visualization
Alert at 75% memory usage, panic at 90%

Set up alerts on memory growth rate, not just absolute usage. A container that goes from 100MB to 400MB in 10 minutes is in trouble, even if 400MB is under your limit. This proactive monitoring approach catches problems before they become outages.

The Container Right-Sizing Process

Here's the systematic approach that doesn't involve guessing:

Step 1: Run production workload without memory limits for 48-72 hours while monitoring RSS memory usage. Yes, this is scary in prod. Do it in staging first, then gradually roll to production during low-traffic periods. Use proper monitoring tools to track actual memory consumption.

Step 2: Calculate your memory requirements using established formulas:

Memory limit = (Peak RSS * 1.5) + JVM/runtime overhead

For JVM apps, add 200-400MB overhead. For Node.js, add 50-100MB. For Go, add 20-50MB. This sizing methodology works across different runtimes.

Step 3: Test your limits under load. Use stress testing tools that mimics your actual usage patterns - API calls, background jobs, whatever causes memory allocation in your app. Load testing reveals memory patterns not visible during normal operation.

Application-Level Defense

The kernel OOM killer is a last resort. Your application should handle memory pressure gracefully before getting to that point. This defensive programming approach prevents outages.

Implement backpressure: When memory usage hits 80% of your limit, start rejecting non-essential operations. Circuit breakers for memory-intensive operations prevent cascade failures. This pattern is well-documented in production systems.

Stream don't load: Processing a 500MB CSV file? Stream it line by line instead of loading it entirely into memory. This applies to API responses, file processing, database queries - anything that deals with large datasets. Streaming patterns reduce memory footprint dramatically.

Connection pooling limits: Database connection pools, HTTP clients, message queues - they all consume memory per connection. Set reasonable limits based on your memory budget, not just performance requirements. Production deployments require careful pool sizing.

## Bad: Unlimited connections
requests.Session()

## Good: Memory-aware pool size  
session = requests.Session()
session.mount('http://', HTTPAdapter(pool_maxsize=50))

Exit Code Reference for Production

Not all container deaths are OOM kills. Here's what the exit codes actually mean:

Exit Code	What Happened	Action Required
137	OOM killed by kernel	Increase memory limits or fix memory leak
125	Docker daemon error	Check Dockerfile syntax, image exists
126	Command not executable	Fix file permissions or command path
127	Command not found	App binary missing or PATH wrong
1	Application error	Check application logs for actual error
0	Clean exit	This is good (unless unexpected)

Pro tip: Exit code 137 doesn't always mean OOM. It's any SIGKILL, which could be from manual docker kill or system shutdown. Always check the OOMKilled flag and system logs to confirm.

The Kubernetes Memory Limit Gotcha

In Kubernetes, memory requests and limits serve different purposes. The scheduler uses requests for placement decisions, but the OOM killer respects limits. This fundamental difference causes confusion and production issues.

This creates a dangerous scenario: If you set requests too low, Kubernetes might schedule too many pods on a node. When they all hit their limits simultaneously, the node runs out of memory and starts killing pods randomly. This overcommitment problem affects cluster stability.

2025 best practice: Set memory.requests = memory.limits. This prevents overcommitment but wastes some memory. For most applications, the reliability is worth the cost. The Kubernetes community increasingly recommends this approach.

resources:
  requests:
    memory: \"512Mi\"
  limits:
    memory: \"512Mi\"  # Same value prevents overcommit

When Your App Needs More Memory Than You Have

Sometimes you can't reduce memory usage or increase limits. Here are the escape hatches:

Horizontal scaling: Run multiple smaller containers instead of one large one. This works better with modern orchestration and provides better fault isolation. Microservices patterns naturally support this approach.

Memory-mapped files: For read-heavy workloads, memory-map large files instead of loading them. The kernel manages the memory and can evict pages when needed. This technique works well for database systems and data processing.

External caching: Move memory usage outside your container. Redis, Memcached, or even file-based caching can offload memory pressure from your application containers. Distributed caching patterns help scale memory usage.

Batch processing: Instead of processing everything at once, work in smaller batches that fit in memory. Takes longer but doesn't explode. Stream processing frameworks excel at this pattern.

The goal isn't to eliminate all OOM kills - it's to make them predictable and recoverable. A container that dies cleanly and restarts quickly is better than one that limps along consuming resources. Resilient architecture patterns embrace this reality.

Essential Resources for Docker Memory Debugging

25%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization

Quick Navigation

The Real-World Scenario

How to Confirm It's Actually an OOM Kill

Memory Usage vs Memory Limits: The Gotcha

The JVM Memory Trap

Kubernetes Makes It More Complicated

Memory Leaks vs Memory Spikes

My container shows OOMKilled: false but still exited with 137 - what gives?

Docker stats shows 200MB usage but my 512MB container died - why?

How do I set the right memory limit without guessing?

Why does my Java app use way more memory than my heap size?

Can I prevent OOM kills completely?

My container uses 100% CPU after OOM kill - is this normal?

How do I handle OOM kills in production gracefully?

Why do some containers survive longer than others with the same memory usage?

Should I use swap in containers?

How can I debug OOM kills after the fact?

My memory usage gradually increases then suddenly drops to zero - memory leak?

Memory Monitoring That Doesn't Lie

The Container Right-Sizing Process

Application-Level Defense

Exit Code Reference for Production

The Kubernetes Memory Limit Gotcha

When Your App Needs More Memory Than You Have

Related Tools & Recommendations

Jenkins Docker Kubernetes CI/CD: Deploy Without Breaking Production

Google Kubernetes Engine (GKE) - Google's Managed Kubernetes (That Actually Works Most of the Time)

Docker Desktop: GUI for Containers, Pricing, & Setup Guide

Podman: Rootless Containers, Docker Alternative & Key Differences

Fix Kubernetes Service Not Accessible - Stop the 503 Hell

Mastering Docker Dev Setup: Fix Exit Code 137 & Performance

GitHub Actions Security Hardening - Prevent Supply Chain Attacks

Tired of GitHub Actions Eating Your Budget? Here's Where Teams Are Actually Going

GitHub Actions - CI/CD That Actually Lives Inside GitHub

Trivy Scanning Failures - Common Problems and Solutions

Docker: Package Code, Run Anywhere - Fix 'Works on My Machine'

Docker Desktop is Fucked - CVE-2025-9074 Container Escape

Docker Desktop Security Configuration Broken? Fix It Fast

Jenkins - The CI/CD Server That Won't Die

Jenkins Production Deployment - From Dev to Bulletproof

Fix Trivy & ECR Container Scan Authentication Issues

Docker 'No Space Left on Device' Error: Fast Fixes & Solutions

Prometheus, Grafana, Alertmanager: Complete Monitoring Stack Setup

Deploy Django with Docker Compose - Complete Production Guide

Binance API Security Hardening: Protect Your Trading Bots