Your pod just died with exit code 137 again. The logs show Reason: OOMKilled
and you're about to double the memory limit and hope for the best. Don't. I've been down this road - it just delays the inevitable crash until your traffic spikes hit.
OOMKilled happens when your container gets too hungry for memory and the Linux kernel kills it. Period. No negotiation, no warnings, just dead. The kernel doesn't care about your business logic or graceful shutdown.
Personal Experience: I once spent 6 hours debugging "OOMKilled" pods that turned out to be hitting the PID limit, not memory. The error message lies sometimes, so don't trust it blindly.
The Two Types of OOM Deaths That Will Fuck Up Your Sleep
Type 1: The Obvious Kill - Pod Dies Screaming
This is the OOMKilled error everyone recognizes. Your pod shows OOMKilled
, kubectl logs shows exit code 137, and your restart count is climbing faster than your blood pressure. Check the pod status and container states for confirmation.
## Check for obvious OOMKilled errors (you'll run this 50 times)
kubectl get pods | grep -E "(OOMKilled|Error|137)"
kubectl describe pod <pod-name> | grep -A 5 -B 5 "OOMKilled"
kubectl logs <pod-name> --previous | tail -50
## Pro tip: --previous shows logs from the dead container, not the new one
Type 2: The Invisible Kill - Your App Dies But Kubernetes Doesn't Give a Shit
This is the invisible OOM kill nightmare that makes debugging absolute hell. A child process inside your container gets murdered, but PID 1 stays alive, so Kubernetes thinks everything's peachy. The container runtime has no visibility into this.
I learned this the hard way: Spent 8 hours debugging "healthy" pods that were actually dead inside. Child processes were getting OOM killed while the main process kept running like nothing happened.
Symptoms of invisible kills:
- Application becomes unresponsive randomly
- Error rates spike without pod restarts
- Performance degrades over time
- Memory usage stays flat after spikes
How to detect invisible kills:
## Check kernel logs on the node (requires node access)
## See: https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-usage-monitoring/
journalctl --utc -k | grep -i "killed process"
journalctl --utc -k | grep -i "out of memory"
## Alternative: dmesg | grep -i "killed process"
## Check for memory pressure on nodes
## Reference: https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/
kubectl describe nodes | grep -A 10 -B 10 "MemoryPressure"
## Also check: kubectl get nodes -o wide
## Monitor memory usage patterns
kubectl top pod <pod-name> --containers
The Memory Debugging Process That Actually Works (Instead of Guessing)
Most teams throw darts at memory limits. Here's how to not be one of those teams:
Step 1: Figure Out If It's Actually OOM (It Usually Isn't What You Think)
## Get the full picture of what actually happened
## Reference: https://kubernetes.io/docs/tasks/debug/debug-application/debug-pods/
kubectl describe pod <pod-name> | grep -A 20 "Last State"
kubectl get events --sort-by='.lastTimestamp' | grep <pod-name>
## Events get rotated every hour, so if you're late to the party, tough shit
## See: https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/
## Check actual vs requested memory usage (spoiler: they're never the same)
## Resource docs: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
kubectl describe pod <pod-name> | grep -A 5 -B 5 "Limits"
kubectl top pod <pod-name> # Shows current usage, not the spike that killed it
## Requires metrics-server: https://github.com/kubernetes-sigs/metrics-server
What you're actually looking for:
Terminated
withReason: OOMKilled
(the smoking gun)- Memory usage near limits (but remember, spikes don't show up in
kubectl top
) - Recent events about memory pressure (if they haven't rotated away yet)
Step 2: Figure Out What's Actually Eating Your Memory
## Get memory breakdown inside the container (if it's still alive)
## Linux memory docs: https://www.kernel.org/doc/Documentation/filesystems/proc.txt
kubectl exec -it <pod-name> -- cat /proc/meminfo | head -10
kubectl exec -it <pod-name> -- ps aux --sort=-%mem | head -20
## Alternative: kubectl exec -it <pod-name> -- top -o %MEM
## Check for memory-mapped files eating space (you'd be surprised)
## Memory mapping info: https://man7.org/linux/man-pages/man5/proc.5.html
kubectl exec -it <pod-name> -- find /proc/*/maps -exec grep -l "rw-" {} \; 2>/dev/null | wc -l
## See also: kubectl exec -it <pod-name> -- cat /proc/*/smaps | grep -i rss
## Monitor memory over time (you'll get bored after 5 minutes and ctrl+c)
kubectl exec -it <pod-name> -- free -h -s 5
Step 3: Profile Memory Usage Patterns
Different applications have different memory patterns:
Java Applications (The Usual Suspects):
For comprehensive Java memory debugging, check the official JVM documentation and Kubernetes Java memory best practices.
## Check heap usage and GC (prepare for disappointment)
kubectl exec -it <pod-name> -- jstat -gc $(pgrep java) 5s
## Get heap dump for analysis (warning: this will freeze your app for 30+ seconds)
kubectl exec -it <pod-name> -- jcmd $(pgrep java) GC.run_finalization
kubectl exec -it <pod-name> -- jcmd $(pgrep java) VM.memory_info
## Don't run this during peak traffic unless you enjoy angry customers
Node.js Applications (Memory Leak Central):
Node.js memory debugging requires understanding V8 memory management and Node.js performance debugging.
## Enable memory profiling (requires app restart, because of course it does)
kubectl exec -it <pod-name> -- node --inspect=0.0.0.0:9229 /app/index.js &
## Use Chrome DevTools to connect and profile - assuming your network config isn't fucked
Python Applications (Surprise Memory Hogs):
Python memory issues are well-documented in the memory profiling guide and Python memory management docs.
## Install memory profiler (assuming pip still works in your locked-down container)
kubectl exec -it <pod-name> -- python -m memory_profiler your_script.py
## Check for too many objects in memory
kubectl exec -it <pod-name> -- python -c "import gc; print(len(gc.get_objects()))"
## If this number is over 100k, you probably have a leak
Real War Stories From The OOMKilled Trenches
War Story #1: The Database Connection Pool From Hell
Our Spring Boot API kept getting murdered every few hours. Memory limit was 2GB, JVM heap was only 1.5GB, but something else was eating 500MB+. Spent 6 hours looking at heap dumps before I realized the problem wasn't on the heap.
Turns out HikariCP was configured for 50 connections, and each connection was caching 10MB of result sets. 50 × 10MB = 500MB of off-heap memory that didn't show up in any JVM monitoring.
Fix: Cut connection pool to 20 and capped result set cache size. Problem solved in 10 minutes once I found the actual issue.
War Story #2: When Winston Tried to Log Everything to Memory
Node.js app went from 200MB to 2GB+ over 48 hours. No obvious leaks in our code - we checked everything twice. Spent two days profiling before finding the real culprit.
Winston was configured to buffer 50,000+ log entries before flushing. Each log entry was ~1KB. Do the math: 50MB+ just sitting in a buffer, growing forever.
Fix: Set buffer size to 1,000 entries max. Memory usage dropped back to 200MB instantly. Sometimes the simplest fixes are the ones you overlook.
War Story #3: The Invisible ArgoCD Massacre
ArgoCD pods looked healthy but apps kept showing "Out of Sync" randomly. No crashes, no restarts, no obvious issues. Took me a week to figure out what was happening.
Turns out helm processes were getting OOM killed during large deployments, but the main ArgoCD process stayed alive. The repo-server had a 128Mi limit, but helm needed 192Mi for complex charts.
Fix: Bumped memory to 256Mi. Moral of the story: invisible OOM kills are the worst kind of debugging nightmare.
This stuff actually works. I've used these techniques to stop getting paged at 3AM about dead pods. The key is understanding that OOMKilled isn't always what it seems - sometimes you need to dig deeper to find what's really killing your containers.