Four different systems, four different numbers, one dead pod
Here's what happened to me last month: production API showing 400MB usage in Datadog, 500MB in kubectl top, 650MB in Prometheus, but got OOMKilled at a 512MB limit. Which number do you trust? None of them work, because they're all measuring different stuff and none of them show what actually kills your pod.
The OOM killer doesn't give a damn about your pretty dashboards. It counts everything that gets charged to the cgroup - every byte of memory-mapped files, every network buffer, every piece of kernel memory allocated on behalf of your container. Your monitoring? It's mostly showing RSS memory and calling it a day.
I spent two solid days debugging this before I realized `kubectl top` was basically useless for OOM debugging. It samples every 15-30 seconds and only shows physical memory that's currently in RAM. Miss that 5-second memory spike during garbage collection? Too bad, your pod's dead and you'll never see what killed it in the metrics.
What Actually Counts Toward Your Memory Limit
Your app thinks it's using 400MB. kubectl top
agrees. But the OOM killer counted 900MB. Here's what it saw that you didn't:
Memory-mapped files: Your app loads some huge JSON config file using mmap() - maybe 180MB, could be more. Doesn't show up in heap monitoring, counts toward your limit. First time someone writes to that mapped region? Boom, suddenly you're way over your limit.
Kernel socket buffers: Got tons of open connections? Had a service with shitloads of connections - few dozen KB per connection in kernel buffers, but adds up fast with thousands of connections. Can easily be hundreds of MB of memory usage that doesn't show up in any application monitoring.
Java non-heap memory: JVM metaspace, code cache, direct memory, compressed class space. Your 1GB heap might have 500MB of additional JVM memory that you've never monitored.
Node.js hidden memory: V8 external memory, Buffer pools, native addon memory. `process.memoryUsage()` shows heap stats, misses half your actual memory usage.
The only way to see what the OOM killer sees:
kubectl exec pod -- cat /sys/fs/cgroup/memory/memory.current
kubectl exec pod -- cat /sys/fs/cgroup/memory/memory.max
Everything else is just guessing. This is documented in the cgroup memory documentation, but most Kubernetes troubleshooting guides skip this crucial detail. The Red Hat memory management guide explains these differences in detail, and the Linux memory statistics documentation covers the underlying mechanisms.
All the different ways your memory gets counted
Linux memory is a clusterfuck of different layers that all eat your memory:
Real Example: The JSON File That Killed Production
Had an e-commerce API that kept dying every few hours. Monitoring showed 700MB usage with 1GB limits - should be fine, right? Wrong.
The app was memory-mapping these massive JSON catalog files during product updates. Monitoring only saw RSS memory (the 700MB). The kernel saw RSS + memory-mapped files = way over the limit.
Took me three fucking days and probably 50 Stack Overflow tabs to figure out I needed to check what the kernel actually sees:
kubectl exec pod -- cat /proc/1/smaps | grep -A 5 catalog
## Size: like 320MB of memory-mapped files not showing up anywhere
kubectl exec pod -- cat /sys/fs/cgroup/memory/memory.stat
## rss 734003200 ← monitoring showed this (700MB)
## mapped_file 335544320 ← monitoring completely ignored this (320MB)
## Total: We were over by like 50MB, maybe more
Fixed it by streaming the JSON instead of memory-mapping the whole thing. Three outages to figure out that literally all our monitoring was lying to us.
cgroup v2 Broke My Stable Workloads
Upgraded to Kubernetes 1.31 and suddenly pods that ran fine for months started getting OOMKilled. Same code, same limits, same everything - except now cgroup v2 counts memory differently.
cgroup v2 is "more accurate" which is a polite way of saying it counts a bunch of shit that v1 ignored. Your 800MB pod that was totally stable before? Now it's using 850MB+ because kernel stack memory and network buffers suddenly count toward your limit.
## Check if you're on cgroup v2
kubectl exec pod -- cat /sys/fs/cgroup/cgroup.controllers
## If this file exists, you're on v2
## See the extra memory v2 counts
kubectl exec pod -- cat /sys/fs/cgroup/memory.stat | grep -E "kernel|sock|slab"
## kernel_stack 4194304 ← 4MB now counted (was free in v1)
## sock 8192000 ← 8MB network buffers now counted
## slab 125829120 ← 120MB kernel memory now counted
Translation: your pods need 10-15% higher memory limits after upgrading to Kubernetes 1.31+. I found this out during a cluster upgrade that took down half our goddamn services at once.
The 15-Second Gap That Kills Your Pods
Your monitoring samples every 15-30 seconds. Memory spikes last 5 seconds. Guess what happens?
10:15:00 - 512MB (monitoring sample: looks fine)
10:15:05 - 1.2GB spike during garbage collection (no sample)
10:15:06 - OOMKilled
10:15:15 - Pod restarting (next sample shows restart, not spike)
You'll never see the spike in your dashboards. The pod just "randomly" dies.
Here's how to catch the spikes that kill your pods:
POD_NAME="your-dying-pod"
while kubectl get pod $POD_NAME > /dev/null 2>&1; do
MEM=$(kubectl exec $POD_NAME -- cat /sys/fs/cgroup/memory/memory.current)
LIMIT=$(kubectl exec $POD_NAME -- cat /sys/fs/cgroup/memory/memory.max)
PERCENT=$(( MEM * 100 / LIMIT ))
echo "$(date '+%H:%M:%S'): ${PERCENT}%"
[ $PERCENT -gt 95 ] && echo "SPIKE DETECTED!"
sleep 0.1
done
Run this before your pod dies and you'll actually see what kills it.
Cloud Provider Memory Overhead (aka Hidden Taxes)
AWS EKS taxes you 50-200MB per pod for VPC CNI networking, instance metadata service, and CloudWatch agents. They don't mention this shit in the pricing calculator, of course.
Google GKE Autopilot forces memory limits based on your requests. Request 256MB, get 512MB limit automatically. But then Stackdriver monitoring eats 40-120MB of that without asking.
Azure AKS has similar CNI overhead plus whatever the hell the Azure Monitor agent feels like consuming that week.
Your 512MB pod suddenly needs 700MB limits on managed Kubernetes. Factor this in or watch your pods die mysterious deaths.
Quick Diagnostic Commands That Actually Work
Skip the fancy monitoring. Use these when shit hits the fan:
## See what the OOM killer sees
kubectl exec pod -- cat /sys/fs/cgroup/memory/memory.current
kubectl exec pod -- cat /sys/fs/cgroup/memory/memory.max
## Find memory-mapped files (common hidden memory)
kubectl exec pod -- cat /proc/1/smaps | grep -A 3 -B 1 "Size:.*[0-9]{6}"
## Check for memory pressure before OOM (cgroup v2 only)
kubectl exec pod -- cat /sys/fs/cgroup/memory.pressure
## Get the real memory breakdown
kubectl exec pod -- cat /proc/1/status | grep -E "VmRSS|VmSize"
Stop guessing. Start measuring what actually matters.