Understanding OOMKilled: When Kubernetes Murders Your Pods For Using Too Much RAM

Kubernetes OOMKilled Process

Your pod just died with exit code 137 again. The logs show Reason: OOMKilled and you're about to double the memory limit and hope for the best. Don't. I've been down this road - it just delays the inevitable crash until your traffic spikes hit.

OOMKilled happens when your container gets too hungry for memory and the Linux kernel kills it. Period. No negotiation, no warnings, just dead. The kernel doesn't care about your business logic or graceful shutdown.

Personal Experience: I once spent 6 hours debugging "OOMKilled" pods that turned out to be hitting the PID limit, not memory. The error message lies sometimes, so don't trust it blindly.

The Two Types of OOM Deaths That Will Fuck Up Your Sleep

Type 1: The Obvious Kill - Pod Dies Screaming

This is the OOMKilled error everyone recognizes. Your pod shows OOMKilled, kubectl logs shows exit code 137, and your restart count is climbing faster than your blood pressure. Check the pod status and container states for confirmation.

## Check for obvious OOMKilled errors (you'll run this 50 times)
kubectl get pods | grep -E "(OOMKilled|Error|137)"
kubectl describe pod <pod-name> | grep -A 5 -B 5 "OOMKilled"
kubectl logs <pod-name> --previous | tail -50
## Pro tip: --previous shows logs from the dead container, not the new one
Type 2: The Invisible Kill - Your App Dies But Kubernetes Doesn't Give a Shit

This is the invisible OOM kill nightmare that makes debugging absolute hell. A child process inside your container gets murdered, but PID 1 stays alive, so Kubernetes thinks everything's peachy. The container runtime has no visibility into this.

I learned this the hard way: Spent 8 hours debugging "healthy" pods that were actually dead inside. Child processes were getting OOM killed while the main process kept running like nothing happened.

Symptoms of invisible kills:

  • Application becomes unresponsive randomly
  • Error rates spike without pod restarts
  • Performance degrades over time
  • Memory usage stays flat after spikes

How to detect invisible kills:

## Check kernel logs on the node (requires node access)
## See: https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-usage-monitoring/
journalctl --utc -k | grep -i "killed process"
journalctl --utc -k | grep -i "out of memory"
## Alternative: dmesg | grep -i "killed process"

## Check for memory pressure on nodes
## Reference: https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/
kubectl describe nodes | grep -A 10 -B 10 "MemoryPressure"
## Also check: kubectl get nodes -o wide

## Monitor memory usage patterns
kubectl top pod <pod-name> --containers

The Memory Debugging Process That Actually Works (Instead of Guessing)

Most teams throw darts at memory limits. Here's how to not be one of those teams:

Step 1: Figure Out If It's Actually OOM (It Usually Isn't What You Think)
## Get the full picture of what actually happened
## Reference: https://kubernetes.io/docs/tasks/debug/debug-application/debug-pods/
kubectl describe pod <pod-name> | grep -A 20 "Last State"
kubectl get events --sort-by='.lastTimestamp' | grep <pod-name>
## Events get rotated every hour, so if you're late to the party, tough shit
## See: https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/

## Check actual vs requested memory usage (spoiler: they're never the same)
## Resource docs: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
kubectl describe pod <pod-name> | grep -A 5 -B 5 "Limits"
kubectl top pod <pod-name>  # Shows current usage, not the spike that killed it
## Requires metrics-server: https://github.com/kubernetes-sigs/metrics-server

What you're actually looking for:

  • Terminated with Reason: OOMKilled (the smoking gun)
  • Memory usage near limits (but remember, spikes don't show up in kubectl top)
  • Recent events about memory pressure (if they haven't rotated away yet)
Step 2: Figure Out What's Actually Eating Your Memory
## Get memory breakdown inside the container (if it's still alive)
## Linux memory docs: https://www.kernel.org/doc/Documentation/filesystems/proc.txt
kubectl exec -it <pod-name> -- cat /proc/meminfo | head -10
kubectl exec -it <pod-name> -- ps aux --sort=-%mem | head -20
## Alternative: kubectl exec -it <pod-name> -- top -o %MEM

## Check for memory-mapped files eating space (you'd be surprised)
## Memory mapping info: https://man7.org/linux/man-pages/man5/proc.5.html
kubectl exec -it <pod-name> -- find /proc/*/maps -exec grep -l "rw-" {} \; 2>/dev/null | wc -l
## See also: kubectl exec -it <pod-name> -- cat /proc/*/smaps | grep -i rss

## Monitor memory over time (you'll get bored after 5 minutes and ctrl+c)
kubectl exec -it <pod-name> -- free -h -s 5
Step 3: Profile Memory Usage Patterns

Different applications have different memory patterns:

Java Applications (The Usual Suspects):

For comprehensive Java memory debugging, check the official JVM documentation and Kubernetes Java memory best practices.

## Check heap usage and GC (prepare for disappointment)
kubectl exec -it <pod-name> -- jstat -gc $(pgrep java) 5s

## Get heap dump for analysis (warning: this will freeze your app for 30+ seconds)
kubectl exec -it <pod-name> -- jcmd $(pgrep java) GC.run_finalization
kubectl exec -it <pod-name> -- jcmd $(pgrep java) VM.memory_info
## Don't run this during peak traffic unless you enjoy angry customers

Node.js Applications (Memory Leak Central):

Node.js memory debugging requires understanding V8 memory management and Node.js performance debugging.

## Enable memory profiling (requires app restart, because of course it does)
kubectl exec -it <pod-name> -- node --inspect=0.0.0.0:9229 /app/index.js &
## Use Chrome DevTools to connect and profile - assuming your network config isn't fucked

Python Applications (Surprise Memory Hogs):

Python memory issues are well-documented in the memory profiling guide and Python memory management docs.

## Install memory profiler (assuming pip still works in your locked-down container)
kubectl exec -it <pod-name> -- python -m memory_profiler your_script.py

## Check for too many objects in memory
kubectl exec -it <pod-name> -- python -c "import gc; print(len(gc.get_objects()))"
## If this number is over 100k, you probably have a leak

Real War Stories From The OOMKilled Trenches

War Story #1: The Database Connection Pool From Hell

Our Spring Boot API kept getting murdered every few hours. Memory limit was 2GB, JVM heap was only 1.5GB, but something else was eating 500MB+. Spent 6 hours looking at heap dumps before I realized the problem wasn't on the heap.

Turns out HikariCP was configured for 50 connections, and each connection was caching 10MB of result sets. 50 × 10MB = 500MB of off-heap memory that didn't show up in any JVM monitoring.

Fix: Cut connection pool to 20 and capped result set cache size. Problem solved in 10 minutes once I found the actual issue.

War Story #2: When Winston Tried to Log Everything to Memory

Node.js app went from 200MB to 2GB+ over 48 hours. No obvious leaks in our code - we checked everything twice. Spent two days profiling before finding the real culprit.

Winston was configured to buffer 50,000+ log entries before flushing. Each log entry was ~1KB. Do the math: 50MB+ just sitting in a buffer, growing forever.

Fix: Set buffer size to 1,000 entries max. Memory usage dropped back to 200MB instantly. Sometimes the simplest fixes are the ones you overlook.

War Story #3: The Invisible ArgoCD Massacre

ArgoCD pods looked healthy but apps kept showing "Out of Sync" randomly. No crashes, no restarts, no obvious issues. Took me a week to figure out what was happening.

Turns out helm processes were getting OOM killed during large deployments, but the main ArgoCD process stayed alive. The repo-server had a 128Mi limit, but helm needed 192Mi for complex charts.

Fix: Bumped memory to 256Mi. Moral of the story: invisible OOM kills are the worst kind of debugging nightmare.

This stuff actually works. I've used these techniques to stop getting paged at 3AM about dead pods. The key is understanding that OOMKilled isn't always what it seems - sometimes you need to dig deeper to find what's really killing your containers.

Actually Useful Memory Debugging (When kubectl top Lies to You)

Kubernetes Memory Monitoring

Basic kubectl top tells you memory is high, but it's a lying piece of shit. It shows current usage, not the spike that killed your pod 30 seconds ago. When you need to find what's actually eating your memory, you need better tools than the built-in garbage.

Deep Memory Analysis with Ephemeral Containers (When Your Cluster Admin Actually Enables Them)

Ephemeral containers are awesome for debugging - assuming your security team allows them (they don't). Stable since K8s 1.25+ and widely supported in modern clusters (1.30+), they let you inject debugging tools without rebuilding images. Check the security implications first.

Reality check: Most corporate environments disable ephemeral containers for "security reasons." If yours does, skip to the next section and cry a little.

## Add a debugging container (if your security policies allow it, lol)
## Reference: https://kubernetes.io/docs/tasks/debug/debug-application/debug-running-pod/
kubectl debug <pod-name> -it --image=nicolaka/netshoot --target=<container-name>

## Once inside the ephemeral container, you have access to:
## - top, htop for real-time process monitoring
## - pmap for memory mapping analysis (see: man pmap)  
## - valgrind for memory leak detection (slow as fuck) - https://valgrind.org/docs/manual/mc-manual.html
## - strace for system call tracing (prepare for spam) - https://man7.org/linux/man-pages/man1/strace.1.html
## - /proc filesystem for detailed memory stats - https://man7.org/linux/man-pages/man5/proc.5.html

Actually useful memory analysis inside ephemeral containers:

## Analyze memory maps for the main process
pmap -x $(pgrep -f "your-app-name") | sort -nk3
## This shows you what's actually using memory, not just heap size

## Track memory allocations over time (you'll get bored after 10 minutes)
while true; do
  echo "$(date): $(cat /proc/meminfo | grep MemAvailable)"
  ps aux --sort=-%mem | head -5
  sleep 30
done

## Find processes with suspicious memory usage
for pid in $(pgrep -f "your-app"); do
  echo "PID $pid memory usage:"
  cat /proc/$pid/status | grep -E "VmPeak|VmSize|VmRSS|VmData"
done

Language-Specific Memory Profiling (Because Every Language Leaks Memory Differently)

Java/JVM Memory Deep Dive (Off-Heap Memory Will Fuck You)

Java Memory Structure

Java memory issues are usually off-heap problems that don't show up in heap dumps. Your heap looks fine, but something else is eating 2GB of RAM. Check the JVM memory model and container memory limits:

## Get complete memory breakdown (assuming you can run ephemeral containers)
## JDK tools: https://docs.oracle.com/javase/8/docs/technotes/tools/unix/jcmd.html
kubectl debug <pod-name> -it --image=openjdk:11 --target=<container-name>

## Inside ephemeral container:
jcmd $(pgrep java) VM.memory_info  # The money shot
jcmd $(pgrep java) GC.class_histogram | head -50
jstat -gc $(pgrep java) 250ms 40  # Watch GC activity

## Off-heap memory analysis (this is where your memory went)
jcmd $(pgrep java) VM.native_memory summary
## If this command fails, your JVM wasn't started with -XX:NativeMemoryTracking=summary

Common Java memory issues that will ruin your day:

References: Oracle JVM Troubleshooting Guide, OpenJDK Memory Guide, JVM Container Best Practices

  1. DirectByteBuffer leaks - NIO operations not releasing off-heap buffers (fuck Netty)
  2. Metaspace exhaustion - Too many classes loaded (looking at you, Spring Boot with 500 dependencies)
  3. Code cache overflow - JIT compiler cache full, performance goes to shit
  4. Connection pool bloat - HikariCP holding 50 connections × 10MB each = 500MB gone

Java memory debugging script (that actually works):

#!/bin/bash
## Save this as memory-debug.sh and run inside ephemeral container

JAVA_PID=$(pgrep java)
if [ -z "$JAVA_PID" ]; then
  echo "No Java process found, probably already dead"
  exit 1
fi

echo "=== Java Memory Analysis for PID $JAVA_PID ==="

echo "1. Heap Usage:"
jcmd $JAVA_PID GC.run && jcmd $JAVA_PID VM.memory_info | grep -A 10 "Memory"

echo "2. Top Memory-Consuming Classes (the usual suspects):"
jcmd $JAVA_PID GC.class_histogram | head -20

echo "3. Off-Heap Usage (where your memory actually went):"
jcmd $JAVA_PID VM.native_memory summary | grep -E "Total|Java Heap|Class|Thread"

echo "4. GC Performance (how fucked are you?):"
jstat -gc $JAVA_PID | awk '{print "Eden:", $3"KB", "Old:", $7"KB", "GC Time:", $12"s"}'
Node.js Memory Profiling (Welcome to Callback Hell)

Node.js memory leaks are usually closures holding onto massive objects, event listeners that never get removed, or some asshole loading a 50MB JSON file into memory. See the Node.js memory best practices and V8 memory management:

## Enable memory profiling (requires app restart because Node.js is special)
kubectl patch deployment <deployment-name> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container-name>","args":["--inspect=0.0.0.0:9229","--max-old-space-size=1024","/app/index.js"]}]}}}}'

## Connect to Node.js inspector (assuming your network doesn't hate you)
kubectl debug <pod-name> -it --image=node:18 --target=<container-name>

## Generate heap snapshot (this will freeze your app for 10+ seconds)
node -e "
const v8 = require('v8');
const fs = require('fs');
const snapshot = v8.writeHeapSnapshot();
console.log('Heap snapshot written to', snapshot);
" 
## Warning: heap snapshots are usually 100MB+ and will crash your laptop

Node.js memory leak detection:

// Add this to your application for memory monitoring
setInterval(() => {
  const used = process.memoryUsage();
  console.log({
    timestamp: new Date().toISOString(),
    rss: Math.round(used.rss / 1024 / 1024) + 'MB',
    heapTotal: Math.round(used.heapTotal / 1024 / 1024) + 'MB', 
    heapUsed: Math.round(used.heapUsed / 1024 / 1024) + 'MB',
    external: Math.round(used.external / 1024 / 1024) + 'MB'
  });
}, 30000);
Python Memory Profiling (Global Interpreter Lock Can't Save You Now)

Python memory issues are usually pandas DataFrames eating all your RAM, circular references that never get garbage collected, or some ML library hoarding memory like a dragon:

## Install memory profiling tools (assuming pip still works in your container)
kubectl debug <pod-name> -it --image=python:3.9 --target=<container-name>

## Inside ephemeral container:
pip install memory-profiler pympler psutil  # This will take forever

## Profile memory usage by line (prepare to wait)
python -m memory_profiler your_script.py

## Find what's hogging memory
python -c "
import gc
import collections
counter = collections.Counter()
for obj in gc.get_objects():
    counter[type(obj).__name__] += 1
print('Top memory hogs:', counter.most_common(20))
"

Detecting Invisible OOM Kills

The most frustrating memory issue is when processes die inside containers without Kubernetes knowing:

Node-Level OOM Detection
## Access worker node (method varies by cluster setup)
kubectl debug node/<node-name> -it --image=busybox

## Inside node debugging container:
## Check kernel logs for OOM kills
journalctl --utc -k --since "1 hour ago" | grep -i "killed process"

## Look for memory cgroup violations
dmesg | grep -i "memory cgroup out of memory"

## Find which pods are causing OOM pressure
find /sys/fs/cgroup/memory -name "memory.oom_control" -exec sh -c '
  for f; do
    if [ -f "$f" ]; then
      echo "Cgroup: $f"
      cat "$f" | grep -E "oom_kill|under_oom"
    fi
  done
' _ {} \;
Monitoring Memory Pressure Indicators
## Check for memory pressure on nodes
kubectl get nodes -o custom-columns=NAME:.metadata.name,MEMORY-PRESSURE:.status.conditions[?(@.type==\"MemoryPressure\")].status

## Get detailed memory statistics
kubectl describe nodes | grep -A 20 "Allocated resources"

## Monitor pod memory usage over time
while true; do
  echo "$(date): Memory usage by pod:"
  kubectl top pods --sort-by=memory | head -10
  sleep 60
done

Memory Leak Detection Patterns

Pattern 1: Gradual Memory Growth
## Monitor memory growth over 24 hours
for i in $(seq 1 144); do  # 144 * 10min = 24 hours
  echo "$(date): $(kubectl top pod <pod-name> | grep <pod-name>)"
  sleep 600  # 10 minutes (this will take forever and probably fail halfway through)
done > memory-growth.log
Pattern 2: Spike-and-Hold Memory Pattern
## Detect sudden memory spikes that don't recover
kubectl top pod <pod-name> --no-headers | while read line; do
  current_mem=$(echo $line | awk '{print $3}' | sed 's/Mi//')
  if [ $current_mem -gt 500 ]; then  # Alert if over 500MB
    echo "ALERT: $(date) - Memory spike detected: ${current_mem}MB"
    kubectl describe pod <pod-name> | grep -A 10 "Events:"
  fi
  sleep 30
done
Pattern 3: Memory Fragmentation Issues
## Check for memory fragmentation (inside ephemeral container)
cat /proc/buddyinfo  # Shows memory fragmentation levels
cat /proc/pagetypeinfo | grep -A 3 "Unmovable"

Production Memory Monitoring Setup

For ongoing OOMKilled prevention, implement comprehensive monitoring:

## Example Prometheus monitoring rules for OOMKilled detection
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: oomkilled-alerts
spec:
  groups:
  - name: memory.rules
    rules:
    - alert: PodOOMKilled
      expr: increase(kube_pod_container_status_restarts_total{reason="OOMKilled"}[5m]) > 0
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.pod }} was OOMKilled"
    
    - alert: HighMemoryUsage
      expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.8
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.pod }} memory usage above 80%"

These advanced profiling techniques help you understand exactly what's consuming memory in your containers. The next section covers how to prevent these memory issues from happening in the first place.

Stop OOMKilled Before It Ruins Your Sleep

Grafana Monitoring Dashboard

Debugging OOMKilled at 3AM sucks. Getting paged because your payment API is dead costs more than just setting proper limits in the first place. Here's how to build apps that don't randomly die from memory issues.

Real talk: I used to get paged 2-3 times a week for OOM issues. After implementing these strategies, I maybe get paged once a month. Your sleep schedule will thank you.

Stop Guessing Memory Limits (Yes, Everyone Does It)

Most teams set memory limits by throwing darts at a board, doubling when stuff breaks, and hoping for the best. This approach wastes money and still gets you paged when traffic spikes. Here's how to actually do it right:

Memory Sizing That Actually Works

Step 1: Actually Measure Memory Usage (Revolutionary Concept)

## Profile memory under normal load for 7 days (if you have the patience)
## Requires metrics-server: https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-usage-monitoring/
kubectl top pod <pod-name> --no-headers | while read line; do
  echo \"$(date +%s),$(echo $line | awk '{print $3}' | sed 's/Mi//')\" 
done >> memory-baseline.csv

## Analyze the data (basic math, but most people skip this)
awk -F',' '{sum+=$2; if($2>max) max=$2} END {
  print \"Average memory: \" sum/NR \"MB\"
  print \"Peak memory: \" max \"MB\"  
  print \"Recommended limit: \" max*1.5 \"MB\"}
' memory-baseline.csv
## Spoiler: your peak is always 3x your average

Step 2: Load Test or Cry Later

## Use k6, hey, or whatever load testing tool doesn't suck
## Monitor memory while throwing traffic at your app

## Gradual ramp-up test (because going 0-100 will just break everything)
for rps in 10 50 100 200 500; do
  echo \"Testing $rps RPS - hope your app survives\"
  hey -z 300s -q $rps https://api.github.com/users/octocat &
  sleep 60
  kubectl top pod <pod-name> | tee -a load-test-memory.log
  sleep 60  # Time to panic if memory usage looks bad
done

Step 3: Memory Math That Actually Works

Base Memory = What your app needs just to start (minimum to not immediately crash)
Working Memory = Memory during normal operation (when users aren't being assholes)
Spike Buffer = Extra memory for when everything goes wrong (50-100% more)
Garbage Collection = Language tax (JVM: 25%, Node.js: 30%, Go: 15%, Python: 40%)

Total Limit = (Base + Working + Spike Buffer) × (1 + GC Overhead)
## Then add 20% because your estimates are always wrong

Example Java application sizing (that won't get you paged):

## Base: 200MB (JVM startup - Spring Boot is a memory hog)
## Working: 400MB (application + libraries + the 47 dependencies you didn't know about)
## Spike: 200MB (50% buffer because traffic spikes are real)
## GC: 25% overhead (Java's garbage collection tax)
## Total: (200 + 400 + 200) × 1.25 = 1000MB limit
## Reality: management says \"can't you just use 512MB?\"

Implementing Quality of Service (QoS) Classes

Kubernetes QoS classes determine which pods get killed first during memory pressure. Understanding and using QoS strategically prevents critical workloads from being OOM killed. See the resource management docs for details.

QoS Class Configuration Examples

Guaranteed QoS (Highest Priority):

apiVersion: v1
kind: Pod
metadata:
  name: guaranteed-pod
spec:
  containers:
  - name: app
    image: myapp:latest
    resources:
      requests:
        memory: \"1Gi\"
        cpu: \"500m\"
      limits:
        memory: \"1Gi\"    # Same as requests
        cpu: \"500m\"      # Same as requests

Burstable QoS (Medium Priority):

apiVersion: v1
kind: Pod  
metadata:
  name: burstable-pod
spec:
  containers:
  - name: app
    image: myapp:latest
    resources:
      requests:
        memory: \"512Mi\"
        cpu: \"250m\" 
      limits:
        memory: \"1Gi\"    # Higher than requests
        cpu: \"1000m\"     # Higher than requests

Strategic QoS Usage:

  • Critical services (databases, payment APIs): Guaranteed QoS
  • Web applications: Burstable QoS with conservative requests
  • Batch jobs, dev environments: BestEffort or high-limit Burstable

Application-Level Memory Management

Garbage Collection Optimization

Java/JVM Applications:

See JVM container best practices and OpenJDK container awareness.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: java-app
spec:
  template:
    spec:
      containers:
      - name: app
        image: openjdk:11
        env:
        # Optimize GC for container environments
        - name: JAVA_OPTS
          value: >
            -XX:+UseG1GC
            -XX:MaxGCPauseMillis=200
            -XX:+UnlockExperimentalVMOptions
            -XX:+UseCGroupMemoryLimitForHeap
            -XX:MaxRAMPercentage=75
            -XX:+HeapDumpOnOutOfMemoryError
            -XX:HeapDumpPath=/tmp/heap-dump.hprof
        resources:
          requests:
            memory: \"1Gi\"
          limits:
            memory: \"1.5Gi\"

Node.js Applications:

References: Node.js performance best practices and V8 memory optimization.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nodejs-app
spec:
  template:
    spec:
      containers:
      - name: app
        image: node:18
        command: 
        - node
        - --max-old-space-size=1024  # Set heap limit explicitly
        - --optimize-for-size         # Reduce memory footprint
        - index.js
        resources:
          requests:
            memory: \"512Mi\"
          limits:
            memory: \"1.5Gi\"
Memory-Efficient Coding Practices

Connection Pool Management:

Connection pooling guidance: HikariCP best practices, Database connection patterns.

## Example database connection configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: db-config
data:
  database.properties: |
    # Conservative connection pool settings
    hikari.maximumPoolSize=10    # Not 100
    hikari.minimumIdle=5
    hikari.maxLifetime=600000    # 10 minutes
    hikari.idleTimeout=300000    # 5 minutes
    
    # Enable connection leak detection
    hikari.leakDetectionThreshold=60000  # 1 minute

Caching Strategy Implementation:

Caching references: Redis configuration guide, Memcached tuning, Kubernetes caching patterns.

apiVersion: v1
kind: ConfigMap  
metadata:
  name: cache-config
data:
  redis.conf: |
    # Memory-efficient Redis configuration
    maxmemory 256mb
    maxmemory-policy allkeys-lru
    
    # Reduce memory overhead  
    hash-max-ziplist-entries 512
    hash-max-ziplist-value 64
    list-max-ziplist-size 8

Monitoring and Alerting for Proactive Prevention

Prometheus Metrics

## References: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
## Kubernetes monitoring: https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-usage-monitoring/
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: memory-prevention-alerts
spec:
  groups:
  - name: memory-alerts
    interval: 30s
    rules:
    
    # Alert when memory usage approaches limit
    - alert: MemoryUsageHigh
      expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.85
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: \"Memory usage high for {{ $labels.pod }}\"
        description: \"Pod {{ $labels.pod }} is using {{ $value | humanizePercentage }} of memory limit\"
    
    # Alert on memory growth rate  
    - alert: MemoryGrowthRate
      expr: rate(container_memory_usage_bytes[30m]) > 10485760  # 10MB/30min growth
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: \"Potential memory leak in {{ $labels.pod }}\"
    
    # Node memory pressure detection
    - alert: NodeMemoryPressure
      expr: kubelet_node_condition{condition=\"MemoryPressure\", status=\"true\"} == 1
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: \"Node {{ $labels.node }} under memory pressure\"
Vertical Pod Autoscaler (VPA) Configuration
## VPA documentation: https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler
## Installation guide: https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler#installation
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: memory-optimization-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: your-app
  updatePolicy:
    updateMode: \"Auto\"  # Or \"Off\" for recommendations only
  resourcePolicy:
    containerPolicies:
    - containerName: app
      maxAllowed:
        memory: \"8Gi\"    # Prevent runaway scaling
      minAllowed:  
        memory: \"256Mi\"  # Minimum viable memory
      controlledResources: [\"memory\"]

Cluster-Level Memory Management

Resource Quotas for Namespace-Level Control
## Resource quota docs: https://kubernetes.io/docs/concepts/policy/resource-quotas/
## Limit ranges: https://kubernetes.io/docs/concepts/policy/limit-range/
apiVersion: v1
kind: ResourceQuota
metadata:
  name: memory-quota
  namespace: production
spec:
  hard:
    requests.memory: \"100Gi\"    # Total memory requests
    limits.memory: \"200Gi\"      # Total memory limits
    pods: \"50\"                  # Maximum pods
    
## Ensure every pod has resource limits
---
apiVersion: v1
kind: LimitRange  
metadata:
  name: memory-limit-range
  namespace: production
spec:
  limits:
  - default:        # Default limits
      memory: \"1Gi\"
    defaultRequest: # Default requests  
      memory: \"512Mi\"
    max:           # Maximum allowed
      memory: \"8Gi\"
    min:           # Minimum required
      memory: \"128Mi\"
    type: Container
Node Selector and Affinity for Memory-Intensive Workloads
## Node affinity docs: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/
## Resource management: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
apiVersion: apps/v1
kind: Deployment
metadata:
  name: memory-intensive-app
spec:
  template:
    spec:
      # Schedule on high-memory nodes
      nodeSelector:
        node-type: \"memory-optimized\"
      
      # Avoid co-location with other memory-heavy pods  
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                type: \"memory-intensive\" 
            topologyKey: kubernetes.io/hostname
      
      containers:
      - name: app
        image: memory-heavy-app:latest
        resources:
          requests:
            memory: \"4Gi\"
          limits:
            memory: \"8Gi\"

Emergency Response Procedures

When prevention fails, have automated responses ready:

Automatic Pod Cycling During High Memory
apiVersion: v1
kind: ConfigMap
metadata:
  name: memory-monitor-script
data:
  monitor.sh: |
    #!/bin/bash
    THRESHOLD=85  # Memory usage percentage threshold
    
    while true; do
      kubectl top pods --no-headers | while read pod_line; do
        POD=$(echo $pod_line | awk '{print $1}')
        MEMORY_PERCENT=$(kubectl top pod $POD --no-headers | awk '{print $3}' | sed 's/%//')
        
        if [ $MEMORY_PERCENT -gt $THRESHOLD ]; then
          echo \"WARNING: $POD memory at ${MEMORY_PERCENT}% - cycling pod\"
          kubectl delete pod $POD
          sleep 30  # Allow time for new pod to start
        fi
      done
      sleep 300  # Check every 5 minutes
    done

---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: memory-monitor
spec:
  schedule: \"*/5 * * * *\"  # Every 5 minutes
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: monitor
            image: bitnami/kubectl:latest
            command: [\"bash\", \"/scripts/monitor.sh\"]
            volumeMounts:
            - name: script
              mountPath: /scripts
          volumes:
          - name: script
            configMap:
              name: memory-monitor-script
          restartPolicy: OnFailure

These prevention strategies work together to create a memory-resilient Kubernetes environment. The key is implementing multiple layers: proper sizing, QoS configuration, application optimization, monitoring, and automated responses. This comprehensive approach prevents most OOMKilled errors before they impact production.

OOMKilled FAQ: The Shit You're Actually Googling at 3AM

Q

why does my pod keep dying kubectl top shows low memory wtf???

A

This happens because kubectl top shows current memory usage, not peak usage. Your pod hit the memory limit temporarily, got killed, and restarted with lower memory usage. Debug this:bash# Check for memory spikes in pod eventskubectl describe pod <pod-name> | grep -A 20 Events# Look for resource limits vs actual usagekubectl describe pod <pod-name> | grep -A 10 -B 5 Limits# Monitor memory over time to catch spikeskubectl top pod <pod-name> --no-headers | while read line; do echo "$(date): $line" sleep 10doneThe solution is usually increasing memory limits by 50-100% to handle memory spikes during garbage collection or high load.

Q

java app oomkilled but heap dump looks fine what is eating my memory help

A

Off-heap memory is the usual culprit.

Java applications use memory outside the heap for:

  • DirectByteBuffers (NIO operations)
  • Metaspace (class definitions)
  • Code cache (JIT compilation)
  • Native library allocationsActually debug off-heap usage:```bash# Enable native memory tracking (requires restart because Java is special)kubectl patch deployment -p '{"spec":{"template":{"spec":{"containers":[{"name":"","env":[{"name":"JAVA_OPTS","value":"-XX:

NativeMemoryTracking=summary -XX:+UnlockDiagnosticVMOptions"}]}]}}}}'# Check native memory usage (the money shot)kubectl exec -it -- jcmd $(pgrep java) VM.native_memory summary# Monitor DirectByteBuffer usage (usually the culprit) kubectl exec -it -- jcmd $(pgrep java) VM.memory_info | grep -i direct```Common fixes:

  • Set `-XX:

MaxDirectMemorySize=256m` to limit DirectByteBuffer

  • Increase -XX:MetaspaceSize if using lots of dynamic classes
  • Set -XX:ReservedCodeCacheSize=128m for applications with heavy JIT compilation
Q

pod crashloopbackoff cant exec how to debug oomkilled???

A

Use these debugging approaches when the pod won't stay alive:**Method 1:

Debug with a sidecar container**```yamlapiVersion: v1kind:

Podmetadata: name: debug-podspec: containers:

  • name: app image: your-broken-image:latest # Your normal container config

  • name: debug-sidecar image: nicolaka/netshoot command: ["sleep", "3600"] # Keep alive for debugging volumeMounts:

  • name: shared-data mountPath: /debug volumes:

  • name: shared-data emptyDir: {}```**Method 2:

Override the entrypoint**```bash# Run the container with a different command that doesn't crashkubectl run debug-pod --image=your-broken-image:latest --rm -it -- /bin/sh# Inside the container, run your app manually and monitor:/usr/bin/your-app &PID=$!while kill -0 $PID 2>/dev/null; do echo "$(date):

Memory: $(cat /proc/$PID/status | grep VmRSS)" sleep 5done```**Method 3:

Use init containers for debugging**```yamlapiVersion: v1kind:

Podmetadata: name: debug-with-initspec: initContainers:

  • name: memory-debugger image: busybox command: ["sh", "-c", "echo 'Debugging memory limits'; cat /proc/meminfo"] containers:

  • name: app image: your-broken-image:latest```

Q

nodejs oomkilled even with max-old-space-size set why???

A

Node.js has multiple memory areas beyond the V8 heap:

  • V8 heap (controlled by --max-old-space-size)
  • Buffer allocations (outside V8 heap)
  • Native addon memory
  • libuv memory poolsDebug Node.js memory usage:```bash# Check total process memory vs V8 heapkubectl exec -it -- node -e "setInterval(() => { const mem = process.memoryUsage(); console.log({ rss:

Math.round(mem.rss / 1024 / 1024) + 'MB', // Total memory heapTotal: Math.round(mem.heap

Total / 1024 / 1024) + 'MB', // V8 heap heapUsed:

Math.round(mem.heap

Used / 1024 / 1024) + 'MB', // Used heap external: Math.round(mem.external / 1024 / 1024) + 'MB' // C++ objects });}, 10000);"# Monitor Buffer allocations specificallykubectl exec -it -- node -e "console.log('Buffer pool size:', Buffer.poolSize);setInterval(() => { console.log('External memory:', process.memory

Usage().external);}, 5000);"```Fix Node.js memory issues:

  • Set container memory limit to 2x your --max-old-space-size
  • Use --max-buffer-length=1048576 to limit buffer sizes
  • Monitor and limit Buffer allocations in your code
Q

python app oomkilled but memory looks normal how to debug this

A

Python memory issues are often caused by:

  • C extensions holding onto memory
  • Large objects not being garbage collected
  • Memory fragmentation
  • Multiprocessing memory sharingDebug Python memory:```bash# Install memory profiling toolskubectl exec -it -- pip install pympler memory-profiler psutil# Get detailed memory breakdownkubectl exec -it -- python -c "import psutilimport gcprocess = psutil.

Process()print(f'RSS: {process.memory_info().rss / 1024 / 1024:.1f}MB')print(f'VMS: {process.memory_info().vms / 1024 / 1024:.1f}MB') print(f'Objects: {len(gc.get_objects())}')"# Profile memory by line in your applicationkubectl exec -it -- python -m memory_profiler your_script.py```Common Python memory fixes:

  • Use __slots__ for classes with many instances
  • Explicitly delete large objects and call gc.collect()
  • Use generators instead of loading large datasets into memory
  • Set PYTHONHASHSEED=0 for consistent memory usage
Q

multi container pod oomkilled which container is causing it

A

Multi-container pods share memory limits, making debugging harder:**Method 1:

Check individual container memory**bash# Get memory usage by containerkubectl top pod <pod-name> --containers# Check events for specific container terminationkubectl describe pod <pod-name> | grep -A 10 -B 10 "container.*terminated"# Look at resource limits per containerkubectl get pod <pod-name> -o jsonpath='{.spec.containers[*].resources}'Method 2: Add memory monitoring to each containerbash# Run memory monitoring in each containerkubectl exec -it <pod-name> -c <container-name> -- sh -c "while true; do echo 'Container: <container-name>' cat /proc/meminfo | head -5 ps aux --sort=-%mem | head -10 echo '---' sleep 30done"**Method 3:

Use separate resource limits per container**```yamlapiVersion: v1kind:

Podspec: containers:

  • name: web resources: limits: memory: "1Gi" # Separate limit for web container
  • name: sidecar resources: limits: memory: "512Mi" # Separate limit for sidecar```
Q

database pods oomkilled during backup operations how to fix

A

Database memory spikes during operations like backups, index rebuilds, or query processing:Short-term fix:bash# Increase memory limits temporarily for backup operationskubectl patch statefulset <database-name> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","resources":{"limits":{"memory":"4Gi"}}}]}}}}'# Run backup (of course, this will fail during your most important backup)kubectl exec -it <pod-name> -- /backup-script.sh# Restore normal limitskubectl patch statefulset <database-name> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","resources":{"limits":{"memory":"2Gi"}}}]}}}}'**Long-term solution

  • Use init containers for resource-intensive operations:**```yamlapiVersion: v1kind:

Podspec: initContainers:

  • name: backup image: postgres:13 resources: limits: memory: "4Gi" # Higher memory for backup command: ["pg_dump", "..."] containers:

  • name: postgres image: postgres:13 resources: limits: memory: "2Gi" # Normal operational memory```

Q

pods oomkilled on some nodes but not others why different behavior

A

Node-level differences can cause inconsistent OOM behavior:Check node memory differences:bash# Compare available memory across nodeskubectl describe nodes | grep -E "Name:|memory:"# Check node conditionskubectl get nodes -o custom-columns=NAME:.metadata.name,MEMORY-PRESSURE:.status.conditions[?(@.type==\"MemoryPressure\")].status# See memory usage patterns per nodekubectl top nodes --sort-by=memoryCommon node-level issues:

  • Different node types (some with less memory)
  • System processes consuming different amounts of memory
  • Memory fragmentation levels varying by node
  • Different kernel memory settingsFix with node selection:```yamlapiVersion: apps/v1kind:

Deploymentspec: template: spec: nodeSelector: node.kubernetes.io/instance-type: "m5.xlarge" # Consistent node type containers:

  • name: app resources: limits: memory: "2Gi"```
Q

works in docker locally but oomkilled in kubernetes whats different

A

Several differences between local Docker and Kubernetes can cause memory issues:Container runtime differences:bash# Check cgroup version differenceskubectl exec -it <pod-name> -- cat /proc/1/cgroup | head -5# Compare memory accounting methodskubectl exec -it <pod-name> -- cat /sys/fs/cgroup/memory/memory.usage_in_bytes**Common Kubernetes vs Docker differences:**1. Memory accounting:

Kubernetes includes page cache in memory usage 2. Process limits: Different ulimits and system constraints 3. User context:

Containers might run as different users 4. Filesystem differences: tmpfs mounts consuming memoryDebug the differences:bash# Compare user contextsdocker run --rm your-image idkubectl exec -it <pod-name> -- id# Compare memory limits enforcementdocker run --rm your-image cat /proc/meminfo | head -5kubectl exec -it <pod-name> -- cat /proc/meminfo | head -5# Check for tmpfs mounts consuming memory kubectl exec -it <pod-name> -- df -h | grep tmpfsThese FAQ answers should help you debug the most common OOMKilled scenarios quickly, especially during production incidents.

OOMKilled Tool Reality Check - What Actually Works vs. What Wastes Your Time

Category

Tool/Approach

Description

Pros/Good For

Cons/Reality Check

Additional Details

Tools That Don't Completely Suck (Your First 5 Minutes of Panic)

kubectl top

Shows current memory usage

Great for seeing "oh shit we're at 90%"

Lies about actual usage 50% of the time. Useless for understanding peaks

Command: kubectl top pod <pod> --containers. My experience: Showed 200MB usage on a pod that was OOMKilled at 1GB. Thanks, Kubernetes.

Tools That Don't Completely Suck (Your First 5 Minutes of Panic)

kubectl describe

Shows limits, events, restart reasons

Actually tells you "killed for exceeding memory limit"

Events rotate every fucking hour. Miss the window? You're debugging blind.

Command: kubectl describe pod <pod>

Tools That Don't Completely Suck (Your First 5 Minutes of Panic)

kubectl logs --previous

Shows app logs before death

Add this to muscle memory. Current logs won't show shit about why it died.

Some apps don't log memory issues because they're dead before they know it

Command: kubectl logs <pod> --previous. Pro tip: Add this to muscle memory.

Tools That Don't Completely Suck (Your First 5 Minutes of Panic)

kubectl get events

Cluster-wide chaos in chronological order

When multiple pods are dying and you need to see if it's a node issue

90% noise, 10% useful. Hope you like scrolling.

Command: kubectl get events --sort-by='.lastTimestamp'. Reality: 90% noise, 10% useful. Hope you like scrolling.

Advanced Tools (When Basic Shit Doesn't Work)

Ephemeral Containers

Inject debugging containers into live pods without rebuilding images

Disabled in 80% of corporate environments because security teams think debugging is a security vulnerability

Since K8s 1.30.

My experience: Spent 3 hours getting this approved, then it saved my ass in 10 minutes. Worth the fight.

Advanced Tools (When Basic Shit Doesn't Work)

kubectl debug

Spin up debug containers with access to target pod filesystems

When you need to poke around files without app restart

Actually understanding memory usage. It's glorified ls access.

The "safe" option.

Advanced Tools (When Basic Shit Doesn't Work)

Application Profilers (JVisualVM, Chrome DevTools, etc.)

Deep code-level memory analysis

JVisualVM crashes more than my actual app. Chrome DevTools works great until it tries to download a 500MB heap dump over VPN.

When to use: When you're desperate and have time to curse at flaky tooling.

Advanced Tools (When Basic Shit Doesn't Work)

Node-level debugging

SSH to the node like a caveman. Kernel logs, cgroup limits, the real story.

Kernel logs, cgroup limits, the real story

Your DevOps team's respect and probably some security compliance

Pro tip: journalctl -f on the node shows OOM kills that Kubernetes never reports.

Monitoring Solutions That Don't Bankrupt You

Local/Dev Clusters (kubectl + wishful thinking)

Basic resource usage and prayer-based alerting

Fine for dev until your staging environment starts mimicking prod issues

Time to setup: 30 minutes to install metrics-server, 3 hours to figure out why it's broken. Cost: Free (your sanity is priceless).

Reality: Fine for dev until your staging environment starts mimicking prod issues.

Monitoring Solutions That Don't Bankrupt You

Small Production (Prometheus + Grafana)

Yes, once you stop cursing the setup process

Setup time: "1-2 days" according to tutorials. 1-2 weeks in reality after fighting YAML hell. Cost: $50-200/month if you're smart about it, $500+ if AWS manages it for you.

My experience: Spent a weekend getting alerting rules right. Still get woken up for non-issues. Worth it: Yes, once you stop cursing the setup process.

Monitoring Solutions That Don't Bankrupt You

Enterprise (Full observability stack aka "the money pit")

Everything monitored, alerted, and correlated

The tools work great.

Cost: $500-2000+/month (DataDog, New Relic, or whatever sales sold you). Time to value: 2-6 weeks depending on how many meetings you need to setup monitors.

Pro tip: The tools work great. Getting approval to spend this much takes longer than the setup.

Monitoring Solutions That Don't Bankrupt You

Cloud Managed (Google/AWS/Azure handle it)

Native integration, minimal setup

Vendor lock-in and costs add up fast. Great until you get the first $1000 monitoring bill for a 3-node cluster.

Reality check: Great until you get the first $1000 monitoring bill for a 3-node cluster.

Language-Specific Debugging (Because Every Language Sucks Differently)

jstat (Java/JVM)

Your basic bitch GC monitor. Heap usage, GC frequency, general suffering.

Works great

Works great until you realize you need off-heap info and jstat just shrugs

Command: jstat -gc $(pgrep java) 5s 10

  • run this every 5 seconds and pray.

Language-Specific Debugging (Because Every Language Sucks Differently)

jcmd (Java/JVM)

The Swiss Army knife that actually cuts. Shows you ALL the memory, not just heap.

Shows you ALL the memory, not just heap. That 2GB "mystery" memory usage? It's off-heap native memory. jcmd finds it.

Needs -XX:NativeMemoryTracking=summary JVM flag or it tells you to go fuck yourself

My go-to: jcmd $(pgrep java) VM.native_memory summary.

Language-Specific Debugging (Because Every Language Sucks Differently)

JVisualVM (Java/JVM)

Pretty charts, interactive profiling, feels professional

Pretty charts, interactive profiling, feels professional

Crashes more than Internet Explorer. Connection issues galore.

When to use: When you have time to fight with GUI tools and need to demo something to management.

Language-Specific Debugging (Because Every Language Sucks Differently)

Heap dumps (Java/JVM)

The nuclear option

Heap dumps are fucking huge. 2GB heap = 2GB+ dump file. Analysis time: 30 seconds to generate, 2 hours to analyze with Eclipse MAT.

War story: Generated a 8GB heap dump that crashed my laptop, corrupted the file, and taught me nothing except to buy more RAM.

Language-Specific Debugging (Because Every Language Sucks Differently)

JFR (Java Flight Recorder) (Java/JVM)

Built into modern JVMs, minimal overhead, doesn't crash

Built into modern JVMs, minimal overhead, doesn't crash

Oracle scared people away with licensing FUD for years

Setup: Add -XX:+FlightRecorder -XX:StartFlightRecording to JVM args.

Language-Specific Debugging (Because Every Language Sucks Differently)

process.memoryUsage() (Node.js)

Built-in and boring. Basic heap stats that look impressive in graphs.

Tells you memory is high, not why. Like a smoke detector for RAM.

Code snippet: console.log(process.memoryUsage())

  • because debugging is just console.log with extra steps.

Language-Specific Debugging (Because Every Language Sucks Differently)

--inspect + Chrome DevTools (Node.js)

Same DevTools interface you know, heap snapshots, allocation profiling

Same DevTools interface you know, heap snapshots, allocation profiling

Adds significant overhead. Don't run this in production unless you hate your users.

The setup: Start Node with --inspect, open Chrome, connect to debugger. My experience: Works great in dev. In prod, it's like debugging with molasses.

Language-Specific Debugging (Because Every Language Sucks Differently)

Heap snapshots (Node.js)

For when you need to go deep

Each snapshot is massive. Hope you have disk space and patience. Analysis: Compare snapshots to find growing objects. Sounds easy, takes hours.

Pro tip: Take snapshots before and after reproducing the issue. Everything else is noise.

Language-Specific Debugging (Because Every Language Sucks Differently)

clinic.js (Node.js)

The toolkit that tries really hard. Comprehensive Node.js profiling suite.

Pretty charts and analysis that may or may not help

When to use: When you've exhausted basic options and need to justify spending time on tooling.

Language-Specific Debugging (Because Every Language Sucks Differently)

psutil (Python)

Process stats that actually work. For RSS/VMS breakdown.

Unlike other Python memory tools, it doesn't lie about OS-level usage

Shows your process is using 2GB when you allocated 500MB. Welcome to Python.

Usage: psutil.Process().memory_info().

Language-Specific Debugging (Because Every Language Sucks Differently)

memory-profiler (Python)

Line-by-line self-torture. Decorators on functions to track memory per line.

Find the exact line that allocates 1GB of pandas DataFrames you forgot about

Makes your code run like it's 1995

Best use: Find the exact line that allocates 1GB of pandas DataFrames you forgot about.

Language-Specific Debugging (Because Every Language Sucks Differently)

pympler (Python)

Object tracking for masochists. Tracks every Python object and where it lives in memory.

When you need to know why you have 50,000 string objects

The analysis phase uses more memory than your actual leak

Warning: The analysis phase uses more memory than your actual leak.

Language-Specific Debugging (Because Every Language Sucks Differently)

tracemalloc (Python)

Built-in since Python 3.4. Python actually shipped with decent memory tracing.

Minimal overhead unless you're tracing everything

My experience: Found more accidental memory leaks with this than any other Python tool.

Memory Limit Debugging Approaches (Ranked by Desperation Level)

Double the limit

The "fuck it, ship it" approach

Won't break anything immediately

Terrible accuracy. You're just pushing the problem down the road. This is your 3AM move when the site is down and you'll "fix it properly later" (spoiler: you won't). Your cost alerts will hate you.

Time to fix: 30 seconds to change the YAML, 30 minutes to convince yourself it's temporary. Production safety: Won't break anything immediately, but your cost alerts will hate you.

Memory Limit Debugging Approaches (Ranked by Desperation Level)

VPA recommendations

Let the machine decide. VPA watches your pods for a week, then suggests limits.

Pretty good for steady workloads

Garbage for spiky ones. VPA suggests 8GB for your 512MB app because it doesn't understand your traffic patterns.

Time investment: 7+ days of waiting, plus whatever time you spend ignoring the recommendations.

Memory Limit Debuging Approaches (Ranked by Desperation Level)

Load testing + profiling

The "proper" engineering approach

Best you'll get if you do it right

Time cost: 1-2 days of building realistic test scenarios. Skill requirement: High. You need to know your app's usage patterns and how to stress test properly.

My experience: Spent a week building perfect load tests, then production traffic did something completely different.

Memory Limit Debugging Approaches (Ranked by Desperation Level)

Production monitoring

The marathon approach. Watch real usage over weeks/months, adjust based on patterns.

Excellent for long-term capacity planning. Real traffic is the only traffic that matters.

Time to results: 2-4 weeks for meaningful data.

Why it works: Real traffic is the only traffic that matters.

OOMKill Detection Reality Check

kubectl get pods

The bare minimum. Pod restarts, obvious crashes.

Fine for "is shit broken right now"

Misses: Child process OOM kills, anything that happens between your checks. Useless for understanding patterns.

My take: Fine for "is shit broken right now" but useless for understanding patterns. Pro tip: kubectl get pods -w keeps it running. Add --all-namespaces because your problem is never in the namespace you expect.

OOMKill Detection Reality Check

Prometheus monitoring

The gold standard that costs gold. Catches everything, historical data, pretty graphs for incident reports.

Catches everything, historical data, pretty graphs for incident reports

Setup complexity, storage costs, false positive alerts when you tune it wrong

Reality: Best long-term solution but useless during your current emergency.

OOMKill Detection Reality Check

Node log monitoring

SSH and pray. Kernel logs showing the real OOM kills.

The kernel doesn't lie like Kubernetes metrics sometimes do

SSH access to nodes makes security teams cry

Commands: journalctl -f | grep -i "killed process" on the node. Why nobody does it: SSH access to nodes makes security teams cry.

OOMKill Detection Reality Check

Application logging

Only as good as your developers. Apps log memory warnings before dying.

Java OOM exceptions, Python memory errors with proper exception handling

Most apps die before they know they're dying

When it doesn't: Everything else.

Cloud Provider Tools (Aka How to Spend Money on Debugging)

AWS EKS (The vendor lock-in special)

CloudWatch Container Insights: Works out of the box, costs a fortune at scale. X-Ray: Only useful if you instrument your app properly.

CloudWatch Container Insights: Works out of the box. Great for small clusters.

CloudWatch Container Insights: Costs a fortune at scale. $3/container/month adds up fast. X-Ray: Only useful if you instrument your app properly (spoiler: you didn't).

My experience: Great for small clusters, bankruptcy-inducing for large ones.

Cloud Provider Tools (Aka How to Spend Money on Debugging)

Google GKE (The "we actually understand Kubernetes" option)

Cloud Monitoring: Best native K8s integration, reasonable pricing. Cloud Profiler: Actually works in production without crashing your app.

Cloud Monitoring: Best native K8s integration, reasonable pricing. Cloud Profiler: Actually works in production without crashing your app.

Why it's better: Google wrote Kubernetes, their tooling shows it.

Cloud Provider Tools (Aka How to Spend Money on Debugging)

Azure AKS (The "we're trying our best" platform)

Azure Monitor: Decent but not great, typical Microsoft enterprise approach. Application Insights: Good if you're all-in on Microsoft, confusing if you're not.

Works fine

Feels like an afterthought compared to AWS/GCP

Reality: Works fine, feels like an afterthought compared to AWS/GCP.

Emergency Response Playbook (When Shit's on Fire)

Pod CrashLoopBackOff (The classic)

First 30 seconds: kubectl describe pod and kubectl logs --previous. If that fails: Double the memory limit and restart. Time to fix: 5-30 minutes if you know what you're doing.

Emergency Response Playbook (When Shit's on Fire)

Intermittent OOM (The mystery novel)

You're gonna be debugging this for hours, not minutes

First move: Set up monitoring if you don't have it. Tools that help: Prometheus historical data, application profilers. Time to fix: 2-8 hours of actual debugging.

Emergency Response Playbook (When Shit's on Fire)

Memory leak (The slow death)

Memory usage slowly climbing over days/weeks

Detection: Memory usage slowly climbing over days/weeks. Analysis: Heap dumps, profilers, lots of coffee. Time to fix: 1-3 days minimum. Don't promise faster to management.

Emergency Response Playbook (When Shit's on Fire)

Invisible OOM (The ghost in the machine)

Processes die inside containers, Kubernetes doesn't notice

How to find: Node logs, process monitoring inside containers. Fix: Usually cgroup v2 migration or proper resource limits.

Tool Selection for People Who Actually Work for a Living

Your site is down right now:

kubectl describe podkubectl logs --previous → double the memory → deploy → investigate later.

Tool Selection for People Who Actually Work for a Living

You have a few hours to debug:

Set up ephemeral containers for runtime analysis. Use application profilers to find the real problem. Load test in staging with proper profiling.

Tool Selection for People Who Actually Work for a Living

You want to prevent this shit:

Prometheus + Grafana for monitoring (suck it up, learn the YAML). VPA for automatic right-sizing (after you fix the real issues). Resource quotas so one team can't crash the cluster.

Tool Selection for People Who Actually Work for a Living

Your budget determines your tooling:

Broke: kubectl + metrics-server + shell scripts. Normal: Add Prometheus, Grafana, basic cloud monitoring. Enterprise: Full observability stack, dedicated SRE team, blame junior developers.

Related Tools & Recommendations

troubleshoot
Similar content

Fix Kubernetes OOMKilled Pods: Production Crisis Guide

When your pods die with exit code 137 at 3AM and production is burning - here's the field guide that actually works

Kubernetes
/troubleshoot/kubernetes-oom-killed-pod/oomkilled-production-crisis-management
100%
tool
Similar content

Helm: Simplify Kubernetes Deployments & Avoid YAML Chaos

Package manager for Kubernetes that saves you from copy-pasting deployment configs like a savage. Helm charts beat maintaining separate YAML files for every dam

Helm
/tool/helm/overview
96%
howto
Similar content

Set Up Microservices Observability: Prometheus & Grafana Guide

Stop flying blind - get real visibility into what's breaking your distributed services

Prometheus
/howto/setup-microservices-observability-prometheus-jaeger-grafana/complete-observability-setup
93%
tool
Similar content

Debugging Istio Production Issues: The 3AM Survival Guide

When traffic disappears and your service mesh is the prime suspect

Istio
/tool/istio/debugging-production-issues
84%
tool
Similar content

Minikube Troubleshooting Guide: Fix Common Errors & Issues

Real solutions for when Minikube decides to ruin your day

Minikube
/tool/minikube/troubleshooting-guide
75%
integration
Similar content

Kafka, MongoDB, K8s, Prometheus: Event-Driven Observability

When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go

Apache Kafka
/integration/kafka-mongodb-kubernetes-prometheus-event-driven/complete-observability-architecture
71%
tool
Similar content

Debug Kubernetes Issues: The 3AM Production Survival Guide

When your pods are crashing, services aren't accessible, and your pager won't stop buzzing - here's how to actually fix it

Kubernetes
/tool/kubernetes/debugging-kubernetes-issues
66%
troubleshoot
Similar content

Fix Kubernetes CrashLoopBackOff Exit Code 1 Application Errors

Troubleshoot and fix Kubernetes CrashLoopBackOff with Exit Code 1 errors. Learn why your app works locally but fails in Kubernetes and discover effective debugg

Kubernetes
/troubleshoot/kubernetes-crashloopbackoff-exit-code-1/exit-code-1-application-errors
59%
troubleshoot
Similar content

Kubernetes Crisis Management: Fix Your Down Cluster Fast

How to fix Kubernetes disasters when everything's on fire and your phone won't stop ringing.

Kubernetes
/troubleshoot/kubernetes-production-crisis-management/production-crisis-management
57%
troubleshoot
Similar content

Kubernetes CrashLoopBackOff: Debug & Fix Pod Restart Issues

Your pod is fucked and everyone knows it - time to fix this shit

Kubernetes
/troubleshoot/kubernetes-pod-crashloopbackoff/crashloopbackoff-debugging
53%
integration
Recommended

Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together

Stop fighting with YAML hell and infrastructure drift - here's how to manage everything through Git without losing your sanity

Pulumi
/integration/pulumi-kubernetes-helm-gitops/complete-workflow-integration
49%
tool
Recommended

Fix Helm When It Inevitably Breaks - Debug Guide

The commands, tools, and nuclear options for when your Helm deployment is fucked and you need to debug template errors at 3am.

Helm
/tool/helm/troubleshooting-guide
49%
troubleshoot
Similar content

Fix Admission Controller Policy Failures: Stop Container Blocks

Fix the Webhook Timeout Hell That's Breaking Your CI/CD

Trivy
/troubleshoot/container-vulnerability-scanning-failures/admission-controller-policy-failures
49%
troubleshoot
Similar content

Debug Kubernetes AI GPU Failures: Pods Stuck Pending & OOM

Debugging workflows for when Kubernetes decides your AI workload doesn't deserve those GPUs. Based on 3am production incidents where everything was on fire.

Kubernetes
/troubleshoot/kubernetes-ai-workload-deployment-issues/ai-workload-gpu-resource-failures
46%
troubleshoot
Recommended

Docker Desktop Won't Install? Welcome to Hell

When the "simple" installer turns your weekend into a debugging nightmare

Docker Desktop
/troubleshoot/docker-cve-2025-9074/installation-startup-failures
45%
howto
Recommended

Complete Guide to Setting Up Microservices with Docker and Kubernetes (2025)

Split Your Monolith Into Services That Will Break in New and Exciting Ways

Docker
/howto/setup-microservices-docker-kubernetes/complete-setup-guide
45%
troubleshoot
Recommended

Fix Docker Daemon Connection Failures

When Docker decides to fuck you over at 2 AM

Docker Engine
/troubleshoot/docker-error-during-connect-daemon-not-running/daemon-connection-failures
45%
tool
Similar content

ArgoCD Production Troubleshooting: Debugging & Fixing Deployments

The real-world guide to debugging ArgoCD when your deployments are on fire and your pager won't stop buzzing

Argo CD
/tool/argocd/production-troubleshooting
44%
alternatives
Recommended

GitHub Actions Alternatives That Don't Suck

integrates with GitHub Actions

GitHub Actions
/alternatives/github-actions/use-case-driven-selection
43%
alternatives
Recommended

Tired of GitHub Actions Eating Your Budget? Here's Where Teams Are Actually Going

integrates with GitHub Actions

GitHub Actions
/alternatives/github-actions/migration-ready-alternatives
43%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization