why does my pod keep dying kubectl top shows low memory wtf???

This happens because `kubectl top` shows current memory usage, not peak usage. Your pod hit the memory limit temporarily, got killed, and restarted with lower memory usage. Debug this:```bash# Check for memory spikes in pod eventskubectl describe pod | grep -A 20 Events# Look for resource limits vs actual usagekubectl describe pod | grep -A 10 -B 5 Limits# Monitor memory over time to catch spikeskubectl top pod --no-headers | while read line; do echo "$(date): $line" sleep 10done```The solution is usually increasing memory limits by 50-100% to handle memory spikes during garbage collection or high load.

java app oomkilled but heap dump looks fine what is eating my memory help

Off-heap memory is the usual culprit. Java applications use memory outside the heap for:- DirectByteBuffers (NIO operations)- Metaspace (class definitions)- Code cache (JIT compilation)- Native library allocations**Actually debug off-heap usage:**```bash# Enable native memory tracking (requires restart because Java is special)kubectl patch deployment -p '{"spec":{"template":{"spec":{"containers":[{"name":" ","env":[{"name":"JAVA_OPTS","value":"-XX:NativeMemoryTracking=summary -XX:+UnlockDiagnosticVMOptions"}]}]}}}}'# Check native memory usage (the money shot)kubectl exec -it -- jcmd $(pgrep java) VM.native_memory summary# Monitor DirectByteBuffer usage (usually the culprit) kubectl exec -it -- jcmd $(pgrep java) VM.memory_info | grep -i direct```**Common fixes:**- Set `-XX:MaxDirectMemorySize=256m` to limit DirectByteBuffer- Increase `-XX:MetaspaceSize` if using lots of dynamic classes- Set `-XX:ReservedCodeCacheSize=128m` for applications with heavy JIT compilation

pod crashloopbackoff cant exec how to debug oomkilled???

Use these debugging approaches when the pod won't stay alive:**Method 1: Debug with a sidecar container**```yamlapiVersion: v1kind: Podmetadata: name: debug-podspec: containers: - name: app image: your-broken-image:latest # Your normal container config - name: debug-sidecar image: nicolaka/netshoot command: ["sleep", "3600"] # Keep alive for debugging volumeMounts: - name: shared-data mountPath: /debug volumes: - name: shared-data emptyDir: {}```**Method 2: Override the entrypoint**```bash# Run the container with a different command that doesn't crashkubectl run debug-pod --image=your-broken-image:latest --rm -it -- /bin/sh# Inside the container, run your app manually and monitor:/usr/bin/your-app &PID=$!while kill -0 $PID 2>/dev/null; do echo "$(date): Memory: $(cat /proc/$PID/status | grep VmRSS)" sleep 5done```**Method 3: Use init containers for debugging**```yamlapiVersion: v1kind: Podmetadata: name: debug-with-initspec: initContainers: - name: memory-debugger image: busybox command: ["sh", "-c", "echo 'Debugging memory limits'; cat /proc/meminfo"] containers: - name: app image: your-broken-image:latest```

nodejs oomkilled even with max-old-space-size set why???

Node.js has multiple memory areas beyond the V8 heap:- V8 heap (controlled by `--max-old-space-size`)- Buffer allocations (outside V8 heap)- Native addon memory- libuv memory pools**Debug Node.js memory usage:**```bash# Check total process memory vs V8 heapkubectl exec -it -- node -e "setInterval(() => { const mem = process.memoryUsage(); console.log({ rss: Math.round(mem.rss / 1024 / 1024) + 'MB', // Total memory heapTotal: Math.round(mem.heapTotal / 1024 / 1024) + 'MB', // V8 heap heapUsed: Math.round(mem.heapUsed / 1024 / 1024) + 'MB', // Used heap external: Math.round(mem.external / 1024 / 1024) + 'MB' // C++ objects });}, 10000);"# Monitor Buffer allocations specificallykubectl exec -it -- node -e "console.log('Buffer pool size:', Buffer.poolSize);setInterval(() => { console.log('External memory:', process.memoryUsage().external);}, 5000);"```**Fix Node.js memory issues:**- Set container memory limit to 2x your `--max-old-space-size`- Use `--max-buffer-length=1048576` to limit buffer sizes- Monitor and limit Buffer allocations in your code

python app oomkilled but memory looks normal how to debug this

Python memory issues are often caused by:- C extensions holding onto memory- Large objects not being garbage collected- Memory fragmentation- Multiprocessing memory sharing**Debug Python memory:**```bash# Install memory profiling toolskubectl exec -it -- pip install pympler memory-profiler psutil# Get detailed memory breakdownkubectl exec -it -- python -c "import psutilimport gcprocess = psutil.Process()print(f'RSS: {process.memory_info().rss / 1024 / 1024:.1f}MB')print(f'VMS: {process.memory_info().vms / 1024 / 1024:.1f}MB') print(f'Objects: {len(gc.get_objects())}')"# Profile memory by line in your applicationkubectl exec -it -- python -m memory_profiler your_script.py```**Common Python memory fixes:**- Use `__slots__` for classes with many instances- Explicitly delete large objects and call `gc.collect()`- Use generators instead of loading large datasets into memory- Set `PYTHONHASHSEED=0` for consistent memory usage

multi container pod oomkilled which container is causing it

Multi-container pods share memory limits, making debugging harder:**Method 1: Check individual container memory**```bash# Get memory usage by containerkubectl top pod --containers# Check events for specific container terminationkubectl describe pod | grep -A 10 -B 10 "container.*terminated"# Look at resource limits per containerkubectl get pod -o jsonpath='{.spec.containers[*].resources}'```**Method 2: Add memory monitoring to each container**```bash# Run memory monitoring in each containerkubectl exec -it -c -- sh -c "while true; do echo 'Container: ' cat /proc/meminfo | head -5 ps aux --sort=-%mem | head -10 echo '---' sleep 30done"```**Method 3: Use separate resource limits per container**```yamlapiVersion: v1kind: Podspec: containers: - name: web resources: limits: memory: "1Gi" # Separate limit for web container - name: sidecar resources: limits: memory: "512Mi" # Separate limit for sidecar```

database pods oomkilled during backup operations how to fix

Database memory spikes during operations like backups, index rebuilds, or query processing:**Short-term fix:**```bash# Increase memory limits temporarily for backup operationskubectl patch statefulset -p '{"spec":{"template":{"spec":{"containers":[{"name":" ","resources":{"limits":{"memory":"4Gi"}}}]}}}}'# Run backup (of course, this will fail during your most important backup)kubectl exec -it -- /backup-script.sh# Restore normal limitskubectl patch statefulset -p '{"spec":{"template":{"spec":{"containers":[{"name":" ","resources":{"limits":{"memory":"2Gi"}}}]}}}}'```**Long-term solution - Use init containers for resource-intensive operations:**```yamlapiVersion: v1kind: Podspec: initContainers: - name: backup image: postgres:13 resources: limits: memory: "4Gi" # Higher memory for backup command: ["pg_dump", "..."] containers: - name: postgres image: postgres:13 resources: limits: memory: "2Gi" # Normal operational memory```

pods oomkilled on some nodes but not others why different behavior

Node-level differences can cause inconsistent OOM behavior:**Check node memory differences:**```bash# Compare available memory across nodeskubectl describe nodes | grep -E "Name:|memory:"# Check node conditionskubectl get nodes -o custom-columns=NAME:.metadata.name,MEMORY-PRESSURE:.status.conditions[?(@.type==\"MemoryPressure\")].status# See memory usage patterns per nodekubectl top nodes --sort-by=memory```**Common node-level issues:**- Different node types (some with less memory)- System processes consuming different amounts of memory- Memory fragmentation levels varying by node- Different kernel memory settings**Fix with node selection:**```yamlapiVersion: apps/v1kind: Deploymentspec: template: spec: nodeSelector: node.kubernetes.io/instance-type: "m5.xlarge" # Consistent node type containers: - name: app resources: limits: memory: "2Gi"```

works in docker locally but oomkilled in kubernetes whats different

Several differences between local Docker and Kubernetes can cause memory issues:**Container runtime differences:**```bash# Check cgroup version differenceskubectl exec -it -- cat /proc/1/cgroup | head -5# Compare memory accounting methodskubectl exec -it -- cat /sys/fs/cgroup/memory/memory.usage_in_bytes```**Common Kubernetes vs Docker differences:**1. **Memory accounting**: Kubernetes includes page cache in memory usage2. **Process limits**: Different ulimits and system constraints3. **User context**: Containers might run as different users4. **Filesystem differences**: tmpfs mounts consuming memory**Debug the differences:**```bash# Compare user contextsdocker run --rm your-image idkubectl exec -it -- id# Compare memory limits enforcementdocker run --rm your-image cat /proc/meminfo | head -5kubectl exec -it -- cat /proc/meminfo | head -5# Check for tmpfs mounts consuming memory kubectl exec -it -- df -h | grep tmpfs```These FAQ answers should help you debug the most common OOMKilled scenarios quickly, especially during production incidents.

Currently viewing the AI version

Switch to human version

Kubernetes OOMKilled Debugging & Prevention Guide

Critical Context

Operational Reality: OOMKilled errors cause 90% of production incidents at 3AM. Basic troubleshooting fails 80% of the time because kubectl top shows current usage, not the memory spike that killed the pod 30 seconds ago.

Severity Indicators:

Critical: Payment API or database pods dying (immediate revenue impact)
High: Invisible OOM kills (child processes die, main process stays alive, impossible to detect)
Medium: Obvious crashes with restart loops (visible but disruptive)

Hidden Costs: Doubling memory limits costs 2x infrastructure spend but only delays the problem. Proper debugging saves $500-2000/month in wasted resources.

Root Cause Categories

Type 1: Obvious OOM Kills

Symptoms: Pod shows OOMKilled, exit code 137, restart count climbing
Detection Time: Immediate (if you catch the events before they rotate)
Fix Complexity: Low (increase limits) to High (application optimization)

Type 2: Invisible OOM Kills

Symptoms: Application becomes unresponsive, error rates spike, no pod restarts
Detection Time: Hours to days (application appears healthy to Kubernetes)
Fix Complexity: High (requires node-level debugging and cgroup analysis)

Memory Usage Patterns by Technology

Java/JVM Applications

Base Memory: 200MB (Spring Boot startup overhead)
Working Memory: 400MB (application + dependencies)
Off-Heap Tax: 25% additional (DirectByteBuffers, Metaspace, Code Cache)
Common Failure: Off-heap memory consumption invisible to heap monitoring

Critical Commands:

# Enable native memory tracking (requires restart)
-XX:NativeMemoryTracking=summary

# Show ALL memory usage, not just heap
jcmd $(pgrep java) VM.native_memory summary

# Monitor DirectByteBuffer leaks (usual culprit)
jcmd $(pgrep java) VM.memory_info | grep -i direct

Node.js Applications

Base Memory: 50MB (V8 startup)
Working Memory: 200-500MB (depending on dependencies)
Memory Tax: 30% additional (Buffer allocations, libuv pools)
Common Failure: Buffer allocations outside V8 heap don't show in heap snapshots

Critical Configuration:

# Explicit heap limit (prevents 1.5GB default on containers)
--max-old-space-size=1024

# Monitor total process memory vs V8 heap
process.memoryUsage().rss vs process.memoryUsage().heapUsed

Python Applications

Base Memory: 30MB (interpreter)
Working Memory: Highly variable (pandas DataFrames are memory bombs)
Memory Tax: 40% additional (garbage collection overhead, object references)
Common Failure: Circular references preventing garbage collection

Debugging Decision Matrix

Emergency Response (Site Down)

Time Budget: 5-30 minutes
Tools: kubectl describe, kubectl logs --previous, memory limit doubling
Success Rate: 80% for obvious OOMs, 20% for complex issues

Investigation Phase (Hours Available)

Time Budget: 2-8 hours
Tools: Ephemeral containers, application profilers, load testing
Success Rate: 90% if you have proper access and tooling

Prevention Phase (Long-term)

Time Budget: 1-2 weeks setup, ongoing maintenance
Tools: Prometheus monitoring, VPA recommendations, resource quotas
Success Rate: 95% prevention of repeat incidents

Tool Effectiveness by Scenario

Basic Kubernetes Tools

Tool	Good For	Fails When	Reality Check
`kubectl top`	Current usage snapshot	Memory spikes (shows usage after restart)	Lies about actual usage 50% of the time
`kubectl describe`	Termination reasons, resource limits	Events expire after 1 hour	Essential first step but time-sensitive
`kubectl logs --previous`	Application logs before death	App dies before logging	Add to muscle memory

Advanced Debugging Tools

Tool	Setup Time	Effectiveness	Corporate Approval
Ephemeral containers	5 minutes	High	Disabled in 80% of environments
Application profilers	1-4 hours	Very High	Usually allowed in dev only
Node-level access	Immediate	Highest	Requires infrastructure team

Monitoring Solutions

Solution	Setup Cost	Monthly Cost	Time to Value
kubectl + metrics-server	Free	Free	30 minutes
Prometheus + Grafana	1-2 weeks	$50-200	2-4 weeks
Enterprise observability	2-6 weeks	$500-2000+	1-3 months

Memory Sizing Formulas

Production-Ready Calculation

Total Limit = (Base + Working + Spike Buffer) × (1 + Language Tax) × 1.2

Where:
- Base = Minimum to start (language-specific)
- Working = Normal operation memory
- Spike Buffer = 50-100% for traffic/GC spikes  
- Language Tax = GC overhead (Java: 25%, Node: 30%, Python: 40%)
- 1.2 = 20% safety margin (estimates are always wrong)

Example: Java Spring Boot API

Base: 200MB (Spring Boot startup)
Working: 400MB (application logic)  
Spike: 300MB (50% buffer for GC)
Tax: 25% (JVM overhead)
Total: (200 + 400 + 300) × 1.25 × 1.2 = 1350MB

Recommended limit: 1.5GB

Quality of Service (QoS) Strategy

QoS Class Priorities (Kubernetes Kill Order)

BestEffort - No requests/limits (dies first)
Burstable - Requests < Limits (dies second)
Guaranteed - Requests = Limits (dies last)

Strategic QoS Usage

Critical services (databases, payment APIs): Guaranteed QoS
Web applications: Burstable QoS with conservative requests
Batch jobs: BestEffort or high-limit Burstable

Prevention Configuration

Resource Quotas (Namespace Level)

apiVersion: v1
kind: ResourceQuota
spec:
  hard:
    requests.memory: "100Gi"  # Total memory requests
    limits.memory: "200Gi"    # Total memory limits
    pods: "50"                # Prevent pod sprawl

Vertical Pod Autoscaler (Automatic Right-Sizing)

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
spec:
  updatePolicy:
    updateMode: "Auto"        # Or "Off" for recommendations only
  resourcePolicy:
    containerPolicies:
    - maxAllowed:
        memory: "8Gi"         # Prevent runaway scaling
      minAllowed:
        memory: "256Mi"       # Minimum viable memory

Monitoring Alerts (Prometheus)

# Alert when approaching memory limit
- alert: MemoryUsageHigh
  expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.85
  for: 5m
  
# Alert on memory growth rate (leak detection)  
- alert: MemoryGrowthRate
  expr: rate(container_memory_usage_bytes[30m]) > 10485760  # 10MB/30min
  for: 15m

Common Failure Scenarios & Solutions

Database Connection Pool Bloat

Problem: HikariCP configured for 50 connections × 10MB cache = 500MB off-heap
Detection: jcmd native memory tracking shows high "Other" usage
Solution: Reduce pool size to 20, cap result set cache
Prevention: Monitor connection pool metrics, set leak detection

Winston Log Buffer Accumulation

Problem: Winston buffering 50,000+ log entries × 1KB = 50MB+ growing forever
Detection: Node.js external memory growing without heap growth
Solution: Set buffer size to 1,000 entries maximum
Prevention: Monitor external memory vs heap usage

Invisible ArgoCD Helm Kills

Problem: Helm processes OOM killed during deployments, main ArgoCD process unaware
Detection: Apps show "Out of Sync" without obvious errors
Solution: Increase repo-server memory from 128Mi to 256Mi
Prevention: Monitor node kernel logs for process kills

Emergency Debugging Commands

Immediate Triage (0-5 minutes)

# Check pod status and recent events
kubectl describe pod <pod-name> | grep -A 20 "Last State"
kubectl get events --sort-by='.lastTimestamp' | grep <pod-name>

# Check previous container logs
kubectl logs <pod-name> --previous | tail -50

# Quick memory usage check
kubectl top pod <pod-name> --containers

Deep Analysis (5-60 minutes)

# Add debugging container (if allowed)
kubectl debug <pod-name> -it --image=nicolaka/netshoot --target=<container-name>

# Check node-level OOM kills
kubectl debug node/<node-name> -it --image=busybox
journalctl --utc -k --since "1 hour ago" | grep -i "killed process"

# Language-specific memory analysis
# Java: jcmd $(pgrep java) VM.native_memory summary
# Node: node --inspect and heap snapshots  
# Python: memory-profiler and pympler analysis

Load Testing Memory Patterns

# Gradual load ramp to find memory ceiling
for rps in 10 50 100 200 500; do
  echo "Testing $rps RPS"
  hey -z 300s -q $rps <api-endpoint> &
  kubectl top pod <pod-name> | tee -a load-test-memory.log
  sleep 60
done

Resource Requirements by Use Case

Development Environment

Time Investment: 1-2 hours initial setup
Tools: kubectl + metrics-server + shell scripts
Cost: Free
Effectiveness: Good for learning, poor for production patterns

Small Production (< 50 pods)

Time Investment: 1-2 weeks setup
Tools: Prometheus + Grafana + basic cloud monitoring
Cost: $50-200/month
Effectiveness: Excellent for preventive monitoring

Enterprise Production (100+ pods)

Time Investment: 2-6 weeks setup + dedicated SRE
Tools: Full observability stack (DataDog, New Relic, etc.)
Cost: $500-2000+/month
Effectiveness: Best in class, if budget allows

Breaking Points & Failure Modes

Memory Limit Thresholds

Below 256MB: Most applications won't start
256MB-512MB: Suitable for simple services only
512MB-1GB: Standard web applications
1GB-4GB: Database applications, ML workloads
Above 4GB: Requires memory-optimized nodes

Node Memory Pressure

85%+ utilization: Kubernetes starts evicting BestEffort pods
90%+ utilization: System becomes unstable
95%+ utilization: Node becomes unschedulable

Container Runtime Limits

cgroup v1: Memory accounting includes page cache (higher usage)
cgroup v2: More accurate memory tracking, different behavior
Docker vs containerd: Slight differences in memory reporting

Decision Criteria for Tool Selection

Budget-Based Selection

$0 budget: kubectl + metrics-server + patience
$500/month budget: Add Prometheus + Grafana + cloud basics
$2000+/month budget: Enterprise observability + dedicated SRE

Skill-Based Selection

Junior teams: Stick to kubectl basics, use VPA recommendations
Senior teams: Custom Prometheus rules, application profiling
Expert teams: Node-level debugging, custom tooling

Time-Based Selection

Under pressure: Double limits, debug later (technical debt)
Investigation time: Proper profiling and load testing
Long-term: Comprehensive monitoring and prevention

Success Metrics

Incident Reduction

Baseline: 2-3 OOM pages per week (typical before optimization)
Target: 1 OOM page per month (achievable with proper limits + monitoring)
Best case: Zero OOM incidents (requires comprehensive prevention)

Cost Optimization

Waste reduction: 30-50% memory over-provisioning eliminated
Infrastructure savings: $500-2000/month for medium-scale deployments
Engineering time: 80% reduction in debugging time per incident

Operational Maturity

Reactive: Fix issues after they occur (most teams)
Proactive: Prevent issues with monitoring (good teams)
Predictive: Capacity planning and automated remediation (best teams)

Useful Links for Further Investigation

Resources That Don't Suck - Links I Actually Used During Real Outages

Link	Description
Kubernetes Resource Management	The only K8s docs page that's actually useful. Explains limits vs requests without being completely useless.
Debug Pods and ReplicationControllers	Basic debugging steps. Skip the theory, go straight to the kubectl commands. Half of these don't work in real clusters.
Troubleshooting Applications	Comprehensive but mostly outdated. The real world is messier than these examples assume.
Ephemeral Containers	Cool feature if your security team allows it (spoiler: they don't).
Netshoot Debugging Container	Holy grail of debugging containers. Has every tool you need when your pod is dying. kubectl run tmp --rm -i --tty --image nicolaka/netshoot -- /bin/bash
kubectl-debug Plugin	Enhances kubectl debug with more features. Installation is a pain, but worth it if you debug pods regularly.
VPA (Vertical Pod Autoscaler)	Actually recommends useful memory limits. Setup is a nightmare but the recommendations are solid once it learns your apps.
Goldilocks by Fairwinds	Pretty UI for VPA data. Nice for showing charts to management, less useful for actual debugging.
Prometheus Memory Monitoring Setup	Takes 2 weeks to get working properly but then it's bulletproof. The alerting rules in this guide actually work.
Grafana Kubernetes Memory Dashboards	Save yourself hours of YAML hell. Import dashboard 7249 and 6879. They're ugly but functional.
kube-state-metrics	Essential for getting OOMKilled events into Prometheus. Installation is straightforward, just follow their manifests.
Metrics Server	Required for kubectl top to work. Breaks constantly on self-managed clusters due to certificate issues.
Java Memory Troubleshooting in Kubernetes	Mostly useless K8s docs, but the heap dump commands work. Better to just run jcmd directly.
Eclipse Memory Analyzer (MAT)	Ugly as sin but actually finds memory leaks. Download takes forever, analysis takes longer, but results are solid.
Node.js Memory Best Practices	Official docs that are surprisingly not garbage. The profiling examples actually work in containers.
Python Memory Profiler	Slows your app to a crawl but shows you exactly which lines eat memory. Use sparingly.
Lumigo Kubernetes OOMKilled Guide	Actually comprehensive and based on real debugging experience. Bookmark this one.
Fairwinds 5 Ways to Diagnose OOMKilled	Good practical approaches. Skip the marketing fluff, focus on the debugging steps.
Medium: Tracking Invisible OOM Kills	The best guide I've found for invisible OOMs. This saved my ass during a particularly nasty incident.
Komodor OOMKilled Troubleshooting	Step-by-step approach that actually works. No bullshit, just solutions.
AWS EKS Troubleshooting Guide	Typical AWS docs - comprehensive but assumes you love paying for CloudWatch. Container Insights costs add up fast.
Google GKE Memory Monitoring	Best cloud provider docs for K8s. Google wrote Kubernetes so their monitoring actually makes sense.
Azure AKS Troubleshooting	Hit or miss. Some sections are great, others feel like they were translated from marketing speak.
DigitalOcean Kubernetes Basic Monitoring	Simple and straightforward. No enterprise bullshit, just basic monitoring that works.
Awesome Prometheus Alerts Collection	Copy-paste ready alerting rules. Saved me weeks of writing custom alerts from scratch.
Jaeger Memory Tracing	Overkill for simple memory issues but amazing for distributed memory leaks. Setup is a pain.
k9s Terminal UI	Best K8s terminal UI. Period. Makes debugging so much faster than raw kubectl commands.
stern Multi-Pod Logs	stern pod-prefix to tail logs from all matching pods. Essential when you don't know which pod is dying.
kube-monkey Chaos Testing	Randomly kills pods to test resilience. Great idea until it crashes your prod database and you get fired.
Pumba Network and Resource Chaos	Simulate memory pressure and resource constraints. Use in staging unless you enjoy resume writing.
k6 Load Testing	Modern load testing that doesn't suck. JavaScript-based, actually scales, and shows memory patterns under load.
Artillery.io Performance Testing	Good for memory profiling under load. Setup is easier than k6 but less powerful.
cAdvisor Container Metrics	Shows the real memory usage. More accurate than kubectl top because it doesn't lie about spikes.
runc Memory Debugging	Deep cgroup debugging when you need to understand exactly how memory limits work. Very technical.
Kubernetes Emergency Debugging Cheat Sheet	Basic kubectl commands for memory debugging. Print this and keep it handy.
kubectl Quick Reference	Quick reference for when your brain stops working at 3AM.
Stack Overflow Kubernetes Memory Tag	90% of your OOM problems have been solved here. Search before struggling.
Kubernetes Up & Running (Memory Chapter)	Chapter 5 covers resource management well. Skip the rest unless you're new to K8s.
Troubleshooting Kubernetes (O'Reilly)	Actually focused on troubleshooting. Less theory, more practical debugging steps.
Red Hat OpenShift Memory Troubleshooting	OpenShift adds complexity but their docs are comprehensive. Memory debugging works the same.
Rancher Kubernetes Troubleshooting	Rancher-specific debugging. Most generic K8s approaches still work.