Debug Kubernetes Issues - The 3AM Production Survival Guide

The Reality of Kubernetes Failures (What Actually Breaks in Production)

Your Kubernetes cluster isn't failing because of some exotic edge case described in a 200-page white paper. It's failing because someone forgot to set memory limits, your ingress controller shit the bed, or AWS decided to randomly restart your nodes. Here's what actually breaks in the real world.

The Five Horsemen of Kubernetes Apocalypse

1. Pod Death Spiral (CrashLoopBackOff Hell)

What it looks like: Your pod keeps restarting every 30 seconds and kubectl describe shows CrashLoopBackOff status.

Why it happens:

Your app crashes on startup but the container image doesn't fail properly
Database connection strings are wrong (check your secrets)
Missing environment variables that your app needs to start
File permissions are fucked in your container (security context issues)
Your health check is failing because the app takes 45 seconds to start but your readinessProbe times out after 10

The 5-minute fix:

## See what's actually happening
kubectl logs your-broken-pod --previous
kubectl describe pod your-broken-pod

## Check if it's a startup timing issue
kubectl get pod your-broken-pod -o yaml | grep -A 5 -B 5 probe

## Nuclear option: disable health checks temporarily
kubectl patch deployment your-app -p '{\"spec\":{\"template\":{\"spec\":{\"containers\":[{\"name\":\"your-container\",\"readinessProbe\":null,\"livenessProbe\":null}]}}}}'

Real failure story: Our Spring Boot app took 90 seconds to start because it was downloading Maven dependencies on startup (brilliant architecture choice). The readiness probe was configured for 30 seconds. Pods kept getting killed just as they were about to be ready. Solution: increased initialDelaySeconds to 120 and fixed the Docker image to include dependencies. Took down prod for 2 hours because we "didn't want to change the health check configuration." Learn from our debugging mistakes.

2. OOMKilled - The Memory Massacre

What it looks like: kubectl get pods shows your pod restarted with exit code 137, and kubectl describe pod reveals the dreaded Reason: OOMKilled.

Why it happens:

You allocated 128MB but your Java app needs 2GB (classic resource management failure)
Memory leak in your application that accumulates over days
No memory limits set, so your pod consumed everything until the node kernel murdered it (node resource management)
Your container is running multiple processes and you only accounted for one (best practices guide)

The actual debugging process:

## Check current memory usage 
kubectl top pod your-pod-name
kubectl top node

## See memory limits vs requests
kubectl describe pod your-pod | grep -A 3 -B 3 Limits

## Check what's actually using memory in the container
kubectl exec -it your-pod -- ps aux --sort=-%mem | head

## Historical memory usage (if you have metrics-server)
kubectl get --raw \"/apis/metrics.k8s.io/v1beta1/namespaces/default/pods/your-pod\"

What actually worked: Doubled the memory limit from 256Mi to 512Mi, added memory monitoring, and discovered our logging library was buffering 300MB of logs in memory before writing to disk. The "temporary" fix became permanent because nobody had time to optimize the logger during Black Friday prep season.

3. ImagePullBackOff - The Registry Nightmare

What happens: Your pod sits in Pending state forever with ImagePullBackOff error, usually at the worst possible time.

Why it always happens:

Registry is down (Docker Hub limits hit you)
Wrong image tag (someone deployed v2.1.4 but it doesn't exist)
Authentication failed (registry credentials expired)
Image is too big for the node (8GB image on a 10GB node)
Private registry URL is wrong (happens during migrations)

Debug it properly:

## Check the actual error message
kubectl describe pod your-pod | grep -A 10 Events

## Test image pull manually
kubectl debug node/your-node -it --image=busybox
crictl pull your-registry/your-image:tag

## Check if it's an auth issue
kubectl get secret your-registry-secret -o yaml
echo \"base64-string\" | base64 -d | jq .

Production war story: During a midnight deployment, our image pull started failing with "unauthorized" errors. Turns out the CI/CD system was using an API token that expired after 90 days. The image built fine, pushed fine, but pulling failed because the cluster used different credentials. Spent 3 hours debugging container registries while our app was down. The fix was rotating a single API key that should have been automated 6 months earlier.

The Networking Black Hole

Services That Pretend to Work

The symptoms: kubectl get svc shows your service exists, but curl returns connection refused, and your ingress returns 503 errors.

What's actually broken:

## Check if service has endpoints
kubectl get endpoints your-service

## No endpoints? Your selector is wrong
kubectl get pods --show-labels
kubectl describe svc your-service | grep Selector

## Endpoints exist but connection fails? Check network policies
kubectl get networkpolicies
kubectl describe netpol your-policy

## Still broken? Test from inside the cluster
kubectl run debug-pod --image=busybox --rm -it -- sh
## Inside the pod:
nslookup your-service
telnet your-service 80

Reality check: Network policies are the reason your microservices can't talk to each other. Someone enabled "secure by default" policies that block everything, then quit without documenting which services need to communicate. You'll spend hours adding network policy exceptions one by one while your staging environment is completely broken.

Ingress Controllers Having Existential Crises

What you see: External users get 502 Bad Gateway, 503 Service Unavailable, or timeouts.

What's actually happening:

## Check ingress controller logs (it's probably nginx-ingress)
kubectl logs -n ingress-nginx deployment/ingress-nginx-controller

## Check if ingress has an IP
kubectl get ingress your-ingress

## Verify backend services are healthy
kubectl get pods -l app=your-app
kubectl describe endpoints your-service

## Test internal service connectivity
kubectl port-forward svc/your-service 8080:80
curl localhost:8080

War story: Our ingress worked fine for months, then started returning 502 errors during high traffic. The nginx-ingress logs showed upstream sent too big header while reading response header from upstream. Our API was returning 32KB JSON responses, but nginx proxy buffer was set to 4KB. The solution was adding this annotation: nginx.ingress.kubernetes.io/proxy-buffer-size: 32k. Took down our API for 4 hours because nobody knew ingress controllers had buffer limits.

Resource Starvation - When Everything Fights for CPU

The death spiral: Pods get CPU throttled, become slow, health checks fail, pods restart, rinse and repeat.

Debug resource issues:

## Check current resource usage
kubectl top nodes
kubectl top pods --sort-by=cpu
kubectl top pods --sort-by=memory

## Find pods without resource limits
kubectl get pods -o jsonpath=\"{\.items[*].spec.containers[*].resources}\" | jq .

## Check for CPU throttling
kubectl describe pod your-pod | grep -A 5 -B 5 throttl

The fix that actually worked: Set proper CPU limits based on actual usage, not guesses. Monitor for a week first. Yes, it takes time. No, you can't wing it. CPU limits of 100m work for static websites, not for your machine learning inference API that needs 2000m during peak load.

Your cluster is probably broken right now for one of these reasons. The next section covers the specific commands and debugging workflows that actually fix these problems instead of just identifying them.

Kubernetes Debugging FAQ (The Questions You're Actually Asking at 3AM)

How do I see what's actually wrong with my pod that won't start?

The debugging sequence that actually works:

kubectl get pods - Check the status and restart count
kubectl describe pod <pod-name> - Read the Events section (scroll to bottom)
kubectl logs <pod-name> - See current logs
kubectl logs <pod-name> --previous - See logs from the crashed container

If it's still Pending: Check kubectl describe node for resource availability and kubectl get events --sort-by='.lastTimestamp' for scheduler complaints.

Pro tip: The error is usually in the Events section of kubectl describe pod, not the logs. Learn to read those events first.

Why is my service returning 503 errors when the pods are running fine?

The 3 things that are actually wrong (in order of likelihood):

Service selector doesn't match pod labels: kubectl describe svc <service> and compare selector to kubectl get pods --show-labels
Pods aren't ready: kubectl get pods shows Running but ready column is 0/1 - your readiness probe is failing
Service port doesn't match container port: Your service targets port 8080 but your container runs on port 3000

Quick test: kubectl get endpoints <service-name> - if it's empty, your selector is wrong.

Nuclear debugging option:

kubectl run debug --image=busybox --rm -it -- sh
## Inside pod: nslookup your-service-name
## Should return service IP

How do I debug networking when nothing can talk to anything else?

Start with the basics:

## Test DNS resolution first (it's usually DNS)
kubectl run dnstest --image=busybox --rm -it -- nslookup kubernetes.default

## Test service connectivity from inside cluster
kubectl run nettest --image=busybox --rm -it -- sh
## Inside pod: telnet your-service 80

If DNS works but connections fail: Check NetworkPolicies

kubectl get networkpolicies
kubectl describe netpol <policy-name>

Reality check: Network policies are like firewall rules written by someone who hates you. They'll block everything by default and you'll need to add exceptions for every single connection your app needs.

My pod keeps getting OOMKilled - how do I figure out why?

Step 1: Check what's actually using memory right now:

kubectl top pod <pod-name>
kubectl exec -it <pod-name> -- ps aux --sort=-%mem | head -10

Step 2: Check your memory limits vs requests:

kubectl describe pod <pod-name> | grep -A 10 Limits
kubectl describe pod <pod-name> | grep -A 10 Requests

Step 3: If your app is using more memory than expected, profile it:

Java: Add JVM flags to track memory usage
Node.js: Use --inspect and Chrome DevTools
Python: Use memory-profiler
Go: Use go tool pprof

The real answer: Your memory limits are probably too low. Java apps need at least 512MB, Node.js apps need 256MB minimum, and databases need whatever they ask for plus 50%. Stop trying to save money on RAM.

How do I use kubectl debug for actual troubleshooting?

The new hotness (kubectl debug with ephemeral containers):

## Add a debugging container to a running pod
kubectl debug <pod-name> -it --image=busybox --target=<container-name>

## Debug a crashed pod by creating a copy
kubectl debug <pod-name> -it --image=busybox --copy-to=debug-copy

## Debug node issues
kubectl debug node/<node-name> -it --image=busybox

What's actually useful:

Install debugging tools (curl, dig, tcpdump) in ephemeral containers
Access the same filesystem and network namespace as your app
Debug without modifying your production container images
Leave the debugging container running for extended troubleshooting

Reality: This only works if your cluster has ephemeral containers enabled (Kubernetes v1.25+). AWS EKS disables it by default because fuck you, that's why.

What commands actually help during a production outage?

The emergency toolkit:

## Get the big picture
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running
kubectl get events --sort-by='.lastTimestamp' | tail -20

## Check resource usage
kubectl top nodes
kubectl top pods --all-namespaces --sort-by=memory

## Find what's actually broken
kubectl get pods -o wide | grep -E "(Error|CrashLoop|ImagePull)"
kubectl get ingress --all-namespaces

For networking issues:

kubectl get svc --all-namespaces
kubectl get endpoints | grep "<none>"
kubectl get networkpolicies --all-namespaces

The command you'll actually use most: kubectl logs <pod-name> -f --previous to see what happened before it crashed.

How do I know if it's a cluster problem vs application problem?

Cluster-wide issues (panic mode):

## Check if nodes are healthy
kubectl get nodes

## Check system pods
kubectl get pods -n kube-system

## Check cluster events
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -50

If system pods are failing: Your cluster is fucked. Call your platform team or AWS support.

Application issues: Only your pods are broken, system pods are fine. This is your problem to fix.

The gray area: When your app is broken because the cluster is having a bad day (DNS issues, network policies, resource constraints). These are the hardest to debug and usually require fixing both layers.

Why do my containers work locally but crash in Kubernetes?

The usual suspects:

Different base image: Your local Docker might be using different OS/libraries
Missing environment variables: Local has them set, K8s doesn't
File permissions: Container user is different (non-root in K8s)
Resource constraints: Local has unlimited resources, K8s has limits
Network assumptions: App expects to bind to 0.0.0.0:80 but container port is 3000

Debug the differences:

## Check what user your container runs as
kubectl exec -it <pod> -- id

## Check environment variables
kubectl exec -it <pod> -- env | sort

## Check if ports match
kubectl describe pod <pod> | grep -A 5 Ports

The fix: Make your local environment match production as closely as possible. Use the same base images, same user permissions, same environment variables.

Advanced Debugging Techniques (When Basic Commands Don't Cut It)

When `kubectl logs` and `kubectl describe` aren't enough to solve your production nightmare, you need the advanced arsenal. These are the debugging techniques that separate engineers who panic from those who methodically debug complex cluster failures.

Ephemeral Containers - Debug Without Destroying Production

The game changer: Ephemeral containers let you inject debugging tools into running pods without modifying your production images.

## Add a debugging container to a running pod
kubectl debug misbehaving-pod -it --image=nicolaka/netshoot --target=my-app

## Once inside:
## - tcpdump to capture network traffic
## - dig to test DNS resolution  
## - curl to test HTTP endpoints
## - ss to check listening ports

Real debugging session: Our payment API was randomly timing out during checkout. Using ephemeral containers, we discovered the database connection pool was exhausted because the connection timeout was set to 30 seconds but queries were taking 45 seconds. Fixed by increasing the timeout configuration and optimizing the slow queries through performance monitoring.

Limitations:

Requires Kubernetes v1.25+ with ephemeral containers enabled
AWS EKS disables this by default (enable via feature gate)
Some managed K8s services don't support it yet

Alternative for older clusters: Create debugging pods with shared volumes:

kubectl run debug-pod --image=busybox --rm -it --overrides='{\"spec\":{\"shareProcessNamespace\":true}}'

Deep Dive: Networking Debug Hell

When Services Work But Applications Don't

The invisible problem: Your service endpoints look healthy, pods are ready, but requests are failing intermittently.

## Check if load balancing is working correctly
kubectl get endpoints your-service -o yaml

## Test direct pod connectivity (bypass service)
POD_IP=$(kubectl get pod your-pod -o jsonpath='{.status.podIP}')
kubectl run test --image=busybox --rm -it -- wget -O- http://\$POD_IP:8080/health

## Check iptables rules (advanced)
kubectl debug node/worker-node -it --image=busybox
## Inside node container:
iptables -t nat -L | grep your-service

What we found: Half our pods were healthy, half were failing health checks due to a database migration that only affected certain database shards. The service was load-balancing between healthy and unhealthy pods, causing 50% of requests to fail. Fixed by implementing better health checks that actually tested database connectivity.

Ingress Debugging: When External Traffic Doesn't Reach Your Pods

The full debugging chain:

## 1. Verify ingress has external IP
kubectl get ingress your-ingress -o wide

## 2. Check ingress controller logs
kubectl logs -n ingress-nginx deployment/ingress-nginx-controller --tail=100 -f

## 3. Test ingress rule parsing
kubectl get ingress your-ingress -o yaml | grep -A 5 -B 5 rules

## 4. Verify TLS certificates (if using HTTPS)
kubectl describe secret your-tls-secret
openssl x509 -in <(kubectl get secret your-tls-secret -o jsonpath='{.data.tls\.crt}' | base64 -d) -text -noout

Production nightmare: Our ingress was routing traffic correctly during the day but failing at night. Investigation revealed the TLS certificate was valid, but the ingress controller was failing SSL handshakes after midnight. The root cause: our certificate renewal job ran at 12:01 AM and briefly replaced the certificate with an invalid one during the renewal process. Fixed by adding proper certificate validation to the renewal script and updating certificates during low-traffic hours.

Storage Debugging: When Persistent Volumes Become Persistent Problems

PVC Stuck in Pending State

The typical causes:

## Check if storage class exists
kubectl get storageclass

## Verify PVC configuration
kubectl describe pvc your-pvc

## Check available storage on nodes
kubectl describe nodes | grep -A 5 -B 5 \"Allocated resources\"

## For EBS/cloud storage, check zone constraints
kubectl get nodes --show-labels | grep topology.kubernetes.io/zone
kubectl get pv | grep Available

What actually happened: Our PVC was requesting 100GB in us-west-2a, but all available PVs were in us-west-2b. Kubernetes couldn't schedule the pod because of availability zone constraints. The fix required either creating cross-zone PVs or ensuring pod affinity matched PV zones.

Corrupted Persistent Volumes

When your data is there but not accessible:

## Check mount status inside the container
kubectl exec -it your-pod -- df -h
kubectl exec -it your-pod -- ls -la /var/data/

## Check volume mount events
kubectl describe pod your-pod | grep -A 10 -B 10 Volume

## For filesystem corruption, check node-level mounts
kubectl debug node/your-node -it --image=busybox
## Inside: mount | grep your-volume

War story: Our PostgreSQL pod was crashing with "database system was interrupted" errors. The PV was mounted correctly, but filesystem checks revealed corruption. The underlying EBS volume had experienced an unexpected detachment during an AWS maintenance event. We had to restore from backup because the filesystem corruption was too extensive to repair. Lesson: Always test your backup restoration process before you need it.

Resource Analysis: Understanding What's Really Consuming Resources

CPU Throttling Detection

Beyond basic kubectl top:

## Check for CPU throttling in pod metrics
kubectl get --raw \"/apis/metrics.k8s.io/v1beta1/namespaces/default/pods/your-pod\" | jq .

## Historical throttling (if using Prometheus)
## container_cpu_cfs_throttled_seconds_total metric shows throttling events

## Check cgroup limits inside container
kubectl exec -it your-pod -- cat /sys/fs/cgroup/cpu/cpu.cfs_period_us
kubectl exec -it your-pod -- cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us

The discovery: Our API response times increased by 400% during peak load. CPU metrics showed normal usage, but deeper investigation revealed severe CPU throttling. The pods were hitting their 100m CPU limits during request processing, causing response times to degrade. Fixed by increasing CPU limits to 500m and implementing request queuing.

Memory Leak Detection

Advanced memory analysis:

## Get detailed memory breakdown
kubectl exec -it your-pod -- cat /proc/meminfo

## For Java applications, check heap usage
kubectl exec -it your-pod -- jstat -gc $(pgrep java) 5s

## Memory usage trends (requires monitoring)
kubectl top pods --sort-by=memory | head -20

## Check for memory-mapped files consuming space
kubectl exec -it your-pod -- find /proc/*/maps -exec cat {} \; | grep -v \"00000000-00000000\" | wc -l

Real debugging scenario: Our Node.js application was consuming 2GB+ memory after running for 48 hours, despite having a memory limit of 512MB. The investigation revealed our logging library was keeping file descriptors open and buffering logs in memory. We had 50,000+ open file handles consuming memory. Fixed by implementing proper log rotation and reducing log buffer sizes.

Cluster-Level Debugging: When the Platform is the Problem

etcd Performance Issues

When the brain is slow:

## Check etcd health (from control plane)
kubectl get componentstatuses
etcdctl --endpoints=https://etcd-server:2379 endpoint health

## Monitor etcd performance
etcdctl --endpoints=https://etcd-server:2379 endpoint status --write-out=table

## Check for large objects in etcd
kubectl get --raw=\"/api/v1/namespaces/default/configmaps\" | jq '.items[] | select(.metadata.name==\"large-config\") | .data | length'

The incident: Our entire cluster became unresponsive during a deployment. Investigation showed etcd was running out of disk space due to a ConfigMap containing 50MB of JSON data. The large ConfigMap was being replicated across all etcd nodes, causing disk I/O bottlenecks. Fixed by moving large configurations to external storage and implementing ConfigMap size limits.

Node Resource Exhaustion

When nodes stop accepting new pods:

## Check node conditions
kubectl describe nodes | grep -A 10 Conditions

## Node resource allocation
kubectl describe nodes | grep -A 10 \"Allocated resources\"

## System resource usage on nodes
kubectl debug node/your-node -it --image=busybox
## Inside: top, free -h, df -h

The pattern: During traffic spikes, our nodes would reach 90% memory utilization and stop scheduling new pods. The issue wasn't application memory usage but kernel memory consumption due to too many network connections. We had to tune kernel parameters and implement connection limits at the ingress level.

Understanding these advanced techniques separates operational firefighting from systematic problem-solving. The next section covers how to prevent these issues before they happen in production.

Debugging Tool Comparison - What Actually Works When Your Cluster is Broken

Debugging Scenario	kubectl logs	kubectl debug	kubectl exec	External Monitoring	When to Use
Pod won't start	✅ Best first option	⚠️ Can't debug non-running pods	❌ Pod needs to be running	⚠️ Shows symptoms not cause	Check logs first, then describe
Container crashes repeatedly	✅ Use `--previous` flag	⚠️ Limited if pod restarts quickly	❌ Can't exec into crashed container	✅ Track crash patterns	Logs show crash reason
Application hangs/deadlock	⚠️ May not show internal state	✅ Perfect for runtime analysis	✅ Good for process inspection	✅ Shows resource consumption	Debug for thread dumps
Network connectivity issues	⚠️ May show connection errors	✅ Install network tools	⚠️ Limited debugging tools	✅ Network flow monitoring	Debug with netshoot image
Performance degradation	❌ Logs rarely show perf issues	✅ Runtime profiling tools	✅ Can check system metrics	✅ Essential for perf analysis	External monitoring first
Out of Memory errors	✅ Shows OOMKilled events	✅ Memory profiling inside pod	✅ Check current memory usage	✅ Historical memory trends	Combine all approaches
Storage/PVC problems	⚠️ Shows mount errors	✅ Inspect filesystem state	✅ Check disk usage/permissions	⚠️ Limited storage insights	Exec first, debug if needed
Service discovery failures	❌ DNS issues don't show in logs	✅ Test DNS resolution tools	✅ Basic network testing	⚠️ Shows symptoms only	Debug for network analysis
Resource constraint issues	❌ Logs don't show throttling	⚠️ Can check cgroup limits	✅ Real-time resource view	✅ Historical resource data	Monitor + exec combination
Configuration errors	✅ Shows config parsing errors	⚠️ Runtime config inspection	✅ Check env vars/configs	❌ Config errors are internal	Logs first, exec to verify

Quick Navigation

The Five Horsemen of Kubernetes Apocalypse

1. Pod Death Spiral (CrashLoopBackOff Hell)

2. OOMKilled - The Memory Massacre

3. ImagePullBackOff - The Registry Nightmare

The Networking Black Hole

Services That Pretend to Work

Ingress Controllers Having Existential Crises

Resource Starvation - When Everything Fights for CPU

How do I see what's actually wrong with my pod that won't start?

Why is my service returning 503 errors when the pods are running fine?

How do I debug networking when nothing can talk to anything else?

My pod keeps getting OOMKilled - how do I figure out why?

How do I use kubectl debug for actual troubleshooting?

What commands actually help during a production outage?

How do I know if it's a cluster problem vs application problem?

Why do my containers work locally but crash in Kubernetes?

Ephemeral Containers - Debug Without Destroying Production

Deep Dive: Networking Debug Hell

When Services Work But Applications Don't

Ingress Debugging: When External Traffic Doesn't Reach Your Pods

Storage Debugging: When Persistent Volumes Become Persistent Problems

PVC Stuck in Pending State

Corrupted Persistent Volumes

Resource Analysis: Understanding What's Really Consuming Resources

CPU Throttling Detection

Memory Leak Detection

Cluster-Level Debugging: When the Platform is the Problem

etcd Performance Issues

Node Resource Exhaustion

Related Tools & Recommendations

ArgoCD Production Troubleshooting: Debugging & Fixing Deployments

Helm Troubleshooting Guide: Fix Deployments & Debug Errors

etcd Overview: The Core Database Powering Kubernetes Clusters

Kubernetes Crisis Management: Fix Your Down Cluster Fast

Fix gRPC Production Errors - The 3AM Debugging Guide

Kubernetes CrashLoopBackOff: Debug & Fix Pod Restart Issues

FastAPI Kubernetes Deployment: Production Reality Check

Django Production Deployment Guide: Docker, Security, Monitoring

Fix Kubernetes Service Not Accessible: Stop 503 Errors

Jenkins Docker Kubernetes CI/CD: Deploy Without Breaking Production

Debug Kubernetes AI GPU Failures: Pods Stuck Pending & OOM

Setting Up Prometheus Monitoring That Won't Make You Hate Your Job

Linkerd Overview: The Lightweight Kubernetes Service Mesh

Fix Kubernetes CrashLoopBackOff Exit Code 1 Application Errors

Fix Kubernetes Pod CrashLoopBackOff - Complete Troubleshooting Guide

Fix Kubernetes ImagePullBackOff Error: Complete Troubleshooting Guide

Lock Down Kubernetes: Production Cluster Hardening & Security

Istio Service Mesh: Real-World Complexity, Benefits & Deployment

GKE Overview: Google Kubernetes Engine & Managed Clusters

FastAPI Production Deployment Guide: Prevent Crashes & Scale