Your Kubernetes cluster isn't failing because of some exotic edge case described in a 200-page white paper. It's failing because someone forgot to set memory limits, your ingress controller shit the bed, or AWS decided to randomly restart your nodes. Here's what actually breaks in the real world.
The Five Horsemen of Kubernetes Apocalypse
1. Pod Death Spiral (CrashLoopBackOff Hell)
What it looks like: Your pod keeps restarting every 30 seconds and kubectl describe shows CrashLoopBackOff
status.
Why it happens:
- Your app crashes on startup but the container image doesn't fail properly
- Database connection strings are wrong (check your secrets)
- Missing environment variables that your app needs to start
- File permissions are fucked in your container (security context issues)
- Your health check is failing because the app takes 45 seconds to start but your
readinessProbe
times out after 10
The 5-minute fix:
## See what's actually happening
kubectl logs your-broken-pod --previous
kubectl describe pod your-broken-pod
## Check if it's a startup timing issue
kubectl get pod your-broken-pod -o yaml | grep -A 5 -B 5 probe
## Nuclear option: disable health checks temporarily
kubectl patch deployment your-app -p '{\"spec\":{\"template\":{\"spec\":{\"containers\":[{\"name\":\"your-container\",\"readinessProbe\":null,\"livenessProbe\":null}]}}}}'
Real failure story: Our Spring Boot app took 90 seconds to start because it was downloading Maven dependencies on startup (brilliant architecture choice). The readiness probe was configured for 30 seconds. Pods kept getting killed just as they were about to be ready. Solution: increased initialDelaySeconds
to 120 and fixed the Docker image to include dependencies. Took down prod for 2 hours because we "didn't want to change the health check configuration." Learn from our debugging mistakes.
2. OOMKilled - The Memory Massacre
What it looks like: kubectl get pods
shows your pod restarted with exit code 137, and kubectl describe pod
reveals the dreaded Reason: OOMKilled
.
Why it happens:
- You allocated 128MB but your Java app needs 2GB (classic resource management failure)
- Memory leak in your application that accumulates over days
- No memory limits set, so your pod consumed everything until the node kernel murdered it (node resource management)
- Your container is running multiple processes and you only accounted for one (best practices guide)
The actual debugging process:
## Check current memory usage
kubectl top pod your-pod-name
kubectl top node
## See memory limits vs requests
kubectl describe pod your-pod | grep -A 3 -B 3 Limits
## Check what's actually using memory in the container
kubectl exec -it your-pod -- ps aux --sort=-%mem | head
## Historical memory usage (if you have metrics-server)
kubectl get --raw \"/apis/metrics.k8s.io/v1beta1/namespaces/default/pods/your-pod\"
What actually worked: Doubled the memory limit from 256Mi to 512Mi, added memory monitoring, and discovered our logging library was buffering 300MB of logs in memory before writing to disk. The "temporary" fix became permanent because nobody had time to optimize the logger during Black Friday prep season.
3. ImagePullBackOff - The Registry Nightmare
What happens: Your pod sits in Pending
state forever with ImagePullBackOff
error, usually at the worst possible time.
Why it always happens:
- Registry is down (Docker Hub limits hit you)
- Wrong image tag (someone deployed
v2.1.4
but it doesn't exist) - Authentication failed (registry credentials expired)
- Image is too big for the node (8GB image on a 10GB node)
- Private registry URL is wrong (happens during migrations)
Debug it properly:
## Check the actual error message
kubectl describe pod your-pod | grep -A 10 Events
## Test image pull manually
kubectl debug node/your-node -it --image=busybox
crictl pull your-registry/your-image:tag
## Check if it's an auth issue
kubectl get secret your-registry-secret -o yaml
echo \"base64-string\" | base64 -d | jq .
Production war story: During a midnight deployment, our image pull started failing with "unauthorized" errors. Turns out the CI/CD system was using an API token that expired after 90 days. The image built fine, pushed fine, but pulling failed because the cluster used different credentials. Spent 3 hours debugging container registries while our app was down. The fix was rotating a single API key that should have been automated 6 months earlier.
The Networking Black Hole
Services That Pretend to Work
The symptoms: kubectl get svc
shows your service exists, but curl returns connection refused, and your ingress returns 503 errors.
What's actually broken:
## Check if service has endpoints
kubectl get endpoints your-service
## No endpoints? Your selector is wrong
kubectl get pods --show-labels
kubectl describe svc your-service | grep Selector
## Endpoints exist but connection fails? Check network policies
kubectl get networkpolicies
kubectl describe netpol your-policy
## Still broken? Test from inside the cluster
kubectl run debug-pod --image=busybox --rm -it -- sh
## Inside the pod:
nslookup your-service
telnet your-service 80
Reality check: Network policies are the reason your microservices can't talk to each other. Someone enabled "secure by default" policies that block everything, then quit without documenting which services need to communicate. You'll spend hours adding network policy exceptions one by one while your staging environment is completely broken.
Ingress Controllers Having Existential Crises
What you see: External users get 502 Bad Gateway, 503 Service Unavailable, or timeouts.
What's actually happening:
## Check ingress controller logs (it's probably nginx-ingress)
kubectl logs -n ingress-nginx deployment/ingress-nginx-controller
## Check if ingress has an IP
kubectl get ingress your-ingress
## Verify backend services are healthy
kubectl get pods -l app=your-app
kubectl describe endpoints your-service
## Test internal service connectivity
kubectl port-forward svc/your-service 8080:80
curl localhost:8080
War story: Our ingress worked fine for months, then started returning 502 errors during high traffic. The nginx-ingress logs showed upstream sent too big header while reading response header from upstream
. Our API was returning 32KB JSON responses, but nginx proxy buffer was set to 4KB. The solution was adding this annotation: nginx.ingress.kubernetes.io/proxy-buffer-size: 32k
. Took down our API for 4 hours because nobody knew ingress controllers had buffer limits.
Resource Starvation - When Everything Fights for CPU
The death spiral: Pods get CPU throttled, become slow, health checks fail, pods restart, rinse and repeat.
Debug resource issues:
## Check current resource usage
kubectl top nodes
kubectl top pods --sort-by=cpu
kubectl top pods --sort-by=memory
## Find pods without resource limits
kubectl get pods -o jsonpath=\"{\.items[*].spec.containers[*].resources}\" | jq .
## Check for CPU throttling
kubectl describe pod your-pod | grep -A 5 -B 5 throttl
The fix that actually worked: Set proper CPU limits based on actual usage, not guesses. Monitor for a week first. Yes, it takes time. No, you can't wing it. CPU limits of 100m
work for static websites, not for your machine learning inference API that needs 2000m
during peak load.
Your cluster is probably broken right now for one of these reasons. The next section covers the specific commands and debugging workflows that actually fix these problems instead of just identifying them.