Your pod is still dying and you've tried everything obvious. Memory limits? Set. Environment variables? Checked twice. Health checks? Perfect. Yet here you are at 2am watching "CrashLoopBackOff" and "Back-off 5m0s restarting failed container" like some twisted Groundhog Day.
Here's what actually happens: your code's fine, your manifest's fine, but some invisible cluster bullshit is murdering your pods. Not the obvious stuff like memory limits - the weird node constraints, storage backend timeouts, or runtime security policies that don't show up anywhere.
Been debugging this shit for years and it's always the same story: works on laptop, passes CI, manifest looks good, pod dies every 30 seconds anyway.
Node Scheduling Conflicts That Kill Pods Silently
Taints and tolerations - Kubernetes's passive-aggressive way of fucking with you. Pod starts fine then crashes when it tries to do actual work because some node has taints you didn't know about.
## Check node taints that might affect scheduling
kubectl describe nodes | grep -A 5 -B 5 \"Taints\"
## Examine specific node taints and their effects
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, taints: .spec.taints}'
## Check if your pods have appropriate tolerations
kubectl describe pod <pod-name> | grep -A 10 \"Tolerations\"
Node affinity rules can create scenarios where pods start but fail during runtime due to resource constraints or environment mismatches. A pod scheduled on an inappropriate node might lack GPU access, specific storage types, or network configurations required for operation.
## Check node labels and affinity rules
kubectl get nodes --show-labels
## Examine pod affinity constraints
kubectl describe pod <pod-name> | grep -A 15 \"Node-Selectors\\|Affinity\"
## Verify resource availability on assigned nodes
kubectl describe node <node-name> | grep -A 10 \"Allocated resources\"
Spent way too long on one where ML training kept dying with "CUDA error: no device found". Made no sense - we had GPU nodes, memory was fine, everything looked right.
Turned out our YAML was missing some GPU selector bullshit so Kubernetes kept scheduling GPU pods on regular nodes. I don't even remember the exact fix now, just that it was something stupid with node selectors.
Container Runtime and Node-Level Storage Issues
Container runtime problems are the worst kind of debugging hell because the error messages are useless and Google searches return nothing helpful. You'll spend hours checking containerd configuration, CRI-O settings, or Docker daemon issues while your pod dies with cryptic exit codes.
Security contexts get fucked up and the runtime has no idea what to do, or filesystem permissions are broken but only when your app tries to actually write something.
## Check container runtime logs on the node
kubectl get pod <pod-name> -o wide # Get node name
kubectl describe node <node-name> | grep \"Container Runtime\"
## Examine kubelet logs on the problematic node (requires node access)
sudo journalctl -u kubelet -f --since \"1 hour ago\"
## Check for runtime-specific errors
sudo crictl ps -a | grep <container-name>
sudo crictl logs <container-id>
Persistent volume mounting failures happen when your pod looks fine but crashes the second it tries to read or write files. The volume mounts look perfect in kubectl but the storage backend is having a meltdown.
## Check PV and PVC status for mounting issues
kubectl get pv,pvc
## Examine volume mount details in the pod
kubectl describe pod <pod-name> | grep -A 20 \"Volumes\\|Mounts\"
## Check for volume-related events
kubectl get events --field-selector involvedObject.name=<pod-name> | grep -i volume
Storage backend issues cause the most infuriating failures - your pod runs fine for a few minutes then dies when the network storage decides to time out or the cloud storage provisioner shits the bed.
Network Policy and Service Mesh Configuration Conflicts
Network policies are silent killers. Your pod starts perfectly, passes health checks, then crashes the moment it tries to call another service because some network policy is quietly blocking the connection.
I've seen this kill entire microservice deployments where developers spend days debugging "connection refused" errors, not realizing that Calico or Cilium policies are dropping their traffic.
## Check network policies that might affect your pod
kubectl get networkpolicies -A
## Examine specific policy rules
kubectl describe networkpolicy <policy-name> -n <namespace>
## Test network connectivity from the failing pod
kubectl exec <pod-name> -- nc -zv <target-service> <port>
kubectl exec <pod-name> -- nslookup <service-name>
Service mesh bullshit (Istio, Linkerd, Consul Connect) loves to crash your app when the sidecar proxy decides to intercept traffic in stupid ways. Connection timeouts, TLS cert failures, routing fuckups that break your startup sequence.
## Check for service mesh sidecar injection
kubectl describe pod <pod-name> | grep -A 5 -B 5 \"istio\\|linkerd\\|consul\"
## Examine sidecar proxy logs
kubectl logs <pod-name> -c istio-proxy
kubectl logs <pod-name> -c linkerd-proxy
## Verify service mesh configuration
istioctl proxy-config cluster <pod-name>
linkerd stat pod <pod-name>
Resource Quota and Limit Range Enforcement Issues
Resource quotas - silent killers that let your pod start then murder it later. Pod runs for a few minutes, then gets OOMKilled by some namespace quota you didn't know existed. Error messages won't tell you this - you have to dig through cluster events like a detective.
## Check resource quotas affecting your namespace
kubectl describe resourcequota -n <namespace>
## Examine namespace-level resource usage
kubectl describe namespace <namespace>
## Check for resource-related events
kubectl get events -A | grep -i \"quota\\|limit\"
Limit ranges are sneaky fuckers that apply hidden limits to your pods. Your app crashes when it hits these secret constraints that don't show up anywhere obvious in your pod spec.
## Check limit ranges in your namespace
kubectl describe limitrange -n <namespace>
## Compare pod resource requests with limit ranges
kubectl describe pod <pod-name> | grep -A 10 \"Limits\\|Requests\"
Advanced Node Health and Kernel-Level Issues
Node-level resource exhaustion gets weird fast. Your pod dies but kubectl top node
shows plenty of CPU and memory available. Turns out the node ran out of inodes, or hit some obscure kernel limit on network connections, or the disk I/O subsystem is having a breakdown.
## Check detailed node resource usage
kubectl top node <node-name>
## Examine node conditions for health issues
kubectl describe node <node-name> | grep -A 10 \"Conditions\"
## Check for node pressure conditions
kubectl get nodes -o wide
kubectl describe node <node-name> | grep -i \"pressure\\|full\"
Kernel parameter bullshit kills apps that try to open too many connections, file handles, or spawn too many processes. Your code is fine, the kernel just decided to be a dick about resource limits.
When your pod keeps dying despite perfect config, it's always invisible infrastructure bullshit. Node taints, storage backends timing out, network policies written by someone who quit three years ago.
Accept that your app is probably fine and the cluster is lying. Check node health, storage status, runtime logs - don't waste more time staring at your code.
But when cluster-level debugging still doesn't show shit? When kubectl says everything's healthy but your pod dies every few minutes anyway? Time for the nuclear options.