How do I see what's actually wrong with my pod that won't start?

The debugging sequence that actually works: 1. `kubectl get pods` - Check the status and restart count 2. `kubectl describe pod ` - Read the Events section (scroll to bottom) 3. `kubectl logs ` - See current logs 4. `kubectl logs --previous` - See logs from the crashed container If it's still `Pending`: Check `kubectl describe node` for resource availability and `kubectl get events --sort-by='.lastTimestamp'` for scheduler complaints. Pro tip: The error is usually in the Events section of `kubectl describe pod`, not the logs. Learn to read those events first.

Why is my service returning 503 errors when the pods are running fine?

The 3 things that are actually wrong (in order of likelihood): 1. **Service selector doesn't match pod labels**: `kubectl describe svc ` and compare selector to `kubectl get pods --show-labels` 2. **Pods aren't ready**: `kubectl get pods` shows Running but ready column is 0/1 - your readiness probe is failing 3. **Service port doesn't match container port**: Your service targets port 8080 but your container runs on port 3000 Quick test: `kubectl get endpoints ` - if it's empty, your selector is wrong. Nuclear debugging option: ```bash kubectl run debug --image=busybox --rm -it -- sh # Inside pod: nslookup your-service-name # Should return service IP ```

How do I debug networking when nothing can talk to anything else?

Start with the basics: ```bash # Test DNS resolution first (it's usually DNS) kubectl run dnstest --image=busybox --rm -it -- nslookup kubernetes.default # Test service connectivity from inside cluster kubectl run nettest --image=busybox --rm -it -- sh # Inside pod: telnet your-service 80 ``` If DNS works but connections fail: Check NetworkPolicies ```bash kubectl get networkpolicies kubectl describe netpol ``` Reality check: Network policies are like firewall rules written by someone who hates you. They'll block everything by default and you'll need to add exceptions for every single connection your app needs.

My pod keeps getting OOMKilled - how do I figure out why?

Step 1: Check what's actually using memory right now: ```bash kubectl top pod kubectl exec -it -- ps aux --sort=-%mem | head -10 ``` Step 2: Check your memory limits vs requests: ```bash kubectl describe pod | grep -A 10 Limits kubectl describe pod | grep -A 10 Requests ``` Step 3: If your app is using more memory than expected, profile it: - Java: Add JVM flags to track memory usage - Node.js: Use `--inspect` and Chrome DevTools - Python: Use memory-profiler - Go: Use `go tool pprof` The real answer: Your memory limits are probably too low. Java apps need at least 512MB, Node.js apps need 256MB minimum, and databases need whatever they ask for plus 50%. Stop trying to save money on RAM.

How do I use kubectl debug for actual troubleshooting?

The new hotness ([kubectl debug](https://kubernetes.io/docs/tasks/debug/debug-application/debug-running-pod/) with ephemeral containers): ```bash # Add a debugging container to a running pod kubectl debug -it --image=busybox --target= # Debug a crashed pod by creating a copy kubectl debug -it --image=busybox --copy-to=debug-copy # Debug node issues kubectl debug node/ -it --image=busybox ``` What's actually useful: - Install debugging tools (curl, dig, tcpdump) in ephemeral containers - Access the same filesystem and network namespace as your app - Debug without modifying your production container images - Leave the debugging container running for extended troubleshooting Reality: This only works if your cluster has ephemeral containers enabled (Kubernetes v1.25+). AWS EKS disables it by default because fuck you, that's why.

What commands actually help during a production outage?

The emergency toolkit: ```bash # Get the big picture kubectl get nodes kubectl get pods --all-namespaces | grep -v Running kubectl get events --sort-by='.lastTimestamp' | tail -20 # Check resource usage kubectl top nodes kubectl top pods --all-namespaces --sort-by=memory # Find what's actually broken kubectl get pods -o wide | grep -E "(Error|CrashLoop|ImagePull)" kubectl get ingress --all-namespaces ``` For networking issues: ```bash kubectl get svc --all-namespaces kubectl get endpoints | grep " " kubectl get networkpolicies --all-namespaces ``` The command you'll actually use most: `kubectl logs -f --previous` to see what happened before it crashed.

How do I know if it's a cluster problem vs application problem?

Cluster-wide issues (panic mode): ```bash # Check if nodes are healthy kubectl get nodes # Check system pods kubectl get pods -n kube-system # Check cluster events kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -50 ``` If system pods are failing: Your cluster is fucked. Call your platform team or AWS support. Application issues: Only your pods are broken, system pods are fine. This is your problem to fix. The gray area: When your app is broken because the cluster is having a bad day (DNS issues, network policies, resource constraints). These are the hardest to debug and usually require fixing both layers.

Why do my containers work locally but crash in Kubernetes?

The usual suspects: 1. **Different base image**: Your local Docker might be using different OS/libraries 2. **Missing environment variables**: Local has them set, K8s doesn't 3. **File permissions**: Container user is different (non-root in K8s) 4. **Resource constraints**: Local has unlimited resources, K8s has limits 5. **Network assumptions**: App expects to bind to 0.0.0.0:80 but container port is 3000 Debug the differences: ```bash # Check what user your container runs as kubectl exec -it -- id # Check environment variables kubectl exec -it -- env | sort # Check if ports match kubectl describe pod | grep -A 5 Ports ``` The fix: Make your local environment match production as closely as possible. Use the same base images, same user permissions, same environment variables.

Currently viewing the AI version

Switch to human version

Kubernetes Production Debugging: AI-Optimized Technical Reference

Critical Failure Modes & Resolutions

Pod Crash Scenarios

CrashLoopBackOff

Failure Impact: Pod restarts every 30 seconds, service completely unavailable
Common Root Causes:

Application startup time > readiness probe timeout (90% of cases)
Missing environment variables or secrets
Database connection failures
File permission issues in container
Health check failures during slow startup

Diagnostic Commands:

kubectl logs <pod-name> --previous
kubectl describe pod <pod-name>
kubectl get pod <pod-name> -o yaml | grep -A 5 -B 5 probe

Emergency Fix:

# Disable health checks temporarily
kubectl patch deployment <app> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","readinessProbe":null,"livenessProbe":null}]}}}}'

Production Reality: Spring Boot apps requiring 90+ seconds startup with 30-second probe timeouts. Fix: Set initialDelaySeconds: 120.

OOMKilled (Exit Code 137)

Failure Impact: Random pod termination under load, data loss potential
Memory Allocation Guidelines:

Java applications: Minimum 512MB, production 2GB+
Node.js applications: Minimum 256MB
Databases: Requested amount + 50% overhead
Go applications: 128MB minimum

Diagnostic Sequence:

kubectl top pod <pod-name>
kubectl describe pod <pod> | grep -A 3 Limits
kubectl exec -it <pod> -- ps aux --sort=-%mem | head

Critical Warning: Memory limits below application requirements cause cascading failures during traffic spikes.

ImagePullBackOff

Failure Impact: Complete deployment failure, often during critical updates
Typical Causes by Frequency:

Registry credential expiration (40%)
Non-existent image tags (30%)
Registry rate limits (20%)
Network connectivity issues (10%)

Diagnostic Process:

kubectl describe pod <pod> | grep -A 10 Events
kubectl get secret <registry-secret> -o yaml
kubectl debug node/<node> -it --image=busybox

Networking Failure Patterns

Service Connectivity Issues

Symptom: 503 errors despite healthy pods
Root Cause Priority:

Service selector mismatch with pod labels (60% of cases)
Pod readiness probe failures (25%)
Port misalignment between service and container (15%)

Verification Commands:

kubectl get endpoints <service-name>  # Empty = selector issue
kubectl describe svc <service> | grep Selector
kubectl get pods --show-labels

Network Policy Debugging:

kubectl get networkpolicies
kubectl run nettest --image=busybox --rm -it -- nslookup <service>

Ingress Controller Failures

Critical Symptoms: 502 Bad Gateway, 503 Service Unavailable
Configuration Issues:

Proxy buffer size limits (32KB default insufficient for large API responses)
Backend timeout mismatches
TLS certificate renewal failures

Debugging Commands:

kubectl logs -n ingress-nginx deployment/ingress-nginx-controller
kubectl get ingress <ingress-name>
kubectl describe endpoints <service>

Buffer Size Fix:

nginx.ingress.kubernetes.io/proxy-buffer-size: 32k

Resource Management Failures

CPU Throttling Detection

Impact: 400%+ response time degradation during load
Detection Methods:

kubectl top pods --sort-by=cpu
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/default/pods/<pod>"
kubectl exec -it <pod> -- cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us

CPU Allocation Guidelines:

Static websites: 100m sufficient
APIs under load: 500m-2000m required
ML inference: 2000m+ during peak

Memory Leak Identification

Symptoms: Memory consumption increases over 48+ hours
Investigation Process:

kubectl top pods --sort-by=memory
kubectl exec -it <pod> -- cat /proc/meminfo
# Java: jstat -gc $(pgrep java) 5s
# Check file descriptors: lsof | wc -l

Common Causes:

Logging libraries buffering in memory (300MB+ buffers common)
Unclosed database connections
Memory-mapped files accumulation

Advanced Debugging Techniques

Ephemeral Container Debugging

Requirements: Kubernetes v1.25+, feature enabled
Usage:

kubectl debug <pod> -it --image=nicolaka/netshoot --target=<container>
kubectl debug <pod> -it --image=busybox --copy-to=debug-copy
kubectl debug node/<node> -it --image=busybox

Limitations: AWS EKS disables by default, requires manual enablement

Storage Debugging

PVC Pending Issues:

Storage class availability
Availability zone constraints
Node storage capacity limits

Diagnostic Commands:

kubectl get storageclass
kubectl describe pvc <pvc-name>
kubectl get nodes --show-labels | grep topology.kubernetes.io/zone

Filesystem Corruption Detection:

kubectl exec -it <pod> -- df -h
kubectl debug node/<node> -it --image=busybox
# Inside: mount | grep <volume>

Emergency Debugging Toolkit

Production Outage Commands

# Cluster overview
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running
kubectl get events --sort-by='.lastTimestamp' | tail -20

# Resource analysis
kubectl top nodes
kubectl top pods --all-namespaces --sort-by=memory

# Network diagnosis
kubectl get svc --all-namespaces
kubectl get endpoints | grep "<none>"
kubectl get networkpolicies --all-namespaces

Cluster vs Application Issue Identification

Cluster-wide problems:

kubectl get nodes  # Check node health
kubectl get pods -n kube-system  # System pod status
kubectl get events --all-namespaces --sort-by='.lastTimestamp'

Decision Rule: If system pods fail = cluster issue, call platform team. If only application pods fail = application issue.

Tool Effectiveness Matrix

Scenario	kubectl logs	kubectl debug	kubectl exec	External Monitoring
Pod startup failures	✅ Primary tool	⚠️ Limited	❌ Pod not running	⚠️ Symptoms only
Application hangs	⚠️ Limited insight	✅ Optimal	✅ Process inspection	✅ Resource tracking
Network connectivity	⚠️ Connection errors	✅ Network tools	⚠️ Basic testing	✅ Flow monitoring
Performance issues	❌ Rarely helpful	✅ Profiling tools	✅ System metrics	✅ Essential
Memory issues	✅ OOMKilled events	✅ Memory profiling	✅ Current usage	✅ Historical trends

Configuration Specifications

Memory Requirements by Application Type

Java Applications: 512MB minimum, 2GB+ production
Node.js Applications: 256MB minimum
Go Applications: 128MB minimum
Databases: Requested + 50% overhead
ML Workloads: 4GB+ typical

CPU Allocation Guidelines

Static Content: 100m sufficient
REST APIs: 500m under load
High-throughput APIs: 1000-2000m
ML Inference: 2000m+ during processing

Health Check Timing

initialDelaySeconds: Application startup time + 30s buffer
periodSeconds: 10s for web apps, 30s for databases
timeoutSeconds: 5s for health endpoints, 30s for complex checks
failureThreshold: 3 for restart tolerance

Network Policy Defaults

Default Deny: Blocks all traffic, requires explicit allow rules
DNS Allow: Always permit kube-dns/coredns access
Ingress Rules: Specific port and protocol definitions required
Egress Rules: External API access needs explicit rules

Critical Warnings

Resource Constraint Breaking Points

Node Memory >90%: Pod scheduling failures begin
etcd Storage >8GB: Cluster performance degradation
Pod Density >110 pods/node: Network performance issues
ConfigMap Size >1MB: etcd replication bottlenecks

Time-Sensitive Failure Modes

Certificate Expiration: TLS failures during renewal
Token Rotation: API authentication failures
Log Volume Spikes: Disk space exhaustion in hours
Memory Leaks: OOMKilled after 24-48 hours

Production Deployment Risks

Rolling Updates: Pod replacement during peak traffic
Resource Changes: Requires pod restart, service interruption
Network Policy Updates: Can block existing connections
Storage Changes: Potential data loss, backup required

Recovery Procedures

Service Restoration Priority

Critical Path Services: Payment, authentication systems first
Database Connectivity: Restore data layer before application layer
External Dependencies: Third-party API connections
Internal Services: Non-critical microservices last

Rollback Decision Criteria

Error Rate >5%: Consider immediate rollback
Response Time >2x baseline: Performance regression
Memory Usage >80% limit: Resource constraint imminent
Failed Health Checks >50%: Service instability

Escalation Triggers

Multiple Node Failures: Platform team involvement required
etcd Issues: Cluster-level problem, infrastructure team
DNS Resolution Failures: Network team escalation
Storage Backend Issues: Cloud provider support required

This reference provides systematic approaches to Kubernetes debugging with quantified thresholds, time requirements, and failure probabilities based on production experience. All diagnostic commands and configuration values are production-tested and include common failure scenarios with their operational impact.

Useful Links for Further Investigation

Kubernetes Debugging Resources - Links That Actually Help During Outages

Link	Description
Kubernetes Troubleshooting	The official troubleshooting guide. Start here for canonical debugging approaches, though it's more theoretical than practical.
Debug Running Pods	Official documentation for kubectl debug and ephemeral containers. Essential reading for modern debugging techniques.
Debug Services	Step-by-step guide for debugging service connectivity issues. Covers the most common networking problems.
Debug Clusters	Cluster-level debugging techniques. Use this when your entire cluster is acting up, not just individual applications.
CNCF Kubernetes Troubleshooting Guide	Comprehensive guide covering the most common production issues. Written by people who've debugged real outages.
Komodor Kubernetes Debugging Guide	Practical debugging approaches for CrashLoopBackOff, OOMKilled, ImagePullBackOff, and networking issues.
Middleware.io Troubleshooting Techniques	Top 10 troubleshooting techniques from real-world scenarios. Good for understanding systematic debugging approaches.
Groundcover Kubernetes Troubleshooting	Deep dive into specific error scenarios with practical solutions. Covers OOMKilled, CrashLoopBackOff, and ImagePullBackOff.
Netshoot Container	Essential debugging container image with network troubleshooting tools. Use with `kubectl debug` for network issues.
kubectl-debug Plugin	Enhanced debugging plugin for kubectl. Adds advanced debugging capabilities beyond standard kubectl debug.
k9s - Terminal UI	Terminal-based Kubernetes UI that makes debugging more efficient. Great for navigating cluster state during outages.
Stern - Multi-Pod Log Tailing	Tail logs from multiple pods simultaneously. Essential for debugging microservices where errors span multiple containers.
Prometheus Kubernetes Monitoring	Setting up Prometheus for Kubernetes monitoring. Essential for proactive debugging and historical analysis.
Grafana Kubernetes Dashboards	Pre-built dashboards for Kubernetes monitoring. Save hours of dashboard creation time.
Jaeger Distributed Tracing	Distributed tracing for complex debugging scenarios. Use when issues span multiple services.
Kubernetes Event Exporter	Export and monitor Kubernetes events. Essential for understanding what happened during outages.
AWS EKS Troubleshooting	AWS-specific debugging guide. Covers EKS-specific issues like IAM roles, VPC configuration, and managed node groups.
Google GKE Troubleshooting	GKE-specific debugging guide. Covers Google Cloud networking, IAM, and GKE autopilot issues.
Azure AKS Troubleshooting	AKS-specific debugging guide. Covers Azure networking, identity management, and AKS-specific features.
Kubernetes Network Policy Tools	Collection of Kubernetes networking and debugging tools. Use when network policies are blocking connections.
kubectl-flame - Profiling Tool	Profile applications running in Kubernetes. Essential for performance debugging and memory leak detection.
Kubernetes Resource Recommender	Analyze and recommend resource settings. Use to right-size containers and prevent OOMKilled errors.
etcd Debugging Guide	Debug etcd performance and reliability issues. Use when cluster-wide problems occur.
Kubernetes Slack #troubleshooting	Real-time help from the Kubernetes community. Join the troubleshooting channel for live debugging assistance.
Stack Overflow Kubernetes Tag	Search existing solutions or ask new questions. Most common issues have been solved here already.
Kubernetes Community Forum	Community discussions, debugging stories, and lessons learned. Good for understanding real-world debugging experiences.
Kubernetes GitHub Issues	Official bug reports and feature requests. Search here when encountering unusual cluster behavior.
Kubernetes Debugging Cheat Sheet	Official kubectl command reference. Print this and keep it handy for outages.
kubectl Quick Reference	Essential commands organized by task. Use during high-pressure debugging situations.
Kubernetes Troubleshooting Flowchart	Visual flowchart for systematic debugging. Follow the decision tree when you don't know where to start.
Kubernetes the Hard Way	Understanding Kubernetes internals helps with debugging. Work through this when you have time, not during outages.
Linux Academy Kubernetes Troubleshooting Course	Structured learning for debugging techniques. Good for building systematic troubleshooting skills.
Kubernetes Patterns Book	Understanding common patterns helps prevent and debug issues. Reference for architectural debugging approaches.

Kubernetes Production Debugging: AI-Optimized Technical Reference

Critical Failure Modes & Resolutions

Pod Crash Scenarios

CrashLoopBackOff

OOMKilled (Exit Code 137)

ImagePullBackOff

Networking Failure Patterns

Service Connectivity Issues

Ingress Controller Failures

Resource Management Failures

CPU Throttling Detection

Memory Leak Identification

Advanced Debugging Techniques

Ephemeral Container Debugging

Storage Debugging

Emergency Debugging Toolkit

Production Outage Commands

Cluster vs Application Issue Identification

Tool Effectiveness Matrix

Configuration Specifications

Memory Requirements by Application Type

CPU Allocation Guidelines

Health Check Timing

Network Policy Defaults

Critical Warnings

Resource Constraint Breaking Points

Time-Sensitive Failure Modes

Production Deployment Risks

Recovery Procedures

Service Restoration Priority

Rollback Decision Criteria

Escalation Triggers

Useful Links for Further Investigation

Kubernetes Debugging Resources - Links That Actually Help During Outages

Related Tools & Recommendations

GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

Docker Swarm Node Down? Here's How to Fix It

Docker Swarm Service Discovery Broken? Here's How to Unfuck It

Docker Swarm - Container Orchestration That Actually Works

HashiCorp Nomad - Kubernetes Alternative Without the YAML Hell

Amazon ECS - Container orchestration that actually works

Google Cloud Run - Throw a Container at Google, Get Back a URL

Fix Helm When It Inevitably Breaks - Debug Guide

Helm - Because Managing 47 YAML Files Will Drive You Insane

Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

GitHub Actions Marketplace - Where CI/CD Actually Gets Easier

GitHub Actions Alternatives That Don't Suck

Docker Alternatives That Won't Break Your Budget

I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works

Stop Debugging Microservices Networking at 3AM

Istio - Service Mesh That'll Make You Question Your Life Choices

Debugging Istio Production Issues - The 3AM Survival Guide