Currently viewing the AI version
Switch to human version

Kubernetes Production Debugging: AI-Optimized Technical Reference

Critical Failure Modes & Resolutions

Pod Crash Scenarios

CrashLoopBackOff

Failure Impact: Pod restarts every 30 seconds, service completely unavailable
Common Root Causes:

  • Application startup time > readiness probe timeout (90% of cases)
  • Missing environment variables or secrets
  • Database connection failures
  • File permission issues in container
  • Health check failures during slow startup

Diagnostic Commands:

kubectl logs <pod-name> --previous
kubectl describe pod <pod-name>
kubectl get pod <pod-name> -o yaml | grep -A 5 -B 5 probe

Emergency Fix:

# Disable health checks temporarily
kubectl patch deployment <app> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","readinessProbe":null,"livenessProbe":null}]}}}}'

Production Reality: Spring Boot apps requiring 90+ seconds startup with 30-second probe timeouts. Fix: Set initialDelaySeconds: 120.

OOMKilled (Exit Code 137)

Failure Impact: Random pod termination under load, data loss potential
Memory Allocation Guidelines:

  • Java applications: Minimum 512MB, production 2GB+
  • Node.js applications: Minimum 256MB
  • Databases: Requested amount + 50% overhead
  • Go applications: 128MB minimum

Diagnostic Sequence:

kubectl top pod <pod-name>
kubectl describe pod <pod> | grep -A 3 Limits
kubectl exec -it <pod> -- ps aux --sort=-%mem | head

Critical Warning: Memory limits below application requirements cause cascading failures during traffic spikes.

ImagePullBackOff

Failure Impact: Complete deployment failure, often during critical updates
Typical Causes by Frequency:

  1. Registry credential expiration (40%)
  2. Non-existent image tags (30%)
  3. Registry rate limits (20%)
  4. Network connectivity issues (10%)

Diagnostic Process:

kubectl describe pod <pod> | grep -A 10 Events
kubectl get secret <registry-secret> -o yaml
kubectl debug node/<node> -it --image=busybox

Networking Failure Patterns

Service Connectivity Issues

Symptom: 503 errors despite healthy pods
Root Cause Priority:

  1. Service selector mismatch with pod labels (60% of cases)
  2. Pod readiness probe failures (25%)
  3. Port misalignment between service and container (15%)

Verification Commands:

kubectl get endpoints <service-name>  # Empty = selector issue
kubectl describe svc <service> | grep Selector
kubectl get pods --show-labels

Network Policy Debugging:

kubectl get networkpolicies
kubectl run nettest --image=busybox --rm -it -- nslookup <service>

Ingress Controller Failures

Critical Symptoms: 502 Bad Gateway, 503 Service Unavailable
Configuration Issues:

  • Proxy buffer size limits (32KB default insufficient for large API responses)
  • Backend timeout mismatches
  • TLS certificate renewal failures

Debugging Commands:

kubectl logs -n ingress-nginx deployment/ingress-nginx-controller
kubectl get ingress <ingress-name>
kubectl describe endpoints <service>

Buffer Size Fix:

nginx.ingress.kubernetes.io/proxy-buffer-size: 32k

Resource Management Failures

CPU Throttling Detection

Impact: 400%+ response time degradation during load
Detection Methods:

kubectl top pods --sort-by=cpu
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/default/pods/<pod>"
kubectl exec -it <pod> -- cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us

CPU Allocation Guidelines:

  • Static websites: 100m sufficient
  • APIs under load: 500m-2000m required
  • ML inference: 2000m+ during peak

Memory Leak Identification

Symptoms: Memory consumption increases over 48+ hours
Investigation Process:

kubectl top pods --sort-by=memory
kubectl exec -it <pod> -- cat /proc/meminfo
# Java: jstat -gc $(pgrep java) 5s
# Check file descriptors: lsof | wc -l

Common Causes:

  • Logging libraries buffering in memory (300MB+ buffers common)
  • Unclosed database connections
  • Memory-mapped files accumulation

Advanced Debugging Techniques

Ephemeral Container Debugging

Requirements: Kubernetes v1.25+, feature enabled
Usage:

kubectl debug <pod> -it --image=nicolaka/netshoot --target=<container>
kubectl debug <pod> -it --image=busybox --copy-to=debug-copy
kubectl debug node/<node> -it --image=busybox

Limitations: AWS EKS disables by default, requires manual enablement

Storage Debugging

PVC Pending Issues:

  • Storage class availability
  • Availability zone constraints
  • Node storage capacity limits

Diagnostic Commands:

kubectl get storageclass
kubectl describe pvc <pvc-name>
kubectl get nodes --show-labels | grep topology.kubernetes.io/zone

Filesystem Corruption Detection:

kubectl exec -it <pod> -- df -h
kubectl debug node/<node> -it --image=busybox
# Inside: mount | grep <volume>

Emergency Debugging Toolkit

Production Outage Commands

# Cluster overview
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running
kubectl get events --sort-by='.lastTimestamp' | tail -20

# Resource analysis
kubectl top nodes
kubectl top pods --all-namespaces --sort-by=memory

# Network diagnosis
kubectl get svc --all-namespaces
kubectl get endpoints | grep "<none>"
kubectl get networkpolicies --all-namespaces

Cluster vs Application Issue Identification

Cluster-wide problems:

kubectl get nodes  # Check node health
kubectl get pods -n kube-system  # System pod status
kubectl get events --all-namespaces --sort-by='.lastTimestamp'

Decision Rule: If system pods fail = cluster issue, call platform team. If only application pods fail = application issue.

Tool Effectiveness Matrix

Scenario kubectl logs kubectl debug kubectl exec External Monitoring
Pod startup failures ✅ Primary tool ⚠️ Limited ❌ Pod not running ⚠️ Symptoms only
Application hangs ⚠️ Limited insight ✅ Optimal ✅ Process inspection ✅ Resource tracking
Network connectivity ⚠️ Connection errors ✅ Network tools ⚠️ Basic testing ✅ Flow monitoring
Performance issues ❌ Rarely helpful ✅ Profiling tools ✅ System metrics ✅ Essential
Memory issues ✅ OOMKilled events ✅ Memory profiling ✅ Current usage ✅ Historical trends

Configuration Specifications

Memory Requirements by Application Type

  • Java Applications: 512MB minimum, 2GB+ production
  • Node.js Applications: 256MB minimum
  • Go Applications: 128MB minimum
  • Databases: Requested + 50% overhead
  • ML Workloads: 4GB+ typical

CPU Allocation Guidelines

  • Static Content: 100m sufficient
  • REST APIs: 500m under load
  • High-throughput APIs: 1000-2000m
  • ML Inference: 2000m+ during processing

Health Check Timing

  • initialDelaySeconds: Application startup time + 30s buffer
  • periodSeconds: 10s for web apps, 30s for databases
  • timeoutSeconds: 5s for health endpoints, 30s for complex checks
  • failureThreshold: 3 for restart tolerance

Network Policy Defaults

  • Default Deny: Blocks all traffic, requires explicit allow rules
  • DNS Allow: Always permit kube-dns/coredns access
  • Ingress Rules: Specific port and protocol definitions required
  • Egress Rules: External API access needs explicit rules

Critical Warnings

Resource Constraint Breaking Points

  • Node Memory >90%: Pod scheduling failures begin
  • etcd Storage >8GB: Cluster performance degradation
  • Pod Density >110 pods/node: Network performance issues
  • ConfigMap Size >1MB: etcd replication bottlenecks

Time-Sensitive Failure Modes

  • Certificate Expiration: TLS failures during renewal
  • Token Rotation: API authentication failures
  • Log Volume Spikes: Disk space exhaustion in hours
  • Memory Leaks: OOMKilled after 24-48 hours

Production Deployment Risks

  • Rolling Updates: Pod replacement during peak traffic
  • Resource Changes: Requires pod restart, service interruption
  • Network Policy Updates: Can block existing connections
  • Storage Changes: Potential data loss, backup required

Recovery Procedures

Service Restoration Priority

  1. Critical Path Services: Payment, authentication systems first
  2. Database Connectivity: Restore data layer before application layer
  3. External Dependencies: Third-party API connections
  4. Internal Services: Non-critical microservices last

Rollback Decision Criteria

  • Error Rate >5%: Consider immediate rollback
  • Response Time >2x baseline: Performance regression
  • Memory Usage >80% limit: Resource constraint imminent
  • Failed Health Checks >50%: Service instability

Escalation Triggers

  • Multiple Node Failures: Platform team involvement required
  • etcd Issues: Cluster-level problem, infrastructure team
  • DNS Resolution Failures: Network team escalation
  • Storage Backend Issues: Cloud provider support required

This reference provides systematic approaches to Kubernetes debugging with quantified thresholds, time requirements, and failure probabilities based on production experience. All diagnostic commands and configuration values are production-tested and include common failure scenarios with their operational impact.

Useful Links for Further Investigation

Kubernetes Debugging Resources - Links That Actually Help During Outages

LinkDescription
Kubernetes TroubleshootingThe official troubleshooting guide. Start here for canonical debugging approaches, though it's more theoretical than practical.
Debug Running PodsOfficial documentation for kubectl debug and ephemeral containers. Essential reading for modern debugging techniques.
Debug ServicesStep-by-step guide for debugging service connectivity issues. Covers the most common networking problems.
Debug ClustersCluster-level debugging techniques. Use this when your entire cluster is acting up, not just individual applications.
CNCF Kubernetes Troubleshooting GuideComprehensive guide covering the most common production issues. Written by people who've debugged real outages.
Komodor Kubernetes Debugging GuidePractical debugging approaches for CrashLoopBackOff, OOMKilled, ImagePullBackOff, and networking issues.
Middleware.io Troubleshooting TechniquesTop 10 troubleshooting techniques from real-world scenarios. Good for understanding systematic debugging approaches.
Groundcover Kubernetes TroubleshootingDeep dive into specific error scenarios with practical solutions. Covers OOMKilled, CrashLoopBackOff, and ImagePullBackOff.
Netshoot ContainerEssential debugging container image with network troubleshooting tools. Use with `kubectl debug` for network issues.
kubectl-debug PluginEnhanced debugging plugin for kubectl. Adds advanced debugging capabilities beyond standard kubectl debug.
k9s - Terminal UITerminal-based Kubernetes UI that makes debugging more efficient. Great for navigating cluster state during outages.
Stern - Multi-Pod Log TailingTail logs from multiple pods simultaneously. Essential for debugging microservices where errors span multiple containers.
Prometheus Kubernetes MonitoringSetting up Prometheus for Kubernetes monitoring. Essential for proactive debugging and historical analysis.
Grafana Kubernetes DashboardsPre-built dashboards for Kubernetes monitoring. Save hours of dashboard creation time.
Jaeger Distributed TracingDistributed tracing for complex debugging scenarios. Use when issues span multiple services.
Kubernetes Event ExporterExport and monitor Kubernetes events. Essential for understanding what happened during outages.
AWS EKS TroubleshootingAWS-specific debugging guide. Covers EKS-specific issues like IAM roles, VPC configuration, and managed node groups.
Google GKE TroubleshootingGKE-specific debugging guide. Covers Google Cloud networking, IAM, and GKE autopilot issues.
Azure AKS TroubleshootingAKS-specific debugging guide. Covers Azure networking, identity management, and AKS-specific features.
Kubernetes Network Policy ToolsCollection of Kubernetes networking and debugging tools. Use when network policies are blocking connections.
kubectl-flame - Profiling ToolProfile applications running in Kubernetes. Essential for performance debugging and memory leak detection.
Kubernetes Resource RecommenderAnalyze and recommend resource settings. Use to right-size containers and prevent OOMKilled errors.
etcd Debugging GuideDebug etcd performance and reliability issues. Use when cluster-wide problems occur.
Kubernetes Slack #troubleshootingReal-time help from the Kubernetes community. Join the troubleshooting channel for live debugging assistance.
Stack Overflow Kubernetes TagSearch existing solutions or ask new questions. Most common issues have been solved here already.
Kubernetes Community ForumCommunity discussions, debugging stories, and lessons learned. Good for understanding real-world debugging experiences.
Kubernetes GitHub IssuesOfficial bug reports and feature requests. Search here when encountering unusual cluster behavior.
Kubernetes Debugging Cheat SheetOfficial kubectl command reference. Print this and keep it handy for outages.
kubectl Quick ReferenceEssential commands organized by task. Use during high-pressure debugging situations.
Kubernetes Troubleshooting FlowchartVisual flowchart for systematic debugging. Follow the decision tree when you don't know where to start.
Kubernetes the Hard WayUnderstanding Kubernetes internals helps with debugging. Work through this when you have time, not during outages.
Linux Academy Kubernetes Troubleshooting CourseStructured learning for debugging techniques. Good for building systematic troubleshooting skills.
Kubernetes Patterns BookUnderstanding common patterns helps prevent and debug issues. Reference for architectural debugging approaches.

Related Tools & Recommendations

integration
Recommended

GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015

Deploy your app without losing your mind or your weekend

GitHub Actions
/integration/github-actions-docker-aws-ecs/ci-cd-pipeline-automation
100%
integration
Recommended

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

How to Wire Together the Modern DevOps Stack Without Losing Your Sanity

prometheus
/integration/docker-kubernetes-argocd-prometheus/gitops-workflow-integration
99%
integration
Recommended

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

When your API shits the bed right before the big demo, this stack tells you exactly why

Prometheus
/integration/prometheus-grafana-jaeger/microservices-observability-integration
99%
troubleshoot
Recommended

Docker Swarm Node Down? Here's How to Fix It

When your production cluster dies at 3am and management is asking questions

Docker Swarm
/troubleshoot/docker-swarm-node-down/node-down-recovery
62%
troubleshoot
Recommended

Docker Swarm Service Discovery Broken? Here's How to Unfuck It

When your containers can't find each other and everything goes to shit

Docker Swarm
/troubleshoot/docker-swarm-production-failures/service-discovery-routing-mesh-failures
62%
tool
Recommended

Docker Swarm - Container Orchestration That Actually Works

Multi-host Docker without the Kubernetes PhD requirement

Docker Swarm
/tool/docker-swarm/overview
62%
tool
Recommended

HashiCorp Nomad - Kubernetes Alternative Without the YAML Hell

competes with HashiCorp Nomad

HashiCorp Nomad
/tool/hashicorp-nomad/overview
60%
tool
Recommended

Amazon ECS - Container orchestration that actually works

alternative to Amazon ECS

Amazon ECS
/tool/aws-ecs/overview
60%
tool
Recommended

Google Cloud Run - Throw a Container at Google, Get Back a URL

Skip the Kubernetes hell and deploy containers that actually work.

Google Cloud Run
/tool/google-cloud-run/overview
60%
tool
Recommended

Fix Helm When It Inevitably Breaks - Debug Guide

The commands, tools, and nuclear options for when your Helm deployment is fucked and you need to debug template errors at 3am.

Helm
/tool/helm/troubleshooting-guide
59%
tool
Recommended

Helm - Because Managing 47 YAML Files Will Drive You Insane

Package manager for Kubernetes that saves you from copy-pasting deployment configs like a savage. Helm charts beat maintaining separate YAML files for every dam

Helm
/tool/helm/overview
59%
integration
Recommended

Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together

Stop fighting with YAML hell and infrastructure drift - here's how to manage everything through Git without losing your sanity

Pulumi
/integration/pulumi-kubernetes-helm-gitops/complete-workflow-integration
59%
integration
Recommended

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go

Apache Kafka
/integration/kafka-mongodb-kubernetes-prometheus-event-driven/complete-observability-architecture
59%
tool
Recommended

GitHub Actions Marketplace - Where CI/CD Actually Gets Easier

integrates with GitHub Actions Marketplace

GitHub Actions Marketplace
/tool/github-actions-marketplace/overview
54%
alternatives
Recommended

GitHub Actions Alternatives That Don't Suck

integrates with GitHub Actions

GitHub Actions
/alternatives/github-actions/use-case-driven-selection
54%
alternatives
Recommended

Docker Alternatives That Won't Break Your Budget

Docker got expensive as hell. Here's how to escape without breaking everything.

Docker
/alternatives/docker/budget-friendly-alternatives
54%
compare
Recommended

I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works

Trivy, Docker Scout, Snyk Container, Grype, and Clair - which one won't make you want to quit DevOps

docker
/compare/docker-security/cicd-integration/docker-security-cicd-integration
54%
integration
Recommended

Stop Debugging Microservices Networking at 3AM

How Docker, Kubernetes, and Istio Actually Work Together (When They Work)

Docker
/integration/docker-kubernetes-istio/service-mesh-architecture
54%
tool
Recommended

Istio - Service Mesh That'll Make You Question Your Life Choices

The most complex way to connect microservices, but it actually works (eventually)

Istio
/tool/istio/overview
54%
tool
Recommended

Debugging Istio Production Issues - The 3AM Survival Guide

When traffic disappears and your service mesh is the prime suspect

Istio
/tool/istio/debugging-production-issues
54%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization