Kubernetes Production Outage Recovery - AI-Optimized Guide
Critical Failure Classifications
Complete Control Plane Death
- Symptoms: API server won't start, etcd corrupted, kubectl connection errors
- Causes: Failed cluster upgrades, disk space exhaustion on master nodes, network issues between etcd members, power outages without UPS
- Recovery Time: 3-8 hours (first attempt: 8 hours due to permission issues; experienced recovery: 3 hours)
- Prerequisites: Valid etcd backups, direct SSH access to nodes
Resource Cascade Failures
- Pattern: Database pod OOMKilled → App connection failures → Restart storm → Node memory exhaustion → kubelet dies → API server isolation → Complete failure
- Detection: Random pod evictions, slow kubectl responses, deployment timeouts
- Recovery Method: Immediate non-essential pod termination using forced deletion
Slow Death Spiral
- Characteristics: Hardest to detect, progressive node failures with pod migration
- Timeline: Can take hours to days before complete failure
- Early Warnings: Random pod evictions, slow kubectl responses, deployment timeouts
Emergency Recovery Procedures
When kubectl Is Dead (5-15 minutes assessment)
Direct Node Assessment:
# Use when kubectl fails
sudo systemctl status kubelet
sudo crictl ps | grep apiserver
sudo etcdctl endpoint health # WARNING: Hangs indefinitely if etcd is corrupted - kill after 30 seconds
etcd Recovery (Make-or-Break Step)
Single etcd Failure:
# CRITICAL: Stop services first to prevent corruption
sudo systemctl stop kubelet
# Restore creates NEW directory, doesn't overwrite
sudo etcdctl snapshot restore /path/to/backup.db \
--data-dir /var/lib/etcd-from-backup
# CRITICAL: Wait 60 seconds before restarting kubelet
Multiple etcd Failures (Complete Rebuild):
# Stop all services on all nodes
sudo systemctl stop kubelet
sudo systemctl stop docker
# On single master node
sudo rm -rf /var/lib/etcd/*
sudo etcdctl snapshot restore /backup/etcd-snapshot.db \
--data-dir /var/lib/etcd \
--initial-cluster master1=https://<master-ip>:2380 \
--initial-advertise-peer-urls https://<master-ip>:2380
# Start services with 5-minute intervals
sudo systemctl start docker
# Wait 5 minutes
sudo systemctl start kubelet
Resource Death Spiral Recovery
Nuclear Option (Immediate):
# Force delete all non-essential pods
kubectl delete pods --all -n non-essential-namespace --grace-period=0 --force
kubectl scale deployment --all --replicas=0 -n non-essential-namespace
Surgical Option:
kubectl drain <worst-node> --ignore-daemonsets --force --delete-emptydir-data
OOMKilled Massacre Recovery
# Identify worst offenders
kubectl top pods --all-namespaces --sort-by=memory
# Emergency memory doubling
kubectl patch deployment problem-app -p '{
"spec": {
"template": {
"spec": {
"containers": [{
"name": "app",
"resources": {
"limits": {"memory": "4Gi"},
"requests": {"memory": "2Gi"}
}
}]
}
}
}
}'
Production-Tested Prevention Strategies
etcd Backup Script (Failure-Resistant)
#!/bin/bash
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/var/backups/etcd"
# Verify etcd health before backup
if ! etcdctl endpoint health &>/dev/null; then
echo "etcd is dead, backup will fail"
exit 1
fi
# Create backup
etcdctl snapshot save "${BACKUP_DIR}/etcd-${DATE}.db"
# Verify backup integrity
if ! etcdctl snapshot status "${BACKUP_DIR}/etcd-${DATE}.db" &>/dev/null; then
echo "Backup is corrupted, trying again"
rm "${BACKUP_DIR}/etcd-${DATE}.db"
exit 1
fi
Common Backup Failures:
- etcd under load (backup timeouts)
- Insufficient disk space (creates 0-byte files)
- Network issues (partial backups)
- Permission problems (backup succeeds but restore fails)
Resource Limits (Production-Tested)
# Prevents OOM cascades while maintaining density
limits:
- type: Container
default:
memory: "512Mi" # Prevents most OOMKills
cpu: "200m" # Prevents throttling cascades
Why These Values:
- 512Mi: Tested threshold preventing OOMKills while allowing pod density
- 200m CPU: Prevents throttling-induced death spirals
- Lower values cause mystery performance issues
Critical Monitoring Thresholds
etcd Performance Alert:
expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.1
# Threshold: >100ms for 2 minutes
# Rationale: etcd gets slow before complete failure
API Server Error Rate:
expr: (rate(apiserver_request_total{code=~"5.."}[5m]) / rate(apiserver_request_total[5m])) > 0.01
# Threshold: >1% error rate for 1 minute
# Rationale: Don't wait for complete failure
Node Memory Pressure:
expr: (node_memory_Active_bytes / node_memory_MemTotal_bytes) > 0.9
# Threshold: >90% for 5 minutes
# Rationale: At 90%, pod termination begins
Real-World Failure Causes (By Frequency)
Most Common: Disk Space Exhaustion
- etcd logs grow unchecked (logrotate misconfiguration)
- Docker image accumulation on nodes
- Application log files filling /var/log
Resource Limits Too Low
- OOMKills under production load
- CPU throttling causing cascades
- Staging environment doesn't match production resource patterns
Upgrade Disasters
- Kubernetes 1.24 dockershim removal without warning
- API deprecations breaking deployments (networking.k8s.io/v1beta1)
- etcd 3.4→3.5 requiring data migration
Certificate Expiration
- kubelet certificates (90-day default)
- etcd peer certificates
- Ingress TLS certificates
Recovery Decision Tree
When to Rebuild vs Repair
Rebuild Indicators:
- etcd restore fails multiple times
- API server won't start with good etcd
- Random node failures continue
- 6+ hours of unsuccessful repair attempts
Rebuild Timeline:
- Fight corrupted cluster: 12 hours
- New cluster deployment: 2 hours
- Lesson: Know when to cut losses
Recovery Time Estimates
etcd corruption with backups: 2-6 hours
etcd corruption without backups: 24+ hours (complete rebuild)
Resource exhaustion: 30 minutes (aggressive) to 3 hours (surgical)
Complete cluster death: 4 hours to 2 days (highly variable)
Management Communication Rule: Double your estimate, then double again
Tools That Work When Standard Tools Fail
When kubectl is dead:
crictl
(works when docker doesn't)- Direct SSH to nodes
systemctl
commands for service management- Raw
etcdctl
for etcd access
Jump Box Requirements:
Install these tools on infrastructure NOT part of the cluster - you'll need them when everything else fails.
Critical Operational Intelligence
Testing Requirements
- Monthly: Pod chaos testing, basic recovery procedures
- Quarterly: etcd restore testing, single master failure simulation
- Yearly: Complete cluster rebuild drill
- Most Critical: Backup integrity testing (corrupted backups discovered after 6 months are common)
Configuration Drift Prevention
# Good - tracked in Git
kubectl apply -f deployment.yaml
# Bad - lost forever
kubectl edit deployment my-app
Use GitOps tools (ArgoCD, Flux) for automatic drift detection and correction.
Outage Communication Strategy
During Outage:
- Provide padded time estimates
- Focus on recovery actions, not technical details
Post-Outage:
- Management cares about prevention, not etcd internals
- Boring post-mortems with action items stop questioning
- Assign clear ownership for monitoring improvements
Recovery Completion Criteria
NOT recovered when kubectl works - that's just the beginning.
Actually recovered when:
- All applications healthy for 30+ minutes
- No anomalous log entries
- Performance metrics normal
- New deployments succeed consistently
- Resource utilization patterns normal
Clusters appearing "recovered" failing again within 20 minutes is common when underlying issues remain unresolved.
Useful Links for Further Investigation
Resources That Actually Help
Link | Description |
---|---|
Kubernetes Troubleshooting Guide | Official docs. They assume your cluster is healthy enough to run diagnostics, but it's a starting point. |
etcd Disaster Recovery | Follow this exactly. Saved my ass when etcd died during a power outage. |
Kubernetes GitHub Issues | Search for your exact error message. Someone else has probably hit the same bug. |
k9s | Terminal UI for Kubernetes. Sometimes shows things kubectl misses, but won't help if the cluster is completely dead. |
GitLab Database Incident | Not Kubernetes, but shows how to handle catastrophic data loss. |
Related Tools & Recommendations
GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus
How to Wire Together the Modern DevOps Stack Without Losing Your Sanity
Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break
When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go
Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015
When your API shits the bed right before the big demo, this stack tells you exactly why
GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015
Deploy your app without losing your mind or your weekend
Docker Alternatives That Won't Break Your Budget
Docker got expensive as hell. Here's how to escape without breaking everything.
I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works
Trivy, Docker Scout, Snyk Container, Grype, and Clair - which one won't make you want to quit DevOps
Grafana - The Monitoring Dashboard That Doesn't Suck
integrates with Grafana
Set Up Microservices Monitoring That Actually Works
Stop flying blind - get real visibility into what's breaking your distributed services
MongoDB Alternatives: Choose the Right Database for Your Specific Use Case
Stop paying MongoDB tax. Choose a database that actually works for your use case.
RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)
Running RAG Systems on K8s Will Make You Hate Your Life, But Sometimes You Don't Have a Choice
Rancher Desktop - Docker Desktop's Free Replacement That Actually Works
extends Rancher Desktop
I Ditched Docker Desktop for Rancher Desktop - Here's What Actually Happened
3 Months Later: The Good, Bad, and Bullshit
Rancher - Manage Multiple Kubernetes Clusters Without Losing Your Sanity
One dashboard for all your clusters, whether they're on AWS, your basement server, or that sketchy cloud provider your CTO picked
Fix Helm When It Inevitably Breaks - Debug Guide
The commands, tools, and nuclear options for when your Helm deployment is fucked and you need to debug template errors at 3am.
Helm - Because Managing 47 YAML Files Will Drive You Insane
Package manager for Kubernetes that saves you from copy-pasting deployment configs like a savage. Helm charts beat maintaining separate YAML files for every dam
Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together
Stop fighting with YAML hell and infrastructure drift - here's how to manage everything through Git without losing your sanity
etcd - The Database That Keeps Kubernetes Working
etcd stores all the important cluster state. When it breaks, your weekend is fucked.
GitHub Actions Marketplace - Where CI/CD Actually Gets Easier
integrates with GitHub Actions Marketplace
GitHub Actions Alternatives That Don't Suck
integrates with GitHub Actions
Red Hat OpenShift Container Platform - Enterprise Kubernetes That Actually Works
More expensive than vanilla K8s but way less painful to operate in production
Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization