kubectl stopped working, now what?

When kubectl dies, the API server is probably dead. Don't waste time troubleshooting kubectl.SSH to the master nodes and use:- `crictl` for containers (works when docker doesn't)- `systemctl status kubelet` (shows if kubelet crashed)- Direct `etcdctl` commands (bypasses API server)**Pro tip**: Install these tools on a jump box that's NOT part of the cluster. You'll need them when everything else is down.

Can I recover if all masters die?

Maybe. Depends on your backup strategy and how they died.**With etcd backups**: Maybe. Depends if your backups are corrupted. Last time took 6 hours because I had to try 3 different backup files before finding one that worked.**Without backups**: You're fucked. Start spinning up a new cluster and hope you committed your YAML files to git.**Power failure killed everything**: 50/50 chance your etcd data is corrupted. Had one where fsck fixed it, another where we had to rebuild everything.

How do I know if it's resource exhaustion or hardware failure?

**Resource exhaustion?** Pods die slowly, you'll see OOMKilled everywhere, and nodes start bitching about memory pressure. System gets slow before it dies completely.**Hardware failure?** Entire nodes vanish instantly, network times out to specific boxes, disk I/O errors in logs. Everything was fine, then it wasn't.**Mixed failures** (the worst kind): Start with hardware, then resource exhaustion cascades. Check both.

Should I restart pods showing CrashLoopBackOff?

Hell no. CrashLoopBackOff means the pod is already restarting itself every few minutes. Restarting it manually just resets the backoff timer.Instead:1. Check logs: `kubectl logs --previous` (shows logs from before crash)2. Check dependencies (database, external APIs)3. Check resource limits (common cause)4. Fix the actual problem**Don't just restart and hope.** I've seen teams restart the same failing pod 50 times instead of reading the logs.

Cordon vs drain - what's the difference?

**Cordon**: Stops new pods from scheduling on the node. Existing pods stay put.**Drain**: Kicks all pods off the node AND prevents new ones.Use cordon when the node is sick but not dying. Use drain when it's fucked and needs to be taken out of service.

How long will this recovery take?

Depends how fucked things are:**etcd corruption with backups**: 2-6 hours if you're lucky**etcd corruption without backups**: Day or more rebuilding everything**Resource exhaustion**: 30 minutes if you kill stuff fast, 3 hours if you try to be surgical**Complete cluster death**: No idea. Could be 4 hours, could be 2 days**Rule of thumb**: Tell management double your estimate. Then double that again.

When do I call for help?

**Call cloud support for:**- Managed control plane issues (EKS/GKE/AKS)- Infrastructure problems (networking, storage)- When you've been stuck for 2+ hours**Don't call support for:**- App-level failures- Config mistakes you made- Resource limit issues- Basic etcd problemsMost cloud support can't help with cluster internals anyway.

What alerts actually matter during outages?

**Critical alerts:**- API server down or >2s response time- etcd cluster unhealthy or >200ms latency- Node memory >95% (not 85% - too early)- Multiple pods OOMKilled in 5 minutes**Ignore during outages:**- Individual pod failures (you have bigger problems)- CPU utilization (memory kills you first)- Slow applications (fix the cluster first)

Can I just throw more resources at the problem?

Usually not. Most outages aren't about needing more resources - they're about:- Disk space filling up- Certificates expiring- Config mistakes- Things breaking during upgradesAdding more CPU/memory won't fix etcd corruption or expired certificates.

How often should I test disaster recovery?

**Monthly**: Kill random pods, test basic procedures**Quarterly**: etcd restore testing, single master failure**Yearly**: Complete cluster rebuild drill**Most important**: Test your backups. I've seen teams with perfect backup scripts that created corrupted files for 6 months.

How do I explain this outage to management?

**During the outage**: Give padded estimates. Say "4-6 hours" when you think it'll take 2. You'll always be wrong anyway.**After the outage**: They don't care about etcd internals. They care about "how do we prevent this" and "who's responsible for monitoring."**Write a post-mortem** with action items and owners. Make it boring so they stop asking questions.

When is the cluster actually recovered?

**Not when kubectl works again.** That's just the beginning.**Actually recovered when:**- All applications are healthy for 30+ minutes- No weird errors in logs- Performance is back to normal- You can deploy new stuff without issues- Resource usage looks normalI've seen "recovered" clusters die again 20 minutes later because the underlying problem wasn't fixed.

Currently viewing the AI version

Switch to human version

Kubernetes Production Outage Recovery - AI-Optimized Guide

Critical Failure Classifications

Complete Control Plane Death

Symptoms: API server won't start, etcd corrupted, kubectl connection errors
Causes: Failed cluster upgrades, disk space exhaustion on master nodes, network issues between etcd members, power outages without UPS
Recovery Time: 3-8 hours (first attempt: 8 hours due to permission issues; experienced recovery: 3 hours)
Prerequisites: Valid etcd backups, direct SSH access to nodes

Resource Cascade Failures

Pattern: Database pod OOMKilled → App connection failures → Restart storm → Node memory exhaustion → kubelet dies → API server isolation → Complete failure
Detection: Random pod evictions, slow kubectl responses, deployment timeouts
Recovery Method: Immediate non-essential pod termination using forced deletion

Slow Death Spiral

Characteristics: Hardest to detect, progressive node failures with pod migration
Timeline: Can take hours to days before complete failure
Early Warnings: Random pod evictions, slow kubectl responses, deployment timeouts

Emergency Recovery Procedures

When kubectl Is Dead (5-15 minutes assessment)

Direct Node Assessment:

# Use when kubectl fails
sudo systemctl status kubelet
sudo crictl ps | grep apiserver
sudo etcdctl endpoint health  # WARNING: Hangs indefinitely if etcd is corrupted - kill after 30 seconds

etcd Recovery (Make-or-Break Step)

Single etcd Failure:

# CRITICAL: Stop services first to prevent corruption
sudo systemctl stop kubelet

# Restore creates NEW directory, doesn't overwrite
sudo etcdctl snapshot restore /path/to/backup.db \
  --data-dir /var/lib/etcd-from-backup

# CRITICAL: Wait 60 seconds before restarting kubelet

Multiple etcd Failures (Complete Rebuild):

# Stop all services on all nodes
sudo systemctl stop kubelet
sudo systemctl stop docker

# On single master node
sudo rm -rf /var/lib/etcd/*
sudo etcdctl snapshot restore /backup/etcd-snapshot.db \
  --data-dir /var/lib/etcd \
  --initial-cluster master1=https://<master-ip>:2380 \
  --initial-advertise-peer-urls https://<master-ip>:2380

# Start services with 5-minute intervals
sudo systemctl start docker
# Wait 5 minutes
sudo systemctl start kubelet

Resource Death Spiral Recovery

Nuclear Option (Immediate):

# Force delete all non-essential pods
kubectl delete pods --all -n non-essential-namespace --grace-period=0 --force
kubectl scale deployment --all --replicas=0 -n non-essential-namespace

Surgical Option:

kubectl drain <worst-node> --ignore-daemonsets --force --delete-emptydir-data

OOMKilled Massacre Recovery

# Identify worst offenders
kubectl top pods --all-namespaces --sort-by=memory

# Emergency memory doubling
kubectl patch deployment problem-app -p '{
  "spec": {
    "template": {
      "spec": {
        "containers": [{
          "name": "app",
          "resources": {
            "limits": {"memory": "4Gi"},
            "requests": {"memory": "2Gi"}
          }
        }]
      }
    }
  }
}'

Production-Tested Prevention Strategies

etcd Backup Script (Failure-Resistant)

#!/bin/bash
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/var/backups/etcd"

# Verify etcd health before backup
if ! etcdctl endpoint health &>/dev/null; then
    echo "etcd is dead, backup will fail"
    exit 1
fi

# Create backup
etcdctl snapshot save "${BACKUP_DIR}/etcd-${DATE}.db"

# Verify backup integrity
if ! etcdctl snapshot status "${BACKUP_DIR}/etcd-${DATE}.db" &>/dev/null; then
    echo "Backup is corrupted, trying again"
    rm "${BACKUP_DIR}/etcd-${DATE}.db"
    exit 1
fi

Common Backup Failures:

etcd under load (backup timeouts)
Insufficient disk space (creates 0-byte files)
Network issues (partial backups)
Permission problems (backup succeeds but restore fails)

Resource Limits (Production-Tested)

# Prevents OOM cascades while maintaining density
limits:
- type: Container
  default:
    memory: "512Mi"  # Prevents most OOMKills
    cpu: "200m"      # Prevents throttling cascades

Why These Values:

512Mi: Tested threshold preventing OOMKills while allowing pod density
200m CPU: Prevents throttling-induced death spirals
Lower values cause mystery performance issues

Critical Monitoring Thresholds

etcd Performance Alert:

expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.1
# Threshold: >100ms for 2 minutes
# Rationale: etcd gets slow before complete failure

API Server Error Rate:

expr: (rate(apiserver_request_total{code=~"5.."}[5m]) / rate(apiserver_request_total[5m])) > 0.01
# Threshold: >1% error rate for 1 minute
# Rationale: Don't wait for complete failure

Node Memory Pressure:

expr: (node_memory_Active_bytes / node_memory_MemTotal_bytes) > 0.9
# Threshold: >90% for 5 minutes
# Rationale: At 90%, pod termination begins

Real-World Failure Causes (By Frequency)

Most Common: Disk Space Exhaustion

etcd logs grow unchecked (logrotate misconfiguration)
Docker image accumulation on nodes
Application log files filling /var/log

Resource Limits Too Low

OOMKills under production load
CPU throttling causing cascades
Staging environment doesn't match production resource patterns

Upgrade Disasters

Kubernetes 1.24 dockershim removal without warning
API deprecations breaking deployments (networking.k8s.io/v1beta1)
etcd 3.4→3.5 requiring data migration

Certificate Expiration

kubelet certificates (90-day default)
etcd peer certificates
Ingress TLS certificates

Recovery Decision Tree

When to Rebuild vs Repair

Rebuild Indicators:

etcd restore fails multiple times
API server won't start with good etcd
Random node failures continue
6+ hours of unsuccessful repair attempts

Rebuild Timeline:

Fight corrupted cluster: 12 hours
New cluster deployment: 2 hours
Lesson: Know when to cut losses

Recovery Time Estimates

etcd corruption with backups: 2-6 hours
etcd corruption without backups: 24+ hours (complete rebuild)
Resource exhaustion: 30 minutes (aggressive) to 3 hours (surgical)
Complete cluster death: 4 hours to 2 days (highly variable)

Management Communication Rule: Double your estimate, then double again

Tools That Work When Standard Tools Fail

When kubectl is dead:

crictl (works when docker doesn't)
Direct SSH to nodes
systemctl commands for service management
Raw etcdctl for etcd access

Jump Box Requirements:
Install these tools on infrastructure NOT part of the cluster - you'll need them when everything else fails.

Critical Operational Intelligence

Testing Requirements

Monthly: Pod chaos testing, basic recovery procedures
Quarterly: etcd restore testing, single master failure simulation
Yearly: Complete cluster rebuild drill
Most Critical: Backup integrity testing (corrupted backups discovered after 6 months are common)

Configuration Drift Prevention

# Good - tracked in Git
kubectl apply -f deployment.yaml

# Bad - lost forever
kubectl edit deployment my-app

Use GitOps tools (ArgoCD, Flux) for automatic drift detection and correction.

Outage Communication Strategy

During Outage:

Provide padded time estimates
Focus on recovery actions, not technical details

Post-Outage:

Management cares about prevention, not etcd internals
Boring post-mortems with action items stop questioning
Assign clear ownership for monitoring improvements

Recovery Completion Criteria

NOT recovered when kubectl works - that's just the beginning.

Actually recovered when:

All applications healthy for 30+ minutes
No anomalous log entries
Performance metrics normal
New deployments succeed consistently
Resource utilization patterns normal

Clusters appearing "recovered" failing again within 20 minutes is common when underlying issues remain unresolved.

Useful Links for Further Investigation

Resources That Actually Help

Link	Description
Kubernetes Troubleshooting Guide	Official docs. They assume your cluster is healthy enough to run diagnostics, but it's a starting point.
etcd Disaster Recovery	Follow this exactly. Saved my ass when etcd died during a power outage.
Kubernetes GitHub Issues	Search for your exact error message. Someone else has probably hit the same bug.
k9s	Terminal UI for Kubernetes. Sometimes shows things kubectl misses, but won't help if the cluster is completely dead.
GitLab Database Incident	Not Kubernetes, but shows how to handle catastrophic data loss.

Kubernetes Production Outage Recovery - AI-Optimized Guide

Critical Failure Classifications

Complete Control Plane Death

Resource Cascade Failures

Slow Death Spiral

Emergency Recovery Procedures

When kubectl Is Dead (5-15 minutes assessment)

etcd Recovery (Make-or-Break Step)

Resource Death Spiral Recovery

OOMKilled Massacre Recovery

Production-Tested Prevention Strategies

etcd Backup Script (Failure-Resistant)

Resource Limits (Production-Tested)

Critical Monitoring Thresholds

Real-World Failure Causes (By Frequency)

Most Common: Disk Space Exhaustion

Resource Limits Too Low

Upgrade Disasters

Certificate Expiration

Recovery Decision Tree

When to Rebuild vs Repair

Recovery Time Estimates

Tools That Work When Standard Tools Fail

Critical Operational Intelligence

Testing Requirements

Configuration Drift Prevention

Outage Communication Strategy

Recovery Completion Criteria

Useful Links for Further Investigation

Resources That Actually Help

Related Tools & Recommendations

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015

Docker Alternatives That Won't Break Your Budget

I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works

Grafana - The Monitoring Dashboard That Doesn't Suck

Set Up Microservices Monitoring That Actually Works

MongoDB Alternatives: Choose the Right Database for Your Specific Use Case

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

Rancher Desktop - Docker Desktop's Free Replacement That Actually Works

I Ditched Docker Desktop for Rancher Desktop - Here's What Actually Happened

Rancher - Manage Multiple Kubernetes Clusters Without Losing Your Sanity

Fix Helm When It Inevitably Breaks - Debug Guide

Helm - Because Managing 47 YAML Files Will Drive You Insane

Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together

etcd - The Database That Keeps Kubernetes Working

GitHub Actions Marketplace - Where CI/CD Actually Gets Easier

GitHub Actions Alternatives That Don't Suck

Red Hat OpenShift Container Platform - Enterprise Kubernetes That Actually Works