Currently viewing the human version
Switch to AI version

When Kubernetes Dies (And Why Standard Docs Won't Help)

Kubernetes Disaster Recovery

etcd dies, everything dies. When etcd corrupts, your kubectl returns "unable to connect to server" and you're fucked.

Most official troubleshooting guides assume your cluster is healthy enough to run diagnostics. Reality: when clusters fail, the tools you need to debug them stop working.

Here's what actually happens when clusters fail:

etcd Dies: kubectl returns ECONNREFUSED. Pods keep running but you can't manage anything. Had this happen when someone ran apt upgrade on a master node and etcd went from 3.4.22 to 3.5.1 without the data migration.

Multiple Masters Fail: Lost quorum means rebuilding from scratch unless your backup script actually worked. Spoiler: it probably didn't write to the right directory.

Power Failures: UPS died at 3am, corrupted etcd and the filesystem. Lost 3 months of config because backups were going to the same failed NFS mount.

Skip the theory - here's what works when nothing else does.

The Three Ways Kubernetes Breaks

Complete Control Plane Death

This is the nightmare scenario. API server won't start, etcd is corrupted, kubectl throws connection errors. Usually happens because:

  • Someone fucked up a cluster upgrade
  • Disk space ran out on master nodes (happens more than you'd think)
  • Network issues between etcd members
  • Power outage without proper UPS

Fixed this twice. First time took 8 hours because the etcd user needed read access to the backup directory. Second time took 3 hours because the restore worked but I forgot to update the systemd service file with the new data directory path.

Resource Cascade Failures

Starts with one OOMKilled pod, ends with the entire cluster grinding to a halt. Memory pressure spreads like cancer until nothing can schedule.

How it cascades: Database pod gets OOMKilled → Apps can't connect → Restart storm → Node runs out of memory → kubelet dies → API server can't reach nodes → Everything fails.

The fix is brutal: kill everything non-essential immediately using resource quotas.

The Slow Death Spiral

Worst because it's hard to catch. Node starts having issues, pods slowly migrate, other nodes get overloaded, more nodes fail. Eventually you're running production on one overloaded node.

Signs: Random pod evictions, slow kubectl responses, deployments timing out.

Why kubectl Becomes Useless

Kubernetes Architecture Components

When clusters are dying, the tools you need to debug them stop working. kubectl needs the API server. The API server needs etcd. etcd needs healthy nodes and disk space.

So when things go sideways, your primary diagnostic tool becomes a paperweight.

Instead you need:

Learned this during a 6-hour outage where kubectl was dead the entire time and I kept trying kubectl get nodes like an idiot instead of just SSHing to the boxes.

The Real Recovery Process

Forget the clean step-by-step guides. Real recovery is messy:

  1. Panic for 5 minutes while you figure out how bad it is
  2. SSH directly to nodes because kubectl is dead
  3. Check if etcd is alive - if not, you're in deep shit
  4. Kill everything non-essential to free up resources
  5. Restore from backup or rebuild

The hardest part isn't technical - it's staying calm while everyone asks for ETAs you can't give.

Emergency Recovery (When kubectl Is Dead)

EKS Backup Levels Diagram

Step 1: Figure Out What's Actually Broken (5-15 minutes)

Don't trust kubectl - it lies when the cluster is dying. SSH directly to nodes and check:

## This works when kubectl doesn't
sudo systemctl status kubelet
sudo crictl ps | grep apiserver

## etcd health - this command will timeout if etcd is dead
sudo etcdctl endpoint health

Reference: Troubleshooting kubelet

WARNING: That etcd command hangs forever if etcd is corrupted. Give it 30 seconds max before killing it.

etcd Recovery (The Make-or-Break Step)

etcd Cluster Recovery Process

If etcd is dead, everything else is pointless. No shortcuts here.

Single etcd failure is recoverable. Multiple etcd failures means you're rebuilding unless you have recent backups that actually work.

## Stop everything first or you'll corrupt the restore
sudo systemctl stop kubelet

## This restore command fails silently if paths are wrong
sudo etcdctl snapshot restore /path/to/backup.db \
  --data-dir /var/lib/etcd-from-backup

GOTCHA: That restore command creates a NEW data directory, doesn't overwrite the existing one. I spent 2 hours wondering why the restore "worked" but nothing changed.

GOTCHA #2: Restart kubelet too fast and it corrupts the restored data. Wait 60 seconds.

I've seen this fail because:

  • Wrong backup path (most common) - see etcd backup guide
  • Permissions screwed up (etcd user can't read the backup)
  • Not enough disk space in target directory
  • kubelet restarted before restore finished

Complete Cluster Death Recovery

When everything's fucked and you're rebuilding from scratch:

Step 1: Accept you're going to be here for hours.

Step 2: Stop all kubelets on all nodes

## On every single node
sudo systemctl stop kubelet
sudo systemctl stop docker

Step 3: Pick one master node to be your lifeline

## Clear the data directory
sudo rm -rf /var/lib/etcd/*

## Restore backup (replace <master-ip> with your actual master node IP)
sudo etcdctl snapshot restore /backup/etcd-snapshot.db \
  --data-dir /var/lib/etcd \
  --initial-cluster master1=https://<master-ip>:2380 \
  --initial-advertise-peer-urls https://<master-ip>:2380

Step 4: Start services ONE AT A TIME

sudo systemctl start docker
sudo systemctl start kubelet

Wait 5 minutes between each step. I know it's painful but rushing breaks things.

Resource Death Spiral Recovery

When memory pressure kills everything:

Nuclear option - kill all non-essential pods immediately using forced deletion:

## This will hurt but saves the cluster
kubectl delete pods --all -n non-essential-namespace --grace-period=0 --force

## Scale everything down
kubectl scale deployment --all --replicas=0 -n non-essential-namespace

Less nuclear - drain the most fucked node:

kubectl drain <worst-node> --ignore-daemonsets --force --delete-emptydir-data

OOMKilled Massacre Recovery

When pods are getting OOMKilled faster than they can start:

## Find the worst offenders
kubectl top pods --all-namespaces --sort-by=memory

## Emergency memory increases (double everything)
kubectl patch deployment problem-app -p '{
  "spec": {
    "template": {
      "spec": {
        "containers": [{
          "name": "app",
          "resources": {
            "limits": {"memory": "4Gi"},
            "requests": {"memory": "2Gi"}
          }
        }]
      }
    }
  }
}'

More on resource management.

Pro tip: Don't try to be surgical during an outage. Double the memory limits and fix it properly later.

When Nothing Works

Sometimes you're just fucked. I've been there.

Signs you need to rebuild from scratch:

  • etcd restore fails multiple times
  • API server won't start even with good etcd
  • Nodes keep dying randomly
  • You've been at this for 6+ hours

At that point, save what you can, spin up a new cluster, and restore applications from application backups.

Had one outage where we fought a corrupted cluster for 12 hours before giving up and rebuilding. New cluster was up in 2 hours.

The hardest part of outage recovery isn't the technical stuff - it's knowing when to cut your losses.

Prevention (Because 3am Outages Suck)

Cluster Backup Components

etcd Backups That Actually Work

The official docs show perfect backup scripts that fail in production. Here's what works:

#!/bin/bash
## Real backup script - learned the hard way
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR=\"/var/backups/etcd\"

## Check if etcd is even responding first
if ! etcdctl endpoint health &>/dev/null; then
    echo \"etcd is dead, backup will fail\"
    exit 1
fi

## Make the backup
etcdctl snapshot save \"${BACKUP_DIR}/etcd-${DATE}.db\"

## Verify it's not corrupted (this fails more than you'd think)
if ! etcdctl snapshot status \"${BACKUP_DIR}/etcd-${DATE}.db\" &>/dev/null; then
    echo \"Backup is corrupted, trying again\"
    rm \"${BACKUP_DIR}/etcd-${DATE}.db\"
    exit 1
fi

Things that break this script:

  • etcd under load (backup times out)
  • No disk space (silently creates 0-byte files)
  • Network issues (creates partial backups)
  • Permissions (backup succeeds but can't restore)

Resource Limits (Stop the OOM Massacre)

Generic resource quotas are useless. Here's what actually prevents cascading failures:

## This works - learned from production OOMKilled storms
limits:
- type: Container
  default:
    memory: \"512Mi\"  # Not 1Gi - containers lie about memory usage
    cpu: \"200m\"      # Not 100m - too low causes throttling cascades

Why these numbers:

  • 512Mi prevents most OOMKills while allowing density
  • 200m CPU prevents throttling death spirals
  • Anything lower and you get mystery performance issues

Monitoring That Doesn't Cry Wolf

Kubernetes Monitoring Dashboard

Most Kubernetes alerts are garbage. Here's what actually matters:

Alert if etcd latency > 100ms for 2 minutes

expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.1

This saved my ass twice. etcd gets slow before it dies completely.

Alert if API server error rate > 1% for 1 minute

expr: (rate(apiserver_request_total{code=~\"5..\"}[5m]) / rate(apiserver_request_total[5m])) > 0.01

Don't wait for it to completely fail.

Alert if node memory > 90% for 5 minutes

expr: (node_memory_Active_bytes / node_memory_MemTotal_bytes) > 0.9

At 90%, you're about to start killing pods.

Disaster Testing (Or How I Learned to Stop Worrying)

Test your recovery procedures or they won't work when you need them.

Monthly chaos tests:

  • Kill random pods during business hours
  • Stop etcd on one master node
  • Fill up disk space on a worker node
  • Disconnect nodes from network

Quarterly disaster drills:

I've seen teams with perfect backup scripts that never tested restores. Don't be that team.

Configuration That Doesn't Drift

Store everything in Git. I don't care if it's "just a small change" - if it's not in Git, it didn't happen.

## Good - change tracked in Git
kubectl apply -f deployment.yaml

## Bad - change lost forever
kubectl edit deployment my-app

Use GitOps tools like ArgoCD or Flux. They'll catch configuration drift and fix it automatically.

The Hard Truth About Prevention

Most outages happen because:

  1. Someone tried to "quickly fix" something in production
  2. Disk space ran out because nobody monitored it
  3. Certificates expired because nobody tracked them
  4. Resource limits were too low because "it worked in staging"

Prevention isn't glamorous. It's boring scripts that check disk space and certificate expiration dates. It's saying "no" to emergency production changes.

But boring scripts don't wake you up at 3am.

Data Protection Strategy

What Actually Causes Outages

Kubernetes Failure Patterns

After fixing production Kubernetes clusters, here's what actually breaks them:

Most of the time it's disk space

  • etcd logs grow until disk is full (happened twice because logrotate wasn't configured)
  • Docker images pile up on nodes
  • Log files from chatty applications fill /var/log

Resource limits set too low

  • Apps get OOMKilled under load
  • CPU throttling causes cascading failures
  • Nobody tested with realistic production data

Upgrade disasters

Certificate expiration

The boring stuff kills you more than exotic edge cases.

FAQ: When Kubernetes Recovery Goes Wrong

Q

kubectl stopped working, now what?

A

When kubectl dies, the API server is probably dead.

Don't waste time troubleshooting kubectl.SSH to the master nodes and use:

  • crictl for containers (works when docker doesn't)
  • systemctl status kubelet (shows if kubelet crashed)
  • Direct etcdctl commands (bypasses API server)Pro tip: Install these tools on a jump box that's NOT part of the cluster. You'll need them when everything else is down.
Q

Can I recover if all masters die?

A

Maybe. Depends on your backup strategy and how they died.With etcd backups: Maybe. Depends if your backups are corrupted. Last time took 6 hours because I had to try 3 different backup files before finding one that worked.Without backups: You're fucked. Start spinning up a new cluster and hope you committed your YAML files to git.Power failure killed everything: 50/50 chance your etcd data is corrupted. Had one where fsck fixed it, another where we had to rebuild everything.

Q

How do I know if it's resource exhaustion or hardware failure?

A

Resource exhaustion? Pods die slowly, you'll see OOMKilled everywhere, and nodes start bitching about memory pressure. System gets slow before it dies completely.Hardware failure? Entire nodes vanish instantly, network times out to specific boxes, disk I/O errors in logs. Everything was fine, then it wasn't.Mixed failures (the worst kind): Start with hardware, then resource exhaustion cascades. Check both.

Q

Should I restart pods showing CrashLoopBackOff?

A

Hell no.

CrashLoopBackOff means the pod is already restarting itself every few minutes. Restarting it manually just resets the backoff timer.Instead: 1.

Check logs: kubectl logs <pod> --previous (shows logs from before crash)2. Check dependencies (database, external APIs)3. Check resource limits (common cause)4. Fix the actual problemDon't just restart and hope. I've seen teams restart the same failing pod 50 times instead of reading the logs.

Q

Cordon vs drain - what's the difference?

A

Cordon: Stops new pods from scheduling on the node. Existing pods stay put.Drain: Kicks all pods off the node AND prevents new ones.Use cordon when the node is sick but not dying. Use drain when it's fucked and needs to be taken out of service.

Q

How long will this recovery take?

A

Depends how fucked things are:etcd corruption with backups: 2-6 hours if you're luckyetcd corruption without backups: Day or more rebuilding everythingResource exhaustion: 30 minutes if you kill stuff fast, 3 hours if you try to be surgicalComplete cluster death: No idea. Could be 4 hours, could be 2 daysRule of thumb: Tell management double your estimate. Then double that again.

Q

When do I call for help?

A

Call cloud support for:

  • Managed control plane issues (EKS/GKE/AKS)
  • Infrastructure problems (networking, storage)
  • When you've been stuck for 2+ hoursDon't call support for:
  • App-level failures
  • Config mistakes you made
  • Resource limit issues
  • Basic etcd problemsMost cloud support can't help with cluster internals anyway.
Q

What alerts actually matter during outages?

A

Critical alerts:

  • API server down or >2s response time
  • etcd cluster unhealthy or >200ms latency
  • Node memory >95% (not 85%
  • too early)
  • Multiple pods OOMKilled in 5 minutesIgnore during outages:
  • Individual pod failures (you have bigger problems)
  • CPU utilization (memory kills you first)
  • Slow applications (fix the cluster first)
Q

Can I just throw more resources at the problem?

A

Usually not.

Most outages aren't about needing more resources

  • they're about:

  • Disk space filling up

  • Certificates expiring

  • Config mistakes

  • Things breaking during upgradesAdding more CPU/memory won't fix etcd corruption or expired certificates.

Q

How often should I test disaster recovery?

A

Monthly: Kill random pods, test basic proceduresQuarterly: etcd restore testing, single master failureYearly: Complete cluster rebuild drillMost important: Test your backups. I've seen teams with perfect backup scripts that created corrupted files for 6 months.

Q

How do I explain this outage to management?

A

During the outage:

Give padded estimates. Say "4-6 hours" when you think it'll take 2. You'll always be wrong anyway.After the outage: They don't care about etcd internals. They care about "how do we prevent this" and "who's responsible for monitoring."Write a post-mortem with action items and owners. Make it boring so they stop asking questions.

Q

When is the cluster actually recovered?

A

Not when kubectl works again. That's just the beginning.Actually recovered when:

  • All applications are healthy for 30+ minutes
  • No weird errors in logs
  • Performance is back to normal
  • You can deploy new stuff without issues
  • Resource usage looks normalI've seen "recovered" clusters die again 20 minutes later because the underlying problem wasn't fixed.

Related Tools & Recommendations

integration
Recommended

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

How to Wire Together the Modern DevOps Stack Without Losing Your Sanity

prometheus
/integration/docker-kubernetes-argocd-prometheus/gitops-workflow-integration
100%
integration
Recommended

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go

Apache Kafka
/integration/kafka-mongodb-kubernetes-prometheus-event-driven/complete-observability-architecture
77%
integration
Recommended

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

When your API shits the bed right before the big demo, this stack tells you exactly why

Prometheus
/integration/prometheus-grafana-jaeger/microservices-observability-integration
51%
integration
Recommended

GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015

Deploy your app without losing your mind or your weekend

GitHub Actions
/integration/github-actions-docker-aws-ecs/ci-cd-pipeline-automation
33%
alternatives
Recommended

Docker Alternatives That Won't Break Your Budget

Docker got expensive as hell. Here's how to escape without breaking everything.

Docker
/alternatives/docker/budget-friendly-alternatives
32%
compare
Recommended

I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works

Trivy, Docker Scout, Snyk Container, Grype, and Clair - which one won't make you want to quit DevOps

docker
/compare/docker-security/cicd-integration/docker-security-cicd-integration
32%
tool
Recommended

Grafana - The Monitoring Dashboard That Doesn't Suck

integrates with Grafana

Grafana
/tool/grafana/overview
31%
howto
Recommended

Set Up Microservices Monitoring That Actually Works

Stop flying blind - get real visibility into what's breaking your distributed services

Prometheus
/howto/setup-microservices-observability-prometheus-jaeger-grafana/complete-observability-setup
31%
alternatives
Recommended

MongoDB Alternatives: Choose the Right Database for Your Specific Use Case

Stop paying MongoDB tax. Choose a database that actually works for your use case.

MongoDB
/alternatives/mongodb/use-case-driven-alternatives
30%
integration
Recommended

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

Running RAG Systems on K8s Will Make You Hate Your Life, But Sometimes You Don't Have a Choice

Vector Databases
/integration/vector-database-rag-production-deployment/kubernetes-orchestration
28%
tool
Recommended

Rancher Desktop - Docker Desktop's Free Replacement That Actually Works

extends Rancher Desktop

Rancher Desktop
/tool/rancher-desktop/overview
26%
review
Recommended

I Ditched Docker Desktop for Rancher Desktop - Here's What Actually Happened

3 Months Later: The Good, Bad, and Bullshit

Rancher Desktop
/review/rancher-desktop/overview
26%
tool
Recommended

Rancher - Manage Multiple Kubernetes Clusters Without Losing Your Sanity

One dashboard for all your clusters, whether they're on AWS, your basement server, or that sketchy cloud provider your CTO picked

Rancher
/tool/rancher/overview
26%
tool
Recommended

Fix Helm When It Inevitably Breaks - Debug Guide

The commands, tools, and nuclear options for when your Helm deployment is fucked and you need to debug template errors at 3am.

Helm
/tool/helm/troubleshooting-guide
25%
tool
Recommended

Helm - Because Managing 47 YAML Files Will Drive You Insane

Package manager for Kubernetes that saves you from copy-pasting deployment configs like a savage. Helm charts beat maintaining separate YAML files for every dam

Helm
/tool/helm/overview
25%
integration
Recommended

Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together

Stop fighting with YAML hell and infrastructure drift - here's how to manage everything through Git without losing your sanity

Pulumi
/integration/pulumi-kubernetes-helm-gitops/complete-workflow-integration
25%
tool
Recommended

etcd - The Database That Keeps Kubernetes Working

etcd stores all the important cluster state. When it breaks, your weekend is fucked.

etcd
/tool/etcd/overview
23%
tool
Recommended

GitHub Actions Marketplace - Where CI/CD Actually Gets Easier

integrates with GitHub Actions Marketplace

GitHub Actions Marketplace
/tool/github-actions-marketplace/overview
22%
alternatives
Recommended

GitHub Actions Alternatives That Don't Suck

integrates with GitHub Actions

GitHub Actions
/alternatives/github-actions/use-case-driven-selection
22%
tool
Recommended

Red Hat OpenShift Container Platform - Enterprise Kubernetes That Actually Works

More expensive than vanilla K8s but way less painful to operate in production

Red Hat OpenShift Container Platform
/tool/openshift/overview
16%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization