Kubernetes Just Died

Currently viewing the human version

When Kubernetes Dies (And Why Standard Docs Won't Help)

Kubernetes Disaster Recovery

etcd dies, everything dies. When etcd corrupts, your kubectl returns "unable to connect to server" and you're fucked.

Most official troubleshooting guides assume your cluster is healthy enough to run diagnostics. Reality: when clusters fail, the tools you need to debug them stop working.

Here's what actually happens when clusters fail:

etcd Dies: kubectl returns ECONNREFUSED. Pods keep running but you can't manage anything. Had this happen when someone ran apt upgrade on a master node and etcd went from 3.4.22 to 3.5.1 without the data migration.

Multiple Masters Fail: Lost quorum means rebuilding from scratch unless your backup script actually worked. Spoiler: it probably didn't write to the right directory.

Power Failures: UPS died at 3am, corrupted etcd and the filesystem. Lost 3 months of config because backups were going to the same failed NFS mount.

Skip the theory - here's what works when nothing else does.

The Three Ways Kubernetes Breaks

Complete Control Plane Death

This is the nightmare scenario. API server won't start, etcd is corrupted, kubectl throws connection errors. Usually happens because:

Someone fucked up a cluster upgrade
Disk space ran out on master nodes (happens more than you'd think)
Network issues between etcd members
Power outage without proper UPS

Fixed this twice. First time took 8 hours because the etcd user needed read access to the backup directory. Second time took 3 hours because the restore worked but I forgot to update the systemd service file with the new data directory path.

Resource Cascade Failures

Starts with one OOMKilled pod, ends with the entire cluster grinding to a halt. Memory pressure spreads like cancer until nothing can schedule.

How it cascades: Database pod gets OOMKilled → Apps can't connect → Restart storm → Node runs out of memory → kubelet dies → API server can't reach nodes → Everything fails.

The fix is brutal: kill everything non-essential immediately using resource quotas.

The Slow Death Spiral

Worst because it's hard to catch. Node starts having issues, pods slowly migrate, other nodes get overloaded, more nodes fail. Eventually you're running production on one overloaded node.

Signs: Random pod evictions, slow kubectl responses, deployments timing out.

Why kubectl Becomes Useless

Kubernetes Architecture Components

When clusters are dying, the tools you need to debug them stop working. kubectl needs the API server. The API server needs etcd. etcd needs healthy nodes and disk space.

So when things go sideways, your primary diagnostic tool becomes a paperweight.

Instead you need:

`crictl` (works when docker doesn't)
Direct SSH to nodes
`systemctl` commands
Raw `etcdctl` if you can reach it

Learned this during a 6-hour outage where kubectl was dead the entire time and I kept trying kubectl get nodes like an idiot instead of just SSHing to the boxes.

The Real Recovery Process

Forget the clean step-by-step guides. Real recovery is messy:

Panic for 5 minutes while you figure out how bad it is
SSH directly to nodes because kubectl is dead
Check if etcd is alive - if not, you're in deep shit
Kill everything non-essential to free up resources
Restore from backup or rebuild

The hardest part isn't technical - it's staying calm while everyone asks for ETAs you can't give.

Emergency Recovery (When kubectl Is Dead)

EKS Backup Levels Diagram

Step 1: Figure Out What's Actually Broken (5-15 minutes)

Don't trust kubectl - it lies when the cluster is dying. SSH directly to nodes and check:

## This works when kubectl doesn't
sudo systemctl status kubelet
sudo crictl ps | grep apiserver

## etcd health - this command will timeout if etcd is dead
sudo etcdctl endpoint health

Reference: Troubleshooting kubelet

WARNING: That etcd command hangs forever if etcd is corrupted. Give it 30 seconds max before killing it.

etcd Recovery (The Make-or-Break Step)

etcd Cluster Recovery Process

If etcd is dead, everything else is pointless. No shortcuts here.

Single etcd failure is recoverable. Multiple etcd failures means you're rebuilding unless you have recent backups that actually work.

## Stop everything first or you'll corrupt the restore
sudo systemctl stop kubelet

## This restore command fails silently if paths are wrong
sudo etcdctl snapshot restore /path/to/backup.db \
  --data-dir /var/lib/etcd-from-backup

GOTCHA: That restore command creates a NEW data directory, doesn't overwrite the existing one. I spent 2 hours wondering why the restore "worked" but nothing changed.

GOTCHA #2: Restart kubelet too fast and it corrupts the restored data. Wait 60 seconds.

I've seen this fail because:

Wrong backup path (most common) - see etcd backup guide
Permissions screwed up (etcd user can't read the backup)
Not enough disk space in target directory
kubelet restarted before restore finished

Complete Cluster Death Recovery

When everything's fucked and you're rebuilding from scratch:

Step 1: Accept you're going to be here for hours.

Step 2: Stop all kubelets on all nodes

## On every single node
sudo systemctl stop kubelet
sudo systemctl stop docker

Step 3: Pick one master node to be your lifeline

## Clear the data directory
sudo rm -rf /var/lib/etcd/*

## Restore backup (replace <master-ip> with your actual master node IP)
sudo etcdctl snapshot restore /backup/etcd-snapshot.db \
  --data-dir /var/lib/etcd \
  --initial-cluster master1=https://<master-ip>:2380 \
  --initial-advertise-peer-urls https://<master-ip>:2380

Step 4: Start services ONE AT A TIME

sudo systemctl start docker
sudo systemctl start kubelet

Wait 5 minutes between each step. I know it's painful but rushing breaks things.

Resource Death Spiral Recovery

When memory pressure kills everything:

Nuclear option - kill all non-essential pods immediately using forced deletion:

## This will hurt but saves the cluster
kubectl delete pods --all -n non-essential-namespace --grace-period=0 --force

## Scale everything down
kubectl scale deployment --all --replicas=0 -n non-essential-namespace

Less nuclear - drain the most fucked node:

kubectl drain <worst-node> --ignore-daemonsets --force --delete-emptydir-data

OOMKilled Massacre Recovery

When pods are getting OOMKilled faster than they can start:

## Find the worst offenders
kubectl top pods --all-namespaces --sort-by=memory

## Emergency memory increases (double everything)
kubectl patch deployment problem-app -p '{
  "spec": {
    "template": {
      "spec": {
        "containers": [{
          "name": "app",
          "resources": {
            "limits": {"memory": "4Gi"},
            "requests": {"memory": "2Gi"}
          }
        }]
      }
    }
  }
}'

When Nothing Works

Sometimes you're just fucked. I've been there.

Signs you need to rebuild from scratch:

etcd restore fails multiple times
API server won't start even with good etcd
Nodes keep dying randomly
You've been at this for 6+ hours

At that point, save what you can, spin up a new cluster, and restore applications from application backups.

Had one outage where we fought a corrupted cluster for 12 hours before giving up and rebuilding. New cluster was up in 2 hours.

The hardest part of outage recovery isn't the technical stuff - it's knowing when to cut your losses.

Prevention (Because 3am Outages Suck)

Cluster Backup Components

etcd Backups That Actually Work

The official docs show perfect backup scripts that fail in production. Here's what works:

#!/bin/bash
## Real backup script - learned the hard way
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR=\"/var/backups/etcd\"

## Check if etcd is even responding first
if ! etcdctl endpoint health &>/dev/null; then
    echo \"etcd is dead, backup will fail\"
    exit 1
fi

## Make the backup
etcdctl snapshot save \"${BACKUP_DIR}/etcd-${DATE}.db\"

## Verify it's not corrupted (this fails more than you'd think)
if ! etcdctl snapshot status \"${BACKUP_DIR}/etcd-${DATE}.db\" &>/dev/null; then
    echo \"Backup is corrupted, trying again\"
    rm \"${BACKUP_DIR}/etcd-${DATE}.db\"
    exit 1
fi

Things that break this script:

etcd under load (backup times out)
No disk space (silently creates 0-byte files)
Network issues (creates partial backups)
Permissions (backup succeeds but can't restore)

Resource Limits (Stop the OOM Massacre)

Generic resource quotas are useless. Here's what actually prevents cascading failures:

## This works - learned from production OOMKilled storms
limits:
- type: Container
  default:
    memory: \"512Mi\"  # Not 1Gi - containers lie about memory usage
    cpu: \"200m\"      # Not 100m - too low causes throttling cascades

Why these numbers:

512Mi prevents most OOMKills while allowing density
200m CPU prevents throttling death spirals
Anything lower and you get mystery performance issues

Monitoring That Doesn't Cry Wolf

Most Kubernetes alerts are garbage. Here's what actually matters:

Alert if etcd latency > 100ms for 2 minutes

expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.1

This saved my ass twice. etcd gets slow before it dies completely.

Alert if API server error rate > 1% for 1 minute

expr: (rate(apiserver_request_total{code=~\"5..\"}[5m]) / rate(apiserver_request_total[5m])) > 0.01

Don't wait for it to completely fail.

Alert if node memory > 90% for 5 minutes

expr: (node_memory_Active_bytes / node_memory_MemTotal_bytes) > 0.9

At 90%, you're about to start killing pods.

Disaster Testing (Or How I Learned to Stop Worrying)

Test your recovery procedures or they won't work when you need them.

Monthly chaos tests:

Kill random pods during business hours
Stop etcd on one master node
Fill up disk space on a worker node
Disconnect nodes from network

Quarterly disaster drills:

Complete etcd cluster failure
All master nodes down
Restore from backup

I've seen teams with perfect backup scripts that never tested restores. Don't be that team.

Configuration That Doesn't Drift

Store everything in Git. I don't care if it's "just a small change" - if it's not in Git, it didn't happen.

## Good - change tracked in Git
kubectl apply -f deployment.yaml

## Bad - change lost forever
kubectl edit deployment my-app

Use GitOps tools like ArgoCD or Flux. They'll catch configuration drift and fix it automatically.

The Hard Truth About Prevention

Most outages happen because:

Someone tried to "quickly fix" something in production
Disk space ran out because nobody monitored it
Certificates expired because nobody tracked them
Resource limits were too low because "it worked in staging"

Prevention isn't glamorous. It's boring scripts that check disk space and certificate expiration dates. It's saying "no" to emergency production changes.

But boring scripts don't wake you up at 3am.

Data Protection Strategy

What Actually Causes Outages

Kubernetes Failure Patterns

After fixing production Kubernetes clusters, here's what actually breaks them:

Most of the time it's disk space

etcd logs grow until disk is full (happened twice because logrotate wasn't configured)
Docker images pile up on nodes
Log files from chatty applications fill /var/log

Resource limits set too low

Apps get OOMKilled under load
CPU throttling causes cascading failures
Nobody tested with realistic production data

Upgrade disasters

Kubernetes 1.24 breaking dockershim without warning
API changes breaking deployments (happened with networking.k8s.io/v1beta1)
etcd 3.4 to 3.5 needing data migration

Certificate expiration

kubelet certificates expire (90 days by default)
etcd peer certificates expire
Ingress TLS certificates expire and nobody noticed

The boring stuff kills you more than exotic edge cases.

FAQ: When Kubernetes Recovery Goes Wrong

kubectl stopped working, now what?

When kubectl dies, the API server is probably dead.

Don't waste time troubleshooting kubectl.SSH to the master nodes and use:

crictl for containers (works when docker doesn't)
systemctl status kubelet (shows if kubelet crashed)
Direct etcdctl commands (bypasses API server)Pro tip: Install these tools on a jump box that's NOT part of the cluster. You'll need them when everything else is down.

Can I recover if all masters die?

Maybe. Depends on your backup strategy and how they died.With etcd backups: Maybe. Depends if your backups are corrupted. Last time took 6 hours because I had to try 3 different backup files before finding one that worked.Without backups: You're fucked. Start spinning up a new cluster and hope you committed your YAML files to git.Power failure killed everything: 50/50 chance your etcd data is corrupted. Had one where fsck fixed it, another where we had to rebuild everything.

How do I know if it's resource exhaustion or hardware failure?

Resource exhaustion? Pods die slowly, you'll see OOMKilled everywhere, and nodes start bitching about memory pressure. System gets slow before it dies completely.Hardware failure? Entire nodes vanish instantly, network times out to specific boxes, disk I/O errors in logs. Everything was fine, then it wasn't.Mixed failures (the worst kind): Start with hardware, then resource exhaustion cascades. Check both.

Should I restart pods showing CrashLoopBackOff?

Hell no.

CrashLoopBackOff means the pod is already restarting itself every few minutes. Restarting it manually just resets the backoff timer.Instead: 1.

Check logs: kubectl logs <pod> --previous (shows logs from before crash)2. Check dependencies (database, external APIs)3. Check resource limits (common cause)4. Fix the actual problemDon't just restart and hope. I've seen teams restart the same failing pod 50 times instead of reading the logs.

Cordon vs drain - what's the difference?

Cordon: Stops new pods from scheduling on the node. Existing pods stay put.Drain: Kicks all pods off the node AND prevents new ones.Use cordon when the node is sick but not dying. Use drain when it's fucked and needs to be taken out of service.

How long will this recovery take?

Depends how fucked things are:etcd corruption with backups: 2-6 hours if you're luckyetcd corruption without backups: Day or more rebuilding everythingResource exhaustion: 30 minutes if you kill stuff fast, 3 hours if you try to be surgicalComplete cluster death: No idea. Could be 4 hours, could be 2 daysRule of thumb: Tell management double your estimate. Then double that again.

When do I call for help?

Call cloud support for:

Managed control plane issues (EKS/GKE/AKS)
Infrastructure problems (networking, storage)
When you've been stuck for 2+ hoursDon't call support for:
App-level failures
Config mistakes you made
Resource limit issues
Basic etcd problemsMost cloud support can't help with cluster internals anyway.

What alerts actually matter during outages?

Critical alerts:

API server down or >2s response time
etcd cluster unhealthy or >200ms latency
Node memory >95% (not 85%
too early)
Multiple pods OOMKilled in 5 minutesIgnore during outages:
Individual pod failures (you have bigger problems)
CPU utilization (memory kills you first)
Slow applications (fix the cluster first)

Can I just throw more resources at the problem?

Usually not.

Most outages aren't about needing more resources

they're about:
Disk space filling up
Certificates expiring
Config mistakes
Things breaking during upgradesAdding more CPU/memory won't fix etcd corruption or expired certificates.

How often should I test disaster recovery?

Monthly: Kill random pods, test basic proceduresQuarterly: etcd restore testing, single master failureYearly: Complete cluster rebuild drillMost important: Test your backups. I've seen teams with perfect backup scripts that created corrupted files for 6 months.

How do I explain this outage to management?

During the outage:

Give padded estimates. Say "4-6 hours" when you think it'll take 2. You'll always be wrong anyway.After the outage: They don't care about etcd internals. They care about "how do we prevent this" and "who's responsible for monitoring."Write a post-mortem with action items and owners. Make it boring so they stop asking questions.

When is the cluster actually recovered?

Not when kubectl works again. That's just the beginning.Actually recovered when:

All applications are healthy for 30+ minutes
No weird errors in logs
Performance is back to normal
You can deploy new stuff without issues
Resource usage looks normalI've seen "recovered" clusters die again 20 minutes later because the underlying problem wasn't fixed.

Quick Navigation

The Three Ways Kubernetes Breaks

Complete Control Plane Death

Resource Cascade Failures

The Slow Death Spiral

Why kubectl Becomes Useless

The Real Recovery Process

Step 1: Figure Out What's Actually Broken (5-15 minutes)

etcd Recovery (The Make-or-Break Step)

Complete Cluster Death Recovery

Resource Death Spiral Recovery

OOMKilled Massacre Recovery

When Nothing Works

etcd Backups That Actually Work

Resource Limits (Stop the OOM Massacre)

Monitoring That Doesn't Cry Wolf

Disaster Testing (Or How I Learned to Stop Worrying)

Configuration That Doesn't Drift

The Hard Truth About Prevention

What Actually Causes Outages

kubectl stopped working, now what?

Can I recover if all masters die?

How do I know if it's resource exhaustion or hardware failure?

Should I restart pods showing CrashLoopBackOff?

Cordon vs drain - what's the difference?

How long will this recovery take?

When do I call for help?

What alerts actually matter during outages?

Can I just throw more resources at the problem?

How often should I test disaster recovery?

How do I explain this outage to management?

When is the cluster actually recovered?

Related Tools & Recommendations

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015

Docker Alternatives That Won't Break Your Budget

I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works

Grafana - The Monitoring Dashboard That Doesn't Suck

Set Up Microservices Monitoring That Actually Works

MongoDB Alternatives: Choose the Right Database for Your Specific Use Case

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

Rancher Desktop - Docker Desktop's Free Replacement That Actually Works

I Ditched Docker Desktop for Rancher Desktop - Here's What Actually Happened

Rancher - Manage Multiple Kubernetes Clusters Without Losing Your Sanity

Fix Helm When It Inevitably Breaks - Debug Guide

Helm - Because Managing 47 YAML Files Will Drive You Insane

Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together

etcd - The Database That Keeps Kubernetes Working

GitHub Actions Marketplace - Where CI/CD Actually Gets Easier

GitHub Actions Alternatives That Don't Suck

Red Hat OpenShift Container Platform - Enterprise Kubernetes That Actually Works