etcd - The Database That Keeps Kubernetes Working

What etcd Actually Does (And Why It'll Piss You Off Sometimes)

etcd Client Architecture

etcd stores the stuff that keeps distributed systems working - think of it as a shared brain for your cluster that occasionally gets amnesia. Originally built by CoreOS before Red Hat ate them, it's now a CNCF project that everyone uses and no one wants to debug.

The name comes from Unix's /etc directory plus "d" for distributed - because someone thought "hey, let's take the folder where all the important config lives and make it work across multiple machines with consensus algorithms." What could go wrong?

Why etcd Exists and Why You Can't Avoid It

etcd uses the Raft consensus algorithm to make sure all your nodes agree on what's true. This sounds great until you realize that "strong consistency" means your writes can completely stop during network hiccups while the cluster has an identity crisis about who's in charge.

The original etcd whitepaper explains why we needed another distributed key-value store, but the real reason is that ZooKeeper's complexity was driving people insane with JVM heap tuning and Jute protocol debugging.

During leader elections, your entire cluster becomes temporarily unavailable for writes while nodes vote on who gets to be in charge next. It's like a distributed democracy, except when there's an election, nobody can get anything done.

Unlike Redis or other eventually-consistent stores, etcd won't lie to you about data freshness. Every read gets you the latest committed write, which is fantastic for cluster state but will bite you in the ass when availability matters more than consistency.

The etcd Guarantee: Your data is either perfectly consistent across all nodes, or your writes are completely fucked until the cluster sorts itself out. No middle ground.

Features That Actually Matter in Production

Multi-Version Concurrency Control (MVCC): etcd keeps old versions of your data around so you can see what changed when. This is brilliant for debugging disasters and implementing atomic transactions that don't leave your system in a half-broken state. Unlike MySQL's MVCC that gets confused with long-running transactions, etcd's revision-based approach is actually sane.

Watch Events: Instead of polling etcd every few seconds like a maniac, you can watch for changes and get real-time notifications. Works great until you have a network partition and miss half the events. The watch implementation uses gRPC streaming, which is infinitely better than WebSocket-based solutions that fall apart under load.

Lease System: Built-in TTL support for distributed locks and leader election. Your services can grab a lock, do their thing, and automatically release it if they crash. Much better than Redis-based locking that leaves you with orphaned locks from dead processes, or DynamoDB's conditional writes that cost you money every time they fail.

etcd v3.6 - Actually Fixed Some Shit

etcd Deployment Diagram

etcd v3.6 fixed some annoying shit that made earlier versions painful:

Memory leaks mostly gone: v3.5 would slowly eat 2GB+ RAM over weeks. v3.6 doesn't do this as much, though you still need to watch memory usage.
Better disk handling: Fewer random timeouts when storage gets slow. Still needs fast SSDs though.
Can actually downgrade: If v3.6 breaks your production, you can roll back to v3.5 without rebuilding the entire cluster.

Real talk: The robustness testing they added uncovered a bunch of edge cases that could cause data inconsistency. These Jepsen-style tests found issues that traditional unit tests completely missed. etcd v3.6 is probably the most stable version they've ever released, which isn't saying much if you've dealt with earlier versions.

The project now has 5,500+ contributors from 873 companies, which means either everyone loves etcd or everyone needs it to work and is contributing fixes for their own survival. The GitHub activity shows this isn't some abandoned project - there are real people fixing real problems.

Anyway, here's where it gets painful. Production etcd will teach you why distributed systems suck.

Production Reality - Where etcd Shines and Where It'll Ruin Your Weekend

etcd Cluster Architecture

etcd runs the infrastructure behind some massive operations, but their adopter list doesn't tell you about all the 3am pages when clusters decided to split-brain during a routine network maintenance. Companies like Google, Amazon, and Microsoft all run managed Kubernetes services built on etcd - they just hide the operational nightmare behind their SLAs.

The Kubernetes Marriage (For Better and Worse)

Kubernetes Architecture with etcd

etcd is Kubernetes' only datastore option, storing everything from:

Every pod definition and its current state
All service endpoints and load balancer configs
ConfigMaps, Secrets, and RBAC rules
Resource quotas and network policies

This tight coupling means when etcd has a bad day, your entire Kubernetes cluster becomes an expensive collection of confused servers. I've watched a 200-node cluster become completely unmanageable because etcd ran out of disk space - the API server couldn't write new state, so kubectl basically stopped working. The Kubernetes documentation tries to warn you about this, but nobody reads warnings until production breaks.

The real performance story: etcd handles Kubernetes API patterns well, but those patterns are mostly small writes with occasional bulk reads. Scale your cluster to 1000+ nodes and watch etcd struggle with the constant churn of pod updates and status changes. The Kubernetes scalability documentation mentions these limits, but doesn't tell you how painful they are to hit. Red Hat's testing shows the real bottlenecks.

Financial Services - Where Split-Brain Actually Costs Money

etcd High Availability Setup

Companies like Fidelity and Ant Financial use etcd because they literally cannot afford data inconsistency. In trading systems, a split-brain scenario where different parts of your cluster see different states can cost millions in minutes. JP Morgan's trading infrastructure and Goldman Sachs' risk systems rely on this kind of consistency.

Why banks tolerate etcd's quirks:

Strong consistency prevents phantom trades or double-execution - ACID compliance that actually works
Audit trails are bulletproof thanks to MVCC versioning - regulatory compliance made easy
When it works, it really works - no surprises about data freshness like eventual consistency systems give you
Better to have writes fail completely than succeed inconsistently - fail-fast approaches save money

The hidden costs: etcd's requirement for odd-numbered clusters (3, 5, 7 nodes) and cross-datacenter latency sensitivity makes disaster recovery expensive. You need dedicated fiber connections between sites, not just internet connections. The network requirements are stricter than most financial firms expect.

Real Performance Numbers (Not Marketing Bullshit)

![etcd Performance Monitoring](https://grafana.com/meta-generator/Etcd Cluster Overview.jpg)

Recent benchmarks show what etcd actually delivers:

Write performance: Maybe 10K writes/sec on good hardware with NVMe SSDs. Spinning disks or slow network = constant leader elections.
Read latency: Can be sub-millisecond if everything is perfect. Usually isn't.
Failover time: Around 10 seconds during leader elections. No writes during that time.
Storage limits: Starts getting slow around 2GB. The default quota exists for good reasons.

v3.6 improvements are real but incremental. Memory usage is better and compaction doesn't fail as much.

How People Actually Use etcd (Beyond the Happy Path)

Service Discovery: Works great until you have flapping network connections. Services register, die, and re-register faster than your monitoring can keep up. etcd's lease system helps but doesn't solve the fundamental problem of network instability.

Configuration Management: Real-time config updates are dangerous. Push one bad config and every service breaks at once. Seen teams take down prod by pushing invalid JSON that etcd accepted but applications couldn't parse.

Distributed Locking: etcd's locking primitives are solid, but debugging distributed lock contention is a nightmare. When your critical job won't start because some dead process is holding a lock, you'll be digging through etcd logs at 3am trying to figure out which lease expired but didn't clean up properly.

Leader Election: etcd handles this well, but leader changes during network partitions can cause brief outages. Your services need to be designed for leadership changes, not just assume the leader is stable.

The 168,800+ contributions mostly come from people fixing edge cases they hit in production - which tells you something about how many edge cases exist in distributed consensus systems.

Running etcd in production is where the theory falls apart.

Operational Reality - What Nobody Tells You About Running etcd in Production

etcd Cluster Backup Process

Running etcd in production is like babysitting a particularly demanding toddler that occasionally throws tantrums and breaks all your stuff. Here's what you'll learn the hard way. The etcd operational guide glosses over most of these realities.

etcd's Disk Obsession Will Ruin Your Life

etcd has a pathological obsession with disk write latency. Seriously - it syncs every single write to disk before acknowledging success, which means:

Use spinning drives and die: Anything over 10ms write latency will cause constant leader elections and write timeouts. I've seen teams spend weeks debugging "mysterious" cluster issues only to discover they were using shitty AWS EBS gp2 volumes. Switch to gp3 with provisioned IOPS or prepare for pain.
NVMe SSDs or go home: You want sub-1ms write latency consistently. That nice network-attached storage solution your storage team loves? Forget about it. Local NVMe or Azure Premium SSD are your only real options.
Dedicated volumes are mandatory: Don't share storage with anything else. That backup job running at 3am will cause write spikes that make etcd lose its mind. Storage performance isolation isn't optional.

Had a cluster shit itself every night at 2am. Took forever to figure out someone put database backups on the same volume as etcd. Disk latency would spike and etcd would lose its mind electing new leaders. Worked fine for months, then suddenly nothing worked during backup windows.

The Memory Leak Chronicles (Before v3.6)

etcd Memory Usage Patterns

etcd v3.5 and earlier had a delightful habit of eating memory like a starving teenager:

Base usage: 512MB minimum, but it would slowly creep up to 2GB+ over weeks. The Go garbage collector couldn't keep up with etcd's allocation patterns.
Watch subscriptions: Each client watching for changes consumed memory that wasn't released properly. Kubernetes controllers making lots of watch connections would slowly kill etcd.
Compaction delays: When automatic compaction failed, memory usage would explode. The boltdb freelist would grow without bounds.

The v3.6 memory fixes cut usage by 50%, but you still need to monitor this shit religiously. Set alerts for memory usage above 1GB per node or you'll get surprise OOM kills during peak load. Prometheus monitoring makes this easier, but the memory metrics are confusing as hell.

Network Partitions - Where Consistency Becomes Your Enemy

etcd's Raft implementation prioritizes consistency over availability, which sounds great until your network hiccups and suddenly nothing works:

Split-brain prevention: Only the partition with majority nodes (2 out of 3, or 3 out of 5) can process writes. The minority partition goes read-only and sulks until the network heals.

Leader election chaos: During network instability, etcd constantly re-elects leaders. Each election means 5-10 seconds of no writes. I've watched clusters flip leaders 20+ times during a "brief" network maintenance window.

Cross-datacenter pain: Want geo-distributed etcd? Hope you have dedicated fiber between sites. Any latency over 50ms round-trip and you'll get constant election timeouts.

Certificate Hell and Security Theater

etcd's TLS support works great until certificates expire:

Certificate rotation nightmare: etcd supports hot-reloading certs, but getting the timing right without downtime is an art form
CA validation: Screw up your certificate authority setup and the entire cluster stops talking to itself
Client cert validation: Every application needs valid certs, and debugging "certificate not trusted" errors at 3am is not fun

Pro tip: Set calendar reminders 30 days before certificate expiry. I've seen entire Kubernetes clusters go down because etcd certificates expired over the weekend.

The Monitoring Shitshow

etcd's health endpoints in v3.6 are better but still not great. You'll want to set up proper Grafana dashboards because etcd's built-in metrics are useless when your cluster is melting down at 3am.

Critical metrics to monitor or die:

Raft proposal commit rate: If this drops to zero, your cluster is fucked
Backend commit duration: Spikes above 100ms mean storage problems
Database size: etcd slows down as data grows - alert when you hit 1GB
Network round-trip time: Between cluster members - spikes indicate network issues

The metrics that actually matter: Forget the official recommendations. Watch for etcd_server_is_leader flapping, etcd_disk_wal_fsync_duration_seconds spiking, and etcd_mvcc_db_total_size_in_bytes growing without bound.

Backup and Recovery - Hope You Never Need It

etcd's snapshot functionality works until you actually need to restore:

## This works great in testing
etcdctl snapshot save backup.db
etcdutl snapshot restore backup.db

## In production at 3am with the CEO breathing down your neck? Good luck.

Restore gotchas:

Restored clusters get new cluster IDs, breaking existing client connections
You lose all data since the snapshot was taken (obviously, but people forget)
Restoration requires downtime for the entire cluster - no rolling restores

Test your backups: Not "make sure they exist" but actually restore them to a test cluster monthly. I've seen teams discover their 6-month-old backup strategy was broken when they needed it most.

v3.6 Fixes Some Shit (Finally)

The robustness testing they added actually uncovered several nasty bugs where etcd would silently corrupt data or hang indefinitely during specific failure scenarios.

What's actually better in v3.6:

Memory leaks mostly fixed (so far)
Better handling of disk full scenarios
Downgrade support for when upgrades go wrong
Improved compaction that doesn't randomly fail

etcd v3.6 is probably the first version I'd trust in production without months of testing. Earlier versions were educational.

Frequently Asked Questions (And the Answers That Actually Help)

Why does my etcd cluster keep electing new leaders?

Your disk is too slow. etcd syncs every write to disk and if that takes too long (over 10ms), it assumes the leader is dead and starts a new election. Check etcd_disk_wal_fsync_duration_seconds

anything over 100ms means your storage sucks.Use NVMe SSDs or suffer. AWS EBS GP2 volumes are garbage for etcd. Get GP3 with provisioned IOPS or local NVMe.

What happens when etcd goes down?

kubectl stops working. Your cluster can't schedule new pods, update services, or make any changes to cluster state. Everything just freezes until etcd comes back.

How do I know if my etcd cluster is about to die?

Watch these metrics like your career depends on it:

etcd_server_is_leader flapping between 0 and 1 = leader election chaos
etcd_disk_wal_fsync_duration_seconds over 0.1 = disk problems incoming
etcd_mvcc_db_total_size_in_bytes approaching 2GB = time to increase quotas or clean up
etcd_network_peer_round_trip_time_seconds spiking = network issues

Can I run etcd on spinning disks?

Technically yes, realistically no. You'll spend more time debugging random leader elections and timeouts than actually running services. etcd syncs every write to disk

slow disk equals sad etcd equals angry ops team.

Why is my etcd cluster eating all my RAM?

Older etcd versions had memory leaks. v3.6 fixed most of them, but watch out for:

Compaction failing and revisions building up
Applications making too many watch connections
Storing large values (keep it under a few KB per key)

What's this "leader election timeout" bullshit?

etcd uses Raft consensus, which means one node is the leader and handles all writes. When nodes can't talk to each other fast enough (network latency, disk latency, or just bad luck), they assume the leader is dead and start a new election. During elections, no writes happen

your cluster is essentially frozen.

How do I fix split-brain scenarios?

You don't "fix" them, you prevent them by using odd numbers of nodes (3, 5, 7). During network partitions, only the side with majority nodes stays active. The minority nodes go read-only and wait. This is by design

better to have no writes than inconsistent writes.

My backups are failing - what's wrong?

Probably one of these fun issues:

etcd ran out of disk space (it needs space for snapshots)
Backup script doesn't have proper permissions to etcd data directory
You're trying to backup while compaction is running (timing is everything)
Network issues preventing snapshot transfer

Use etcdctl snapshot save and actually test restores. Teams always discover their backup scripts are broken during disasters.

Can I use etcd for my application database?

No. etcd is for cluster metadata, not app data. Use PostgreSQL for your application and etcd for:

Service discovery registration
Distributed lock coordination
Configuration that must be consistent across services
Leader election for background jobs

How do I debug etcd performance issues?

Start with the obvious shit first:

Check disk latency with iostat -x 1 - anything over 10ms write latency is bad news
Look at etcd logs for leader election messages
Check network latency between cluster members
Monitor compaction - if it's failing, memory usage explodes
Verify you're not hitting the 2GB default quota

Most performance issues are disk or network problems, not etcd itself being slow.

What's the deal with etcd certificates expiring?

etcd uses TLS for everything in production, and certificates expire. When they do, the entire cluster stops talking and your Kubernetes cluster dies. Set calendar reminders for 30 days before expiry and practice certificate rotation in staging first.

Should I use etcd v3.6?

Yes, unless you enjoy debugging memory leaks and random data corruption. v3.6 has actual robustness testing that found and fixed several nasty bugs. It's the first etcd version I'd trust in production without months of testing.

How many nodes do I really need?

3 nodes for most use cases. 5 nodes if you're paranoid or have truly critical workloads. 7 nodes only if you're running a global financial system and can tolerate the increased consensus overhead. More nodes = more things that can break and slower writes due to consensus requirements.

etcd vs Alternatives - The Honest Comparison

Feature	etcd	ZooKeeper	Consul	Redis Cluster	DynamoDB
Consensus Algorithm	Raft	ZAB	Raft	Gossip protocol	AWS magic
Consistency	Strong (writes stop during partitions)	Strong (same problem)	Strong	Eventual (lies to you)	Eventual (lies to you)
When It Breaks	Network partitions = no writes	Network partitions = no writes	Network partitions = no writes	Data inconsistency	AWS is down
API	gRPC (good)	Custom Java shit	HTTP REST	Redis protocol	HTTP REST
Operational Pain	Disk latency obsession	JVM tuning nightmare	Complex service mesh	Memory management hell	AWS billing surprises
Data Limits	~2GB (gets slow after that)	~1GB (JVM heap limits)	~1GB	Hundreds of GB	Unlimited (for a price)
Disaster Recovery	Manual snapshots	Manual snapshots	Manual snapshots	Redis persistence	AWS handles it
Security	TLS + RBAC	ACLs (if you configure them)	ACL + TLS	Password (lol)	IAM
Real Latency	RTT + disk fsync	RTT + JVM GC pauses	RTT + gossip delays	Sub-ms (until split-brain)	1-10ms + network to AWS
Memory Usage	512MB-2GB (v3.6 fixed leaks)	JVM heap (good luck)	1-4GB depending on features	RAM = your data size	Not your problem

Quick Navigation

Why etcd Exists and Why You Can't Avoid It

Features That Actually Matter in Production

etcd v3.6 - Actually Fixed Some Shit

The Kubernetes Marriage (For Better and Worse)

Financial Services - Where Split-Brain Actually Costs Money

Real Performance Numbers (Not Marketing Bullshit)

How People Actually Use etcd (Beyond the Happy Path)

etcd's Disk Obsession Will Ruin Your Life

The Memory Leak Chronicles (Before v3.6)

Network Partitions - Where Consistency Becomes Your Enemy

Certificate Hell and Security Theater

The Monitoring Shitshow

Backup and Recovery - Hope You Never Need It

v3.6 Fixes Some Shit (Finally)

Why does my etcd cluster keep electing new leaders?

What happens when etcd goes down?

How do I know if my etcd cluster is about to die?

Can I run etcd on spinning disks?

Why is my etcd cluster eating all my RAM?

What's this "leader election timeout" bullshit?

How do I fix split-brain scenarios?

My backups are failing - what's wrong?

Can I use etcd for my application database?

How do I debug etcd performance issues?

What's the deal with etcd certificates expiring?

Should I use etcd v3.6?

How many nodes do I really need?

Related Tools & Recommendations

Helm Troubleshooting Guide: Fix Deployments & Debug Errors

Helm: Simplify Kubernetes Deployments & Avoid YAML Chaos

Jenkins Docker Kubernetes CI/CD: Deploy Without Breaking Production

Fix gRPC Production Errors - The 3AM Debugging Guide

Kubernetes Crisis Management: Fix Your Down Cluster Fast

ArgoCD Production Troubleshooting: Debugging & Fixing Deployments

Linkerd Overview: The Lightweight Kubernetes Service Mesh

Setting Up Prometheus Monitoring That Won't Make You Hate Your Job

Google Kubernetes Engine (GKE) - Google's Managed Kubernetes (That Actually Works Most of the Time)

Flux GitOps: Secure Kubernetes Deployments with CI/CD

Change Data Capture (CDC) Integration Patterns for Production

FastAPI Kubernetes Deployment: Production Reality Check

Fix Kubernetes ImagePullBackOff Error: Complete Troubleshooting Guide

containerd - The Container Runtime That Actually Just Works

Kubernetes CrashLoopBackOff: Debug & Fix Pod Restart Issues

Istio Service Mesh: Real-World Complexity, Benefits & Deployment

ArgoCD - GitOps for Kubernetes That Actually Works

Fix Kubernetes CrashLoopBackOff Exit Code 1 Application Errors

Fix Kubernetes Pod CrashLoopBackOff - Complete Troubleshooting Guide

Django Production Deployment Guide: Docker, Security, Monitoring