Why does my etcd cluster keep electing new leaders?

Your disk is too slow. etcd syncs every write to disk and if that takes too long (over 10ms), it assumes the leader is dead and starts a new election. Check `etcd_disk_wal_fsync_duration_seconds` - anything over 100ms means your storage sucks.Use NVMe SSDs or suffer. AWS EBS GP2 volumes are garbage for etcd. Get GP3 with provisioned IOPS or local NVMe.

What happens when etcd goes down?

kubectl stops working. Your cluster can't schedule new pods, update services, or make any changes to cluster state. Everything just freezes until etcd comes back.

How do I know if my etcd cluster is about to die?

Watch these metrics like your career depends on it: - `etcd_server_is_leader` flapping between 0 and 1 = leader election chaos - `etcd_disk_wal_fsync_duration_seconds` over 0.1 = disk problems incoming - `etcd_mvcc_db_total_size_in_bytes` approaching 2GB = time to increase quotas or clean up - `etcd_network_peer_round_trip_time_seconds` spiking = network issues

Can I run etcd on spinning disks?

Technically yes, realistically no. You'll spend more time debugging random leader elections and timeouts than actually running services. etcd syncs every write to disk - slow disk equals sad etcd equals angry ops team.

Why is my etcd cluster eating all my RAM?

Older etcd versions had memory leaks. v3.6 fixed most of them, but watch out for: - Compaction failing and revisions building up - Applications making too many watch connections - Storing large values (keep it under a few KB per key)

What's this "leader election timeout" bullshit?

etcd uses [Raft consensus](https://raft.github.io/), which means one node is the leader and handles all writes. When nodes can't talk to each other fast enough (network latency, disk latency, or just bad luck), they assume the leader is dead and start a new election. During elections, no writes happen - your cluster is essentially frozen.

How do I fix split-brain scenarios?

You don't "fix" them, you prevent them by using odd numbers of nodes (3, 5, 7). During network partitions, only the side with majority nodes stays active. The minority nodes go read-only and wait. This is by design - better to have no writes than inconsistent writes.

My backups are failing - what's wrong?

Probably one of these fun issues: - etcd ran out of disk space (it needs space for snapshots) - Backup script doesn't have proper permissions to etcd data directory - You're trying to backup while compaction is running (timing is everything) - Network issues preventing snapshot transfer Use `etcdctl snapshot save` and actually test restores. Teams always discover their backup scripts are broken during disasters.

Can I use etcd for my application database?

No. etcd is for cluster metadata, not app data. Use PostgreSQL for your application and etcd for: - Service discovery registration - Distributed lock coordination - Configuration that must be consistent across services - Leader election for background jobs

How do I debug etcd performance issues?

Start with the obvious shit first: 1. Check disk latency with `iostat -x 1` - anything over 10ms write latency is bad news 2. Look at etcd logs for leader election messages 3. Check network latency between cluster members 4. Monitor compaction - if it's failing, memory usage explodes 5. Verify you're not hitting the 2GB default quota Most performance issues are disk or network problems, not etcd itself being slow.

What's the deal with etcd certificates expiring?

etcd uses [TLS for everything](https://etcd.io/docs/v3.6/op-guide/security/) in production, and certificates expire. When they do, the entire cluster stops talking and your Kubernetes cluster dies. Set calendar reminders for 30 days before expiry and practice certificate rotation in staging first.

Should I use etcd v3.6?

Yes, unless you enjoy debugging memory leaks and random data corruption. v3.6 has [actual robustness testing](https://github.com/etcd-io/etcd/tree/main/tests/robustness) that found and fixed several nasty bugs. It's the first etcd version I'd trust in production without months of testing.

How many nodes do I really need?

3 nodes for most use cases. 5 nodes if you're paranoid or have truly critical workloads. 7 nodes only if you're running a global financial system and can tolerate the increased consensus overhead. More nodes = more things that can break and slower writes due to consensus requirements.

Currently viewing the AI version

Switch to human version

etcd: AI-Optimized Technical Reference

Core Function and Critical Constraints

etcd is a distributed key-value store using Raft consensus that serves as Kubernetes' sole datastore option. Critical failure mode: When etcd breaks, kubectl becomes non-functional and entire clusters enter read-only state.

Data Storage Scope:

Pod locations and state
Service endpoints and load balancer configs
Secrets, ConfigMaps, RBAC rules
Resource quotas and network policies

Production Configuration Requirements

Storage Requirements (Non-Negotiable)

Disk type: NVMe SSDs mandatory - spinning disks cause constant leader elections
Write latency threshold: <10ms (anything over 10ms = leader election chaos)
Storage isolation: Dedicated volumes required - shared storage causes write spikes during other operations
AWS specifics: GP3 with provisioned IOPS or local NVMe only - GP2 volumes will fail

Hardware Specifications

Memory: 512MB minimum, monitor for growth to 2GB+ (v3.5 and earlier had memory leaks)
Network latency: <50ms round-trip between nodes for stable operation
Cluster size: 3, 5, or 7 nodes only (odd numbers for majority consensus)

Performance Thresholds and Breaking Points

Write performance: ~10K writes/sec maximum on optimal hardware
Storage limits: Performance degrades significantly after 2GB
Failover time: 5-10 seconds during leader elections (no writes during this period)
Scale limits: Struggles at 1000+ Kubernetes nodes due to constant pod update churn

Critical Failure Scenarios and Recovery

Network Partition Behavior

Only majority partition remains writable (2/3 or 3/5 nodes)
Minority partition becomes read-only
Consequence: Better complete write failure than inconsistent state

Common Production Failures

Certificate expiration: Entire cluster communication stops
Disk space exhaustion: Cluster becomes read-only, kubectl fails
Memory leaks (pre-v3.6): Gradual memory growth to 2GB+ over weeks
Storage latency spikes: Automatic leader re-elections, write timeouts

Disaster Recovery Requirements

Backup method: etcdctl snapshot save backup.db
Restoration gotcha: New cluster IDs break existing client connections
Downtime requirement: Full cluster downtime required for restoration
Testing mandate: Monthly restore testing to verify backup integrity

Monitoring and Health Indicators

Critical Metrics (Set Alerts)

etcd_server_is_leader flapping = leader election chaos
etcd_disk_wal_fsync_duration_seconds >100ms = storage problems
etcd_mvcc_db_total_size_in_bytes approaching 2GB = quota/cleanup needed
etcd_network_peer_round_trip_time_seconds spiking = network issues

Memory Usage Patterns

v3.6: Fixed most memory leaks, 50% reduction in usage
Watch for: Compaction failures, excessive watch connections, large value storage

Version-Specific Intelligence

etcd v3.6 Improvements

Memory leak fixes: 50% reduction in memory usage
Robustness testing: Jepsen-style testing uncovered and fixed data inconsistency bugs
Downgrade support: Can roll back to v3.5 without cluster rebuild
Verdict: First version suitable for production without extensive testing

Pre-v3.6 Issues

Memory leaks requiring regular monitoring and restarts
Data corruption scenarios in specific failure modes
Compaction failures causing memory explosion

Security Configuration Reality

TLS Requirements

All cluster communication requires TLS in production
Failure mode: Certificate expiration = complete cluster failure
Operational requirement: 30-day expiration warnings mandatory
Certificate rotation requires careful timing to avoid downtime

Use Case Suitability Analysis

Appropriate Uses

Kubernetes cluster state (mandatory)
Service discovery with strong consistency requirements
Distributed locking for critical operations
Configuration requiring immediate consistency

Inappropriate Uses

Application databases (use PostgreSQL instead)
High-volume data storage (2GB practical limit)
Eventually consistent scenarios (unnecessarily restrictive)

Comparative Analysis with Alternatives

vs ZooKeeper

Advantage: No JVM heap tuning nightmare
Disadvantage: Same network partition write unavailability
Migration factor: Simpler operational model

vs Redis Cluster

Advantage: Strong consistency prevents phantom states
Disadvantage: Write unavailability during partitions vs Redis eventual consistency
Performance: Redis faster but lies about data freshness

vs DynamoDB

Advantage: On-premises deployment control
Disadvantage: Manual operational overhead vs AWS managed service
Cost: Predictable vs AWS billing surprises

Financial Services Specific Requirements

Why Banks Use etcd

Strong consistency prevents phantom trades and double-execution
ACID compliance for regulatory requirements
MVCC provides bulletproof audit trails
Fail-fast behavior preferred over inconsistent success

Hidden Infrastructure Costs

Odd-numbered cluster requirements increase hardware needs
Cross-datacenter latency requires dedicated fiber connections
Disaster recovery setup more expensive than anticipated

Resource Investment Requirements

Time Investments

Initial setup: 1-2 weeks for proper production configuration
Ongoing monitoring: Daily metric review mandatory
Disaster recovery testing: Monthly restore validation required
Certificate management: Quarterly rotation planning

Expertise Requirements

Distributed systems understanding for troubleshooting
Storage performance analysis skills
Network latency debugging capabilities
Kubernetes integration knowledge for production use

Infrastructure Costs

Premium storage required (NVMe SSDs, high IOPS)
Dedicated infrastructure for each cluster member
Network quality requirements increase connectivity costs
Monitoring and alerting system integration overhead

Troubleshooting Decision Trees

Performance Issues

Check disk latency first (iostat -x 1)
Verify network latency between members
Monitor compaction success rate
Check for quota limits approaching

High Availability Issues

Verify odd-numbered cluster configuration
Check network partition simulation capability
Validate certificate expiration schedules
Test failover procedures under load

Memory/Resource Issues

Identify version (v3.6+ preferred)
Monitor watch connection counts
Check compaction frequency and success
Analyze stored value sizes

Breaking Points and Scalability Limits

Hard Limits

2GB storage before significant performance degradation
1000+ Kubernetes nodes = constant leadership churn
10ms+ disk latency = unusable leader election behavior
50ms+ network RTT = cross-datacenter deployment failure

Soft Limits

10K writes/sec theoretical maximum on optimal hardware
Real-world performance significantly lower under Kubernetes load
Memory usage growth over time requires periodic monitoring

Operational Red Flags

Immediate Action Required

Leader election messages in logs
Certificate expiration within 30 days
Database size approaching 1.8GB
Disk write latency spikes above 50ms

Planning Required

Memory usage trend toward 1.5GB
Kubernetes cluster growth past 800 nodes
Network maintenance affecting inter-node communication
Storage performance degradation trends

This reference provides decision-support information for etcd deployment, scaling, and maintenance while preserving all operational intelligence from real-world production experience.

etcd: AI-Optimized Technical Reference

Core Function and Critical Constraints

Production Configuration Requirements

Storage Requirements (Non-Negotiable)

Hardware Specifications

Performance Thresholds and Breaking Points

Critical Failure Scenarios and Recovery

Network Partition Behavior

Common Production Failures

Disaster Recovery Requirements

Monitoring and Health Indicators

Critical Metrics (Set Alerts)

Memory Usage Patterns

Version-Specific Intelligence

etcd v3.6 Improvements

Pre-v3.6 Issues

Security Configuration Reality

TLS Requirements

Use Case Suitability Analysis

Appropriate Uses

Inappropriate Uses

Comparative Analysis with Alternatives

vs ZooKeeper

vs Redis Cluster

vs DynamoDB

Financial Services Specific Requirements

Why Banks Use etcd

Hidden Infrastructure Costs

Resource Investment Requirements

Time Investments

Expertise Requirements

Infrastructure Costs

Troubleshooting Decision Trees

Performance Issues

High Availability Issues

Memory/Resource Issues

Breaking Points and Scalability Limits

Hard Limits

Soft Limits

Operational Red Flags

Immediate Action Required

Planning Required

Related Tools & Recommendations

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

MongoDB Alternatives: Choose the Right Database for Your Specific Use Case

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

Grafana - The Monitoring Dashboard That Doesn't Suck

Set Up Microservices Monitoring That Actually Works

Sift - Fraud Detection That Actually Works

GPT-5 Is So Bad That Users Are Begging for the Old Version Back

Fix Helm When It Inevitably Breaks - Debug Guide

Helm - Because Managing 47 YAML Files Will Drive You Insane

Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together

Docker Alternatives That Won't Break Your Budget

I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works

Redis vs Memcached vs Hazelcast: Production Caching Decision Guide

Redis Alternatives for High-Performance Applications

Redis - In-Memory Data Platform for Real-Time Applications

MongoDB Alternatives: The Migration Reality Check

Why I Finally Dumped Cassandra After 5 Years of 3AM Hell

Apache Cassandra - The Database That Scales Forever (and Breaks Spectacularly)