I've been woken up at 3am by etcd failures because someone cut corners on the infrastructure planning. Here's what you actually need to run Kubernetes in production without hating your life.
Hardware Requirements That Won't Screw You
Control Plane Nodes: The Brains That Better Not Die
Minimum specs for each control plane node:
- CPU: 4 vCPUs minimum, 8+ recommended
- RAM: 16GB minimum, 32GB if you want to sleep at night
- Storage: 100GB+ SSD for etcd (never use spinning disks)
- Network: Dedicated NICs with 10Gbps+ for cluster communication
Production reality check: Had a client run production on t2.micro. Lasted like 45 minutes before it shit the bed during the first real traffic spike. etcd needs consistent CPU and fast disk I/O - don't cheap out on storage.
You need 3 control plane nodes minimum spread across availability zones. 2 nodes means you have no fault tolerance, and 4+ nodes means you're overthinking it. etcd needs odd numbers for consensus, so stick with 3 or 5.
Worker Nodes: Where Your Apps Actually Live
Per node specifications:
- CPU: 8-16+ vCPUs depending on workload density
- RAM: 32-64GB+ (leave 25% for Kubernetes overhead)
- Storage: Local SSD for container images and logs
- Pod density: Plan for 20-30 pods per node maximum
The gotcha: Kubernetes reserves about 1GB RAM and 100m CPU per node for system components. Your 32GB node only gives you ~30GB for actual workloads. Plan accordingly or learn this during your first resource crunch.
Network Planning: Where Everything Goes Wrong
IP Address Planning (Get This Right or Cry Later):
- Pod CIDR:
/16
network gives you 65k pod IPs - Service CIDR:
/16
for internal service IPs - Node network: Separate subnet from pod/service networks
Don't overlap with existing networks. Spent 3 days debugging "no route to host" until we found the office VPN used the same damn range.
Bandwidth requirements:
- Control plane: 1Gbps minimum between nodes
- Worker nodes: 10Gbps recommended for pod-to-pod traffic
- Internet: Plan for 10x your expected traffic for image pulls
Storage Strategy: Because Persistent Means Persistent
etcd storage requirements:
- Dedicated SSD volumes with 3000+ IOPS
- 100GB minimum, monitor growth closely
- Backup to different region/zone every hour
- Network latency under 10ms between etcd members
Application storage classes:
- Fast SSD: Databases, caches (gp3 with provisioned IOPS)
- Standard SSD: General application data (gp3 default)
- Cheap storage: Logs, backups (sc1 or cold storage)
Certificate Management and PKI Hell
Never use self-signed certificates in production. Set up cert-manager with Let's Encrypt for automatic certificate rotation, or integrate with your enterprise PKI.
Certificate lifecycle to plan for:
- CA certificates: Root trust, rotate every 3-5 years
- API server certificates: Automatic rotation via kubeadm
- etcd certificates: Separate CA, manual rotation required
- Application certificates: Automated via cert-manager
The pain: Certificate rotation during business hours will cause brief API unavailability. Plan maintenance windows or configure certificate rotation overlap.
Security Hardening Checklist
Enable these before you go live:
- Pod Security Standards (restricted mode)
- Network policies (deny-all by default, explicit allow)
- RBAC with least privilege access
- Runtime security scanning (Falco or similar)
- Secrets encryption at rest in etcd
The security reality: Default Kubernetes is like leaving your front door wide open. Enable Pod Security Standards immediately or prepare for security audit failures.
Resource Sizing Reality Check
Control plane sizing by cluster size:
- Small cluster (< 100 nodes): 4 CPU, 16GB RAM
- Medium cluster (< 500 nodes): 8 CPU, 32GB RAM
- Large cluster (1000+ nodes): 16+ CPU, 64GB+ RAM
Worker node sizing patterns:
- Microservices: Many small nodes (8 CPU, 32GB)
- Monoliths: Fewer large nodes (32+ CPU, 128GB+)
- Batch workloads: Large nodes with local SSD storage
The Hidden Costs Nobody Mentions
Cloud provider markup reality:
- AWS EKS: $73/month per cluster + EC2 costs
- GKE: $73/month standard tier (autopilot costs 3x more)
- AKS: $73/month standard tier + Azure tax on everything
Load balancer costs kill budgets:
- $20-50/month per LoadBalancer service
- You'll need 5-10 of these minimum
- Ingress controllers reduce this but add complexity
Storage costs accumulate:
- EBS gp3: $0.08/GB/month (adds up fast)
- Snapshot storage: $0.05/GB/month for backups
- Cross-zone data transfer: $0.02/GB (death by 1000 cuts)
Regional and Multi-Zone Strategy
High availability basics:
- Control plane across 3 zones minimum
- Worker nodes distributed evenly
- etcd members in different zones
- Application replicas spread via pod anti-affinity
The zone failure reality: When a zone goes down, you lose 33% capacity instantly. Size your remaining zones to handle the full load, or accept degraded performance during outages.
Spinning disks = 30-second API calls. Don't do it. Local NVMe storage for etcd, EBS gp3 with provisioned IOPS for everything else. The extra $200/month in storage costs saves you $20k in lost revenue when etcd performs correctly.
Got the infrastructure sorted? Now you need to choose how you're actually going to deploy this clusterfuck. Spoiler: they all suck in different ways.