Kubernetes Production Deployment - AI-Optimized Reference
Infrastructure Prerequisites
Hardware Requirements
Control Plane Nodes (Critical Foundation)
- Minimum: 4 vCPUs, 16GB RAM, 100GB+ SSD
- Production Reality: 8+ vCPUs, 32GB RAM prevents 3am outages
- Storage Critical: etcd requires 3000+ IOPS SSD, never spinning disks
- Network: 10Gbps+ dedicated NICs for cluster communication
- High Availability: 3 control plane nodes minimum (odd numbers for etcd consensus)
- Failure Impact: 2 nodes = no fault tolerance, 4+ nodes = unnecessary complexity
Worker Nodes (Application Runtime)
- CPU: 8-16+ vCPUs based on workload density
- Memory: 32-64GB+ (reserve 25% for Kubernetes overhead)
- Pod Density: Maximum 20-30 pods per node
- Resource Reality: 32GB node only provides ~30GB for workloads
- Storage: Local SSD for container images and logs
Network Planning (High Failure Risk)
IP Address Allocation
- Pod CIDR: /16 network (65k pod IPs)
- Service CIDR: /16 for internal services
- Critical Warning: Don't overlap with existing networks
- Common Failure: Office VPN using same IP range causes "no route to host" errors
Bandwidth Requirements
- Control Plane: 1Gbps minimum between nodes
- Workers: 10Gbps recommended for pod-to-pod traffic
- Internet: Plan 10x expected traffic for image pulls
Storage Strategy (Data Persistence)
etcd Requirements (Mission Critical)
- Performance: Dedicated SSD with 3000+ IOPS
- Capacity: 100GB minimum, monitor growth closely
- Latency: Under 10ms between etcd members
- Backup: Hourly backups to different region/zone
- Failure Point: Spinning disks = 30-second API calls
Application Storage Classes
- Fast SSD: Databases, caches (gp3 with provisioned IOPS)
- Standard SSD: General application data (gp3 default)
- Cheap Storage: Logs, backups (sc1 or cold storage)
Cost Analysis (Hidden Expenses)
Cloud Provider Pricing
Provider | Monthly Base Cost | Reality |
---|---|---|
AWS EKS | $73 + EC2 costs | First bill 3x expected |
GKE | $73 standard, Autopilot 3x more | Vendor lock-in risk |
AKS | $73 + Azure tax | Azure networking complexity |
Additional Costs That Kill Budgets
- Load Balancers: $20-50/month each (need 5-10 minimum)
- Storage: EBS gp3 $0.08/GB/month accumulates fast
- Data Transfer: Cross-zone $0.02/GB (death by 1000 cuts)
- Monitoring Stack: ~$500/month for comprehensive coverage
Deployment Method Comparison
Method | Setup Time | Operational Overhead | Best For | Critical Pain Points |
---|---|---|---|---|
AWS EKS | 2-4 hours | Low-Medium | AWS environments | Unexpected billing |
Google GKE | 1-2 hours | Low | Google Cloud | Autopilot vendor lock |
Azure AKS | 2-6 hours | Medium | Microsoft shops | Networking nightmare |
kubeadm | 8-16 hours | High | Control requirements | Weekend certificate issues |
kops | 4-8 hours | High | AWS power users | AWS-only, complex upgrades |
kubeadm Production Deployment
Pre-Installation Critical Steps
System Preparation (All Nodes)
# Disable swap (Kubernetes requirement)
sudo swapoff -a
sudo sed -i '/ swap / s/^(.*)$/#1/g' /etc/fstab
# Install containerd (avoid Docker)
sudo apt install -y containerd.io
sudo containerd config default | sudo tee /etc/containerd/config.toml
sudo sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml
Version Pinning (Prevent Update Disasters)
# Pin specific Kubernetes versions
sudo apt install -y kubelet=1.33.4-1.1 kubeadm=1.33.4-1.1 kubectl=1.33.4-1.1
sudo apt-mark hold kubelet kubeadm kubectl
Control Plane Initialization
Configuration File (Don't Guess Parameters)
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: v1.33.4
controlPlaneEndpoint: "k8s-api.yourdomain.com:6443"
networking:
podSubnet: "10.244.0.0/16"
serviceSubnet: "10.96.0.0/16"
etcd:
local:
dataDir: "/var/lib/etcd"
Common Initialization Failures
- Port 6443 in use: Kill zombie kube-apiserver processes
- Container runtime issues: Verify containerd configuration
- Swap enabled: Double-check swap disable commands
Post-Deployment Essentials
CNI Network Plugin Selection
- Calico: Battle-tested, complex debugging
- Flannel: Simple, reliable, limited features
- Cilium: eBPF advanced features, harder troubleshooting
Essential Production Components
- Ingress Controller: nginx-ingress (most documented)
- Certificate Management: cert-manager with Let's Encrypt
- Storage Classes: Configure for cloud provider
- Security Policies: Pod Security Standards (restricted mode)
Monitoring and Observability
Prometheus Stack Resource Requirements
Memory Consumption Reality
- Prometheus: 4-8GB RAM for medium cluster
- Grafana: 512MB-1GB RAM
- Elasticsearch: 4GB+ per node (grows aggressively)
- Total Impact: 15% of cluster resources consumed
Critical Alerts Configuration
# Essential alerts only
- NodeDown (5min threshold)
- PodCrashLooping (10min threshold)
- APIServerDown (1min threshold)
- EtcdDown (3min threshold)
- DiskSpaceRunning Low (85% threshold)
Log Aggregation Strategy
ELK Stack Deployment
- Elasticsearch: 3 replicas, 2 minimum masters
- Fluentd: DaemonSet for log collection
- Kibana: Search interface (slow but useful during outages)
- Storage Requirements: 100GB+ for log retention
Security Hardening Checklist
Immediate Security Requirements
- Pod Security Standards: Restricted mode default
- Network Policies: Deny-all default, explicit allow
- RBAC: Least privilege access patterns
- Secrets Encryption: At rest in etcd
- Certificate Management: Automated rotation via cert-manager
Runtime Security
- Falco: Runtime security monitoring
- kube-bench: CIS benchmark validation
- OPA Gatekeeper: Policy enforcement
- Image Scanning: Security vulnerability detection
Operational Procedures
Backup and Recovery
etcd Backup (Critical Process)
# Daily automated backup
ETCDCTL_API=3 etcdctl snapshot save /var/lib/etcd/backup-$(date +%Y%m%d-%H%M%S).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key
Backup Testing: Test restore procedures in staging environment
Upgrade Strategy
Phased Upgrade Process
- Test in staging (production-like environment)
- Drain and upgrade control plane nodes individually
- Upgrade worker nodes in 20% capacity batches
- Verify cluster health after each phase
Rollback Preparation
- Document current versions
- Prepare rollback procedures
- Maintain cluster access during upgrades
Troubleshooting Common Issues
Certificate Problems
Symptoms: API server won't start after certificate rotation
Solution: Renew certificates manually, restart control plane pods
Prevention: Monitor certificate expiration dates
etcd Storage Issues
Symptoms: Cluster unresponsive, disk space full
Solution: Compact etcd database, increase storage
Prevention: Automated compaction, storage monitoring
Network Connectivity
Symptoms: Pod-to-pod communication failures
Solution: Verify CNI plugin status, check network policies
Prevention: Network policy testing, CNI monitoring
Resource Exhaustion
Symptoms: Pods pending, node pressure conditions
Solution: Add resources or optimize resource requests
Prevention: Resource monitoring, capacity planning
Production Readiness Validation
Cluster Testing Procedures
- Node Failure Test: Drain node during active traffic
- Control Plane Failure: Kill control plane node
- Storage Pressure: Fill etcd disk space
- Application Restart: Delete all pods in namespace
- Network Partition: Simulate zone failures
Performance Benchmarks
- API Response Time: Under 100ms for basic operations
- Pod Startup Time: Under 30 seconds for standard images
- Network Throughput: Baseline inter-pod communication
- Storage IOPS: Verify etcd performance requirements
Capacity Planning
Scaling Thresholds
- CPU: Scale at 70% sustained usage
- Memory: Scale at 80% usage (account for spikes)
- Storage: Scale at 75% usage
- Network: Monitor bandwidth utilization
Growth Planning
- Pod Density: Plan for 20-30 pods per node maximum
- Node Count: Maintain 20% capacity buffer
- Storage Growth: Monitor etcd size growth patterns
- Network Bandwidth: Plan for 10x traffic growth
Cost Optimization
Resource Efficiency
- Right-sizing: Monitor actual vs requested resources
- Spot Instances: Use for fault-tolerant workloads
- Storage Optimization: Match storage class to use case
- Network Costs: Minimize cross-zone traffic
Monitoring Costs
- Cloud Provider Bills: Track Kubernetes-related charges
- Resource Utilization: Identify over-provisioned resources
- Efficiency Metrics: Cost per application, per team
- Optimization Opportunities: Regular cost review cycles
Maintenance Windows
Planned Maintenance
- Certificate Rotation: Business hours with brief API downtime
- Kubernetes Upgrades: Extended maintenance windows
- Node Patching: Rolling updates with capacity planning
- Storage Expansion: Minimal downtime operations
Emergency Procedures
- Incident Response: Escalation procedures
- Rollback Plans: Rapid recovery procedures
- Communication: Status page updates
- Post-Incident: Root cause analysis and improvements
Useful Links for Further Investigation
Essential Kubernetes Production Resources
Link | Description |
---|---|
Kubernetes Production Environment Setup | Official production checklist from the Kubernetes project |
kubeadm Production Best Practices | Step-by-step cluster creation with kubeadm |
Kubernetes Security Best Practices | Security hardening guide for production clusters |
etcd Administration Guide | etcd backup, recovery, and maintenance procedures |
Kubernetes Backup and Disaster Recovery | Official backup strategies and procedures |
AWS EKS | Managed Kubernetes on AWS ($73/month + infrastructure costs) |
Google GKE | Google's managed Kubernetes with autopilot options |
Azure AKS | Microsoft's managed Kubernetes service |
kops | Production Grade K8s installation, upgrades, and management on AWS |
Rancher | Multi-cluster management platform with decent UI, questionable CLI |
Red Hat OpenShift | Enterprise Kubernetes with enterprise pricing |
containerd | Default container runtime, works everywhere, battle-tested |
CRI-O | Lightweight runtime, minimal attack surface, security-focused |
Calico | Works until networking dies with cryptic errors |
Cilium | Fancy eBPF stuff that's hard to debug when it breaks |
Flannel | Boring and reliable VXLAN overlay networking |
kubectl | The CLI you'll use 300 times a day |
k9s | Makes kubectl bearable. Just install it. |
Helm | YAML templating that barely works but beats raw manifests |
Stern | Tail logs from multiple pods without losing your mind |
kubectl-debug | Enhanced debugging for ephemeral containers |
Popeye | Cluster validator that makes you feel bad about your configs |
Prometheus | Eats disk space but tells you why shit broke |
Grafana | Pretty dashboards that spike during outages |
Jaeger | Distributed tracing when you need to know where requests die |
Elastic Stack (ELK) | Log aggregation and search (prepare for memory consumption) |
Fluentd | Log collection that usually works |
Falco | Runtime security monitoring for detecting weird behavior |
cert-manager | Automatic certificate management and renewal with Let's Encrypt |
Let's Encrypt | Free TLS certificates that actually work |
Open Policy Agent (OPA) | Policy engine with a learning curve steeper than Kubernetes itself |
Gatekeeper | OPA for Kubernetes admission control |
kube-bench | CIS benchmark checks that make you realize how insecure your cluster is |
Longhorn | Distributed storage that works in small clusters |
Ceph Rook | Distributed storage for when you have storage expertise and unlimited time |
Velero | Cluster backup and disaster recovery tool |
MinIO | S3-compatible object storage for on-premises deployments |
nginx-ingress | Most popular ingress controller, well-documented |
Traefik | Auto-discovery ingress with better defaults |
Ambassador/Emissary | API gateway built on Envoy |
MetalLB | Load balancer for bare metal clusters |
ArgoCD | GitOps continuous delivery, works when Git doesn't break everything |
Flux | GitOps toolkit for automated deployments |
Tekton | Kubernetes-native CI/CD pipelines |
AWS Load Balancer Controller | Provisions ALBs and NLBs for services |
EBS CSI Driver | Persistent volumes backed by EBS |
AWS VPC CNI | Native VPC networking for pods |
GKE Autopilot Documentation | Serverless Kubernetes that costs 3x more |
GCP Load Balancer Integration | Native load balancer provisioning |
AKS Best Practices | Microsoft's recommendations for production AKS |
Azure CNI | Advanced networking with VNET integration |
Kubernetes Slack | Real-time help from people who actually know what they're doing |
CNCF Landscape | Map of the cloud native ecosystem (prepare for choice paralysis) |
KodeKloud Kubernetes Courses | Hands-on learning that teaches practical skills |
Kubernetes the Hard Way | Learn by building everything from scratch (masochist edition) |
Kubernetes Community Forums | Community discussions and war stories |
Kubernetes Troubleshooting Flowchart | Visual guide for debugging common issues |
Post-Mortem Templates | Learn from other people's outages |
SRE Book | Google's approach to running production systems |
Kubernetes Failure Stories | Real outage stories and lessons learned |
Pod Security Standards | Essential security policies for production |
Network Policies | Firewall rules that will break everything until configured correctly |
Kubernetes Security Checklist | Comprehensive security hardening guide |
KubeCost | Track spending and resource usage across clusters |
Spot Instances/Preemptible VMs | Save money using instances that disappear randomly |
Cluster Autoscaler | Automatic node scaling (5 minutes after you needed it) |
Kubernetes in Action | Comprehensive guide to Kubernetes concepts and patterns |
Production Kubernetes | Practical guide to running K8s in production |
Kubernetes Operators | Building and managing custom controllers |
Related Tools & Recommendations
GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus
How to Wire Together the Modern DevOps Stack Without Losing Your Sanity
Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break
When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go
Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015
When your API shits the bed right before the big demo, this stack tells you exactly why
GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015
Deploy your app without losing your mind or your weekend
Docker Alternatives That Won't Break Your Budget
Docker got expensive as hell. Here's how to escape without breaking everything.
I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works
Trivy, Docker Scout, Snyk Container, Grype, and Clair - which one won't make you want to quit DevOps
MongoDB Alternatives: Choose the Right Database for Your Specific Use Case
Stop paying MongoDB tax. Choose a database that actually works for your use case.
RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)
Running RAG Systems on K8s Will Make You Hate Your Life, But Sometimes You Don't Have a Choice
Rancher Desktop - Docker Desktop's Free Replacement That Actually Works
extends Rancher Desktop
I Ditched Docker Desktop for Rancher Desktop - Here's What Actually Happened
3 Months Later: The Good, Bad, and Bullshit
containerd - The Container Runtime That Actually Just Works
The boring container runtime that Kubernetes uses instead of Docker (and you probably don't need to care about it)
Fix Helm When It Inevitably Breaks - Debug Guide
The commands, tools, and nuclear options for when your Helm deployment is fucked and you need to debug template errors at 3am.
Helm - Because Managing 47 YAML Files Will Drive You Insane
Package manager for Kubernetes that saves you from copy-pasting deployment configs like a savage. Helm charts beat maintaining separate YAML files for every dam
Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together
Stop fighting with YAML hell and infrastructure drift - here's how to manage everything through Git without losing your sanity
GitHub Actions Marketplace - Where CI/CD Actually Gets Easier
integrates with GitHub Actions Marketplace
GitHub Actions Alternatives That Don't Suck
integrates with GitHub Actions
Rancher - Manage Multiple Kubernetes Clusters Without Losing Your Sanity
One dashboard for all your clusters, whether they're on AWS, your basement server, or that sketchy cloud provider your CTO picked
Podman Desktop - Free Docker Desktop Alternative
competes with Podman Desktop
Grafana - The Monitoring Dashboard That Doesn't Suck
integrates with Grafana
Set Up Microservices Monitoring That Actually Works
Stop flying blind - get real visibility into what's breaking your distributed services
Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization