Currently viewing the AI version
Switch to human version

Kubernetes Production Deployment - AI-Optimized Reference

Infrastructure Prerequisites

Hardware Requirements

Control Plane Nodes (Critical Foundation)

  • Minimum: 4 vCPUs, 16GB RAM, 100GB+ SSD
  • Production Reality: 8+ vCPUs, 32GB RAM prevents 3am outages
  • Storage Critical: etcd requires 3000+ IOPS SSD, never spinning disks
  • Network: 10Gbps+ dedicated NICs for cluster communication
  • High Availability: 3 control plane nodes minimum (odd numbers for etcd consensus)
  • Failure Impact: 2 nodes = no fault tolerance, 4+ nodes = unnecessary complexity

Worker Nodes (Application Runtime)

  • CPU: 8-16+ vCPUs based on workload density
  • Memory: 32-64GB+ (reserve 25% for Kubernetes overhead)
  • Pod Density: Maximum 20-30 pods per node
  • Resource Reality: 32GB node only provides ~30GB for workloads
  • Storage: Local SSD for container images and logs

Network Planning (High Failure Risk)

IP Address Allocation

  • Pod CIDR: /16 network (65k pod IPs)
  • Service CIDR: /16 for internal services
  • Critical Warning: Don't overlap with existing networks
  • Common Failure: Office VPN using same IP range causes "no route to host" errors

Bandwidth Requirements

  • Control Plane: 1Gbps minimum between nodes
  • Workers: 10Gbps recommended for pod-to-pod traffic
  • Internet: Plan 10x expected traffic for image pulls

Storage Strategy (Data Persistence)

etcd Requirements (Mission Critical)

  • Performance: Dedicated SSD with 3000+ IOPS
  • Capacity: 100GB minimum, monitor growth closely
  • Latency: Under 10ms between etcd members
  • Backup: Hourly backups to different region/zone
  • Failure Point: Spinning disks = 30-second API calls

Application Storage Classes

  • Fast SSD: Databases, caches (gp3 with provisioned IOPS)
  • Standard SSD: General application data (gp3 default)
  • Cheap Storage: Logs, backups (sc1 or cold storage)

Cost Analysis (Hidden Expenses)

Cloud Provider Pricing

Provider Monthly Base Cost Reality
AWS EKS $73 + EC2 costs First bill 3x expected
GKE $73 standard, Autopilot 3x more Vendor lock-in risk
AKS $73 + Azure tax Azure networking complexity

Additional Costs That Kill Budgets

  • Load Balancers: $20-50/month each (need 5-10 minimum)
  • Storage: EBS gp3 $0.08/GB/month accumulates fast
  • Data Transfer: Cross-zone $0.02/GB (death by 1000 cuts)
  • Monitoring Stack: ~$500/month for comprehensive coverage

Deployment Method Comparison

Method Setup Time Operational Overhead Best For Critical Pain Points
AWS EKS 2-4 hours Low-Medium AWS environments Unexpected billing
Google GKE 1-2 hours Low Google Cloud Autopilot vendor lock
Azure AKS 2-6 hours Medium Microsoft shops Networking nightmare
kubeadm 8-16 hours High Control requirements Weekend certificate issues
kops 4-8 hours High AWS power users AWS-only, complex upgrades

kubeadm Production Deployment

Pre-Installation Critical Steps

System Preparation (All Nodes)

# Disable swap (Kubernetes requirement)
sudo swapoff -a
sudo sed -i '/ swap / s/^(.*)$/#1/g' /etc/fstab

# Install containerd (avoid Docker)
sudo apt install -y containerd.io
sudo containerd config default | sudo tee /etc/containerd/config.toml
sudo sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml

Version Pinning (Prevent Update Disasters)

# Pin specific Kubernetes versions
sudo apt install -y kubelet=1.33.4-1.1 kubeadm=1.33.4-1.1 kubectl=1.33.4-1.1
sudo apt-mark hold kubelet kubeadm kubectl

Control Plane Initialization

Configuration File (Don't Guess Parameters)

apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: v1.33.4
controlPlaneEndpoint: "k8s-api.yourdomain.com:6443"
networking:
  podSubnet: "10.244.0.0/16"
  serviceSubnet: "10.96.0.0/16"
etcd:
  local:
    dataDir: "/var/lib/etcd"

Common Initialization Failures

  • Port 6443 in use: Kill zombie kube-apiserver processes
  • Container runtime issues: Verify containerd configuration
  • Swap enabled: Double-check swap disable commands

Post-Deployment Essentials

CNI Network Plugin Selection

  • Calico: Battle-tested, complex debugging
  • Flannel: Simple, reliable, limited features
  • Cilium: eBPF advanced features, harder troubleshooting

Essential Production Components

  1. Ingress Controller: nginx-ingress (most documented)
  2. Certificate Management: cert-manager with Let's Encrypt
  3. Storage Classes: Configure for cloud provider
  4. Security Policies: Pod Security Standards (restricted mode)

Monitoring and Observability

Prometheus Stack Resource Requirements

Memory Consumption Reality

  • Prometheus: 4-8GB RAM for medium cluster
  • Grafana: 512MB-1GB RAM
  • Elasticsearch: 4GB+ per node (grows aggressively)
  • Total Impact: 15% of cluster resources consumed

Critical Alerts Configuration

# Essential alerts only
- NodeDown (5min threshold)
- PodCrashLooping (10min threshold)
- APIServerDown (1min threshold)
- EtcdDown (3min threshold)
- DiskSpaceRunning Low (85% threshold)

Log Aggregation Strategy

ELK Stack Deployment

  • Elasticsearch: 3 replicas, 2 minimum masters
  • Fluentd: DaemonSet for log collection
  • Kibana: Search interface (slow but useful during outages)
  • Storage Requirements: 100GB+ for log retention

Security Hardening Checklist

Immediate Security Requirements

  • Pod Security Standards: Restricted mode default
  • Network Policies: Deny-all default, explicit allow
  • RBAC: Least privilege access patterns
  • Secrets Encryption: At rest in etcd
  • Certificate Management: Automated rotation via cert-manager

Runtime Security

  • Falco: Runtime security monitoring
  • kube-bench: CIS benchmark validation
  • OPA Gatekeeper: Policy enforcement
  • Image Scanning: Security vulnerability detection

Operational Procedures

Backup and Recovery

etcd Backup (Critical Process)

# Daily automated backup
ETCDCTL_API=3 etcdctl snapshot save /var/lib/etcd/backup-$(date +%Y%m%d-%H%M%S).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

Backup Testing: Test restore procedures in staging environment

Upgrade Strategy

Phased Upgrade Process

  1. Test in staging (production-like environment)
  2. Drain and upgrade control plane nodes individually
  3. Upgrade worker nodes in 20% capacity batches
  4. Verify cluster health after each phase

Rollback Preparation

  • Document current versions
  • Prepare rollback procedures
  • Maintain cluster access during upgrades

Troubleshooting Common Issues

Certificate Problems

Symptoms: API server won't start after certificate rotation
Solution: Renew certificates manually, restart control plane pods
Prevention: Monitor certificate expiration dates

etcd Storage Issues

Symptoms: Cluster unresponsive, disk space full
Solution: Compact etcd database, increase storage
Prevention: Automated compaction, storage monitoring

Network Connectivity

Symptoms: Pod-to-pod communication failures
Solution: Verify CNI plugin status, check network policies
Prevention: Network policy testing, CNI monitoring

Resource Exhaustion

Symptoms: Pods pending, node pressure conditions
Solution: Add resources or optimize resource requests
Prevention: Resource monitoring, capacity planning

Production Readiness Validation

Cluster Testing Procedures

  1. Node Failure Test: Drain node during active traffic
  2. Control Plane Failure: Kill control plane node
  3. Storage Pressure: Fill etcd disk space
  4. Application Restart: Delete all pods in namespace
  5. Network Partition: Simulate zone failures

Performance Benchmarks

  • API Response Time: Under 100ms for basic operations
  • Pod Startup Time: Under 30 seconds for standard images
  • Network Throughput: Baseline inter-pod communication
  • Storage IOPS: Verify etcd performance requirements

Capacity Planning

Scaling Thresholds

  • CPU: Scale at 70% sustained usage
  • Memory: Scale at 80% usage (account for spikes)
  • Storage: Scale at 75% usage
  • Network: Monitor bandwidth utilization

Growth Planning

  • Pod Density: Plan for 20-30 pods per node maximum
  • Node Count: Maintain 20% capacity buffer
  • Storage Growth: Monitor etcd size growth patterns
  • Network Bandwidth: Plan for 10x traffic growth

Cost Optimization

Resource Efficiency

  • Right-sizing: Monitor actual vs requested resources
  • Spot Instances: Use for fault-tolerant workloads
  • Storage Optimization: Match storage class to use case
  • Network Costs: Minimize cross-zone traffic

Monitoring Costs

  • Cloud Provider Bills: Track Kubernetes-related charges
  • Resource Utilization: Identify over-provisioned resources
  • Efficiency Metrics: Cost per application, per team
  • Optimization Opportunities: Regular cost review cycles

Maintenance Windows

Planned Maintenance

  • Certificate Rotation: Business hours with brief API downtime
  • Kubernetes Upgrades: Extended maintenance windows
  • Node Patching: Rolling updates with capacity planning
  • Storage Expansion: Minimal downtime operations

Emergency Procedures

  • Incident Response: Escalation procedures
  • Rollback Plans: Rapid recovery procedures
  • Communication: Status page updates
  • Post-Incident: Root cause analysis and improvements

Useful Links for Further Investigation

Essential Kubernetes Production Resources

LinkDescription
Kubernetes Production Environment SetupOfficial production checklist from the Kubernetes project
kubeadm Production Best PracticesStep-by-step cluster creation with kubeadm
Kubernetes Security Best PracticesSecurity hardening guide for production clusters
etcd Administration Guideetcd backup, recovery, and maintenance procedures
Kubernetes Backup and Disaster RecoveryOfficial backup strategies and procedures
AWS EKSManaged Kubernetes on AWS ($73/month + infrastructure costs)
Google GKEGoogle's managed Kubernetes with autopilot options
Azure AKSMicrosoft's managed Kubernetes service
kopsProduction Grade K8s installation, upgrades, and management on AWS
RancherMulti-cluster management platform with decent UI, questionable CLI
Red Hat OpenShiftEnterprise Kubernetes with enterprise pricing
containerdDefault container runtime, works everywhere, battle-tested
CRI-OLightweight runtime, minimal attack surface, security-focused
CalicoWorks until networking dies with cryptic errors
CiliumFancy eBPF stuff that's hard to debug when it breaks
FlannelBoring and reliable VXLAN overlay networking
kubectlThe CLI you'll use 300 times a day
k9sMakes kubectl bearable. Just install it.
HelmYAML templating that barely works but beats raw manifests
SternTail logs from multiple pods without losing your mind
kubectl-debugEnhanced debugging for ephemeral containers
PopeyeCluster validator that makes you feel bad about your configs
PrometheusEats disk space but tells you why shit broke
GrafanaPretty dashboards that spike during outages
JaegerDistributed tracing when you need to know where requests die
Elastic Stack (ELK)Log aggregation and search (prepare for memory consumption)
FluentdLog collection that usually works
FalcoRuntime security monitoring for detecting weird behavior
cert-managerAutomatic certificate management and renewal with Let's Encrypt
Let's EncryptFree TLS certificates that actually work
Open Policy Agent (OPA)Policy engine with a learning curve steeper than Kubernetes itself
GatekeeperOPA for Kubernetes admission control
kube-benchCIS benchmark checks that make you realize how insecure your cluster is
LonghornDistributed storage that works in small clusters
Ceph RookDistributed storage for when you have storage expertise and unlimited time
VeleroCluster backup and disaster recovery tool
MinIOS3-compatible object storage for on-premises deployments
nginx-ingressMost popular ingress controller, well-documented
TraefikAuto-discovery ingress with better defaults
Ambassador/EmissaryAPI gateway built on Envoy
MetalLBLoad balancer for bare metal clusters
ArgoCDGitOps continuous delivery, works when Git doesn't break everything
FluxGitOps toolkit for automated deployments
TektonKubernetes-native CI/CD pipelines
AWS Load Balancer ControllerProvisions ALBs and NLBs for services
EBS CSI DriverPersistent volumes backed by EBS
AWS VPC CNINative VPC networking for pods
GKE Autopilot DocumentationServerless Kubernetes that costs 3x more
GCP Load Balancer IntegrationNative load balancer provisioning
AKS Best PracticesMicrosoft's recommendations for production AKS
Azure CNIAdvanced networking with VNET integration
Kubernetes SlackReal-time help from people who actually know what they're doing
CNCF LandscapeMap of the cloud native ecosystem (prepare for choice paralysis)
KodeKloud Kubernetes CoursesHands-on learning that teaches practical skills
Kubernetes the Hard WayLearn by building everything from scratch (masochist edition)
Kubernetes Community ForumsCommunity discussions and war stories
Kubernetes Troubleshooting FlowchartVisual guide for debugging common issues
Post-Mortem TemplatesLearn from other people's outages
SRE BookGoogle's approach to running production systems
Kubernetes Failure StoriesReal outage stories and lessons learned
Pod Security StandardsEssential security policies for production
Network PoliciesFirewall rules that will break everything until configured correctly
Kubernetes Security ChecklistComprehensive security hardening guide
KubeCostTrack spending and resource usage across clusters
Spot Instances/Preemptible VMsSave money using instances that disappear randomly
Cluster AutoscalerAutomatic node scaling (5 minutes after you needed it)
Kubernetes in ActionComprehensive guide to Kubernetes concepts and patterns
Production KubernetesPractical guide to running K8s in production
Kubernetes OperatorsBuilding and managing custom controllers

Related Tools & Recommendations

integration
Recommended

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

How to Wire Together the Modern DevOps Stack Without Losing Your Sanity

prometheus
/integration/docker-kubernetes-argocd-prometheus/gitops-workflow-integration
100%
integration
Recommended

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go

Apache Kafka
/integration/kafka-mongodb-kubernetes-prometheus-event-driven/complete-observability-architecture
76%
integration
Recommended

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

When your API shits the bed right before the big demo, this stack tells you exactly why

Prometheus
/integration/prometheus-grafana-jaeger/microservices-observability-integration
34%
integration
Recommended

GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015

Deploy your app without losing your mind or your weekend

GitHub Actions
/integration/github-actions-docker-aws-ecs/ci-cd-pipeline-automation
33%
alternatives
Recommended

Docker Alternatives That Won't Break Your Budget

Docker got expensive as hell. Here's how to escape without breaking everything.

Docker
/alternatives/docker/budget-friendly-alternatives
32%
compare
Recommended

I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works

Trivy, Docker Scout, Snyk Container, Grype, and Clair - which one won't make you want to quit DevOps

docker
/compare/docker-security/cicd-integration/docker-security-cicd-integration
32%
alternatives
Recommended

MongoDB Alternatives: Choose the Right Database for Your Specific Use Case

Stop paying MongoDB tax. Choose a database that actually works for your use case.

MongoDB
/alternatives/mongodb/use-case-driven-alternatives
31%
integration
Recommended

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

Running RAG Systems on K8s Will Make You Hate Your Life, But Sometimes You Don't Have a Choice

Vector Databases
/integration/vector-database-rag-production-deployment/kubernetes-orchestration
30%
tool
Recommended

Rancher Desktop - Docker Desktop's Free Replacement That Actually Works

extends Rancher Desktop

Rancher Desktop
/tool/rancher-desktop/overview
30%
review
Recommended

I Ditched Docker Desktop for Rancher Desktop - Here's What Actually Happened

3 Months Later: The Good, Bad, and Bullshit

Rancher Desktop
/review/rancher-desktop/overview
30%
tool
Recommended

containerd - The Container Runtime That Actually Just Works

The boring container runtime that Kubernetes uses instead of Docker (and you probably don't need to care about it)

containerd
/tool/containerd/overview
29%
tool
Recommended

Fix Helm When It Inevitably Breaks - Debug Guide

The commands, tools, and nuclear options for when your Helm deployment is fucked and you need to debug template errors at 3am.

Helm
/tool/helm/troubleshooting-guide
26%
tool
Recommended

Helm - Because Managing 47 YAML Files Will Drive You Insane

Package manager for Kubernetes that saves you from copy-pasting deployment configs like a savage. Helm charts beat maintaining separate YAML files for every dam

Helm
/tool/helm/overview
26%
integration
Recommended

Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together

Stop fighting with YAML hell and infrastructure drift - here's how to manage everything through Git without losing your sanity

Pulumi
/integration/pulumi-kubernetes-helm-gitops/complete-workflow-integration
26%
tool
Recommended

GitHub Actions Marketplace - Where CI/CD Actually Gets Easier

integrates with GitHub Actions Marketplace

GitHub Actions Marketplace
/tool/github-actions-marketplace/overview
25%
alternatives
Recommended

GitHub Actions Alternatives That Don't Suck

integrates with GitHub Actions

GitHub Actions
/alternatives/github-actions/use-case-driven-selection
25%
tool
Recommended

Rancher - Manage Multiple Kubernetes Clusters Without Losing Your Sanity

One dashboard for all your clusters, whether they're on AWS, your basement server, or that sketchy cloud provider your CTO picked

Rancher
/tool/rancher/overview
22%
tool
Recommended

Podman Desktop - Free Docker Desktop Alternative

competes with Podman Desktop

Podman Desktop
/tool/podman-desktop/overview
20%
tool
Recommended

Grafana - The Monitoring Dashboard That Doesn't Suck

integrates with Grafana

Grafana
/tool/grafana/overview
18%
howto
Recommended

Set Up Microservices Monitoring That Actually Works

Stop flying blind - get real visibility into what's breaking your distributed services

Prometheus
/howto/setup-microservices-observability-prometheus-jaeger-grafana/complete-observability-setup
18%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization