Why does kubeadm init keep failing with "port 6443 already in use"?

Something's already running on that port. Nine times out of ten, it's a zombie kube-apiserver process from a previous failed installation attempt. ```bash # Kill whatever's on port 6443 sudo ss -tulpn | grep :6443 sudo kill -9 # Clean up previous kubeadm attempts sudo kubeadm reset -f sudo rm -rf /etc/kubernetes /var/lib/etcd /var/lib/kubelet # Start fresh sudo kubeadm init --config=/etc/kubernetes/kubeadm-config.yaml ``` Don't just change the port - fix the underlying issue. Changing default ports breaks half the tooling that assumes 6443.

My worker nodes won't join - they're stuck with "couldn't validate the identity of the API Server"

Your token expired. kubeadm tokens last 24 hours by default, which is just long enough to forget about them. ```bash # On control plane node - create new token kubeadm token create --print-join-command # This gives you a fresh join command with new token # Copy and run on worker nodes ``` Pro tip: Don't take 3 weeks to join your worker nodes. I learned from breaking production twice and spending my weekend fixing it.

The ingress controller shows "ready" but returns 503 errors

Your backend service doesn't have any healthy endpoints. The ingress controller is working fine - your application pods are fucked. ```bash # Check if service has endpoints kubectl get endpoints your-service-name # No endpoints? Your pod labels don't match service selector kubectl get pods --show-labels kubectl describe service your-service-name | grep Selector # Fix the label mismatch or wait for pods to become ready kubectl get pods -o wide kubectl describe pod ```

Certificate renewal broke everything and now the API server won't start

kubeadm certificates expire after 1 year. When they rotate, sometimes the API server gets confused about which certificate to use. ```bash # Check certificate expiry sudo kubeadm certs check-expiration # Renew all certificates manually sudo kubeadm certs renew all # Restart control plane pods (they'll pick up new certs) sudo crictl ps -a | grep kube sudo crictl stop # Restart kubelet to pick up new kubeconfig sudo systemctl restart kubelet ``` This will cause brief API downtime. Don't do it during business hours unless you enjoy career-limiting conversations.

LoadBalancer services are "pending" forever

Your cluster doesn't have a cloud controller manager or MetalLB configured. Kubernetes can't provision load balancers without something to actually create them. For cloud providers: ```bash # AWS - install AWS Load Balancer Controller helm repo add eks https://aws.github.io/eks-charts helm install aws-load-balancer-controller eks/aws-load-balancer-controller \ -n kube-system \ --set clusterName=your-cluster-name \ --set serviceAccount.create=false \ --set serviceAccount.name=aws-load-balancer-controller ``` For bare metal, use NodePort or set up MetalLB: ```bash helm repo add metallb https://metallb.github.io/metallb helm install metallb metallb/metallb --namespace metallb-system --create-namespace ```

How do I test if my cluster is actually production-ready?

Run these tests and if any fail, you're not ready: ```bash # 1. Drain a node during active traffic kubectl drain --ignore-daemonsets # Apps should stay available # 2. Kill a control plane node # Cluster should keep working # 3. Fill up etcd disk space # API server should handle it gracefully # 4. Restart all pods in a namespace kubectl delete pods --all -n your-app-namespace # Apps should restart without data loss # 5. Network partition test # Simulate network failures between zones ``` Don't skip this, I've been that guy who learned about split-brain scenarios during Black Friday. Expect something to break during upgrades. It always does. Have your resume updated and a rollback plan ready.

The cluster survives load testing but dies under real production traffic

Load testing tools don't simulate the chaos of real user behavior. Production traffic patterns include: - Random spikes that last 30 seconds - Slow database connections that consume connection pools - Memory leaks that only show up after 48+ hours of runtime - DNS timeouts during high connection volume - Certificate validation failures during traffic spikes **What actually helps**: Start with 10% production traffic and gradually increase over weeks, not hours.

How do I upgrade without destroying everything?

**Step 1**: Test the upgrade in staging (minikube doesn't count!) **Step 2**: Drain and upgrade one control plane node at a time: ```bash kubectl drain control-plane-1 --ignore-daemonsets # SSH to node, upgrade kubeadm/kubelet, reboot kubectl uncordon control-plane-1 ``` **Step 3**: Upgrade worker nodes in batches of 20% capacity: ```bash kubectl drain worker-node-1 --ignore-daemonsets # Upgrade node, verify it rejoins healthy kubectl uncordon worker-node-1 ``` **Step 4**: Pray nothing breaks. It probably will.

Why does my Prometheus server keep crashing with OOMKilled?

Prometheus memory usage grows with the number of metrics and retention period. Default settings assume you're monitoring a toy cluster. ```bash # Check current memory usage kubectl top pods -n monitoring | grep prometheus # Increase memory limits in values.yaml prometheus: prometheusSpec: resources: requests: memory: 4Gi limits: memory: 8Gi retention: 15d # Reduce retention if needed ``` This setup will eat 15% of your cluster and generate stupid amounts of logs. Plan for it. Prometheus ate 12GB RAM on a 3-node cluster with 5 pods. Monitoring took down production before the actual apps did.

My Helm deployments work in dev but fail in production

Different storage classes, different node selectors, different resource quotas. Dev uses default everything, prod has constraints. ```bash # Check what's different kubectl describe pod | grep -A 10 Events kubectl get nodes --show-labels | grep node-type kubectl describe resourcequota --all-namespaces # Override Helm values for production helm upgrade myapp ./chart \ --values values.yaml \ --values values-prod.yaml \ --set resources.requests.memory=1Gi ```

How do I know if I actually need Kubernetes?

You don't need Kubernetes if: - You have fewer than 10 services - Your traffic is predictable and doesn't spike - You have fewer than 5 engineers total - Downtime isn't business-critical - You're not planning to scale significantly Use it anyway because your next job will require Kubernetes experience and explaining why you chose simpler technology makes you sound old.

What's the most important thing for Kubernetes in production?

Monitoring and observability. You will debug more issues than you prevent. When (not if) your cluster breaks at 3am, you need to know: - Which nodes are healthy - What pods are crashing and why - Where network requests are failing - How much resources are being consumed - What changed in the last 24 hours The second most important thing: backups you've actually tested. Not backup scripts that run without errors - backups you've restored successfully in a test environment.

Currently viewing the AI version

Switch to human version

Kubernetes Production Deployment - AI-Optimized Reference

Infrastructure Prerequisites

Hardware Requirements

Control Plane Nodes (Critical Foundation)

Minimum: 4 vCPUs, 16GB RAM, 100GB+ SSD
Production Reality: 8+ vCPUs, 32GB RAM prevents 3am outages
Storage Critical: etcd requires 3000+ IOPS SSD, never spinning disks
Network: 10Gbps+ dedicated NICs for cluster communication
High Availability: 3 control plane nodes minimum (odd numbers for etcd consensus)
Failure Impact: 2 nodes = no fault tolerance, 4+ nodes = unnecessary complexity

Worker Nodes (Application Runtime)

CPU: 8-16+ vCPUs based on workload density
Memory: 32-64GB+ (reserve 25% for Kubernetes overhead)
Pod Density: Maximum 20-30 pods per node
Resource Reality: 32GB node only provides ~30GB for workloads
Storage: Local SSD for container images and logs

Network Planning (High Failure Risk)

IP Address Allocation

Pod CIDR: /16 network (65k pod IPs)
Service CIDR: /16 for internal services
Critical Warning: Don't overlap with existing networks
Common Failure: Office VPN using same IP range causes "no route to host" errors

Bandwidth Requirements

Control Plane: 1Gbps minimum between nodes
Workers: 10Gbps recommended for pod-to-pod traffic
Internet: Plan 10x expected traffic for image pulls

Storage Strategy (Data Persistence)

etcd Requirements (Mission Critical)

Performance: Dedicated SSD with 3000+ IOPS
Capacity: 100GB minimum, monitor growth closely
Latency: Under 10ms between etcd members
Backup: Hourly backups to different region/zone
Failure Point: Spinning disks = 30-second API calls

Application Storage Classes

Fast SSD: Databases, caches (gp3 with provisioned IOPS)
Standard SSD: General application data (gp3 default)
Cheap Storage: Logs, backups (sc1 or cold storage)

Cost Analysis (Hidden Expenses)

Cloud Provider Pricing

Provider	Monthly Base Cost	Reality
AWS EKS	$73 + EC2 costs	First bill 3x expected
GKE	$73 standard, Autopilot 3x more	Vendor lock-in risk
AKS	$73 + Azure tax	Azure networking complexity

Additional Costs That Kill Budgets

Load Balancers: $20-50/month each (need 5-10 minimum)
Storage: EBS gp3 $0.08/GB/month accumulates fast
Data Transfer: Cross-zone $0.02/GB (death by 1000 cuts)
Monitoring Stack: ~$500/month for comprehensive coverage

Deployment Method Comparison

Method	Setup Time	Operational Overhead	Best For	Critical Pain Points
AWS EKS	2-4 hours	Low-Medium	AWS environments	Unexpected billing
Google GKE	1-2 hours	Low	Google Cloud	Autopilot vendor lock
Azure AKS	2-6 hours	Medium	Microsoft shops	Networking nightmare
kubeadm	8-16 hours	High	Control requirements	Weekend certificate issues
kops	4-8 hours	High	AWS power users	AWS-only, complex upgrades

kubeadm Production Deployment

Pre-Installation Critical Steps

System Preparation (All Nodes)

# Disable swap (Kubernetes requirement)
sudo swapoff -a
sudo sed -i '/ swap / s/^(.*)$/#1/g' /etc/fstab

# Install containerd (avoid Docker)
sudo apt install -y containerd.io
sudo containerd config default | sudo tee /etc/containerd/config.toml
sudo sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml

Version Pinning (Prevent Update Disasters)

# Pin specific Kubernetes versions
sudo apt install -y kubelet=1.33.4-1.1 kubeadm=1.33.4-1.1 kubectl=1.33.4-1.1
sudo apt-mark hold kubelet kubeadm kubectl

Control Plane Initialization

Configuration File (Don't Guess Parameters)

apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: v1.33.4
controlPlaneEndpoint: "k8s-api.yourdomain.com:6443"
networking:
  podSubnet: "10.244.0.0/16"
  serviceSubnet: "10.96.0.0/16"
etcd:
  local:
    dataDir: "/var/lib/etcd"

Common Initialization Failures

Port 6443 in use: Kill zombie kube-apiserver processes
Container runtime issues: Verify containerd configuration
Swap enabled: Double-check swap disable commands

Post-Deployment Essentials

CNI Network Plugin Selection

Calico: Battle-tested, complex debugging
Flannel: Simple, reliable, limited features
Cilium: eBPF advanced features, harder troubleshooting

Essential Production Components

Ingress Controller: nginx-ingress (most documented)
Certificate Management: cert-manager with Let's Encrypt
Storage Classes: Configure for cloud provider
Security Policies: Pod Security Standards (restricted mode)

Monitoring and Observability

Prometheus Stack Resource Requirements

Memory Consumption Reality

Prometheus: 4-8GB RAM for medium cluster
Grafana: 512MB-1GB RAM
Elasticsearch: 4GB+ per node (grows aggressively)
Total Impact: 15% of cluster resources consumed

Critical Alerts Configuration

# Essential alerts only
- NodeDown (5min threshold)
- PodCrashLooping (10min threshold)
- APIServerDown (1min threshold)
- EtcdDown (3min threshold)
- DiskSpaceRunning Low (85% threshold)

Log Aggregation Strategy

ELK Stack Deployment

Elasticsearch: 3 replicas, 2 minimum masters
Fluentd: DaemonSet for log collection
Kibana: Search interface (slow but useful during outages)
Storage Requirements: 100GB+ for log retention

Security Hardening Checklist

Immediate Security Requirements

Pod Security Standards: Restricted mode default
Network Policies: Deny-all default, explicit allow
RBAC: Least privilege access patterns
Secrets Encryption: At rest in etcd
Certificate Management: Automated rotation via cert-manager

Runtime Security

Falco: Runtime security monitoring
kube-bench: CIS benchmark validation
OPA Gatekeeper: Policy enforcement
Image Scanning: Security vulnerability detection

Operational Procedures

Backup and Recovery

etcd Backup (Critical Process)

# Daily automated backup
ETCDCTL_API=3 etcdctl snapshot save /var/lib/etcd/backup-$(date +%Y%m%d-%H%M%S).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

Backup Testing: Test restore procedures in staging environment

Upgrade Strategy

Phased Upgrade Process

Test in staging (production-like environment)
Drain and upgrade control plane nodes individually
Upgrade worker nodes in 20% capacity batches
Verify cluster health after each phase

Rollback Preparation

Document current versions
Prepare rollback procedures
Maintain cluster access during upgrades

Troubleshooting Common Issues

Certificate Problems

Symptoms: API server won't start after certificate rotation
Solution: Renew certificates manually, restart control plane pods
Prevention: Monitor certificate expiration dates

etcd Storage Issues

Symptoms: Cluster unresponsive, disk space full
Solution: Compact etcd database, increase storage
Prevention: Automated compaction, storage monitoring

Network Connectivity

Symptoms: Pod-to-pod communication failures
Solution: Verify CNI plugin status, check network policies
Prevention: Network policy testing, CNI monitoring

Resource Exhaustion

Symptoms: Pods pending, node pressure conditions
Solution: Add resources or optimize resource requests
Prevention: Resource monitoring, capacity planning

Production Readiness Validation

Cluster Testing Procedures

Node Failure Test: Drain node during active traffic
Control Plane Failure: Kill control plane node
Storage Pressure: Fill etcd disk space
Application Restart: Delete all pods in namespace
Network Partition: Simulate zone failures

Performance Benchmarks

API Response Time: Under 100ms for basic operations
Pod Startup Time: Under 30 seconds for standard images
Network Throughput: Baseline inter-pod communication
Storage IOPS: Verify etcd performance requirements

Capacity Planning

Scaling Thresholds

CPU: Scale at 70% sustained usage
Memory: Scale at 80% usage (account for spikes)
Storage: Scale at 75% usage
Network: Monitor bandwidth utilization

Growth Planning

Pod Density: Plan for 20-30 pods per node maximum
Node Count: Maintain 20% capacity buffer
Storage Growth: Monitor etcd size growth patterns
Network Bandwidth: Plan for 10x traffic growth

Cost Optimization

Resource Efficiency

Right-sizing: Monitor actual vs requested resources
Spot Instances: Use for fault-tolerant workloads
Storage Optimization: Match storage class to use case
Network Costs: Minimize cross-zone traffic

Monitoring Costs

Cloud Provider Bills: Track Kubernetes-related charges
Resource Utilization: Identify over-provisioned resources
Efficiency Metrics: Cost per application, per team
Optimization Opportunities: Regular cost review cycles

Maintenance Windows

Planned Maintenance

Certificate Rotation: Business hours with brief API downtime
Kubernetes Upgrades: Extended maintenance windows
Node Patching: Rolling updates with capacity planning
Storage Expansion: Minimal downtime operations

Emergency Procedures

Incident Response: Escalation procedures
Rollback Plans: Rapid recovery procedures
Communication: Status page updates
Post-Incident: Root cause analysis and improvements

Useful Links for Further Investigation

Essential Kubernetes Production Resources

Link	Description
Kubernetes Production Environment Setup	Official production checklist from the Kubernetes project
kubeadm Production Best Practices	Step-by-step cluster creation with kubeadm
Kubernetes Security Best Practices	Security hardening guide for production clusters
etcd Administration Guide	etcd backup, recovery, and maintenance procedures
Kubernetes Backup and Disaster Recovery	Official backup strategies and procedures
AWS EKS	Managed Kubernetes on AWS ($73/month + infrastructure costs)
Google GKE	Google's managed Kubernetes with autopilot options
Azure AKS	Microsoft's managed Kubernetes service
kops	Production Grade K8s installation, upgrades, and management on AWS
Rancher	Multi-cluster management platform with decent UI, questionable CLI
Red Hat OpenShift	Enterprise Kubernetes with enterprise pricing
containerd	Default container runtime, works everywhere, battle-tested
CRI-O	Lightweight runtime, minimal attack surface, security-focused
Calico	Works until networking dies with cryptic errors
Cilium	Fancy eBPF stuff that's hard to debug when it breaks
Flannel	Boring and reliable VXLAN overlay networking
kubectl	The CLI you'll use 300 times a day
k9s	Makes kubectl bearable. Just install it.
Helm	YAML templating that barely works but beats raw manifests
Stern	Tail logs from multiple pods without losing your mind
kubectl-debug	Enhanced debugging for ephemeral containers
Popeye	Cluster validator that makes you feel bad about your configs
Prometheus	Eats disk space but tells you why shit broke
Grafana	Pretty dashboards that spike during outages
Jaeger	Distributed tracing when you need to know where requests die
Elastic Stack (ELK)	Log aggregation and search (prepare for memory consumption)
Fluentd	Log collection that usually works
Falco	Runtime security monitoring for detecting weird behavior
cert-manager	Automatic certificate management and renewal with Let's Encrypt
Let's Encrypt	Free TLS certificates that actually work
Open Policy Agent (OPA)	Policy engine with a learning curve steeper than Kubernetes itself
Gatekeeper	OPA for Kubernetes admission control
kube-bench	CIS benchmark checks that make you realize how insecure your cluster is
Longhorn	Distributed storage that works in small clusters
Ceph Rook	Distributed storage for when you have storage expertise and unlimited time
Velero	Cluster backup and disaster recovery tool
MinIO	S3-compatible object storage for on-premises deployments
nginx-ingress	Most popular ingress controller, well-documented
Traefik	Auto-discovery ingress with better defaults
Ambassador/Emissary	API gateway built on Envoy
MetalLB	Load balancer for bare metal clusters
ArgoCD	GitOps continuous delivery, works when Git doesn't break everything
Flux	GitOps toolkit for automated deployments
Tekton	Kubernetes-native CI/CD pipelines
AWS Load Balancer Controller	Provisions ALBs and NLBs for services
EBS CSI Driver	Persistent volumes backed by EBS
AWS VPC CNI	Native VPC networking for pods
GKE Autopilot Documentation	Serverless Kubernetes that costs 3x more
GCP Load Balancer Integration	Native load balancer provisioning
AKS Best Practices	Microsoft's recommendations for production AKS
Azure CNI	Advanced networking with VNET integration
Kubernetes Slack	Real-time help from people who actually know what they're doing
CNCF Landscape	Map of the cloud native ecosystem (prepare for choice paralysis)
KodeKloud Kubernetes Courses	Hands-on learning that teaches practical skills
Kubernetes the Hard Way	Learn by building everything from scratch (masochist edition)
Kubernetes Community Forums	Community discussions and war stories
Kubernetes Troubleshooting Flowchart	Visual guide for debugging common issues
Post-Mortem Templates	Learn from other people's outages
SRE Book	Google's approach to running production systems
Kubernetes Failure Stories	Real outage stories and lessons learned
Pod Security Standards	Essential security policies for production
Network Policies	Firewall rules that will break everything until configured correctly
Kubernetes Security Checklist	Comprehensive security hardening guide
KubeCost	Track spending and resource usage across clusters
Spot Instances/Preemptible VMs	Save money using instances that disappear randomly
Cluster Autoscaler	Automatic node scaling (5 minutes after you needed it)
Kubernetes in Action	Comprehensive guide to Kubernetes concepts and patterns
Production Kubernetes	Practical guide to running K8s in production
Kubernetes Operators	Building and managing custom controllers

Kubernetes Production Deployment - AI-Optimized Reference

Infrastructure Prerequisites

Hardware Requirements

Network Planning (High Failure Risk)

Storage Strategy (Data Persistence)

Cost Analysis (Hidden Expenses)

Cloud Provider Pricing

Additional Costs That Kill Budgets

Deployment Method Comparison

kubeadm Production Deployment

Pre-Installation Critical Steps

Control Plane Initialization

Post-Deployment Essentials

Monitoring and Observability

Prometheus Stack Resource Requirements

Log Aggregation Strategy

Security Hardening Checklist

Immediate Security Requirements

Runtime Security

Operational Procedures

Backup and Recovery

Upgrade Strategy

Troubleshooting Common Issues

Certificate Problems

etcd Storage Issues

Network Connectivity

Resource Exhaustion

Production Readiness Validation

Cluster Testing Procedures

Performance Benchmarks

Capacity Planning

Scaling Thresholds

Growth Planning

Cost Optimization

Resource Efficiency

Monitoring Costs

Maintenance Windows

Planned Maintenance

Emergency Procedures

Useful Links for Further Investigation

Essential Kubernetes Production Resources

Related Tools & Recommendations

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015

Docker Alternatives That Won't Break Your Budget

I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works

MongoDB Alternatives: Choose the Right Database for Your Specific Use Case

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

Rancher Desktop - Docker Desktop's Free Replacement That Actually Works

I Ditched Docker Desktop for Rancher Desktop - Here's What Actually Happened

containerd - The Container Runtime That Actually Just Works

Fix Helm When It Inevitably Breaks - Debug Guide

Helm - Because Managing 47 YAML Files Will Drive You Insane

Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together

GitHub Actions Marketplace - Where CI/CD Actually Gets Easier

GitHub Actions Alternatives That Don't Suck

Rancher - Manage Multiple Kubernetes Clusters Without Losing Your Sanity

Podman Desktop - Free Docker Desktop Alternative

Grafana - The Monitoring Dashboard That Doesn't Suck

Set Up Microservices Monitoring That Actually Works