Infrastructure Prerequisites and Planning (The Shit Nobody Warns You About)

I've been woken up at 3am by etcd failures because someone cut corners on the infrastructure planning. Here's what you actually need to run Kubernetes in production without hating your life.

Hardware Requirements That Won't Screw You

Control Plane Nodes: The Brains That Better Not Die

Minimum specs for each control plane node:

  • CPU: 4 vCPUs minimum, 8+ recommended
  • RAM: 16GB minimum, 32GB if you want to sleep at night
  • Storage: 100GB+ SSD for etcd (never use spinning disks)
  • Network: Dedicated NICs with 10Gbps+ for cluster communication

Production reality check: Had a client run production on t2.micro. Lasted like 45 minutes before it shit the bed during the first real traffic spike. etcd needs consistent CPU and fast disk I/O - don't cheap out on storage.

You need 3 control plane nodes minimum spread across availability zones. 2 nodes means you have no fault tolerance, and 4+ nodes means you're overthinking it. etcd needs odd numbers for consensus, so stick with 3 or 5.

Worker Nodes: Where Your Apps Actually Live

Per node specifications:

  • CPU: 8-16+ vCPUs depending on workload density
  • RAM: 32-64GB+ (leave 25% for Kubernetes overhead)
  • Storage: Local SSD for container images and logs
  • Pod density: Plan for 20-30 pods per node maximum

The gotcha: Kubernetes reserves about 1GB RAM and 100m CPU per node for system components. Your 32GB node only gives you ~30GB for actual workloads. Plan accordingly or learn this during your first resource crunch.

Network Planning: Where Everything Goes Wrong

IP Address Planning (Get This Right or Cry Later):

  • Pod CIDR: /16 network gives you 65k pod IPs
  • Service CIDR: /16 for internal service IPs
  • Node network: Separate subnet from pod/service networks

Don't overlap with existing networks. Spent 3 days debugging "no route to host" until we found the office VPN used the same damn range.

Bandwidth requirements:

  • Control plane: 1Gbps minimum between nodes
  • Worker nodes: 10Gbps recommended for pod-to-pod traffic
  • Internet: Plan for 10x your expected traffic for image pulls

Storage Strategy: Because Persistent Means Persistent

etcd storage requirements:

  • Dedicated SSD volumes with 3000+ IOPS
  • 100GB minimum, monitor growth closely
  • Backup to different region/zone every hour
  • Network latency under 10ms between etcd members

Application storage classes:

  • Fast SSD: Databases, caches (gp3 with provisioned IOPS)
  • Standard SSD: General application data (gp3 default)
  • Cheap storage: Logs, backups (sc1 or cold storage)

Certificate Management and PKI Hell

Never use self-signed certificates in production. Set up cert-manager with Let's Encrypt for automatic certificate rotation, or integrate with your enterprise PKI.

Certificate lifecycle to plan for:

  • CA certificates: Root trust, rotate every 3-5 years
  • API server certificates: Automatic rotation via kubeadm
  • etcd certificates: Separate CA, manual rotation required
  • Application certificates: Automated via cert-manager

The pain: Certificate rotation during business hours will cause brief API unavailability. Plan maintenance windows or configure certificate rotation overlap.

Security Hardening Checklist

Enable these before you go live:

  • Pod Security Standards (restricted mode)
  • Network policies (deny-all by default, explicit allow)
  • RBAC with least privilege access
  • Runtime security scanning (Falco or similar)
  • Secrets encryption at rest in etcd

The security reality: Default Kubernetes is like leaving your front door wide open. Enable Pod Security Standards immediately or prepare for security audit failures.

Resource Sizing Reality Check

Control plane sizing by cluster size:

  • Small cluster (< 100 nodes): 4 CPU, 16GB RAM
  • Medium cluster (< 500 nodes): 8 CPU, 32GB RAM
  • Large cluster (1000+ nodes): 16+ CPU, 64GB+ RAM

Worker node sizing patterns:

  • Microservices: Many small nodes (8 CPU, 32GB)
  • Monoliths: Fewer large nodes (32+ CPU, 128GB+)
  • Batch workloads: Large nodes with local SSD storage

The Hidden Costs Nobody Mentions

Cloud provider markup reality:

  • AWS EKS: $73/month per cluster + EC2 costs
  • GKE: $73/month standard tier (autopilot costs 3x more)
  • AKS: $73/month standard tier + Azure tax on everything

Load balancer costs kill budgets:

  • $20-50/month per LoadBalancer service
  • You'll need 5-10 of these minimum
  • Ingress controllers reduce this but add complexity

Storage costs accumulate:

  • EBS gp3: $0.08/GB/month (adds up fast)
  • Snapshot storage: $0.05/GB/month for backups
  • Cross-zone data transfer: $0.02/GB (death by 1000 cuts)

Regional and Multi-Zone Strategy

High availability basics:

  • Control plane across 3 zones minimum
  • Worker nodes distributed evenly
  • etcd members in different zones
  • Application replicas spread via pod anti-affinity

The zone failure reality: When a zone goes down, you lose 33% capacity instantly. Size your remaining zones to handle the full load, or accept degraded performance during outages.

Spinning disks = 30-second API calls. Don't do it. Local NVMe storage for etcd, EBS gp3 with provisioned IOPS for everything else. The extra $200/month in storage costs saves you $20k in lost revenue when etcd performs correctly.

Got the infrastructure sorted? Now you need to choose how you're actually going to deploy this clusterfuck. Spoiler: they all suck in different ways.

Kubernetes Deployment Method Comparison

Method

Setup Time

Monthly Cost

Operational Overhead

Best For

Pain Points

AWS EKS

2-4 hours

$73 + $200-2000

Low-Medium

AWS shops

First EKS bill was 3x expected

Google GKE

1-2 hours

$73 + $150-1500

Low

Google Cloud users

Autopilot vendor lock-in hell

Azure AKS

2-6 hours

$73 + $180-1800

Medium

Microsoft prisoners

Azure networking is a nightmare

kubeadm

8-16 hours

$100-800

High

Control freaks

Certificate rotation will ruin weekends

kops

4-8 hours

$120-900

High

AWS power users

AWS-only, complex upgrades

Rancher

4-6 hours

$0 + infra costs

Medium-High

Multi-cloud

UI looks pretty, CLI is garbage

OpenShift

1-2 days

$5000-20000

Low-Medium

Enterprises

Red Hat support costs more than your salary

Step-by-Step Production Deployment (kubeadm Method)

This walkthrough covers kubeadm deployment because it teaches you what's actually happening under the hood. Once you understand this pain, managed Kubernetes will feel like magic.

Pre-Flight Checklist: Don't Skip This Shit

On ALL nodes (control plane + workers):

## Update system packages (this will fail if you don't have the right repo)
sudo apt update && sudo apt upgrade -y

## Disable swap (Kubernetes hates swap)
sudo swapoff -a
sudo sed -i '/ swap / s/^
(.*
)$/#
1/g' /etc/fstab

## Install container runtime (containerd)
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

sudo apt update
sudo apt install -y containerd.io

## Configure containerd for Kubernetes
sudo mkdir -p /etc/containerd
sudo containerd config default | sudo tee /etc/containerd/config.toml
sudo sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml
sudo systemctl restart containerd
sudo systemctl enable containerd

If that shits the bed: add docker repo first

Install Kubernetes Components

## Add Kubernetes repo (this changes every few months)
sudo apt update
sudo apt install -y apt-transport-https ca-certificates curl gpg
curl -fsSL https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://apt.kubernetes.io/ kubernetes-xenial main' | sudo tee /etc/apt/sources.list.d/kubernetes.list

sudo apt update

## Install specific versions (pin the version or updates will fuck you)
sudo apt install -y kubelet=1.33.4-1.1 kubeadm=1.33.4-1.1 kubectl=1.33.4-1.1
sudo apt-mark hold kubelet kubeadm kubectl

Version reality check: Use Kubernetes 1.33.x for production as of September 2025. 1.34 just dropped but let it bake for 2-3 months before using in prod.

Initialize Control Plane (The Scary Part)

On the first control plane node:

## Create kubeadm config file (don't guess at this)
cat <<EOF | sudo tee /etc/kubernetes/kubeadm-config.yaml
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
localAPIEndpoint:
  advertiseAddress: 10.0.1.10  # Your control plane node IP
  bindPort: 6443
nodeRegistration:
  criSocket: unix:///var/run/containerd/containerd.sock
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: v1.33.4
controlPlaneEndpoint: "k8s-api.yourdomain.com:6443"  # Use load balancer here
networking:
  podSubnet: "10.244.0.0/16"
  serviceSubnet: "10.96.0.0/16"
etcd:
  local:
    dataDir: "/var/lib/etcd"
apiServer:
  certSANs:
  - "k8s-api.yourdomain.com"
  - "10.0.1.10"
  - "10.0.1.11"
  - "10.0.1.12"
  auditPolicy:
    path: "/etc/kubernetes/audit-policy.yaml"
  auditLogPath: "/var/log/kubernetes/audit.log"
EOF

Run the initialization (this takes forever to download):

sudo kubeadm init --config=/etc/kubernetes/kubeadm-config.yaml --upload-certs

## Save the output! You need the join commands
## Copy the kubeconfig for kubectl access
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

If initialization fails: Check systemd logs with journalctl -xeu kubelet and fix whatever broke. Common issues:

  • Port 6443 already in use
  • Container runtime not configured properly
  • Swap still enabled despite earlier commands

Add Control Plane Nodes (For HA)

On additional control plane nodes:

## Use the join command from kubeadm init output
sudo kubeadm join k8s-api.yourdomain.com:6443 \
  --token <token-from-init> \
  --discovery-token-ca-cert-hash sha256:<hash-from-init> \
  --control-plane \
  --certificate-key <cert-key-from-init>

## Set up kubectl access
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

Install Pod Network (CNI Plugin)

Choose your networking hell:

## Calico (most popular, most complex)
kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml

## OR Flannel (simple but limited)
kubectl apply -f https://github.com/flannel-io/flannel/releases/latest/download/kube-flannel.yml

## OR Cilium (eBPF magic that sometimes works)
helm repo add cilium https://helm.cilium.io/
helm install cilium cilium/cilium --version 1.15.8 \
  --namespace kube-system \
  --set kubeProxyReplacement=strict

Network reality: Calico is battle-tested but complex. Flannel just works until you need network policies. Cilium is the future but harder to debug when it breaks.

Join Worker Nodes

On each worker node:

## Use the worker join command from kubeadm init
sudo kubeadm join k8s-api.yourdomain.com:6443 \
  --token <token-from-init> \
  --discovery-token-ca-cert-hash sha256:<hash-from-init>

Verify everything works:

## Check node status (may take 5-10 minutes to show Ready)
kubectl get nodes -o wide

## Verify system pods are running
kubectl get pods -n kube-system

## Test pod scheduling
kubectl run nginx --image=nginx --port=80
kubectl expose pod nginx --port=80 --type=NodePort

Install Essential Production Components

Ingress Controller (nginx-ingress)

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update

helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --create-namespace \
  --set controller.replicaCount=3 \
  --set controller.nodeSelector."kubernetes\.io/os"=linux \
  --set defaultBackend.nodeSelector."kubernetes\.io/os"=linux \
  --set controller.admissionWebhooks.patch.nodeSelector."kubernetes\.io/os"=linux

Certificate Management (cert-manager)

helm repo add jetstack https://charts.jetstack.io
helm repo update

helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --version v1.15.3 \
  --set installCRDs=true

## Create Let's Encrypt issuer
cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: admin@yourdomain.com
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
    - http01:
        ingress:
          class: nginx
EOF

Storage Class Configuration

## For cloud deployments - AWS EBS example
cat <<EOF | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
EOF

Cluster Validation and Smoke Tests

## Comprehensive cluster validation
kubectl get nodes -o wide
kubectl get pods --all-namespaces | grep -v Running

## Test DNS resolution
kubectl run busybox --rm -it --image=busybox -- nslookup kubernetes.default

## Test ingress controller
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello-world
spec:
  replicas: 2
  selector:
    matchLabels:
      app: hello-world
  template:
    metadata:
      labels:
        app: hello-world
    spec:
      containers:
      - name: hello-world
        image: gcr.io/google-samples/hello-app:1.0
        ports:
        - containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: hello-world
spec:
  selector:
    app: hello-world
  ports:
  - port: 80
    targetPort: 8080
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: hello-world
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - hello.yourdomain.com
    secretName: hello-tls
  rules:
  - host: hello.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: hello-world
            port:
              number: 80
EOF

## Test the application
curl -H "Host: hello.yourdomain.com" http://<ingress-ip>/

Backup Configuration

etcd backup script (run this daily):

#!/bin/bash
ETCD_POD=$(kubectl get pod -n kube-system -l component=etcd -o name | head -1)
kubectl exec -n kube-system $ETCD_POD -- etcdctl snapshot save /var/lib/etcd/backup-$(date +%Y%m%d-%H%M%S).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

## Copy backup to external storage
kubectl cp kube-system/$ETCD_POD:/var/lib/etcd/backup-$(date +%Y%m%d-%H%M%S).db ./etcd-backup-$(date +%Y%m%d-%H%M%S).db

Post-Deployment Security Hardening

## Enable Pod Security Standards
kubectl label --overwrite ns kube-system pod-security.kubernetes.io/enforce=privileged
kubectl label --overwrite ns kube-public pod-security.kubernetes.io/enforce=baseline  
kubectl label --overwrite ns default pod-security.kubernetes.io/enforce=restricted

## Create network policy deny-all default
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all
  namespace: default
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
EOF

Reality check: This deployment process takes 4-8 hours the first time you do it. Everything will break at least twice. Keep the troubleshooting commands handy and don't try this on a Friday.

The cluster is deployed but not production-ready. You still need monitoring, alerting, backup verification, and about 50 other things that will break during your first real traffic spike.

Kubernetes Production Deployment FAQ

Q

Why does kubeadm init keep failing with "port 6443 already in use"?

A

Something's already running on that port. Nine times out of ten, it's a zombie kube-apiserver process from a previous failed installation attempt.

## Kill whatever's on port 6443
sudo ss -tulpn | grep :6443
sudo kill -9 <pid-from-above>

## Clean up previous kubeadm attempts  
sudo kubeadm reset -f
sudo rm -rf /etc/kubernetes /var/lib/etcd /var/lib/kubelet

## Start fresh
sudo kubeadm init --config=/etc/kubernetes/kubeadm-config.yaml

Don't just change the port - fix the underlying issue. Changing default ports breaks half the tooling that assumes 6443.

Q

My worker nodes won't join - they're stuck with "couldn't validate the identity of the API Server"

A

Your token expired. kubeadm tokens last 24 hours by default, which is just long enough to forget about them.

## On control plane node - create new token
kubeadm token create --print-join-command

## This gives you a fresh join command with new token
## Copy and run on worker nodes

Pro tip: Don't take 3 weeks to join your worker nodes. I learned from breaking production twice and spending my weekend fixing it.

Q

The ingress controller shows "ready" but returns 503 errors

A

Your backend service doesn't have any healthy endpoints. The ingress controller is working fine - your application pods are fucked.

## Check if service has endpoints
kubectl get endpoints your-service-name

## No endpoints? Your pod labels don't match service selector
kubectl get pods --show-labels
kubectl describe service your-service-name | grep Selector

## Fix the label mismatch or wait for pods to become ready
kubectl get pods -o wide
kubectl describe pod <pending-pod-name>
Q

Certificate renewal broke everything and now the API server won't start

A

kubeadm certificates expire after 1 year. When they rotate, sometimes the API server gets confused about which certificate to use.

## Check certificate expiry
sudo kubeadm certs check-expiration

## Renew all certificates manually
sudo kubeadm certs renew all

## Restart control plane pods (they'll pick up new certs)
sudo crictl ps -a | grep kube
sudo crictl stop <container-ids-for-api-server-controller-scheduler>

## Restart kubelet to pick up new kubeconfig
sudo systemctl restart kubelet

This will cause brief API downtime. Don't do it during business hours unless you enjoy career-limiting conversations.

Q

etcd is consuming all disk space and the cluster is dying

A

etcd doesn't automatically compact old revisions. Your database is full of every object that's ever existed.

## Check etcd database size
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint status --write-out=table

## Compact etcd (this is scary but necessary)
REVISION=$(ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint status --write-out=json | egrep -o '"revision":[0-9]*' | egrep -o '[0-9].*')

ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  compact $REVISION

Set up automatic compaction in production or this will happen again in 6 months.

Q

LoadBalancer services are "pending" forever

A

Your cluster doesn't have a cloud controller manager or MetalLB configured. Kubernetes can't provision load balancers without something to actually create them.

For cloud providers:

## AWS - install AWS Load Balancer Controller
helm repo add eks https://aws.github.io/eks-charts
helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
  -n kube-system \
  --set clusterName=your-cluster-name \
  --set serviceAccount.create=false \
  --set serviceAccount.name=aws-load-balancer-controller

For bare metal, use NodePort or set up MetalLB:

helm repo add metallb https://metallb.github.io/metallb
helm install metallb metallb/metallb --namespace metallb-system --create-namespace
Q

How do I test if my cluster is actually production-ready?

A

Run these tests and if any fail, you're not ready:

## 1. Drain a node during active traffic
kubectl drain <node-name> --ignore-daemonsets
## Apps should stay available

## 2. Kill a control plane node
## Cluster should keep working

## 3. Fill up etcd disk space
## API server should handle it gracefully

## 4. Restart all pods in a namespace
kubectl delete pods --all -n your-app-namespace
## Apps should restart without data loss

## 5. Network partition test
## Simulate network failures between zones

Don't skip this, I've been that guy who learned about split-brain scenarios during Black Friday. Expect something to break during upgrades. It always does. Have your resume updated and a rollback plan ready.

Q

The cluster survives load testing but dies under real production traffic

A

Load testing tools don't simulate the chaos of real user behavior. Production traffic patterns include:

  • Random spikes that last 30 seconds
  • Slow database connections that consume connection pools
  • Memory leaks that only show up after 48+ hours of runtime
  • DNS timeouts during high connection volume
  • Certificate validation failures during traffic spikes

What actually helps: Start with 10% production traffic and gradually increase over weeks, not hours.

Q

How do I upgrade without destroying everything?

A

Step 1: Test the upgrade in staging (minikube doesn't count!)

Step 2: Drain and upgrade one control plane node at a time:

kubectl drain control-plane-1 --ignore-daemonsets
## SSH to node, upgrade kubeadm/kubelet, reboot
kubectl uncordon control-plane-1

Step 3: Upgrade worker nodes in batches of 20% capacity:

kubectl drain worker-node-1 --ignore-daemonsets
## Upgrade node, verify it rejoins healthy
kubectl uncordon worker-node-1

Step 4: Pray nothing breaks. It probably will.

Q

Why does my Prometheus server keep crashing with OOMKilled?

A

Prometheus memory usage grows with the number of metrics and retention period. Default settings assume you're monitoring a toy cluster.

## Check current memory usage
kubectl top pods -n monitoring | grep prometheus

## Increase memory limits in values.yaml
prometheus:
  prometheusSpec:
    resources:
      requests:
        memory: 4Gi
      limits:
        memory: 8Gi
    retention: 15d  # Reduce retention if needed

This setup will eat 15% of your cluster and generate stupid amounts of logs. Plan for it. Prometheus ate 12GB RAM on a 3-node cluster with 5 pods. Monitoring took down production before the actual apps did.

Q

My Helm deployments work in dev but fail in production

A

Different storage classes, different node selectors, different resource quotas. Dev uses default everything, prod has constraints.

## Check what's different
kubectl describe pod <failed-pod> | grep -A 10 Events
kubectl get nodes --show-labels | grep node-type
kubectl describe resourcequota --all-namespaces

## Override Helm values for production
helm upgrade myapp ./chart \
  --values values.yaml \
  --values values-prod.yaml \
  --set resources.requests.memory=1Gi
Q

How do I know if I actually need Kubernetes?

A

You don't need Kubernetes if:

  • You have fewer than 10 services
  • Your traffic is predictable and doesn't spike
  • You have fewer than 5 engineers total
  • Downtime isn't business-critical
  • You're not planning to scale significantly

Use it anyway because your next job will require Kubernetes experience and explaining why you chose simpler technology makes you sound old.

Q

What's the most important thing for Kubernetes in production?

A

Monitoring and observability. You will debug more issues than you prevent. When (not if) your cluster breaks at 3am, you need to know:

  • Which nodes are healthy
  • What pods are crashing and why
  • Where network requests are failing
  • How much resources are being consumed
  • What changed in the last 24 hours

The second most important thing: backups you've actually tested. Not backup scripts that run without errors - backups you've restored successfully in a test environment.

Essential Production Monitoring and Observability

Kubernetes without monitoring = debugging outages blind at 2am. Here's the monitoring stack that actually helps during production incidents, not the pretty dashboards that look good in demos.

The Monitoring Reality Check

What you need to know during outages:

  1. Which pods are dying and why
  2. What nodes are running out of resources
  3. Which services aren't responding to health checks
  4. Where network requests are timing out
  5. What the fuck changed in the last 24 hours

What you don't need: 47 Grafana dashboards showing CPU usage trending over 6 months. You need actionable alerts that wake you up when things are actually broken.

Prometheus + Grafana: The Standard Stack That Works

Prometheus Installation (Prepare for Resource Consumption)

## Add Prometheus helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

## Custom values for production (don't use defaults)
cat <<EOF > prometheus-values.yaml
prometheus:
  prometheusSpec:
    resources:
      requests:
        memory: 4Gi
        cpu: 2
      limits:
        memory: 8Gi
        cpu: 4
    retention: 15d
    retentionSize: 50GB
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: fast-ssd
          accessModes: [\"ReadWriteOnce\"]
          resources:
            requests:
              storage: 100Gi
    
  # Don't scrape everything - be selective
  serviceMonitorSelectorNilUsesHelmValues: false
  podMonitorSelectorNilUsesHelmValues: false
  ruleSelectorNilUsesHelmValues: false
  
grafana:
  adminPassword: \"change-this-password-now\"
  resources:
    requests:
      memory: 512Mi
      cpu: 500m
    limits:
      memory: 1Gi
      cpu: 1
  
  persistence:
    enabled: true
    size: 10Gi
    storageClassName: fast-ssd

alertmanager:
  alertmanagerSpec:
    resources:
      requests:
        memory: 256Mi
      limits:
        memory: 512Mi
EOF

## Install with custom values
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  -f prometheus-values.yaml

Resource consumption warning: This setup will eat 15% of your cluster and generate stupid amounts of logs. Plan for it.

Prometheus ate 12GB RAM on a 3-node cluster with 5 pods. Monitoring took down production before the actual apps did.

Essential Grafana Dashboards (Not the Fancy Ones)

Import these dashboard IDs:

  • 1860: Node Exporter Full (node-level metrics)
  • 6417: Kubernetes Cluster Monitoring (cluster overview)
  • 7249: Kubernetes Deployment Statefulset Daemonset metrics
  • 13332: Kubernetes Cluster Monitoring (via Prometheus)

Skip the fancy dashboards: You need metrics that load in under 3 seconds during outages. Complex queries that take 30 seconds to render are useless when your site is down.

Log Aggregation: ELK Stack or Suffer

Elasticsearch, Fluentd, Kibana (ELK) Setup

## Add Elastic helm repo
helm repo add elastic https://helm.elastic.co
helm repo update

## Elasticsearch with proper resource limits
cat <<EOF > elasticsearch-values.yaml
replicas: 3
minimumMasterNodes: 2
esConfig:
  elasticsearch.yml: |
    cluster.name: \"kubernetes-logs\"
    network.host: 0.0.0.0
    xpack.security.enabled: false
    xpack.monitoring.enabled: false
    
resources:
  requests:
    cpu: 1
    memory: 2Gi
  limits:
    cpu: 2
    memory: 4Gi
    
volumeClaimTemplate:
  accessModes: [ \"ReadWriteOnce\" ]
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 100Gi
EOF

helm install elasticsearch elastic/elasticsearch \
  --namespace logging \
  --create-namespace \
  -f elasticsearch-values.yaml

Elasticsearch reality check: It will consume as much memory as you give it, then ask for more. Start with 4GB per node and monitor usage. Don't be surprised when it uses 8GB+ for a medium cluster.

Fluentd for Log Collection

cat <<EOF > fluentd-values.yaml
image:
  repository: fluent/fluentd-kubernetes-daemonset
  tag: v1.16.2-debian-elasticsearch7-1.1

resources:
  limits:
    memory: 512Mi
  requests:
    cpu: 100m
    memory: 200Mi

elasticsearch:
  hosts: [\"elasticsearch-master.logging.svc.cluster.local:9200\"]
  
## Exclude noisy logs
env:
  - name: FLUENTD_SYSTEMD_CONF
    value: disable
  - name: FLUENTD_PROMETHEUS_CONF
    value: disable
EOF

helm install fluentd fluent/fluentd \
  --namespace logging \
  -f fluentd-values.yaml

Kibana for Log Search (When You're Desperate)

helm install kibana elastic/kibana \
  --namespace logging \
  --set elasticsearchHosts=\"http://elasticsearch-master.logging.svc.cluster.local:9200\" \
  --set resources.requests.memory=1Gi \
  --set resources.limits.memory=2Gi

Kibana truth: You'll only use it during outages to search for error messages. It's slow, complex, and occasionally useful. The search syntax is cryptic but powerful.

Application Performance Monitoring (APM)

Jaeger for Distributed Tracing

## Install Jaeger operator
kubectl apply -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.59.0/jaeger-operator.yaml

## Create Jaeger instance
cat <<EOF | kubectl apply -f -
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger-prod
  namespace: monitoring
spec:
  strategy: production
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: http://elasticsearch-master.logging.svc.cluster.local:9200
        index-prefix: jaeger
  collector:
    resources:
      requests:
        memory: 512Mi
        cpu: 500m
      limits:
        memory: 1Gi
        cpu: 1
EOF

Tracing reality: You need to instrument your applications to send traces. This means code changes, not just YAML deployment. Worth it for debugging complex request flows.

Alerting Rules That Actually Matter

Critical Alerts Only

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kubernetes-critical-alerts
  namespace: monitoring
spec:
  groups:
  - name: kubernetes.critical
    rules:
    
    # Node down
    - alert: NodeDown
      expr: up{job=\"node-exporter\"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: \"Node {{ $labels.instance }} is down\"
        
    # Pod crash loop
    - alert: PodCrashLooping
      expr: rate(kube_pod_container_status_restarts_total[5m]) > 0.1
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: \"Pod {{ $labels.pod }} is crash looping\"
        
    # API server down
    - alert: KubernetesAPIServerDown
      expr: up{job=\"apiserver\"} == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: \"Kubernetes API Server is down\"
        
    # etcd down
    - alert: EtcdDown
      expr: up{job=\"etcd\"} == 0
      for: 3m
      labels:
        severity: critical
      annotations:
        summary: \"etcd is down - cluster fucked\"
        
    # Out of disk space
    - alert: NodeDiskSpaceRunningLow
      expr: (1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 > 85
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: \"Node disk space is {{ $value }}% full\"

Slack Integration for Alerts

global:
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'slack-notifications'

receivers:
- name: 'slack-notifications'
  slack_configs:
  - channel: '#alerts'
    title: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
    text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    send_resolved: true

Cluster Resource Monitoring Commands

Real-Time Debugging Commands

## Node resource usage
kubectl top nodes

## Pod resource usage sorted by memory
kubectl top pods --all-namespaces --sort-by=memory

## Find pods without resource limits (they'll kill your nodes)
kubectl get pods --all-namespaces -o jsonpath='{range .items[*]}{.metadata.namespace}{"	"}{.metadata.name}{"	"}{.spec.containers[*].resources}{"
"}{end}' | grep -E '({}|\"requests\":{})'

## Check for pods in CrashLoopBackOff
kubectl get pods --all-namespaces | grep -v Running | grep -v Completed

## Events from the last hour (useful during incidents)
kubectl get events --sort-by='.lastTimestamp' --all-namespaces | tail -20

## Network policies that might be blocking traffic
kubectl get networkpolicies --all-namespaces

## Persistent volume claims that are stuck
kubectl get pvc --all-namespaces | grep -v Bound

Health Check Dashboard (Simple but Effective)

#!/bin/bash
echo \"=== Kubernetes Cluster Health Check ===\"
echo
echo \"Nodes:\"
kubectl get nodes -o wide
echo
echo \"System Pods:\"
kubectl get pods -n kube-system | grep -v Running
echo
echo \"Failed Pods (All Namespaces):\"
kubectl get pods --all-namespaces | grep -v Running | grep -v Completed | grep -v ContainerCreating
echo
echo \"Resource Usage:\"
kubectl top nodes 2>/dev/null || echo \"metrics-server not available\"
echo
echo \"Recent Events:\"
kubectl get events --sort-by='.lastTimestamp' --all-namespaces | tail -5

Save this as cluster-health.sh and run it every morning. It shows 90% of the problems you need to know about.

Cost Monitoring (Because Infrastructure Isn't Free)

Monthly monitoring stack costs:

  • Prometheus storage: ~$100/month for 15 days retention
  • Grafana persistent volumes: ~$10/month
  • ELK stack: ~$300/month for log storage and processing
  • Jaeger traces: ~$50/month for storage
  • Alertmanager: ~$5/month

Total: ~$500/month for comprehensive monitoring of a medium cluster. This is on top of your actual infrastructure costs.

The tradeoff: Skip monitoring to save money, spend 10x more on debugging outages and lost revenue during downtime.

Monitoring Best Practices That Actually Work

  1. Start with basic metrics: CPU, memory, disk, network
  2. Add application metrics gradually: Don't try to monitor everything on day one
  3. Set up alerting for business impact: Revenue-affecting alerts first, everything else second
  4. Test your monitoring: Intentionally break things and verify alerts fire
  5. Keep dashboards simple: Complex queries are useless during 3am outages

The monitoring paradox: The more comprehensive your monitoring becomes, the more time you spend maintaining the monitoring infrastructure instead of your actual applications.

Set up basic monitoring first, expand gradually based on actual production pain points. Your monitoring stack will probably be more complex than your application stack, and that's normal in the Kubernetes world.

Essential Kubernetes Production Resources

Related Tools & Recommendations

integration
Similar content

Jenkins Docker Kubernetes CI/CD: Deploy Without Breaking Production

The Real Guide to CI/CD That Actually Works

Jenkins
/integration/jenkins-docker-kubernetes/enterprise-ci-cd-pipeline
100%
troubleshoot
Similar content

Fix Kubernetes Service Not Accessible: Stop 503 Errors

Your pods show "Running" but users get connection refused? Welcome to Kubernetes networking hell.

Kubernetes
/troubleshoot/kubernetes-service-not-accessible/service-connectivity-troubleshooting
74%
tool
Similar content

containerd - The Container Runtime That Actually Just Works

The boring container runtime that Kubernetes uses instead of Docker (and you probably don't need to care about it)

containerd
/tool/containerd/overview
57%
integration
Similar content

Kafka, MongoDB, K8s, Prometheus: Event-Driven Observability

When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go

Apache Kafka
/integration/kafka-mongodb-kubernetes-prometheus-event-driven/complete-observability-architecture
52%
integration
Recommended

Setting Up Prometheus Monitoring That Won't Make You Hate Your Job

How to Connect Prometheus, Grafana, and Alertmanager Without Losing Your Sanity

Prometheus
/integration/prometheus-grafana-alertmanager/complete-monitoring-integration
50%
tool
Recommended

MongoDB Atlas Enterprise Deployment Guide

built on MongoDB Atlas

MongoDB Atlas
/tool/mongodb-atlas/enterprise-deployment
45%
tool
Recommended

Google Kubernetes Engine (GKE) - Google's Managed Kubernetes (That Actually Works Most of the Time)

Google runs your Kubernetes clusters so you don't wake up to etcd corruption at 3am. Costs way more than DIY but beats losing your weekend to cluster disasters.

Google Kubernetes Engine (GKE)
/tool/google-kubernetes-engine/overview
43%
pricing
Similar content

Docker, Podman & Kubernetes Enterprise Pricing Comparison

Real costs, hidden fees, and why your CFO will hate you - Docker Business vs Red Hat Enterprise Linux vs managed Kubernetes services

Docker
/pricing/docker-podman-kubernetes-enterprise/enterprise-pricing-comparison
40%
tool
Recommended

Helm - Because Managing 47 YAML Files Will Drive You Insane

Package manager for Kubernetes that saves you from copy-pasting deployment configs like a savage. Helm charts beat maintaining separate YAML files for every dam

Helm
/tool/helm/overview
38%
tool
Recommended

Fix Helm When It Inevitably Breaks - Debug Guide

The commands, tools, and nuclear options for when your Helm deployment is fucked and you need to debug template errors at 3am.

Helm
/tool/helm/troubleshooting-guide
38%
alternatives
Recommended

Terraform Alternatives That Don't Suck to Migrate To

Stop paying HashiCorp's ransom and actually keep your infrastructure working

Terraform
/alternatives/terraform/migration-friendly-alternatives
36%
alternatives
Recommended

Terraform Alternatives by Performance and Use Case - Which Tool Actually Fits Your Needs

Stop choosing IaC tools based on hype - pick the one that performs best for your specific workload and team size

Terraform
/alternatives/terraform/performance-focused-alternatives
36%
pricing
Recommended

Infrastructure as Code Pricing Reality Check: Terraform vs Pulumi vs CloudFormation

What these IaC tools actually cost you in 2025 - and why your AWS bill might double

Terraform
/pricing/terraform-pulumi-cloudformation/infrastructure-as-code-cost-analysis
36%
news
Recommended

Google Guy Says AI is Better Than You at Most Things Now

Jeff Dean makes bold claims about AI superiority, conveniently ignoring that his job depends on people believing this

OpenAI ChatGPT/GPT Models
/news/2025-09-01/google-ai-human-capabilities
36%
news
Recommended

Google Avoids Breakup, Stock Surges

Judge blocks DOJ breakup plan. Google keeps Chrome and Android.

go
/news/2025-09-04/google-antitrust-chrome-victory
36%
troubleshoot
Recommended

Docker Won't Start on Windows 11? Here's How to Fix That Garbage

Stop the whale logo from spinning forever and actually get Docker working

Docker Desktop
/troubleshoot/docker-daemon-not-running-windows-11/daemon-startup-issues
35%
howto
Recommended

Stop Docker from Killing Your Containers at Random (Exit Code 137 Is Not Your Friend)

Three weeks into a project and Docker Desktop suddenly decides your container needs 16GB of RAM to run a basic Node.js app

Docker Desktop
/howto/setup-docker-development-environment/complete-development-setup
35%
news
Recommended

Docker Desktop's Stupidly Simple Container Escape Just Owned Everyone

integrates with Technology News Aggregation

Technology News Aggregation
/news/2025-08-26/docker-cve-security
35%
tool
Similar content

Fix gRPC Production Errors - The 3AM Debugging Guide

Fix critical gRPC production errors: 'connection refused', 'DEADLINE_EXCEEDED', and slow calls. This guide provides debugging strategies and monitoring solution

gRPC
/tool/grpc/production-troubleshooting
33%
howto
Similar content

Lock Down Kubernetes: Production Cluster Hardening & Security

Stop getting paged at 3am because someone turned your cluster into a bitcoin miner

Kubernetes
/howto/setup-kubernetes-production-security/hardening-production-clusters
28%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization