Why the fuck is my pod stuck in "ImagePullBackOff"?

ImagePullBackOff is Kubernetes' passive-aggressive way of saying "I can't find your image, you dumbass." 99% of the time it's one of these dumb mistakes: 1. **You typo'd the image name (60% of cases)** - Double-check the image tag in your deployment YAML 2. **Your registry auth is fucked (30%)** - Run `kubectl create secret docker-registry` for private images 3. **The image doesn't actually exist (10%)** - Verify with `docker pull your-image:tag` Run `kubectl describe pod ` and scroll through 200 lines of YAML bullshit to find the one line that actually explains why your pod is having an existential crisis.

How many replicas should I run? (Spoiler: More than you think)

Start with 3 replicas because Kubernetes will randomly murder one of your pods and act like nothing happened. The official guidance says 2-3, but that assumes your infrastructure doesn't randomly shit the bed (it will). Reality: You need at least 3 because: - One will be on the node that randomly goes down - One will be stuck in `Terminating` for 10 minutes - One might actually serve traffic

My services can't talk to each other - what broke?

90% of service communication failures are DNS-related. Here's the debug sequence: ```bash # Test DNS resolution from inside a pod kubectl exec -it -- nslookup user-service # Check if your service actually exists kubectl get services # Verify service endpoints (pods behind the service) kubectl get endpoints user-service ``` Common fuckups: - **Wrong namespace**: Use `service-name.namespace.svc.cluster.local` for cross-namespace calls - **Wrong port**: Your service port ≠ container port ≠ target port - **NetworkPolicies**: Someone enabled them and blocked everything - **Labels don't match**: Service selector must match deployment labels exactly

Why are my logs completely useless?

Your logs suck because you're still logging like it's 2010 and your grandmother just discovered `console.log()`. In microservices, you need structured logging with correlation IDs or you'll lose your fucking mind trying to trace requests across services: ```javascript // Bad logging (what you're doing now) console.log('User registered'); // Good logging (what will save your ass) console.log(JSON.stringify({ timestamp: new Date().toISOString(), service: 'user-service', level: 'info', message: 'User registered', userId: user.id, correlationId: req.headers['x-correlation-id'], duration: Date.now() - startTime })); ``` Set up ELK stack or use Fluentd with cloud logging if you hate money. Include correlation IDs to trace requests across your 47 different services (yes, you somehow ended up with 47, nobody knows how).

Help! My pod is stuck in "CrashLoopBackOff"

CrashLoopBackOff means your container starts, shits itself, restarts, shits itself again, repeat forever until you question your career choices. Here's how to debug this nightmare: ```bash # Check what's killing your container kubectl logs --previous # Get the exit code and reason kubectl describe pod # Common exit codes you'll see: # Exit 0: Clean shutdown (probably not this) # Exit 1: General application error # Exit 125: Docker daemon error # Exit 137: SIGKILL (OOMKilled - your app uses too much memory) ``` **Most likely causes:** - Your health check is broken and Kubernetes is killing healthy pods - Memory limit too low (increase it or fix your memory leak) - Application crashes on startup (check your logs, genius) - Missing environment variables or config

Why is Kubernetes eating all my money?

Your cloud bill went from $50 to $5000 overnight because nobody warns you about these expensive surprises: - **Auto-scaling lost its mind**: HPA scaled to 200 pods during that load test you forgot to cancel - **Zombie persistent volumes**: 47 orphaned 100GB volumes at $10/month each - delete these manually or they'll bankrupt you - **LoadBalancer tax**: $20/month per service - use Ingress controllers instead - **That forgotten test cluster**: Been burning $500/month for 6 months while you weren't looking Use kubecost or k8s-cost-monitoring to track spending before your CFO starts asking uncomfortable questions about the $10K AWS bill.

Why the fuck is my cluster eating all my resources?

Your cluster is a hungry beast because nobody set proper limits and everything's running wild: - **Prometheus is hoarding data** - set retention to 1 day or it'll eat your SSD alive - **Your apps don't have resource limits** - set `limits` in your YAML or pods will consume everything - **Memory leaks everywhere** - restart things weekly and pretend it's "planned maintenance" - **Zombie processes** - your graceful shutdowns aren't graceful Quick fix: `kubectl top nodes` and `kubectl top pods --all-namespaces` to see who's hogging what.

How do I do zero-downtime deployments without everything breaking?

"Zero-downtime" is marketing bullshit, but here's how to minimize the pain: - **Rolling deployments** - set proper readiness probes or K8s will route traffic to broken pods - **Health checks that actually work** - don't just return 200, check your database connection - **Graceful shutdowns** - handle SIGTERM properly (most apps don't) - **Connection draining** - give connections time to finish before killing pods Reality check: You'll still have brief blips. Plan for them.

What monitoring actually matters? (Hint: Not Everything)

Forget the "Four Golden Signals" academic bullshit. Monitor what wakes you up at 3AM: - **Error rate spikes** - users can't do the thing they need to do - **Response time > 5 seconds** - users think your app is broken - **Memory usage > 80%** - pods about to get OOMKilled - **Disk space < 10%** - everything's about to crash Set alerts that matter, not alerts that make you ignore all alerts.

My pods can't talk to each other - networking is fucked

Network debugging is hell, but here's the systematic approach: ```bash # Test from inside the broken pod kubectl exec -it broken-pod -- curl service-name:80 # If that fails, check DNS kubectl exec -it broken-pod -- nslookup service-name # Still broken? Check if the service exists kubectl get endpoints service-name ``` Common fuckups: wrong namespace, NetworkPolicies blocking everything, or labels don't match.

Monorepo vs separate repos? (Spoiler: Both suck)

**Separate repos**: Every deployment requires coordinating 12 different repos. Cross-service changes are hell. Works great until you need to change an API contract. **Monorepo**: One change can break 6 different services. CI takes 45 minutes. Works great until your repository is 10GB. Pick your poison based on your team's pain tolerance.

Configuration drift is making me lose my mind

Configuration drift happens because humans touch production. Here's damage control: - **GitOps everything** - if it's not in git, it doesn't exist - **Admission controllers** - prevent humans from deploying stupid shit - **Daily drift detection** - automated scripts that yell when things change - **Immutable infrastructure** - burn everything down and rebuild from code Still happens. Accept it and build monitoring around it.

Testing microservices is like testing a house of cards

Your testing strategy will be: - **Unit tests** - 90% coverage, catches 10% of bugs - **Integration tests** - slow, flaky, developers hate them - **End-to-end tests** - takes 2 hours to run, fails because staging data is fucked - **Production testing** - users find the bugs you missed Use contract testing (Pact) if you hate yourself less than writing mocks.

Docker images that don't suck

Stop building 2GB images for a Node.js app: - **Multi-stage builds** - build in one stage, copy artifacts to a clean stage - **Alpine base images** - or distroless if you're paranoid about security - **Actually use .dockerignore** - don't include your 500MB node_modules in the image - **Vulnerability scanning** - security team will bug you anyway Your image should be <100MB unless you're running Java (then good luck).

Currently viewing the AI version

Switch to human version

Microservices with Docker & Kubernetes: AI-Optimized Technical Guide

Configuration

Development Environment Requirements

Docker Desktop: Latest version with Kubernetes enabled
System Resources: 16GB RAM minimum, 100GB free space, 8GB RAM allocated to Docker
Alternative Local Clusters: Kind (stable), Minikube (feature-rich but breaks more), k3s/k3d (production-like)
Tools: kubectl, Node.js 20+, Git

Production-Ready Dockerfile Pattern

FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production && npm cache clean --force

FROM node:20-alpine AS runtime
RUN addgroup -g 1001 -S nodejs && adduser -S nodejs -u 1001
WORKDIR /app
COPY --from=builder --chown=nodejs:nodejs /app/node_modules ./node_modules
COPY --chown=nodejs:nodejs . .
USER nodejs

EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
  CMD node -e "require('http').get('http://localhost:3000/health', res => process.exit(res.statusCode === 200 ? 0 : 1)).on('error', () => process.exit(1))"

CMD ["node", "server.js"]

Kubernetes Deployment Configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: user-service-deployment
  labels:
    app: user-service
    version: v1.0.0
    component: backend
spec:
  replicas: 3  # Minimum viable replicas for reliability
  selector:
    matchLabels:
      app: user-service
  template:
    metadata:
      labels:
        app: user-service
        version: v1.0.0
    spec:
      containers:
      - name: user-service
        image: user-service:v1.0.0
        ports:
        - containerPort: 3000
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "256Mi"
            cpu: "200m"
        livenessProbe:
          httpGet:
            path: /health
            port: 3000
          initialDelaySeconds: 60  # Increased from 30 to prevent false failures
          periodSeconds: 10
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /health
            port: 3000
          initialDelaySeconds: 5
          periodSeconds: 5
          failureThreshold: 3

Auto-Scaling Configuration

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: user-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: user-service-deployment
  minReplicas: 3
  maxReplicas: 50
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Resource Requirements

Time Investment

Basic Setup: 2 weeks minimum
Production-Ready: 2-6 months
Full CI/CD Pipeline: 6-18 months
Team Expertise: Requires dedicated DevOps engineers or significant learning curve

Infrastructure Costs

Local Development: 16GB RAM, 100GB storage per developer
Production Scaling: 2x infrastructure cost with blue-green deployments
Hidden Costs:
- LoadBalancer services: $20/month each
- Persistent volumes: $10/month per 100GB
- Monitoring stack: 4-6x resource usage of main application

Team Requirements

Kubernetes expertise (6+ months learning curve)
Docker containerization knowledge
Distributed systems debugging skills
24/7 on-call capability for production issues

Critical Warnings

Architecture Complexity

Service Communication: Network calls introduce latency (50ms → 450ms with service mesh)
Debugging Difficulty: Request tracing across multiple services exponentially increases complexity
Data Consistency: Distributed transactions are unreliable
Log Aggregation: Scattered logs across services make root cause analysis challenging

Production Failure Modes

Single Points of Failure: API Gateway, service mesh proxies
Resource Limits: Pods get OOMKilled at memory limits (set limits 2x expected usage)
Health Check Lies: Health endpoints return 200 while application fails
Auto-Scaling Chaos: HPA scales incorrectly during load tests or traffic spikes
Image Pull Failures: 60% typos in image names, 30% registry authentication issues

Common Breaking Points

1000+ Spans: UI becomes unusable for debugging large distributed transactions
Memory Usage: Applications typically use 3x requested resources
Disk Space: Prometheus consumes more resources than monitored applications
Network Timeouts: Random service timeouts during peak traffic

Implementation Reality

Deployment Strategy Comparison

Strategy	Downtime	Failure Risk	Rollback Speed	Cost Impact
Rolling	"Zero" (actually brief blips)	Always breaks eventually	Fast	Moderate
Blue-Green	Actually zero	Expensive mistakes	Instant	2x infrastructure cost
Canary	Zero	Finds issues gradually	Fast if detected early	Gradual cost increase
Recreate	30-60 seconds	Honest about downtime	Slow	Cheapest

Service Mesh Impact

Latency Increase: 8-9x latency increase (50ms → 450ms measured)
Resource Overhead: Additional CPU/memory for sidecar proxies
Debugging Complexity: Application + proxy debugging required
Learning Curve: mTLS, service policies, mesh-specific troubleshooting

Monitoring Requirements

Critical Metrics: Error rate spikes, response time >5s, memory >80%, disk <10%
Tracing: OpenTelemetry for request correlation across services
Log Structure: JSON with correlation IDs, timestamps, service identification
Cost: Monitoring infrastructure often costs more than applications

Decision Criteria

When NOT to Use Microservices

Teams smaller than 8-10 engineers
Applications with <1000 daily active users
Tight coupling between business domains
Limited DevOps/infrastructure expertise
Budget constraints (10x cost increase typical)

Prerequisites for Success

Automated testing pipeline that catches real issues
24/7 monitoring and alerting
Container registry and image management
Service discovery and configuration management
Distributed tracing and log aggregation
Database per service strategy (12 different backup failure modes)

Breaking Points to Monitor

Service-to-service call chains >5 hops
Memory usage approaching container limits
Auto-scaling triggering during normal operations
Failed deployments requiring manual intervention
Configuration drift between environments

Troubleshooting Guide

ImagePullBackOff (60% of early failures)

Verify image name spelling in deployment YAML
Check registry authentication: kubectl create secret docker-registry
Confirm image exists: docker pull your-image:tag

Service Communication Failures (90% DNS-related)

kubectl exec -it <pod-name> -- nslookup service-name
kubectl get services
kubectl get endpoints service-name

CrashLoopBackOff Debugging

kubectl logs <pod-name> --previous
kubectl describe pod <pod-name>
# Exit 137 = OOMKilled (increase memory limits)
# Exit 1 = Application error (check logs)

Resource Consumption Issues

Prometheus retention: Set to 1 day maximum
Pod resource limits: Set 2x expected usage
Persistent volume cleanup: Manual deletion required
LoadBalancer cost optimization: Use Ingress controllers

Success Patterns

Health Check Implementation

app.get('/health', (req, res) => {
  // Include actual dependency checks, not just HTTP 200
  res.json({
    status: 'healthy',
    service: 'user-service',
    timestamp: new Date().toISOString(),
    uptime: process.uptime(),
    dependencies: {
      database: 'connected',  // Real database check
      redis: 'connected'      // Real Redis check
    }
  });
});

Structured Logging Pattern

console.log(JSON.stringify({
  timestamp: new Date().toISOString(),
  service: 'user-service',
  level: 'info',
  message: 'User registered',
  userId: user.id,
  correlationId: req.headers['x-correlation-id'],
  duration: Date.now() - startTime
}));

Resource Limit Configuration

Requests: Conservative estimates for scheduling
Limits: 2-3x requests to prevent OOMKilled
CPU: Test with realistic load, not synthetic benchmarks
Memory: Monitor actual usage patterns over time

This guide represents 18 months of production experience with the associated costs, failures, and lessons learned in a realistic enterprise environment.

Useful Links for Further Investigation

Resources That Might Actually Help (Or Waste More of Your Time)

Link	Description
kubectl Cheat Sheet	Actually useful - bookmark this shit, you'll need it at 3AM
Docker Official Docs	Surprisingly good until you need to fix networking, then you're on your own
Microsoft's Microservices Guide	Microsoft actually knows their shit here, surprisingly solid advice
DataDog	Costs more than your entire infrastructure budget but actually fucking works
Stack Overflow K8s	has copy-paste solutions that work 60% of the time.

Microservices with Docker & Kubernetes: AI-Optimized Technical Guide

Configuration

Development Environment Requirements

Production-Ready Dockerfile Pattern

Kubernetes Deployment Configuration

Auto-Scaling Configuration

Resource Requirements

Time Investment

Infrastructure Costs

Team Requirements

Critical Warnings

Architecture Complexity

Production Failure Modes

Common Breaking Points

Implementation Reality

Deployment Strategy Comparison

Service Mesh Impact

Monitoring Requirements

Decision Criteria

When NOT to Use Microservices

Prerequisites for Success

Breaking Points to Monitor

Troubleshooting Guide

ImagePullBackOff (60% of early failures)

Service Communication Failures (90% DNS-related)

CrashLoopBackOff Debugging

Resource Consumption Issues

Success Patterns

Health Check Implementation

Structured Logging Pattern

Resource Limit Configuration

Useful Links for Further Investigation

Resources That Might Actually Help (Or Waste More of Your Time)

Related Tools & Recommendations

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015

GitHub Actions Marketplace - Where CI/CD Actually Gets Easier

GitHub Actions Alternatives That Don't Suck

Rancher Desktop - Docker Desktop's Free Replacement That Actually Works

I Ditched Docker Desktop for Rancher Desktop - Here's What Actually Happened

Podman Desktop - Free Docker Desktop Alternative

containerd - The Container Runtime That Actually Just Works

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

Jenkins + Docker + Kubernetes: How to Deploy Without Breaking Production (Usually)

Jenkins Production Deployment - From Dev to Bulletproof

Jenkins - The CI/CD Server That Won't Die

Docker Compose 2.39.2 and Buildx 0.27.0 Released with Major Updates

Deploy Django with Docker Compose - Complete Production Guide

Podman Desktop Alternatives That Don't Suck

Fix Helm When It Inevitably Breaks - Debug Guide

Helm - Because Managing 47 YAML Files Will Drive You Insane

Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015