My pods are stuck in CrashLoopBackOff and I want to throw my laptop

Welcome to Kubernetes debugging hell. Your pods are crashing, restarting, crashing again, and you're questioning all your life choices.First, figure out what's actually happening: ```bash kubectl logs -n fastapi-prod kubectl describe pod -n fastapi-prod kubectl get events -n fastapi-prod --sort-by='.lastTimestamp' ``` **The usual suspects (in order of how much they'll piss you off):** **Database connection failing**: Your `DATABASE_URL` is wrong, the database is down, or networking is fucked. You'll see `psycopg2.OperationalError: could not connect to server: Connection refused` or `asyncpg.exceptions.ConnectionDoesNotExistError`. This will waste 2-3 hours of your life because the error messages are cryptic and PostgreSQL loves to fail silently. Pro tip: If you're using RDS with SSL, make sure your connection string has `?sslmode=require` or you'll get connection timeouts that look like network issues. **Memory limits too low**: Set them to 128Mi? Your pod gets OOM killed. Set them to 2Gi? Works fine but now you're wasting money. There's no middle ground. **Health check lies**: Your `/health` endpoint says everything is fine while your database connections are dead. Then Kubernetes thinks your pod is healthy and keeps sending traffic to the broken instance. **Port mismatch**: You exposed port 8000 in your service but your container is listening on 3000. Kubernetes doesn't give a shit about this until runtime. **Image pull failures**: Your registry credentials expired, the image doesn't exist, or Docker Hub is having a bad day. You'll see `ErrImagePull` or `ImagePullBackOff` in the pod status. The worst part? Kubernetes just sits there retrying every 30 seconds like it's going to magically work. Always check `kubectl describe pod` for the actual error.

How do I run database migrations without breaking everything?

**Short answer**: Use Kubernetes Jobs and pray. **Long answer**: Never, ever run migrations in your main app pods. That's how you get database locks during deployments and angry calls from your boss. ```yaml apiVersion: batch/v1 kind: Job metadata: name: fastapi-migrate-v1.0.0 namespace: fastapi-prod spec: template: spec: restartPolicy: Never containers: - name: migrate image: your-registry/fastapi-app:v1.0.0 command: ["python", "-m", "alembic", "upgrade", "head"] envFrom: - configMapRef: name: fastapi-config - secretRef: name: fastapi-secrets ``` **The workflow that actually works:** 1. Run the migration job first: `kubectl apply -f migration-job.yaml` 2. Wait for it to complete: `kubectl wait --for=condition=complete job/fastapi-migrate-v1.0.0 --timeout=300s` 3. If it fails, debug it before deploying your app 4. Only then deploy your application updates **Pro tip**: Use different job names for each migration (include the version) so you can see the history and don't accidentally run the same migration twice.

Why is everything slow after I moved to Kubernetes?

Your FastAPI app was fast on a single server, now it's slow as hell in Kubernetes. Welcome to distributed systems. **Database connections are probably fucked**: You were sharing connections on a single server. Now each pod creates its own connections and your database is drowning. I learned this the hard way during a Black Friday traffic spike - scaled up to what I thought was a reasonable number of pods and suddenly got `connection limit exceeded` errors. Turns out each pod was opening way more connections than I expected through SQLAlchemy's default pool settings, and we blew past PostgreSQL's connection limit faster than I could figure out what the hell was happening. The math gets ugly fast when you're not paying attention to pool sizes. ```python # This will kill your database import psycopg2 def get_user(id): conn = psycopg2.connect(DATABASE_URL) # New connection every request # Database gets overwhelmed fast with this approach ``` Fix it with proper connection pooling, or your DBA will hate you: ```python from databases import Database database = Database(DATABASE_URL, min_size=5, max_size=20) ``` **Resource limits are choking your app**: Check if your pods are getting throttled: ```bash kubectl top pods -n fastapi-prod kubectl describe pod -n fastapi-prod | grep -A 10 "Limits\|Requests" ``` If CPU is hitting 100%, your app is getting throttled and you'll see response times go to shit. If memory is close to the limit, your pods will get OOMKilled randomly with exit code 137 - classic memory exhaustion death. **Kubernetes networking adds latency**: Service mesh, DNS lookups, load balancing - it all adds overhead. Your localhost database calls now go through the network. **Your health checks are slowing shit down**: If your `/health` endpoint does actual work (database queries, API calls), it's getting hammered every few seconds by Kubernetes probes.

How do I deploy without taking the site down?

"Zero-downtime deployments" - the mythical creature of Kubernetes. It works great until it doesn't. ```yaml spec: strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0 # Keep existing pods until new ones are ready template: spec: containers: - name: fastapi readinessProbe: httpGet: path: /health/ready # Don't make this depend on the database port: 8000 periodSeconds: 5 failureThreshold: 3 lifecycle: preStop: exec: command: ["/bin/sleep", "15"] # Give load balancer time to stop sending traffic ``` **What actually happens during deployment:** 1. New pods start up (this takes 30-60 seconds, not the 5 seconds you expected) 2. Readiness checks fail for the first minute while your app loads (FastAPI needs to import all modules) 3. Old pods get terminated while new ones are still starting up 4. Users get 503 errors for 30 seconds anyway because the load balancer doesn't know the old pods are dying yet 5. Sometimes a pod gets stuck in `Terminating` status for 5+ minutes and you just have to wait or force kill it **Pro tip**: Always test your readiness checks independently. If they're slow or depend on external services, your deployments will be a shitshow. I once had a readiness check that pinged Redis, and during a Redis maintenance window, all deployments just hung for 20 minutes because new pods couldn't pass readiness. The old pods kept running but we couldn't deploy fixes, so we were stuck in this weird limbo state until I figured out what the hell was happening.

My pods are getting OOMKilled randomly

Your memory limits are too low, but setting them higher costs money. Welcome to the resource tuning game. **Quick fix**: Double your memory limits and see if the problem goes away. Yes, you'll pay more, but random crashes are worse than a higher AWS bill. I once spent 6 hours debugging "random" crashes that only happened during high traffic - turned out the app was hitting memory limits during garbage collection spikes, but the timing made it look like load-related crashes instead of resource limits. **Better fix**: Profile your actual memory usage: ```bash kubectl top pods -n fastapi-prod --sort-by=memory kubectl describe pod | grep -A 5 "Limits\|Requests" ``` **Reality check**: FastAPI with async database drivers needs at least 256MB, probably 512MB if you're doing anything real. Don't try to optimize this to save $20/month.

Why does kubectl apply take forever?

Kubernetes is doing a lot of work behind the scenes. Each `kubectl apply` has to: - Validate your YAML isn't completely broken - Compare it with the existing state - Plan the changes - Apply them one by one - Wait for each step to complete **Speed it up**: Use `kubectl apply -f . --prune` to apply all files at once instead of one by one. **Or just accept**: This is the price of declarative infrastructure. It's slower than `docker-compose up` but more reliable.

The ingress isn't working and I can't reach my app

DNS, certificates, load balancers, ingress controllers - so many ways for networking to break. **Start with the basics**: ```bash kubectl get svc -n fastapi-prod # Does the service exist? kubectl get ingress -n fastapi-prod # Does the ingress exist? kubectl get pods -n ingress-nginx # Is the ingress controller running? ``` **Test service connectivity directly**: ```bash kubectl port-forward svc/fastapi-service 8080:80 -n fastapi-prod # Test: curl localhost:8080/health ``` If that works, your app is fine and the networking is fucked. **Common networking failures**: - **DNS isn't set up**: Your domain doesn't point to your cluster (wait 10-60 minutes for propagation, always longer than you think) - **Certificates are broken**: Let's Encrypt rate limited you, cert-manager is confused, or your domain validation failed with `urn:ietf:params:acme:error:unauthorized` - **Ingress controller died**: Check the ingress-nginx pods - they crash when you have conflicting ingress rules - **Security groups**: AWS blocked the ports (again) and you'll spend an hour in the console trying to figure out which rule is blocking what **Nuclear option**: Delete the ingress and recreate it. Sometimes Kubernetes just needs a kick.

How do I scale without breaking the database?

Auto-scaling sounds great until your database gets hammered by 50 pods all opening connections simultaneously. **Set up HPA** but be conservative: ```yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: fastapi-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: fastapi-deployment minReplicas: 3 maxReplicas: 10 # Start low, increase carefully metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 ``` **Watch your database connections**: Each pod might open anywhere from 5-30 database connections depending on your setup. Scale to 50 pods and you could have anywhere from 250 to 1500 connections hitting your database. Your DBA will definitely not be happy. **The scaling math (when things go to shit)**: - Your Postgres instance: probably has around 100 max_connections (maybe 200 if someone configured it, but most people don't) - Each FastAPI pod: Opens anywhere from 5-25 connections depending on load, your pool settings, and what other async shit you're running - You'll start seeing connection refused errors somewhere around 6-12 pods, but it's inconsistent - Also depends on your database config, connection timeouts, and whether your monitoring tools are also eating connections Scale your database connection limits first, use connection pooling like PgBouncer, or just hope your traffic doesn't spike at the wrong time.

Where do I put my database passwords?

Kubernetes Secrets. They're base64 encoded, not encrypted, but better than hardcoding them. ```bash kubectl create secret generic fastapi-secrets \ --from-literal=database-password='your-secure-password' \ --from-literal=api-key='your-api-key' \ -n fastapi-prod ``` **Pro tip**: Don't put secrets in your YAML files and commit them to git. That's how you end up on those "exposed secrets" lists.

When deployments fail, what's the fastest way to fix it?

**Step 1**: Rollback immediately, debug later: ```bash kubectl rollout undo deployment/fastapi-deployment -n fastapi-prod ``` **Step 2**: Figure out what broke: ```bash kubectl logs deployment/fastapi-deployment -n fastapi-prod --previous kubectl get events -n fastapi-prod --sort-by=.lastTimestamp | tail -20 ``` **Step 3**: Fix the actual problem, test it somewhere else, then try again. **Common deployment failures**: - Image doesn't exist (typo in the tag) - Config map missing (forgot to apply it) - Health checks failing (new version broke the `/health` endpoint) - Resource limits too low (new version uses more memory) **Pro tip**: Always test deployments in staging first. If you don't have staging, you're using production as staging. *Now that you've survived deployment and learned to fix the common disasters, here's how to keep your FastAPI app actually running in production.*

Currently viewing the AI version

Switch to human version

FastAPI Kubernetes Production Deployment Guide

Critical Context

When Kubernetes is Actually Needed

Single server died: Complete application and database connection loss, 6-hour recovery from scratch
Traffic spikes kill application: Single FastAPI process cannot handle unexpected load, results in 500 errors
Deployment becomes nightmare: Tuesday afternoon deployments break user authentication for 3 hours
Configuration hell: Production secrets scattered across files, hardcoded values, manual database URL copying

When NOT to Use Kubernetes

90% of projects are overkill for Kubernetes complexity
Will consume 3 months of development time learning why pods are stuck in Pending
Health checks can kill healthy pods during database outages

Technical Specifications

Performance Reality Check

Synthetic benchmarks: 21k+ req/sec (TechEmpower - meaningless "Hello World" tests)
Real-world performance: 3-8k req/sec per pod with actual business logic, database queries, authentication
Optimal performance: 15-20k req/sec per pod on decent hardware with proper async database connections
Memory requirements: Minimum 256MB, realistic 512MB for production FastAPI applications

Resource Requirements

Time Investment

Minimum setup: 2-3 weeks for basic production deployment
Actually working: 2-3 months to understand random failures
Learning curve: 6 months of suffering to become proficient

Financial Investment

Production cluster: $400-800/month AWS (varies with negotiation skills)
Optimized setup: $300/month minimum with aggressive optimization

Team Requirements

Dedicated DevOps person required for production operations
Mental health cost: Significant time debugging pods stuck in Pending status

Configuration

Essential FastAPI Application Setup

from fastapi import FastAPI
import os

app = FastAPI(
    title="Your App",
    version=os.getenv("APP_VERSION", "1.0.0"),
    docs_url=None  # Don't expose docs in production
)

@app.get("/health")
async def health():
    # CRITICAL: Don't make this depend on database unless you want pain
    return {"status": "ok"}

Docker Configuration That Works

FROM python:3.12-slim

WORKDIR /app

# Install dependencies first (for better caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy your app
COPY app/ ./app/

# Don't run as root (security best practice)
RUN useradd --create-home --shell /bin/bash app
USER app

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

Production Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: fastapi-deployment
  namespace: fastapi-prod
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: fastapi
        image: yourusername/your-app:v1.0.0
        resources:
          requests:
            memory: "256Mi"
            cpu: "200m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10

Critical Warnings

Database Connection Disasters

Problem: Each pod creates independent database connections
Failure point: 10 pods = 100+ database connections, exceeds PostgreSQL default limits
Consequence: Connection refused errors around 6-12 pods
Solution: Use connection pooling with 10-20 connections per pod maximum

from databases import Database

database = Database(
    DATABASE_URL,
    min_size=5,    # Keep some connections warm
    max_size=15,   # Don't go crazy
)

Health Check Failures

Default settings will fail: Health checks depending on external services kill pods during database maintenance
Real example: 5-minute maintenance window became 3-hour outage when Kubernetes killed healthy pods unable to ping database
SSL connection issues: RDS connections require ?sslmode=require or connection timeouts mimic network issues

Resource Limit Tuning

Memory limits too low: 128Mi results in OOM kills
Memory limits too high: 2Gi wastes money
No middle ground: Trial and error required for optimization
CPU throttling: Pods hitting 100% CPU get throttled, response times degrade

Common Failure Modes

Pods Stuck in CrashLoopBackOff

Database connection failing: Wrong DATABASE_URL, database down, networking issues
Memory limits too low: OOM kills during garbage collection spikes
Health check lies: /health endpoint returns 200 while database connections dead
Port mismatch: Service exposes port 8000 but container listens on 3000
Image pull failures: Registry credentials expired, image doesn't exist, Docker Hub rate limiting

Deployment Failures

Image doesn't exist: Typo in tag
Config map missing: Forgot to apply configuration
Health checks failing: New version broke /health endpoint
Resource limits too low: New version uses more memory

Network Infrastructure Issues

IP address exhaustion: AWS EKS defaults to /24 subnets (256 IPs), 30+ reserved for system pods
Cryptic error: failed to allocate a node-local DNSRecord indicates networking problems, not resource issues
DNS propagation: Takes 10-60 minutes, always longer than expected
Security groups: AWS blocks ports requiring console debugging

Implementation Reality

Zero-Downtime Deployment Myth

New pods startup time: 30-60 seconds, not expected 5 seconds
Readiness check failures: First minute while FastAPI loads modules
Load balancer lag: 30 seconds of 503 errors while old pods terminate
Stuck termination: Pods can remain in Terminating status for 5+ minutes

Database Migration Strategy

apiVersion: batch/v1
kind: Job
metadata:
  name: fastapi-migrate-v1.0.0
  namespace: fastapi-prod
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: migrate
        image: your-registry/fastapi-app:v1.0.0
        command: ["python", "-m", "alembic", "upgrade", "head"]

Critical workflow:

Run migration job first
Wait for completion with timeout
Debug failures before application deployment
Use versioned job names for history tracking

Auto-scaling Database Impact

HPA scaling: Conservative limits required to prevent database overload
Connection math: 50 pods × 5-25 connections = 250-1250 database connections
PostgreSQL limits: Usually 100 max_connections (200 if configured)
Failure point: Connection refused errors at 6-12 pods

Troubleshooting Commands

Essential Debugging

# Check pod status and events
kubectl get pods -n fastapi-prod
kubectl get events -n fastapi-prod --sort-by=.metadata.creationTimestamp

# Debug specific pod failures
kubectl logs <pod-name> -n fastapi-prod
kubectl describe pod <pod-name> -n fastapi-prod

# Check resource usage
kubectl top pods -n fastapi-prod
kubectl describe pod <pod-name> -n fastapi-prod | grep -A 10 "Limits\|Requests"

# Test service connectivity
kubectl port-forward svc/fastapi-service 8080:80 -n fastapi-prod

Rollback Procedure

# Immediate rollback
kubectl rollout undo deployment/fastapi-deployment -n fastapi-prod

# Debug previous deployment
kubectl logs deployment/fastapi-deployment -n fastapi-prod --previous
kubectl get events -n fastapi-prod --sort-by=.lastTimestamp | tail -20

Performance Optimization

Docker Image Optimization

Problem: Base python:3.12 image is 1.2GB
Multi-stage build: Reduces size, faster loading
Remove build artifacts: Don't ship .git folders, tests, documentation

Database Connection Pooling

# This will murder your database
import psycopg2
def get_user(user_id):
    conn = psycopg2.connect(DATABASE_URL)
    # Every request = new connection = database death

# Proper connection pooling
from databases import Database
database = Database(DATABASE_URL, min_size=5, max_size=15)

Monitoring That Actually Works

Essential Metrics

Response times over 500ms: Users notice performance degradation
Error rates over 1%: Indicates system problems
Pod restarts: Memory leaks, crashes, or Kubernetes issues
Database connection pool exhaustion: Scaling problems

Basic Health Check Implementation

import time
from fastapi import Request

@app.middleware("http")
async def log_requests(request: Request, call_next):
    start_time = time.time()
    response = await call_next(request)
    process_time = time.time() - start_time

    # Log slow requests for performance debugging
    if process_time > 1.0:
        print(f"SLOW REQUEST: {request.url} took {process_time:.2f}s")

    return response

Alternative Solutions

When to Choose Alternatives

Solution	Time to Deploy	Monthly Cost	Team Size	Use Case
Railway/Render	5 minutes	$20-100	Solo developer	Simple applications
Docker Swarm	2-3 days	$150-300	1-2 people	Docker knowledge, simpler than K8s
Traditional VPS	1-2 hours	$50-150	SSH knowledge	Predictable traffic
Serverless (Lambda)	1 hour	$50-500	Serverless expertise	Variable workloads
Kubernetes	2-3 weeks	$400-800	Dedicated DevOps	High traffic, enterprise needs

Decision Criteria

Traffic predictability: VPS for stable loads, Kubernetes for spikes
Team expertise: Start simple, add complexity when needed
Downtime tolerance: Kubernetes for mission-critical applications
Budget constraints: Managed platforms for cost-effective starting point

Breaking Points and Failure Modes

Resource Exhaustion

Memory: OOM kills at garbage collection spikes during high traffic
CPU: Throttling causes response time degradation
Network: IP exhaustion in EKS subnets causes pod scheduling failures

Scaling Limits

Database connections: Hard limit around 6-12 pods without connection pooling
Load balancer: Requires proper ingress controller configuration
Storage: StatefulSets required for persistent data

Operational Complexity

YAML hell: 12+ configuration files for simple applications
Debugging difficulty: Distributed system troubleshooting requires specialized knowledge
Version conflicts: Image tags, configuration versions, database migrations coordination

Success Criteria

An AI implementing this guide should understand:

WHAT: FastAPI deployment to Kubernetes with production reliability
HOW: Step-by-step configuration avoiding common pitfalls
WHAT WILL GO WRONG: Database connections, resource limits, networking issues
WHETHER IT'S WORTH IT: Cost-benefit analysis for team size and traffic requirements

Useful Links for Further Investigation

Tools That Actually Work (And Don't Waste Your Time)

Link	Description
FastAPI Docs	Actually good documentation. Read the deployment section, not just the tutorial.
Kubernetes Docs	Dry as hell but accurate. Focus on the workloads and services sections.
kubectl Cheat Sheet	Bookmark this. You'll reference it daily for the first 6 months.
k9s	Terminal dashboard for Kubernetes. Way better than running kubectl commands constantly.
Helm	Package manager for Kubernetes. Be aware of debugging template errors with nested YAML. Use Helm v3.12+ to avoid security issues.
NGINX Ingress Controller	Rock solid. Works. Boring is good for ingress controllers. I've tried others and always come back to this one.
cert-manager	Automatic SSL certificates. Way better than managing Let's Encrypt manually.
Sentry	Error tracking that actually helps. Integrates with FastAPI easily.
UptimeRobot	Dead simple uptime monitoring. Sends you a text when shit breaks.
Kubernetes the Hard Way	Learn how Kubernetes actually works instead of just copy-pasting YAML.
Kubernetes Community Discussions	Official community forum with real problems and solutions from Kubernetes users.
Awesome FastAPI	Collection of FastAPI resources. Skip the enterprise stuff, focus on deployment examples.
Railway	A platform for deploying FastAPI applications without the complexities of Kubernetes.
Render	A platform similar to Railway, offering deployment solutions suitable for small teams.
Digital Ocean App Platform	A managed platform built on Kubernetes, simplifying application deployment and management.

FastAPI Kubernetes Production Deployment Guide

Critical Context

When Kubernetes is Actually Needed

When NOT to Use Kubernetes

Technical Specifications

Performance Reality Check

Resource Requirements

Time Investment

Financial Investment

Team Requirements

Configuration

Essential FastAPI Application Setup

Docker Configuration That Works

Production Kubernetes Deployment

Critical Warnings

Database Connection Disasters

Health Check Failures

Resource Limit Tuning

Common Failure Modes

Pods Stuck in CrashLoopBackOff

Deployment Failures

Network Infrastructure Issues

Implementation Reality

Zero-Downtime Deployment Myth

Database Migration Strategy

Auto-scaling Database Impact

Troubleshooting Commands

Essential Debugging

Rollback Procedure

Performance Optimization

Docker Image Optimization

Database Connection Pooling

Monitoring That Actually Works

Essential Metrics

Basic Health Check Implementation

Alternative Solutions

When to Choose Alternatives

Decision Criteria

Breaking Points and Failure Modes

Resource Exhaustion

Scaling Limits

Operational Complexity

Success Criteria

Useful Links for Further Investigation

Tools That Actually Work (And Don't Waste Your Time)

Related Tools & Recommendations

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

Docker Alternatives That Won't Break Your Budget

I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

GitHub Actions Marketplace - Where CI/CD Actually Gets Easier

GitHub Actions Alternatives That Don't Suck

Deploy Django with Docker Compose - Complete Production Guide

Docker Swarm Node Down? Here's How to Fix It

Docker Swarm Service Discovery Broken? Here's How to Unfuck It

Docker Swarm - Container Orchestration That Actually Works

HashiCorp Nomad - Kubernetes Alternative Without the YAML Hell

Amazon ECS - Container orchestration that actually works

Google Cloud Run - Throw a Container at Google, Get Back a URL

Fix Helm When It Inevitably Breaks - Debug Guide

Helm - Because Managing 47 YAML Files Will Drive You Insane

Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together

Stop Waiting 3 Seconds for Your Django Pages to Load

Django - The Web Framework for Perfectionists with Deadlines