Why does Docker Desktop work fine but production crashes with exit code 137?

Because Docker Desktop lies to you. It doesn't enforce memory limits properly, especially on macOS. Production Kubernetes will OOMKill your pods the second they hit the memory limit. Exit code 137 means your container got SIGKILL'd by the kernel's out-of-memory killer - your app didn't crash, it got murdered for using too much RAM. **Fix**: Run `docker run --memory=512m your-app` locally first. If it crashes, your app uses too much RAM. Use [memory profiling tools](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/) or just set limits higher and monitor what it actually uses with `kubectl top pod`.

My containers take 2 minutes to start in production but start instantly locally. WTF?

Your local environment doesn't pull massive images over the network or run security scans. Production does. I spent like three days debugging slow startup times before realizing our image had the entire fucking Python package index cached inside it - probably close to 1.2GB of useless crap. **Fix**: Use [multi-stage builds](https://docs.docker.com/develop/dev-best-practices/). Copy only what you need in the final stage. Our images dropped from like 1.8GB down to around 200MB, startup time went from over a minute down to maybe 10-15 seconds.

Why is my app randomly slow in production when CPU/memory look fine?

Network I/O is probably fucked. Container networking adds overhead, and if you're using a service mesh like Istio, it's doing health checks, retries, and load balancing that can add 100+ ms latency. Each request now bounces through sidecar proxies, iptables rules, and service discovery - death by a thousand network hops. **Fix**: Check your service mesh config. We had Istio health checks running every 100ms and timing out after 200ms. Changed to every 5 seconds and our P95 latency dropped by 80%. Use `kubectl describe pods` to see if health checks are failing.

Alpine Linux breaks everything. Why do people recommend it?

Because they haven't debugged DNS failures at 3 AM. Alpine uses musl libc instead of glibc. Your Python packages, Go binaries, and Node.js native dependencies will break in weird ways. **Fix**: Use `debian:slim` or [distroless images](https://github.com/GoogleContainerTools/distroless). They're 50-100MB bigger but actually work. I wasted a weekend debugging why our PostgreSQL connections were failing - turns out Alpine's DNS resolver has race conditions.

How do I know if my memory/CPU limits are right?

Monitor for a week with `kubectl top pods` and Prometheus. If your app uses 200MB RAM normally, set limit to 400MB for spikes. If you set it too low, pods get killed. Too high, your AWS bill explodes. **Real numbers from our setup**: Node.js apps typically use around 150-200MB during normal operation, so we set limits to 400MB. Java apps are memory hogs - they'll consume whatever you give them, so set `-Xmx` to roughly 75% of your container limit.

BuildKit keeps corrupting cache and forcing full rebuilds. Any fixes?

This happens weekly. Docker's BuildKit cache randomly corrupts, especially in CI environments. We added `docker builder prune -a -f` to our CI pipeline and run it daily. **Fix**: Use external cache storage like S3 or just accept that Docker caching is trash. At least with external storage, corruption doesn't nuke your entire cache.

Why do Java containers eat 3x more memory than expected?

Because the JVM doesn't understand containers by default. It sees all system memory and tries to use it. Then Kubernetes kills the pod for exceeding limits. **Fix**: Add these JVM flags: `-XX:+UseContainerSupport -XX:MaxRAMPercentage=75`. This makes the JVM respect container limits and use only 75% of allocated memory (rest is for garbage collection overhead).

My logs are filling up disk space and crashing containers. Help?

You're logging to files inside the container instead of stdout. The container filesystem fills up, and Docker starts failing writes. **Fix**: Log to stdout/stderr only: `console.log()` in Node.js, `print()` in Python. Use a logging driver or ship logs to external systems like ELK or CloudWatch. Never log to files inside containers.

Currently viewing the AI version

Switch to human version

Docker Container Performance Optimization - AI Technical Reference

Critical Failure Scenarios

OOMKilled Containers (Exit Code 137)

Cause: Container exceeds memory limits, kernel kills process
Detection: kubectl get events --all-namespaces | grep "OOMKilling"
Impact: Production service downtime, cascading failures
Root Cause: Docker Desktop doesn't enforce memory limits properly - production Kubernetes does
Solution: Test locally with docker run --memory=512m your-app

Image Size and Startup Performance

Problem Scale: Images 2GB+ common, startup times 30-60 seconds
Real Impact: Network transfer delays, security scan overhead, deployment bottlenecks
Cost Impact: AWS bills can increase from $1,200 to $8,000/month due to oversized instances
Solution Impact: Multi-stage builds reduce images from 1.8GB to ~200MB, startup from 60s to 8-10s

Container Networking Latency

Overhead: Service mesh adds 100-200ms per request
Common Failure: Istio health checks every 500ms, timing out at 200ms for 300ms database queries
Fix Impact: Tuning health checks from 100ms/200ms timeout to 5s intervals reduced P95 latency by 80%

Resource Requirements and Limits

Memory Allocation Strategy

Application Type	Normal Usage	Spike Usage	Recommended Limit	Reasoning
Node.js API	150MB	300MB	400MB	2x normal for traffic spikes
Java Spring Boot	800MB	1.2GB	1.5GB	JVM heap + GC overhead
Python Flask	80MB	120MB	200MB	Conservative for memory leaks

CPU Resource Guidelines

Requests: Set to average usage under load
Limits: 2-3x requests for burst capacity
Critical Warning: CPU throttling is silent - pods slow down without obvious errors
Java Special Case: Use -XX:+UseContainerSupport -XX:MaxRAMPercentage=75 to respect container limits

Image Optimization Strategies

Multi-Stage Build Template

# Build stage - contains build tools and dependencies
FROM node:18-alpine AS build
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production --no-audit --no-fund
RUN npm prune --production

# Runtime stage - minimal production image
FROM node:18-alpine AS runtime
RUN addgroup -g 1001 -S nodejs && adduser -S nodejs -u 1001
WORKDIR /app
COPY --from=build --chown=nodejs:nodejs /app/node_modules ./node_modules
COPY --chown=nodejs:nodejs . .
USER nodejs
EXPOSE 3000
CMD ["node", "server.js"]

Results: 180MB final images, 8-10s startup times on AWS ECS

Base Image Selection

Image Type	Size	Compatibility	Use Case	Failure Rate
Alpine Linux	5MB	DNS/glibc issues common	High maintenance tolerance	High
Debian Slim	50MB	Full compatibility	Production recommended	Low
Distroless	Variable	No debug capability	Security-critical apps	Medium
Ubuntu	200MB+	Full compatibility	Legacy apps only	Low

Alpine Linux Warning: Uses musl libc instead of glibc - causes DNS race conditions, PostgreSQL connection failures, Python package breakage

Network Performance Optimization

Service Mesh Impact

Latency Overhead: 100-200ms per request minimum
Health Check Failures: Default timeouts too aggressive for real applications
Alternative: Use Kubernetes native services for internal communication
When to Use: Only if traffic splitting or mutual TLS required

DNS Performance Issues

Container DNS Overhead: 50-100ms per lookup
Solution: Cache DNS for 30+ seconds
Node.js Fix: dns.setDefaultResultOrder('ipv4first')
Impact: Response times dropped from 500ms to 50ms with DNS caching

Storage and Logging Anti-Patterns

Critical Storage Rules

Never: Log to files inside containers - fills filesystem, causes write failures
Always: Log to stdout/stderr only
Database Storage: Use persistent volumes, not bind mounts (15-20% faster)
Temporary Files: Use emptyDir volumes

Real Disaster Example

PostgreSQL container filled 20GB volume with logs in 3 days
Database went read-only
Application crashed
Required Saturday restoration from backup
Fix: Ship logs to CloudWatch, rotate daily

Monitoring and Alerting Configuration

Essential Metrics

Pod Restart Rate: rate(kube_pod_container_status_restarts_total[10m]) > 0
Memory Usage: container_memory_usage_bytes / container_spec_memory_limit_bytes * 100 > 85
Request Latency P95: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1

Alert Thresholds

Pod Restarts: >3 restarts in 10 minutes (indicates OOMKills, crashes)
Response Time: P95 >1 second (performance degradation)
Error Rate: >1% (application issues)
Memory Usage: >85% (approaching OOMKill)

Platform-Specific Considerations

AWS Graviton (ARM64)

Cost Savings: 40% cheaper than x86
Compatibility Risk: Many Docker images lack ARM64 support
Requirement: Multi-arch image builds
Performance: Significant improvements for most workloads

BuildKit Cache Issues

Problem: Cache corruption occurs weekly in CI environments
Symptom: 5-minute builds become 20-minute full rebuilds
Workaround: docker builder prune -f in daily CI pipeline
Alternative: External cache storage (S3), though Docker caching remains unreliable

Autoscaling Configuration

HPA Settings That Prevent Flapping

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
spec:
  minReplicas: 2
  maxReplicas: 20
  targetCPUUtilization: 80
  scaleDownPolicy:
    stabilizationWindowSeconds: 300  # Prevents constant scaling

Rationale: 80% CPU threshold scales before user impact, 5-minute scale-down prevents oscillation

Security and Compliance

Container Security Requirements

User Permissions: Never run as root - use USER nodejs directive
Base Images: Keep Docker updated for container escape patches
Image Scanning: Automated scanning catches known vulnerabilities
Network Policies: Implement without service mesh overhead

Cost Optimization Strategies

Immediate Cost Reduction

Multi-stage builds: 5x smaller images = lower storage/transfer costs
Resource right-sizing: Monitor for 1 week, set limits to 2x observed usage
Spot instances: 70% cost reduction (with availability risk)
ARM64 instances: 40% cost reduction (with compatibility requirements)

Cost Monitoring

Kubernetes resource allocation: kubectl describe node | grep -A 5 "Allocated resources"
Pod resource usage: kubectl top pods --all-namespaces --sort-by=memory
Budget alerts: Set up before bills reach CFO attention

Implementation Priority

Phase 1 - Immediate Fixes (2-4 hours)

Identify OOMKilled pods: kubectl get events | grep OOMKilling
Monitor resource usage: kubectl top pods
Fix largest images with multi-stage builds

Phase 2 - Resource Optimization (1 week)

Set proper resource limits based on monitoring
Implement health checks with realistic timeouts
Configure autoscaling with anti-flapping measures

Phase 3 - Advanced Optimization (2-3 weeks)

Implement comprehensive monitoring with Prometheus/Grafana
Optimize networking (remove unnecessary service mesh)
Platform-specific optimizations (ARM64, spot instances)

Troubleshooting Commands

Diagnostic Commands

# Memory usage by pod
kubectl top pods --all-namespaces --sort-by=memory

# Find restarting pods (OOMKills)
kubectl get pods --all-namespaces --field-selector=status.phase=Running | grep -v " 0 "

# Check node resource allocation
kubectl describe node | grep -A 5 "Allocated resources"

# Recent OOMKill events
kubectl get events --all-namespaces | grep "OOMKilling"

# CPU throttling detection
kubectl top pods --containers --all-namespaces

Performance Analysis

Image layer analysis: Use dive tool to identify bloated layers
Resource monitoring: Deploy Prometheus + Grafana (200-500MB RAM overhead per node)
Network latency: Check service mesh configuration and health check timings

Critical Warnings

What Official Documentation Doesn't Mention

Docker Desktop: Doesn't enforce memory limits - production will
Alpine Linux: DNS race conditions affect production reliability
BuildKit: Cache corruption requires weekly manual intervention
Java Containers: JVM ignores container limits without specific flags
Service Mesh: Adds significant latency overhead for marginal security benefits

Decision Criteria

Use Alpine: Only if debugging DNS failures at 3 AM sounds manageable
Use Service Mesh: Only if traffic splitting or mutual TLS absolutely required
Use Spot Instances: If application can handle random termination
Use ARM64: If all Docker images support multi-arch builds

Success Metrics

Image Size: Target <200MB for typical applications
Startup Time: Target <15 seconds in production
Memory Efficiency: 2x headroom over observed usage
Cost Reduction: 40-60% savings possible with proper optimization
Reliability: <3 pod restarts per 10 minutes indicates stable configuration

Useful Links for Further Investigation

Essential Container Performance Resources

Link	Description
Docker Production Best Practices	Docker's official guide - actually useful unlike most vendor docs. Covers the multi-stage build shit that'll save your storage budget and the layer caching tricks that'll stop your builds from taking forever.
Kubernetes Resource Management	Required reading unless you enjoy OOMKilled pods and mystery performance issues. Actually explains CPU limits vs requests and why your app gets throttled.
Docker Build Optimization	Explains BuildKit caching and why your builds randomly take 20 minutes. Spoiler: the cache corrupts and there's no good fix except `docker builder prune -f`.
cAdvisor - Container Resource Monitoring	Google's tool that actually tells you what your containers are doing. Better than guessing why your CPU usage looks like a seizure.
OpenTelemetry Documentation	The observability framework that's supposed to fix everything. In reality adds 20-50ms latency to every request and is perpetually in beta.
Prometheus Container Monitoring	How to wire up Prometheus to cAdvisor. Industry standard monitoring that actually works and won't randomly break at 3am.
AWS Container Optimization Best Practices	AWS marketing blog that occasionally has useful posts. Good for Graviton ARM64 optimization and figuring out why your Fargate bill is insane.
Azure Container Apps Performance Guide	Microsoft's attempt at serverless containers. Cheaper than ACI but slower than molasses and randomly fails deployments.
Google Kubernetes Engine Optimization	GKE Autopilot docs - Google manages the cluster for you which is nice until you need to do something custom and hit a wall.
Kubernetes Autoscaling Best Practices	HPA and VPA documentation that assumes your metrics work and your app scales linearly. Spoiler: neither is true.
Container Security and Performance	K8s security docs explaining why every security policy breaks something in production and how to find the least-broken compromise.
Service Mesh Performance Considerations	Istio's guide to why your service mesh adds 150ms latency to every request and how to make it slightly less terrible.
CNCF Annual Survey - Container Adoption	CNCF's yearly reality check on which container technologies people actually use vs what they claim in meetings.
Container Performance Benchmarking	SPEC benchmarks that work great in lab conditions and completely fail to predict real-world performance.
Dive - Docker Image Layer Analysis	Actually useful tool that shows you exactly where your Docker image got fat. Will make you angry at how much space npm install wastes.
Distroless Base Images	Google's stripped-down images with no shell or package manager. Great for security, terrible when you need to debug inside the container.
Multi-Architecture Build Tools	How to build images that work on both x86 and ARM64. Required for Graviton instances unless you enjoy paying 40% more for compute.
Cloud Cost Optimization Automation	CNCF post about automating cost optimization before your Kubernetes bill makes the CFO cry. Mostly common sense presented as revelations.
FinOps Foundation - Container Cost Management	FinOps resources for when your cloud bill becomes a spreadsheet nightmare. Mostly consultants selling "best practices" for problems you already know about.
Docker Community Forum	Where you go when Docker breaks in a way Stack Overflow has never seen. Hit or miss but sometimes has answers from actual Docker maintainers.
Kubernetes Slack Community	K8s Slack where you can ask why your pods are crashing and get 12 different opinions from people who've never seen your setup.
Container Performance Newsletter	Container Journal - industry publication that's 80% vendor marketing and 20% actually useful performance articles.
Kubernetes Release Blog	K8s release announcements and feature updates. Good for staying current with what breaks between versions and what new footguns they've added.
Docker Security Announcements	Docker security docs and CVE notifications. Check regularly unless you enjoy explaining container escapes to your security team.

Docker Container Performance Optimization - AI Technical Reference

Critical Failure Scenarios

OOMKilled Containers (Exit Code 137)

Image Size and Startup Performance

Container Networking Latency

Resource Requirements and Limits

Memory Allocation Strategy

CPU Resource Guidelines

Image Optimization Strategies

Multi-Stage Build Template

Base Image Selection

Network Performance Optimization

Service Mesh Impact

DNS Performance Issues

Storage and Logging Anti-Patterns

Critical Storage Rules

Real Disaster Example

Monitoring and Alerting Configuration

Essential Metrics

Alert Thresholds

Platform-Specific Considerations

AWS Graviton (ARM64)

BuildKit Cache Issues

Autoscaling Configuration

HPA Settings That Prevent Flapping

Security and Compliance

Container Security Requirements

Cost Optimization Strategies

Immediate Cost Reduction

Cost Monitoring

Implementation Priority

Phase 1 - Immediate Fixes (2-4 hours)

Phase 2 - Resource Optimization (1 week)

Phase 3 - Advanced Optimization (2-3 weeks)

Troubleshooting Commands

Diagnostic Commands

Performance Analysis

Critical Warnings

What Official Documentation Doesn't Mention

Decision Criteria

Success Metrics

Useful Links for Further Investigation

Essential Container Performance Resources

Related Tools & Recommendations

Deploy Django with Docker Compose - Complete Production Guide

Set Up Microservices Monitoring That Actually Works

Stop Breaking FastAPI in Production - Kubernetes Reality Check

Temporal + Kubernetes + Redis: The Only Microservices Stack That Doesn't Hate You

Prometheus + Grafana: Performance Monitoring That Actually Works

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

Your Kubernetes Cluster is Probably Fucked

Jenkins + Docker + Kubernetes: How to Deploy Without Breaking Production (Usually)

Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together

Docker Daemon Won't Start on Windows 11? Here's the Fix

Docker 프로덕션 배포할 때 털리지 않는 법

GitHub Actions is Fucking Slow: Alternatives That Actually Work

GitHub Actions Security Hardening - Prevent Supply Chain Attacks

GitHub Actions Cost Optimization - When Your CI Bill Is Higher Than Your Rent

Docker Desktop vs Podman Desktop vs Rancher Desktop vs OrbStack: What Actually Happens

containerd - The Container Runtime That Actually Just Works

Podman Desktop - Free Docker Desktop Alternative

Google Mete Gemini AI Directamente en Chrome: La Jugada Maestra (o el Comienzo del Fin)

Meta Just Dropped $10 Billion on Google Cloud Because Their Servers Are on Fire

EFK Stack Integration - Stop Your Logs From Disappearing Into the Void