Three months ago, our Kubernetes cluster started randomly killing pods with exit code 137 (OOMKilled). Staging was fine, but prod was falling apart completely. Turns out, nobody set memory limits properly, and some Node.js app was consuming memory like crazy - I think it hit around 8GB or maybe more before Kubernetes murdered it. Someone was storing entire HTTP request bodies in memory like a fucking amateur.
The Real Problems Nobody Talks About
Docker Desktop on your MacBook is not production. I don't care how many times you've run docker-compose up
- production will find new and exciting ways to break your shit.
Problem #1: Your Images Are Obese
That Dockerfile you copied from Stack Overflow? It's pulling Ubuntu, installing like 400MB of build tools, and leaving everything behind. My team was deploying massive Node.js images - over 2GB - until we figured out multi-stage builds. Now they're around 180MB, maybe less. Startup time dropped from almost a minute down to maybe 8 or 10 seconds on AWS ECS.
Problem #2: Memory Limits Will Kill You
Set your memory limits too low? Kubernetes kills your pods. Set them too high? Your AWS bill explodes. We learned this the hard way when our bill went from roughly twelve hundred bucks to over eight grand in one month because nobody set resource constraints. The logs just said "Pod killed" - no helpful details, as fucking usual.
Worth noting: Kubernetes resource monitoring has gotten somewhat better recently, but you still need to dig through verbose kubectl describe node
output to figure out why pods are getting throttled. The logs just say "resource pressure" which isn't exactly helpful for debugging.
Problem #3: Networking Is Black Magic
Container networking adds overhead you don't expect. We saw latency spikes around 200ms between services that should have been under 10ms. Turns out, our service mesh (Istio) was doing health checks every 500ms and failing like half of them. The solution? Tune the health check timeouts and use host networking for latency-critical services.
The Shit That Actually Matters
Forget the "comprehensive optimization strategies" - here's what fixed our production nightmare:
Alpine Linux Breaks Everything
Alpine uses musl instead of glibc. A bunch of your dependencies will break in mysterious ways. We spent probably two days debugging why our Python app couldn't connect to PostgreSQL before finding this GitHub issue. The Alpine security vulnerabilities are also an ongoing problem. Stick with Debian slim unless you enjoy debugging DNS resolution failures at 3 AM.
BuildKit Cache Is Trash
Docker's BuildKit caching randomly corrupts and forces full rebuilds. We added docker builder prune -f
to our CI pipeline because it happens like clockwork every week or so. Your 5-minute builds turn into 20-minute builds, and Docker's error messages are about as helpful as a screen door on a submarine. The BuildKit GitHub issues are full of people with the same exact problem.
Security reminder: Keep Docker updated. Container escapes happen, and you don't want to be the person explaining to the security team how an attacker broke out of a container because you're running last year's Docker version.
Java Containers Are Special Snowflakes
Java in containers is a pain in the ass. The JVM doesn't understand container memory limits by default. Use `-XX:+UseContainerSupport` flag and set `-Xmx` to 75% of your container limit, or the JVM will try to use all available system memory and get killed by the OOM killer.
What Actually Works
After wasting months on "best practices" that don't work, here's what saved our production:
Use distroless images - Google's distroless images have no shell, no package manager, nothing. Your attack surface is tiny, and startup is fast. Check out the distroless security benefits if you care about compliance.
Set proper resource limits - If you don't know what your app needs, start with 200m CPU and 512Mi memory. Monitor for a week, then adjust. Better to start conservative than kill your budget. The Kubernetes resource management docs have the details.
Enable proper health checks - Kubernetes needs to know when your app is ready. Use
/health
endpoints that actually check your database connections, not just return HTTP 200. Proper liveness and readiness probes prevent cascading failures.
The bottom line? Your development environment lies to you. Production will find every edge case, race condition, and performance bottleneck you didn't know existed. Container resource monitoring is essential, proper health checks prevent cascading failures, and troubleshooting guides become your best friends at 3 AM. Plan for failure, set proper limits, and always have a rollback strategy ready.
You get why production hates your containers. Here's what's probably broken right now: