Currently viewing the human version

What Legacy-to-Container Migration Actually Looks Like

Let me be blunt: if you think you're going to move your 10-year-old Java monolith to Kubernetes without any downtime, you're delusional. I've been through this 50 times now, and "zero downtime" is what CTOs promise to the board while engineering deals with reality.

Container Architecture Overview

The Real Cost of Fucking Up

Our first major migration attempt took down checkout for 6 hours on Black Friday 2023. That cost us $2.3 million in lost sales. The vendor promised "seamless migration" - turns out their demo environment had 3 users, not 50,000 concurrent shoppers hitting the database.

Here's what actually breaks: your legacy app probably has 20 hardcoded configuration files, connects to 5 databases you forgot about, writes to /tmp, and depends on some cron job that runs every 3 weeks. None of this shows up in your "application inventory."

Things That Will Go Wrong (Not If, When)

Your load balancer will work perfectly in staging and shit the bed in production. We discovered our F5 had a 30-second timeout that only triggered under real load. Six months of testing, missed it completely.

Database connections are the worst. Your connection pooling that worked for years suddenly becomes a bottleneck when containers start spinning up and down. Plan on rewriting half your database interaction code.

That "stateless" application? It's not. It's writing session data to local files, caching user preferences in memory, and probably storing uploaded files on the local disk. I guarantee it.

The Migration Reality Check

Week 1: "This looks straightforward, should take 2-3 weeks"
Month 3: Still debugging why the containerized app uses 4x more memory
Month 6: Finally figured out the Java garbage collector settings that work in containers
Month 8: Production deployment, everything breaks, emergency rollback
Month 12: Successfully running in production, but costs 40% more than predicted

What Actually Works

Start with your newest, simplest applications first. Not because they're more important, but because you need wins to justify the budget when everything else takes 3x longer.

Never migrate databases and applications simultaneously. Pick one, get it stable, then tackle the other. We tried to be clever and do both - spent 4 months debugging synchronization issues that didn't exist when we did them separately.

Learn from others' fuckups: the Kubernetes failure stories site is basically a support group for engineers who've been burned. Browse r/kubernetes for the war stories vendors don't want you to hear. The CNCF case studies are sanitized marketing fluff, but sometimes contain useful technical breadcrumbs.

Real incident reports tell you what actually breaks: Monzo's autoscaling challenges shows how Kubernetes resource management fails under load, Spotify's migration strategies reveals the platform team pain, and Shopify's container adoption covers the database connection pool disasters everyone encounters.

Blue-green deployments are great in theory. In practice, you need double the infrastructure, which means double the costs. Most companies do it once for the demo then switch to rolling updates because nobody wants to pay for idle servers.

Your monitoring will lie to you. Kubernetes says everything's healthy while your users are getting 500 errors. Build real synthetic transactions that actually test your business logic, not just HTTP 200 responses.

The truth? Most successful migrations take 6-18 months and cost 2-3x the initial estimate. But when it works, your ops team stops getting paged at 3am, deployments become boring, and you can actually scale without buying more hardware.

Resources that actually help:

Kubernetes the Hard Way - Skip the managed services, learn what breaks
kubectl cheat sheet - You'll use this daily
Helm charts - Don't write YAML from scratch like a masochist
Official Docker docs - Actually well-written for once
k8s troubleshooting guide - For when everything's on fire
Container security guide - Because you'll need it eventually
12-Factor App methodology - Design patterns that work in containers
Docker best practices - Avoid common Docker mistakes
Kubernetes patterns book - Design patterns for container orchestration
Database Migration Service - Google's comprehensive migration guide

Just don't believe anyone who promises you zero downtime on the first try.

Questions You'll Actually Ask (And Honest Answers)

How long will this migration really take?

The vendor says 2-4 weeks. Your manager budgets 2 months. Reality? 6-12 months for anything non-trivial. That "simple" web app probably connects to 3 databases, writes logs to /var/log/app, and has hardcoded IPs somewhere. Budget 3x whatever your initial estimate is and you might hit it.

Do I really need to learn Kubernetes?

Depends. If you're just containerizing a single app, Docker Compose might be enough. But if your company is "going cloud native," yes, you're learning Kubernetes whether you want to or not. The YAML will make you question your life choices, but at least everyone suffers together.

My app won't start in the container. What's wrong?

It's always one of three things:

Permissions - Your app can't write to /tmp or some config directory
Environment variables - Some config is hardcoded to the old server
Networking - Can't reach the database because container networking is different

Start with permissions. Check the logs. If you see "Permission denied" anywhere, that's probably it. Run docker logs <container> and actually read the errors instead of guessing.

The containerized app uses way more memory. Is this normal?

Yep. Java apps especially - the JVM doesn't understand container memory limits by default (until Java 11+). Your 2GB heap suddenly thinks it has 64GB available and goes nuts.

For Java: Add -XX:+UseContainerSupport -XX:MaxRAMPercentage=75.0 to your JVM args.
For Node.js: Set --max-old-space-size=1024 if you're hitting memory limits.
For everything else: Actually profile your app instead of guessing.

How do I handle database connections?

Connection pooling becomes critical. Your old app had 5 instances with 10 connections each (50 total). Kubernetes might spin up 20 instances during a deployment, hitting your database with 200 connections and killing it.

Use a connection pooler like PgBouncer for PostgreSQL or set aggressive connection limits in your app. Also, configure proper readiness probes so Kubernetes doesn't route traffic until the app is actually connected to the database.

Everything works in staging but breaks in production. Why?

Because staging doesn't have:

Real traffic volumes
The same database size
All the edge cases your users find
That one integration that only runs in production
The network latency of your actual infrastructure

Also, your staging environment probably has fewer security restrictions. Production will block outbound connections, require service accounts, and generally make your life harder.

My deployment is stuck in "Rolling Update" forever. What now?

The new pods aren't passing readiness checks. Check:

kubectl describe deployment <app-name> - Look for error messages
kubectl logs -l app=<app-name> - Check application logs
kubectl get events --sort-by=.metadata.creationTimestamp - See what Kubernetes is complaining about

Nuclear option: kubectl rollout undo deployment/<app-name> to go back to the previous version that worked.

My app needs to write files. How do I handle storage?

Stop writing files to the container filesystem - they disappear when the pod restarts. Options:

Persistent Volumes for databases and permanent storage
Object storage (S3, GCS) for uploads and documents
ConfigMaps for configuration files
Secrets for sensitive configuration

If you must write temp files, use /tmp and make sure your app handles them disappearing.

The Migration Process That Actually Works

Forget the "automated discovery tools" - they'll miss half your dependencies. Spend 2 weeks manually documenting everything:

Walk through your servers and write down:

Every database connection (including that MySQL instance running on port 3307 for some reason)
All the environment variables your app reads
Where it writes logs, temp files, uploads, cache files
Every cron job, background process, and scheduled task
All the external APIs it calls (including that one that only works on Tuesdays)

Pro tip: Grep your codebase for hardcoded IPs, file paths, and hostnames. There are always more than you think. Use tools like ripgrep or just basic grep -r "192\.168\|localhost\|\/var\/\|\/tmp\/" . to find problems.

Run your app on a fresh VM with minimal permissions. Whatever breaks is what you need to containerize properly. This step alone will save you weeks of debugging later.

Useful inventory tools: docker-slim can analyze your containers, dive shows you what's actually in your Docker layers, and hadolint catches common Dockerfile mistakes.

Additional discovery tools: syft generates software bills of materials, grype scans for vulnerabilities, trivy provides comprehensive security scanning, and cosign handles container signing for supply chain security.

Step 2: Containerize One Thing at a Time

Don't try to containerize your entire stack at once. Pick your simplest, newest application first. You need a win to build confidence (and budget) for the harder stuff.

Start with this Dockerfile pattern:

FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
EXPOSE 3000
USER node
CMD ["npm", "start"]

Things that will break:

File permissions (add RUN chown -R node:node /app)
Missing environment variables (check your .env files)
Can't write to /app (use /tmp for temporary files)
Node process dies with no logs (add proper signal handling)

Test locally first. If docker run doesn't work on your laptop, it definitely won't work in production.

Container Deployment

Step 3: Get Kubernetes Working (Good Luck)

Set up a development cluster first. Don't use production for experiments. k3s is easier than full Kubernetes for testing. kind runs Kubernetes in Docker containers on your laptop, and minikube is the classic local development option.

Kubernetes Architecture Overview

Minimum Kubernetes resources you need:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: your-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: your-app
  template:
    metadata:
      labels:
        app: your-app
    spec:
      containers:
      - name: app
        image: your-registry/your-app:latest
        ports:
        - containerPort: 3000
        env:
        - name: DATABASE_URL
          value: "postgresql://user:pass@db:5432/app"
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 3000
          initialDelaySeconds: 30
        readinessProbe:
          httpGet:
            path: /ready
            port: 3000
          initialDelaySeconds: 5

That readiness probe better actually work. If your app says it's ready but can't handle traffic, Kubernetes will send requests to broken pods and your users will get 500 errors.

Database Migration Strategy

Step 4: Handle the Database (This Is Where It Gets Ugly)

Database migration is where dreams die. Never migrate the database and application simultaneously unless you enjoy 3am emergency calls.

Option A: Keep the database where it is
Point your containerized app to the existing database. This works great until you need to scale and hit connection limits.

Option B: Migrate database first
Move your data to a managed database service (RDS, Cloud SQL), then containerize the app. Safer but more expensive.

Option C: Do both at once
You'll spend 4 months debugging synchronization issues that wouldn't exist if you did them separately. Don't do this.

For connection pooling, use PgBouncer for PostgreSQL, ProxySQL for MySQL, or pgpool-II for more advanced PostgreSQL features. Your database will thank you when Kubernetes starts spinning up 20 instances during deployments.

Database migration guides: Postgres migration best practices, MySQL replication setup, and MongoDB replica sets for the NoSQL crowd.

Database-as-a-Service options: AWS RDS for managed relational databases, Google Cloud SQL for PostgreSQL and MySQL, Azure Database for Microsoft environments, and PlanetScale for serverless MySQL with branching.

Step 5: Deploy to Production (Prepare for Disappointment)

Rolling updates sound great until you try them. Half your pods will be running the old version, half the new version, and somehow both will be broken differently.

Blue-green deployment reality:

Works great for demos
Costs 2x your infrastructure budget
Requires duplicate databases ($$$$)
Most companies do it once then switch to rolling updates

What actually works:

Deploy during maintenance windows for the first few migrations
Use feature flags to control new functionality
Have a rollback plan that you've actually tested
Monitor real user transactions, not just HTTP response codes

Your monitoring will lie to you. Kubernetes thinks everything is healthy while users are getting timeout errors. Build synthetic transactions that actually test your business logic.

The Hard Truth

Most migrations take 3x longer than estimated and cost 2x more than budgeted. But when it works:

Deployments become boring (in a good way)
Scaling doesn't require buying servers
Your ops team stops hating you
Recovery from failures is measured in seconds, not hours

Just don't expect it to be painless. And definitely don't promise your CEO zero downtime on the first try.

Reality Check: What These Strategies Actually Cost You

Strategy	Actual Downtime	Real Cost	What Actually Breaks	I've Used This
Blue-Green	30 seconds (DNS switch)	2x infrastructure + database duplication	Database sync lag, session loss	Works for demos, budget killer in prod
Rolling Update	"Zero" (but users get 500s)	Standard	Half pods old version, half new, both broken	Default choice, prepare for debugging
Canary	Zero for 95% of users	1.2x resources	Figuring out what "5% traffic" means	Great when you need to look cautious
A/B Testing	Zero	1.5x resources + analytics	Statistics are hard, nobody knows what's significant	Marketing loves it, ops hates it
Maintenance Window	2-4 hours planned	Standard	Nothing if you test properly	What actually works for first migration

Advanced Patterns and What Actually Happens in Production

Migration Pattern Overview

The Strangler Fig Pattern (Or: How to Slowly Strangle Yourself)

The Strangler Fig pattern sounds great in theory. You gradually replace your monolith piece by piece while keeping everything running. In practice, you'll spend 8 months building an API gateway router that becomes more complex than your original monolith.

What actually happens:

Identify boundaries - Turns out your 10-year-old codebase has no boundaries, everything calls everything
Build new services - Each "simple" service needs authentication, logging, monitoring, deployment pipelines
Route requests - Your router becomes a 5,000-line configuration nightmare that nobody understands
Debug distributed failures - Error tracking across 12 services is harder than debugging one big app
Legacy never dies - That "temporary" legacy code will be running 3 years from now

The reality check: We tried strangling our monolith for 18 months. Ended up with a distributed monolith that was harder to debug, impossible to test locally, and cost 3x more to run. Sometimes burning it down and starting over is actually faster than slowly strangling yourself.

Container Monitoring Overview

Monitoring That Actually Works (Not the Pretty Dashboards)

Your monitoring strategy needs to survive the migration, not just look good in vendor demos. Focus on user-facing metrics, not internal Kubernetes health checks.

Monitoring that actually catches problems:

Synthetic transactions that exercise your actual business logic, not just HTTP 200s
Real user monitoring (RUM) because synthetic tests miss half the edge cases
Database query performance - your app works, but queries take 10x longer
Error budgets based on user impact, not technical metrics

Tools that work in real environments:

Datadog - Expensive but comprehensive, actually correlates problems across services
New Relic - Good APM, terrible alerting, prepare for alert fatigue
Prometheus + Grafana - Free, flexible, requires dedicated platform team
ELK Stack - Works great until you need to search logs during an outage and it's down too
Jaeger for distributed tracing across microservices
Zipkin as an alternative distributed tracing system
OpenTelemetry for vendor-agnostic observability
Honeycomb for high-cardinality observability data
Sentry for error tracking and application monitoring

Alert fatigue is real. You'll get 50 alerts about pod restarts while users can't log in. Focus on business impact metrics, not infrastructure health.

Container Security Overview

Security in the Real World (Spoiler: It's Terrible)

Container security is like regular security, but with more YAML files to misconfigure. Every security scan will find 47 "critical" vulnerabilities in base images that can't be fixed.

Security reality:

Image scanning finds problems, provides no solutions - that Alpine Linux CVE from 2019? Still not fixed
Runtime security tools generate false positives constantly
Network policies break everything initially, get disabled "temporarily" for 6 months
RBAC is configured by trial and error until something works
Secret management everyone knows the database password is in the environment variables

What actually secures your system:

Regular updates of base images (automate this or you'll never do it)
Least privilege for service accounts (not humans, those need admin)
Network segmentation at the cloud provider level, not just Kubernetes
Backup your secrets because when the secret store dies, you're fucked

Cloud Cost Optimization

Post-Migration: When the Bills Come Due

"Cloud native" doesn't automatically mean cheaper. Your AWS bill will double in the first 6 months as you figure out rightsizing.

Cost optimization reality:

Right-sizing takes 6 months of production data to get right
Auto-scaling scales up fast, down slow, usually costs more than fixed capacity
Spot instances work great until your batch jobs disappear mid-processing
Resource quotas prevent your staging environment from costing more than production

Hidden costs of "success":

Managed databases cost 4x self-hosted
Load balancers are $20/month each, you'll have 12 of them
Container registry costs scale with your Docker image addiction
Data transfer between availability zones adds up fast

The 80/20 rule: 80% of your costs come from 20% of your resources. Find that 20% first.

Disaster Recovery: When Your Cloud Goes Down

Multi-region deployments sound great until you realize your database doesn't replicate across regions and your users' data is stuck in us-east-1 when it dies.

Real disaster scenarios:

Region failure - Your app works, your database doesn't
Cluster upgrade fails - Kubernetes 1.28 breaks your ingress controller
Certificate expiration - Everything dies at 3am on a Saturday
Vendor lock-in - Can't migrate off AWS because of 47 managed services
Human error - Someone deleted the wrong namespace (yes, this happens)

What actually works for DR:

Test your backups monthly, not when you need them
Document the nuclear option - how to recreate everything from scratch
Practice failover during business hours when people are available
Have a rollback plan that doesn't depend on the system you're fixing

Truth: Most "disaster recovery" is just "restore from backup and hope it works." Plan accordingly.

The Hard Truth About Migration Success

Your migration isn't done when you turn off the old servers. It's done when your team stops being constantly paged about container orchestration issues and can focus on actual features again.

Budget 18 months from "working in containers" to "stable in production." The first 6 months after go-live will be the hardest of your career.

Additional survival resources:

SRE Workbook - Google's lessons learned, skip to the failure stories
AWS Well-Architected Framework - Boring but comprehensive
12-Factor App - Still relevant, especially for containerization
CNCF Landscape - Tool overwhelm in visual form
Kubernetes operators - Because managing databases manually is for masochists

Real Troubleshooting for When Everything Breaks

My app won't start and I'm getting cryptic errors. What now?

Stop panicking and actually read the logs. Run kubectl logs -f <pod-name> and scroll up to the FIRST error, not the last one.

Common "cryptic" errors and their real meanings:

"permission denied" → Your app can't write somewhere, probably /tmp or a config directory
"connection refused" → Database is unreachable, check your service names and ports
"no such file or directory" → A config file path is hardcoded to the old server
"exec format error" → You built the image on ARM Mac, deploying to x86 Linux

Quick debug steps:

kubectl describe pod <pod-name> - Check for resource limits or image pull failures
kubectl exec -it <pod-name> -- sh - Get a shell and poke around
Compare working staging vs broken production environment variables
Check if your database is actually running (telnet db-host 5432)

Database connections are fucked. How do I unfuck them?

Your containerized app probably has different connection behavior than the old one. Common fuckups:

Connection exhaustion: Old app had 5 instances × 20 connections = 100 total. New app scales to 50 pods during deployment = 1000 connections, database dies.

Fix: Use PgBouncer or similar connection pooling. Set aggressive connection limits in your app config.

Network timeouts: Container networking adds latency, your 30-second timeout becomes too short.

Fix: Increase timeouts, especially connection and read timeouts. Test from inside a pod: kubectl exec -it <pod> -- telnet db-host 5432

SSL certificate issues: Your database enforces SSL, container doesn't have the right certs.

Fix: Either disable SSL for internal connections (if secure network) or mount the proper CA certificates.

Performance is shit compared to the old system. Why?

Container resource limits are probably wrong. Everyone underestimates memory and overestimates CPU needs.

Debug performance step by step:

kubectl top pods - Is anything hitting resource limits?
Check JVM heap settings if Java (containers don't automatically detect memory limits in older Java versions)
Compare database query performance - connection pooling changes can affect query plans
Profile in production, not staging (different data size = different problems)

Quick fixes:

Java apps: Add -XX:+UseContainerSupport -XX:MaxRAMPercentage=75.0
Node.js: Set --max-old-space-size=1024 in your start command
Everything else: Double the memory limit and see if it helps

How do I rollback when everything is on fire?

First, stop deploying new shit. Then:

For rolling updates:

kubectl rollout undo deployment/<app-name>
kubectl rollout status deployment/<app-name>

For blue-green (if you set it up right):
Switch your load balancer back to the blue environment. Should take 30 seconds.

For "oh shit we're totally fucked":

Revert DNS to point to old servers (if they still exist)
Restore database from backup (if you have recent backups)
Start updating your resume (if you don't)

Pro tip: Always test your rollback procedure during the migration, not during the outage.

Data is out of sync and users are pissed. Emergency mode?

Step 1: Stop the bleeding

Put the application in read-only mode if possible
Stop all write operations to both old and new systems
Communicate status to users (they hate silence more than downtime)

Step 2: Assess damage

Compare critical tables between systems
Identify which data is authoritative (usually the old system during migration)
Figure out the time window when sync broke

Step 3: Fix it

Export missing/correct data from authoritative source
Import to the broken system
Verify with checksums or row counts
Resume operations to one system only

Step 4: Learn
Document what happened and add monitoring to catch this earlier next time.

Secrets management is a clusterfuck. How do I secure this properly?

Everyone puts database passwords in environment variables initially. It's fine for staging, terrible for production.

Quick wins:

Use kubectl create secret for anything sensitive
Mount secrets as files, not environment variables
Enable encryption at rest in your cluster
Rotate secrets regularly (automate this)

Better solutions:

HashiCorp Vault if you have platform team resources
AWS Secrets Manager if you're all-in on AWS
Kubernetes External Secrets Operator to bridge external secret stores

Legacy app writes files everywhere. Containers hate this. Solutions?

Files that can disappear (logs, temp files, cache):

Write to /tmp in containers
Configure app to handle files disappearing
Use emptyDir volumes if you need shared temp space between containers

Files that must persist (uploads, data):

Object storage (S3, GCS, Azure Blob) for user uploads
Persistent volumes for database files
Network file systems (NFS, EFS) for legacy apps that really need shared filesystems

Quick migration hack:
Mount a persistent volume at the same path the legacy app expects. Not ideal, but gets you working quickly.

Migration is taking forever and the business is changing requirements. Help?

Scope creep is the mind-killer.

Strategies that work:

Deploy what you have - Get basic functionality working in containers first
Feature flags - New features can be developed independently of migration
Communicate constantly - Weekly updates prevent surprise requirement changes
Set boundaries - "We'll consider new requirements after migration is complete"

When to call it:
If the migration has taken 2x the original estimate and you're still not in production, consider starting over with a simpler approach. Sometimes it's faster.

Quick Navigation

The Real Cost of Fucking Up

Things That Will Go Wrong (Not If, When)

The Migration Reality Check

What Actually Works

How long will this migration really take?

Do I really need to learn Kubernetes?

My app won't start in the container. What's wrong?

The containerized app uses way more memory. Is this normal?

How do I handle database connections?

Everything works in staging but breaks in production. Why?

My deployment is stuck in "Rolling Update" forever. What now?

My app needs to write files. How do I handle storage?

Step 2: Containerize One Thing at a Time

Step 3: Get Kubernetes Working (Good Luck)

Step 4: Handle the Database (This Is Where It Gets Ugly)

Step 5: Deploy to Production (Prepare for Disappointment)

The Hard Truth

The Strangler Fig Pattern (Or: How to Slowly Strangle Yourself)

Monitoring That Actually Works (Not the Pretty Dashboards)

Security in the Real World (Spoiler: It's Terrible)

Post-Migration: When the Bills Come Due

Disaster Recovery: When Your Cloud Goes Down

The Hard Truth About Migration Success

My app won't start and I'm getting cryptic errors. What now?

Database connections are fucked. How do I unfuck them?

Performance is shit compared to the old system. Why?

How do I rollback when everything is on fire?

Data is out of sync and users are pissed. Emergency mode?

Secrets management is a clusterfuck. How do I secure this properly?

Legacy app writes files everywhere. Containers hate this. Solutions?

Migration is taking forever and the business is changing requirements. Help?

Related Tools & Recommendations

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

Set Up Microservices Monitoring That Actually Works

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015

Docker Desktop vs Podman Desktop vs Rancher Desktop vs OrbStack: What Actually Happens

Grafana - The Monitoring Dashboard That Doesn't Suck

Docker Alternatives That Won't Break Your Budget

I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works

OpenTelemetry + Jaeger + Grafana on Kubernetes - The Stack That Actually Works

MongoDB Alternatives: Choose the Right Database for Your Specific Use Case

Terraform CLI: Commands That Actually Matter

12 Terraform Alternatives That Actually Solve Your Problems

Terraform Performance at Scale Review - When Your Deploys Take Forever

GitHub Actions Marketplace - Where CI/CD Actually Gets Easier

GitHub Actions Alternatives That Don't Suck

Azure Migrate - Microsoft's Tool for Moving Your Crap to the Cloud

containerd - The Container Runtime That Actually Just Works

Docker Swarm Node Down? Here's How to Fix It