Let me tell you about my first production migration disaster. It was 2019, we had a 2TB PostgreSQL 9.6 database that needed to move to AWS RDS. "How hard could it be?" - famous last words. Took the site down for 6 hours on a Tuesday morning.
The CEO called me at 3am asking why our biggest customer couldn't place orders. That's when I learned that "just a quick database migration" can cost you $50K in lost revenue and almost your job.
What Actually Breaks During Migrations
Here's what nobody tells you about database migrations - they fail in the stupidest ways:
Connection Pool Exhaustion
Your application has 100 database connections configured, but during migration you're running dual-writes to both databases. Suddenly you need 200 connections. PostgreSQL 12.8 defaults to 100 max_connections
, so half your writes start failing with FATAL: remaining connection slots are reserved for non-replication superuser connections
.
I learned this the hard way - PgBouncer saved my ass when I had to handle the connection math during our user table migration. The PostgreSQL documentation explains connection limits in detail, but they don't warn you about the dual-write connection doubling problem.
Replication Lag Hell
PostgreSQL logical replication looks great in the docs. In practice, it falls behind during high load. I've seen 10+ minute replication lag during peak hours, which means your new database is serving stale data while users are adding new orders.
During our 2TB migration, lag hit 15 minutes during our morning traffic spike. Had to throttle the bulk migration and add monitoring alerts to catch this shit before customers noticed. PostgreSQL logical replication documentation covers tuning parameters, and Uber's blog post shows how they handled similar lag issues at scale.
Timezone Fuckery
You think your timestamps are stored correctly? Think again. PostgreSQL TIMESTAMP WITHOUT TIME ZONE
columns become TIMESTAMP WITH TIME ZONE
on the target. All your stored times are suddenly off by your server's timezone offset. I spent 8 hours debugging why all our scheduled jobs were running at the wrong times. PostgreSQL's timezone documentation is comprehensive but this Stack Overflow thread explains the practical differences better. The Postgres Wiki on timezone handling has saved me multiple times.
Foreign Key Cascade Nightmares
That innocent foreign key constraint with ON DELETE CASCADE
? During migration, it decided to delete 50K related records when I was just trying to clean up test data. No warning, no rollback - just gone.
Two hours of explaining to the VP of Engineering why half our user profiles disappeared. Pro tip: disable foreign key constraints during migration, re-enable after you're done shitting your pants. MySQL's constraint documentation covers constraint handling, and PostgreSQL's approach to deferrable constraints. This comprehensive migration guide covers foreign key handling strategies.
The Three Things That Actually Matter
After breaking production multiple times, here's what I learned:
1. Test Your Rollback First
Don't test your migration strategy - test your rollback strategy. When shit hits the fan at 2am, you need to be able to switch back to the old database in under 60 seconds.
I use pg_basebackup for point-in-time recovery that actually works. Practiced the rollback procedure 12 times before our production migration. Good thing, because I had to use it twice during the actual migration when replication lag spiked to 45 minutes. This comprehensive guide on PostgreSQL backup and recovery is essential reading, and Percona's backup guide covers large database strategies.
2. Monitor Everything, Trust Nothing
Set up alerts on replication lag, connection counts, and query performance. I use Prometheus with custom metrics to track dual-write success rates. If replication lag hits 30 seconds, the migration stops automatically. This Prometheus PostgreSQL exporter provides essential migration metrics, and Grafana's PostgreSQL dashboard visualizes them perfectly.
Learned this after our order processing started serving stale inventory data. Customers were buying products we didn't have in stock. That was a fun Monday morning. DataDog's database monitoring guide and New Relic's best practices helped me set up proper alerting.
3. Staging Environment Must Be Identical
Your staging database with 1GB of data will not reveal the same issues as production with 500GB. I learned this when a migration that took 10 minutes in staging took 4 hours in production because of checkpoint frequency differences.
Spent all night with the CEO texting me "status updates" every 15 minutes. Now staging has the same hardware specs, PostgreSQL config, and data volume as production. Painful lesson but worth it. This production parity guide explains why dev/staging/prod consistency matters, and PostgreSQL's configuration tuning guide helps match performance characteristics.
The truth is, every database migration is different, and most of them go wrong in ways you didn't expect. But if you test the rollback procedure, monitor the shit out of everything, and have staging that actually matches production, you might not break your site. This comprehensive migration checklist and GitHub's migration best practices provide additional safety nets that have saved my ass multiple times.