The replication is 6 hours behind and customers are calling. What do I do?

First, don't panic (lie to yourself if necessary). Check if the replication slot is full: ```sql SELECT slot_name, database, active, restart_lsn, confirmed_flush_lsn FROM pg_replication_slots; ``` If the slot is consuming all your disk space, you're screwed. You need to either: - Increase disk space immediately (costs $$$) - Drop the replication slot and start over (takes hours) - Implement manual data sync while fixing replication (nightmare mode) [PostgreSQL replication troubleshooting](https://www.postgresql.org/docs/current/warm-standby.html#STREAMING-REPLICATION-SLOTS) has the gory details.

AWS DMS is silently dropping data. How do I even find what's missing?

AWS DMS's dirty secret: it chokes on LOB data performance-wise, grinding migrations to a crawl with memory allocation issues. Run this query to check your data integrity: ```sql -- Find tables with LOB columns SELECT table_name, column_name, data_type FROM information_schema.columns WHERE data_type IN ('text', 'bytea', 'json', 'jsonb'); -- Count rows in source vs target (will make you cry) SELECT 'source' as db, COUNT(*) FROM source_table UNION ALL SELECT 'target' as db, COUNT(*) FROM target_table; ``` [AWS DMS troubleshooting guide](https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Troubleshooting.html) mentions this buried on page 47.

My "2-hour migration" is taking 20 hours. How do I explain this to management?

Welcome to database migration reality! Here's what actually happened: - Initial data sync: 6 hours (not 30 minutes) - Foreign key rebuilding: 4 hours (nobody mentioned this) - Index creation: 3 hours (because you forgot to parallelize) - Application deployment issues: 2 hours (staging != production) - DNS propagation: 1 hour (the internet is slow) - Debugging weird edge cases: 4 hours (someone's using deprecated APIs) Send this email: "We're experiencing extended maintenance due to unexpected data integrity validation requirements. ETA: when it's actually done." Then go hide in the server room.

The database migration "succeeded" but half the app is broken. Now what?

Check these common gotchas that break everything: ```bash # Are your sequences out of sync? SELECT last_value FROM user_id_seq; -- Should match MAX(id) # Did foreign key constraints fail? SELECT conname, conrelid::regclass FROM pg_constraint WHERE NOT convalidated; # Are your indexes missing? SELECT schemaname, tablename FROM pg_tables WHERE NOT EXISTS (SELECT 1 FROM pg_indexes WHERE tablename = pg_tables.tablename); ``` Roll back immediately if you can. Your pride isn't worth 6 hours of broken payments.

Oracle GoldenGate stopped replicating and I have no idea why

Oracle GoldenGate fails silently like a passive-aggressive coworker. Check the logs: ```bash # In GoldenGate home directory tail -f ggserr.log # Look for "OGG-00868" (transaction too large) # Or "OGG-01028" (checkpoint issue) # Or literally any other error code ``` Common fixes: - Increase the lag reporting threshold - Restart the extract/replicat processes - Sacrifice something to the Oracle licensing gods - Call expensive Oracle consultants [Oracle GoldenGate troubleshooting](https://docs.oracle.com/en/middleware/goldengate/core/) documentation is 400 pages for a reason.

Why is my new database 40% slower than the old one?

Because database performance is black magic and your "identical" setup isn't identical: ```sql -- Check your PostgreSQL settings SHOW shared_buffers; -- Should be 25% of RAM SHOW effective_cache_size; -- Should be 75% of RAM SHOW work_mem; -- Probably too small SHOW max_connections; -- Probably too high -- Check query plans EXPLAIN ANALYZE SELECT * FROM your_slow_query; ``` Common issues: - Connection pooling is misconfigured ([pgbouncer settings](https://pgbouncer.github.io/config.html)) - Query planner statistics are stale (run ANALYZE) - Indexes weren't migrated properly - SSL overhead (disable if you dare) - Storage type is different (gp2 vs gp3 vs io1 in AWS)

How do I convince my CEO this isn't just "moving some files around"?

Forward them this article about [GitLab's database migration disaster](https://about.gitlab.com/blog/2017/02/01/gitlab-dot-com-database-incident/) that took down their entire service for 18 hours. Or this one about [Knight Capital losing $440 million](https://en.wikipedia.org/wiki/Knight_Capital_Group#2012_stock_trading_disruption) from a deployment gone wrong. Then explain that you're trying to avoid making the news.

The migration finished but the AWS bill is 3x higher than expected. Why?

AWS doesn't tell you about the hidden costs: - Data transfer between AZs: $0.01/GB (adds up fast) - Cross-region replication: $0.02/GB - Enhanced monitoring: $15/instance/month - Performance Insights: $0.009/vCPU-hour - Automated backups in multiple regions: $0.095/GB/month Use the [AWS pricing calculator](https://calculator.aws/) but multiply by 2.5 for the real cost.

Is it normal to consider a career change during database migrations?

Absolutely. According to [Stack Overflow's developer survey](https://insights.stackoverflow.com/survey/), database administration has the highest burnout rate in tech. Alternative careers to consider: - Goat farming (no on-call) - Underwater basket weaving (peaceful) - Anything that doesn't involve Oracle licensing But remember: you're now the expert who survived a production database migration. That experience is worth its weight in gold (and therapy bills).

Currently viewing the AI version

Switch to human version

Zero Downtime Database Migration: AI-Optimized Technical Reference

Executive Summary

Zero downtime database migration means users don't notice database relocation. Reality: even "best" migrations have hiccups. GitHub's 2018 MySQL migration caused 24-hour downtime due to split-brain scenario. Expect 4-month timeline minimum despite 2-week marketing promises.

Critical Definitions

Zero Downtime Migration: Users don't notice database movement - no maintenance pages, no service interruptions, no customer support calls.

Actual Success Rate: Based on 12 production migrations over 8 years - 33% spectacular failure rate including 6-hour payment system outage and 3-month data corruption.

Resource Requirements

Time Investment

Marketing Promise: 2 weeks
Engineering Estimate: 6 weeks
Production Reality: 4 months
Root Cause: Undocumented stored procedures, legacy dependencies, infrastructure complexity

Financial Costs

Strategy	Setup Promise	Reality	AWS Monthly Cost	Weekend Risk
Blue-Green	2-4 hours	2-3 weeks to 2 months	$5,000-15,000 (double infrastructure)	Low (budget impact)
Dual-Write	Application-level	Silent write failures, data divergence	$2,000-5,000 + therapy	High (3am debugging)
AWS DMS	Seamless	47 undocumented edge cases	$1,500-8,000/month	High (slow support)
Oracle GoldenGate	Real-time replication	Stops on unparseable transactions	$50,000/year minimum	Career-ending
Rolling Updates	Minor schema changes	3-hour table locks on 500GB tables	$0 (3 hours downtime)	Guaranteed user impact

Hidden Costs

Double Infrastructure: Blue-green requires 2x storage, compute, AWS bill for 6 months
Professional Services: "Free" migration tools require $50,000 in consulting
Network Configuration: 2 weeks fixing VPC/subnet/security group issues from 2018 consultant setup

Critical Failure Modes

Backward Compatibility Failures

Severity: Service-breaking
Frequency: Every schema change
Example: Adding JSON column breaks legacy JDBC drivers (3-hour login outage)
Impact: Legacy APIs, microservices crashes, silent data type conversion errors

Replication Lag Disasters

AWS RDS Documentation Claim: "Minutes behind"
Production Reality: 15+ minutes during peak traffic
Consequence: 15-minute stale data window during "instant" cutover
Mitigation: None - inherent to replication architecture

Tool-Specific Failures

AWS DMS

LOB Data Performance: Catastrophic degradation on files >few MB due to memory allocation issues
Edge Cases: 47 undocumented scenarios that break migration
Data Loss: Silent dropping of large binary objects

Oracle GoldenGate

Composite Primary Keys: Cannot handle custom implementations
Transaction Parsing: Stops permanently on complex transactions
Support Cost: Enterprise rates for consultants who understand configuration

PostgreSQL Logical Replication

Replication Slot Disk Usage: Fills disk space, crashes production
Custom Extensions: Don't replicate between versions
Initial Sync: 72-hour table locks on large datasets
Custom Types: Enum types don't transfer

Production Configuration

PostgreSQL Settings That Actually Work

-- Memory allocation (critical for performance)
shared_buffers = '25% of total RAM'
effective_cache_size = '75% of total RAM'
work_mem = '4MB' -- Usually too small by default
max_connections = 100 -- Usually too high by default

-- Replication monitoring
SELECT slot_name, database, active, restart_lsn, confirmed_flush_lsn
FROM pg_replication_slots;

AWS RDS Hidden Costs

Data transfer between AZs: $0.01/GB
Cross-region replication: $0.02/GB
Enhanced monitoring: $15/instance/month
Performance Insights: $0.009/vCPU-hour
Multi-region backups: $0.095/GB/month
Real Cost Formula: AWS calculator estimate × 2.5

Connection Pooling Requirements

pgbouncer: Essential for PostgreSQL migrations
Default Misconfiguration: 100 connections planned, 25 actual (pgbouncer default)
SSL Overhead: 15% CPU increase not factored in planning

Migration Timeline Reality

Week 1-2: Discovery Phase

Find Hidden Databases: "Simple" app has 7+ unknown databases including microservice PostgreSQL on developer laptop

Schema Archaeology Issues:

Tables without foreign keys "for performance"
47 stored procedures by "TempContractor2019"
Columns named 'data' containing JSON/XML/Excel as base64
CEO email triggers on user table updates

Backup Validation: Last successful restore 3 years ago, backup script failing 8 months (disk full)

Week 3-6: Environment Setup

Real Provisioning Costs: $500/month calculator estimate becomes $2,300/month with Multi-AZ, Performance Insights, Enhanced monitoring, compliance backups

Network Issues: 2 weeks fixing 2018 consultant VPC setup - wrong subnets, wrong NAT gateway AZ, blocking security groups

Week 7-10: Application Changes

Feature Flag Costs: LaunchDarkly $20/month per developer (30 developers = $600/month + $200 professional services)

Database Abstraction Layer: "Simple" layer becomes 3,000-line complexity monster with 47 configuration parameters

Week 11-16: Migration Execution

Canary Deployment Results: 1% traffic routing causes:

300% response time increase
Multiple 500 errors
Red monitoring dashboard
CTO questioning

Performance Regression Root Causes:

Query planner cost estimation changes in new PostgreSQL version
Connection pooling: configured 100, actual 25 (pgbouncer default)
SSL overhead: 15% CPU increase
Storage type differences: GP2 vs GP3 IOPS burst behavior
Read replica instance type "optimization"

Data Consistency Failure Scenarios

Dual-Write System Issues

Duplicate users (same email, different IDs)
Orders pointing to non-existent users
Financial records imbalance
Contradictory audit logs

3AM Crisis Pattern

Primary database replica lag: 6 minutes
Customer support angry calls (duplicate orders)
Database locks during routine statistics update
6 hours debugging while CEO sends passive-aggressive revenue impact emails

Emergency Procedures

Replication Failure Response

-- Check replication slot disk usage
SELECT slot_name, database, active, restart_lsn, confirmed_flush_lsn
FROM pg_replication_slots;

Options When Slot Full:

Increase disk space (immediate cost)
Drop replication slot, restart (hours delay)
Manual data sync during fix (nightmare mode)

Data Integrity Validation

-- Find LOB columns (AWS DMS problem areas)
SELECT table_name, column_name, data_type
FROM information_schema.columns
WHERE data_type IN ('text', 'bytea', 'json', 'jsonb');

-- Row count comparison
SELECT 'source' as db, COUNT(*) FROM source_table
UNION ALL
SELECT 'target' as db, COUNT(*) FROM target_table;

Performance Regression Diagnosis

-- Check critical PostgreSQL settings
SHOW shared_buffers;   -- Should be 25% of RAM
SHOW effective_cache_size;  -- Should be 75% of RAM
SHOW work_mem;         -- Usually too small
SHOW max_connections;  -- Usually too high

-- Query plan analysis
EXPLAIN ANALYZE SELECT * FROM problem_query;

Success Patterns

Stripe's Boring Success Model

Timeline: 18 months careful planning
Method: Feature flags for gradual traffic shifting (reads first, then writes)
Synchronization: Weeks of database sync before final cutover
Result: No war room, no 3am calls, no exciting ceremonies

Critical Success Factors

Extensive Monitoring: Essential for debugging performance regressions
Gradual Traffic Shifting: Avoid "quick and easy" approaches
Extended Sync Period: Allow weeks for database synchronization
Rollback Capabilities: Feature flags provide instant rollback

Tool-Specific Operational Intelligence

AWS DMS

Works Well For: Standard relational data without LOBs
Fails On: LOB data >few MB, 47 undocumented edge cases
Performance: Memory allocation issues cause catastrophic slowdown
Monitoring: Buried troubleshooting guide mentions LOB dropping bug (page 47)

Oracle GoldenGate

Works Well For: Standard Oracle-to-Oracle replication
Fails On: Custom composite primary keys, complex transactions
Support Model: $50,000/year minimum + expensive consultants
Failure Mode: Silent stops requiring deep log analysis

PostgreSQL Logical Replication

Works Well For: Simple schemas without custom types
Fails On: Custom extensions, enum types, large initial syncs
Disk Management: Replication slots consume unlimited disk space
Version Compatibility: Query planner changes affect performance

Blue-Green Deployment

Works Well For: Applications with good health checks
Cost Impact: 2x infrastructure for migration duration
AWS Implementation: Actually reliable but expensive
Timeline: Requires 2-3 weeks minimum setup

Disaster Recovery Patterns

Common Post-Migration Failures

# Sequence synchronization check
SELECT last_value FROM user_id_seq; -- Should match MAX(id)

# Failed foreign key constraints
SELECT conname, conrelid::regclass FROM pg_constraint WHERE NOT convalidated;

# Missing indexes
SELECT schemaname, tablename FROM pg_tables
WHERE NOT EXISTS (SELECT 1 FROM pg_indexes WHERE tablename = pg_tables.tablename);

Oracle GoldenGate Failure Diagnosis

# Check GoldenGate logs
tail -f ggserr.log
# Common error codes:
# OGG-00868: transaction too large
# OGG-01028: checkpoint issue

Economic Impact Analysis

Knight Capital Reference Case

Disaster Type: Deployment error (not migration but similar risk profile)
Financial Impact: $440 million loss
Timeline: Single deployment window
Lesson: Why rollback plans are essential

GitLab Database Incident

Disaster Type: Routine migration data destruction
Data Loss: 6 hours of production data
Root Cause: Backup validation failure
Recovery: 18-hour service outage

Stakeholder Communication

Management Explanation Template

Current Status: "Extended maintenance due to unexpected data integrity validation requirements"
Timeline: "ETA: when it's actually done"
Risk Context: Reference GitHub 24-hour outage, GitLab data loss examples
Cost Justification: Compare to Knight Capital $440M loss

Technical Team Communication

Slack Channel Management: Prepare for alert explosion during canary deployment
Escalation Path: Define when to abandon gradual migration for maintenance window
Rollback Criteria: Clear decision points for reverting changes

Resource Quality Assessment

High-Value Resources

GitHub Database Outage Post-Mortem: Real split-brain scenario analysis
Stripe Document Database Migration: Actual success story with timeline
PostgreSQL Discord #help Channel: Real-time 3am debugging support
Jepsen Database Reports: Consistency testing reality checks

Marketing vs Reality

AWS DMS Documentation: Useful until edge case #47
Oracle GoldenGate Docs: 400 pages of troubleshooting (complexity indicator)
Vendor Success Stories: Filter for actual timelines and costs

Tool Recommendations

pgbadger: PostgreSQL log analyzer for performance debugging
LaunchDarkly: $20/month per developer for rollback capability
AWS Performance Insights: $15/month per instance for regression diagnosis
DataDog Database Monitoring: Expensive but essential for critical systems

Career Impact Assessment

Survival Rate: Database administration has highest tech burnout rate
Experience Value: Surviving production migration worth significant career advancement
Alternative Careers: Goat farming, underwater basket weaving (no on-call)
Professional Development: 3am debugging experience creates migration expertise

Conclusion

Zero downtime database migration requires 4-month timeline, 2.5x budget multiplier, and expectation of significant technical debt. Success depends on boring, gradual approaches with extensive monitoring and rollback capabilities. Vendor promises are marketing; plan for reality of edge cases, performance regressions, and data consistency challenges.

Useful Links for Further Investigation

Actually Useful Database Migration Resources (Not Vendor Marketing)

Link	Description
GitHub's Database Outage Post-Mortem	How GitHub's "zero downtime" migration took down the entire service for 24 hours. MySQL replication can fail catastrophically in ways you never imagined.
GitLab Database Migration Disaster	The horrifying story of how a routine migration destroyed 6 hours of production data. Required reading for anyone who thinks "backups are just insurance."
Stripe's Document Database Migration	The rare success story. 18 months of careful planning, feature flags, and gradual traffic shifting. Shows what "boring" successful migrations look like.
Knight Capital's $440M Bug	What happens when a deployment goes wrong at financial scale. Not directly migration-related but shows why you need rollback plans.
AWS DMS Documentation	Useful until you hit edge case #47 with LOB data. The troubleshooting guide on page 47 mentions the LOB data dropping bug that will ruin your day.
PostgreSQL Logical Replication	Official docs that make it sound easy. Doesn't mention that replication slots will fill your disk and crash production.
AWS RDS Blue-Green Deployments	Actually works well, but costs 2x your infrastructure budget during migration.
Oracle GoldenGate Documentation	400 pages of troubleshooting documentation. That should tell you something about complexity.
pgbouncer	Connection pooler that might save your ass during PostgreSQL migrations. Configuration is black magic but essential for performance.
LaunchDarkly Feature Flags	Costs $20/month per developer but gives you instant rollback capabilities. Worth every penny when things go sideways.
Liquibase	Schema migration tool that works great for simple schemas. Chokes on anything complex but beats manual SQL scripts.
PgRoll	Open-source PostgreSQL migration tool using shadow columns. New but promising approach to the problem.
Database Administrators Stack Exchange	Community discussions about database migration challenges and solutions. Real DBAs sharing their experiences with migration strategies and troubleshooting.
PostgreSQL Discord Community	Real-time help from people who've debugged replication at 3am. Join the #help channel.
Hacker News Database Migration Discussions	Search for "database migration" on HN to find ongoing discussions, war stories, and engineer experiences. Comment threads often contain more valuable insights than the articles.
pgbadger	PostgreSQL log analyzer that will show you why your queries are slow on the new database.
AWS Performance Insights	Costs $15/month per instance but essential for debugging performance regressions.
DataDog Database Monitoring	Expensive but shows you what's actually happening during migrations. Worth it for critical systems.
Jepsen	Distributed systems testing. Their database consistency reports will terrify you but show what can actually go wrong.
Designing Data-Intensive Applications	Chapter 5 covers replication patterns and why they all suck in different ways.
Database Reliability Engineering	How to not get fired when databases break. Migration chapters are gold.
Site Reliability Engineering	Google's free book. Chapter 26 covers data processing pipelines and migration patterns at scale.

Related Tools & Recommendations

howto

How I Migrated Our MySQL Database to PostgreSQL (And Didn't Quit My Job)

Real migration guide from someone who's done this shit 5 times

MySQL

/howto/migrate-legacy-database-mysql-postgresql-2025/beginner-migration-guide

Zero Downtime Database Migration: AI-Optimized Technical Reference

Executive Summary

Critical Definitions

Resource Requirements

Time Investment

Financial Costs

Hidden Costs

Critical Failure Modes

Backward Compatibility Failures

Replication Lag Disasters

Tool-Specific Failures

AWS DMS

Oracle GoldenGate

PostgreSQL Logical Replication

Production Configuration

PostgreSQL Settings That Actually Work

AWS RDS Hidden Costs

Connection Pooling Requirements

Migration Timeline Reality

Week 1-2: Discovery Phase

Week 3-6: Environment Setup

Week 7-10: Application Changes

Week 11-16: Migration Execution

Data Consistency Failure Scenarios

Dual-Write System Issues

3AM Crisis Pattern

Emergency Procedures

Replication Failure Response

Data Integrity Validation

Performance Regression Diagnosis

Success Patterns

Stripe's Boring Success Model

Critical Success Factors

Tool-Specific Operational Intelligence

AWS DMS

Oracle GoldenGate

PostgreSQL Logical Replication

Blue-Green Deployment

Disaster Recovery Patterns

Common Post-Migration Failures

Oracle GoldenGate Failure Diagnosis

Economic Impact Analysis

Knight Capital Reference Case

GitLab Database Incident

Stakeholder Communication

Management Explanation Template

Technical Team Communication

Resource Quality Assessment

High-Value Resources

Marketing vs Reality

Tool Recommendations

Career Impact Assessment

Conclusion

Useful Links for Further Investigation

Actually Useful Database Migration Resources (Not Vendor Marketing)

Related Tools & Recommendations

How I Migrated Our MySQL Database to PostgreSQL (And Didn't Quit My Job)

PostgreSQL vs MySQL vs MongoDB vs Cassandra - Which Database Will Ruin Your Weekend Less?

PostgreSQL vs MySQL vs MariaDB - Performance Analysis 2025

Oracle GoldenGate - Database Replication That Actually Works

Deploy Django with Docker Compose - Complete Production Guide

Kafka Will Fuck Your Budget - Here's the Real Cost

Apache Kafka - The Distributed Log That LinkedIn Built (And You Probably Don't Need)

AWS Database Migration Service - When You Need to Move Your Database Without Getting Fired

Maven is Slow, Gradle Crashes, Mill Confuses Everyone

Docker Daemon Won't Start on Windows 11? Here's the Fix

Docker 프로덕션 배포할 때 털리지 않는 법

Fivetran: Expensive Data Plumbing That Actually Works

Debezium - Database Change Capture Without the Pain

Airbyte - Stop Your Data Pipeline From Shitting The Bed

Your MongoDB Atlas Bill Just Doubled Overnight. Again.

Liquibase Pro - Database Migrations That Don't Break Production

SQL Server 2025 - Vector Search Finally Works (Sort Of)

Stop Breaking FastAPI in Production - Kubernetes Reality Check

Temporal + Kubernetes + Redis: The Only Microservices Stack That Doesn't Hate You

Your Kubernetes Cluster is Probably Fucked