Database Migration Tools That Don't Want to Kill You

I've spent the last couple years migrating everything from 50GB MySQL shitshows to 2TB PostgreSQL clusters that developers swore would "never get that big." Most tools are designed by people who think production environments are just staging with more RAM. But I found a few that won't make you hate your career choices.

pgroll PostgreSQL migration tool

pgroll - Finally, PostgreSQL Migrations That Don't Lock Everything

pgroll is the first PostgreSQL migration tool that doesn't make me want to throw my laptop out the window. Released by Xata in 2024, it uses shadow columns to avoid table locks completely.

Here's the difference:

-- Old way (pray your app can handle downtime)
ALTER TABLE users ADD COLUMN email VARCHAR(255);
-- Table locked, users can't login, phone starts ringing

-- pgroll way  
pgroll start add_email_column.json
-- Table never locked, old and new schema work simultaneously

I migrated a 40GB users table during peak traffic and nobody noticed. The GitHub repo has decent documentation and actual working examples, which is rare.

Real gotchas I found:

  • Shadow columns eat extra disk space - maybe 20% more during migration
  • Connection pool exhaustion will kill you if max_connections is too low
  • Version 0.6.1 had a memory leak, stick with 0.6.0 or 0.7.0+
  • Foreign key constraints get messy - test thoroughly first

AWS DMS Still Exists (And Still Sucks for CDC)

AWS DMS Architecture

Database Migration Data Flow

AWS Database Migration Service works for simple one-shot migrations, but using it for ongoing replication is like using a chainsaw for brain surgery. I've burned three weekends straight debugging why DMS randomly decided to lag 4 hours behind during Black Friday traffic.

The r/aws community agrees - DMS is fine for "lift and shift" but terrible for Change Data Capture. Connection timeouts randomly kill long-running replications, and the error messages are useless.

When DMS doesn't suck:

  • Simple MySQL → PostgreSQL one-time migrations
  • Small datasets (under 100GB)
  • You have enterprise support for when shit breaks at 3am
  • Non-critical data where some lag is acceptable

When DMS will ruin your weekend:

  • Real-time CDC requirements
  • Complex schema transformations
  • Mixed workload patterns (heavy writes + reads)
  • Any mission-critical data pipeline

Debezium 3.0 - CDC That Actually Works

Debezium Logo

Debezium 3.0 released in October 2024 and fixed most of the stuff that made me hate CDC. The autonomous error recovery alone saved my ass multiple times during Black Friday traffic.

Key improvements from someone who's used it in production:

  • Smart batching reduced our replication lag from ~30 seconds to under 5 seconds
  • Auto-recovery handles about 80% of the stupid connection errors that used to wake me up
  • Schema evolution doesn't break when someone adds a column anymore
  • Cross-cloud sync between AWS RDS and GCP CloudSQL actually works now

Production reality check: CDC lag turns into complete shit during peak traffic. During our Black Friday sale, lag jumped from 200ms to 3+ minutes and stayed there until I figured out the batch sizes were configured by someone who'd never seen real traffic. The Confluent CDC guide has the settings that actually work - not the defaults that look good in demos.

Container-Native Migrations (Finally Useful)

Liquibase in Kubernetes and Atlas Kubernetes Operator graduated from "toy demo" to "actually useful in production" status.

The big win: database schema changes get versioned with application code. No more "did someone run the migration?" confusion during deployments.

## This actually works in production now
apiVersion: liquibase.io/v1beta1
kind: LiquibaseOperator  
metadata:
  name: user-schema-migration
spec:
  changeLog: migrations/add-email-column.xml
  database:
    url: postgresql://prod-db:5432/myapp

Kubernetes migration reality check:

  • Works great in staging until you discover your production DB has 47 tables that staging doesn't
  • RBAC permissions will eat your entire weekend - whoever designed Kubernetes security hates humanity
  • Failed migrations leave your cluster in this zombie state where half your pods think they're migrated
  • Rollbacks technically work, but you'll be faster just fixing the shit manually at 2am

The Atlas Kubernetes docs are actually readable, which puts them ahead of most database tools.

Database Migration Tools - What Actually Works

Tool

What It's Good For

Reality Check

Pain Points

Cost

pgroll

PostgreSQL schema changes

Actually works as advertised

PostgreSQL only, shadow columns eat disk space like a hungry teenager

Free

AWS DMS

Simple one-time migrations

Works for basic stuff unless it's Tuesday

Error messages written by sadists, dies randomly during large migrations

~$200-1000/month (more when it breaks)

Google DMS

Cross-cloud migrations

Better than AWS DMS (low bar)

Still expensive, supports like 6 databases total

Depends on usage (prepare to be surprised)

Debezium

Real-time CDC

Solid for event streaming

Setup so complex you need a Kafka PhD, will consume your CPU for breakfast

Free + infrastructure costs that add up fast

Atlas

Schema-as-code

Good Kubernetes integration

Learning curve steeper than K2, nobody on your team knows it

Free tier available (paid tier inevitable)

Liquibase

CI/CD schema changes

Enterprise-friendly

XML configs designed by people who hate developers

Free + paid tiers (good luck with free limits)

How to Actually Do Zero Downtime Database Migrations

After fucking up my first few migration attempts (including the memorable incident where I accidentally dropped the production users table at 2pm on a Tuesday), here's what actually works. Skip the theoretical bullshit - this is what you copy-paste when your boss needs the migration done yesterday and you can't afford another career-ending fuckup.

Pre-Migration: Don't Wing It (I Learned This the Hard Way)

Database Migration Architecture Flow

Database Migration Planning

Before touching production, you need to know exactly what clusterfuck you're dealing with. I spent 8 hours debugging a migration that failed because there was a trigger from 2017 that nobody documented and the original developer had long since quit. Don't be me.

1. Actually Check What You're Migrating

## Check table sizes (this will save your ass)
SELECT schemaname,tablename,attname,n_distinct,correlation 
FROM pg_stats 
WHERE schemaname='public' 
ORDER BY schemaname,tablename,attname;

## Find all the triggers that will break your migration
SELECT trigger_name, table_name, action_timing, event_manipulation 
FROM information_schema.triggers 
WHERE table_schema = 'public';

## Check for foreign key constraints (pain in the ass)
SELECT conname, conrelid::regclass, confrelid::regclass 
FROM pg_constraint 
WHERE contype = 'f';

Skip the AI assessment nonsense - those tools will tell you everything's fine right before your migration destroys production. Manual checking sucks but at least you understand what's actually happening.

2. Data Quality Check (Actually Important)

## Quick dirty check for null hell
SELECT column_name, 
       COUNT(*) as total_rows,
       COUNT(column_name) as non_null_rows,
       COUNT(*) - COUNT(column_name) as null_rows
FROM information_schema.columns 
CROSS JOIN your_table_name
WHERE table_name = 'users'
GROUP BY column_name;

Great Expectations is solid for this if you have 2 days to fight with their configuration YAML. Apache Griffin is complete overkill unless you're Netflix and have a team of data engineers who get off on complexity.

Migration Execution: What Actually Works

Option 1: pgroll (If You're Using PostgreSQL)

pgroll is the only tool that doesn't make me want to quit programming:

## Install (obvious but whatever)
brew install xataio/pgroll/pgroll

## Initialize - this sets up the shadow column bullshit  
pgroll init --postgres-url postgresql://prod-db:5432/myapp

## Create your migration - JSON because everything is JSON now
cat > add_email_column.json << 'EOF'
{
  \"name\": \"add_email_column\",
  \"operations\": [
    {
      \"add_column\": {
        \"table\": \"users\",
        \"column\": {
          \"name\": \"email\",
          \"type\": \"varchar(255)\",
          \"nullable\": true
        }
      }
    }
  ]
}
EOF

## Start migration - this is where the magic happens
pgroll start add_email_column.json --postgres-url postgresql://prod-db:5432/myapp

## Your app can now use either schema version
## Old code works: SELECT id, name FROM users;
## New code works: SELECT id, name, email FROM users;

## Complete when you're confident it's not broken
pgroll complete --postgres-url postgresql://prod-db:5432/myapp

Why this doesn't suck: Shadow columns and triggers handle the sync automatically. No table locks, no downtime, no prayer circles.

Option 2: AWS DMS (When You Have No Choice)

Sometimes you're stuck with AWS DMS because that's what management bought. It works for simple one-time migrations:

## Create replication instance (this takes forever)
aws dms create-replication-instance \
    --replication-instance-identifier my-migration \
    --replication-instance-class dms.t3.medium \
    --allocated-storage 100

## Create source endpoint
aws dms create-endpoint \
    --endpoint-identifier source-db \
    --endpoint-type source \
    --engine-name postgres \
    --server-name old-prod.company.com \
    --port 5432 \
    --database-name myapp \
    --username dbuser \
    --password 'your-password-here'

## Start replication (pray it doesn't break)
aws dms start-replication-task \
    --replication-task-arn arn:aws:dms:us-east-1:123456:task/abcd1234

DMS gotchas that will ruin your entire week:

  • Connection timeouts kill long-running replications at exactly the worst possible moment (usually 3am Saturday)
  • Large tables (100GB+) cause memory issues on small instances, but AWS support will insist it's your fault
  • Error messages are cryptic garbage like "ERROR: 1020 (HY000)" - might as well roll dice to debug
  • CDC lag spikes during high write volumes, turning your real-time system into a historical archive

Option 3: CDC with Debezium (For Real-time Stuff)

Debezium CDC Architecture

If you need actual real-time sync, Debezium is your best bet:

## Set up Kafka (because everything needs Kafka now)
docker run -d --name kafka \
  -p 9092:9092 \
  confluentinc/cp-kafka:latest

## Start Debezium connector (replace localhost with your Kafka Connect host)
curl -X POST YOUR_KAFKA_CONNECT_HOST:8083/connectors \
  -H \"Content-Type: application/json\" \
  -d '{
    \"name\": \"postgres-connector\",
    \"config\": {
      \"connector.class\": \"io.debezium.connector.postgresql.PostgresConnector\",
      \"database.hostname\": \"prod-db.company.com\",
      \"database.port\": \"5432\",
      \"database.user\": \"replication_user\",
      \"database.password\": \"password\",
      \"database.dbname\": \"myapp\",
      \"database.server.name\": \"prod\",
      \"table.include.list\": \"public.users,public.orders\",
      \"plugin.name\": \"pgoutput\"
    }
  }'

Real production war stories:

  • CDC lag gets nasty during peak traffic - we saw 2+ minute delays that made our "real-time" dashboard a comedy show
  • PostgreSQL replication slots will fill up your disk like a digital black hole if consumers lag behind
  • Kafka topics need proper partitioning or you'll bottleneck so hard that single-threaded batch jobs look fast
  • Schema changes during migration break absolutely everything unless you plan for them (which nobody does the first time)

Post-Migration: Make Sure You Didn't Break Everything

Monitoring is crucial because migrations fail in subtle, career-ending ways that won't show up until 3am when the CEO calls you personally because the entire product is broken.

Basic Sanity Checks That Actually Matter

## Count rows in source vs target (obvious but people skip this)
echo \"Source rows:\"
psql -h old-db -c \"SELECT COUNT(*) FROM users;\"
echo \"Target rows:\"  
psql -h new-db -c \"SELECT COUNT(*) FROM users;\"

## Check for null data that shouldn't be null
psql -h new-db -c \"
SELECT 'email_nulls' as check_name, COUNT(*) as null_count 
FROM users WHERE email IS NULL
UNION ALL
SELECT 'created_at_nulls', COUNT(*) 
FROM users WHERE created_at IS NULL;\"

## Performance regression check (this will save your ass)
psql -h new-db -c \"EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'test@example.com';\"

Monitoring That Doesn't Suck

Database Monitoring Dashboard

## Prometheus alerts that actually fire when things break
groups:
- name: migration.rules
  rules:
  - alert: DatabaseConnectionsHigh
    expr: pg_stat_database_numbackends > 80
    for: 2m
    annotations:
      summary: \"Too many DB connections - migration might be leaking\"

  - alert: ReplicationLagHigh  
    expr: pg_replication_lag_seconds > 60
    for: 1m
    annotations:
      summary: \"CDC lag is getting bad - check Debezium\"

Set up Grafana dashboard for PostgreSQL monitoring. The defaults are actually useful.

Troubleshooting: Shit That Will Go Wrong

Here's what breaks and how to fix it, based on actual production failures:

pgroll Gotchas

## Check if pgroll fucked up your triggers
pgroll status --postgres-url postgresql://db:5432/myapp

## Kill a stuck migration (sometimes needed)
pgroll rollback --postgres-url postgresql://db:5432/myapp

## Fix connection pool exhaustion
## In your app config: increase max_connections temporarily
## In PostgreSQL: ALTER SYSTEM SET max_connections = 500;

Common pgroll failures:

  • Existing triggers with same names conflict with pgroll triggers
  • Foreign key constraints cause shadow column sync to fail
  • Large JSONB columns slow down trigger performance
  • Connection pools get exhausted with dual schema versions

DMS Nightmare Debugging

## Check DMS task status (will probably show \"Failed\")
aws dms describe-replication-tasks \
  --filters Name=replication-task-id,Values=your-task-id

## Get actual error messages (usually cryptic garbage)
aws logs get-log-events \
  --log-group-name dms-tasks-your-task-id \
  --log-stream-name your-stream-id

## Nuclear option: delete and recreate the task
aws dms delete-replication-task --replication-task-arn your-arn

Debezium CDC Issues

## Check connector status (replace with your Kafka Connect host)
curl YOUR_KAFKA_CONNECT_HOST:8083/connectors/postgres-connector/status

## Restart stuck connector (happens more than you'd like)
curl -X POST YOUR_KAFKA_CONNECT_HOST:8083/connectors/postgres-connector/restart

## Check PostgreSQL replication slots (these fill up disk)
SELECT slot_name, active, restart_lsn 
FROM pg_replication_slots;

Links for when you're debugging at 3am:

Questions Engineers Actually Ask About Database Migration

Q

Does pgroll work with massive databases?

A

I've used it on a 800GB PostgreSQL database and it didn't explode. The shadow columns eat maybe 20% extra disk space during migration, so make sure you have room.

Gotchas for large DBs:

  • Connection pools get exhausted fast with dual schemas running
  • JSONB columns slow everything down if you don't index them right
  • Memory usage spikes with high concurrent transactions
  • Big tables (100GB+) take forever to backfill - plan accordingly

Don't trust vendor claims about "10TB+ support" until you test it yourself. Vendor benchmarks are about as reliable as weather forecasts - technically possible but rarely accurate when you need them to be.

Q

Should I bother with AWS DMS or just use pgroll?

A

If you're doing PostgreSQL → PostgreSQL, use pgroll. If you're doing MySQL → PostgreSQL or need to cross cloud providers, you're stuck with DMS.

DMS works for one-time migrations but turns into a special kind of hell for ongoing replication. I've spent too many weekends debugging mysterious connection timeouts that AWS support can't explain and Stack Overflow pretends don't exist.

Q

What happens when pgroll shits the bed mid-migration?

A

Usually you can recover, but it's not fun. The shadow columns stick around eating disk space until you clean them up.

## Check what the hell happened  
pgroll status --postgres-url postgresql://db:5432/myapp

## Try to rollback (sometimes works)
pgroll rollback --postgres-url postgresql://db:5432/myapp

## Nuclear option - manually clean up shadow columns
pgroll rollback --force --cleanup-shadows

Real failure scenarios that ruined my weekends:

  • Connection died during backfill at 87% complete - had to restart the entire 8-hour migration
  • Disk space ran out because shadow columns filled up everything and the monitoring alerts were misconfigured
  • Foreign key constraint conflicts turned into a 3-day debugging session involving legacy code from 2016
  • Trigger name conflicts where existing triggers blocked pgroll's triggers, and nobody documented what the old triggers actually did
Q

Why does CDC lag get so fucking bad during peak traffic?

A

Because everything is trying to replicate at once and your network/CPU can't handle it. Debezium gets overwhelmed and starts batching everything.

What actually works:

  • Increase batch sizes during peak hours (counterintuitive but true)
  • Use read replicas for CDC source - don't hammer your primary DB
  • Pre-provision more resources than you think you need
  • Set up circuit breakers to pause during traffic spikes

Our lag jumped to 5+ minutes during a flash sale, making our "real-time" inventory system about as current as yesterday's newspaper, until I figured out the batch configurations were set by someone who'd clearly never seen actual traffic.

Q

Can I migrate to Kubernetes databases without losing my sanity?

A

CockroachDB and YugabyteDB work in Kubernetes, but the migration process is still a pain in the ass.

Approaches that actually work:

  1. Migrate to managed cloud DB first, then to K8s later
  2. Use CDC to gradually shift traffic (safer but more complex)
  3. Do a blue-green deployment with feature flags
## Simple approach - dump and restore
pg_dump postgresql://old-db:5432/myapp | \
  psql postgresql://cockroachdb:26257/myapp

## CDC approach with Debezium (more complex but safer)
curl -X POST YOUR_KAFKA_CONNECT_HOST:8083/connectors \
  -H "Content-Type: application/json" \
  -d '{
    "name": "postgres-to-cockroach",
    "config": {
      "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
      "tasks.max": "1",
      "database.hostname": "old-postgres",
      "database.port": "5432",
      "database.user": "replicator",
      "database.password": "password",
      "database.server.name": "migration"
    }
  }'

Reality check: Kubernetes databases are great for new projects where you can build everything from scratch. Migrating existing production systems to K8s databases is expensive, risky, and will probably take 3x longer than you think. Make sure you have a really good reason beyond "Kubernetes is cool."

Q

What's the dumbest mistake people make with database migrations?

A

Not testing the rollback procedure. I've seen teams spend weeks planning the migration forward but have no idea how to undo it when shit hits the fan.

Other ways to destroy your career:

  • Trusting "zero downtime" tools without actually monitoring downtime (spoiler: there's always some)
  • Not checking application code compatibility with new schema versions - enjoy debugging why the app crashes after "successful" migration
  • Running migrations during peak traffic hours because the tool says "zero downtime" (peak traffic finds every edge case)
  • Skipping connection pool tuning and wondering why everything crawls like dial-up internet
  • Not having a rollback plan beyond "restore from backup and pray"

Pro tip: Always migrate your staging environment first and let it run for a few days. You'll find weird edge cases that only show up under real load.

Q

How much does this actually cost?

A

For a typical 500GB PostgreSQL database migration:

pgroll approach:

  • Tool cost: $0 (open source)
  • Cloud compute: maybe $500-2000 depending on how long it takes
  • Engineering time: 1-2 weeks for experienced team
  • Downtime cost: $0 if it works, $lots if it doesn't

AWS DMS approach:

  • DMS instance costs: $200-1000/month while running
  • Data transfer costs: can get expensive with cross-region
  • Engineering time: 2-4 weeks because DMS is finicky
  • Downtime: usually some, despite vendor claims

Skip the enterprise consulting unless your database is truly massive or you have weird compliance requirements.

Q

Should I switch from my current migration tools?

A

If your current process works and doesn't cause outages, stick with it.

Don't fix what ain't broken.

Consider switching if:

  • You're doing migrations more than once a quarter
  • Current tools require significant downtime windows
  • You're spending more than a week per migration
  • Your existing tools don't work with your target infrastructure

Don't switch if:

  • Your team is already overloaded with other projects
  • You have strict compliance requirements for pre-approved tools
  • The current process works fine and doesn't cause business impact

Change is risky. Make sure the benefits are worth the learning curve and potential issues.

Resources That Don't Completely Suck (Use at Your Own Risk)

Related Tools & Recommendations

alternatives
Recommended

Maven is Slow, Gradle Crashes, Mill Confuses Everyone

depends on Apache Maven

Apache Maven
/alternatives/maven-gradle-modern-java-build-tools/comprehensive-alternatives
100%
compare
Recommended

PostgreSQL vs MySQL vs MongoDB vs Cassandra - Which Database Will Ruin Your Weekend Less?

Skip the bullshit. Here's what breaks in production.

PostgreSQL
/compare/postgresql/mysql/mongodb/cassandra/comprehensive-database-comparison
97%
howto
Similar content

MongoDB to PostgreSQL Migration: The Complete Survival Guide

Four Months of Pain, 47k Lost Sessions, and What Actually Works

MongoDB
/howto/migrate-mongodb-to-postgresql/complete-migration-guide
92%
howto
Similar content

MySQL to PostgreSQL Production Migration: Complete Guide with pgloader

Migrate MySQL to PostgreSQL without destroying your career (probably)

MySQL
/howto/migrate-mysql-to-postgresql-production/mysql-to-postgresql-production-migration
89%
tool
Similar content

AWS Database Migration Service: Real-World Migrations & Costs

Explore AWS Database Migration Service (DMS): understand its true costs, functionality, and what actually happens during production migrations. Get practical, r

AWS Database Migration Service
/tool/aws-database-migration-service/overview
82%
integration
Similar content

Jenkins Docker Kubernetes CI/CD: Deploy Without Breaking Production

The Real Guide to CI/CD That Actually Works

Jenkins
/integration/jenkins-docker-kubernetes/enterprise-ci-cd-pipeline
77%
alternatives
Similar content

MongoDB Atlas Alternatives: Escape High Costs & Migrate Easily

Fed up with MongoDB Atlas's rising costs and random timeouts? Discover powerful, cost-effective alternatives and learn how to migrate your database without hass

MongoDB Atlas
/alternatives/mongodb-atlas/migration-focused-alternatives
72%
tool
Recommended

Oracle GoldenGate - Database Replication That Actually Works

Database replication for enterprises who can afford Oracle's pricing

Oracle GoldenGate
/tool/oracle-goldengate/overview
66%
tool
Similar content

GitLab CI/CD Overview: Features, Setup, & Real-World Use

CI/CD, security scanning, and project management in one place - when it works, it's great

GitLab CI/CD
/tool/gitlab-ci-cd/overview
55%
howto
Similar content

Zero Downtime Database Migration Strategies: AWS DMS Guide

How to Migrate Your Production Database Without Getting Fired (Or Losing Your Mind)

Blue-Green Deployment
/howto/database-migration-zero-downtime/zero-downtime-migration-strategies
52%
integration
Recommended

Fix Your Slow-Ass Laravel + MySQL Setup

Stop letting database performance kill your Laravel app - here's how to actually fix it

MySQL
/integration/mysql-laravel/overview
52%
troubleshoot
Recommended

Docker Won't Start on Windows 11? Here's How to Fix That Garbage

Stop the whale logo from spinning forever and actually get Docker working

Docker Desktop
/troubleshoot/docker-daemon-not-running-windows-11/daemon-startup-issues
50%
howto
Recommended

Stop Docker from Killing Your Containers at Random (Exit Code 137 Is Not Your Friend)

Three weeks into a project and Docker Desktop suddenly decides your container needs 16GB of RAM to run a basic Node.js app

Docker Desktop
/howto/setup-docker-development-environment/complete-development-setup
50%
news
Recommended

Docker Desktop's Stupidly Simple Container Escape Just Owned Everyone

compatible with Technology News Aggregation

Technology News Aggregation
/news/2025-08-26/docker-cve-security
50%
tool
Recommended

Fivetran: Expensive Data Plumbing That Actually Works

Data integration for teams who'd rather pay than debug pipelines at 3am

Fivetran
/tool/fivetran/overview
48%
troubleshoot
Similar content

Fix Kubernetes Service Not Accessible: Stop 503 Errors

Your pods show "Running" but users get connection refused? Welcome to Kubernetes networking hell.

Kubernetes
/troubleshoot/kubernetes-service-not-accessible/service-connectivity-troubleshooting
40%
troubleshoot
Recommended

Fix MySQL Error 1045 Access Denied - Real Solutions That Actually Work

Stop fucking around with generic fixes - these authentication solutions are tested on thousands of production systems

MySQL
/troubleshoot/mysql-error-1045-access-denied/authentication-error-solutions
38%
compare
Recommended

Python vs JavaScript vs Go vs Rust - Production Reality Check

What Actually Happens When You Ship Code With These Languages

java
/compare/python-javascript-go-rust/production-reality-check
38%
tool
Recommended

Node.js ESM Migration - Stop Writing 2018 Code Like It's Still Cool

How to migrate from CommonJS to ESM without your production apps shitting the bed

Node.js
/tool/node.js/modern-javascript-migration
38%
tool
Recommended

MongoDB Atlas Enterprise Deployment Guide

compatible with MongoDB Atlas

MongoDB Atlas
/tool/mongodb-atlas/enterprise-deployment
36%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization