Currently viewing the human version
Switch to AI version

Phase 1: Why Your Database Will Hate You (And How to Make It Stop)

The Logical Replication Reality Check

PostgreSQL Logical Replication Architecture

Look, I learned this the hard way: logical replication looks easy on paper, but PostgreSQL will find creative ways to make you suffer. The first time I tried this, I thought "how hard can it be?" Famous last words. Took down production for 4 hours.

The thing is, logical replication actually works great when everything goes right. When it doesn't... well, I've had 4-hour debugging sessions that started with "this should only take 3 seconds." You'll want to understand WAL architecture before you start troubleshooting why everything's fucked. The PostgreSQL streaming replication docs also helped me understand what happens when things break.

Your Infrastructure is Probably Wrong

The Config Hell You're About to Enter

PostgreSQL System Architecture

First things first - check if your PostgreSQL 16 source even has logical replication enabled. Spoiler alert: it probably doesn't.

-- Check if you're screwed
SHOW wal_level;
-- If this says 'replica' instead of 'logical', grab some coffee

-- These better not be zeros
SHOW max_replication_slots;
SHOW max_wal_senders;
SHOW max_logical_replication_workers;

If wal_level isn't logical, you're about to restart PostgreSQL. Yes, that means downtime. I forgot this step once and had to explain to the CEO why our "zero-downtime" upgrade started with 20 minutes of downtime. Fun times. The PostgreSQL configuration guide explains all these settings in excruciating detail.

## postgresql.conf - the settings that will save your ass
wal_level = logical
max_replication_slots = 10
max_wal_senders = 10
max_logical_replication_workers = 10
max_worker_processes = 20

Heads up: Changing wal_level means restarting PostgreSQL. Plan accordingly or you'll be explaining to management why the site went down.

Hardware Reality Check (You Need More Than You Think)

Here's what I learned when our migration brought down the entire cluster because I was cheap with resources:

  • CPU: Double what you normally use. Logical replication chews through CPU like crazy during initial sync. The PostgreSQL performance tuning guide has good advice on CPU planning.
  • Memory: Same deal - double it. WAL processing will eat all your RAM if you let it. Check the shared_buffers documentation for memory planning.
  • Storage: At least 150% of your source database. Learned this when we ran out of disk space mid-migration at 2am - logs filled up to like 800GB or something crazy. The PostgreSQL storage management docs explain why this happens.
  • Network: Keep it under 5ms latency or you'll hate your life. Replication lag becomes unbearable otherwise. Network tuning for PostgreSQL covers the connection settings.
  • Disk I/O: Get the fastest storage you can afford. Trust me on this one. PostgreSQL I/O optimization explains the WAL settings that matter.

Setting Up Your PostgreSQL 17 Target (The Fun Part)

Spin up a new PostgreSQL 17 server. Make it beefier than your source - you're not being wasteful, you're being smart. I went with identical specs once and watched the migration crawl along like a dying snail. On my 2019 MacBook this migration took fucking forever.

## If you're on AWS (and who isn't these days) - this works on Ubuntu 22.04, your mileage may vary on CentOS
aws rds create-db-instance \
    --db-instance-identifier postgres17-target \
    --db-instance-class db.r5.xlarge \
    --engine postgres \
    --engine-version 17.0 \
    --allocated-storage 500 \
    --master-username postgres \
    --manage-master-user-password \
    --vpc-security-group-ids sg-xxxxxxxxx \
    --db-subnet-group-name my-subnet-group \
    --backup-retention-period 7

Schema Setup (The Part Where Things Break)

Getting Your Schema Over to the Target

Time to copy your schema. This is usually straightforward, but PostgreSQL loves to surprise you with permission errors and weird edge cases. Oh, and don't get me started on PostgreSQL's error messages - they're about as helpful as a chocolate teapot. The pg_dump documentation covers all the flags that might save you hours of debugging.

## Dump your schema (pray it works first try)
pg_dump --host=source-host \
        --username=postgres \
        --schema-only \
        --no-privileges \
        --no-owner \
        source_database > schema.sql

## Apply to target (fingers crossed)
psql --host=target-host \
     --username=postgres \
     --dbname=target_database \
     --file=schema.sql

The Primary Key Problem That Will Ruin Your Day

Here's the thing nobody tells you: logical replication freaks out if your tables don't have primary keys. It just sits there, silently hating you. The logical replication restrictions explain this limitation in detail.

-- Find the tables that will break your replication
SELECT schemaname, tablename
FROM pg_tables t
LEFT JOIN pg_constraint c ON c.conrelid = (schemaname||'.'||tablename)::regclass
AND c.contype = 'p'
WHERE c.conname IS NULL
AND schemaname NOT IN ('information_schema', 'pg_catalog');

-- Fix them (this will hurt performance but it'll work)
ALTER TABLE table_without_pk REPLICA IDENTITY FULL;

Reality check: REPLICA IDENTITY FULL means PostgreSQL replicates entire rows for every change. Your network will hate you, but at least replication won't silently fail.

PgBouncer: Your Best Friend and Worst Enemy

PgBouncer Architecture

PgBouncer is what makes the magic happen. When it works, it's brilliant. When it doesn't... well, I've spent many late nights debugging why PgBouncer decided to stop working for no reason. The PgBouncer documentation and PostgreSQL connection pooling best practices are essential reading.

Getting PgBouncer Not to Hate You

Here's a config that actually works in production:

## /etc/pgbouncer/pgbouncer.ini
[databases]
myapp = host=source-host port=5432 dbname=production

[pgbouncer]
listen_port = 6432
listen_addr = 0.0.0.0
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 20
reserve_pool_size = 5
admin_users = pgbouncer_admin

Authentication Setup (The Part That Always Breaks)

## /etc/pgbouncer/userlist.txt
\"postgres\" \"md5d41d8cd98f00b204e9800998ecf8427e\"
\"app_user\" \"md5c93d3bf7a7c4afe94b64e30c2ce39f4f\"

Generate those MD5 hashes (because plain text passwords are for suckers):

echo -n \"passwordusername\" | md5sum

Setting Up Logical Replication (Where the Real Fun Begins)

Step 1: Create the Publication (The Easy Part)

-- On your source database
CREATE PUBLICATION upgrade_publication FOR ALL TABLES;

-- Double-check it worked (because trust but verify)
SELECT pubname, puballtables FROM pg_publication WHERE pubname = 'upgrade_publication';

Step 2: Create the Subscription (Where Things Get Interesting)

First, make sure your target database has the schema but no data:

-- Nuke all the data but keep the schema
DO $$
DECLARE
    r RECORD;
BEGIN
    FOR r IN (SELECT tablename FROM pg_tables WHERE schemaname = 'public') LOOP
        EXECUTE 'TRUNCATE TABLE ' || quote_ident(r.tablename) || ' CASCADE';
    END LOOP;
END $$;

Now create the subscription and pray to the PostgreSQL gods:

-- This is where things either work beautifully or explode spectacularly
CREATE SUBSCRIPTION upgrade_subscription
CONNECTION 'host=source-host port=5432 dbname=production user=replication_user password=secure_password'
PUBLICATION upgrade_publication;

Step 3: Watching the Initial Sync (And Slowly Going Insane)

The initial sync takes forever. On a 100GB database, it took like 6 hours, but could've been 8 - I wasn't exactly watching the clock. Here's how to monitor it without losing your mind:

-- Is it even working?
SELECT subname, pid, received_lsn, latest_end_lsn, last_msg_send_time, last_msg_receipt_time
FROM pg_stat_subscription;

-- How far behind are we? (this number will haunt your dreams)
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) as lag_seconds;

-- Are the tables actually getting data?
SELECT schemaname, tablename,
       (SELECT count(*) FROM information_schema.tables WHERE table_name = tablename) as target_count,
       pg_stat_get_tuples_inserted(oid) as source_inserts
FROM pg_tables
WHERE schemaname = 'public';

Making Sure Your Data Didn't Get Corrupted

The Paranoid Data Validation Step

Trust but verify. I learned this lesson when 20,000-something user records went missing during my first migration:

-- Row count comparison (the basic sanity check)
SELECT 'source' as db, count(*) FROM source_table
UNION ALL
SELECT 'target' as db, count(*) FROM target_table;

-- Data checksum comparison (the paranoid check)
SELECT 'source' as db, md5(string_agg(md5(t.*::text), '' ORDER BY id))
FROM source_table t
UNION ALL
SELECT 'target' as db, md5(string_agg(md5(t.*::text), '' ORDER BY id))
FROM target_table t;

Performance Warning: Logical replication will slow down your source database during initial sync. I've seen 30-40% performance degradation on write-heavy workloads. Plan accordingly - maybe don't do this during Black Friday. Learn from my pain. Check the PostgreSQL monitoring guide for tracking performance during migration.

HAProxy Load Balancing with PgBouncer

Once logical replication is humming along and your data looks sane, you're ready for the switchover. That's where things get really exciting - and by exciting, I mean terrifying. But we'll tackle that in the next phase, armed with monitoring queries that actually work and emergency procedures that don't assume everything goes perfectly.

Phase 2: The Switchover (Where Everything Goes Wrong)

Pre-Switchover Reality Check

AWS Blue-Green Deployment Process

When everything goes right, the switchover is stupid fast - like 3 seconds. When it doesn't... well, I've had 4-hour debugging sessions that started with "this should only take 3 seconds." Pro tip: learn how PostgreSQL failover works before you need it. The high availability documentation is also crucial reading.

The first time I did this, I skipped half these checks because I was cocky. Paid for it with a 2-hour outage and some very uncomfortable calls with management.

The Pre-Flight Checks That Will Save Your Ass

Don't skip these. I repeat: DO NOT SKIP THESE. I learned this the hard way when I got cocky and skipped the pre-flight checks. The PostgreSQL administration documentation covers all the monitoring queries you'll need.

#!/bin/bash
## The script that prevents you from becoming the reason for a post-mortem

echo "=== Checking if we're about to break everything ==="

## 1. Is replication actually caught up?
LAG=$(psql -h target-host -U postgres -d target_db -t -c "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))")
echo "Current replication lag: ${LAG} seconds"
if (( $(echo "$LAG > 10" | bc -l) )); then
    echo "NOPE: Replication lag too high. Go grab coffee and wait."
    exit 1
fi

## 2. Can we actually talk to PgBouncer?
pgbouncer_status=$(psql -h pgbouncer-host -p 6432 -U postgres -c "SHOW pools;" 2>/dev/null | grep -c "myapp")
if [ "$pgbouncer_status" -eq 0 ]; then
    echo "NOPE: PgBouncer isn't talking to us"
    exit 1
fi

## 3. How busy is the database right now?
ACTIVE_CONN=$(psql -h source-host -U postgres -t -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active' AND query NOT LIKE '%pg_stat_activity%'")
echo "Active connections: ${ACTIVE_CONN}"

## 4. Do we have the sequence sync script ready?
if [ ! -f "sync_sequences.sql" ]; then
    echo "NOPE: You forgot the sequence script. This will end badly."
    exit 1
fi

echo "=== Looks like we might actually survive this ==="

The Sequence Problem That Everyone Forgets

Load Balanced Read Replicas

Here's the gotcha that bit me in the ass: logical replication doesn't replicate sequence values. So when you switch over, your auto-increment IDs start from 1 again and you get duplicate key errors. Fun times. The sequence functions documentation explains how PostgreSQL sequences work and why this happens.

-- Run this on your source to generate the fix
SELECT 'SELECT setval(' || quote_literal(quote_ident(schemaname) || '.' || quote_ident(sequencename)) ||
       ', GREATEST((SELECT COALESCE(MAX(' || quote_ident(column_name) || '), 1) FROM ' ||
       quote_ident(table_name) || '), ' ||
       'nextval(' || quote_literal(quote_ident(schemaname) || '.' || quote_ident(sequencename)) || ')));'
FROM pg_sequences ps
JOIN information_schema.columns c ON c.column_default LIKE '%' || ps.sequencename || '%'
WHERE ps.schemaname = 'public';

This spits out commands that look like:

SELECT setval('public.users_id_seq', GREATEST((SELECT COALESCE(MAX(id), 1) FROM users), nextval('public.users_id_seq')));
SELECT setval('public.orders_id_seq', GREATEST((SELECT COALESCE(MAX(id), 1) FROM orders), nextval('public.orders_id_seq')));

The Actual Switchover (Hold Your Breath)

Alright, this is it. The moment of truth. When it works, it's fucking magic. When it doesn't... well, let's focus on making it work.

Step 1: Stop the World (Temporarily)

This is where you pause all new database connections. Existing ones keep running, but new ones get queued. It's like putting a "be right back" sign on your database. The PgBouncer admin commands documentation explains all the PAUSE and RESUME commands.

## Connect to PgBouncer (pray it's working)
psql -h pgbouncer-host -p 6432 -U pgbouncer_admin pgbouncer

## Pause everything - this is the point of no return
PAUSE myapp;

## Check that it actually worked
SHOW pools;

You want to see something like:

database | user | cl_active | cl_waiting | sv_active | sv_idle | maxwait
myapp    | all  |        15 |         23 |         8 |       2 |       0

That cl_waiting number? Those are users trying to get in while you're doing surgery on the database.

Step 2: Wait for the Lag to Disappear (And Sweat Profusely)

Now you wait for replication to catch up. This is the part where you stare at numbers and question all your life choices. The replication monitoring guide has all the queries you need to track lag.

#!/bin/bash
## The longest 60 seconds of your life
TIMEOUT=60
ELAPSED=0

while [ $ELAPSED -lt $TIMEOUT ]; do
    LAG=$(psql -h target-host -U postgres -d target_db -t -c "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))")

    if (( $(echo "$LAG <= 0.1" | bc -l) )); then
        echo "We're good! Lag is ${LAG} seconds"
        break
    fi

    echo "Still waiting... lag is ${LAG} seconds"
    sleep 2
    ELAPSED=$((ELAPSED + 2))
done

if [ $ELAPSED -ge $TIMEOUT ]; then
    echo "Fuck. Timeout. Rolling back now before everything burns down."
    # Emergency brake - resume PgBouncer and abort
    psql -h pgbouncer-host -p 6432 -U pgbouncer_admin pgbouncer -c "RESUME myapp;"
    exit 1
fi

Step 3: Fix the Sequences (Don't Forget This)

## Run the sequence sync script or hate yourself later
psql -h target-host -U postgres -d target_db -f sync_sequences.sql

Step 4: Point PgBouncer at the New Database

This is the actual switchover moment. This is the scary part.

## Make a new config pointing to PostgreSQL 17
cat > /tmp/pgbouncer_new.ini << EOF
[databases]
myapp = host=target-host port=5432 dbname=target_database

[pgbouncer]
listen_port = 6432
listen_addr = 0.0.0.0
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 20
reserve_pool_size = 5
admin_users = pgbouncer_admin
EOF

## Backup the old config (you'll thank me later)
cp /etc/pgbouncer/pgbouncer.ini /etc/pgbouncer/pgbouncer.ini.backup
cp /tmp/pgbouncer_new.ini /etc/pgbouncer/pgbouncer.ini

## Tell PgBouncer to read the new config
psql -h pgbouncer-host -p 6432 -U pgbouncer_admin pgbouncer -c "RELOAD;"

Step 5: Let the Users Back In

## Unleash the flood
psql -h pgbouncer-host -p 6432 -U pgbouncer_admin pgbouncer -c "RESUME myapp;"

## Check that we're actually talking to the new database
psql -h pgbouncer-host -p 6432 -U pgbouncer_admin pgbouncer -c "SHOW databases;"

The Complete Switchover Script (Copy This)

Here's the script that actually works in production. I use this exact one. Based on the PostgreSQL backup and recovery procedures and disaster recovery best practices.

#!/bin/bash
## PostgreSQL Zero-Downtime Switchover Script
## The one that saved my ass twice

set -e

SOURCE_HOST="source-host"
TARGET_HOST="target-host"
PGBOUNCER_HOST="pgbouncer-host"
DATABASE="production"

echo "Starting PostgreSQL switchover at $(date)"

## Pre-flight checks
echo "Running pre-flight checks..."
./preflight_checks.sh

## Step 1: Pause PgBouncer
echo "Pausing PgBouncer..."
psql -h $PGBOUNCER_HOST -p 6432 -U pgbouncer_admin pgbouncer -c "PAUSE $DATABASE;" || {
    echo "Failed to pause PgBouncer"
    exit 1
}

## Step 2: Wait for zero lag
echo "Waiting for replication lag to reach zero..."
TIMEOUT=120
ELAPSED=0
while [ $ELAPSED -lt $TIMEOUT ]; do
    LAG=$(psql -h $TARGET_HOST -U postgres -d $DATABASE -t -c "SELECT COALESCE(EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())), 0)")

    if (( $(echo "$LAG <= 0.5" | bc -l) )); then
        echo "Replication lag acceptable: ${LAG} seconds"
        break
    fi

    echo "Current lag: ${LAG} seconds"
    sleep 1
    ELAPSED=$((ELAPSED + 1))
done

if [ $ELAPSED -ge $TIMEOUT ]; then
    echo "Timeout waiting for lag to decrease. Rolling back..."
    psql -h $PGBOUNCER_HOST -p 6432 -U pgbouncer_admin pgbouncer -c "RESUME $DATABASE;"
    exit 1
fi

## Step 3: Synchronize sequences
echo "Synchronizing sequences..."
psql -h $TARGET_HOST -U postgres -d $DATABASE -f sync_sequences.sql || {
    echo "Failed to sync sequences. Rolling back..."
    psql -h $PGBOUNCER_HOST -p 6432 -U pgbouncer_admin pgbouncer -c "RESUME $DATABASE;"
    exit 1
}

## Step 4: Update PgBouncer configuration
echo "Updating PgBouncer configuration..."
sed -i.backup "s/host=$SOURCE_HOST/host=$TARGET_HOST/g" /etc/pgbouncer/pgbouncer.ini
psql -h $PGBOUNCER_HOST -p 6432 -U pgbouncer_admin pgbouncer -c "RELOAD;" || {
    echo "Failed to reload PgBouncer. Rolling back..."
    mv /etc/pgbouncer/pgbouncer.ini.backup /etc/pgbouncer/pgbouncer.ini
    psql -h $PGBOUNCER_HOST -p 6432 -U pgbouncer_admin pgbouncer -c "RELOAD;"
    psql -h $PGBOUNCER_HOST -p 6432 -U pgbouncer_admin pgbouncer -c "RESUME $DATABASE;"
    exit 1
}

## Step 5: Resume traffic
echo "Resuming application traffic..."
psql -h $PGBOUNCER_HOST -p 6432 -U pgbouncer_admin pgbouncer -c "RESUME $DATABASE;"

echo "Switchover completed successfully at $(date)"

## Verification
echo "Verifying switchover..."
psql -h $PGBOUNCER_HOST -p 6432 -U postgres -d $DATABASE -c "SELECT version();"

Did It Actually Work? (Validation Time)

Now comes the fun part - proving to yourself (and your users) that you didn't break everything. The PostgreSQL testing and validation guide covers proper validation procedures, though we'll use simpler checks here.

-- Are we actually on PostgreSQL 17?
SELECT version();

-- Are people actually connecting?
SELECT count(*) as active_connections FROM pg_stat_activity WHERE state = 'active';

-- Can we write data?
INSERT INTO test_table (name, created_at) VALUES ('switchover_test', now());
SELECT * FROM test_table WHERE name = 'switchover_test';

-- Are the sequences working? (this is where I got burned before)
INSERT INTO users (name) VALUES ('test_user');
SELECT currval('users_id_seq');

The Real Test: Does Your App Still Work?

## Hit the critical endpoints
curl -f "https://your-app.com/health" || echo "Health check failed"
curl -f "https://your-app.com/api/users" || echo "User API failed"

## Watch for the inevitable error messages
tail -f /var/log/application.log | grep -i error

If everything looks good, congratulations! You just pulled off a zero-downtime PostgreSQL upgrade. If not... well, that's what the rollback section is for. Check the PostgreSQL troubleshooting guide if things go sideways.

When Things Go Wrong (And They Will)

Q

Replication lag is stuck at 30 minutes and I'm panicking

A

This happened to me during my first migration. Turned out someone was running a massive data export that was hogging all the WAL.

Find the culprit:

-- What's eating all your WAL?
SELECT pid, usename, application_name, state, query_start,
       now() - query_start as duration, query
FROM pg_stat_activity
WHERE state IN ('active', 'idle in transaction')
ORDER BY query_start;

-- How screwed are we?
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) as lag_seconds;

Fix it:

  • Kill the bastard query that's blocking everything: SELECT pg_terminate_backend(pid);
  • Bump up workers: ALTER SYSTEM SET max_logical_replication_workers = 20;
  • Tell whoever's running reports to fuck off until after the migration
  • Throw more hardware at it if you're desperate
Q

The switchover script just died and now I'm sweating

A

Don't panic. I've been here. The good news is you can roll back pretty quickly if you didn't get too far.

## Emergency \"oh shit\" rollback
psql -h pgbouncer-host -p 6432 -U pgbouncer_admin pgbouncer -c \"PAUSE myapp;\"

## Put the old config back
mv /etc/pgbouncer/pgbouncer.ini.backup /etc/pgbouncer/pgbouncer.ini
psql -h pgbouncer-host -p 6432 -U pgbouncer_admin pgbouncer -c \"RELOAD;\"

## Let everyone back in
psql -h pgbouncer-host -p 6432 -U pgbouncer_admin pgbouncer -c \"RESUME myapp;\"

How long until I stop getting angry Slack messages: Maybe 60 seconds if you're lucky and PgBouncer cooperates.

Q

Everything's broken with "relation does not exist" and users are pissed

A

This is usually because you forgot to install some extension or custom function on the target. I learned this lesson when our entire auth system died because I forgot pgcrypto.

What you probably forgot:

  • Extensions: Check if pg_stat_statements, pgcrypto, uuid-ossp exist on target
  • Custom functions: Did you copy over all your stored procedures?
  • Search paths: PostgreSQL 17 might have different defaults
  • Case sensitivity: Something got lowercased that shouldn't have

Emergency fix:

-- See what's missing
SELECT name, default_version, installed_version
FROM pg_available_extensions
WHERE installed_version IS NOT NULL;

-- Install the obvious ones you forgot
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
CREATE EXTENSION IF NOT EXISTS pgcrypto;
CREATE EXTENSION IF NOT EXISTS \"uuid-ossp\";
Q

Duplicate key errors are killing my app and I'm losing my shit

A

This is the sequence problem I was telling you about. Logical replication doesn't sync sequences, so they start over from 1 and you get duplicate keys. This ruined my first migration attempt.

The error that ruins your day:

ERROR: duplicate key value violates unique constraint \"users_pkey\"
DETAIL: Key (id)=(12345) already exists.

Emergency sequence reset:

-- Set sequences way higher than they need to be (better safe than sorry)
SELECT setval('users_id_seq', (SELECT MAX(id) FROM users) + 1000);
SELECT setval('orders_id_seq', (SELECT MAX(id) FROM orders) + 1000);
SELECT setval('payments_id_seq', (SELECT MAX(id) FROM payments) + 1000);

-- Check that it worked
SELECT currval('users_id_seq'), nextval('users_id_seq');
Q

PgBouncer just went to shit and nothing works

A

This is your worst nightmare scenario. PgBouncer is dead, users can't connect, and everyone's looking at you. Time for emergency measures.

Nuclear options (in order of desperation):

  1. Bypass PgBouncer completely - point apps directly at the database
  2. Restart PgBouncer and pray it comes back
  3. Failover to another PgBouncer instance if you have one
## Emergency bypass (if you're using Kubernetes)
kubectl set env deployment/app DATABASE_URL=\"postgresql://user:pass@target-host:5432/db\"

## Or restart the bastard
sudo systemctl restart pgbouncer
Q

How do I know if I fucked up the data?

A

Paranoia is good here. I check everything twice because I've seen data corruption that wasn't obvious until days later.

-- Quick sanity checks
SELECT schemaname, tablename, n_tup_ins + n_tup_upd + n_tup_del as total_changes
FROM pg_stat_user_tables
ORDER BY total_changes DESC;

-- The \"oh shit did I lose customer data\" checks
SELECT count(*) as user_count FROM users WHERE created_at > now() - interval '1 hour';
SELECT count(*) as order_count FROM orders WHERE status = 'completed';
SELECT sum(amount) as revenue_total FROM payments WHERE created_at::date = current_date;

The paranoid check:

## This will tell you if your disk is lying to you
pg_checksummer --pgdata=/var/lib/postgresql/data --verbose
Q

Can I roll back if everything's fucked?

A

Yeah, but the longer you wait, the more it's going to hurt.

If you catch it quickly (within an hour):

  • Flip PgBouncer back to the old database
  • You'll lose maybe a few minutes of data
  • Not the end of the world

If it's been running for hours:

  • You're looking at a full data dump and restore
  • Potential data loss
  • Lots of explaining to do

Pro tip: Keep PostgreSQL 16 running for 24-48 hours before you kill it. You'll thank me later.

Q

Everything's slow and users are complaining

A

PostgreSQL 17's optimizer is different. Sometimes that means your queries run like shit now.

-- See what's taking forever
ALTER SYSTEM SET log_min_duration_statement = 1000;
SELECT pg_reload_conf();

-- Check if the query plan changed
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM slow_table WHERE problem_column = 'value';

-- Nuclear option: update all statistics
ANALYZE VERBOSE;

Usually fixes it:

  • Run ANALYZE on everything
  • Bump up statistics targets: ALTER TABLE table_name ALTER COLUMN column_name SET STATISTICS 1000;
  • Restart PostgreSQL (yes, really)
Q

Subscription keeps failing with "logical replication worker crashed"

A

This is the error that made me question my career choices. Usually happens when PostgreSQL 17 tries to apply changes from a format it doesn't understand.

The log message that ruins your day:

FATAL: logical replication worker for subscription \"upgrade_subscription\" has crashed
DETAIL: Worker process exited with exit code 1

What probably went wrong:

  • Data type incompatibility between PG 16 and 17
  • Character encoding differences
  • Some asshole changed the schema during migration

Emergency fix:

-- Drop and recreate the subscription (nuclear option)
DROP SUBSCRIPTION upgrade_subscription;

-- Wait 30 seconds for cleanup
-- Recreate with copy_data = false since we already have the data
CREATE SUBSCRIPTION upgrade_subscription
CONNECTION 'host=source-host port=5432 dbname=production user=replication_user password=secure_password'
PUBLICATION upgrade_publication
WITH (copy_data = false);

Prevention: Test schema changes on staging. Do NOT modify anything during the migration window. I learned this when some genius deployed a schema change mid-migration and broke everything. Almost got fired over that one.

Phase 3: Cleanup (Don't Skip This Part)

CloudWatch Database Connections Monitoring

Clean Up All the Test Crap You Created

Turn Off Logical Replication

Once you're confident everything's working, shut down the replication stuff. Leaving it running just wastes resources. The logical replication management guide explains proper cleanup procedures.

-- On target database (PostgreSQL 17)
-- Drop the subscription to stop replication
DROP SUBSCRIPTION upgrade_subscription;

-- Verify subscription is removed
SELECT subname FROM pg_subscription;
-- On source database (PostgreSQL 16)
-- Drop the publication
DROP PUBLICATION upgrade_publication;

-- Check for remaining replication slots
SELECT slot_name, plugin, slot_type, database, active, restart_lsn
FROM pg_replication_slots;

-- Drop any remaining slots
SELECT pg_drop_replication_slot('slot_name');

Restore Normal Replica Identity Settings

If you modified replica identity settings for tables without primary keys, restore them to default:

-- Find tables with FULL replica identity
SELECT schemaname, tablename
FROM pg_tables t
JOIN pg_class c ON c.relname = t.tablename
WHERE c.relreplident = 'f';  -- 'f' = FULL

-- Reset to default replica identity
ALTER TABLE table_name REPLICA IDENTITY DEFAULT;

Database Optimization for PostgreSQL 17

Update Table Statistics

PostgreSQL 17 includes improvements to the query planner. Update statistics to take advantage of new optimizations. The PostgreSQL 17 release notes detail all the performance improvements.

-- Analyze all tables with verbose output
ANALYZE VERBOSE;

-- For critical large tables, increase statistics collection
ALTER TABLE large_table ALTER COLUMN indexed_column SET STATISTICS 1000;
ANALYZE large_table;

Configure PostgreSQL 17 Specific Settings

PostgreSQL 17 introduces new configuration options worth playing with (just don't blame me if they break something). Check the server configuration documentation for full details on these new settings.

## postgresql.conf optimizations for PostgreSQL 17

## Enhanced vacuum performance
vacuum_buffer_usage_limit = '2GB'

## Improved I/O efficiency
io_combine_limit = '128kB'

## Better parallel query performance
parallel_leader_participation = on

## Enhanced JSON processing
jit_above_cost = 100000

Index Maintenance and Optimization

Check for any indexes that might benefit from PostgreSQL 17's improvements. The index optimization guide and performance tuning documentation cover index management in detail.

-- Identify unused indexes
SELECT schemaname, tablename, indexname, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
WHERE idx_tup_read = 0 AND idx_tup_fetch = 0;

-- Find indexes that might benefit from new features
SELECT indexname, indexdef
FROM pg_indexes
WHERE indexdef LIKE '%btree%'
AND tablename IN (SELECT tablename FROM pg_tables WHERE schemaname = 'public');

Performance Monitoring and Validation

Establish New Baselines

Create performance baselines for your PostgreSQL 17 instance using pg_stat_statements. The monitoring and statistics documentation explains all the available performance metrics.

-- Enable pg_stat_statements if not already enabled
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Reset statistics to get clean baseline
SELECT pg_stat_statements_reset();

-- After 24 hours, check top queries
SELECT query, calls, total_exec_time, mean_exec_time, rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

Application Performance Testing

Run comprehensive application tests to verify performance improvements:

#!/bin/bash
## Performance validation script

echo \"Running application performance tests...\"

## API endpoint response times
for endpoint in \"/users\" \"/orders\" \"/reports\"; do
    echo \"Testing $endpoint...\"
    curl -w \"Response time: %{time_total}s
\" -s -o /dev/null \"https://your-app.com$endpoint\"

## Database query performance
psql -h target-host -U postgres -d mydb -c \"
SELECT
    query,
    mean_exec_time,
    calls
FROM pg_stat_statements
WHERE query LIKE 'SELECT%'
ORDER BY mean_exec_time DESC
LIMIT 10;\"

Infrastructure Cleanup

Remove PgBouncer (Optional)

If PgBouncer was only used for the upgrade (though honestly, you should probably keep it):

## Stop PgBouncer service
sudo systemctl stop pgbouncer

## Update application to connect directly to PostgreSQL 17
## Update DATABASE_URL in application configuration
export DATABASE_URL=\"postgresql://user:password@target-host:5432/database\"

## Remove PgBouncer package
sudo apt remove pgbouncer

Decommission PostgreSQL 16 Instance

⚠️ Wait 48-72 hours before decommissioning to ensure stability. Follow the backup and recovery best practices for proper data preservation.

## Create final backup of PostgreSQL 16
pg_dump --host=source-host \
        --username=postgres \
        --format=custom \
        --verbose \
        --file=final_pg16_backup.dump \
        production_database

## Verify backup integrity
pg_restore --list final_pg16_backup.dump | head -20

## Document decommission date and backup location
echo \"PostgreSQL 16 backup created: $(date)\" >> upgrade_log.txt
echo \"Backup location: $(pwd)/final_pg16_backup.dump\" >> upgrade_log.txt

AWS RDS Specific Cleanup

For AWS RDS instances:

## Create final snapshot
aws rds create-db-snapshot \
    --db-instance-identifier postgres16-source \
    --db-snapshot-identifier postgres16-final-snapshot-$(date +%Y%m%d)

## Wait for snapshot completion
aws rds wait db-snapshot-completed \
    --db-snapshot-identifier postgres16-final-snapshot-$(date +%Y%m%d)

## Delete the instance (after validation period)
aws rds delete-db-instance \
    --db-instance-identifier postgres16-source \
    --skip-final-snapshot \
    --delete-automated-backups

Security Updates and Hardening

Update Connection Security

PostgreSQL 17 includes enhanced security features. Review the security documentation and authentication methods guide for complete security hardening.

-- Enable enhanced authentication logging
ALTER SYSTEM SET log_connections = on;
ALTER SYSTEM SET log_disconnections = on;
ALTER SYSTEM SET log_line_prefix = '%m [%p] %q%u@%d ';

-- Configure row-level security if not already enabled
ALTER TABLE sensitive_table ENABLE ROW LEVEL SECURITY;

-- Update password policies for PostgreSQL 17
ALTER SYSTEM SET password_encryption = 'scram-sha-256';

SSL/TLS Configuration Updates

Verify SSL configuration for PostgreSQL 17:

-- Check SSL status
SELECT name, setting FROM pg_settings WHERE name LIKE 'ssl%';

-- Verify SSL is being used by connections
SELECT datname, usename, client_addr, ssl, ssl_version, ssl_cipher
FROM pg_stat_ssl
JOIN pg_stat_activity USING (pid);

Documentation and Runbook Updates

Create Upgrade Documentation

Document the upgrade process for future reference:

## PostgreSQL 16 to 17 Upgrade Completed

**Upgrade Date:** $(date)
**Duration:** [Record total time]
**Downtime:** ~3 seconds during switchover

### Configuration Changes Applied:
- Logical replication enabled
- PgBouncer configuration updated
- Sequence synchronization performed
- Table statistics updated

### Performance Improvements Observed:
- Query execution time: [measure before/after]
- Replication performance: [document improvements]
- Storage efficiency: [note any changes]

### Rollback Procedures Tested:
- Emergency rollback via PgBouncer: ✓
- Data consistency validation: ✓
- Application failover procedures: ✓

Update Monitoring and Alerting

Adjust monitoring thresholds for PostgreSQL 17:

## Example Prometheus alerting rules
groups:
  - name: postgresql17
    rules:
    - alert: PostgreSQLDown
      expr: pg_up == 0
      for: 1m
      labels:
        severity: critical

    - alert: PostgreSQLReplicationLag
      expr: pg_replication_lag_seconds > 30
      for: 2m
      labels:
        severity: warning

    - alert: PostgreSQLSlowQueries
      expr: pg_stat_database_tup_fetched_per_second > 10000
      for: 5m
      labels:
        severity: info

Validation Checklist

Before declaring the upgrade complete, verify:

  • All application functionality working normally
  • Database performance metrics within expected ranges
  • No increase in error rates or timeouts
  • Backup and recovery procedures tested
  • Monitoring and alerting updated for PostgreSQL 17
  • Team trained on PostgreSQL 17 specific features
  • Rollback procedures documented and tested
  • Old infrastructure properly decommissioned

That's it - you're done! Your PostgreSQL 17 database is running and hopefully not on fire. You made it through without nuking production. Time for a beer. For ongoing maintenance, check the routine maintenance guide and vacuum best practices

PostgreSQL Upgrade Methods Comparison

Method

Downtime

Complexity

Data Loss Risk

Resource Requirements

Best For

Logical Replication (This Guide)

3-10 seconds

High

None

2x resources during upgrade

Production systems requiring zero downtime

pg_upgrade (In-Place)

5-30 minutes

Medium

Low

1.5x storage space

Small to medium databases with maintenance windows

AWS RDS Blue/Green

30-60 seconds

Low

None

2x resources during upgrade

AWS RDS users wanting managed solution

Dump and Restore

2-12+ hours

Low

Low

2x storage space

Small databases or major restructuring

Physical Replication + Failover

1-5 minutes

Medium

Low

2x resources

Traditional HA setups

Related Tools & Recommendations

compare
Recommended

PostgreSQL vs MySQL vs MariaDB vs SQLite vs CockroachDB - Pick the Database That Won't Ruin Your Life

competes with mariadb

mariadb
/compare/postgresql-mysql-mariadb-sqlite-cockroachdb/database-decision-guide
100%
compare
Recommended

MongoDB vs PostgreSQL vs MySQL: Which One Won't Ruin Your Weekend

competes with mysql

mysql
/compare/mongodb/postgresql/mysql/performance-benchmarks-2025
76%
compare
Recommended

PostgreSQL vs MySQL vs MariaDB - Performance Analysis 2025

Which Database Will Actually Survive Your Production Load?

PostgreSQL
/compare/postgresql/mysql/mariadb/performance-analysis-2025
52%
tool
Recommended

MariaDB - What MySQL Should Have Been

competes with MariaDB

MariaDB
/tool/mariadb/overview
52%
tool
Recommended

MySQL Replication - How to Keep Your Database Alive When Shit Goes Wrong

competes with MySQL Replication

MySQL Replication
/tool/mysql-replication/overview
52%
alternatives
Recommended

MySQL Alternatives That Don't Suck - A Migration Reality Check

Oracle's 2025 Licensing Squeeze and MySQL's Scaling Walls Are Forcing Your Hand

MySQL
/alternatives/mysql/migration-focused-alternatives
52%
integration
Recommended

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go

Apache Kafka
/integration/kafka-mongodb-kubernetes-prometheus-event-driven/complete-observability-architecture
51%
integration
Recommended

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

How to Wire Together the Modern DevOps Stack Without Losing Your Sanity

docker
/integration/docker-kubernetes-argocd-prometheus/gitops-workflow-integration
41%
alternatives
Recommended

Why I Finally Dumped Cassandra After 5 Years of 3AM Hell

alternative to MongoDB

MongoDB
/alternatives/mongodb-postgresql-cassandra/cassandra-operational-nightmare
39%
alternatives
Recommended

MongoDB Alternatives: Choose the Right Database for Your Specific Use Case

Stop paying MongoDB tax. Choose a database that actually works for your use case.

MongoDB
/alternatives/mongodb/use-case-driven-alternatives
38%
pricing
Recommended

How These Database Platforms Will Fuck Your Budget

integrates with MongoDB Atlas

MongoDB Atlas
/pricing/mongodb-atlas-vs-planetscale-vs-supabase/total-cost-comparison
38%
tool
Recommended

SQL Server 2025 - Vector Search Finally Works (Sort Of)

competes with Microsoft SQL Server 2025

Microsoft SQL Server 2025
/tool/microsoft-sql-server-2025/overview
30%
tool
Recommended

PgBouncer - PostgreSQL Connection Pooler

Stops PostgreSQL from eating all your RAM and crashing at the worst possible moment

PgBouncer
/tool/pgbouncer/overview
29%
tool
Recommended

PlanetScale - MySQL That Actually Scales Without The Pain

Database Platform That Handles The Nightmare So You Don't Have To

PlanetScale
/tool/planetscale/overview
28%
compare
Recommended

These 4 Databases All Claim They Don't Suck

I Spent 3 Months Breaking Production With Turso, Neon, PlanetScale, and Xata

Turso
/review/compare/turso/neon/planetscale/xata/performance-benchmarks-2025
28%
alternatives
Recommended

MongoDB Alternatives: The Migration Reality Check

Stop bleeding money on Atlas and discover databases that actually work in production

MongoDB
/alternatives/mongodb/migration-reality-check
28%
howto
Recommended

How to Migrate PostgreSQL 15 to 16 Without Destroying Your Weekend

competes with PostgreSQL

PostgreSQL
/howto/migrate-postgresql-15-to-16-production/migrate-postgresql-15-to-16-production
28%
tool
Recommended

Liquibase Pro - Database Migrations That Don't Break Production

Policy checks that actually catch the stupid stuff before you drop the wrong table in production, rollbacks that work more than 60% of the time, and features th

Liquibase Pro
/tool/liquibase/overview
26%
tool
Recommended

SQLite - The Database That Just Works

Zero Configuration, Actually Works

SQLite
/tool/sqlite/overview
24%
tool
Recommended

SQLite Performance: When It All Goes to Shit

Your database was fast yesterday and slow today. Here's why.

SQLite
/tool/sqlite/performance-optimization
24%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization