Replication lag is stuck at 30 minutes and I'm panicking

This happened to me during my first migration. Turned out someone was running a massive data export that was hogging all the WAL. **Find the culprit:** ```sql -- What's eating all your WAL? SELECT pid, usename, application_name, state, query_start, now() - query_start as duration, query FROM pg_stat_activity WHERE state IN ('active', 'idle in transaction') ORDER BY query_start; -- How screwed are we? SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) as lag_seconds; ``` **Fix it:** - **Kill the bastard query** that's blocking everything: `SELECT pg_terminate_backend(pid);` - **Bump up workers:** `ALTER SYSTEM SET max_logical_replication_workers = 20;` - **Tell whoever's running reports to fuck off** until after the migration - **Throw more hardware at it** if you're desperate

Everything's broken with "relation does not exist" and users are pissed

This is usually because you forgot to install some extension or custom function on the target. I learned this lesson when our entire auth system died because I forgot `pgcrypto`. **What you probably forgot:** - **Extensions:** Check if `pg_stat_statements`, `pgcrypto`, `uuid-ossp` exist on target - **Custom functions:** Did you copy over all your stored procedures? - **Search paths:** PostgreSQL 17 might have different defaults - **Case sensitivity:** Something got lowercased that shouldn't have **Emergency fix:** ```sql -- See what's missing SELECT name, default_version, installed_version FROM pg_available_extensions WHERE installed_version IS NOT NULL; -- Install the obvious ones you forgot CREATE EXTENSION IF NOT EXISTS pg_stat_statements; CREATE EXTENSION IF NOT EXISTS pgcrypto; CREATE EXTENSION IF NOT EXISTS \"uuid-ossp\"; ```

Duplicate key errors are killing my app and I'm losing my shit

This is the sequence problem I was telling you about. Logical replication doesn't sync sequences, so they start over from 1 and you get duplicate keys. This ruined my first migration attempt. **The error that ruins your day:** ``` ERROR: duplicate key value violates unique constraint \"users_pkey\" DETAIL: Key (id)=(12345) already exists. ``` **Emergency sequence reset:** ```sql -- Set sequences way higher than they need to be (better safe than sorry) SELECT setval('users_id_seq', (SELECT MAX(id) FROM users) + 1000); SELECT setval('orders_id_seq', (SELECT MAX(id) FROM orders) + 1000); SELECT setval('payments_id_seq', (SELECT MAX(id) FROM payments) + 1000); -- Check that it worked SELECT currval('users_id_seq'), nextval('users_id_seq'); ```

PgBouncer just went to shit and nothing works

This is your worst nightmare scenario. PgBouncer is dead, users can't connect, and everyone's looking at you. Time for emergency measures. **Nuclear options (in order of desperation):** 1. **Bypass PgBouncer completely** - point apps directly at the database 2. **Restart PgBouncer** and pray it comes back 3. **Failover to another PgBouncer instance** if you have one ```bash # Emergency bypass (if you're using Kubernetes) kubectl set env deployment/app DATABASE_URL=\"postgresql://user:pass@target-host:5432/db\" # Or restart the bastard sudo systemctl restart pgbouncer ```

How do I know if I fucked up the data?

Paranoia is good here. I check everything twice because I've seen data corruption that wasn't obvious until days later. ```sql -- Quick sanity checks SELECT schemaname, tablename, n_tup_ins + n_tup_upd + n_tup_del as total_changes FROM pg_stat_user_tables ORDER BY total_changes DESC; -- The \"oh shit did I lose customer data\" checks SELECT count(*) as user_count FROM users WHERE created_at > now() - interval '1 hour'; SELECT count(*) as order_count FROM orders WHERE status = 'completed'; SELECT sum(amount) as revenue_total FROM payments WHERE created_at::date = current_date; ``` **The paranoid check:** ```bash # This will tell you if your disk is lying to you pg_checksummer --pgdata=/var/lib/postgresql/data --verbose ```

Can I roll back if everything's fucked?

Yeah, but the longer you wait, the more it's going to hurt. **If you catch it quickly (within an hour):** - Flip PgBouncer back to the old database - You'll lose maybe a few minutes of data - Not the end of the world **If it's been running for hours:** - You're looking at a full data dump and restore - Potential data loss - Lots of explaining to do **Pro tip:** Keep PostgreSQL 16 running for 24-48 hours before you kill it. You'll thank me later.

Everything's slow and users are complaining

PostgreSQL 17's optimizer is different. Sometimes that means your queries run like shit now. ```sql -- See what's taking forever ALTER SYSTEM SET log_min_duration_statement = 1000; SELECT pg_reload_conf(); -- Check if the query plan changed EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM slow_table WHERE problem_column = 'value'; -- Nuclear option: update all statistics ANALYZE VERBOSE; ``` **Usually fixes it:** - Run `ANALYZE` on everything - Bump up statistics targets: `ALTER TABLE table_name ALTER COLUMN column_name SET STATISTICS 1000;` - Restart PostgreSQL (yes, really)

Subscription keeps failing with "logical replication worker crashed"

This is the error that made me question my career choices. Usually happens when PostgreSQL 17 tries to apply changes from a format it doesn't understand. **The log message that ruins your day:** ``` FATAL: logical replication worker for subscription \"upgrade_subscription\" has crashed DETAIL: Worker process exited with exit code 1 ``` **What probably went wrong:** - Data type incompatibility between PG 16 and 17 - Character encoding differences - Some asshole changed the schema during migration **Emergency fix:** ```sql -- Drop and recreate the subscription (nuclear option) DROP SUBSCRIPTION upgrade_subscription; -- Wait 30 seconds for cleanup -- Recreate with copy_data = false since we already have the data CREATE SUBSCRIPTION upgrade_subscription CONNECTION 'host=source-host port=5432 dbname=production user=replication_user password=secure_password' PUBLICATION upgrade_publication WITH (copy_data = false); ``` **Prevention:** Test schema changes on staging. Do NOT modify anything during the migration window. I learned this when some genius deployed a schema change mid-migration and broke everything. Almost got fired over that one.

Currently viewing the AI version

Switch to human version

PostgreSQL 16 to 17 Zero-Downtime Upgrade via Logical Replication

Executive Summary

Technique: Zero-downtime PostgreSQL upgrade using logical replication
Downtime: 3-10 seconds (theoretical) vs 4-hour disasters (reality when things go wrong)
Resource Requirements: Double normal resources during migration
Complexity: High - multiple failure modes, requires precise timing
Success Rate: High when properly planned, catastrophic when rushed

Critical Prerequisites

Configuration Requirements

wal_level = logical (requires PostgreSQL restart - plan downtime accordingly)
max_replication_slots >= 10
max_wal_senders >= 10
max_logical_replication_workers >= 10
max_worker_processes >= 20

FAILURE MODE: If wal_level isn't logical, upgrade starts with 20-minute downtime

Hardware Resource Requirements

Resource	Minimum Requirement	Failure Consequence
CPU	2x normal usage	Replication crawls, lag increases exponentially
Memory	2x normal allocation	WAL processing fills all available RAM
Storage	150% of source database	Migration fails mid-process, logs consume 800GB+
Network Latency	<5ms between source/target	Replication lag becomes unbearable
Disk I/O	Fastest available storage	Migration takes 8+ hours instead of 6

Primary Key Requirement

CRITICAL: Tables without primary keys break logical replication silently

Detection: Query pg_tables joined with pg_constraint for missing primary keys
Workaround: ALTER TABLE table_name REPLICA IDENTITY FULL (causes performance degradation)
Impact: REPLICA IDENTITY FULL replicates entire rows, crushing network performance

Phase 1: Infrastructure Setup

PostgreSQL 17 Target Configuration

Size Target Correctly: Identical specs to source results in "dying snail" migration speed
AWS RDS Example: db.r5.xlarge minimum for production workloads
Storage: Minimum 500GB allocated storage regardless of source size

Schema Migration Process

-- Schema-only dump (excludes data, permissions, ownership)
pg_dump --host=source-host --schema-only --no-privileges --no-owner source_db > schema.sql
psql --host=target-host --file=schema.sql target_db

Common Failure Points:

Permission errors during schema application
Missing extensions (pgcrypto, uuid-ossp, pg_stat_statements)
Character encoding mismatches between versions

PgBouncer Configuration

Purpose: Enables instant connection redirection during switchover
Pool Mode: transaction (required for seamless switching)
Authentication: md5 with userlist file (plain text passwords fail)

# Production-tested configuration
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 20
reserve_pool_size = 5

Phase 2: Logical Replication Setup

Publication Creation (Source Database)

CREATE PUBLICATION upgrade_publication FOR ALL TABLES;

Subscription Creation (Target Database)

CREATE SUBSCRIPTION upgrade_subscription
CONNECTION 'host=source port=5432 dbname=prod user=replication_user password=secure_pass'
PUBLICATION upgrade_publication;

Initial Sync Monitoring

Duration Expectations:

100GB database: 6-8 hours
Performance degradation on source: 30-40% during sync
DO NOT run during high-traffic periods (Black Friday lesson learned)

Monitoring Queries:

-- Replication lag in seconds
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) as lag_seconds;

-- Subscription status
SELECT subname, pid, received_lsn, latest_end_lsn FROM pg_stat_subscription;

Phase 3: Switchover Execution

Pre-Switchover Validation

Mandatory Checks (skipping these caused 2-hour outage):

Replication lag < 0.5 seconds
PgBouncer responding to admin commands
Sequence synchronization script prepared
Source database activity < normal threshold

Sequence Synchronization Problem

CRITICAL ISSUE: Logical replication doesn't sync sequence values
Consequence: Auto-increment IDs restart from 1, causing duplicate key errors
Solution: Pre-generate sequence sync commands before switchover

-- Generate sequence sync commands
SELECT 'SELECT setval(' || quote_literal(schemaname||'.'||sequencename) ||
       ', GREATEST((SELECT COALESCE(MAX('||column_name||'), 1) FROM '||table_name||'), ' ||
       'nextval('||quote_literal(schemaname||'.'||sequencename)||')));'
FROM pg_sequences ps
JOIN information_schema.columns c ON c.column_default LIKE '%'||ps.sequencename||'%';

Switchover Process

Pause PgBouncer: PAUSE database_name;
Wait for zero lag: Timeout after 120 seconds
Sync sequences: Apply pre-generated commands
Redirect PgBouncer: Update config, RELOAD;
Resume traffic: RESUME database_name;

Total Switchover Time: 3 seconds when everything works, 4 hours when it doesn't

Critical Failure Modes and Recovery

Replication Lag Stuck at 30+ Minutes

Root Cause: Large query blocking WAL processing
Detection: Query pg_stat_activity for long-running transactions
Resolution:

Kill blocking queries: SELECT pg_terminate_backend(pid);
Increase workers: ALTER SYSTEM SET max_logical_replication_workers = 20;
Postpone report generation until after migration

Switchover Script Failure

Emergency Rollback:

# Restore PgBouncer to source database
psql -h pgbouncer-host -p 6432 -U admin pgbouncer -c "PAUSE myapp;"
mv /etc/pgbouncer/pgbouncer.ini.backup /etc/pgbouncer/pgbouncer.ini
psql -h pgbouncer-host -p 6432 -U admin pgbouncer -c "RELOAD; RESUME myapp;"

Recovery Time: 60 seconds if PgBouncer cooperates

Duplicate Key Errors Post-Migration

Symptom: ERROR: duplicate key value violates unique constraint
Emergency Fix:

-- Set sequences higher than maximum values
SELECT setval('users_id_seq', (SELECT MAX(id) FROM users) + 1000);

Subscription Worker Crashes

Error: logical replication worker crashed
Common Causes:

Data type incompatibility between PostgreSQL versions
Schema modifications during migration
Character encoding differences

Nuclear Recovery:

DROP SUBSCRIPTION upgrade_subscription;
-- Wait 30 seconds for cleanup
CREATE SUBSCRIPTION upgrade_subscription
CONNECTION '...' PUBLICATION upgrade_publication
WITH (copy_data = false);

Performance Optimization Post-Upgrade

PostgreSQL 17 Specific Settings

# Enhanced performance settings
vacuum_buffer_usage_limit = '2GB'
io_combine_limit = '128kB'
parallel_leader_participation = on
jit_above_cost = 100000

Statistics and Index Updates

-- Update optimizer statistics
ANALYZE VERBOSE;

-- Increase statistics for critical columns
ALTER TABLE large_table ALTER COLUMN indexed_column SET STATISTICS 1000;

Resource Requirements Summary

Phase	CPU Usage	Memory Usage	Storage Usage	Network Impact
Initial Sync	200% normal	200% normal	150% source size	High sustained traffic
Steady Replication	120% normal	150% normal	Stable	Low continuous traffic
Switchover	Minimal	Minimal	Stable	Burst during sequence sync

Upgrade Method Comparison

Method	Downtime	Complexity	Data Loss Risk	Resource Cost	Best Use Case
Logical Replication	3-10 seconds	High	None	2x during upgrade	Production requiring zero downtime
pg_upgrade	5-30 minutes	Medium	Low	1.5x storage	Small-medium DBs with maintenance windows
AWS Blue/Green	30-60 seconds	Low	None	2x during upgrade	AWS RDS managed solution
Dump/Restore	2-12+ hours	Low	Low	2x storage	Small databases or major restructuring

Cleanup and Decommissioning

Replication Cleanup

-- Target database
DROP SUBSCRIPTION upgrade_subscription;

-- Source database
DROP PUBLICATION upgrade_publication;
SELECT pg_drop_replication_slot('slot_name');

Decommissioning Timeline

24-48 hours: Keep PostgreSQL 16 running for emergency rollback
72 hours: Safe to decommission after stability validation
Backup: Create final dump before decommissioning

Key Success Factors

Resource Allocation: Never underestimate hardware requirements
Primary Key Verification: Check all tables before starting
Sequence Preparation: Generate sync commands in advance
Emergency Procedures: Test rollback scripts before migration
Timing: Avoid high-traffic periods and concurrent deployments
Monitoring: Establish replication lag thresholds and alerts

Lessons from Production Failures

First attempt: 4-hour outage from insufficient preparation
Sequence issue: 20,000 user records lost due to duplicate keys
Resource shortage: Migration failed at 2am with 800GB log files
PgBouncer failure: Bypassed to direct database connections as emergency measure
Schema changes: Colleague deployed during migration, broke replication worker

Success Rate: High when methodically planned, catastrophic when rushed or under-resourced

Useful Links for Further Investigation

Resources That Actually Helped Me

Link	Description
PostgreSQL Logical Replication	The official docs. Actually useful once you get past the marketing speak. Pay attention to the restrictions section - it'll save you hours of debugging.
PgBouncer Configuration	This page saved my ass when PgBouncer decided to stop working for no reason. The pool modes section is critical.
The Hard Parts of Zero-Downtime Migrations	These guys actually tell you what goes wrong instead of pretending everything's perfect. Read this before you start.
Zero-Downtime Upgrade Guide	Good walkthrough that includes the actual scripts they used. Wish I'd found this before my first attempt.
PostgreSQL IRC Channel	#postgresql on Freenode. Real humans who will help you debug at 3am when everything's on fire.
PostgreSQL Community Forums	The official mailing lists where actual PostgreSQL developers hang out. When you need real answers from people who wrote the code.

PostgreSQL 16 to 17 Zero-Downtime Upgrade via Logical Replication

Executive Summary

Critical Prerequisites

Configuration Requirements

Hardware Resource Requirements

Primary Key Requirement

Phase 1: Infrastructure Setup

PostgreSQL 17 Target Configuration

Schema Migration Process

PgBouncer Configuration

Phase 2: Logical Replication Setup

Publication Creation (Source Database)

Subscription Creation (Target Database)

Initial Sync Monitoring

Phase 3: Switchover Execution

Pre-Switchover Validation

Sequence Synchronization Problem

Switchover Process

Critical Failure Modes and Recovery

Replication Lag Stuck at 30+ Minutes

Switchover Script Failure

Duplicate Key Errors Post-Migration

Subscription Worker Crashes

Performance Optimization Post-Upgrade

PostgreSQL 17 Specific Settings

Statistics and Index Updates

Resource Requirements Summary

Upgrade Method Comparison

Cleanup and Decommissioning

Replication Cleanup

Decommissioning Timeline

Key Success Factors

Lessons from Production Failures

Useful Links for Further Investigation

Resources That Actually Helped Me

Related Tools & Recommendations

PostgreSQL vs MySQL vs MariaDB vs SQLite vs CockroachDB - Pick the Database That Won't Ruin Your Life

MongoDB vs PostgreSQL vs MySQL: Which One Won't Ruin Your Weekend

PostgreSQL vs MySQL vs MariaDB - Performance Analysis 2025

MariaDB - What MySQL Should Have Been

MySQL Replication - How to Keep Your Database Alive When Shit Goes Wrong

MySQL Alternatives That Don't Suck - A Migration Reality Check

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Why I Finally Dumped Cassandra After 5 Years of 3AM Hell

MongoDB Alternatives: Choose the Right Database for Your Specific Use Case

How These Database Platforms Will Fuck Your Budget

SQL Server 2025 - Vector Search Finally Works (Sort Of)

PgBouncer - PostgreSQL Connection Pooler

PlanetScale - MySQL That Actually Scales Without The Pain

These 4 Databases All Claim They Don't Suck

MongoDB Alternatives: The Migration Reality Check

How to Migrate PostgreSQL 15 to 16 Without Destroying Your Weekend

Liquibase Pro - Database Migrations That Don't Break Production

SQLite - The Database That Just Works

SQLite Performance: When It All Goes to Shit