Why is my logical replication eating 500GB of WAL overnight?

Because you have a stuck or inactive replication slot and didn't set max_slot_wal_keep_size. I've seen this exact scenario kill production three times. Check [pg_replication_slots](https://www.postgresql.org/docs/current/view-pg-replication-slots.html) for slots with huge restart_lsn lag. Common causes: your subscriber crashed, network hiccupped, or apply worker died silently while logging absolutely nothing useful. Set max_slot_wal_keep_size = 50GB so slots get killed before they kill your server.

Should I use REPLICA IDENTITY FULL?

Only if network bandwidth isn't a problem. It sends complete row images (2-4x more data) but makes apply processing simpler and gives you consistent row data for analytics. If you're replicating across regions or paying for bandwidth, hell no.

How many parallel apply workers should I use?

Start with 4, monitor lag, adjust up. More workers help with high transaction volume but create contention if you're updating the same rows constantly. I've seen diminishing returns past 8 workers for most workloads.

Why does my replication work in dev but break in production?

Because dev has 10MB of toy data and zero concurrent users. Production has 2TB tables, large transactions, resource contention, network latency, and 47 other processes fighting for the same CPU cores. Always test with production-scale data and concurrent load or prepare to be surprised when your "working" replication falls 3 hours behind on day one.

How do I fix apply lag that keeps growing?

Check [pg_stat_subscription](https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-SUBSCRIPTION-VIEW) for worker status. Common causes: subscriber CPU/memory limits, lock contention from long queries, or large transactions. Monitor disk spills in pg_stat_replication_slots - high spill counts mean you need more logical_decoding_work_mem.

Can logical replication handle bulk imports?

Large transactions kill logical replication. Batch imports into smaller chunks (10K-100K rows), bump logical_decoding_work_mem to 256MB+ before imports, and watch WAL space like a hawk. Some teams pause replication during massive imports.

Why do some tables replicate slower than others?

Tables without primary keys force full table scans for every UPDATE/DELETE. Tables with tons of indexes slow down apply workers. Giant text/blob columns eat network bandwidth. Use [selective publications](https://www.postgresql.org/docs/current/logical-replication-publication.html) to prioritize tables that actually matter.

What happens when a slot hits max_slot_wal_keep_size?

The slot gets invalidated immediately and you'll see `ERROR: replication slot "your_slot_name" was invalidated` when the subscriber tries to connect. Game over. You recreate everything from scratch including initial snapshots that take 4+ hours for large databases. PostgreSQL 13+ logs this as `logical replication slot "slot_name" has been invalidated because it exceeded the maximum allowed WAL`. Prevention is literally the only cure - there's no recovery option, no backup plan, nothing.

How do I tune logical replication for cross-region setups?

Turn on [wal_compression](https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-WAL-COMPRESSION), use aggressive publication filtering, and prepare for 2-10x higher latency. Cross-region bandwidth costs real money - replicate only what you need.

How do I test replication performance properly?

Use [pgbench](https://www.postgresql.org/docs/current/pgbench.html) with production-scale data. Test with concurrent load, not single-threaded inserts. Monitor WAL retention, apply lag, and CPU usage during testing. If it doesn't work under test load, it won't work in production.

Why does logical replication break during PostgreSQL upgrades?

Before PostgreSQL 17, slots couldn't survive major version upgrades. You recreated everything from scratch. [PostgreSQL 17](https://www.postgresql.org/docs/17/logical-replication-failover.html) improved this but still test thoroughly before upgrading production.

Can I use logical replication for real-time analytics?

Define "real-time". Expect 10-60 second lag for typical workloads with occasional spikes during large transactions. For sub-second analytics, use [Change Data Capture](https://debezium.io/documentation/reference/stable/connectors/postgresql.html) solutions built on logical replication like Debezium.

Why does my replication break when I run VACUUM FULL or major DDL?

`VACUUM FULL` rewrites the entire table and generates massive WAL volume that can invalidate slots if they can't keep up. I've seen a `VACUUM FULL` on a 200GB table generate 340GB of WAL in 2 hours. Major DDL operations like `ALTER TABLE` don't replicate automatically - you need [pgl_ddl_deploy](https://github.com/enova/pgl_ddl_deploy) extension or manual coordination on each subscriber. Plan DDL changes during maintenance windows and temporarily bump `max_slot_wal_keep_size` to 100GB+ before big operations.

Currently viewing the AI version

Switch to human version

PostgreSQL Logical Replication: AI-Optimized Technical Reference

Critical Warnings: What Will Kill Your Database

WAL Bloat Emergency Scenarios

Single inactive slot can consume 500GB overnight - Common cause: Jenkins/CI systems with stuck subscriptions
Primary database death: Slots without max_slot_wal_keep_size will fill disk until server crashes
Recovery time: 4+ hours for full replication rebuild on large databases (no recovery options exist)
Production impact example: Analytics server went from 50GB to 487GB in 6 hours due to sleeping Jupyter notebook with active subscription

Failure Mode: Disk Full Cascade

Replication slot gets stuck (subscriber crashes/network issue)
WAL accumulates without cleanup
Disk fills up (typical: 847GB overnight)
Primary database crashes with "disk full" error
Customer-facing APIs return 500 errors for hours
Manual slot deletion required (breaks replication permanently)

Configuration: Production-Ready Settings

Essential WAL Protection

-- MANDATORY: Set this or suffer database death
max_slot_wal_keep_size = 50GB

-- Performance tuning that actually matters
wal_buffers = 64MB              -- Default 3% of shared_buffers is inadequate
wal_compression = on            -- Enable if paying for bandwidth
max_replication_slots = 20
max_wal_senders = 20
logical_decoding_work_mem = 256MB  -- Default 64MB causes disk spills

Subscriber Performance Configuration

-- PostgreSQL 14+ parallel workers
ALTER SUBSCRIPTION my_sub SET (parallel_apply = on);
max_logical_replication_workers = 16
work_mem = 64MB                 -- Per apply worker
max_worker_processes = 16

-- Subscriber-only optimization (NEVER on publisher)
synchronous_commit = off

Network Optimization

-- Connection string for subscribers
'host=publisher port=5432 dbname=source_db user=repl_user
 sslmode=require sslcompression=1 keepalives_idle=600'

Resource Requirements and Performance Impact

CPU and Memory Impact

Logical replication uses 2x CPU vs streaming replication due to WAL decoding overhead
Memory per apply worker: 64MB minimum (work_mem)
Parallel apply workers: 40-60% improvement with mixed workloads, diminishing returns past 8 workers
Decoding CPU overhead: Can reach 45% on busy systems without optimization

Network Bandwidth Impact

Configuration	Bandwidth Multiplier	Use Case
Default	1x	Baseline
REPLICA IDENTITY FULL	2-4x	Analytics, cross-region expensive
WITH filtering	0.3-0.8x	Production recommended
ALL TABLES publication	2-10x	Never use in production

Real-World Performance Numbers

Production example: 127 tables → 10 specific tables
- WAL decoding CPU: 45% → 12%
- Network traffic: 2.1GB/hour → 380MB/hour
Apply lag expectations: 10-60 seconds normal, spikes during large transactions
Disk spill threshold: >10% spill rate requires more logical_decoding_work_mem

Critical Monitoring Queries

WAL Retention Monitoring (Run Every 5 Minutes)

-- Emergency detection query
SELECT
  slot_name,
  database,
  active,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag_size,
  pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) / 1024 / 1024 AS lag_mb
FROM pg_replication_slots
WHERE pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) > 1073741824; -- 1GB

Performance Bottleneck Detection

-- Disk spill monitoring
SELECT slot_name, spill_txns, total_txns,
       round(100.0 * spill_txns / total_txns, 1) as spill_pct
FROM pg_stat_replication_slots
WHERE total_txns > 0;

-- Apply worker status
SELECT subscription_name, pid, latest_end_time
FROM pg_stat_subscription;

Alert Thresholds (Based on Production Experience)

WAL Retention Alerts

10GB per slot: Wake someone up (hours until disaster)
20GB per slot: Emergency mode
50GB per slot: Database about to die
Any inactive slot >1GB: Network/subscriber problem

Disk Space Alerts

<30% free: Start monitoring closely
<15% free: Cancel weekend plans
<5% free: Execute nuclear option (drop slots)

Apply Performance Alerts

<30 seconds lag: Normal operation
1-5 minutes lag: Monitor closely
>10 minutes lag: Something is broken

PostgreSQL Version-Specific Capabilities

PostgreSQL 13

First version with max_slot_wal_keep_size - upgrade mandatory for production safety
No parallel apply workers (single-threaded bottleneck)

PostgreSQL 14

Parallel apply workers: Basic implementation, 40-60% improvement
Heartbeat function: pg_logical_emit_message for idle databases

PostgreSQL 15

Row filtering and column lists: Major bandwidth reduction capability
Production-ready selective replication

PostgreSQL 16

Improved parallel apply: Better resource management
Standby slots: Manual failover preparation (2-5 minute recovery)

PostgreSQL 17

Automatic failover slots: 30-60 second automated recovery
synchronized_standby_slots: Proper slot coordination
pg_createsubscriber utility: Easier initial setup
Improved batch commits: Better small transaction performance

Emergency Procedures

Nuclear Option: Slot Deletion

-- When disk is 90%+ full and slots are the cause
SELECT slot_name,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as lag
FROM pg_replication_slots
WHERE pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) > 10737418240; -- 10GB

-- Kill the slot (breaks replication permanently)
SELECT pg_drop_replication_slot('the_slot_killing_your_server');

-- Force cleanup
SELECT pg_switch_wal();
CHECKPOINT;

Heartbeat for Idle Databases

-- Prevent slot lag on quiet systems
GRANT EXECUTE ON FUNCTION pg_logical_emit_message(boolean, text, text) TO replication_user;

-- Cron this every few minutes
SELECT pg_logical_emit_message(false, 'heartbeat', now()::varchar);

Table Structure Requirements

Primary Key Mandate

-- Find tables that will destroy apply performance
SELECT schemaname, tablename
FROM pg_tables t
WHERE NOT EXISTS (
    SELECT 1 FROM pg_constraint c
    WHERE c.conrelid = (t.schemaname||'.'||t.tablename)::regclass
    AND c.contype = 'p'
);

Impact: Tables without primary keys force full table scans for every UPDATE/DELETE

REPLICA IDENTITY Trade-offs

-- Bandwidth vs CPU trade-off
ALTER TABLE problematic_table REPLICA IDENTITY FULL;

Pros: Apply workers don't hunt for rows, consistent analytics data
Cons: 2-4x network bandwidth usage
Use when: Network isn't the bottleneck, need complete row images

Publication Strategy

Selective Publication (Mandatory for Production)

-- Never do this in production
CREATE PUBLICATION my_pub FOR ALL TABLES;

-- Do this instead
CREATE PUBLICATION smart_pub FOR TABLE
  users, orders, inventory
  WHERE (status != 'deleted');

Real impact: ALL TABLES vs selective can mean 45% CPU → 12% CPU usage

Cross-Region Deployment Considerations

Network Cost Optimization

Same datacenter: Network is cheap, focus on CPU/memory
Cross-AZ: Enable WAL compression (AWS charges for cross-AZ traffic)
Cross-region: Aggressive filtering mandatory, bandwidth costs exceed server costs
Hybrid cloud: VPN is usually the bottleneck

Common Failure Scenarios and Root Causes

Bulk Import Operations

Problem: Large transactions overwhelm logical replication
Impact: Single transaction can invalidate slots, hours of WAL accumulation
Solution: Batch imports to 10K-100K rows, pause replication during massive imports
Example: VACUUM FULL on 200GB table generated 340GB WAL in 2 hours

Apply Worker Failures

Lock contention: Long-running queries block apply workers
Resource limits: Out of connections/memory
Large transactions: Single huge transaction blocks everything

DDL Operation Impact

VACUUM FULL: Rewrites entire table, massive WAL generation
ALTER TABLE: Doesn't replicate automatically, requires manual coordination
Solution: Plan DDL during maintenance windows, temporarily increase max_slot_wal_keep_size to 100GB+

Implementation Priority Order

Phase 1: Critical Safety (Do Immediately)

Set max_slot_wal_keep_size = 50GB
Implement disk space monitoring (every 5 minutes)
Use selective publications (don't replicate everything)

Phase 2: Performance Optimization

Increase wal_buffers to 64MB
Enable WAL compression if network costs money
Configure heartbeats for idle databases

Phase 3: Advanced Tuning

Configure parallel apply workers (PostgreSQL 14+)
Add row/column filtering
Test failover procedures monthly

Testing Requirements

Load Testing Reality

Use production-scale data: Dev's 10MB vs production's 2TB tables
Concurrent load testing: pgbench with realistic transaction patterns
Monitor during testing: WAL retention, apply lag, CPU usage
Failure testing: Subscriber crashes, network interruptions, large transactions

Monitoring Integration

postgres_exporter: For Prometheus/Grafana monitoring
Alert automation: Email/PagerDuty integration for critical thresholds
Documentation: Emergency procedures in team wiki with copy-pasteable commands

Failover Capabilities by Version

PostgreSQL 16 and Earlier

Manual setup required: 2-5 minute recovery with preparation
No preparation: 15 minutes to 4+ hours full rebuild

PostgreSQL 17

Automatic failover: 30-60 seconds with synchronized_standby_slots
Slot synchronization: Prevents invalidation during promotion
Still requires testing: Complex but production-ready

Recovery Reality: Without preparation, you're explaining extended downtime while rebuilding from scratch.

Useful Links for Further Investigation

Resources That Don't Suck

Link	Description
Debezium PostgreSQL Connector	CDC built on logical replication that actually works in production

PostgreSQL Logical Replication: AI-Optimized Technical Reference

Critical Warnings: What Will Kill Your Database

WAL Bloat Emergency Scenarios

Failure Mode: Disk Full Cascade

Configuration: Production-Ready Settings

Essential WAL Protection

Subscriber Performance Configuration

Network Optimization

Resource Requirements and Performance Impact

CPU and Memory Impact

Network Bandwidth Impact

Real-World Performance Numbers

Critical Monitoring Queries

WAL Retention Monitoring (Run Every 5 Minutes)

Performance Bottleneck Detection

Alert Thresholds (Based on Production Experience)

WAL Retention Alerts

Disk Space Alerts

Apply Performance Alerts

PostgreSQL Version-Specific Capabilities

PostgreSQL 13

PostgreSQL 14

PostgreSQL 15

PostgreSQL 16

PostgreSQL 17

Emergency Procedures

Nuclear Option: Slot Deletion

Heartbeat for Idle Databases

Table Structure Requirements

Primary Key Mandate

REPLICA IDENTITY Trade-offs

Publication Strategy

Selective Publication (Mandatory for Production)

Cross-Region Deployment Considerations

Network Cost Optimization

Common Failure Scenarios and Root Causes

Bulk Import Operations

Apply Worker Failures

DDL Operation Impact

Implementation Priority Order

Phase 1: Critical Safety (Do Immediately)

Phase 2: Performance Optimization

Phase 3: Advanced Tuning

Testing Requirements

Load Testing Reality

Monitoring Integration

Failover Capabilities by Version

PostgreSQL 16 and Earlier

PostgreSQL 17

Useful Links for Further Investigation

Resources That Don't Suck

Related Tools & Recommendations

PostgreSQL vs MySQL vs MariaDB vs SQLite vs CockroachDB - Pick the Database That Won't Ruin Your Life

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

MySQL Replication - How to Keep Your Database Alive When Shit Goes Wrong

MongoDB vs PostgreSQL vs MySQL: Which One Won't Ruin Your Weekend

MySQL Alternatives That Don't Suck - A Migration Reality Check

PostgreSQL vs MySQL vs MariaDB - Performance Analysis 2025

MariaDB - What MySQL Should Have Been

CockroachDB - PostgreSQL That Scales Horizontally

CockroachDB Security That Doesn't Suck - Encryption, Auth, and Compliance

pgAdmin - The GUI You Get With PostgreSQL

PgBouncer - PostgreSQL Connection Pooler

Docker Alternatives That Won't Break Your Budget

I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works

Grafana - The Monitoring Dashboard That Doesn't Suck

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

Set Up Microservices Monitoring That Actually Works

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

SQL Server 2025 - Vector Search Finally Works (Sort Of)

MongoDB Alternatives: Choose the Right Database for Your Specific Use Case