PostgreSQL Logical Replication: AI-Optimized Technical Reference
Critical Warnings: What Will Kill Your Database
WAL Bloat Emergency Scenarios
- Single inactive slot can consume 500GB overnight - Common cause: Jenkins/CI systems with stuck subscriptions
- Primary database death: Slots without
max_slot_wal_keep_size
will fill disk until server crashes - Recovery time: 4+ hours for full replication rebuild on large databases (no recovery options exist)
- Production impact example: Analytics server went from 50GB to 487GB in 6 hours due to sleeping Jupyter notebook with active subscription
Failure Mode: Disk Full Cascade
- Replication slot gets stuck (subscriber crashes/network issue)
- WAL accumulates without cleanup
- Disk fills up (typical: 847GB overnight)
- Primary database crashes with "disk full" error
- Customer-facing APIs return 500 errors for hours
- Manual slot deletion required (breaks replication permanently)
Configuration: Production-Ready Settings
Essential WAL Protection
-- MANDATORY: Set this or suffer database death
max_slot_wal_keep_size = 50GB
-- Performance tuning that actually matters
wal_buffers = 64MB -- Default 3% of shared_buffers is inadequate
wal_compression = on -- Enable if paying for bandwidth
max_replication_slots = 20
max_wal_senders = 20
logical_decoding_work_mem = 256MB -- Default 64MB causes disk spills
Subscriber Performance Configuration
-- PostgreSQL 14+ parallel workers
ALTER SUBSCRIPTION my_sub SET (parallel_apply = on);
max_logical_replication_workers = 16
work_mem = 64MB -- Per apply worker
max_worker_processes = 16
-- Subscriber-only optimization (NEVER on publisher)
synchronous_commit = off
Network Optimization
-- Connection string for subscribers
'host=publisher port=5432 dbname=source_db user=repl_user
sslmode=require sslcompression=1 keepalives_idle=600'
Resource Requirements and Performance Impact
CPU and Memory Impact
- Logical replication uses 2x CPU vs streaming replication due to WAL decoding overhead
- Memory per apply worker: 64MB minimum (work_mem)
- Parallel apply workers: 40-60% improvement with mixed workloads, diminishing returns past 8 workers
- Decoding CPU overhead: Can reach 45% on busy systems without optimization
Network Bandwidth Impact
Configuration | Bandwidth Multiplier | Use Case |
---|---|---|
Default | 1x | Baseline |
REPLICA IDENTITY FULL | 2-4x | Analytics, cross-region expensive |
WITH filtering | 0.3-0.8x | Production recommended |
ALL TABLES publication | 2-10x | Never use in production |
Real-World Performance Numbers
- Production example: 127 tables → 10 specific tables
- WAL decoding CPU: 45% → 12%
- Network traffic: 2.1GB/hour → 380MB/hour
- Apply lag expectations: 10-60 seconds normal, spikes during large transactions
- Disk spill threshold: >10% spill rate requires more logical_decoding_work_mem
Critical Monitoring Queries
WAL Retention Monitoring (Run Every 5 Minutes)
-- Emergency detection query
SELECT
slot_name,
database,
active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag_size,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) / 1024 / 1024 AS lag_mb
FROM pg_replication_slots
WHERE pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) > 1073741824; -- 1GB
Performance Bottleneck Detection
-- Disk spill monitoring
SELECT slot_name, spill_txns, total_txns,
round(100.0 * spill_txns / total_txns, 1) as spill_pct
FROM pg_stat_replication_slots
WHERE total_txns > 0;
-- Apply worker status
SELECT subscription_name, pid, latest_end_time
FROM pg_stat_subscription;
Alert Thresholds (Based on Production Experience)
WAL Retention Alerts
- 10GB per slot: Wake someone up (hours until disaster)
- 20GB per slot: Emergency mode
- 50GB per slot: Database about to die
- Any inactive slot >1GB: Network/subscriber problem
Disk Space Alerts
- <30% free: Start monitoring closely
- <15% free: Cancel weekend plans
- <5% free: Execute nuclear option (drop slots)
Apply Performance Alerts
- <30 seconds lag: Normal operation
- 1-5 minutes lag: Monitor closely
- >10 minutes lag: Something is broken
PostgreSQL Version-Specific Capabilities
PostgreSQL 13
- First version with max_slot_wal_keep_size - upgrade mandatory for production safety
- No parallel apply workers (single-threaded bottleneck)
PostgreSQL 14
- Parallel apply workers: Basic implementation, 40-60% improvement
- Heartbeat function: pg_logical_emit_message for idle databases
PostgreSQL 15
- Row filtering and column lists: Major bandwidth reduction capability
- Production-ready selective replication
PostgreSQL 16
- Improved parallel apply: Better resource management
- Standby slots: Manual failover preparation (2-5 minute recovery)
PostgreSQL 17
- Automatic failover slots: 30-60 second automated recovery
- synchronized_standby_slots: Proper slot coordination
- pg_createsubscriber utility: Easier initial setup
- Improved batch commits: Better small transaction performance
Emergency Procedures
Nuclear Option: Slot Deletion
-- When disk is 90%+ full and slots are the cause
SELECT slot_name,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as lag
FROM pg_replication_slots
WHERE pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) > 10737418240; -- 10GB
-- Kill the slot (breaks replication permanently)
SELECT pg_drop_replication_slot('the_slot_killing_your_server');
-- Force cleanup
SELECT pg_switch_wal();
CHECKPOINT;
Heartbeat for Idle Databases
-- Prevent slot lag on quiet systems
GRANT EXECUTE ON FUNCTION pg_logical_emit_message(boolean, text, text) TO replication_user;
-- Cron this every few minutes
SELECT pg_logical_emit_message(false, 'heartbeat', now()::varchar);
Table Structure Requirements
Primary Key Mandate
-- Find tables that will destroy apply performance
SELECT schemaname, tablename
FROM pg_tables t
WHERE NOT EXISTS (
SELECT 1 FROM pg_constraint c
WHERE c.conrelid = (t.schemaname||'.'||t.tablename)::regclass
AND c.contype = 'p'
);
Impact: Tables without primary keys force full table scans for every UPDATE/DELETE
REPLICA IDENTITY Trade-offs
-- Bandwidth vs CPU trade-off
ALTER TABLE problematic_table REPLICA IDENTITY FULL;
- Pros: Apply workers don't hunt for rows, consistent analytics data
- Cons: 2-4x network bandwidth usage
- Use when: Network isn't the bottleneck, need complete row images
Publication Strategy
Selective Publication (Mandatory for Production)
-- Never do this in production
CREATE PUBLICATION my_pub FOR ALL TABLES;
-- Do this instead
CREATE PUBLICATION smart_pub FOR TABLE
users, orders, inventory
WHERE (status != 'deleted');
Real impact: ALL TABLES vs selective can mean 45% CPU → 12% CPU usage
Cross-Region Deployment Considerations
Network Cost Optimization
- Same datacenter: Network is cheap, focus on CPU/memory
- Cross-AZ: Enable WAL compression (AWS charges for cross-AZ traffic)
- Cross-region: Aggressive filtering mandatory, bandwidth costs exceed server costs
- Hybrid cloud: VPN is usually the bottleneck
Common Failure Scenarios and Root Causes
Bulk Import Operations
- Problem: Large transactions overwhelm logical replication
- Impact: Single transaction can invalidate slots, hours of WAL accumulation
- Solution: Batch imports to 10K-100K rows, pause replication during massive imports
- Example: VACUUM FULL on 200GB table generated 340GB WAL in 2 hours
Apply Worker Failures
- Lock contention: Long-running queries block apply workers
- Resource limits: Out of connections/memory
- Large transactions: Single huge transaction blocks everything
DDL Operation Impact
- VACUUM FULL: Rewrites entire table, massive WAL generation
- ALTER TABLE: Doesn't replicate automatically, requires manual coordination
- Solution: Plan DDL during maintenance windows, temporarily increase max_slot_wal_keep_size to 100GB+
Implementation Priority Order
Phase 1: Critical Safety (Do Immediately)
- Set
max_slot_wal_keep_size = 50GB
- Implement disk space monitoring (every 5 minutes)
- Use selective publications (don't replicate everything)
Phase 2: Performance Optimization
- Increase
wal_buffers
to 64MB - Enable WAL compression if network costs money
- Configure heartbeats for idle databases
Phase 3: Advanced Tuning
- Configure parallel apply workers (PostgreSQL 14+)
- Add row/column filtering
- Test failover procedures monthly
Testing Requirements
Load Testing Reality
- Use production-scale data: Dev's 10MB vs production's 2TB tables
- Concurrent load testing: pgbench with realistic transaction patterns
- Monitor during testing: WAL retention, apply lag, CPU usage
- Failure testing: Subscriber crashes, network interruptions, large transactions
Monitoring Integration
- postgres_exporter: For Prometheus/Grafana monitoring
- Alert automation: Email/PagerDuty integration for critical thresholds
- Documentation: Emergency procedures in team wiki with copy-pasteable commands
Failover Capabilities by Version
PostgreSQL 16 and Earlier
- Manual setup required: 2-5 minute recovery with preparation
- No preparation: 15 minutes to 4+ hours full rebuild
PostgreSQL 17
- Automatic failover: 30-60 seconds with synchronized_standby_slots
- Slot synchronization: Prevents invalidation during promotion
- Still requires testing: Complex but production-ready
Recovery Reality: Without preparation, you're explaining extended downtime while rebuilding from scratch.
Useful Links for Further Investigation
Resources That Don't Suck
Link | Description |
---|---|
Debezium PostgreSQL Connector | CDC built on logical replication that actually works in production |
Related Tools & Recommendations
PostgreSQL vs MySQL vs MariaDB vs SQLite vs CockroachDB - Pick the Database That Won't Ruin Your Life
competes with mariadb
GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus
How to Wire Together the Modern DevOps Stack Without Losing Your Sanity
Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break
When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go
MySQL Replication - How to Keep Your Database Alive When Shit Goes Wrong
competes with MySQL Replication
MongoDB vs PostgreSQL vs MySQL: Which One Won't Ruin Your Weekend
competes with mysql
MySQL Alternatives That Don't Suck - A Migration Reality Check
Oracle's 2025 Licensing Squeeze and MySQL's Scaling Walls Are Forcing Your Hand
PostgreSQL vs MySQL vs MariaDB - Performance Analysis 2025
Which Database Will Actually Survive Your Production Load?
MariaDB - What MySQL Should Have Been
competes with MariaDB
CockroachDB - PostgreSQL That Scales Horizontally
Distributed SQL database that's more complex than single-node databases, but works when you need global distribution
CockroachDB Security That Doesn't Suck - Encryption, Auth, and Compliance
Security features that actually work in production - certificates, encryption, audit logs, and compliance checkboxes
pgAdmin - The GUI You Get With PostgreSQL
It's what you use when you don't want to remember psql commands
PgBouncer - PostgreSQL Connection Pooler
Stops PostgreSQL from eating all your RAM and crashing at the worst possible moment
Docker Alternatives That Won't Break Your Budget
Docker got expensive as hell. Here's how to escape without breaking everything.
I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works
Trivy, Docker Scout, Snyk Container, Grype, and Clair - which one won't make you want to quit DevOps
Grafana - The Monitoring Dashboard That Doesn't Suck
integrates with Grafana
Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015
When your API shits the bed right before the big demo, this stack tells you exactly why
Set Up Microservices Monitoring That Actually Works
Stop flying blind - get real visibility into what's breaking your distributed services
RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)
Running RAG Systems on K8s Will Make You Hate Your Life, But Sometimes You Don't Have a Choice
SQL Server 2025 - Vector Search Finally Works (Sort Of)
competes with Microsoft SQL Server 2025
MongoDB Alternatives: Choose the Right Database for Your Specific Use Case
Stop paying MongoDB tax. Choose a database that actually works for your use case.
Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization