Currently viewing the AI version
Switch to human version

PostgreSQL Logical Replication: AI-Optimized Technical Reference

Critical Warnings: What Will Kill Your Database

WAL Bloat Emergency Scenarios

  • Single inactive slot can consume 500GB overnight - Common cause: Jenkins/CI systems with stuck subscriptions
  • Primary database death: Slots without max_slot_wal_keep_size will fill disk until server crashes
  • Recovery time: 4+ hours for full replication rebuild on large databases (no recovery options exist)
  • Production impact example: Analytics server went from 50GB to 487GB in 6 hours due to sleeping Jupyter notebook with active subscription

Failure Mode: Disk Full Cascade

  1. Replication slot gets stuck (subscriber crashes/network issue)
  2. WAL accumulates without cleanup
  3. Disk fills up (typical: 847GB overnight)
  4. Primary database crashes with "disk full" error
  5. Customer-facing APIs return 500 errors for hours
  6. Manual slot deletion required (breaks replication permanently)

Configuration: Production-Ready Settings

Essential WAL Protection

-- MANDATORY: Set this or suffer database death
max_slot_wal_keep_size = 50GB

-- Performance tuning that actually matters
wal_buffers = 64MB              -- Default 3% of shared_buffers is inadequate
wal_compression = on            -- Enable if paying for bandwidth
max_replication_slots = 20
max_wal_senders = 20
logical_decoding_work_mem = 256MB  -- Default 64MB causes disk spills

Subscriber Performance Configuration

-- PostgreSQL 14+ parallel workers
ALTER SUBSCRIPTION my_sub SET (parallel_apply = on);
max_logical_replication_workers = 16
work_mem = 64MB                 -- Per apply worker
max_worker_processes = 16

-- Subscriber-only optimization (NEVER on publisher)
synchronous_commit = off

Network Optimization

-- Connection string for subscribers
'host=publisher port=5432 dbname=source_db user=repl_user
 sslmode=require sslcompression=1 keepalives_idle=600'

Resource Requirements and Performance Impact

CPU and Memory Impact

  • Logical replication uses 2x CPU vs streaming replication due to WAL decoding overhead
  • Memory per apply worker: 64MB minimum (work_mem)
  • Parallel apply workers: 40-60% improvement with mixed workloads, diminishing returns past 8 workers
  • Decoding CPU overhead: Can reach 45% on busy systems without optimization

Network Bandwidth Impact

Configuration Bandwidth Multiplier Use Case
Default 1x Baseline
REPLICA IDENTITY FULL 2-4x Analytics, cross-region expensive
WITH filtering 0.3-0.8x Production recommended
ALL TABLES publication 2-10x Never use in production

Real-World Performance Numbers

  • Production example: 127 tables → 10 specific tables
    • WAL decoding CPU: 45% → 12%
    • Network traffic: 2.1GB/hour → 380MB/hour
  • Apply lag expectations: 10-60 seconds normal, spikes during large transactions
  • Disk spill threshold: >10% spill rate requires more logical_decoding_work_mem

Critical Monitoring Queries

WAL Retention Monitoring (Run Every 5 Minutes)

-- Emergency detection query
SELECT
  slot_name,
  database,
  active,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag_size,
  pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) / 1024 / 1024 AS lag_mb
FROM pg_replication_slots
WHERE pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) > 1073741824; -- 1GB

Performance Bottleneck Detection

-- Disk spill monitoring
SELECT slot_name, spill_txns, total_txns,
       round(100.0 * spill_txns / total_txns, 1) as spill_pct
FROM pg_stat_replication_slots
WHERE total_txns > 0;

-- Apply worker status
SELECT subscription_name, pid, latest_end_time
FROM pg_stat_subscription;

Alert Thresholds (Based on Production Experience)

WAL Retention Alerts

  • 10GB per slot: Wake someone up (hours until disaster)
  • 20GB per slot: Emergency mode
  • 50GB per slot: Database about to die
  • Any inactive slot >1GB: Network/subscriber problem

Disk Space Alerts

  • <30% free: Start monitoring closely
  • <15% free: Cancel weekend plans
  • <5% free: Execute nuclear option (drop slots)

Apply Performance Alerts

  • <30 seconds lag: Normal operation
  • 1-5 minutes lag: Monitor closely
  • >10 minutes lag: Something is broken

PostgreSQL Version-Specific Capabilities

PostgreSQL 13

  • First version with max_slot_wal_keep_size - upgrade mandatory for production safety
  • No parallel apply workers (single-threaded bottleneck)

PostgreSQL 14

  • Parallel apply workers: Basic implementation, 40-60% improvement
  • Heartbeat function: pg_logical_emit_message for idle databases

PostgreSQL 15

  • Row filtering and column lists: Major bandwidth reduction capability
  • Production-ready selective replication

PostgreSQL 16

  • Improved parallel apply: Better resource management
  • Standby slots: Manual failover preparation (2-5 minute recovery)

PostgreSQL 17

  • Automatic failover slots: 30-60 second automated recovery
  • synchronized_standby_slots: Proper slot coordination
  • pg_createsubscriber utility: Easier initial setup
  • Improved batch commits: Better small transaction performance

Emergency Procedures

Nuclear Option: Slot Deletion

-- When disk is 90%+ full and slots are the cause
SELECT slot_name,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as lag
FROM pg_replication_slots
WHERE pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) > 10737418240; -- 10GB

-- Kill the slot (breaks replication permanently)
SELECT pg_drop_replication_slot('the_slot_killing_your_server');

-- Force cleanup
SELECT pg_switch_wal();
CHECKPOINT;

Heartbeat for Idle Databases

-- Prevent slot lag on quiet systems
GRANT EXECUTE ON FUNCTION pg_logical_emit_message(boolean, text, text) TO replication_user;

-- Cron this every few minutes
SELECT pg_logical_emit_message(false, 'heartbeat', now()::varchar);

Table Structure Requirements

Primary Key Mandate

-- Find tables that will destroy apply performance
SELECT schemaname, tablename
FROM pg_tables t
WHERE NOT EXISTS (
    SELECT 1 FROM pg_constraint c
    WHERE c.conrelid = (t.schemaname||'.'||t.tablename)::regclass
    AND c.contype = 'p'
);

Impact: Tables without primary keys force full table scans for every UPDATE/DELETE

REPLICA IDENTITY Trade-offs

-- Bandwidth vs CPU trade-off
ALTER TABLE problematic_table REPLICA IDENTITY FULL;
  • Pros: Apply workers don't hunt for rows, consistent analytics data
  • Cons: 2-4x network bandwidth usage
  • Use when: Network isn't the bottleneck, need complete row images

Publication Strategy

Selective Publication (Mandatory for Production)

-- Never do this in production
CREATE PUBLICATION my_pub FOR ALL TABLES;

-- Do this instead
CREATE PUBLICATION smart_pub FOR TABLE
  users, orders, inventory
  WHERE (status != 'deleted');

Real impact: ALL TABLES vs selective can mean 45% CPU → 12% CPU usage

Cross-Region Deployment Considerations

Network Cost Optimization

  • Same datacenter: Network is cheap, focus on CPU/memory
  • Cross-AZ: Enable WAL compression (AWS charges for cross-AZ traffic)
  • Cross-region: Aggressive filtering mandatory, bandwidth costs exceed server costs
  • Hybrid cloud: VPN is usually the bottleneck

Common Failure Scenarios and Root Causes

Bulk Import Operations

  • Problem: Large transactions overwhelm logical replication
  • Impact: Single transaction can invalidate slots, hours of WAL accumulation
  • Solution: Batch imports to 10K-100K rows, pause replication during massive imports
  • Example: VACUUM FULL on 200GB table generated 340GB WAL in 2 hours

Apply Worker Failures

  1. Lock contention: Long-running queries block apply workers
  2. Resource limits: Out of connections/memory
  3. Large transactions: Single huge transaction blocks everything

DDL Operation Impact

  • VACUUM FULL: Rewrites entire table, massive WAL generation
  • ALTER TABLE: Doesn't replicate automatically, requires manual coordination
  • Solution: Plan DDL during maintenance windows, temporarily increase max_slot_wal_keep_size to 100GB+

Implementation Priority Order

Phase 1: Critical Safety (Do Immediately)

  1. Set max_slot_wal_keep_size = 50GB
  2. Implement disk space monitoring (every 5 minutes)
  3. Use selective publications (don't replicate everything)

Phase 2: Performance Optimization

  1. Increase wal_buffers to 64MB
  2. Enable WAL compression if network costs money
  3. Configure heartbeats for idle databases

Phase 3: Advanced Tuning

  1. Configure parallel apply workers (PostgreSQL 14+)
  2. Add row/column filtering
  3. Test failover procedures monthly

Testing Requirements

Load Testing Reality

  • Use production-scale data: Dev's 10MB vs production's 2TB tables
  • Concurrent load testing: pgbench with realistic transaction patterns
  • Monitor during testing: WAL retention, apply lag, CPU usage
  • Failure testing: Subscriber crashes, network interruptions, large transactions

Monitoring Integration

  • postgres_exporter: For Prometheus/Grafana monitoring
  • Alert automation: Email/PagerDuty integration for critical thresholds
  • Documentation: Emergency procedures in team wiki with copy-pasteable commands

Failover Capabilities by Version

PostgreSQL 16 and Earlier

  • Manual setup required: 2-5 minute recovery with preparation
  • No preparation: 15 minutes to 4+ hours full rebuild

PostgreSQL 17

  • Automatic failover: 30-60 seconds with synchronized_standby_slots
  • Slot synchronization: Prevents invalidation during promotion
  • Still requires testing: Complex but production-ready

Recovery Reality: Without preparation, you're explaining extended downtime while rebuilding from scratch.

Useful Links for Further Investigation

Resources That Don't Suck

LinkDescription
Debezium PostgreSQL ConnectorCDC built on logical replication that actually works in production

Related Tools & Recommendations

compare
Recommended

PostgreSQL vs MySQL vs MariaDB vs SQLite vs CockroachDB - Pick the Database That Won't Ruin Your Life

competes with mariadb

mariadb
/compare/postgresql-mysql-mariadb-sqlite-cockroachdb/database-decision-guide
100%
integration
Recommended

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

How to Wire Together the Modern DevOps Stack Without Losing Your Sanity

docker
/integration/docker-kubernetes-argocd-prometheus/gitops-workflow-integration
74%
integration
Recommended

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go

Apache Kafka
/integration/kafka-mongodb-kubernetes-prometheus-event-driven/complete-observability-architecture
70%
tool
Recommended

MySQL Replication - How to Keep Your Database Alive When Shit Goes Wrong

competes with MySQL Replication

MySQL Replication
/tool/mysql-replication/overview
49%
compare
Recommended

MongoDB vs PostgreSQL vs MySQL: Which One Won't Ruin Your Weekend

competes with mysql

mysql
/compare/mongodb/postgresql/mysql/performance-benchmarks-2025
49%
alternatives
Recommended

MySQL Alternatives That Don't Suck - A Migration Reality Check

Oracle's 2025 Licensing Squeeze and MySQL's Scaling Walls Are Forcing Your Hand

MySQL
/alternatives/mysql/migration-focused-alternatives
49%
compare
Recommended

PostgreSQL vs MySQL vs MariaDB - Performance Analysis 2025

Which Database Will Actually Survive Your Production Load?

PostgreSQL
/compare/postgresql/mysql/mariadb/performance-analysis-2025
44%
tool
Recommended

MariaDB - What MySQL Should Have Been

competes with MariaDB

MariaDB
/tool/mariadb/overview
44%
tool
Recommended

CockroachDB - PostgreSQL That Scales Horizontally

Distributed SQL database that's more complex than single-node databases, but works when you need global distribution

CockroachDB
/tool/cockroachdb/overview
44%
tool
Recommended

CockroachDB Security That Doesn't Suck - Encryption, Auth, and Compliance

Security features that actually work in production - certificates, encryption, audit logs, and compliance checkboxes

CockroachDB
/tool/cockroachdb/security-compliance-guide
44%
tool
Recommended

pgAdmin - The GUI You Get With PostgreSQL

It's what you use when you don't want to remember psql commands

pgAdmin
/tool/pgadmin/overview
44%
tool
Recommended

PgBouncer - PostgreSQL Connection Pooler

Stops PostgreSQL from eating all your RAM and crashing at the worst possible moment

PgBouncer
/tool/pgbouncer/overview
44%
alternatives
Recommended

Docker Alternatives That Won't Break Your Budget

Docker got expensive as hell. Here's how to escape without breaking everything.

Docker
/alternatives/docker/budget-friendly-alternatives
44%
compare
Recommended

I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works

Trivy, Docker Scout, Snyk Container, Grype, and Clair - which one won't make you want to quit DevOps

docker
/compare/docker-security/cicd-integration/docker-security-cicd-integration
44%
tool
Recommended

Grafana - The Monitoring Dashboard That Doesn't Suck

integrates with Grafana

Grafana
/tool/grafana/overview
40%
integration
Recommended

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

When your API shits the bed right before the big demo, this stack tells you exactly why

Prometheus
/integration/prometheus-grafana-jaeger/microservices-observability-integration
40%
howto
Recommended

Set Up Microservices Monitoring That Actually Works

Stop flying blind - get real visibility into what's breaking your distributed services

Prometheus
/howto/setup-microservices-observability-prometheus-jaeger-grafana/complete-observability-setup
40%
integration
Recommended

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

Running RAG Systems on K8s Will Make You Hate Your Life, But Sometimes You Don't Have a Choice

Vector Databases
/integration/vector-database-rag-production-deployment/kubernetes-orchestration
40%
tool
Recommended

SQL Server 2025 - Vector Search Finally Works (Sort Of)

competes with Microsoft SQL Server 2025

Microsoft SQL Server 2025
/tool/microsoft-sql-server-2025/overview
40%
alternatives
Recommended

MongoDB Alternatives: Choose the Right Database for Your Specific Use Case

Stop paying MongoDB tax. Choose a database that actually works for your use case.

MongoDB
/alternatives/mongodb/use-case-driven-alternatives
40%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization