PostgreSQL logical replication actually works reliably once you figure out all the WAL settings that the official docs barely mention. I learned this the hard way when our WAL directory consumed 500GB during a weekend deployment because nobody told me about max_slot_wal_keep_size
.
PostgreSQL streams changes through the WAL (Write-Ahead Log) using replication slots. The slot keeps track of where your connector is in the stream, so restarts don't lose data. Simple concept, but the configuration will bite you if you're not careful.
Essential PostgreSQL Configuration
postgresql.conf settings that prevent common disasters:
## WAL configuration for CDC
wal_level = logical
max_wal_senders = 10
max_replication_slots = 10
wal_keep_size = 1GB
max_slot_wal_keep_size = 5GB
## Connection management
max_connections = 300
shared_preload_libraries = 'wal2json'
## Performance tuning for CDC
wal_buffers = 64MB
checkpoint_completion_target = 0.9
Why these settings matter (learned from production failures):
max_slot_wal_keep_size = 5GB
prevents WAL from eating your entire disk when CDC falls behind - without this, I've seen WAL consume 200GB+ during a 4-hour connector outagemax_replication_slots = 10
allows multiple CDC connectors without hitting the default limit of 5 - discovered this at 2am when the sixth connector failed to startwal2json
plugin performs 30% better than the defaultpgoutput
for high-volume changes, but crashes on PostgreSQL 13.2.0 specifically - test your version
Debezium PostgreSQL Connector Configuration
Working production connector config:
{
"name": "postgres-cdc-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "postgres.internal",
"database.port": "5432",
"database.user": "debezium_user",
"database.password": "${file:/opt/kafka/secrets/db-password.txt:password}",
"database.dbname": "production_db",
"database.server.name": "postgres-server",
"slot.name": "debezium_slot",
"publication.name": "dbz_publication",
"plugin.name": "wal2json",
"table.include.list": "public.orders,public.payments,public.inventory",
"schema.history.internal.kafka.topic": "schema-changes.postgres",
"schema.history.internal.kafka.bootstrap.servers": "kafka:9092",
"database.connectionTimeoutInMs": 30000,
"heartbeat.interval.ms": 60000,
"max.queue.size": 16000,
"max.batch.size": 4096,
"poll.interval.ms": 1000
}
}
Database user setup:
-- Create CDC user with proper permissions
CREATE USER debezium_user WITH REPLICATION LOGIN PASSWORD 'secure_password';
GRANT CONNECT ON DATABASE production_db TO debezium_user;
GRANT USAGE ON SCHEMA public TO debezium_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO debezium_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO debezium_user;
-- Create publication for tables you want to monitor
CREATE PUBLICATION dbz_publication FOR TABLE public.orders, public.payments, public.inventory;
Critical Monitoring Queries
WAL retention monitoring (set up alerts when lag_size > 1GB):
SELECT slot_name,
active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as lag_size,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) as flush_lag
FROM pg_replication_slots
WHERE slot_name = 'debezium_slot';
Connection usage monitoring:
SELECT count(*), state, usename
FROM pg_stat_activity
WHERE usename = 'debezium_user'
GROUP BY state, usename;
Performance Optimization Patterns
For high-volume tables, use REPLICA IDENTITY FULL
to capture complete row data:
ALTER TABLE high_volume_orders REPLICA IDENTITY FULL;
For tables with large JSONB/TEXT columns, exclude unnecessary fields:
{
"column.exclude.list": "public.user_profile.large_metadata,public.logs.full_stacktrace"
}
Common PostgreSQL CDC Pitfalls
WAL accumulation during maintenance: Always pause connectors before long-running operations or face disk space hell:
## Before maintenance (NEVER skip this)
curl -X PUT localhost:8083/connectors/postgres-cdc-connector/pause
## After maintenance
curl -X PUT localhost:8083/connectors/postgres-cdc-connector/resume
I once forgot to pause connectors before a 6-hour database migration. WAL files grew to 180GB and filled the disk, taking down the entire database. Learn from my stupidity.
Schema evolution nightmare: Test every schema change in staging with CDC running - no exceptions. PostgreSQL handles adding columns fine, but renaming or dropping columns destroys connectors with cryptic error messages like "slot has been invalidated because it exceeded the maximum reserved size".
TOAST field disasters: Large JSONB or TEXT fields stored in TOAST tables crash connectors with OOM errors that won't show up until you hit production load. I spent 3 days debugging why our connector died every Tuesday at 2pm - turns out that's when the weekly report export triggered, and large JSON blobs killed the connector.
Connection exhaustion reality: CDC holds database connections permanently. Forget to account for this and your application will start throwing FATAL: remaining connection slots are reserved for non-replication superuser connections
errors. Use pgbouncer
for app connections, dedicate separate connection pools for CDC.
For comprehensive PostgreSQL CDC guidance, see PostgreSQL logical replication documentation, Debezium PostgreSQL connector reference, PostgreSQL WAL configuration guide, wal2json plugin documentation, PostgreSQL replication slots monitoring, PostgreSQL performance tuning for replication, pgbouncer connection pooling setup, PostgreSQL monitoring queries collection, Postgres CDC best practices guide, and PostgreSQL high availability documentation.