PostgreSQL Streaming Replication: AI-Optimized Production Guide
Configuration Requirements
Infrastructure Specifications
- Identical PostgreSQL versions required: Exact version match (14.8 ≠ 14.9) prevents mysterious replication failures
- Network connectivity: Dedicated NICs recommended for replication traffic
- Disk space: Minimum 3x primary database size for WAL accumulation during outages
- Critical failure point: 50GB database can generate 200GB WAL during weekend network outage
- Production reality: Size for disaster scenarios, not normal operation
Version Compatibility
- PostgreSQL 15+ recommended for monitoring improvements
- Streaming replication: Same major versions only (14↔15 incompatible)
- Cross-version needs: Use logical replication with additional complexity
Primary Server Configuration
Essential postgresql.conf Settings
wal_level = replica # Required for standby WAL data
max_wal_senders = 5 # Each standby + backup tools consume one
wal_keep_size = 2GB # Prevents WAL deletion before standby processing
archive_mode = on # Backup plan when streaming fails
listen_addresses = 'specific_ip' # Never use '*' in production
max_connections = 200 # Account for replication connections
Critical Failure Modes
- pg_basebackup fails 3 times average before working due to:
- Firewall/IP address errors
- pg_hba.conf authentication failures
- Permission denied on destination
- Network timeouts during large copies
- Primary WAL sender slot exhaustion
Security Configuration
CREATE ROLE replication_user WITH REPLICATION LOGIN PASSWORD 'secure_password';
pg_hba.conf entry:
host replication replication_user 10.0.1.100/32 scram-sha-256
Standby Server Setup
Base Backup Process
sudo -u postgres pg_basebackup \
-h 10.0.1.99 \
-p 5432 \
-U replication_user \
-D /var/lib/postgresql/15/main \
-Fp -Xs -P -R -W
Time Requirements
- 100GB database: 2-6 hours depending on network
- 1TB database: Cancel weekend plans
- Production planning: 2-4 hours minimum (not "brief maintenance window")
Standby-Specific Settings
hot_standby = on # Enable read-only queries
hot_standby_feedback = off # Prevents primary bloat
max_connections = 100 # Lower for read-only server
wal_receiver_timeout = 60s # Failure detection vs network tolerance
Performance Impact and Resource Requirements
Network Bandwidth
- WAL generation: 1GB/hour becomes 2-3GB/hour network traffic
- Catchup scenarios: Bandwidth spikes during standby recovery
- Network failures: Can overwhelm "enterprise" connections
Storage Requirements
- WAL accumulation: 100GB database can fill 500GB partition during outages
- Monitoring threshold: Alert when WAL directory >20% of disk space
- Disaster sizing: Plan for 3x database size minimum
Primary Server Impact
- Normal operation: Minimal performance impact
- Network issues: WAL accumulation can crash primary server
- Disk space exhaustion: Entire primary database goes down
Operational Troubleshooting
Replication Status Verification
-- Primary server check
SELECT * FROM pg_stat_replication; -- Should show state='streaming'
-- Standby server check
SELECT pg_is_in_recovery(); -- Should return true
Common Failure Scenarios
Connection Failures
- 90% cause: pg_hba.conf misconfiguration
- 9% cause: Firewall blocking port 5432
- 1% cause: Obscure network issues
Replication Lag Growth
- High flush_lag: Network bottleneck
- High replay_lag: Underpowered standby server
- Solution trade-offs: Better hardware vs accepting lag
Long Query Conflicts
- Symptom: Standby queries cancelled by replication
- Root cause: Long-running reports conflict with primary updates
- Impact: 2-hour reports terminated by simple UPDATEs
Failover Procedures
Emergency Promotion
# CRITICAL: Ensure old primary is completely down first
pg_ctl promote -D /var/lib/postgresql/15/main
Post-Failover Requirements
- Update all application connection strings
- Reconfigure monitoring systems
- Plan standby replacement strategy
Synchronous vs Asynchronous Trade-offs
Aspect | Synchronous | Asynchronous |
---|---|---|
Data Loss Risk | Zero (if network stable) | Some data lost on primary failure |
Commit Performance | Slower, network-dependent | Minimal impact |
Network Requirements | High reliability required | Tolerates occasional hiccups |
Use Cases | Financial/medical data | Most web applications |
Complexity | High ongoing tuning | Low until failures occur |
Production Monitoring Requirements
Critical Alerts
- Replication lag > 30 seconds
- WAL directory disk space < 20%
- Missing replication processes
- Standby connection failures
Tools Integration
- Prometheus + postgres_exporter: Production monitoring
- pg_stat_replication: Built-in status monitoring
- Log monitoring: PostgreSQL error logs for failure detection
Version Upgrade Constraints
Major Version Limitations
- Streaming replication: Cannot cross major versions
- Upgrade options:
- Logical replication to new version (complex)
- Downtime for primary upgrade + standby rebuild
- pg_upgrade + standby resync
- Reality: All upgrade paths have significant complexity
Resource Investment Requirements
Time Investments
- Initial setup: 4-8 hours including troubleshooting
- Large database sync: Hours to days depending on size
- Failover testing: Plan monthly testing windows
- Troubleshooting: Network issues can consume entire weekends
Expertise Requirements
- PostgreSQL administration: Advanced level required
- Network troubleshooting: Essential for replication issues
- Monitoring setup: Critical for production stability
- Disaster recovery: Must be tested and documented
Infrastructure Costs
- Standby hardware: Size equally to primary (don't cheap out)
- Network capacity: Plan for 2-3x normal WAL traffic
- Storage overhead: 3x primary database size minimum
- Monitoring tools: Budget for proper alerting systems
Critical Warnings
Documentation Gaps
- "Brief maintenance window": Actually 2-4 hours minimum
- Network requirements: Underspecified in official docs
- Disk space planning: WAL accumulation severely underestimated
- Failure scenarios: Real-world complexity not covered
Breaking Points
- WAL disk exhaustion: Crashes entire primary database
- Network instability: Can make replication unusable
- Standby query workload: Long queries will be cancelled
- Split-brain scenarios: Requires careful primary shutdown verification
Production Gotchas
- pg_basebackup timeouts: Test with actual database sizes
- SSL certificate management: Easy to overlook, hard to fix later
- Connection pooling: Replication consumes application connections
- Backup tool conflicts: pg_basebackup competes for WAL sender slots
Related Tools & Recommendations
PostgreSQL vs MySQL vs MariaDB vs SQLite vs CockroachDB - Pick the Database That Won't Ruin Your Life
alternative to cockroachdb
MySQL Workbench Performance Issues - Fix the Crashes, Slowdowns, and Memory Hogs
Stop wasting hours on crashes and timeouts - actual solutions for MySQL Workbench's most annoying performance problems
MySQL HeatWave - Oracle's Answer to the ETL Problem
Combines OLTP and OLAP in one MySQL database. No more data pipeline hell.
MySQL to PostgreSQL Production Migration: Complete Step-by-Step Guide
Migrate MySQL to PostgreSQL without destroying your career (probably)
SQL Server 2025 - Vector Search Finally Works (Sort Of)
competes with Microsoft SQL Server 2025
Don't Get Screwed by NoSQL Database Pricing - MongoDB vs Redis vs DataStax Reality Check
I've seen database bills that would make your CFO cry. Here's what you'll actually pay once the free trials end and reality kicks in.
SQLite - The Database That Just Works
Zero Configuration, Actually Works
SQLite Performance: When It All Goes to Shit
Your database was fast yesterday and slow today. Here's why.
Docker Daemon Won't Start on Linux - Fix This Shit Now
Your containers are useless without a running daemon. Here's how to fix the most common startup failures.
Linux Foundation Takes Control of Solo.io's AI Agent Gateway - August 25, 2025
Open source governance shift aims to prevent vendor lock-in as AI agent infrastructure becomes critical to enterprise deployments
PostgreSQL vs MySQL vs MariaDB - Developer Ecosystem Analysis 2025
PostgreSQL, MySQL, or MariaDB: Choose Your Database Nightmare Wisely
MariaDB - What MySQL Should Have Been
competes with MariaDB
MariaDB Performance Optimization - Making It Not Suck
competes with MariaDB
pgAdmin - The GUI You Get With PostgreSQL
It's what you use when you don't want to remember psql commands
Docker Desktop Alternatives That Don't Suck
Tried every alternative after Docker started charging - here's what actually works
Docker Swarm - Container Orchestration That Actually Works
Multi-host Docker without the Kubernetes PhD requirement
Docker Security Scanner Performance Optimization - Stop Waiting Forever
compatible with Docker Security Scanners (Category)
PostgreSQL WAL Tuning - Stop Getting Paged at 3AM
The WAL configuration guide for engineers who've been burned by shitty defaults
Debezium - Database Change Capture Without the Pain
Watches your database and streams changes to Kafka. Works great until it doesn't.
Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015
When your API shits the bed right before the big demo, this stack tells you exactly why
Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization