Do I need identical hardware for primary and standby servers?

Don't cheap out on your standby hardware. I've seen people try to replicate from a beefy primary to some leftover server from 2015. The standby falls behind every time you run a big query, then you're scrambling to rebuild replication from scratch. Size your standby properly or watch it fall behind every time you run a report.

Can I set up replication on a running production database?

Yes, but "brief maintenance window" is marketing speak. Plan for 2-4 hours because something will go wrong. Last time I did this, pg_basebackup for a 1TB database? Cancel your weekend plans. The primary restart might be quick, but the initial sync will take forever on any real database.

How much network bandwidth does streaming replication consume?

More than you think. Your 1GB/hour WAL generation becomes 2-3GB/hour of network traffic because PostgreSQL wasn't designed by network engineers. And when the standby falls behind? Prepare for a bandwidth spike that brings down your "enterprise" network connection.

What PostgreSQL versions support streaming replication?

Anything newer than PostgreSQL 9.1, but don't be a masochist. Use [PostgreSQL 15+](https://www.postgresql.org/docs/15/release-15.html) - the replication monitoring improvements alone will save you debugging time. Older versions have fewer tools to figure out why your replication is broken.

Can I replicate specific databases or tables only?

Nope. Streaming replication copies everything or nothing. Want selective replication? Use [logical replication](https://www.postgresql.org/docs/current/logical-replication.html) instead, but prepare for a whole new set of gotchas. Streaming replication works at the WAL level, which doesn't give a shit about your database boundaries.

Why does my standby keep falling behind during large queries?

Because your standby server is underpowered or your network sucks. That "optimized" query that takes 30 seconds on your primary? It's generating WAL faster than your standby can replay it. Size your standby properly or accept that it will lag behind during peak usage.

How much disk space do I actually need for the standby?

More than you think. Start with 3x your primary database size and add more when (not if) you run out. WAL files pile up faster than you can delete them during network issues. I've seen 100GB databases fill 500GB partitions with accumulated WAL files.

Will replication slow down my primary server?

It depends. Under normal conditions, minimal impact. But when your network hiccups or your standby falls behind, WAL segments accumulate on the primary. Run out of disk space and your entire primary goes down. Monitor your WAL directory size religiously.

How do I actually know if replication is working?

Run `SELECT * FROM pg_stat_replication;` on the primary. If you see your standby with `state = 'streaming'`, you're good. If you see nothing, replication is broken. On the standby, `SELECT pg_is_in_recovery();` should return `true`. The real test: create a table on primary and see if it shows up on standby. If it doesn't, welcome to debugging hell.

Why does my standby keep saying "could not connect to server"?

Because pg_hba.conf is where hope goes to die. Check: 1. Can you `telnet primary_ip 5432`? 2. Does pg_hba.conf actually allow your standby's IP? 3. Does the replication user exist and have REPLICATION privileges? 90% of the time it's pg_hba.conf, 9% it's firewall, 1% it's something truly weird.

My replication lag keeps growing - what the hell?

Your standby is falling behind because either your network sucks, your standby server is underpowered, or you're running too many queries on it. Check pg_stat_replication for lag numbers. If flush_lag is high, it's network. If replay_lag is high, your standby can't keep up. Solution: better hardware or accept that your standby will lag.

What happens when my standby dies during maintenance?

Your primary keeps running but starts accumulating WAL files. If your standby is down too long, the primary will run out of disk space and crash. Size your `wal_keep_size` for your longest possible outage, then double it. When the standby comes back, it'll try to catch up automatically. For long outages (days), you'll need a fresh pg_basebackup.

How do I promote a standby to primary when everything's on fire?

First, make absolutely sure your old primary is DEAD or you'll have split-brain syndrome. Run `pg_ctl promote -D /var/lib/postgresql/15/main` on the standby. It exits recovery mode and becomes writable. Now update all your application connection strings to point to the new primary. Hope you documented where all those connections are configured.

Can I actually run queries on the standby without breaking everything?

Yes, if `hot_standby = on` (it's on by default). The standby accepts read-only queries while replaying WAL. Great for reports and read replicas. But long-running queries will get killed when they conflict with replication. Your users will love getting their 2-hour report cancelled by a simple UPDATE on the primary.

My replication broke after a network outage - now what?

Check `SELECT * FROM pg_replication_slots;` on the primary to see if your WAL segments are still available. If the standby fell too far behind, you'll see huge gaps in LSN positions. If WAL segments got recycled (they will), you're rebuilding from scratch with pg_basebackup. Use replication slots to prevent this in the future.

How do I monitor this mess in production?

Monitor replication lag (pg_stat_replication view), disk space on both servers, and whether your replication processes are actually running. Set alerts for lag > 30 seconds, missing processes, or disk space < 20%. Tools like Prometheus with postgres_exporter work well if you're into that sort of thing.

Should I use synchronous or asynchronous replication?

Asynchronous (default): Fast commits, risk of data loss when primary dies. Synchronous: Slow commits, no data loss. Synchronous replication will make your commits slower and your users angry. Use it for financial data or medical records where data loss equals lawsuits.

Can I have multiple standby servers without losing my mind?

Yes, run pg_basebackup from each standby to the primary. Each standby operates independently until one of them breaks. For synchronous replication with multiple standbys, use `synchronous_standby_names = 'FIRST 1 (standby1,standby2)'`. When one standby dies, the other takes over automatically.

How do I upgrade PostgreSQL major versions with replication?

You can't use streaming replication between major versions. Your options: 1. Use logical replication to a new version cluster 2. Take downtime to upgrade primary then rebuild standbys 3. Use pg_upgrade on primary then resync standbys All options suck in different ways.

Currently viewing the AI version

Switch to human version

PostgreSQL Streaming Replication: AI-Optimized Production Guide

Configuration Requirements

Infrastructure Specifications

Identical PostgreSQL versions required: Exact version match (14.8 ≠ 14.9) prevents mysterious replication failures
Network connectivity: Dedicated NICs recommended for replication traffic
Disk space: Minimum 3x primary database size for WAL accumulation during outages
- Critical failure point: 50GB database can generate 200GB WAL during weekend network outage
- Production reality: Size for disaster scenarios, not normal operation

Version Compatibility

PostgreSQL 15+ recommended for monitoring improvements
Streaming replication: Same major versions only (14↔15 incompatible)
Cross-version needs: Use logical replication with additional complexity

Primary Server Configuration

Essential postgresql.conf Settings

wal_level = replica                    # Required for standby WAL data
max_wal_senders = 5                   # Each standby + backup tools consume one
wal_keep_size = 2GB                   # Prevents WAL deletion before standby processing
archive_mode = on                     # Backup plan when streaming fails
listen_addresses = 'specific_ip'      # Never use '*' in production
max_connections = 200                 # Account for replication connections

Critical Failure Modes

pg_basebackup fails 3 times average before working due to:
1. Firewall/IP address errors
2. pg_hba.conf authentication failures
3. Permission denied on destination
4. Network timeouts during large copies
5. Primary WAL sender slot exhaustion

Security Configuration

CREATE ROLE replication_user WITH REPLICATION LOGIN PASSWORD 'secure_password';

pg_hba.conf entry:

host replication replication_user 10.0.1.100/32 scram-sha-256

Standby Server Setup

Base Backup Process

sudo -u postgres pg_basebackup \
    -h 10.0.1.99 \
    -p 5432 \
    -U replication_user \
    -D /var/lib/postgresql/15/main \
    -Fp -Xs -P -R -W

Time Requirements

100GB database: 2-6 hours depending on network
1TB database: Cancel weekend plans
Production planning: 2-4 hours minimum (not "brief maintenance window")

Standby-Specific Settings

hot_standby = on                      # Enable read-only queries
hot_standby_feedback = off            # Prevents primary bloat
max_connections = 100                 # Lower for read-only server
wal_receiver_timeout = 60s            # Failure detection vs network tolerance

Performance Impact and Resource Requirements

Network Bandwidth

WAL generation: 1GB/hour becomes 2-3GB/hour network traffic
Catchup scenarios: Bandwidth spikes during standby recovery
Network failures: Can overwhelm "enterprise" connections

Storage Requirements

WAL accumulation: 100GB database can fill 500GB partition during outages
Monitoring threshold: Alert when WAL directory >20% of disk space
Disaster sizing: Plan for 3x database size minimum

Primary Server Impact

Normal operation: Minimal performance impact
Network issues: WAL accumulation can crash primary server
Disk space exhaustion: Entire primary database goes down

Operational Troubleshooting

Replication Status Verification

-- Primary server check
SELECT * FROM pg_stat_replication;  -- Should show state='streaming'

-- Standby server check  
SELECT pg_is_in_recovery();         -- Should return true

Common Failure Scenarios

Connection Failures

90% cause: pg_hba.conf misconfiguration
9% cause: Firewall blocking port 5432
1% cause: Obscure network issues

Replication Lag Growth

High flush_lag: Network bottleneck
High replay_lag: Underpowered standby server
Solution trade-offs: Better hardware vs accepting lag

Long Query Conflicts

Symptom: Standby queries cancelled by replication
Root cause: Long-running reports conflict with primary updates
Impact: 2-hour reports terminated by simple UPDATEs

Failover Procedures

Emergency Promotion

# CRITICAL: Ensure old primary is completely down first
pg_ctl promote -D /var/lib/postgresql/15/main

Post-Failover Requirements

Update all application connection strings
Reconfigure monitoring systems
Plan standby replacement strategy

Synchronous vs Asynchronous Trade-offs

Aspect	Synchronous	Asynchronous
Data Loss Risk	Zero (if network stable)	Some data lost on primary failure
Commit Performance	Slower, network-dependent	Minimal impact
Network Requirements	High reliability required	Tolerates occasional hiccups
Use Cases	Financial/medical data	Most web applications
Complexity	High ongoing tuning	Low until failures occur

Production Monitoring Requirements

Critical Alerts

Replication lag > 30 seconds
WAL directory disk space < 20%
Missing replication processes
Standby connection failures

Tools Integration

Prometheus + postgres_exporter: Production monitoring
pg_stat_replication: Built-in status monitoring
Log monitoring: PostgreSQL error logs for failure detection

Version Upgrade Constraints

Major Version Limitations

Streaming replication: Cannot cross major versions
Upgrade options:
1. Logical replication to new version (complex)
2. Downtime for primary upgrade + standby rebuild
3. pg_upgrade + standby resync
Reality: All upgrade paths have significant complexity

Resource Investment Requirements

Time Investments

Initial setup: 4-8 hours including troubleshooting
Large database sync: Hours to days depending on size
Failover testing: Plan monthly testing windows
Troubleshooting: Network issues can consume entire weekends

Expertise Requirements

PostgreSQL administration: Advanced level required
Network troubleshooting: Essential for replication issues
Monitoring setup: Critical for production stability
Disaster recovery: Must be tested and documented

Infrastructure Costs

Standby hardware: Size equally to primary (don't cheap out)
Network capacity: Plan for 2-3x normal WAL traffic
Storage overhead: 3x primary database size minimum
Monitoring tools: Budget for proper alerting systems

Critical Warnings

Documentation Gaps

"Brief maintenance window": Actually 2-4 hours minimum
Network requirements: Underspecified in official docs
Disk space planning: WAL accumulation severely underestimated
Failure scenarios: Real-world complexity not covered

Breaking Points

WAL disk exhaustion: Crashes entire primary database
Network instability: Can make replication unusable
Standby query workload: Long queries will be cancelled
Split-brain scenarios: Requires careful primary shutdown verification

Production Gotchas

pg_basebackup timeouts: Test with actual database sizes
SSL certificate management: Easy to overlook, hard to fix later
Connection pooling: Replication consumes application connections
Backup tool conflicts: pg_basebackup competes for WAL sender slots

PostgreSQL Streaming Replication: AI-Optimized Production Guide

Configuration Requirements

Infrastructure Specifications

Version Compatibility

Primary Server Configuration

Essential postgresql.conf Settings

Critical Failure Modes

Security Configuration

Standby Server Setup

Base Backup Process

Time Requirements

Standby-Specific Settings

Performance Impact and Resource Requirements

Network Bandwidth

Storage Requirements

Primary Server Impact

Operational Troubleshooting

Replication Status Verification

Common Failure Scenarios

Connection Failures

Replication Lag Growth

Long Query Conflicts

Failover Procedures

Emergency Promotion

Post-Failover Requirements

Synchronous vs Asynchronous Trade-offs

Production Monitoring Requirements

Critical Alerts

Tools Integration

Version Upgrade Constraints

Major Version Limitations

Resource Investment Requirements

Time Investments

Expertise Requirements

Infrastructure Costs

Critical Warnings

Documentation Gaps

Breaking Points

Production Gotchas

Related Tools & Recommendations

PostgreSQL vs MySQL vs MariaDB vs SQLite vs CockroachDB - Pick the Database That Won't Ruin Your Life

MySQL Workbench Performance Issues - Fix the Crashes, Slowdowns, and Memory Hogs

MySQL HeatWave - Oracle's Answer to the ETL Problem

MySQL to PostgreSQL Production Migration: Complete Step-by-Step Guide

SQL Server 2025 - Vector Search Finally Works (Sort Of)

Don't Get Screwed by NoSQL Database Pricing - MongoDB vs Redis vs DataStax Reality Check

SQLite - The Database That Just Works

SQLite Performance: When It All Goes to Shit

Docker Daemon Won't Start on Linux - Fix This Shit Now

Linux Foundation Takes Control of Solo.io's AI Agent Gateway - August 25, 2025

PostgreSQL vs MySQL vs MariaDB - Developer Ecosystem Analysis 2025

MariaDB - What MySQL Should Have Been

MariaDB Performance Optimization - Making It Not Suck

pgAdmin - The GUI You Get With PostgreSQL

Docker Desktop Alternatives That Don't Suck

Docker Swarm - Container Orchestration That Actually Works

Docker Security Scanner Performance Optimization - Stop Waiting Forever

PostgreSQL WAL Tuning - Stop Getting Paged at 3AM

Debezium - Database Change Capture Without the Pain

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015