Set Up PostgreSQL Streaming Replication Without Losing Your Sanity

Prerequisites and System Requirements

PostgreSQL High Availability Architecture

Before diving into PostgreSQL replication setup, let's talk about what you actually need and what will go wrong if you skip these steps. I've been paged at 3am enough times to know which corners you can't cut.

Infrastructure Requirements

Two or more servers running identical PostgreSQL versions. And I mean IDENTICAL - not "close enough" versions. PostgreSQL 14.8 and 14.9 might look the same but will cause headaches when you try to failover. Use PostgreSQL 15+ if you can - the monitoring improvements and enhanced replication features alone are worth the upgrade.

Network connectivity that doesn't suck. Seriously, this is where most replication setups die. One network hiccup and your standby falls behind. I've seen WAL files pile up to 200GB during a weekend outage because someone's "enterprise-grade" network couldn't stay up for 48 hours. Use dedicated NICs for replication traffic if you have them - your database will thank you.

Sufficient disk space on both servers. Here's the thing nobody tells you: WAL files will eat your disk alive when the network shits the bed. I've seen 50GB databases generate 200GB of WAL over a weekend when the standby couldn't connect. Size your disk like you're planning for a disaster, because you are. Plan for at least 3x your current database size if you want to sleep well at night.

Security Considerations

PostgreSQL Security Best Practices

Dedicated replication user with minimal privileges. Don't be an idiot and use postgres superuser for replication. Create a dedicated replication user with only REPLICATION privileges. I've seen entire databases compromised because someone thought "it's just internal traffic." Follow the principle of least privilege religiously.

SSL/TLS encryption for replication traffic. If you're sending unencrypted database traffic across any network that's not a directly connected cable, you're asking for trouble. Use SSL even on "private" networks - network segmentation is rarely as private as you think.

Firewall configuration restricting replication ports. Lock down port 5432 to specific IPs only. I've seen DBAs who opened it to 0.0.0.0/0 "temporarily" and forgot about it for months. Configure iptables properly or use cloud security groups.

Version Compatibility and Timing

PostgreSQL Version Management

Same major PostgreSQL versions required for streaming replication. PostgreSQL 14 and 15 won't play nice together with streaming replication. If you need cross-version, use logical replication instead - though that has its own set of gotchas.

Minimal downtime planning - this is a lie. While the docs say "brief restart," plan for 15-30 minutes because something will go wrong. First time I did this, had a typo in postgresql.conf that took me 2 hours to find. Turned a quick restart into a Saturday afternoon debugging clusterfuck. Test in staging first - production is not for learning.

These aren't just best practices - they're lessons learned from production failures. Skip them at your own risk. Next, we'll configure the primary server and watch things break in creative ways.

Pre-Setup Questions and Requirements

Do I need identical hardware for primary and standby servers?

Don't cheap out on your standby hardware. I've seen people try to replicate from a beefy primary to some leftover server from 2015. The standby falls behind every time you run a big query, then you're scrambling to rebuild replication from scratch. Size your standby properly or watch it fall behind every time you run a report.

Can I set up replication on a running production database?

Yes, but "brief maintenance window" is marketing speak. Plan for 2-4 hours because something will go wrong. Last time I did this, pg_basebackup for a 1TB database? Cancel your weekend plans. The primary restart might be quick, but the initial sync will take forever on any real database.

How much network bandwidth does streaming replication consume?

More than you think. Your 1GB/hour WAL generation becomes 2-3GB/hour of network traffic because PostgreSQL wasn't designed by network engineers. And when the standby falls behind? Prepare for a bandwidth spike that brings down your "enterprise" network connection.

What PostgreSQL versions support streaming replication?

Anything newer than Postgre

SQL 9.1, but don't be a masochist.

Use PostgreSQL 15+

the replication monitoring improvements alone will save you debugging time. Older versions have fewer tools to figure out why your replication is broken.

Can I replicate specific databases or tables only?

Nope. Streaming replication copies everything or nothing. Want selective replication? Use logical replication instead, but prepare for a whole new set of gotchas. Streaming replication works at the WAL level, which doesn't give a shit about your database boundaries.

Why does my standby keep falling behind during large queries?

Because your standby server is underpowered or your network sucks. That "optimized" query that takes 30 seconds on your primary? It's generating WAL faster than your standby can replay it. Size your standby properly or accept that it will lag behind during peak usage.

How much disk space do I actually need for the standby?

More than you think. Start with 3x your primary database size and add more when (not if) you run out. WAL files pile up faster than you can delete them during network issues. I've seen 100GB databases fill 500GB partitions with accumulated WAL files.

Will replication slow down my primary server?

It depends. Under normal conditions, minimal impact. But when your network hiccups or your standby falls behind, WAL segments accumulate on the primary. Run out of disk space and your entire primary goes down. Monitor your WAL directory size religiously.

Primary Server Configuration

Streaming Replication Architecture Overview:

PostgreSQL Streaming Replication Flow

[Primary Server] ---> WAL Records ---> [Standby Server]
      |
   Generates
   WAL Data
      |
   Accepts
   Connections

The primary server needs to be configured to generate enough WAL data and accept replication connections. Don't expect this to work on the first try - PostgreSQL config files are designed by people who hate DBAs.

Essential postgresql.conf Settings

Configure WAL level for replication:

wal_level = replica

Set this to replica or your standby won't get enough WAL data to stay in sync. Don't use minimal unless you enjoy broken replication and 3am phone calls.

Set maximum WAL sender processes:

max_wal_senders = 5

Each standby eats one WAL sender process. Set this higher than you think you need because backup tools like pg_basebackup also consume senders. Found out the hard way at 2am when pg_basebackup failed with 'too many clients' and I had to explain to my boss why the backup window failed.

Configure WAL retention:

wal_keep_size = 2GB

This prevents WAL segments from being recycled before standbys can process them. 2GB sounds like a lot until your standby falls behind on a Friday night and you come back Monday to find all the WAL it needs has been deleted. Size this based on your write volume and how long you can afford to have broken replication. I've seen systems need 20GB+ during busy periods.

Enable archive mode (recommended):

archive_mode = on
archive_command = 'test ! -f /var/lib/postgresql/archive/%f && cp %p /var/lib/postgresql/archive/%f'

WAL archiving is your backup plan when streaming replication shits the bed. Make sure the archive directory actually exists and has proper permissions, or you'll get cryptic archive command errors that tell you nothing useful. Test your archive command manually before you need it.

Network and Connection Configuration

Configure listening addresses:

listen_addresses = '*'  # DON'T DO THIS IN PRODUCTION

Don't be lazy and use '*' in production. Specify exact IP addresses to limit your attack surface. Your security team will thank you, and you won't accidentally expose PostgreSQL to the internet like those MongoDB incidents and Elasticsearch breaches you read about.

Optimize connection limits:

max_connections = 200

Each replication connection eats one from your connection pool. Plan accordingly or watch your applications get connection errors when replication kicks in. I've seen production apps fail because nobody counted replication connections in their capacity planning.

Create Dedicated Replication User

Create a user specifically for replication - don't be an idiot and use the postgres superuser:

CREATE ROLE replication_user WITH REPLICATION LOGIN PASSWORD 'secure_random_password';

Store that password somewhere secure, not in a Slack channel named #passwords or Dave's desktop folder called 'important_stuff'. This user needs only REPLICATION privileges - don't give it more permissions because you're feeling generous.

Configure Host-Based Authentication

PostgreSQL Security Configuration

Edit pg_hba.conf to allow replication connections from standby servers. This is where most setups break:

## Allow replication from standby server
host    replication    replication_user    10.0.1.100/32    scram-sha-256

Replace 10.0.1.100 with your actual standby IP. Use scram-sha-256 authentication in PostgreSQL 13+ because md5 is deprecated and trust is for development only.

The moment of truth - restart PostgreSQL:

sudo systemctl restart postgresql

If PostgreSQL doesn't start, you probably have a typo in postgresql.conf. Check the logs and fix your mistake. Configuration troubleshooting errors will waste an hour of your Saturday.

The primary server is theoretically ready to accept replication connections. Next, we'll set up the standby and watch everything break in creative ways.

Replication Configuration Options Comparison

Configuration Aspect	Synchronous Replication	Asynchronous Replication	Hybrid Approach
Data Durability	Zero data loss (if it works)	You'll lose some data when shit hits the fan	Depends on how well you tune it
Primary Performance	Commits get slower (how much depends on your network)	Barely noticeable until it breaks	Slower than async, faster than sync
Network Requirements	Your network better not suck	Can handle occasional hiccups	Needs decent but not perfect network
Failover Time	Depends on how fast you notice and react	Hope you have good monitoring	Somewhere in between
Complexity	High you'll be tweaking settings for months	Low until something breaks	Moderate good luck finding the right balance
Best Use Cases	Banks and places that can't lose transactions	Most websites and apps	Enterprise apps where you can't decide

Standby Server Setup and Data Synchronization

Time to set up the standby server and watch pg_basebackup fail in creative ways. This is where you'll learn why PostgreSQL replication documentation is optimistically vague about timelines.

Prepare the Standby Environment

Install identical PostgreSQL version on the standby server. And I mean IDENTICAL - not "close enough" or "probably compatible." PostgreSQL 15.3 and 15.4 might seem similar but will cause mysterious replication failures that will waste your entire weekend. Use the exact same packages from the exact same repository.

Stop PostgreSQL service on the standby before beginning:

sudo systemctl stop postgresql

Clear existing data directory (this is your point of no return):

sudo -u postgres rm -rf /var/lib/postgresql/15/main/*

Double-check your data directory path before running this. I've seen DBAs accidentally wipe their primary server because they copy-pasted the wrong command. Make sure you're on the RIGHT SERVER.

Ensure directory exists with correct permissions:

sudo -u postgres mkdir -p /var/lib/postgresql/15/main
sudo chmod 700 /var/lib/postgresql/15/main

Wrong permissions here will cause PostgreSQL to refuse to start with cryptic error messages.

Execute Base Backup with pg_basebackup

PostgreSQL Monitoring Setup

Here's where the fun begins. pg_basebackup will copy your entire database over the network. For a 1TB database, cancel your weekend plans:

sudo -u postgres pg_basebackup \
    -h 10.0.1.99 \
    -p 5432 \
    -U replication_user \
    -D /var/lib/postgresql/15/main \
    -Fp \
    -Xs \
    -P \
    -R \
    -W

Here's what these flags actually do:

-h: Primary server IP (make sure it's reachable)
-U: Replication user (that you actually created)
-D: Where to dump the data (double-check this path)
-Fp: Plain format (because tar would be too easy)
-Xs: Stream WAL during backup (or your backup will be inconsistent)
-P: Show progress (so you can watch it fail slowly)
-R: Auto-create standby config (saves you manual work)
-W: Force password prompt (because authentication always breaks)

Reality check: This command will fail the first 2-3 times because:

Wrong IP address or firewall blocking connection
Authentication failure in pg_hba.conf
Permission denied on destination directory
Network timeout during large database copy
Primary server out of WAL sender slots

For a 100GB database, expect 2-6 hours depending on your network. Watch the progress and pray your network doesn't hiccup.

Automatic Standby Configuration

The -R option creates the files you need (when it works):

standby.signal - Tells PostgreSQL this is a standby (don't delete this file)
postgresql.auto.conf - Contains connection info to the primary

Check what pg_basebackup actually generated:

sudo -u postgres cat /var/lib/postgresql/15/main/postgresql.auto.conf

Should look like:

primary_conninfo = 'user=replication_user host=10.0.1.99 port=5432 sslmode=prefer'

If it's wrong, edit it manually. The -R option sometimes generates incorrect connection strings.

Configure Standby-Specific Settings

Edit the standby's postgresql.conf for production:

## Enable read-only queries on standby
hot_standby = on

## Feedback to primary about long queries (causes primary bloat)
hot_standby_feedback = off

## Lower connections since this is read-only
max_connections = 100

## WAL receiver timeout settings
wal_receiver_timeout = 60s
wal_retrieve_retry_interval = 5s

Reality check:

hot_standby_feedback = on prevents query cancellations but causes bloat on your primary
Set max_connections lower on standby since it's read-only
Timeout settings: shorter timeouts detect failures faster but are less tolerant of network hiccups

SSL Configuration for Secure Replication

PostgreSQL SSL Setup

For production, copy SSL certificates to the standby (don't skip this):

## Copy certificates from primary (or use your own CA certs)
sudo cp server.crt /var/lib/postgresql/15/main/
sudo cp server.key /var/lib/postgresql/15/main/
sudo chown postgres:postgres /var/lib/postgresql/15/main/server.*
sudo chmod 600 /var/lib/postgresql/15/main/server.key

Update postgresql.auto.conf to require SSL:

primary_conninfo = 'user=replication_user host=10.0.1.99 port=5432 sslmode=require'

Test SSL connectivity before starting replication or you'll waste time debugging SSL connection errors.

Start Standby Server (Moment of Truth)

Cross your fingers and start PostgreSQL:

sudo systemctl start postgresql

If it doesn't start, check the logs:

sudo tail -f /var/log/postgresql/postgresql-15-main.log

Look for these messages (or error messages explaining why everything's broken):

"database system is ready to accept read only connections"
"started streaming WAL from primary"
"consistent recovery state reached"

If you see "FATAL: could not connect to the primary server", welcome to debugging replication connection problems. Check your network, pg_hba.conf, and authentication settings.

If everything worked (unlikely on first try), your standby is receiving changes from the primary. Test it by creating a table on primary and seeing if it appears on standby.

Production Replication Troubleshooting and Monitoring

How do I actually know if replication is working?

Run SELECT * FROM pg_stat_replication; on the primary. If you see your standby with state = 'streaming', you're good. If you see nothing, replication is broken. On the standby, SELECT pg_is_in_recovery(); should return true. The real test: create a table on primary and see if it shows up on standby. If it doesn't, welcome to debugging hell.

Why does my standby keep saying "could not connect to server"?

Because pg_hba.conf is where hope goes to die. Check:

Can you telnet primary_ip 5432?
Does pg_hba.conf actually allow your standby's IP?
Does the replication user exist and have REPLICATION privileges?
90% of the time it's pg_hba.conf, 9% it's firewall, 1% it's something truly weird.

My replication lag keeps growing - what the hell?

Your standby is falling behind because either your network sucks, your standby server is underpowered, or you're running too many queries on it. Check pg_stat_replication for lag numbers. If flush_lag is high, it's network. If replay_lag is high, your standby can't keep up. Solution: better hardware or accept that your standby will lag.

What happens when my standby dies during maintenance?

Your primary keeps running but starts accumulating WAL files. If your standby is down too long, the primary will run out of disk space and crash. Size your wal_keep_size for your longest possible outage, then double it. When the standby comes back, it'll try to catch up automatically. For long outages (days), you'll need a fresh pg_basebackup.

How do I promote a standby to primary when everything's on fire?

First, make absolutely sure your old primary is DEAD or you'll have split-brain syndrome. Run pg_ctl promote -D /var/lib/postgresql/15/main on the standby. It exits recovery mode and becomes writable. Now update all your application connection strings to point to the new primary. Hope you documented where all those connections are configured.

Can I actually run queries on the standby without breaking everything?

Yes, if hot_standby = on (it's on by default). The standby accepts read-only queries while replaying WAL. Great for reports and read replicas. But long-running queries will get killed when they conflict with replication. Your users will love getting their 2-hour report cancelled by a simple UPDATE on the primary.

My replication broke after a network outage - now what?

Check SELECT * FROM pg_replication_slots; on the primary to see if your WAL segments are still available. If the standby fell too far behind, you'll see huge gaps in LSN positions. If WAL segments got recycled (they will), you're rebuilding from scratch with pg_basebackup. Use replication slots to prevent this in the future.

How do I monitor this mess in production?

Monitor replication lag (pg_stat_replication view), disk space on both servers, and whether your replication processes are actually running. Set alerts for lag > 30 seconds, missing processes, or disk space < 20%. Tools like Prometheus with postgres_exporter work well if you're into that sort of thing.

Should I use synchronous or asynchronous replication?

Asynchronous (default): Fast commits, risk of data loss when primary dies. Synchronous: Slow commits, no data loss. Synchronous replication will make your commits slower and your users angry. Use it for financial data or medical records where data loss equals lawsuits.

Can I have multiple standby servers without losing my mind?

Yes, run pg_basebackup from each standby to the primary. Each standby operates independently until one of them breaks. For synchronous replication with multiple standbys, use synchronous_standby_names = 'FIRST 1 (standby1,standby2)'. When one standby dies, the other takes over automatically.

How do I upgrade PostgreSQL major versions with replication?

You can't use streaming replication between major versions. Your options:

Use logical replication to a new version cluster
Take downtime to upgrade primary then rebuild standbys
Use pg_upgrade on primary then resync standbys
All options suck in different ways.

Quick Navigation

Infrastructure Requirements

Security Considerations

Version Compatibility and Timing

Do I need identical hardware for primary and standby servers?

Can I set up replication on a running production database?

How much network bandwidth does streaming replication consume?

What PostgreSQL versions support streaming replication?

Can I replicate specific databases or tables only?

Why does my standby keep falling behind during large queries?

How much disk space do I actually need for the standby?

Will replication slow down my primary server?

Essential postgresql.conf Settings

Network and Connection Configuration

Create Dedicated Replication User

Configure Host-Based Authentication

Prepare the Standby Environment

Execute Base Backup with pg_basebackup

Automatic Standby Configuration

Configure Standby-Specific Settings

SSL Configuration for Secure Replication

Start Standby Server (Moment of Truth)

How do I actually know if replication is working?

Why does my standby keep saying "could not connect to server"?

My replication lag keeps growing - what the hell?

What happens when my standby dies during maintenance?

How do I promote a standby to primary when everything's on fire?

Can I actually run queries on the standby without breaking everything?

My replication broke after a network outage - now what?

How do I monitor this mess in production?

Should I use synchronous or asynchronous replication?

Can I have multiple standby servers without losing my mind?

How do I upgrade PostgreSQL major versions with replication?

Related Tools & Recommendations

PostgreSQL vs MySQL vs MariaDB vs SQLite vs CockroachDB

MySQL Cloud Decision Framework: Choosing the Best Database

PostgreSQL WAL Tuning: Optimize Write-Ahead Logging for Production

PostgreSQL Performance Optimization: Master Tuning & Monitoring

PostgreSQL vs MySQL vs MariaDB: Developer Ecosystem Analysis

PostgreSQL Logical Replication: When Streaming Isn't Enough

Migrate MySQL to PostgreSQL: A Practical, Step-by-Step Guide

MySQL - The Database That Actually Works When Others Don't

MySQL Workbench Performance Issues - Fix the Crashes, Slowdowns, and Memory Hogs

PostgreSQL vs MySQL Performance Optimization Guide

PostgreSQL: Why It Excels & Production Troubleshooting Guide

Neon Production Troubleshooting Guide: Fix Database Errors

PostgreSQL Common Errors & Solutions: Fix Database Issues

OpenTelemetry + Jaeger + Grafana on Kubernetes - The Stack That Actually Works

Don't Get Screwed by NoSQL Database Pricing - MongoDB vs Redis vs DataStax Reality Check

Neon Serverless PostgreSQL: An Honest Review & Production Insights

Database Replication Guide: Overview, Benefits & Best Practices

Change Data Capture (CDC) Troubleshooting Guide: Fix Common Issues

PostgreSQL Alternatives: Escape Production Nightmares

PostgreSQL vs MySQL vs MongoDB: Database Hosting Cost Comparison