Why does my PostgreSQL server keep running out of disk space?

Your WAL files are piling up because either: 1. Your archive command is failing silently (check `pg_stat_archiver`) 2. Checkpoints aren't completing (look for "checkpoint" warnings in logs) 3. You have a long-running transaction keeping old WAL around 4. Replication slots are holding onto WAL files First thing to check: `SELECT slot_name, restart_lsn FROM pg_replication_slots;` If you see old LSNs, you've found your problem. Got paged at 2am because WAL disk filled up during Black Friday traffic - we went from 50GB to 450GB in under an hour. Error was "could not write to WAL file: No space left on device" and boom, everything stops. Had to explain to some very pissed off executives at 3am why customers couldn't buy anything.

Can I just delete old WAL files to free up space?

**NO.** Delete them manually and watch your replication break in spectacular fashion. [PostgreSQL manages WAL cleanup automatically](https://www.postgresql.org/docs/current/wal-intro.html) after checkpoints complete and WAL files are safely archived. If you're really desperate and know what you're doing: `pg_archivecleanup` can help, but you better be damn sure you don't need those files for recovery or replication.

What happens when WAL files get corrupted?

Your database is fucked. Full stop. [WAL corruption usually means restore from backup](https://www.postgresql.org/docs/current/backup-dump.html) and replay archived WAL files up to the corruption point. PostgreSQL has checksums in WAL files, but they can't fix corruption - just detect it. When postgres won't start with "invalid record length" or "bad resource manager data checksum", you're looking at downtime and potential data loss. This is why you archive WAL files religiously or you will get fired.

How much storage overhead should I expect?

Plan for 20-50% additional storage for WAL on write-heavy systems. I've seen it go higher during bulk loads or when checkpoint tuning goes wrong. The formula is roughly: `WAL overhead = write rate × checkpoint_timeout`. So if you're pushing 100MB/sec writes with 5-minute checkpoints, that's about 30GB of WAL minimum. Learned this the hard way when a data migration filled 280GB of WAL in two hours and killed prod. [Size your storage right](https://www.postgresql.org/docs/current/wal-configuration.html) or you'll be on the phone with AWS support at 3am buying emergency storage.

Does `sync_commit=off` make WAL faster?

Yes, but you can lose transactions if you crash. With `sync_commit=off`, PostgreSQL doesn't wait for `fsync()` before acknowledging commits. You get massive performance gains but [risk losing the last second or so of transactions](https://www.postgresql.org/docs/current/wal-async-commit.html) during crashes. It's great for analytics workloads where losing a few rows isn't catastrophic. Don't use it for financial data unless you enjoy explaining transaction loss to auditors.

Why is my WAL replay taking forever during recovery?

Recovery speed depends on WAL volume and whether you cheaped out on storage. Plan for [1-3 minutes per GB of WAL](https://www.postgresql.org/docs/current/wal-intro.html) with decent SSDs. On shitty storage or massive WAL volumes, it takes forever. Had one recovery take 4 hours because someone fucked up the checkpoint config and we had 170GB of WAL to replay. Logs just showed "redo in progress" with LSN numbers crawling up while management kept asking "is it done yet?" every 10 minutes. Speed it up with: - More `recovery_parallel` workers (PostgreSQL 13+) - Faster storage for WAL and data files - Less frequent checkpoints (trade-off with WAL volume)

Can I run PostgreSQL without WAL?

Technically yes with `wal_level=minimal` and `fsync=off`, but don't. You lose crash recovery, replication, and PITR. Your database becomes a fancy CSV file that occasionally corrupts itself. [SQLite can disable WAL entirely](https://www.sqlite.org/wal.html) for read-only workloads, but PostgreSQL assumes you want your data to survive crashes.

What's this "checkpoint" thing and why do I care?

Checkpoints flush dirty data pages to disk and clean up old WAL files. They're necessary but [cause I/O spikes that can kill performance](https://www.postgresql.org/docs/current/wal-configuration.html). Tune `checkpoint_timeout` and `max_wal_size` to spread checkpoint I/O over time. Too frequent = constant I/O spikes. Too infrequent = massive WAL buildup and slow recovery.

Why does my replication keep falling behind?

Usually because WAL generation exceeds network bandwidth or replica apply speed. Check: - Network bandwidth between primary and replica - Replica hardware (especially disk I/O) - WAL volume with `pg_stat_replication` [Logical replication](https://www.postgresql.org/docs/current/logical-replication.html) is slower than physical because it has to decode WAL entries. Physical replication just copies bytes. **Replication Flow**: Primary server streams WAL records to standby servers via walsender processes. Standby servers apply these changes using walreceiver and startup processes.

How do I monitor WAL health?

![Database Monitoring](https://www.vectorlogo.zone/logos/grafana/grafana-ar21.svg) Key metrics to watch: - WAL generation rate: `pg_stat_wal` - Replication lag: `pg_stat_replication` - Archive failures: `pg_stat_archiver` - Disk space in `pg_wal/` directory Set alerts when WAL disk usage hits 70%. By 90%, you're probably in trouble. [Monitor checkpoint completion](https://www.postgresql.org/docs/current/monitoring-stats.html) and tune if they're taking too long.

What's the latest with WAL improvements?

PostgreSQL 17 (the current latest stable version) brought some useful WAL-related changes: - Better parallel vacuum performance reduces WAL generation during maintenance - Improved checkpoint performance helps with WAL throughput - Enhanced monitoring through pg_stat_io gives better visibility into I/O patterns - More efficient WAL compression saves disk space on busy systems If you're planning upgrades, stick with PostgreSQL 17 for now - it's stable and production-ready.

Currently viewing the AI version

Switch to human version

WAL (Write-Ahead Logging) - AI-Optimized Technical Reference

Core Functionality

Purpose: Prevents data loss during database crashes by logging changes before applying them to data files.

Mechanism: Sequential write to log file → fsync() to disk → asynchronous application to data files

Recovery Process: Replay log entries from last checkpoint to restore consistent state

Performance Specifications

Write Performance

Sequential WAL writes: 2-5x faster than random data file writes
Batch commits: Single fsync() can handle multiple transactions
Production throughput: 8,000-50,000+ transactions/second on NVMe SSD with 32GB RAM
Transaction latency: 1-5ms typical on decent hardware

Storage Requirements

Typical overhead: 20-50% additional storage for WAL files
High-write scenarios: Can reach 80% during bulk imports
Formula: WAL overhead = write_rate × checkpoint_timeout
Example: 100MB/sec writes + 5-minute checkpoints = ~30GB WAL minimum

Critical Configuration Settings

PostgreSQL WAL Settings

-- Essential parameters
wal_level = replica                    -- Enable replication
fsync = on                            -- NEVER disable in production
synchronous_commit = on               -- Set to 'off' only for analytics workloads
checkpoint_timeout = 300              -- 5 minutes default, tune based on workload
max_wal_size = 1GB                    -- Increase for write-heavy systems
wal_compression = on                  -- PostgreSQL 14+, saves 20-40% space

Performance Tuning Parameters

-- Advanced tuning
checkpoint_completion_target = 0.9    -- Spread checkpoint I/O over 90% of interval
wal_buffers = 16MB                    -- Usually auto-tuned correctly
commit_delay = 0                      -- Let PostgreSQL handle batching automatically
recovery_parallel_workers = 4        -- PostgreSQL 13+, speeds up recovery

Failure Modes and Solutions

WAL Disk Full

Symptoms: "could not write to WAL file: No space left on device"
Impact: Database stops accepting writes immediately
Prevention: Alert at 70% WAL disk usage, emergency at 90%
Recovery: Fix archive command or increase disk space

Checkpoint Configuration Issues

Problem: Misconfigured checkpoints cause I/O spikes or excessive WAL buildup
Symptoms:

Too frequent: Constant I/O spikes killing performance
Too infrequent: WAL buildup (30GB+ per hour), slow recovery times
Solution: Balance checkpoint_timeout and max_wal_size based on workload

Replication Lag

Cause: WAL generation exceeds network bandwidth or replica apply speed
Impact: Replicas fall behind, potential data loss during failover
Monitoring: SELECT * FROM pg_stat_replication;
Solutions:

Increase network bandwidth
Upgrade replica hardware
Switch from logical to physical replication

Archive Command Failures

Problem: Archive command fails silently, old WAL files accumulate
Detection: SELECT * FROM pg_stat_archiver;
Impact: WAL disk fills up, no point-in-time recovery capability
Prevention: Monitor archiver stats, test archive/restore regularly

Recovery Time Expectations

Recovery Performance

Standard hardware: 1-3 minutes per GB of WAL to replay
High-end NVMe: Sub-minute per GB
Spinning disks: 3-5 minutes per GB
Real example: 170GB WAL = 4+ hours recovery time with poor checkpoint config

Recovery Acceleration

Enable parallel recovery workers (recovery_parallel_workers)
Use faster storage for WAL and data files
Optimize checkpoint frequency to reduce WAL volume

Database-Specific Implementations

PostgreSQL

WAL location: pg_wal/ directory
File size: 16MB segments
Compression: Available PostgreSQL 14+
Strengths: Single WAL system, excellent tooling, reliable
Best for: Most OLTP production systems

MySQL InnoDB

Dual-log system: Redo log (crash recovery) + Binary log (replication)
Complexity: Requires both logs for complete recovery
Performance: 1-3ms write latency
Storage overhead: 25-45%
Limitation: Dual-log management complexity

SQLite WAL Mode

File: .wal extension
Limitation: Single writer only
Performance: Sub-millisecond writes
Best for: Mobile apps, embedded systems
Recovery: Seconds, not minutes

MongoDB Oplog

Type: Operations log (not true WAL)
Storage: Capped collection
Complexity: Oplog sizing is critical - too small and replicas fall behind permanently
Write latency: 5-15ms range

Production Monitoring Requirements

Critical Metrics

-- WAL generation rate
SELECT * FROM pg_stat_wal;

-- Replication status
SELECT * FROM pg_stat_replication;

-- Archive status
SELECT * FROM pg_stat_archiver;

-- Replication slots (can hold WAL files)
SELECT slot_name, restart_lsn FROM pg_replication_slots;

Alert Thresholds

WAL disk usage: Alert at 70%, emergency at 90%
Replication lag: Alert if behind by >100MB or 5 minutes
Archive failures: Alert on any failed archive attempts
Checkpoint duration: Alert if checkpoints take >60% of checkpoint_timeout

Resource Requirements

Hardware Considerations

WAL storage: Fast sequential write performance more important than random I/O
Network: Replication requires sustained bandwidth equal to WAL generation rate
CPU: WAL compression trades CPU for I/O (PostgreSQL 14+)
RAM: Larger shared_buffers reduces checkpoint frequency

Cloud Provider Costs

Aurora: 3-4x RDS pricing for distributed WAL benefits
AlloyDB: Similar premium pricing, faster analytical queries
Standard RDS: Most cost-effective for typical workloads

Common Misconceptions

Dangerous Settings

fsync = off: Disabling loses crash recovery entirely
synchronous_commit = off: Can lose last second of transactions
Manual WAL file deletion: Breaks replication, never do this
WAL on NFS: Reliability issues, avoid in production

Performance Myths

WAL doesn't slow down writes - it makes them safer and often faster through batching
More WAL files doesn't mean worse performance - it means more write activity
Checkpoint storms are configuration problems, not WAL problems

Essential Tools

Debugging and Analysis

pg_waldump: Analyze WAL file contents during troubleshooting
pgBadger: WAL performance analysis and checkpoint timing
pg_stat_io: PostgreSQL 17+ I/O pattern visibility

Backup and Archiving

pgBackRest: Reliable WAL archiving and point-in-time recovery
WAL-G: Modern alternative to WAL-E with better error handling

Monitoring

Prometheus postgres_exporter: Comprehensive WAL metrics
Grafana dashboards: Visualize WAL generation, checkpoint timing, replication lag

Decision Criteria

When WAL Works Best

OLTP workloads with frequent small transactions
Systems requiring point-in-time recovery
Read replica configurations
Applications needing crash consistency guarantees

When to Consider Alternatives

Pure read-only workloads (minimal WAL benefit)
Systems with extreme storage constraints
Applications that can tolerate data loss for performance gains
Single-user embedded applications (SQLite WAL mode sufficient)

Useful Links for Further Investigation

Essential WAL Resources (Actually Useful Stuff)

Link	Description
PostgreSQL Write-Ahead Logging	The official docs. Actually explains how WAL works without the marketing bullshit. I reference this constantly when debugging WAL issues.
WAL Configuration	All the knobs you can turn. Most people fuck up checkpoint tuning - read this first.
Monitoring Database Activity	`pg_stat_wal`, `pg_stat_replication`, `pg_stat_archiver` - monitor these or get fired.
SQLite Write-Ahead Logging	Perfect for single-writer scenarios. Dead simple and works.
The Internals of PostgreSQL - WAL	Best deep dive into PostgreSQL WAL internals. Saved my ass when debugging WAL corruption - spent 6 hours until I found this explanation of LSN handling.
Database System Concepts - Recovery	Academic but useful. Chapter 16 explains WAL theory without the marketing bullshit.
pg_waldump	Debug WAL files when shit hits the fan
pgBadger	Ugly as hell but actually works when you're trying to figure out why checkpoints are slow
pgBackRest	Actually works for WAL archiving and recovery
WAL-G	Modern replacement for WAL-E, less buggy
Prometheus PostgreSQL Exporter	Set alerts on WAL disk at 70% or you'll be getting called at 3am
PostgreSQL WAL tagged questions	"WAL file could not be archived" solutions
DBA StackExchange	Better for PostgreSQL issues than SO

WAL (Write-Ahead Logging) - AI-Optimized Technical Reference

Core Functionality

Performance Specifications

Write Performance

Storage Requirements

Critical Configuration Settings

PostgreSQL WAL Settings

Performance Tuning Parameters

Failure Modes and Solutions

WAL Disk Full

Checkpoint Configuration Issues

Replication Lag

Archive Command Failures

Recovery Time Expectations

Recovery Performance

Recovery Acceleration

Database-Specific Implementations

PostgreSQL

MySQL InnoDB

SQLite WAL Mode

MongoDB Oplog

Production Monitoring Requirements

Critical Metrics

Alert Thresholds

Resource Requirements

Hardware Considerations

Cloud Provider Costs

Common Misconceptions

Dangerous Settings

Performance Myths

Essential Tools

Debugging and Analysis

Backup and Archiving

Monitoring

Decision Criteria

When WAL Works Best

When to Consider Alternatives

Useful Links for Further Investigation

Essential WAL Resources (Actually Useful Stuff)

Related Tools & Recommendations

Fix Redis "ERR max number of clients reached" - Solutions That Actually Work

QuickNode - Blockchain Nodes So You Don't Have To

Get Alpaca Market Data Without the Connection Constantly Dying on You

OpenAI Alternatives That Won't Bankrupt You

Migrate JavaScript to TypeScript Without Losing Your Mind

Docker Compose 2.39.2 and Buildx 0.27.0 Released with Major Updates

Google Vertex AI - Google's Answer to AWS SageMaker

Google NotebookLM Goes Global: Video Overviews in 80+ Languages

Figma Gets Lukewarm Wall Street Reception Despite AI Potential - August 25, 2025

MongoDB - Document Database That Actually Works

How to Actually Configure Cursor AI Custom Prompts Without Losing Your Mind

Cloudflare AI Week 2025 - New Tools to Stop Employees from Leaking Data to ChatGPT

APT - How Debian and Ubuntu Handle Software Installation

jQuery - The Library That Won't Die

AWS RDS Blue/Green Deployments - Zero-Downtime Database Updates

KrakenD Production Troubleshooting - Fix the 3AM Problems

Fix Kubernetes ImagePullBackOff Error - The Complete Battle-Tested Guide

Fix Git Checkout Branch Switching Failures - Local Changes Overwritten

YNAB API - Grab Your Budget Data Programmatically

NVIDIA Earnings Become Crucial Test for AI Market Amid Tech Sector Decline - August 23, 2025