Why is my pg_wal directory eating all my disk space?

This is the #1 WAL emergency. Your database will crash when WAL fills the disk, so fix this immediately. Three main causes: **Stuck replication slots**: Check for abandoned replication slots that prevent WAL cleanup: ```sql SELECT slot_name, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS bytes_behind, active, wal_status FROM pg_replication_slots WHERE wal_status <> 'lost' ORDER BY restart_lsn; ``` If you see slots with massive `bytes_behind` values and `active = false`, drop them: `SELECT pg_drop_replication_slot('stuck_slot_name');` **Failed WAL archiving**: If you have archiving enabled, check archiver status: ```sql SELECT last_failed_wal, last_failed_time FROM pg_stat_archiver WHERE last_failed_time > coalesce(last_archived_time, '-infinity'); ``` Failed archiving prevents WAL cleanup. Check your `archive_command` and fix network/storage issues. **Excessive `wal_keep_size`**: Check if this parameter is set too high: `SHOW wal_keep_size;`. Reduce it if it's consuming too much space.

My database keeps crashing with "PANIC: could not write to file"

This usually means you've run out of WAL disk space. PostgreSQL cannot function without WAL, so it crashes rather than risk data corruption. **Immediate fix**: Add more disk space to the WAL partition. You might need to move `pg_wal` to a larger disk: 1. Stop PostgreSQL 2. Move `pg_wal` directory to new location: `mv /var/lib/postgresql/data/pg_wal /larger-disk/pg_wal` 3. Create symlink: `ln -s /larger-disk/pg_wal /var/lib/postgresql/data/pg_wal` 4. Start PostgreSQL **Prevention**: Monitor WAL disk usage and set up alerts when it reaches 80% full. Use the disk space fixes from the previous question.

What's the difference between checkpoints_timed and checkpoints_req?

Monitor these in `pg_stat_bgwriter`: ```sql SELECT checkpoints_timed, checkpoints_req FROM pg_stat_bgwriter; ``` **checkpoints_timed**: Checkpoints triggered by `checkpoint_timeout` (good - predictable) **checkpoints_req**: Checkpoints triggered by `max_wal_size` being exceeded (bad - unpredictable load) You want mostly timed checkpoints. If you see many requested checkpoints, increase `max_wal_size`. [EDB research shows](https://www.enterprisedb.com/blog/tuning-maxwalsize-postgresql) this can provide massive performance improvements on write-heavy workloads.

Should I put WAL on a separate disk?

**Yes, absolutely.** WAL writes are sequential while data file access is random. Putting them on the same disk creates I/O contention that kills performance. **Best practice**: Place WAL on a fast SSD separate from your data files. Even a modest SSD dedicated to WAL can dramatically improve write performance. Create a symlink from `pg_wal` to the separate disk location. **If you can't**: At least ensure your storage has good write performance. Cloud providers often limit IOPS, so you might need provisioned IOPS storage for busy databases.

Why are my WAL files so huge after enabling logical replication?

Logical replication (`wal_level = logical`) logs additional information needed to decode row changes. This typically increases WAL volume by 20-50%, but can be much higher on workloads with many UPDATEs. **Check your actual usage**: ```sql SELECT name, setting FROM pg_settings WHERE name = 'wal_level'; ``` **Only use `wal_level = logical` if you actually need logical replication.** Most replication scenarios use streaming replication, which only needs `wal_level = replica`.

How do I tune wal_buffers for better performance?

Default 16MB is often too small for busy systems. Monitor WAL buffer usage: ```sql SELECT * FROM pg_stat_wal; ``` If `wal_buffers_full` is increasing rapidly, you need more WAL buffers. **Tuning guidelines**: - **Low write volume**: 16-64MB is fine - **Medium write volume**: 64-256MB - **High write volume**: 256MB-1GB Don't go over 1GB - diminishing returns and memory waste. Set `wal_buffers = shared_buffers / 32` as a starting point, then monitor and adjust.

Can I disable fsync to make writes faster?

**No, never in production.** Disabling `fsync` means WAL writes aren't guaranteed to reach disk, eliminating crash recovery protection. Your database will run faster until it loses data in a crash. **For development/testing only**: `fsync = off` can speed up bulk data loads, but you accept total data loss risk. **Better alternatives for performance**: - Use `synchronous_commit = off` for specific transactions that can tolerate loss - Tune `wal_buffers`, `max_wal_size`, and checkpoint parameters - Use faster storage (SSDs) instead of compromising durability

Why does crash recovery take so long?

Recovery time depends on how much WAL needs to be replayed since the last checkpoint. Long recovery usually means: **Infrequent checkpoints**: Check `checkpoint_timeout` and `max_wal_size`. Very large `max_wal_size` values reduce checkpoint frequency but increase recovery time. **Large transactions**: Massive bulk operations create huge amounts of WAL. Break large operations into smaller transactions. **Slow storage**: Recovery involves random I/O to data files. Faster storage (SSDs) dramatically reduces recovery time. **Tuning for faster recovery**: Reduce `checkpoint_timeout` to 5-15 minutes and set reasonable `max_wal_size` based on your workload and available disk space.

How do I monitor WAL performance?

Enable `track_wal_io_timing` and monitor `pg_stat_wal`: ```sql SELECT wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write_time, wal_sync_time FROM pg_stat_wal; ``` **Key metrics**: - `wal_buffers_full`: High values mean you need bigger `wal_buffers` - `wal_write_time`/`wal_sync_time`: High values indicate storage bottlenecks - `wal_fpi`: Full page image count - high values after checkpoints are normal Set up monitoring alerts for WAL disk usage, checkpoint frequency, and replication slot lag to catch issues before they crash your database.

What happens if I accidentally delete files from pg_wal?

**Don't panic, but this is serious.** PostgreSQL needs WAL files for crash recovery. Deleted WAL files can prevent database startup or cause data loss. **If PostgreSQL is still running**: Stop it immediately and restore the deleted WAL files from backup if possible. **If PostgreSQL won't start**: You might need `pg_resetwal` to reset the WAL, but this can cause data loss. This is a last resort - contact a PostgreSQL expert if you're not sure. **Prevention**: Never manually delete files from `pg_wal`. Always use PostgreSQL's built-in WAL management or proper archiving commands.

Currently viewing the AI version

Switch to human version

PostgreSQL WAL Tuning: AI-Optimized Knowledge Base

Executive Summary

PostgreSQL Write-Ahead Logging (WAL) configuration can cause 10x performance degradation or provide 10x performance improvements depending on tuning. Default settings (16MB wal_buffers, 1GB max_wal_size) are inadequate for production workloads and will cause frequent checkpoint interruptions.

Critical Failure Scenarios

Database Crashes from WAL Disk Space Exhaustion

Consequence: Complete database shutdown, no recovery until disk space added
Frequency: Common in production when replication slots get stuck or archiving fails
Detection: Monitor WAL directory size, alert at 80% full
Prevention: Separate WAL storage, monitor replication slots, proper archiving configuration

Checkpoint Storms Destroying Performance

Symptom: "Checkpoints are occurring too frequently" warnings in logs
Impact: Write operations become 3-10x slower during checkpoint events
Root Cause: max_wal_size too small for workload (default 1GB inadequate for most production)
Solution: Increase max_wal_size to 4-32GB based on workload

I/O Contention Between WAL and Data Files

Impact: 50-90% performance degradation on write operations
Cause: WAL (sequential) and data files (random) on same storage
Solution: Dedicated storage for WAL provides immediate 2-5x improvement

WAL Architecture and Operational Intelligence

WAL Internal Mechanics

16MB segment files in pg_wal directory (pg_xlog in PostgreSQL <10)
Sequential write pattern vs random data file access creates I/O conflict
Full page writes can increase WAL size 2-5x after checkpoints (prevents corruption)
LSN (Log Sequence Number) tracks position for replication and recovery

Performance Impact Quantification

WAL overhead: 10-20% when properly configured, 100-1000% when misconfigured
Recovery time: Proportional to WAL volume since last checkpoint
Replication lag: Directly correlated to WAL generation rate and network capacity

Configuration Specifications by Workload

OLTP (Online Transaction Processing)

max_wal_size = 4-8GB              # Reduces checkpoint frequency
checkpoint_timeout = 15min        # Predictable checkpoint intervals
wal_buffers = 256MB-1GB          # Buffers frequent small writes
wal_level = replica              # Standard replication support
synchronous_commit = on          # Durability guarantee

Performance Impact: 1.5-3x faster writes vs defaults
Recovery Time: 5-15 minutes
Resource Cost: 256MB-1GB additional memory

Batch Processing/Data Warehouses

max_wal_size = 16-32GB           # Handle massive transaction bursts
checkpoint_timeout = 30min       # Reduce checkpoint overhead
wal_buffers = 1GB               # Buffer large writes
synchronous_commit = off        # For bulk loads only (recoverable operations)

Performance Impact: 3-10x faster bulk operations
Recovery Time: 15-30 minutes
Risk: Last few seconds of data loss on crash (acceptable for ETL)

Mixed Workloads

max_wal_size = 8GB              # Balance steady and burst loads
checkpoint_timeout = 10min      # Compromise configuration
wal_buffers = 512MB             # Adequate for mixed patterns

Critical Storage Requirements

WAL Storage Separation (Mandatory)

Requirement: Dedicated storage device for pg_wal directory
Implementation: Symlink pg_wal to separate fast storage
Performance Impact: 2-10x improvement on write-heavy workloads
Storage Type: SSD required, NVMe preferred for high-throughput systems

Storage Performance Specifications

WAL writes: Sequential, synchronous (blocks client until written)
Data writes: Random, often asynchronous
Minimum WAL storage: 3x max_wal_size setting
IOPS requirement: Consistent write performance more important than peak reads

Monitoring and Alerting Thresholds

Critical Alerts (Immediate Response Required)

WAL partition >90% full: Database will crash
Any replication slot >10GB behind: WAL accumulation risk
Archiving failures >1 hour: WAL cleanup blocked

Performance Warnings

Requested checkpoints >10% of total: Increase max_wal_size
WAL write time >5ms consistently: Storage bottleneck
WAL sync time >10ms consistently: Storage or network issue
wal_buffers_full increasing rapidly: Increase wal_buffers

Essential Monitoring Queries

-- WAL disk usage
SELECT pg_size_pretty(sum(size)) FROM pg_ls_waldir();

-- Checkpoint balance (want <10% requested)
SELECT round(100.0 * checkpoints_req / (checkpoints_timed + checkpoints_req), 1) AS pct_requested
FROM pg_stat_bgwriter;

-- Replication slot lag
SELECT slot_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn))
FROM pg_replication_slots;

-- WAL performance metrics
SELECT wal_buffers_full, wal_write_time, wal_sync_time FROM pg_stat_wal;

Common Production Failures and Solutions

WAL Directory Growing Uncontrolled

Causes:

Stuck replication slots (abandoned replicas)
Failed WAL archiving (network/storage issues)
Excessive wal_keep_size setting

Emergency Fix:

Drop abandoned replication slots: SELECT pg_drop_replication_slot('slot_name');
Fix archiving issues (check network, storage, permissions)
Add disk space temporarily while resolving root cause

Checkpoint Performance Disasters

Symptom: Periodic 5-30 second response time spikes
Root Cause: max_wal_size too small causing checkpoint storms
Solution: Increase max_wal_size by 2-4x, monitor checkpoint frequency
Validation: Requested checkpoints should be <10% of total

Recovery Time Unacceptable

Trade-off: Large max_wal_size improves performance but increases recovery time
Solution: Balance based on RTO requirements

Fast recovery needed: max_wal_size 2-4GB, checkpoint_timeout 5-10min
Performance priority: max_wal_size 8-32GB, checkpoint_timeout 15-30min

Advanced Tuning Techniques

Asynchronous Commit for Specific Operations

-- For bulk operations that can be replayed if lost
SET synchronous_commit = off;
-- Perform bulk operations
SET synchronous_commit = on;  -- Restore safety

Risk: Potential loss of last few seconds of data
Benefit: 5-10x faster bulk operations
Use Case: ETL processes, log aggregation, analytics

WAL Compression (PostgreSQL 15+)

wal_compression = on

Benefit: 20-50% WAL size reduction
Cost: Additional CPU usage for compression
Best For: High WAL volume workloads with repetitive data

Commit Grouping for High Concurrency

commit_delay = 100              # 100 microseconds
commit_siblings = 10            # Minimum concurrent transactions

Benefit: Groups multiple commits into single WAL flush
Cost: Slight latency increase per transaction
Requirement: High concurrency (>10 active transactions)

Resource Requirements and Scaling

Memory Requirements

wal_buffers: Start with shared_buffers/32, max 1GB
Typical production: 256MB-1GB depending on write volume
Warning: >1GB provides diminishing returns and wastes memory

Disk Space Planning

Minimum: 3x max_wal_size for safe operation
Recommended: 5x max_wal_size for growth headroom
Archiving enabled: Additional space for archive_timeout period

Network Requirements for Replication

Bandwidth: Must exceed peak WAL generation rate
Latency: Affects replication lag and failover time
Monitoring: Track replication slot lag and network utilization

Emergency Procedures

Database Won't Start Due to WAL Issues

Check WAL directory disk space
Verify WAL file permissions and ownership
Last resort: pg_resetwal (causes data loss, expert consultation required)

WAL Partition Full During Production

Immediate: Add disk space or move pg_wal to larger partition
Temporary: Drop unused replication slots
Fix root cause: Resolve archiving failures or stuck replication

Performance Emergency During High Traffic

Quick win: Move pg_wal to dedicated storage if not already separated
Immediate: Increase max_wal_size to reduce checkpoint frequency
Monitor: Track checkpoint balance and adjust further if needed

Version-Specific Considerations

PostgreSQL 13+ Features

WAL usage tracking in pg_stat_statements
Improved WAL monitoring with pg_stat_wal view
Enhanced replication slot monitoring

PostgreSQL 14+ Features

Comprehensive pg_stat_wal statistics
Better WAL I/O timing information

PostgreSQL 15+ Features

WAL compression capability
Improved checkpoint performance

Critical Success Factors

Separate WAL storage: Single most important optimization
Workload-appropriate configuration: Don't use defaults for production
Comprehensive monitoring: Prevent issues before they cause outages
Regular testing: Validate recovery procedures and performance under load
Capacity planning: Plan for WAL growth and storage requirements

Decision Matrix: Performance vs Recovery Time

max_wal_size	Checkpoint Frequency	Write Performance	Recovery Time	Use Case
1-2GB	Every 5-10 min	Baseline	30sec-2min	Small databases, fast recovery required
4-8GB	Every 15-30 min	1.5-3x faster	2-10 min	OLTP production systems
16-32GB	Every 60+ min	3-10x faster	10-30 min	Batch processing, data warehouses

Operational Intelligence Summary

Default PostgreSQL WAL settings will fail in production - 16MB wal_buffers and 1GB max_wal_size are inadequate
Separate WAL storage is mandatory - Not optional for any production system doing writes
Monitor checkpoint balance - >10% requested checkpoints indicates configuration problem
WAL disk space monitoring is critical - Database crashes when WAL partition fills
Recovery time vs performance is the key trade-off - Larger WAL means better performance but longer recovery
Version matters - PostgreSQL 13+ has significantly better WAL monitoring capabilities

Useful Links for Further Investigation

Essential PostgreSQL WAL Resources

Link	Description
PostgreSQL WAL Configuration	The definitive reference for all WAL parameters. Dense but comprehensive coverage of checkpoint tuning, WAL buffers, and recovery configuration. Essential reading for understanding how PostgreSQL manages WAL internally.
PostgreSQL WAL Internals	Technical deep-dive into WAL file format, LSN (Log Sequence Numbers), and recovery process internals. Critical for understanding how crash recovery actually works and why WAL configuration matters.
PostgreSQL Write-Ahead Logging Introduction	High-level overview of WAL concepts and benefits. Good starting point for understanding why WAL exists and how it enables PostgreSQL's ACID guarantees and replication features.
PostgreSQL Runtime Configuration - WAL	Complete parameter reference for all WAL-related settings including max_wal_size, wal_buffers, checkpoint_timeout, and archiving configuration. Bookmark this for parameter tuning.
PostgreSQL Continuous Archiving and PITR	Comprehensive guide to WAL archiving for point-in-time recovery. Covers archive_command setup, restoration procedures, and WAL shipping for backup strategies.
Tuning max_wal_size in PostgreSQL - EDB	Excellent practical guide showing how proper max_wal_size tuning can provide 1.5-10x performance improvements. Includes real benchmark data and monitoring techniques for checkpoint optimization.
PostgreSQL Performance Tuning - PGEdge	Comprehensive performance guide with WAL-specific tuning recommendations. Covers memory allocation, checkpoint configuration, and monitoring best practices for production systems.
Introduction to PostgreSQL Performance Tuning - EDB	Enterprise-focused tuning guide covering WAL optimization alongside query tuning, memory configuration, and storage considerations. Good resource for holistic performance optimization.
PostgreSQL Performance Tuning Guide - Percona	Practical tuning guide covering WAL buffers, checkpoint tuning, and performance monitoring. Includes specific parameter recommendations for different workload types.
Why does my pg_wal keep growing? - CYBERTEC	Essential troubleshooting guide for WAL disk space issues. Covers replication slot problems, archiving failures, and emergency recovery procedures. Keep this bookmarked for production emergencies.
Monitoring WAL Files - pgDash	Comprehensive monitoring guide covering WAL metrics, alerting thresholds, and dashboard setup. Excellent resource for setting up proactive WAL monitoring in production.
PostgreSQL Checkpoints, Buffers, and WAL Usage - Percona	Detailed monitoring setup using Percona Monitoring and Management. Covers checkpoint tracking, buffer hit ratios, and WAL performance metrics visualization.
Monitoring PostgreSQL WAL Files - RockData	Step-by-step tutorial for WAL monitoring setup. Covers basic monitoring queries and alerting configuration for WAL-related issues.
pg_stat_wal View Documentation	Complete reference for the pg_stat_wal system view introduced in PostgreSQL 14. Essential for monitoring WAL write performance, buffer usage, and I/O timing metrics.
pg_walinspect Extension	Official PostgreSQL extension for low-level WAL inspection and analysis. Useful for forensic analysis and understanding WAL record patterns in your workload.
pg_stat_statements for WAL Usage	PostgreSQL 13+ includes WAL usage statistics in pg_stat_statements. Critical for identifying queries that generate excessive WAL and optimizing write-heavy workloads.
Postgres Exporter for Prometheus	Popular monitoring solution that includes comprehensive WAL metrics collection. Essential for Prometheus/Grafana-based PostgreSQL monitoring stacks.
PostgreSQL WAL Archiving - OpsDash	Comprehensive guide to WAL archiving setup, monitoring, and troubleshooting. Covers local archiving, cloud storage integration, and recovery procedures.
Monitoring PostgreSQL Replication - CYBERTEC	Essential guide for monitoring WAL-based replication. Covers replication lag monitoring, slot management, and troubleshooting replication issues.
PostgreSQL WAL Activities - DEV Community	Developer-focused explanation of WAL operations and their impact on application design. Good resource for understanding WAL from an application development perspective.
How WAL Archiving Monitoring Improved in PostgreSQL 9.4 - EDB	Historical perspective on WAL archiving improvements. Covers the pg_stat_archiver view and its role in production monitoring.
Measuring PostgreSQL WAL Throughput - Estuary	Practical guide to measuring and analyzing WAL throughput using SQL queries. Includes scripts for tracking WAL generation patterns and identifying performance bottlenecks.
Key Metrics for PostgreSQL Monitoring - Datadog	Enterprise monitoring guide covering WAL metrics alongside other PostgreSQL performance indicators. Good resource for comprehensive database monitoring strategy.
Monitoring Transaction Logs in PostgreSQL - Pythian	Detailed technical guide to WAL monitoring and troubleshooting from a database consultant perspective. Covers real-world production scenarios and solutions.
PostgreSQL Monitoring Wiki	Community-maintained monitoring resource covering WAL monitoring alongside other PostgreSQL metrics. Good starting point for monitoring strategy development.
PostgreSQL Statistics Documentation	Complete reference for PostgreSQL's statistics system including WAL-related views and counters. Essential for understanding what metrics are available for monitoring.
WAL Monitoring in PostgreSQL 13+ - rjuju's blog	Technical blog post explaining WAL monitoring improvements in PostgreSQL 13 and later versions. Covers new metrics and monitoring capabilities.

PostgreSQL WAL Tuning: AI-Optimized Knowledge Base

Executive Summary

Critical Failure Scenarios

Database Crashes from WAL Disk Space Exhaustion

Checkpoint Storms Destroying Performance

I/O Contention Between WAL and Data Files

WAL Architecture and Operational Intelligence

WAL Internal Mechanics

Performance Impact Quantification

Configuration Specifications by Workload

OLTP (Online Transaction Processing)

Batch Processing/Data Warehouses

Mixed Workloads

Critical Storage Requirements

WAL Storage Separation (Mandatory)

Storage Performance Specifications

Monitoring and Alerting Thresholds

Critical Alerts (Immediate Response Required)

Performance Warnings

Essential Monitoring Queries

Common Production Failures and Solutions

WAL Directory Growing Uncontrolled

Checkpoint Performance Disasters

Recovery Time Unacceptable

Advanced Tuning Techniques

Asynchronous Commit for Specific Operations

WAL Compression (PostgreSQL 15+)

Commit Grouping for High Concurrency

Resource Requirements and Scaling

Memory Requirements

Disk Space Planning

Network Requirements for Replication

Emergency Procedures

Database Won't Start Due to WAL Issues

WAL Partition Full During Production

Performance Emergency During High Traffic

Version-Specific Considerations

PostgreSQL 13+ Features

PostgreSQL 14+ Features

PostgreSQL 15+ Features

Critical Success Factors

Decision Matrix: Performance vs Recovery Time

Operational Intelligence Summary

Useful Links for Further Investigation

Essential PostgreSQL WAL Resources

Related Tools & Recommendations

MongoDB vs PostgreSQL vs MySQL: Which One Won't Ruin Your Weekend

MySQL Replication - How to Keep Your Database Alive When Shit Goes Wrong

MySQL Alternatives That Don't Suck - A Migration Reality Check

Debezium - Database Change Capture Without the Pain

SQL Server 2025 - Vector Search Finally Works (Sort Of)

SQLite - The Database That Just Works

SQLite Performance: When It All Goes to Shit

PostgreSQL vs MySQL vs MariaDB vs SQLite vs CockroachDB - Pick the Database That Won't Ruin Your Life

Kafka Will Fuck Your Budget - Here's the Real Cost

Apache Kafka - The Distributed Log That LinkedIn Built (And You Probably Don't Need)

SaaSReviews - Software Reviews Without the Fake Crap

Fresh - Zero JavaScript by Default Web Framework

Anthropic Raises $13B at $183B Valuation: AI Bubble Peak or Actual Revenue?

Google Pixel 10 Phones Launch with Triple Cameras and Tensor G5

Dutch Axelera AI Seeks €150M+ as Europe Bets on Chip Sovereignty

CDC Database Platform Implementation Guide: Real-World Configuration Examples

Picking a CDC Tool That Won't Make You Hate Your Life

CDC Security & Compliance: Don't Let Your Data Pipeline Get You Fired

How to Migrate PostgreSQL 15 to 16 Without Destroying Your Weekend

Why I Finally Dumped Cassandra After 5 Years of 3AM Hell