Memory Configuration That Actually Works (Not Documentation Bullshit)

PostgreSQL Logo

The PostgreSQL documentation lies to you about memory settings. I've tuned hundreds of production instances, and the "recommended" settings will fuck your performance more often than help it. Here's what actually works based on real production experience and PostgreSQL performance tuning research.

Shared Buffers: Ignore the 25% Rule

The shared_buffers documentation says set shared_buffers to 25% of RAM. That's garbage advice that comes from 2005, as discussed in PostgreSQL performance tuning guides. On modern systems with 32GB+ RAM, you're wasting memory and creating OS cache competition.

What actually works according to production tuning experience:

  • 8-16GB RAM: 25% is fine (2-4GB)
  • 32GB+ RAM: 15-20% maximum (4-6GB on 32GB system)
  • High-write workloads: Even lower, around 10-15%

I've seen systems with 64GB RAM where dropping shared_buffers from 16GB to 8GB improved performance by 40%. The OS filesystem cache is more efficient than PostgreSQL's buffer pool for most read workloads, as explained in buffer management internals.

## Good starting point for 32GB system
shared_buffers = 4GB
effective_cache_size = 24GB  # 75% of total RAM

Work Memory: The Setting That Breaks Everything

`work_mem` controls memory for sorting, hashing, and bitmap operations per operation per connection. Set this too high and you'll run out of memory. Too low and PostgreSQL hits disk for every complex query. The PostgreSQL query execution internals explain why this setting is so critical.

The math that matters:

  • Each connection can use work_mem multiple times for complex queries
  • A 5-table join might use 4x work_mem per connection
  • 100 connections × 4MB × 4 operations = 1.6GB just for query operations

Real production settings:

  • OLTP workloads: 2-4MB (lots of simple queries)
  • Analytics/reporting: 16-64MB (complex queries, fewer connections)
  • Mixed workloads: 8-16MB with connection pooling

The danger zone is setting work_mem above 64MB without understanding your query patterns. I've seen systems OOM because someone set it to 256MB and a reporting query spawned 10 sort operations.

Connection Memory Overhead Reality Check

Every PostgreSQL connection is a separate process that consumes:

  • Base overhead: 2-4MB per connection
  • work_mem: Multiplied by active operations
  • temp_buffers: 8MB default per connection
  • Backend memory: 1-2MB for connection state

The calculation everyone gets wrong:

Real memory per connection = 4MB + (work_mem × avg_operations) + temp_buffers

With work_mem=16MB and average 2 operations per query:

  • Per connection: 4MB + (16MB × 2) + 8MB = 40MB
  • 100 connections = 4GB just for connection overhead
  • 500 connections = 20GB before you store any actual data

This is why connection pooling isn't optional above 100 connections.

Maintenance Work Memory: Don't Be Cheap

maintenance_work_mem affects VACUUM, CREATE INDEX, and REINDEX operations. The default 64MB is pathetic for production systems.

Set it to 5-10% of total RAM or 1-2GB, whichever is smaller. Large maintenance_work_mem dramatically speeds up:

  • Index creation (can be 3x faster with 2GB vs 64MB)
  • VACUUM operations on large tables
  • pg_dump performance
## For 32GB system
maintenance_work_mem = 2GB
autovacuum_work_mem = 1GB  # Separate setting for autovacuum

WAL Buffer Size: Small Setting, Big Impact

wal_buffers determines WAL (Write-Ahead Log) buffer size. Too small and PostgreSQL flushes to disk constantly. Too large and you waste memory.

Formula that works: shared_buffers / 32 but between 16MB and 64MB.

For most systems, 16-32MB is perfect. Going above 64MB rarely helps and wastes memory that could be used for OS cache.

Checkpoint Configuration: Control the Pain

Checkpoints flush dirty pages to disk and can cause massive I/O spikes. Bad checkpoint tuning creates the classic "why does my database pause for 10 seconds every few minutes" problem.

## Spread checkpoints over time instead of I/O spikes
checkpoint_completion_target = 0.9
checkpoint_timeout = 15min

## Use more WAL segments to avoid frequent checkpoints
max_wal_size = 4GB      # Increase if you have write-heavy workloads
min_wal_size = 1GB

checkpoint_completion_target = 0.9 spreads the checkpoint I/O over 90% of the checkpoint interval instead of doing it all at once. This prevents the I/O storms that make your application timeout.

Real Configuration Example

Here's a production configuration for a 32GB RAM system running mixed OLTP/analytics:

## Memory settings
shared_buffers = 6GB
work_mem = 8MB
maintenance_work_mem = 2GB
effective_cache_size = 24GB
temp_buffers = 32MB

## WAL settings
wal_buffers = 32MB
checkpoint_completion_target = 0.9
max_wal_size = 4GB
min_wal_size = 1GB

## Connection and background processes
max_connections = 200  # With connection pooling
autovacuum_work_mem = 1GB

This configuration assumes you're using connection pooling (mandatory) and have monitoring in place. Adjust work_mem down for higher connection counts, up for fewer concurrent users with complex queries.

Monitoring Memory Usage

PostgreSQL Monitoring

Don't guess if your memory configuration works. Monitor these metrics using tools like pg_stat_statements, pgbadger, or Grafana dashboards:

-- Check buffer hit ratios (want >95%)
SELECT 
    schemaname,
    tablename,
    heap_blks_read,
    heap_blks_hit,
    round((heap_blks_hit::numeric / (heap_blks_hit + heap_blks_read)) * 100, 2) as hit_ratio
FROM pg_statio_user_tables 
WHERE heap_blks_read > 0
ORDER BY hit_ratio ASC;

-- Monitor connection memory usage
SELECT 
    pid,
    usename,
    application_name,
    state,
    query_start,
    resident_set_size,
    virtual_memory_size
FROM pg_stat_activity a
JOIN lateral (
    SELECT (string_to_array(replace(pg_backend_memory_contexts(a.pid), E'
', ' '), ' '))[2]::bigint as resident_set_size,
           (string_to_array(replace(pg_backend_memory_contexts(a.pid), E'
', ' '), ' '))[4]::bigint as virtual_memory_size
) mem ON true
WHERE state = 'active';

If your buffer hit ratio is below 95%, you need more shared_buffers or effective_cache_size. If connections are using more memory than expected, reduce work_mem or implement connection pooling.

Performance Issues You'll Actually Face (And How to Fix Them)

Q

My query was fast yesterday, now it takes 30 seconds. WTF?

A

The query planner statistics are stale or PostgreSQL chose a shit execution plan. First, check if autovacuum ran:

SELECT schemaname, tablename, last_vacuum, last_autovacuum, last_analyze, last_autoanalyze 
FROM pg_stat_user_tables 
WHERE schemaname = 'public' AND tablename = 'your_slow_table';

If last_analyze is more than a day old on a frequently updated table, manually update statistics:

ANALYZE your_table_name;

This usually fixes 80% of "why did my query suddenly get slow" problems. PostgreSQL's planner makes decisions based on table statistics, and stale stats lead to terrible query plans.

Q

PostgreSQL is using 100% CPU but queries are still slow

A

You're probably doing sequential scans on large tables because indexes are missing or not being used. Check what's actually running:

SELECT pid, now() - pg_stat_get_backend_start_time(pid) as duration, query 
FROM pg_stat_activity 
WHERE state = 'active' AND query NOT LIKE '%pg_stat_activity%'
ORDER BY duration DESC;

Run EXPLAIN (ANALYZE, BUFFERS) on the slow queries. Look for:

  • "Seq Scan" on tables with >10k rows
  • "Nested Loop" with large tables
  • High "Buffers: shared read" numbers

Quick fix: Create the missing indexes. Better fix: Review your query patterns and indexing strategy.

Q

Connection pooling with PgBouncer seems slower than direct connections

A

You're probably using session pooling instead of transaction pooling. Session pooling maintains session state but kills performance because connections aren't reused efficiently.

PgBouncer config that actually works:

[databases]
your_db = host=localhost port=5432 dbname=production

[pgbouncer]
pool_mode = transaction
max_client_conn = 300
default_pool_size = 25
reserve_pool_size = 5

Transaction pooling breaks applications that rely on session state (temp tables, prepared statements, SET variables). If your app needs these features, fix the application code instead of using session pooling.

Q

My writes are fast but reads are slow as shit

A

Your shared_buffers is too small or you're not using the OS page cache effectively. Check buffer hit ratios:

SELECT 
    sum(heap_blks_read) as heap_read,
    sum(heap_blks_hit) as heap_hit,
    (sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read))) * 100 as ratio
FROM pg_statio_user_tables;

If the ratio is below 95%, increase shared_buffers or effective_cache_size. But don't set shared_buffers above 8GB unless you have >64GB RAM.

Q

VACUUM is running constantly and my database is still bloated

A

Your autovacuum_vacuum_scale_factor is too high for busy tables. The default waits until 20% of the table is dead tuples - that's insane for large, frequently updated tables.

Check table bloat:

SELECT schemaname, tablename, 
       n_dead_tup, 
       n_live_tup,
       round((n_dead_tup::numeric / (n_dead_tup + n_live_tup)) * 100, 2) as dead_ratio
FROM pg_stat_user_tables 
WHERE n_dead_tup > 0
ORDER BY dead_ratio DESC;

For tables with >1 million rows and frequent updates, tune autovacuum per-table:

ALTER TABLE busy_table SET (
    autovacuum_vacuum_scale_factor = 0.05,  -- 5% instead of 20%
    autovacuum_analyze_scale_factor = 0.02  -- 2% instead of 10%
);
Q

My database ran out of disk space because of WAL files

A

Your max_wal_size is too small for your write workload, causing frequent checkpoints. Or you have a stuck replication slot preventing WAL cleanup.

Check WAL usage:

SELECT slot_name, active, restart_lsn, 
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as retained_wal
FROM pg_replication_slots
WHERE restart_lsn IS NOT NULL;

If retained_wal is multiple GBs, you have a stuck replication slot. Drop it or fix the subscriber:

SELECT pg_drop_replication_slot('stuck_slot_name');

For high write workloads, increase max_wal_size:

-- In postgresql.conf
max_wal_size = 8GB  # or higher for very busy systems
checkpoint_timeout = 15min
Q

Random queries timeout with "canceling statement due to statement timeout"

A

Your statement_timeout is too aggressive, or you have queries that occasionally need more time due to lock waits or plan changes.

Check for lock contention:

SELECT blocked_locks.pid AS blocked_pid,
       blocking_locks.pid AS blocking_pid,
       blocked_activity.query AS blocked_statement,
       blocking_activity.query AS blocking_statement
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
    AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE
    AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.GRANTED;

If you see lock waits, either optimize the blocking queries or increase statement_timeout for specific operations that legitimately need more time.

Query Optimization and Index Design That Actually Fixes Performance

PostgreSQL Query Optimization

Query optimization isn't about memorizing PostgreSQL internals - it's about understanding what makes queries slow and fixing the underlying problems. I've debugged thousands of slow queries, and 90% fall into predictable patterns that proper indexing and query structure can solve.

EXPLAIN ANALYZE: Your Best Friend for Performance Debugging

Every slow query investigation starts with `EXPLAIN (ANALYZE, BUFFERS)`. This shows you exactly what PostgreSQL's query planner is doing and where time is being spent according to PostgreSQL performance documentation:

EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON) 
SELECT u.username, p.title, p.created_at 
FROM users u 
JOIN posts p ON u.id = p.user_id 
WHERE u.created_at > '2024-01-01' 
ORDER BY p.created_at DESC 
LIMIT 20;

What to look for in EXPLAIN output:

  • Sequential Scans on large tables (>10k rows) - usually means missing indexes
  • Nested Loop joins on large datasets - often indicates missing join indexes
  • High "Buffers: shared read" numbers - indicates disk I/O bottlenecks
  • Actual time vs Estimated rows mismatches - stale statistics problem

The most common mistake is focusing on execution time instead of buffer reads. A query that reads 100,000 buffers but runs in 50ms is a disaster waiting to happen when cache pressure increases.

Index Types: When to Use What (Beyond Basic B-Tree)

B-tree indexes (default) work for 80% of cases - equality, range queries, ORDER BY. They're your go-to for most columns as explained in the PostgreSQL index types documentation.

GIN indexes for JSONB, arrays, and full-text search:

-- For JSONB containment queries
CREATE INDEX CONCURRENTLY idx_metadata_gin ON products USING GIN (metadata);

-- Query: WHERE metadata @> '{\"category\": \"electronics\"}'
-- This will use the GIN index efficiently

GIN indexes are massive (3-5x larger than B-tree) and slow down writes significantly. Only create them when you actually need JSONB/array searches.

GiST indexes for geometric data, full-text search with ranking, and nearest-neighbor queries:

-- For PostGIS location queries  
CREATE INDEX CONCURRENTLY idx_location_gist ON stores USING GiST (location);

-- Query: ORDER BY location <-> ST_Point(-122.4194, 37.7749) LIMIT 10
-- Finds nearest 10 stores to San Francisco

BRIN indexes for time-series data with natural clustering:

-- For timestamp columns on large tables
CREATE INDEX CONCURRENTLY idx_created_brin ON events USING BRIN (created_at);

BRIN indexes are tiny (1MB for billion-row tables) but only work when data is physically ordered. Perfect for append-only tables with timestamp columns.

Partial Indexes: The Optimization Nobody Uses

Partial indexes only include rows matching a WHERE condition. They're smaller, faster, and perfect for queries that always filter on the same conditions:

-- Instead of indexing all orders
CREATE INDEX idx_orders_status ON orders (status);

-- Create separate partial indexes for common queries
CREATE INDEX CONCURRENTLY idx_active_orders ON orders (user_id, created_at) 
WHERE status IN ('pending', 'processing');

CREATE INDEX CONCURRENTLY idx_completed_orders ON orders (created_at) 
WHERE status = 'completed';

The partial index is 5-10x smaller than a full index and much faster for queries that match the WHERE condition. Use them for:

  • Soft-deleted records (WHERE deleted_at IS NULL)
  • Active/inactive status filtering
  • Date range filtering on large historical tables

Multi-Column Index Order: Getting It Right

Column order in multi-column indexes matters enormously. PostgreSQL can use prefixes of multi-column indexes, so put the most selective columns first:

-- Wrong: Low selectivity first
CREATE INDEX idx_orders_bad ON orders (status, user_id, created_at);

-- Right: High selectivity first  
CREATE INDEX idx_orders_good ON orders (user_id, created_at, status);

The user_id column is much more selective than status (thousands of users vs 3-4 status values). This index supports queries filtering on:

  • user_id alone
  • user_id and created_at
  • user_id, created_at, and status

But it won't help queries that only filter on status or created_at without user_id.

Query Patterns That Kill Performance

N+1 Query Problem:

-- Bad: Runs one query per user
FOR each user:
    SELECT * FROM posts WHERE user_id = user.id LIMIT 5;

-- Good: Single query with window functions
SELECT u.username, p.title, p.created_at,
       ROW_NUMBER() OVER (PARTITION BY u.id ORDER BY p.created_at DESC) as rn
FROM users u
JOIN posts p ON u.id = p.user_id
WHERE rn <= 5;

OR Conditions That Break Indexes:

-- Bad: Can't use indexes efficiently
SELECT * FROM users WHERE username = 'john' OR email = 'john@example.com';

-- Good: Use UNION ALL for different columns
SELECT * FROM users WHERE username = 'john'
UNION ALL
SELECT * FROM users WHERE email = 'john@example.com' AND username != 'john';

Large OFFSET Values:

-- Bad: Scans and discards 100,000 rows
SELECT * FROM posts ORDER BY created_at DESC OFFSET 100000 LIMIT 20;

-- Good: Cursor-based pagination
SELECT * FROM posts 
WHERE created_at < '2023-06-15 10:30:00'  -- Last timestamp from previous page
ORDER BY created_at DESC 
LIMIT 20;

Join Optimization: Making Multi-Table Queries Fast

PostgreSQL has three join algorithms: Nested Loop, Hash Join, and Merge Join. Understanding when each is used helps optimize complex queries.

Nested Loop - Good for small datasets or when one side has very few rows:

-- Efficient when filtering to small result set
SELECT u.username, p.title 
FROM users u 
JOIN posts p ON u.id = p.user_id 
WHERE u.id = 12345;  -- Single user lookup

Hash Join - Good for medium-sized datasets where one side fits in work_mem:

-- Efficient for moderate-sized joins
SELECT u.username, COUNT(p.id) 
FROM users u 
JOIN posts p ON u.id = p.user_id 
WHERE u.created_at > '2024-01-01'
GROUP BY u.username;

Merge Join - Good for large datasets that are pre-sorted:

-- Efficient when both sides are indexed on join columns
SELECT u.username, p.title 
FROM users u 
JOIN posts p ON u.id = p.user_id 
ORDER BY u.id, p.created_at;

Force specific join algorithms for testing:

SET enable_hashjoin = false;  -- Forces Nested Loop or Merge Join
SET enable_mergejoin = false; -- Forces Nested Loop or Hash Join
SET enable_nestloop = false;  -- Forces Hash Join or Merge Join

Monitoring Query Performance in Production

Don't wait for users to complain about slow queries. Monitor performance proactively with pg_stat_statements, pgbadger for log analysis, and pg_stat_activity for real-time monitoring:

-- Enable pg_stat_statements in postgresql.conf
shared_preload_libraries = 'pg_stat_statements'

-- Find slowest queries by total time
SELECT query, calls, total_exec_time, mean_exec_time, 
       rows/calls as avg_rows
FROM pg_stat_statements 
ORDER BY total_exec_time DESC 
LIMIT 10;

-- Find queries with most I/O
SELECT query, calls, shared_blks_read, shared_blks_hit,
       shared_blks_read / calls as avg_read_per_call
FROM pg_stat_statements 
WHERE shared_blks_read > 0
ORDER BY shared_blks_read DESC 
LIMIT 10;

Set up alerting when:

  • Average query time exceeds 500ms
  • Buffer read ratio drops below 95%
  • Any single query uses >10GB of work_mem

Real Production Example: Optimizing a Reporting Query

Before optimization:

-- 45 second execution, 2.3M buffer reads
SELECT u.username, u.email, COUNT(o.id) as order_count,
       SUM(o.total_amount) as total_spent
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.created_at >= '2023-01-01'
AND o.status = 'completed'
GROUP BY u.id, u.username, u.email
HAVING COUNT(o.id) > 5
ORDER BY total_spent DESC;

Issues identified:

  • Sequential scan on 10M user table
  • Nested loop join on 50M order table
  • GROUP BY on unindexed columns

After optimization:

-- 1.2 second execution, 45k buffer reads
WITH high_value_users AS (
    SELECT user_id, COUNT(*) as order_count, SUM(total_amount) as total_spent
    FROM orders 
    WHERE status = 'completed' 
    AND created_at >= '2023-01-01'  -- Assuming orders.created_at correlates with user activity
    GROUP BY user_id
    HAVING COUNT(*) > 5
)
SELECT u.username, u.email, h.order_count, h.total_spent
FROM high_value_users h
JOIN users u ON h.user_id = u.id  
WHERE u.created_at >= '2023-01-01'
ORDER BY h.total_spent DESC;

Indexes added:

CREATE INDEX CONCURRENTLY idx_orders_status_created ON orders (status, created_at);
CREATE INDEX CONCURRENTLY idx_orders_user_id_status ON orders (user_id) WHERE status = 'completed';
CREATE INDEX CONCURRENTLY idx_users_created ON users (created_at);

The key insight: filter on the large table (orders) first to reduce the join size, rather than starting with the users table and joining to all their orders.

Performance Optimization Strategies: What Actually Works vs What Wastes Time

Optimization Approach

Effort Required

Impact on Performance

When It Backfires

Real Production Result

Proper Indexing Strategy

Medium

High (5-100x faster queries)

Over-indexing kills write performance

Cut query time from 30s to 200ms

Connection Pooling

Low

High (eliminates connection overhead)

Session state breaks, wrong pool size

Reduced CPU usage 60%, fixed OOM crashes

Memory Configuration

Low

Medium-High (eliminates swapping)

Too much shared_buffers hurts OS cache

40% performance gain from tuning shared_buffers

Query Rewriting

High

High (eliminates bad execution plans)

Complex queries become unmaintainable

Reporting queries: 45s
1.2s

Autovacuum Tuning

Medium

Medium (prevents table bloat)

Too aggressive vacuuming locks tables

Reduced bloat from 40% to 5%

Hardware Upgrades

Low (if you have budget)

Medium

Doesn't fix bad queries or indexes

SSD upgrade: 2x performance boost

Read Replicas

Medium

High (for read-heavy workloads)

Replication lag breaks real-time queries

Offloaded 70% of queries from primary

Partitioning

High

Medium (only for very large tables)

Adds complexity, constraint exclusion issues

10TB table: 30s queries
3s

Sharding

Very High

High (but at huge cost)

Application complexity nightmare

Works, but requires 3x engineering effort

Production Monitoring and Alerting That Actually Prevents Outages

PostgreSQL Monitoring Setup

Setting up monitoring isn't about collecting metrics for the sake of it - it's about catching performance problems before they kill your application. I've seen too many PostgreSQL databases die from preventable issues that proper monitoring could have caught hours or days in advance using pg_stat_statements, pg_stat_activity, and pgbadger log analysis.

Essential Metrics That Actually Matter

Buffer Hit Ratio (Most Important):

-- Want this above 95% consistently
SELECT 
    round((sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read))) * 100, 2) as buffer_hit_ratio
FROM pg_statio_user_tables;

When this drops below 95%, your database is hitting disk too much according to PostgreSQL performance tuning guides. Usually means you need more `shared_buffers`, better indexes, or more RAM. A drop from 99% to 92% hit ratio often means 5x slower queries as explained in PostgreSQL buffer management.

Connection Usage:

-- Alert when > 80% of max_connections
SELECT count(*) as active_connections, 
       (SELECT setting::int FROM pg_settings WHERE name = 'max_connections') as max_connections,
       round((count(*)::float / (SELECT setting::int FROM pg_settings WHERE name = 'max_connections')) * 100, 1) as usage_percent
FROM pg_stat_activity 
WHERE state = 'active';

Connection exhaustion causes "FATAL: remaining connection slots are reserved" errors as documented in PostgreSQL client authentication. Set alerts at 80% usage and implement connection pooling with PgBouncer immediately, following PostgreSQL connection pooling best practices.

Database Size Growth:

-- Monitor rapid size increases
SELECT schemaname, tablename, 
       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size,
       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename) - pg_relation_size(schemaname||'.'||tablename)) as index_size
FROM pg_tables 
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

Sudden database growth often indicates runaway processes, missing cleanup jobs, or vacuum issues. Alert on >20% growth per day for established databases.

Autovacuum Monitoring (Critical for Write-Heavy Systems)

Autovacuum failures are the #1 cause of gradual PostgreSQL performance degradation. Monitor vacuum lag religiously:

-- Check vacuum lag by table
SELECT schemaname, tablename,
       n_dead_tup,
       n_live_tup,
       round((n_dead_tup::float / (n_dead_tup + n_live_tup)) * 100, 1) as dead_percentage,
       last_vacuum,
       last_autovacuum
FROM pg_stat_user_tables 
WHERE n_dead_tup > 1000
ORDER BY dead_percentage DESC;

Alert thresholds:

  • Warning: >15% dead tuples on tables with >10k rows
  • Critical: >30% dead tuples or no vacuum in 48+ hours on active tables

When autovacuum can't keep up, tables get bloated and queries slow down exponentially. I've seen 100ms queries become 10-second queries due to table bloat.

Manual vacuum for emergency cleanup:

-- For tables with excessive bloat
VACUUM (ANALYZE, VERBOSE) table_name;

-- Nuclear option (takes exclusive lock, use carefully)
VACUUM FULL table_name;  -- Only during maintenance window

Query Performance Monitoring with pg_stat_statements

Enable pg_stat_statements to track query performance automatically:

-- In postgresql.conf
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.max = 10000
pg_stat_statements.track = all

-- Find queries using most total time
SELECT query,
       calls,
       round(total_exec_time::numeric, 2) as total_time_ms,
       round(mean_exec_time::numeric, 2) as avg_time_ms,
       round((100 * total_exec_time / sum(total_exec_time) OVER())::numeric, 2) as percentage
FROM pg_stat_statements 
ORDER BY total_exec_time DESC 
LIMIT 10;

Alert on:

  • Any query with >1000ms average execution time
  • Queries using >50% of total database time
  • New queries with >10,000 buffer reads per execution

Connection and Lock Monitoring

Long-running queries and lock conflicts destroy application performance:

-- Find long-running queries
SELECT pid, 
       now() - pg_stat_get_backend_start_time(pid) as duration,
       state,
       query
FROM pg_stat_activity 
WHERE state = 'active' 
AND now() - pg_stat_get_backend_start_time(pid) > interval '5 minutes'
ORDER BY duration DESC;

-- Monitor lock waits
SELECT blocked.pid AS blocked_pid,
       blocking.pid AS blocking_pid,
       blocked_activity.query AS blocked_query,
       blocking_activity.query AS blocking_query,
       blocked_activity.application_name AS blocked_app
FROM pg_catalog.pg_locks blocked
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked.pid
JOIN pg_catalog.pg_locks blocking ON blocking.locktype = blocked.locktype
    AND blocking.database IS NOT DISTINCT FROM blocked.database
    AND blocking.relation IS NOT DISTINCT FROM blocked.relation
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking.pid
WHERE NOT blocked.granted;

Kill runaway queries:

-- Gentle cancellation (recommended)
SELECT pg_cancel_backend(12345);  -- Replace with actual PID

-- Nuclear option (kills connection)
SELECT pg_terminate_backend(12345);  -- Use only when cancel fails

Disk Space and WAL Monitoring

Running out of disk space kills PostgreSQL instantly. Monitor disk usage and WAL growth:

-- Check database sizes
SELECT datname, 
       pg_size_pretty(pg_database_size(datname)) as database_size
FROM pg_database 
WHERE datistemplate = false
ORDER BY pg_database_size(datname) DESC;

-- Monitor WAL directory size (requires superuser)
SELECT pg_size_pretty(sum((pg_stat_file('pg_wal/'||name)).size)) as wal_size
FROM pg_ls_dir('pg_wal') AS name
WHERE name ~ '^[0-9A-F]{24}$';

Alert thresholds:

  • Warning: <20% free disk space
  • Critical: <10% free disk space
  • Emergency: WAL directory >10GB (indicates stuck replication or checkpoint issues)

Real Production Monitoring Stack

Here's what actually works in production environments:

Infrastructure Monitoring:

  • Prometheus + Grafana with postgres_exporter
  • pganalyze (paid service, excellent for query analysis)
  • Datadog or New Relic (if already using for application monitoring)

Key dashboards to create:

  1. Connection usage over time (spot connection leaks)
  2. Buffer hit ratio trends (catch memory pressure)
  3. Slowest queries by total time (identify optimization targets)
  4. Table bloat percentages (autovacuum health)
  5. Replication lag (if using read replicas)

Alert rules that prevent outages:

## Connection exhaustion warning
- alert: PostgreSQLHighConnectionUsage
  expr: pg_stat_activity_count / pg_settings_max_connections > 0.8
  for: 5m

## Buffer hit ratio degradation  
- alert: PostgreSQLLowBufferHitRatio
  expr: pg_stat_database_blks_hit_ratio < 0.95
  for: 10m

## Autovacuum falling behind
- alert: PostgreSQLHighTableBloat
  expr: pg_stat_user_tables_n_dead_tup / (pg_stat_user_tables_n_live_tup + pg_stat_user_tables_n_dead_tup) > 0.2
  for: 30m

## Long-running queries
- alert: PostgreSQLLongRunningQuery  
  expr: pg_stat_activity_max_tx_duration > 300  # 5 minutes
  for: 2m

Performance Testing and Capacity Planning

Don't wait for production to find performance limits. Load test your database before you hit scaling problems:

pgbench for basic load testing:

## Initialize test data
pgbench -i -s 100 test_database  # Scale factor 100 = ~10M rows

## Run read/write workload
pgbench -c 20 -j 4 -T 300 test_database  # 20 connections, 5 minutes

## Custom transaction script
pgbench -c 10 -T 60 -f custom_queries.sql production_database

Realistic load testing approach:

  1. Capture production queries with pg_stat_statements
  2. Replay queries against staging with realistic data volumes
  3. Gradually increase concurrent connections until performance degrades
  4. Identify bottlenecks before they hit production

Capacity planning metrics to track:

  • Max sustainable connections before response time degrades
  • Storage growth rate for disk planning
  • Memory usage under peak load
  • I/O patterns to size storage appropriately

When to Scale Beyond Single-Server PostgreSQL

Vertical scaling works until:

  • 64-128GB RAM (diminishing returns on larger instances)
  • 50,000+ queries per second
  • Storage I/O becomes the bottleneck despite SSDs

Read replica indicators:

  • 70% of queries are reads

  • Reporting/analytics queries impact OLTP performance
  • Geographic distribution needs for read performance

Sharding/clustering indicators:

  • Single table >10TB with active queries across all data
  • Write throughput >25,000 inserts/updates per second
  • Query patterns allow horizontal partitioning

Most companies never hit these limits. Focus on optimization before architecture complexity.

The Monitoring Mindset

Performance monitoring isn't about having perfect metrics - it's about early warning systems that give you time to fix problems before they become outages.

Set up the basics (connection usage, buffer hit ratio, vacuum lag) and expand monitoring based on actual production issues you encounter. Don't over-engineer monitoring for theoretical problems you might never face.

Related Tools & Recommendations

compare
Similar content

PostgreSQL vs MySQL vs MariaDB vs SQLite vs CockroachDB

Compare PostgreSQL, MySQL, MariaDB, SQLite, and CockroachDB to pick the best database for your project. Understand performance, features, and team skill conside

/compare/postgresql-mysql-mariadb-sqlite-cockroachdb/database-decision-guide
100%
tool
Similar content

ClickHouse Overview: Analytics Database Performance & SQL Guide

When your PostgreSQL queries take forever and you're tired of waiting

ClickHouse
/tool/clickhouse/overview
64%
tool
Similar content

Apache Cassandra Performance Optimization Guide: Fix Slow Clusters

Stop Pretending Your 50 Ops/Sec Cluster is "Scalable"

Apache Cassandra
/tool/apache-cassandra/performance-optimization-guide
61%
compare
Similar content

PostgreSQL vs MySQL vs MariaDB: Developer Ecosystem Analysis

PostgreSQL, MySQL, or MariaDB: Choose Your Database Nightmare Wisely

PostgreSQL
/compare/postgresql/mysql/mariadb/developer-ecosystem-analysis
60%
compare
Recommended

PostgreSQL vs MySQL vs MongoDB vs Cassandra - Which Database Will Ruin Your Weekend Less?

Skip the bullshit. Here's what breaks in production.

PostgreSQL
/compare/postgresql/mysql/mongodb/cassandra/comprehensive-database-comparison
54%
tool
Similar content

Cassandra Vector Search for RAG: Simplify AI Apps with 5.0

Learn how Apache Cassandra 5.0's integrated vector search simplifies RAG applications. Build AI apps efficiently, overcome common issues like timeouts and slow

Apache Cassandra
/tool/apache-cassandra/vector-search-ai-guide
53%
tool
Similar content

PostgreSQL: Why It Excels & Production Troubleshooting Guide

Explore PostgreSQL's advantages over other databases, dive into real-world production horror stories, solutions for common issues, and expert debugging tips.

PostgreSQL
/tool/postgresql/overview
48%
tool
Similar content

PostgreSQL Logical Replication: When Streaming Isn't Enough

Unlock PostgreSQL Logical Replication. Discover its purpose, how it differs from streaming replication, and a practical guide to setting it up, including tips f

PostgreSQL
/tool/postgresql/logical-replication
48%
tool
Similar content

DuckDB Performance Tuning: 3 Settings for Optimal Speed

Three settings fix most problems. Everything else is fine-tuning.

DuckDB
/tool/duckdb/performance-optimization
47%
tool
Similar content

mongoexport Performance Optimization: Speed Up Large Exports

Real techniques to make mongoexport not suck on large collections

mongoexport
/tool/mongoexport/performance-optimization
41%
tool
Similar content

Neon Production Troubleshooting Guide: Fix Database Errors

When your serverless PostgreSQL breaks at 2AM - fixes that actually work

Neon
/tool/neon/production-troubleshooting
41%
tool
Similar content

Supabase Production Deployment: Best Practices & Scaling Guide

Master Supabase production deployment. Learn best practices for connection pooling, RLS, scaling your app, and a launch day survival guide to prevent crashes an

Supabase
/tool/supabase/production-deployment
39%
tool
Similar content

Neon Serverless PostgreSQL: An Honest Review & Production Insights

PostgreSQL hosting that costs less when you're not using it

Neon
/tool/neon/overview
39%
tool
Similar content

Redis Cluster Production Issues: Troubleshooting & Survival Guide

When Redis clustering goes sideways at 3AM and your boss is calling. The essential troubleshooting guide for split-brain scenarios, slot migration failures, and

Redis
/tool/redis/clustering-production-issues
39%
tool
Similar content

Change Data Capture (CDC) Troubleshooting Guide: Fix Common Issues

I've debugged CDC disasters at three different companies. Here's what actually breaks and how to fix it.

Change Data Capture (CDC)
/tool/change-data-capture/troubleshooting-guide
38%
tool
Similar content

Supabase Overview: PostgreSQL with Bells & Whistles

Explore Supabase, the open-source Firebase alternative powered by PostgreSQL. Understand its architecture, features, and how it compares to Firebase for your ba

Supabase
/tool/supabase/overview
38%
tool
Similar content

Change Data Capture (CDC) Performance Optimization Guide

Demo worked perfectly. Then some asshole ran a 50M row import at 2 AM Tuesday and took down everything.

Change Data Capture (CDC)
/tool/change-data-capture/performance-optimization-guide
36%
tool
Similar content

Liquibase Overview: Automate Database Schema Changes & DevOps

Because manually deploying schema changes while praying is not a sustainable strategy

Liquibase
/tool/liquibase/overview
33%
tool
Similar content

Flyway: Database Migrations Explained - Why & How It Works

Database migrations without the XML bullshit or vendor lock-in

Flyway
/tool/flyway/overview
31%
howto
Similar content

Zero Downtime Database Migration: 2025 Tools That Actually Work

Stop Breaking Production - New Tools That Don't Suck

AWS Database Migration Service (DMS)
/howto/database-migration-zero-downtime/modern-tools-2025
31%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization