My query was fast yesterday, now it takes 30 seconds. WTF?

The query planner statistics are stale or PostgreSQL chose a shit execution plan. First, check if autovacuum ran: ```sql SELECT schemaname, tablename, last_vacuum, last_autovacuum, last_analyze, last_autoanalyze FROM pg_stat_user_tables WHERE schemaname = 'public' AND tablename = 'your_slow_table'; ``` If `last_analyze` is more than a day old on a frequently updated table, manually update statistics: ```sql ANALYZE your_table_name; ``` This usually fixes 80% of "why did my query suddenly get slow" problems. PostgreSQL's planner makes decisions based on table statistics, and stale stats lead to terrible query plans.

PostgreSQL is using 100% CPU but queries are still slow

You're probably doing sequential scans on large tables because indexes are missing or not being used. Check what's actually running: ```sql SELECT pid, now() - pg_stat_get_backend_start_time(pid) as duration, query FROM pg_stat_activity WHERE state = 'active' AND query NOT LIKE '%pg_stat_activity%' ORDER BY duration DESC; ``` Run `EXPLAIN (ANALYZE, BUFFERS)` on the slow queries. Look for: - "Seq Scan" on tables with >10k rows - "Nested Loop" with large tables - High "Buffers: shared read" numbers **Quick fix:** Create the missing indexes. **Better fix:** Review your query patterns and indexing strategy.

Connection pooling with PgBouncer seems slower than direct connections

You're probably using session pooling instead of transaction pooling. Session pooling maintains session state but kills performance because connections aren't reused efficiently. **PgBouncer config that actually works:** ```ini [databases] your_db = host=localhost port=5432 dbname=production [pgbouncer] pool_mode = transaction max_client_conn = 300 default_pool_size = 25 reserve_pool_size = 5 ``` Transaction pooling breaks applications that rely on session state (temp tables, prepared statements, SET variables). If your app needs these features, fix the application code instead of using session pooling.

VACUUM is running constantly and my database is still bloated

Your `autovacuum_vacuum_scale_factor` is too high for busy tables. The default waits until 20% of the table is dead tuples - that's insane for large, frequently updated tables. Check table bloat: ```sql SELECT schemaname, tablename, n_dead_tup, n_live_tup, round((n_dead_tup::numeric / (n_dead_tup + n_live_tup)) * 100, 2) as dead_ratio FROM pg_stat_user_tables WHERE n_dead_tup > 0 ORDER BY dead_ratio DESC; ``` For tables with >1 million rows and frequent updates, tune autovacuum per-table: ```sql ALTER TABLE busy_table SET ( autovacuum_vacuum_scale_factor = 0.05, -- 5% instead of 20% autovacuum_analyze_scale_factor = 0.02 -- 2% instead of 10% ); ```

My database ran out of disk space because of WAL files

Your `max_wal_size` is too small for your write workload, causing frequent checkpoints. Or you have a stuck replication slot preventing WAL cleanup. Check WAL usage: ```sql SELECT slot_name, active, restart_lsn, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as retained_wal FROM pg_replication_slots WHERE restart_lsn IS NOT NULL; ``` If `retained_wal` is multiple GBs, you have a stuck replication slot. Drop it or fix the subscriber: ```sql SELECT pg_drop_replication_slot('stuck_slot_name'); ``` For high write workloads, increase `max_wal_size`: ```sql -- In postgresql.conf max_wal_size = 8GB # or higher for very busy systems checkpoint_timeout = 15min ```

Random queries timeout with "canceling statement due to statement timeout"

Your `statement_timeout` is too aggressive, or you have queries that occasionally need more time due to lock waits or plan changes. Check for lock contention: ```sql SELECT blocked_locks.pid AS blocked_pid, blocking_locks.pid AS blocking_pid, blocked_activity.query AS blocked_statement, blocking_activity.query AS blocking_statement FROM pg_catalog.pg_locks blocked_locks JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid WHERE NOT blocked_locks.GRANTED; ``` If you see lock waits, either optimize the blocking queries or increase `statement_timeout` for specific operations that legitimately need more time.

Currently viewing the AI version

Switch to human version

PostgreSQL Performance Optimization - AI Knowledge Base

Memory Configuration

Shared Buffers Configuration

Standard recommendation: 25% of RAM (from 2005 documentation)
Production reality: Causes OS cache competition on modern systems
Actual working configuration:
- 8-16GB RAM: 25% acceptable (2-4GB)
- 32GB+ RAM: 15-20% maximum (4-6GB on 32GB system)
- High-write workloads: 10-15%
Critical failure point: Above 8GB shared_buffers unless >64GB RAM
Real-world impact: Dropping from 16GB to 8GB shared_buffers improved performance 40% on 64GB system

Work Memory Settings

Per-operation per-connection multiplier: Each connection uses work_mem × number of operations
Danger zone: Above 64MB without understanding query patterns
Production settings:
- OLTP workloads: 2-4MB
- Analytics/reporting: 16-64MB
- Mixed workloads: 8-16MB
Memory calculation: Real memory = 4MB + (work_mem × avg_operations) + temp_buffers
Failure scenario: 256MB work_mem with reporting queries = OOM crashes

Connection Memory Overhead

Base cost per connection: 2-4MB
Additional costs: work_mem multipliers, temp_buffers (8MB default)
Breaking point: Above 100 connections without pooling
Real calculation example:
- 100 connections × 40MB each = 4GB before storing data
- 500 connections = 20GB overhead

Maintenance Work Memory

Default inadequacy: 64MB too small for production
Production setting: 5-10% of RAM or 1-2GB maximum
Performance impact: 3x faster index creation with 2GB vs 64MB
Separate autovacuum setting: autovacuum_work_mem = 1GB

Query Performance and Indexing

Index Type Selection Matrix

Index Type	Use Case	Size Overhead	Write Impact	Query Types
B-tree	General purpose (80% of cases)	1x	Low	Equality, ranges, ORDER BY
GIN	JSONB, arrays, full-text	3-5x	High	Containment (@>), array ops
GiST	Geometric, nearest-neighbor	2-3x	Medium	PostGIS, similarity
BRIN	Time-series (ordered data)	0.001x	Very low	Range queries on ordered data

Critical Query Patterns That Cause Failures

N+1 Query Problem:

Symptom: One query per user/record in loops
Impact: Linear performance degradation with data growth
Solution: Window functions with single query

Large OFFSET Pagination:

Failure point: OFFSET >10,000 becomes unusable
Impact: Scans and discards all offset rows
Solution: Cursor-based pagination with WHERE conditions

OR Conditions Breaking Indexes:

Problem: WHERE col1 = ? OR col2 = ? cannot use indexes efficiently
Solution: UNION ALL with separate indexed queries

Index Design Principles

Multi-column order: Most selective columns first
Partial indexes: 5-10x smaller for filtered queries
Index prefixes: PostgreSQL can use left prefixes of multi-column indexes
Over-indexing threshold: >6 indexes per table kills write performance

Performance Monitoring and Alerting

Critical Metrics and Thresholds

Metric	Warning	Critical	Impact if Exceeded
Buffer Hit Ratio	<97%	<95%	5x slower queries from disk I/O
Connection Usage	>80%	>90%	Connection exhaustion errors
Dead Tuple Ratio	>15%	>30%	Exponential query slowdown
Query Duration	>500ms avg	>1000ms avg	Application timeouts
Disk Space	<20% free	<10% free	Database shutdown
Lock Wait Time	>30s	>60s	Application deadlocks

Buffer Hit Ratio Monitoring

-- Target: >95% consistently
SELECT round((sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read))) * 100, 2) 
FROM pg_statio_user_tables;

Performance impact: Drop from 99% to 92% = 5x slower queries
Root causes: Insufficient shared_buffers, missing indexes, inadequate RAM

Autovacuum Failure Detection

-- Alert on >15% dead tuples
SELECT schemaname, tablename,
       round((n_dead_tup::float / (n_dead_tup + n_live_tup)) * 100, 1) as dead_percentage
FROM pg_stat_user_tables 
WHERE n_dead_tup > 1000;

Failure progression: 100ms queries become 10s queries due to bloat
Emergency fix: Manual VACUUM (takes locks, plan maintenance window)

Production Configuration Templates

32GB RAM System (Mixed Workload)

shared_buffers = 6GB
work_mem = 8MB
maintenance_work_mem = 2GB
effective_cache_size = 24GB
max_connections = 200  # Requires connection pooling
wal_buffers = 32MB
checkpoint_completion_target = 0.9
max_wal_size = 4GB

Connection Pooling (PgBouncer)

pool_mode = transaction  # Not session
max_client_conn = 300
default_pool_size = 25
reserve_pool_size = 5

Critical: Use transaction pooling, not session pooling
Breaking point: Session pooling kills performance with connection reuse

Scaling Decision Matrix

Current State	Next Action	Resource Cost	Complexity	Expected Gain
<64GB RAM, CPU bound	Vertical scaling	Low (if budget)	Low	2-3x performance
>70% reads	Read replicas	Medium	Medium	Offload 70% of queries
Single table >10TB	Partitioning	High	High	10x query performance
>25k writes/sec	Sharding	Very High	Very High	Horizontal scale (3x engineering cost)
Storage I/O bound	SSD upgrade	Low	Low	2x performance boost

Failure Scenarios and Recovery

Connection Exhaustion

Symptom: "FATAL: remaining connection slots reserved"
Immediate fix: Kill idle connections, implement pooling
Prevention: Alert at 80% connection usage

Memory Exhaustion (OOM)

Common cause: work_mem too high × concurrent operations
Calculation: connections × operations × work_mem = total memory
Recovery: Reduce work_mem, kill heavy queries, add RAM

Disk Space Exhaustion

Critical threshold: <10% free space
WAL accumulation: Stuck replication slots prevent cleanup
Emergency cleanup: Drop replication slots, increase max_wal_size

Query Performance Degradation

Primary cause: Stale statistics (80% of cases)
Quick fix: ANALYZE table_name
Prevention: Monitor last_analyze timestamps

Resource Investment Analysis

Optimization Type	Time Investment	Skill Level	Performance Impact	Maintenance Overhead
Memory tuning	2-4 hours	Intermediate	High (40% gains)	Low
Index optimization	4-8 hours	Intermediate	Very High (5-100x)	Medium
Connection pooling	1-2 hours	Beginner	High (eliminates OOM)	Low
Query rewriting	8-16 hours	Advanced	Very High (45s→1.2s)	High
Monitoring setup	4-6 hours	Intermediate	Preventive	Low

Common Misconceptions

"More shared_buffers is always better" - False: Competes with OS cache on modern systems
"Default settings work for production" - False: Defaults optimized for 2005 hardware
"Connection pooling is optional" - False: Mandatory above 100 connections
"Hardware fixes query problems" - False: Bad queries don't improve with better hardware
"VACUUM FULL fixes performance" - False: Takes exclusive locks, often makes problems worse

Emergency Response Procedures

High CPU Usage

Identify running queries: pg_stat_activity
Check for missing indexes: EXPLAIN ANALYZE
Kill runaway queries: pg_cancel_backend(pid)

Memory Pressure

Check connection count vs limits
Reduce work_mem immediately
Implement connection pooling

Storage Issues

Monitor disk space: <10% = critical
Check WAL directory size
Drop stuck replication slots if necessary

This knowledge base prioritizes actionable intelligence and real-world failure scenarios over theoretical optimization.

PostgreSQL Performance Optimization - AI Knowledge Base

Memory Configuration

Shared Buffers Configuration

Work Memory Settings

Connection Memory Overhead

Maintenance Work Memory

Query Performance and Indexing

Index Type Selection Matrix

Critical Query Patterns That Cause Failures

Index Design Principles

Performance Monitoring and Alerting

Critical Metrics and Thresholds

Buffer Hit Ratio Monitoring

Autovacuum Failure Detection

Production Configuration Templates

32GB RAM System (Mixed Workload)

Connection Pooling (PgBouncer)

Scaling Decision Matrix

Failure Scenarios and Recovery

Connection Exhaustion

Memory Exhaustion (OOM)

Disk Space Exhaustion

Query Performance Degradation

Resource Investment Analysis

Common Misconceptions

Emergency Response Procedures

High CPU Usage

Memory Pressure

Storage Issues

Related Tools & Recommendations

Don't Get Screwed by NoSQL Database Pricing - MongoDB vs Redis vs DataStax Reality Check

PostgreSQL vs MySQL vs MariaDB vs SQLite vs CockroachDB - Pick the Database That Won't Ruin Your Life

MySQL Workbench Performance Issues - Fix the Crashes, Slowdowns, and Memory Hogs

MySQL HeatWave - Oracle's Answer to the ETL Problem

MySQL to PostgreSQL Production Migration: Complete Step-by-Step Guide

PostgreSQL vs MySQL vs MariaDB - Developer Ecosystem Analysis 2025

MariaDB - What MySQL Should Have Been

MariaDB Performance Optimization - Making It Not Suck

pgAdmin - The GUI You Get With PostgreSQL

Docker Desktop Alternatives That Don't Suck

Docker Swarm - Container Orchestration That Actually Works

Docker Security Scanner Performance Optimization - Stop Waiting Forever

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

Setting Up Prometheus Monitoring That Won't Make You Hate Your Job

Set Up Microservices Monitoring That Actually Works

Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together

CrashLoopBackOff Exit Code 1: When Your App Works Locally But Kubernetes Hates It

Temporal + Kubernetes + Redis: The Only Microservices Stack That Doesn't Hate You

SQL Server 2025 - Vector Search Finally Works (Sort Of)

I Survived Our MongoDB to PostgreSQL Migration - Here's How You Can Too