DuckDB Performance Optimization: Technical Reference
Critical Configuration Settings
Memory Limit (Primary Performance Driver)
SET memory_limit = '90%';
- Default Problem: DuckDB conservatively uses only 80% of available RAM
- Impact: Causes premature spilling to disk, dramatically slowing queries
- Critical Warning: Never exceed 95% - causes system crashes when OS needs memory
- Real Failure: Setting to 98% locked up entire server requiring 1-hour debugging session
Thread Count (CPU Optimization)
SET threads = <physical_cores_only>;
- Default Problem: DuckDB uses all logical cores including hyperthreading
- Impact: Hyperthreading usually degrades DuckDB performance
- Exception: For S3/HTTP queries, use 2x physical cores due to network I/O wait time
- Implementation: Check actual physical cores, not logical cores reported by OS
Temp Directory (Disk Spill Performance)
SET temp_directory = '/fast-storage/duckdb-temp';
- Critical Impact: Spinning drives make spills extremely slow
- Performance Hierarchy: NVMe > SSD > Spinning disk
- Real Failure: Network-mounted temp directory caused 4-hour debugging session for slow performance
- Warning: Cannot disable spilling completely
Performance Monitoring Commands
Memory Analysis
FROM duckdb_memory(); -- Shows memory breakdown
FROM duckdb_temporary_files(); -- Lists active temp files
- Action Trigger: If temp files appear, either increase memory or optimize query
- Memory Threshold: Consistently >90% memory usage indicates need for more RAM
Advanced Configuration
Order Preservation Override
SET preserve_insertion_order = false;
- Use Case: ETL jobs where row order is irrelevant
- Benefit: Reduces memory usage on large imports
- Trade-off: Loses data ordering for memory savings
S3/Remote File Optimization
SET enable_external_file_cache = true; -- DuckDB 1.3+ only
SET parquet_metadata_cache = true;
SET threads = 32; -- 2x CPU cores for network I/O
- Version Dependency:
enable_external_file_cache
broken in some DuckDB 1.3.0 versions - Network I/O Rule: Use significantly more threads than CPU cores for remote data
File Format Performance Impact
Format | Performance | Memory Usage | Network Efficiency |
---|---|---|---|
DuckDB native | Fastest | Most compressed | Best |
Parquet | Fast | Good compression | Good |
CSV | Slow | High memory usage | Poor |
JSON | Slowest | Highest usage | Worst |
Real Performance Data
- CSV to Parquet conversion: 45-minute query reduced to 8 minutes (5.6x improvement)
- File-based vs in-memory: File-based uses 40% less memory due to compression
Common Failure Scenarios
Memory Exhaustion Patterns
- String Aggregations: Functions like
string_agg()
don't spill efficiently - Large GROUP BY: May require query decomposition with LIMIT/OFFSET
- Solution Hierarchy: More memory > query optimization > chunking
Connection Performance Anti-Pattern
# WRONG: Creates connection overhead
for query in queries:
conn = duckdb.connect("db.duckdb")
result = conn.execute(query)
conn.close()
# CORRECT: Reuse connections
conn = duckdb.connect("db.duckdb")
for query in queries:
result = conn.execute(query)
conn.close()
- Real Impact: ETL job time reduced from 2 hours to 15 minutes (8x improvement)
Query Optimization Intelligence
Partition Elimination (Critical for S3)
-- EFFICIENT: Skips entire files
SELECT * FROM 's3://bucket/year=2024/month=09/*.parquet'
WHERE year = 2024 AND month = 09;
-- INEFFICIENT: Scans all files
SELECT * FROM 's3://bucket/*/*.parquet'
WHERE some_column = 'value';
Window Function Performance (DuckDB 1.1+)
-- OPTIMIZED: Streams efficiently
SELECT customer_id,
SUM(amount) OVER (
PARTITION BY customer_id
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
FROM orders;
- Performance Rule: Use
ROWS BETWEEN
instead ofRANGE BETWEEN
when possible
Query Analysis
EXPLAIN ANALYZE SELECT ...;
Critical Indicators:
- Cardinality Mismatch: Estimates vs actual rows significantly different
- Nested Loop Joins: Usually indicates poor join conditions
- No Filter Pushdown: Filters applied late in execution plan
Resource Requirements
Memory Scaling
- Minimum Effective: 90% of available system RAM for memory_limit
- Risk Threshold: 95% maximum to prevent system instability
- Monitoring: Temp file creation indicates insufficient memory allocation
Storage Requirements
- Temp Space: NVMe strongly recommended for spill operations
- Network I/O: Parquet over CSV provides 5-6x performance improvement
- Compression: File-based databases use 40% less memory than in-memory
Breaking Points and Failure Modes
Hard Limits
- Memory Limit >95%: System crashes requiring manual recovery
- Hyperthreading: Usually degrades performance despite increased logical cores
- CSV over Network: Extremely poor performance, avoid when possible
Version-Specific Issues
- DuckDB 1.3.0: External file cache may cause S3 errors in some builds
- CTE Optimization: Automatic caching available in DuckDB 1.1+, significant performance improvement
Implementation Decision Tree
- Memory Issues → Increase memory_limit to 90%
- Still Slow → Optimize thread count (physical cores only)
- Spilling to Disk → Move temp_directory to fastest storage
- S3 Performance → Increase threads, use Parquet, enable caching
- Complex Queries → Use EXPLAIN ANALYZE, optimize joins and filters
Success Rate: These three primary settings (memory, threads, temp directory) resolve majority of DuckDB performance issues without advanced tuning.
Useful Links for Further Investigation
DuckDB Resources That Don't Suck
Link | Description |
---|---|
Official DuckDB Documentation | The official DuckDB documentation provides in-depth guides and reference material, including performance overviews, to help users understand and optimize their DuckDB usage. |
DuckDB Discord Server | Join the official DuckDB Discord server to connect with the community, ask questions, get support, and discuss various aspects of DuckDB with other users and developers. |
DuckDB Release Notes | Explore the DuckDB release notes, which provide valuable insights into new features, improvements, and bug fixes for each version, helping users stay updated on the latest developments. |
Related Tools & Recommendations
How to Migrate PostgreSQL 15 to 16 Without Destroying Your Weekend
alternative to PostgreSQL
Why I Finally Dumped Cassandra After 5 Years of 3AM Hell
alternative to MongoDB
MongoDB vs PostgreSQL vs MySQL: Which One Won't Ruin Your Weekend
alternative to postgresql
Connecting ClickHouse to Kafka Without Losing Your Sanity
Three ways to pipe Kafka events into ClickHouse, and what actually breaks in production
ClickHouse - Analytics Database That Actually Works
When your PostgreSQL queries take forever and you're tired of waiting
pandas - The Excel Killer for Python Developers
Data manipulation that doesn't make you want to quit programming
When pandas Crashes: Moving to Dask for Large Datasets
Your 32GB laptop just died trying to read that 50GB CSV. Here's what to do next.
Fixing pandas Performance Disasters - Production Troubleshooting Guide
When your pandas code crashes production at 3AM and you need solutions that actually work
Python 3.13 Production Deployment - What Actually Breaks
Python 3.13 will probably break something in your production environment. Here's how to minimize the damage.
Python 3.13 Finally Lets You Ditch the GIL - Here's How to Install It
Fair Warning: This is Experimental as Hell and Your Favorite Packages Probably Don't Work Yet
Python Performance Disasters - What Actually Works When Everything's On Fire
Your Code is Slow, Users Are Pissed, and You're Getting Paged at 3AM
GitHub Desktop - Git with Training Wheels That Actually Work
Point-and-click your way through Git without memorizing 47 different commands
VS Code Settings Are Probably Fucked - Here's How to Fix Them
Same codebase, 12 different formatting styles. Time to unfuck it.
I Burned $400+ Testing AI Tools So You Don't Have To
Stop wasting money - here's which AI doesn't suck in 2025
SQLite - The Database That Just Works
Zero Configuration, Actually Works
SQLite Performance: When It All Goes to Shit
Your database was fast yesterday and slow today. Here's why.
PostgreSQL vs MySQL vs MariaDB vs SQLite vs CockroachDB - Pick the Database That Won't Ruin Your Life
competes with sqlite
Should You Use TypeScript? Here's What It Actually Costs
TypeScript devs cost 30% more, builds take forever, and your junior devs will hate you for 3 months. But here's exactly when the math works in your favor.
Python vs JavaScript vs Go vs Rust - Production Reality Check
What Actually Happens When You Ship Code With These Languages
JavaScript Gets Built-In Iterator Operators in ECMAScript 2025
Finally: Built-in functional programming that should have existed in 2015
Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization