Currently viewing the AI version
Switch to human version

DuckDB Performance Optimization: Technical Reference

Critical Configuration Settings

Memory Limit (Primary Performance Driver)

SET memory_limit = '90%';
  • Default Problem: DuckDB conservatively uses only 80% of available RAM
  • Impact: Causes premature spilling to disk, dramatically slowing queries
  • Critical Warning: Never exceed 95% - causes system crashes when OS needs memory
  • Real Failure: Setting to 98% locked up entire server requiring 1-hour debugging session

Thread Count (CPU Optimization)

SET threads = <physical_cores_only>;
  • Default Problem: DuckDB uses all logical cores including hyperthreading
  • Impact: Hyperthreading usually degrades DuckDB performance
  • Exception: For S3/HTTP queries, use 2x physical cores due to network I/O wait time
  • Implementation: Check actual physical cores, not logical cores reported by OS

Temp Directory (Disk Spill Performance)

SET temp_directory = '/fast-storage/duckdb-temp';
  • Critical Impact: Spinning drives make spills extremely slow
  • Performance Hierarchy: NVMe > SSD > Spinning disk
  • Real Failure: Network-mounted temp directory caused 4-hour debugging session for slow performance
  • Warning: Cannot disable spilling completely

Performance Monitoring Commands

Memory Analysis

FROM duckdb_memory();          -- Shows memory breakdown
FROM duckdb_temporary_files(); -- Lists active temp files
  • Action Trigger: If temp files appear, either increase memory or optimize query
  • Memory Threshold: Consistently >90% memory usage indicates need for more RAM

Advanced Configuration

Order Preservation Override

SET preserve_insertion_order = false;
  • Use Case: ETL jobs where row order is irrelevant
  • Benefit: Reduces memory usage on large imports
  • Trade-off: Loses data ordering for memory savings

S3/Remote File Optimization

SET enable_external_file_cache = true;  -- DuckDB 1.3+ only
SET parquet_metadata_cache = true;
SET threads = 32;  -- 2x CPU cores for network I/O
  • Version Dependency: enable_external_file_cache broken in some DuckDB 1.3.0 versions
  • Network I/O Rule: Use significantly more threads than CPU cores for remote data

File Format Performance Impact

Format Performance Memory Usage Network Efficiency
DuckDB native Fastest Most compressed Best
Parquet Fast Good compression Good
CSV Slow High memory usage Poor
JSON Slowest Highest usage Worst

Real Performance Data

  • CSV to Parquet conversion: 45-minute query reduced to 8 minutes (5.6x improvement)
  • File-based vs in-memory: File-based uses 40% less memory due to compression

Common Failure Scenarios

Memory Exhaustion Patterns

  • String Aggregations: Functions like string_agg() don't spill efficiently
  • Large GROUP BY: May require query decomposition with LIMIT/OFFSET
  • Solution Hierarchy: More memory > query optimization > chunking

Connection Performance Anti-Pattern

# WRONG: Creates connection overhead
for query in queries:
    conn = duckdb.connect("db.duckdb")
    result = conn.execute(query)
    conn.close()

# CORRECT: Reuse connections
conn = duckdb.connect("db.duckdb")
for query in queries:
    result = conn.execute(query)
conn.close()
  • Real Impact: ETL job time reduced from 2 hours to 15 minutes (8x improvement)

Query Optimization Intelligence

Partition Elimination (Critical for S3)

-- EFFICIENT: Skips entire files
SELECT * FROM 's3://bucket/year=2024/month=09/*.parquet'
WHERE year = 2024 AND month = 09;

-- INEFFICIENT: Scans all files
SELECT * FROM 's3://bucket/*/*.parquet'
WHERE some_column = 'value';

Window Function Performance (DuckDB 1.1+)

-- OPTIMIZED: Streams efficiently
SELECT customer_id,
       SUM(amount) OVER (
           PARTITION BY customer_id
           ORDER BY order_date
           ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
       )
FROM orders;
  • Performance Rule: Use ROWS BETWEEN instead of RANGE BETWEEN when possible

Query Analysis

EXPLAIN ANALYZE SELECT ...;

Critical Indicators:

  • Cardinality Mismatch: Estimates vs actual rows significantly different
  • Nested Loop Joins: Usually indicates poor join conditions
  • No Filter Pushdown: Filters applied late in execution plan

Resource Requirements

Memory Scaling

  • Minimum Effective: 90% of available system RAM for memory_limit
  • Risk Threshold: 95% maximum to prevent system instability
  • Monitoring: Temp file creation indicates insufficient memory allocation

Storage Requirements

  • Temp Space: NVMe strongly recommended for spill operations
  • Network I/O: Parquet over CSV provides 5-6x performance improvement
  • Compression: File-based databases use 40% less memory than in-memory

Breaking Points and Failure Modes

Hard Limits

  • Memory Limit >95%: System crashes requiring manual recovery
  • Hyperthreading: Usually degrades performance despite increased logical cores
  • CSV over Network: Extremely poor performance, avoid when possible

Version-Specific Issues

  • DuckDB 1.3.0: External file cache may cause S3 errors in some builds
  • CTE Optimization: Automatic caching available in DuckDB 1.1+, significant performance improvement

Implementation Decision Tree

  1. Memory Issues → Increase memory_limit to 90%
  2. Still Slow → Optimize thread count (physical cores only)
  3. Spilling to Disk → Move temp_directory to fastest storage
  4. S3 Performance → Increase threads, use Parquet, enable caching
  5. Complex Queries → Use EXPLAIN ANALYZE, optimize joins and filters

Success Rate: These three primary settings (memory, threads, temp directory) resolve majority of DuckDB performance issues without advanced tuning.

Useful Links for Further Investigation

DuckDB Resources That Don't Suck

LinkDescription
Official DuckDB DocumentationThe official DuckDB documentation provides in-depth guides and reference material, including performance overviews, to help users understand and optimize their DuckDB usage.
DuckDB Discord ServerJoin the official DuckDB Discord server to connect with the community, ask questions, get support, and discuss various aspects of DuckDB with other users and developers.
DuckDB Release NotesExplore the DuckDB release notes, which provide valuable insights into new features, improvements, and bug fixes for each version, helping users stay updated on the latest developments.

Related Tools & Recommendations

howto
Recommended

How to Migrate PostgreSQL 15 to 16 Without Destroying Your Weekend

alternative to PostgreSQL

PostgreSQL
/howto/migrate-postgresql-15-to-16-production/migrate-postgresql-15-to-16-production
99%
alternatives
Recommended

Why I Finally Dumped Cassandra After 5 Years of 3AM Hell

alternative to MongoDB

MongoDB
/alternatives/mongodb-postgresql-cassandra/cassandra-operational-nightmare
99%
compare
Recommended

MongoDB vs PostgreSQL vs MySQL: Which One Won't Ruin Your Weekend

alternative to postgresql

postgresql
/compare/mongodb/postgresql/mysql/performance-benchmarks-2025
99%
integration
Recommended

Connecting ClickHouse to Kafka Without Losing Your Sanity

Three ways to pipe Kafka events into ClickHouse, and what actually breaks in production

ClickHouse
/integration/clickhouse-kafka/production-deployment-guide
67%
tool
Recommended

ClickHouse - Analytics Database That Actually Works

When your PostgreSQL queries take forever and you're tired of waiting

ClickHouse
/tool/clickhouse/overview
67%
tool
Recommended

pandas - The Excel Killer for Python Developers

Data manipulation that doesn't make you want to quit programming

pandas
/tool/pandas/overview
66%
integration
Recommended

When pandas Crashes: Moving to Dask for Large Datasets

Your 32GB laptop just died trying to read that 50GB CSV. Here's what to do next.

pandas
/integration/pandas-dask/large-dataset-processing
66%
tool
Recommended

Fixing pandas Performance Disasters - Production Troubleshooting Guide

When your pandas code crashes production at 3AM and you need solutions that actually work

pandas
/tool/pandas/performance-troubleshooting
66%
tool
Recommended

Python 3.13 Production Deployment - What Actually Breaks

Python 3.13 will probably break something in your production environment. Here's how to minimize the damage.

Python 3.13
/tool/python-3.13/production-deployment
66%
howto
Recommended

Python 3.13 Finally Lets You Ditch the GIL - Here's How to Install It

Fair Warning: This is Experimental as Hell and Your Favorite Packages Probably Don't Work Yet

Python 3.13
/howto/setup-python-free-threaded-mode/setup-guide
66%
troubleshoot
Recommended

Python Performance Disasters - What Actually Works When Everything's On Fire

Your Code is Slow, Users Are Pissed, and You're Getting Paged at 3AM

Python
/troubleshoot/python-performance-optimization/performance-bottlenecks-diagnosis
66%
tool
Recommended

GitHub Desktop - Git with Training Wheels That Actually Work

Point-and-click your way through Git without memorizing 47 different commands

GitHub Desktop
/tool/github-desktop/overview
66%
tool
Recommended

VS Code Settings Are Probably Fucked - Here's How to Fix Them

Same codebase, 12 different formatting styles. Time to unfuck it.

Visual Studio Code
/tool/visual-studio-code/settings-configuration-hell
66%
tool
Recommended

I Burned $400+ Testing AI Tools So You Don't Have To

Stop wasting money - here's which AI doesn't suck in 2025

Perplexity AI
/tool/perplexity-ai/comparison-guide
66%
tool
Recommended

SQLite - The Database That Just Works

Zero Configuration, Actually Works

SQLite
/tool/sqlite/overview
60%
tool
Recommended

SQLite Performance: When It All Goes to Shit

Your database was fast yesterday and slow today. Here's why.

SQLite
/tool/sqlite/performance-optimization
60%
compare
Recommended

PostgreSQL vs MySQL vs MariaDB vs SQLite vs CockroachDB - Pick the Database That Won't Ruin Your Life

competes with sqlite

sqlite
/compare/postgresql-mysql-mariadb-sqlite-cockroachdb/database-decision-guide
60%
pricing
Recommended

Should You Use TypeScript? Here's What It Actually Costs

TypeScript devs cost 30% more, builds take forever, and your junior devs will hate you for 3 months. But here's exactly when the math works in your favor.

TypeScript
/pricing/typescript-vs-javascript-development-costs/development-cost-analysis
60%
compare
Recommended

Python vs JavaScript vs Go vs Rust - Production Reality Check

What Actually Happens When You Ship Code With These Languages

java
/compare/python-javascript-go-rust/production-reality-check
60%
news
Recommended

JavaScript Gets Built-In Iterator Operators in ECMAScript 2025

Finally: Built-in functional programming that should have existed in 2015

OpenAI/ChatGPT
/news/2025-09-06/javascript-iterator-operators-ecmascript
60%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization