DuckDB isn't using all my RAM and my queries are slow as hell

DuckDB's being conservative with your RAM. It only grabs 80% by default, which is stupid when you have 64GB sitting there doing nothing: ```sql SET memory_limit = '90%'; ``` Don't go higher than 95% or you'll crash the server when the OS needs memory. Been there, spent 2 hours debugging what turned out to be a simple memory issue.

More threads made my queries slower

Hyperthreading usually hurts DuckDB performance. Use physical core count: ```sql SET threads = 16; -- Whatever your actual cores are, not logical ``` But if you're hitting S3 or HTTP endpoints, use more threads because you're just waiting for network: ```sql SET threads = 32; -- 2x cores for S3/HTTP stuff ```

DuckDB keeps creating temp files even though I have tons of memory

Check your memory limit setting: ```sql SELECT current_setting('memory_limit'); ``` If it's lower than expected, DuckDB will spill to disk early. Also make sure your temp directory isn't on a slow drive - that'll kill performance. Had one setup where the temp directory was on a network mount. Took 4 hours to figure out why everything was so slow.

How do I stop DuckDB from writing temp files everywhere?

You can't disable spilling completely, but you can minimize it: - Crank up `memory_limit` to 90% - Put temp directory on NVMe if you have it - Break monster queries into smaller pieces - Use persistent databases instead of in-memory (they compress better)

S3 queries are dog slow compared to local files

Network latency is killing you. Try this: ```sql SET threads = 24; -- Way more threads for network I/O SET enable_external_file_cache = true; -- If you have DuckDB 1.3+ ``` Use Parquet files instead of CSV when possible. CSV over network is painfully slow. Had a 10GB CSV query that took 45 minutes. Same data as Parquet took 8 minutes - fucking huge difference.

My big GROUP BY query ran out of memory and crashed

Some aggregation functions don't spill to disk well. Functions like `string_agg()` can cause memory issues. Try: - More memory (obviously) - Break the query into chunks with LIMIT/OFFSET - Use window functions instead of aggregates when possible

Should I use the file-based DB or keep everything in memory?

File-based is usually faster because DuckDB compresses the data. In-memory doesn't compress, so you use more RAM and spill more often. Counter-intuitive but true. Found this out when our in-memory setup kept OOMing with a dataset that fit fine in a file-based DB. Same machine, same settings - file version used 40% less memory.

Currently viewing the AI version

Switch to human version

DuckDB Performance Optimization: Technical Reference

Critical Configuration Settings

Memory Limit (Primary Performance Driver)

SET memory_limit = '90%';

Default Problem: DuckDB conservatively uses only 80% of available RAM
Impact: Causes premature spilling to disk, dramatically slowing queries
Critical Warning: Never exceed 95% - causes system crashes when OS needs memory
Real Failure: Setting to 98% locked up entire server requiring 1-hour debugging session

Thread Count (CPU Optimization)

SET threads = <physical_cores_only>;

Default Problem: DuckDB uses all logical cores including hyperthreading
Impact: Hyperthreading usually degrades DuckDB performance
Exception: For S3/HTTP queries, use 2x physical cores due to network I/O wait time
Implementation: Check actual physical cores, not logical cores reported by OS

Temp Directory (Disk Spill Performance)

SET temp_directory = '/fast-storage/duckdb-temp';

Critical Impact: Spinning drives make spills extremely slow
Performance Hierarchy: NVMe > SSD > Spinning disk
Real Failure: Network-mounted temp directory caused 4-hour debugging session for slow performance
Warning: Cannot disable spilling completely

Performance Monitoring Commands

Memory Analysis

FROM duckdb_memory();          -- Shows memory breakdown
FROM duckdb_temporary_files(); -- Lists active temp files

Action Trigger: If temp files appear, either increase memory or optimize query
Memory Threshold: Consistently >90% memory usage indicates need for more RAM

Advanced Configuration

Order Preservation Override

SET preserve_insertion_order = false;

Use Case: ETL jobs where row order is irrelevant
Benefit: Reduces memory usage on large imports
Trade-off: Loses data ordering for memory savings

S3/Remote File Optimization

SET enable_external_file_cache = true;  -- DuckDB 1.3+ only
SET parquet_metadata_cache = true;
SET threads = 32;  -- 2x CPU cores for network I/O

Version Dependency: enable_external_file_cache broken in some DuckDB 1.3.0 versions
Network I/O Rule: Use significantly more threads than CPU cores for remote data

File Format Performance Impact

Format	Performance	Memory Usage	Network Efficiency
DuckDB native	Fastest	Most compressed	Best
Parquet	Fast	Good compression	Good
CSV	Slow	High memory usage	Poor
JSON	Slowest	Highest usage	Worst

Real Performance Data

CSV to Parquet conversion: 45-minute query reduced to 8 minutes (5.6x improvement)
File-based vs in-memory: File-based uses 40% less memory due to compression

Common Failure Scenarios

Memory Exhaustion Patterns

String Aggregations: Functions like string_agg() don't spill efficiently
Large GROUP BY: May require query decomposition with LIMIT/OFFSET
Solution Hierarchy: More memory > query optimization > chunking

Connection Performance Anti-Pattern

# WRONG: Creates connection overhead
for query in queries:
    conn = duckdb.connect("db.duckdb")
    result = conn.execute(query)
    conn.close()

# CORRECT: Reuse connections
conn = duckdb.connect("db.duckdb")
for query in queries:
    result = conn.execute(query)
conn.close()

Real Impact: ETL job time reduced from 2 hours to 15 minutes (8x improvement)

Query Optimization Intelligence

Partition Elimination (Critical for S3)

-- EFFICIENT: Skips entire files
SELECT * FROM 's3://bucket/year=2024/month=09/*.parquet'
WHERE year = 2024 AND month = 09;

-- INEFFICIENT: Scans all files
SELECT * FROM 's3://bucket/*/*.parquet'
WHERE some_column = 'value';

Window Function Performance (DuckDB 1.1+)

-- OPTIMIZED: Streams efficiently
SELECT customer_id,
       SUM(amount) OVER (
           PARTITION BY customer_id
           ORDER BY order_date
           ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
       )
FROM orders;

Performance Rule: Use ROWS BETWEEN instead of RANGE BETWEEN when possible

Query Analysis

EXPLAIN ANALYZE SELECT ...;

Critical Indicators:

Cardinality Mismatch: Estimates vs actual rows significantly different
Nested Loop Joins: Usually indicates poor join conditions
No Filter Pushdown: Filters applied late in execution plan

Resource Requirements

Memory Scaling

Minimum Effective: 90% of available system RAM for memory_limit
Risk Threshold: 95% maximum to prevent system instability
Monitoring: Temp file creation indicates insufficient memory allocation

Storage Requirements

Temp Space: NVMe strongly recommended for spill operations
Network I/O: Parquet over CSV provides 5-6x performance improvement
Compression: File-based databases use 40% less memory than in-memory

Breaking Points and Failure Modes

Hard Limits

Memory Limit >95%: System crashes requiring manual recovery
Hyperthreading: Usually degrades performance despite increased logical cores
CSV over Network: Extremely poor performance, avoid when possible

Version-Specific Issues

DuckDB 1.3.0: External file cache may cause S3 errors in some builds
CTE Optimization: Automatic caching available in DuckDB 1.1+, significant performance improvement

Implementation Decision Tree

Memory Issues → Increase memory_limit to 90%
Still Slow → Optimize thread count (physical cores only)
Spilling to Disk → Move temp_directory to fastest storage
S3 Performance → Increase threads, use Parquet, enable caching
Complex Queries → Use EXPLAIN ANALYZE, optimize joins and filters

Success Rate: These three primary settings (memory, threads, temp directory) resolve majority of DuckDB performance issues without advanced tuning.

Useful Links for Further Investigation

DuckDB Resources That Don't Suck

Link	Description
Official DuckDB Documentation	The official DuckDB documentation provides in-depth guides and reference material, including performance overviews, to help users understand and optimize their DuckDB usage.
DuckDB Discord Server	Join the official DuckDB Discord server to connect with the community, ask questions, get support, and discuss various aspects of DuckDB with other users and developers.
DuckDB Release Notes	Explore the DuckDB release notes, which provide valuable insights into new features, improvements, and bug fixes for each version, helping users stay updated on the latest developments.

DuckDB Performance Optimization: Technical Reference

Critical Configuration Settings

Memory Limit (Primary Performance Driver)

Thread Count (CPU Optimization)

Temp Directory (Disk Spill Performance)

Performance Monitoring Commands

Memory Analysis

Advanced Configuration

Order Preservation Override

S3/Remote File Optimization

File Format Performance Impact

Real Performance Data

Common Failure Scenarios

Memory Exhaustion Patterns

Connection Performance Anti-Pattern

Query Optimization Intelligence

Partition Elimination (Critical for S3)

Window Function Performance (DuckDB 1.1+)

Query Analysis

Resource Requirements

Memory Scaling

Storage Requirements

Breaking Points and Failure Modes

Hard Limits

Version-Specific Issues

Implementation Decision Tree

Useful Links for Further Investigation

DuckDB Resources That Don't Suck

Related Tools & Recommendations

How to Migrate PostgreSQL 15 to 16 Without Destroying Your Weekend

Why I Finally Dumped Cassandra After 5 Years of 3AM Hell

MongoDB vs PostgreSQL vs MySQL: Which One Won't Ruin Your Weekend

Connecting ClickHouse to Kafka Without Losing Your Sanity

ClickHouse - Analytics Database That Actually Works

pandas - The Excel Killer for Python Developers

When pandas Crashes: Moving to Dask for Large Datasets

Fixing pandas Performance Disasters - Production Troubleshooting Guide

Python 3.13 Production Deployment - What Actually Breaks

Python 3.13 Finally Lets You Ditch the GIL - Here's How to Install It

Python Performance Disasters - What Actually Works When Everything's On Fire

GitHub Desktop - Git with Training Wheels That Actually Work

VS Code Settings Are Probably Fucked - Here's How to Fix Them

I Burned $400+ Testing AI Tools So You Don't Have To

SQLite - The Database That Just Works

SQLite Performance: When It All Goes to Shit

PostgreSQL vs MySQL vs MariaDB vs SQLite vs CockroachDB - Pick the Database That Won't Ruin Your Life

Should You Use TypeScript? Here's What It Actually Costs

Python vs JavaScript vs Go vs Rust - Production Reality Check

JavaScript Gets Built-In Iterator Operators in ECMAScript 2025