DuckDB Performance Tuning That Actually Works

The 3 Settings That Actually Matter

Been using DuckDB with big datasets for a while now. Most optimization guides focus on stuff that doesn't actually help. There are 3 settings that actually fix slow queries.

DuckDB Performance Over Time

Memory Limit: Just Give It More RAM

DuckDB's being conservative with your RAM. It only grabs 80% by default, which is stupid when you have 64GB sitting there doing nothing:

SET memory_limit = '90%';

This lets DuckDB use more memory before spilling to disk. Makes queries way faster in most cases.

Watch out: Don't set it higher than 95%. I crashed a server once setting it to 98% - everything locked up when the OS needed memory for something else. Spent an hour figuring out what happened.

Thread Count: Your CPU Probably Sucks at Hyperthreading

DuckDB detects your CPU cores automatically but hyperthreading usually hurts performance. Your CPU probably has hyperthreading and DuckDB will use all logical cores by default:

SET threads = 32;  -- Physical cores only

But if you're reading from S3 or remote files, use more threads since you're waiting on network anyway.

Temp Directory: Put It on Your Fastest Drive

When DuckDB runs out of memory, it spills to disk. Put the temp files on your fastest drive:

SET temp_directory = '/fast-storage/duckdb-temp';

Spinning drives make spills really slow. SSD helps. NVMe is best but still slower than keeping everything in memory.

DuckDB Memory Spill Process

Here's What Nobody Tells You About Memory Monitoring

Use these commands to see what's actually happening:

FROM duckdb_memory();  -- Shows memory breakdown
FROM duckdb_temporary_files();  -- Lists temp files being created

If you see temp files being created, either give DuckDB more memory or fix your query. Usually the query is doing something inefficient.

Why preserve_insertion_order = false Sometimes Helps

This one's weird. DuckDB tries to keep your data in the same order, but that uses extra memory:

SET preserve_insertion_order = false;

Use this for ETL jobs where order doesn't matter. Saves memory on big imports.

Common DuckDB Performance Problems

DuckDB isn't using all my RAM and my queries are slow as hell

DuckDB's being conservative with your RAM. It only grabs 80% by default, which is stupid when you have 64GB sitting there doing nothing:

SET memory_limit = '90%';

Don't go higher than 95% or you'll crash the server when the OS needs memory. Been there, spent 2 hours debugging what turned out to be a simple memory issue.

More threads made my queries slower

Hyperthreading usually hurts DuckDB performance. Use physical core count:

SET threads = 16;  -- Whatever your actual cores are, not logical

But if you're hitting S3 or HTTP endpoints, use more threads because you're just waiting for network:

SET threads = 32;  -- 2x cores for S3/HTTP stuff

DuckDB keeps creating temp files even though I have tons of memory

Check your memory limit setting:

SELECT current_setting('memory_limit');

If it's lower than expected, DuckDB will spill to disk early. Also make sure your temp directory isn't on a slow drive - that'll kill performance. Had one setup where the temp directory was on a network mount. Took 4 hours to figure out why everything was so slow.

How do I stop DuckDB from writing temp files everywhere?

You can't disable spilling completely, but you can minimize it:

Crank up memory_limit to 90%
Put temp directory on NVMe if you have it
Break monster queries into smaller pieces
Use persistent databases instead of in-memory (they compress better)

S3 queries are dog slow compared to local files

Network latency is killing you. Try this:

SET threads = 24;  -- Way more threads for network I/O
SET enable_external_file_cache = true;  -- If you have DuckDB 1.3+

Use Parquet files instead of CSV when possible. CSV over network is painfully slow. Had a 10GB CSV query that took 45 minutes. Same data as Parquet took 8 minutes - fucking huge difference.

My big GROUP BY query ran out of memory and crashed

Some aggregation functions don't spill to disk well. Functions like string_agg() can cause memory issues. Try:

More memory (obviously)
Break the query into chunks with LIMIT/OFFSET
Use window functions instead of aggregates when possible

Should I use the file-based DB or keep everything in memory?

File-based is usually faster because DuckDB compresses the data. In-memory doesn't compress, so you use more RAM and spill more often. Counter-intuitive but true.

Found this out when our in-memory setup kept OOMing with a dataset that fit fine in a file-based DB. Same machine, same settings - file version used 40% less memory.

Settings That Actually Matter

Setting	When It Helps	Will It Break Shit?
memory_limit = 90%	When you're spilling to disk	Might crash if OS needs RAM
threads = physical cores	If hyperthreading is fucking you	Nope, totally safe
temp_directory on NVMe	When you can't avoid spills	No, just costs money

Advanced DuckDB Performance Stuff

The basic memory and thread settings fix most performance issues. If you're still having problems with complex queries or huge datasets, here's what else helps.

Using EXPLAIN ANALYZE to Find What's Actually Slow

DuckDB Query Plan Visualizer

Stop guessing why your query is slow. Use EXPLAIN ANALYZE to see what's actually happening:

EXPLAIN ANALYZE SELECT region, COUNT(*), AVG(sales)
FROM huge_sales_table 
WHERE date_col >= '2024-01-01'
GROUP BY region;

Look for:

Cardinality way off: If it estimates 1000 rows but gets 10M, the query planner has no clue what it's doing
Nested loop joins: Usually means your join conditions suck
No filter pushdown: Filters should happen early, not at the end

DuckDB 1.1 Features That Actually Help

Automatic CTE Caching

DuckDB Performance Benchmarks

DuckDB 1.1 got smart about repeated CTEs. This query used to run the expensive part twice:

WITH expensive_stuff AS (
  SELECT customer_id, SUM(complex_calculation) as total
  FROM massive_table
  GROUP BY customer_id
)
SELECT e1.total, e2.total  
FROM expensive_stuff e1
JOIN expensive_stuff e2 ON e1.customer_id = e2.customer_id;

Now it automatically caches the CTE. Makes some analytical queries much faster. Would have saved me 3 hours of manual optimization bullshit if I'd known about this earlier.

S3 and Remote Files

Caching Settings That Work

If you're stuck reading from S3, these help:

SET enable_external_file_cache = true;  -- DuckDB 1.3+ only
SET parquet_metadata_cache = true;
SET threads = 32;  -- Way more than your CPU cores

Note: enable_external_file_cache breaks on some versions of 1.3.0 - if you get weird S3 errors, comment that line out.

Partition Your Queries Properly

This matters way more than I thought for partition elimination:

-- Good: Skips whole files
SELECT * FROM 's3://bucket/year=2024/month=09/*.parquet'
WHERE year = 2024 AND month = 09;

-- Bad: Scans everything
SELECT * FROM 's3://bucket/*/*.parquet'
WHERE some_random_column = 'value';

The first query skips files it doesn't need. The second reads everything.

Connection Reuse

Common mistake:

## Don't do this - creates connection overhead every time
for query in my_queries:
    conn = duckdb.connect(\"mydb.duckdb\")
    result = conn.execute(query)
    conn.close()

Just reuse the damn connection:

## Much faster
conn = duckdb.connect(\"mydb.duckdb\") 
for query in my_queries:
    result = conn.execute(query)
conn.close()

Connection overhead adds up when you're running lots of queries. Our ETL job went from 2 hours to 15 minutes just by not being stupid about connections.

File Formats

Format	Performance	When to Use
Parquet	Much faster than CSV	Best for most use cases
CSV	Slower, uses more memory	When you need text format
JSON	Slow	Only when necessary
DuckDB files	Fastest of all	Internal storage

If you must use CSV, compress it. gzip helps a lot over network.

Window Functions: The Right Way

DuckDB 1.1 made window functions way faster if you use the right syntax:

-- This streams efficiently
SELECT customer_id,
       SUM(amount) OVER (
           PARTITION BY customer_id 
           ORDER BY order_date
           ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
       ) as running_total
FROM orders;

Use ROWS BETWEEN when possible instead of RANGE BETWEEN.

Monitoring That Actually Matters

Skip the fancy monitoring. Just check these:

-- Memory pressure
FROM duckdb_memory();

-- Temp file usage  
FROM duckdb_temporary_files();

If memory is consistently >90% or you see big temp files, you need more RAM or better queries.

What Doesn't Actually Help

These settings usually don't help much:

streaming_buffer_size: Only matters in specific network cases
Perfect thread calculations: Use your core count, adjust if needed
Micro-optimizing everything: Focus on memory, threads, storage first

Memory limit and thread count solve most performance problems. Everything else is edge cases unless your setup is completely fucked.

Quick Navigation

Memory Limit: Just Give It More RAM

Thread Count: Your CPU Probably Sucks at Hyperthreading

Temp Directory: Put It on Your Fastest Drive

Here's What Nobody Tells You About Memory Monitoring

Why preserve_insertion_order = false Sometimes Helps

DuckDB isn't using all my RAM and my queries are slow as hell

More threads made my queries slower

DuckDB keeps creating temp files even though I have tons of memory

How do I stop DuckDB from writing temp files everywhere?

S3 queries are dog slow compared to local files

My big GROUP BY query ran out of memory and crashed

Should I use the file-based DB or keep everything in memory?

Using EXPLAIN ANALYZE to Find What's Actually Slow

DuckDB 1.1 Features That Actually Help

Automatic CTE Caching

S3 and Remote Files

Caching Settings That Work

Partition Your Queries Properly

Connection Reuse

File Formats

Window Functions: The Right Way

Monitoring That Actually Matters

What Doesn't Actually Help

Related Tools & Recommendations

pandas Performance Troubleshooting: Fix Production Issues

DuckDB: The SQLite for Analytics - Fast, Embedded, No Servers

pandas Overview: What It Is, Use Cases, & Common Problems

PostgreSQL Performance Optimization: Master Tuning & Monitoring

mongoexport Performance Optimization: Speed Up Large Exports

Apache Cassandra Performance Optimization Guide: Fix Slow Clusters

PostgreSQL vs MySQL vs MongoDB vs Cassandra - Which Database Will Ruin Your Weekend Less?

MySQL to PostgreSQL Production Migration: Complete Step-by-Step Guide

I Survived Our MongoDB to PostgreSQL Migration - Here's How You Can Too

Protocol Buffers: Troubleshooting Performance & Memory Leaks

Node.js Performance Optimization: Boost App Speed & Scale

Change Data Capture (CDC) Performance Optimization Guide

PostgreSQL: Why It Excels & Production Troubleshooting Guide

SvelteKit Performance Optimization: Fix Slow Apps & Boost Speed

Flutter Performance Optimization: Debug & Fix Issues with DevTools

Express.js Production Guide: Optimize Performance & Prevent Crashes

Webpack Performance Optimization: Fix Slow Builds & Bundles

React Production Debugging: Fix App Crashes & White Screens

LM Studio Performance: Fix Crashes & Speed Up Local AI

Optimize WebStorm Performance: Fix Memory & Speed Issues