The 3 Settings That Actually Matter

Been using DuckDB with big datasets for a while now. Most optimization guides focus on stuff that doesn't actually help. There are 3 settings that actually fix slow queries.

DuckDB Performance Over Time

Memory Limit: Just Give It More RAM

DuckDB's being conservative with your RAM. It only grabs 80% by default, which is stupid when you have 64GB sitting there doing nothing:

SET memory_limit = '90%';

This lets DuckDB use more memory before spilling to disk. Makes queries way faster in most cases.

Watch out: Don't set it higher than 95%. I crashed a server once setting it to 98% - everything locked up when the OS needed memory for something else. Spent an hour figuring out what happened.

Thread Count: Your CPU Probably Sucks at Hyperthreading

DuckDB detects your CPU cores automatically but hyperthreading usually hurts performance. Your CPU probably has hyperthreading and DuckDB will use all logical cores by default:

SET threads = 32;  -- Physical cores only

But if you're reading from S3 or remote files, use more threads since you're waiting on network anyway.

Temp Directory: Put It on Your Fastest Drive

When DuckDB runs out of memory, it spills to disk. Put the temp files on your fastest drive:

SET temp_directory = '/fast-storage/duckdb-temp';

Spinning drives make spills really slow. SSD helps. NVMe is best but still slower than keeping everything in memory.

DuckDB Memory Spill Process

Here's What Nobody Tells You About Memory Monitoring

Use these commands to see what's actually happening:

FROM duckdb_memory();  -- Shows memory breakdown
FROM duckdb_temporary_files();  -- Lists temp files being created

If you see temp files being created, either give DuckDB more memory or fix your query. Usually the query is doing something inefficient.

Why preserve_insertion_order = false Sometimes Helps

This one's weird. DuckDB tries to keep your data in the same order, but that uses extra memory:

SET preserve_insertion_order = false;

Use this for ETL jobs where order doesn't matter. Saves memory on big imports.

Common DuckDB Performance Problems

Q

DuckDB isn't using all my RAM and my queries are slow as hell

A

DuckDB's being conservative with your RAM. It only grabs 80% by default, which is stupid when you have 64GB sitting there doing nothing:

SET memory_limit = '90%';

Don't go higher than 95% or you'll crash the server when the OS needs memory. Been there, spent 2 hours debugging what turned out to be a simple memory issue.

Q

More threads made my queries slower

A

Hyperthreading usually hurts DuckDB performance. Use physical core count:

SET threads = 16;  -- Whatever your actual cores are, not logical

But if you're hitting S3 or HTTP endpoints, use more threads because you're just waiting for network:

SET threads = 32;  -- 2x cores for S3/HTTP stuff
Q

DuckDB keeps creating temp files even though I have tons of memory

A

Check your memory limit setting:

SELECT current_setting('memory_limit');

If it's lower than expected, DuckDB will spill to disk early. Also make sure your temp directory isn't on a slow drive - that'll kill performance. Had one setup where the temp directory was on a network mount. Took 4 hours to figure out why everything was so slow.

Q

How do I stop DuckDB from writing temp files everywhere?

A

You can't disable spilling completely, but you can minimize it:

  • Crank up memory_limit to 90%
  • Put temp directory on NVMe if you have it
  • Break monster queries into smaller pieces
  • Use persistent databases instead of in-memory (they compress better)
Q

S3 queries are dog slow compared to local files

A

Network latency is killing you. Try this:

SET threads = 24;  -- Way more threads for network I/O
SET enable_external_file_cache = true;  -- If you have DuckDB 1.3+

Use Parquet files instead of CSV when possible. CSV over network is painfully slow. Had a 10GB CSV query that took 45 minutes. Same data as Parquet took 8 minutes - fucking huge difference.

Q

My big GROUP BY query ran out of memory and crashed

A

Some aggregation functions don't spill to disk well. Functions like string_agg() can cause memory issues. Try:

  • More memory (obviously)
  • Break the query into chunks with LIMIT/OFFSET
  • Use window functions instead of aggregates when possible
Q

Should I use the file-based DB or keep everything in memory?

A

File-based is usually faster because DuckDB compresses the data. In-memory doesn't compress, so you use more RAM and spill more often. Counter-intuitive but true.

Found this out when our in-memory setup kept OOMing with a dataset that fit fine in a file-based DB. Same machine, same settings - file version used 40% less memory.

Settings That Actually Matter

Setting

When It Helps

Will It Break Shit?

memory_limit = 90%

When you're spilling to disk

Might crash if OS needs RAM

threads = physical cores

If hyperthreading is fucking you

Nope, totally safe

temp_directory on NVMe

When you can't avoid spills

No, just costs money

Advanced DuckDB Performance Stuff

The basic memory and thread settings fix most performance issues. If you're still having problems with complex queries or huge datasets, here's what else helps.

Using EXPLAIN ANALYZE to Find What's Actually Slow

DuckDB Query Plan Visualizer

Stop guessing why your query is slow. Use EXPLAIN ANALYZE to see what's actually happening:

EXPLAIN ANALYZE SELECT region, COUNT(*), AVG(sales)
FROM huge_sales_table 
WHERE date_col >= '2024-01-01'
GROUP BY region;

Look for:

  • Cardinality way off: If it estimates 1000 rows but gets 10M, the query planner has no clue what it's doing
  • Nested loop joins: Usually means your join conditions suck
  • No filter pushdown: Filters should happen early, not at the end

DuckDB 1.1 Features That Actually Help

Automatic CTE Caching

DuckDB Performance Benchmarks

DuckDB 1.1 got smart about repeated CTEs. This query used to run the expensive part twice:

WITH expensive_stuff AS (
  SELECT customer_id, SUM(complex_calculation) as total
  FROM massive_table
  GROUP BY customer_id
)
SELECT e1.total, e2.total  
FROM expensive_stuff e1
JOIN expensive_stuff e2 ON e1.customer_id = e2.customer_id;

Now it automatically caches the CTE. Makes some analytical queries much faster. Would have saved me 3 hours of manual optimization bullshit if I'd known about this earlier.

S3 and Remote Files

Caching Settings That Work

If you're stuck reading from S3, these help:

SET enable_external_file_cache = true;  -- DuckDB 1.3+ only
SET parquet_metadata_cache = true;
SET threads = 32;  -- Way more than your CPU cores

Note: enable_external_file_cache breaks on some versions of 1.3.0 - if you get weird S3 errors, comment that line out.

Partition Your Queries Properly

This matters way more than I thought for partition elimination:

-- Good: Skips whole files
SELECT * FROM 's3://bucket/year=2024/month=09/*.parquet'
WHERE year = 2024 AND month = 09;

-- Bad: Scans everything
SELECT * FROM 's3://bucket/*/*.parquet'
WHERE some_random_column = 'value';

The first query skips files it doesn't need. The second reads everything.

Connection Reuse

Common mistake:

## Don't do this - creates connection overhead every time
for query in my_queries:
    conn = duckdb.connect(\"mydb.duckdb\")
    result = conn.execute(query)
    conn.close()

Just reuse the damn connection:

## Much faster
conn = duckdb.connect(\"mydb.duckdb\") 
for query in my_queries:
    result = conn.execute(query)
conn.close()

Connection overhead adds up when you're running lots of queries. Our ETL job went from 2 hours to 15 minutes just by not being stupid about connections.

File Formats

Format Performance When to Use
Parquet Much faster than CSV Best for most use cases
CSV Slower, uses more memory When you need text format
JSON Slow Only when necessary
DuckDB files Fastest of all Internal storage

If you must use CSV, compress it. gzip helps a lot over network.

DuckDB File Format Performance

Window Functions: The Right Way

DuckDB 1.1 made window functions way faster if you use the right syntax:

-- This streams efficiently
SELECT customer_id,
       SUM(amount) OVER (
           PARTITION BY customer_id 
           ORDER BY order_date
           ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
       ) as running_total
FROM orders;

Use ROWS BETWEEN when possible instead of RANGE BETWEEN.

Monitoring That Actually Matters

Skip the fancy monitoring. Just check these:

-- Memory pressure
FROM duckdb_memory();

-- Temp file usage  
FROM duckdb_temporary_files();

If memory is consistently >90% or you see big temp files, you need more RAM or better queries.

What Doesn't Actually Help

These settings usually don't help much:

  • streaming_buffer_size: Only matters in specific network cases
  • Perfect thread calculations: Use your core count, adjust if needed
  • Micro-optimizing everything: Focus on memory, threads, storage first

Memory limit and thread count solve most performance problems. Everything else is edge cases unless your setup is completely fucked.

Related Tools & Recommendations

tool
Similar content

pandas Performance Troubleshooting: Fix Production Issues

When your pandas code crashes production at 3AM and you need solutions that actually work

pandas
/tool/pandas/performance-troubleshooting
100%
tool
Similar content

DuckDB: The SQLite for Analytics - Fast, Embedded, No Servers

SQLite for analytics - runs on your laptop, no servers, no bullshit

DuckDB
/tool/duckdb/overview
88%
tool
Similar content

pandas Overview: What It Is, Use Cases, & Common Problems

Data manipulation that doesn't make you want to quit programming

pandas
/tool/pandas/overview
87%
tool
Similar content

PostgreSQL Performance Optimization: Master Tuning & Monitoring

Optimize PostgreSQL performance with expert tips on memory configuration, query tuning, index design, and production monitoring. Prevent outages and speed up yo

PostgreSQL
/tool/postgresql/performance-optimization
81%
tool
Similar content

mongoexport Performance Optimization: Speed Up Large Exports

Real techniques to make mongoexport not suck on large collections

mongoexport
/tool/mongoexport/performance-optimization
73%
tool
Similar content

Apache Cassandra Performance Optimization Guide: Fix Slow Clusters

Stop Pretending Your 50 Ops/Sec Cluster is "Scalable"

Apache Cassandra
/tool/apache-cassandra/performance-optimization-guide
68%
compare
Recommended

PostgreSQL vs MySQL vs MongoDB vs Cassandra - Which Database Will Ruin Your Weekend Less?

Skip the bullshit. Here's what breaks in production.

PostgreSQL
/compare/postgresql/mysql/mongodb/cassandra/comprehensive-database-comparison
62%
howto
Recommended

MySQL to PostgreSQL Production Migration: Complete Step-by-Step Guide

Migrate MySQL to PostgreSQL without destroying your career (probably)

MySQL
/howto/migrate-mysql-to-postgresql-production/mysql-to-postgresql-production-migration
62%
howto
Recommended

I Survived Our MongoDB to PostgreSQL Migration - Here's How You Can Too

Four Months of Pain, 47k Lost Sessions, and What Actually Works

MongoDB
/howto/migrate-mongodb-to-postgresql/complete-migration-guide
62%
tool
Similar content

Protocol Buffers: Troubleshooting Performance & Memory Leaks

Real production issues and how to actually fix them (not just optimize them)

Protocol Buffers
/tool/protocol-buffers/performance-troubleshooting
60%
tool
Similar content

Node.js Performance Optimization: Boost App Speed & Scale

Master Node.js performance optimization techniques. Learn to speed up your V8 engine, effectively use clustering & worker threads, and scale your applications e

Node.js
/tool/node.js/performance-optimization
57%
tool
Similar content

Change Data Capture (CDC) Performance Optimization Guide

Demo worked perfectly. Then some asshole ran a 50M row import at 2 AM Tuesday and took down everything.

Change Data Capture (CDC)
/tool/change-data-capture/performance-optimization-guide
57%
tool
Similar content

PostgreSQL: Why It Excels & Production Troubleshooting Guide

Explore PostgreSQL's advantages over other databases, dive into real-world production horror stories, solutions for common issues, and expert debugging tips.

PostgreSQL
/tool/postgresql/overview
53%
tool
Similar content

SvelteKit Performance Optimization: Fix Slow Apps & Boost Speed

Users are bailing because your site loads like shit on mobile - here's what actually works

SvelteKit
/tool/sveltekit/performance-optimization
53%
tool
Similar content

Flutter Performance Optimization: Debug & Fix Issues with DevTools

Stop guessing why your app is slow. Debug frame drops, memory leaks, and rebuild hell with tools that work.

Flutter
/tool/flutter/performance-optimization
53%
tool
Similar content

Express.js Production Guide: Optimize Performance & Prevent Crashes

I've debugged enough production fires to know what actually breaks (and how to fix it)

Express.js
/tool/express/production-optimization-guide
47%
tool
Similar content

Webpack Performance Optimization: Fix Slow Builds & Bundles

Optimize Webpack performance: fix slow builds, reduce giant bundle sizes, and implement production-ready configurations. Improve app loading speed and user expe

Webpack
/tool/webpack/performance-optimization
43%
tool
Similar content

React Production Debugging: Fix App Crashes & White Screens

Five ways React apps crash in production that'll make you question your life choices.

React
/tool/react/debugging-production-issues
43%
tool
Similar content

LM Studio Performance: Fix Crashes & Speed Up Local AI

Stop fighting memory crashes and thermal throttling. Here's how to make LM Studio actually work on real hardware.

LM Studio
/tool/lm-studio/performance-optimization
43%
tool
Similar content

Optimize WebStorm Performance: Fix Memory & Speed Issues

Optimize WebStorm performance. Fix high RAM usage, memory leaks, and slow indexing. Discover advanced techniques and debugging tips for a faster, more efficient

WebStorm
/tool/webstorm/performance-optimization
43%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization