ANN-Benchmarks is fine for research papers but completely useless when you're trying to pick a database that won't shit the bed in production. I've learned this the hard way - multiple times. VDBBench 1.0 came out in July 2025 and finally started testing scenarios that actually happen in the real world.
The Problem with Academic Benchmarks
Academic benchmarks test perfect conditions that don't exist in production. ANN-Benchmarks tests 128-dimension SIFT vectors from 2009 while we're dealing with 1,536-dimension OpenAI embeddings and need to filter by user permissions at query time. The memory patterns are completely different, the computational requirements are nothing alike.
I've seen this pattern too many times: choose the benchmark winner, deploy to production, watch everything catch fire. Elasticsearch benchmarks showed sub-100ms queries, but in production it needed 18+ hours to optimize indexes every time we updated data - during which our system was basically fucked and throwing "connection timeout" errors. ChromaDB worked great in development but fell apart the moment we had more than one user hitting it simultaneously.
What Actually Matters for Production Benchmarks
After debugging vector databases at 3am more times than I care to count, here's what actually matters:
Does it test concurrent writes while serving queries? Your users don't politely wait for data updates to finish. VDBBench actually tests this - 500 vectors/second ingestion while multiple clients are searching. Guess what? Most "fast" databases become unusable under this load.
Does it use realistic vector dimensions? ANN-Benchmarks still tests 128D vectors from 2009. We're dealing with 1,536D OpenAI embeddings or 3,072D from newer models. The memory access patterns are completely different - what works for 128D often becomes garbage at 1,500D+.
Does it test filtered search? Production queries aren't just "find similar vectors". They're "find similar vectors but only from this user's data and within this price range". Qdrant's benchmarks showed that highly selective filters (99.9% exclusion) can cause 10x latency spikes. Most benchmarks completely ignore this.
The Reality Check: Which Tools Are Worth Your Time
ANN-Benchmarks: Great for research papers, useless for production decisions. If you're implementing a new algorithm, use it. If you're picking a database for production, skip it.
VDBBench: Actually tests scenarios that happen in production. Setup is a pain in the ass, but the results are realistic. Worth the time investment.
Qdrant's benchmarks: Obviously biased toward Qdrant, but they test filtered search scenarios that matter. Good for understanding performance cliffs.
Vendor Marketing Bullshit: Most vendor benchmarks are complete horseshit designed to make their product look good. Red flags: synthetic datasets, cherry-picked metrics, missing configuration details. If they won't share their exact setup, the results are garbage.
Metrics That Actually Matter
Average latency is bullshit. Here's what you should actually measure:
P95/P99 latency: If your average is 10ms but P99 is 2 seconds, your users are going to hate you. VDBBench focuses on tail latency because that's what breaks user experience.
Sustained performance over hours, not minutes: Peak throughput for 30 seconds means nothing. Can it handle your load for 8 hours straight without degrading? Most can't.
Cost per query, not just speed: The fastest system often costs 10x more to run. Factor in memory requirements, instance types, and operational overhead. Speed is useless if it breaks your budget.
Traditional benchmarks have been academic circle jerks. The tools that matter now test production scenarios. This shift means better technology decisions, but it also means you actually need to do your homework instead of just picking the benchmark winner and praying it works.