Look, if you're still using basic keyword search in 2025, you're making your users hate you. They search for "laptop repair" and get results about "notebook fixes" because your system is too dumb to understand these mean the same thing.
OpenAI's embeddings API fixes this shit. It turns text into arrays of numbers (vectors) that capture meaning, not just word matches. So when someone searches for "car insurance," it'll find results about "auto coverage" because the vectors are mathematically similar.
I've been using this in production for 18 months. Here's what actually works and what doesn't.
Embeddings turn words into vectors - arrays of numbers that cluster similar concepts together in high-dimensional space. "Car" and "automobile" end up closer to each other than "car" and "banana."
The Three Models That Matter (Stop Using ada-002)
text-embedding-3-small: $0.02 per million tokens. This is your workhorse. 1,536 dimensions, handles most use cases fine. I use this for 90% of everything because it's fast and cheap.
text-embedding-3-large: $0.13 per million tokens (6.5x more expensive). 3,072 dimensions, scores 64.6% on MTEB benchmarks. Only worth it if you need the absolute best accuracy and have budget to burn.
text-embedding-ada-002: Stop. Just stop using this. It's old, expensive at $0.10 per million tokens, and performs worse than the small model. If you're still on ada-002, migrate. Now.
Real-World Performance (From Someone Who's Actually Used This)
The v3 models are genuinely good. I tested text-embedding-3-large against our old keyword system on 50K product descriptions and it found relevant items 80% better than string matching. MTEB benchmarks show similar performance gains in controlled tests.
When text embeddings work correctly, similar concepts cluster together in high-dimensional space - words like "car" and "automobile" become mathematically closer than "car" and "banana", enabling semantic understanding beyond keyword matching.
But here's what the docs don't tell you:
- Rate limits will bite you: 3,000 requests per minute sounds like a lot until you're batch processing. Plan accordingly. The API fails randomly during peak hours and you'll get useless "request failed" errors. OpenAI's status page tracks outages but doesn't warn you about degraded performance.
- 8,192 token limit: About 6K words max per request. Long documents need chunking, which is annoying. Use tiktoken to count tokens properly - their counter is the only one that's accurate. Check out this chunking guide for proper implementation.
- Costs scale fast: Went from $50/month to $800/month in 3 months because usage grew faster than expected. Set billing alerts or learn the hard way like I did. This cost calculator helps estimate usage but real-world costs are always higher.
RAG (Retrieval Augmented Generation) systems use embeddings to find relevant documents, then feed them to language models for answers. The embedding step is what makes semantic search possible - without it you're stuck with dumb keyword matching.
Language Support Reality Check
English works great. Spanish, French, German are solid. Everything else is hit-or-miss.
I tested with Japanese customer reviews and it worked okay for basic similarity but missed cultural context. For non-English production use, test thoroughly with your actual data, not sample text. Consider Cohere's multilingual models, Voyage AI's language support, or Google's Universal Sentence Encoder if you need better international support. This multilingual embedding comparison shows performance across different languages.
The semantic search workflow: user query → embedding → vector similarity search → ranked results. This process understands intent rather than matching exact keywords, finding relevant content even when different terminology is used.
The Gotchas That Will Screw You
Model updates break everything: When OpenAI releases new versions, embeddings change completely. You can't mix embeddings from different model versions - learned this the hard way when v3 launched. Had to re-embed 2TB of data over a weekend. OpenAI's changelog tracks model updates but doesn't warn about breaking changes.
Vector storage gets expensive: Storing 3,072-dimensional vectors for millions of documents eats storage. Pinecone costs add up quick. Budget for vector database expenses or use pgvector if you're already on PostgreSQL. This vector database comparison breaks down storage costs across providers.
Vector databases like Pinecone, Weaviate, and Qdrant each have different strengths - Pinecone offers managed simplicity, Weaviate provides GraphQL flexibility, and Qdrant delivers open-source performance.
Cold start performance: First API call after idle time can take 2-3 seconds. Keep a warming script running if you need consistent response times. The API randomly adds 3+ seconds during outages with zero warning. Monitor their infrastructure status for performance degradation warnings.
The math actually works. Similar concepts cluster together in vector space, which is why semantic search isn't just marketing bullshit - it's measurably better for finding relevant content.
When embeddings work properly, semantically similar text gets grouped together in high-dimensional space, making similarity calculations reliable for search and recommendations.
But which model should you actually use? And how do they stack up against the competition? Let me break down the real performance numbers.