I started testing Cohere's embed-v4.0 when it came out - I think it was like May? Anyway, after embedding way too many documents across various RAG projects, here's what you need to know.
The Context Window Changes Everything
The massive 128k token capacity isn't just a bigger number - it fundamentally changes how you build RAG systems. Before v4.0, I was spending half my time writing document chunking logic:
## The old nightmare - splitting docs and losing context
def chunk_document(doc, chunk_size=500, overlap=50):
# 20 lines of complex logic to preserve context
# Still loses important relationships across boundaries
Now? Embed the whole damn document:
## The new reality - embed entire documents
response = co.embed(
texts=[entire_research_paper], # 40k tokens? No problem
model="embed-v4.0"
)
I tried this with some massive legal contract - I think it was like 40 pages? Maybe more, it was huge. With OpenAI embeddings, I had to chop it up into what, 15-20 chunks? And of course the important clauses got separated. With Cohere, one API call and done.
Multimodal: Not Just Marketing Fluff
The image + text embedding actually works. I threw a financial report with charts and tables at it, and it correctly understood relationships between the narrative sections and the visual data.
When someone searched for "revenue growth trends" in our document corpus, Cohere v4.0 returned the text discussing Q3 performance AND the chart showing the actual numbers - because it understood they were semantically related.
Performance Reality Check
What works great:
- Long documents (research papers, manuals, reports)
- Mixed content (PDFs with images, presentations)
- Multilingual content (tested with English/Spanish/French docs)
What's still painful:
- Cost: Around 12 cents per million tokens for text (multimodal is like 47 cents/1M image tokens), so a big document runs maybe 0.6 cents vs 0.65 cents with OpenAI
- Speed: Multimodal embeddings crawl compared to text-only
- Rate limits: Hit them fast when processing large doc batches
The Gotchas You'll Hit
Dimension confusion that'll waste your day: The default 1536 dimensions work for most cases, but if you're migrating from another model, your similarity thresholds are completely fucked. Plan for total recalibration or you'll be debugging "why is search broken" for hours.
Batch API weirdness: The batch API is great for large batches but has some timeout behavior that's barely documented. Start with small batches (100 docs) to test or you'll be staring at hanging requests wondering what the hell happened.
Token counting: Multimodal tokens are counted differently than text tokens. A PDF with images might consume 2x more tokens than you expect.
When It's Worth the Premium
I use Cohere v4.0 for:
- Legal document search (context preservation is crucial)
- Research paper analysis (need full paper context)
- Technical documentation with diagrams (multimodal helps)
I stick with OpenAI/Mistral for:
- FAQ systems (short docs, cost matters)
- Product catalogs (simple text, high volume)
- Chat applications (speed matters more than context)