I've been running Vertex AI embeddings in production for about 8 months. It's Google's text-to-vector API that converts your text into numbers that machines can work with for search and AI stuff.
Three models exist as of September 2025: text-embedding-005 (the newest one, released November 2024), the older text-embedding-004 (still available with no retirement date announced), and the new Gemini Embedding that costs about 6x more but handles multilingual content better.
The Reality of Using This Thing
Authentication is a pain in the ass. Setting up service accounts and IAM permissions isn't straightforward if you're new to Google Cloud. Took me 3 hours the first time because the docs assume you already know GCP. You need `aiplatform.user` role at minimum and have to set that GOOGLE_APPLICATION_CREDENTIALS
environment variable or everything breaks with cryptic auth errors.
Costs surprised me. The regular embedding models charge per character (about $0.025 per million characters), but Gemini Embedding charges per token ($0.15 per million tokens). Our chatbot bill jumped from $150 to $800 per month when I switched to Gemini without doing the math. Use their count-tokens API for Gemini or you'll get fucked.
Both text-embedding models are still supported - no retirement dates announced. But I'd still recommend text-embedding-005 over 004 for new projects. I spent 2 weeks testing both against our existing vectors and 005 is slightly better at understanding code and technical docs.
When This Actually Works Well
RAG systems: We use it for document search in our support system. Works way better than keyword search - customers find answers in 30 seconds instead of 5 minutes. The semantic understanding actually gets what they're asking about.
Code documentation: text-embedding-005 understands programming concepts better than the old model. Our internal wiki search for API docs went from "mostly useless" to "actually helpful."
Multilingual stuff: If you have content in multiple languages, Gemini Embedding handles that without translating everything first. Costs more but saves the translation step.
What Doesn't Work
Rate limits hit hard. Default is 600 requests per minute. Hit that constantly during our data migration and had to implement exponential backoff. Takes 2-3 minutes to process 10K documents if you're not batching.
2048 token limit. Long documents get truncated without warning. You have to chunk everything yourself - I use 1024 tokens with 20% overlap. Pain in the ass but necessary.
Regional availability varies. Some models aren't available everywhere. Check before you commit to a region because migrating later sucks.
With that overview of what actually works and what doesn't, let's compare the different embedding models side-by-side so you can make an informed choice.