What It Actually Is and Why You'd Use It

RAG Architecture Diagram

I've been running Vertex AI embeddings in production for about 8 months. It's Google's text-to-vector API that converts your text into numbers that machines can work with for search and AI stuff.

Three models exist as of September 2025: text-embedding-005 (the newest one, released November 2024), the older text-embedding-004 (still available with no retirement date announced), and the new Gemini Embedding that costs about 6x more but handles multilingual content better.

The Reality of Using This Thing

Authentication is a pain in the ass. Setting up service accounts and IAM permissions isn't straightforward if you're new to Google Cloud. Took me 3 hours the first time because the docs assume you already know GCP. You need `aiplatform.user` role at minimum and have to set that GOOGLE_APPLICATION_CREDENTIALS environment variable or everything breaks with cryptic auth errors.

Costs surprised me. The regular embedding models charge per character (about $0.025 per million characters), but Gemini Embedding charges per token ($0.15 per million tokens). Our chatbot bill jumped from $150 to $800 per month when I switched to Gemini without doing the math. Use their count-tokens API for Gemini or you'll get fucked.

Both text-embedding models are still supported - no retirement dates announced. But I'd still recommend text-embedding-005 over 004 for new projects. I spent 2 weeks testing both against our existing vectors and 005 is slightly better at understanding code and technical docs.

When This Actually Works Well

RAG systems: We use it for document search in our support system. Works way better than keyword search - customers find answers in 30 seconds instead of 5 minutes. The semantic understanding actually gets what they're asking about.

Code documentation: text-embedding-005 understands programming concepts better than the old model. Our internal wiki search for API docs went from "mostly useless" to "actually helpful."

Multilingual stuff: If you have content in multiple languages, Gemini Embedding handles that without translating everything first. Costs more but saves the translation step.

What Doesn't Work

Rate limits hit hard. Default is 600 requests per minute. Hit that constantly during our data migration and had to implement exponential backoff. Takes 2-3 minutes to process 10K documents if you're not batching.

2048 token limit. Long documents get truncated without warning. You have to chunk everything yourself - I use 1024 tokens with 20% overlap. Pain in the ass but necessary.

Regional availability varies. Some models aren't available everywhere. Check before you commit to a region because migrating later sucks.

With that overview of what actually works and what doesn't, let's compare the different embedding models side-by-side so you can make an informed choice.

Model Comparison (The Real Differences)

Model

Status

Price per 1M Characters

What It's Actually Good For

text-embedding-005

Newest (Nov 2024)

0.025

English content, code docs, general use

text-embedding-004

Stable, no retirement

0.025

Still works fine, but use 005 for new projects

Gemini Embedding

GA July 2025

0.15 per 1M tokens

Multilingual stuff, 6x more expensive

The Shit That Breaks in Production

After 8 months running this in production, here's what actually goes wrong and how to fix it.

Authentication Hell

Google's service account system is needlessly complex. You'll spend half a day just getting the credentials right.

The errors you'll see:

What actually works:

  1. Enable the Vertex AI API (this takes 2-3 minutes, don't refresh the page)
  2. Create a service account, download the JSON key
  3. Set export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key.json"
  4. Give it `aiplatform.user` role - don't use custom roles, they're broken

Cost Management (Because It Gets Expensive Fast)

Our embedding costs went from $200 to $2,000 per month when usage spiked. Token counting is weird - it's not 1:1 with characters.

Things that cost way more than expected:

  • Repeated API calls for the same text (cache everything - reduced my API calls by 60%)
  • Processing without chunking (you pay for characters even if they get truncated at 2048 tokens)
  • Using Gemini Embedding for English-only content (6x price increase for marginal quality improvements)
  • Processing large PDFs without checking character count first (some research papers hit 200K+ characters)

Batch processing saves 20% but requires uploading to Cloud Storage and waiting 30-60 minutes. Good for one-time migrations, useless for real-time apps.

Use `pip install google-cloud-aiplatform` and call `count_tokens()` first. Don't guess - I learned this the hard way when our bill hit $3,500 one month.

Rate Limiting That Actually Matters

600 requests/minute sounds like a lot until you try to process 50K documents. Hit rate limits constantly during data migrations.

Exponential backoff or you're fucked:

import time
import random

def embed_with_retry(text, max_retries=5):
    for attempt in range(max_retries):
        try:
            return client.predict(text)
        except Exception as e:
            if \"quota\" in str(e).lower():
                wait = (2 ** attempt) + random.uniform(0, 1)
                time.sleep(wait)
            else:
                raise e

Vector Storage Reality

Vertex AI Vector Search costs $0.32/hour per node minimum, even if you're not using it. For small applications, that's $230/month just to keep it running. Pinecone starts at $70/month and scales better.

BigQuery ML works if you're already using BigQuery, but query performance is shit for real-time search. Good for batch analytics, terrible for user-facing apps.

I use Weaviate deployed on GKE. More setup work but better control over costs and performance.

Migration Pain Points

text-embedding-004 to text-embedding-005 migration took 3 weeks (even though it's optional now):

  1. Week 1: Testing similarity scores between models (they're different enough to matter)
  2. Week 2: Reprocessing 2 million embeddings - took about 40 hours of API calls
  3. Week 3: Fixing search relevance that broke because of vector changes

Save the old model version in metadata or you'll have no idea what's broken when similarity scores change. I learned this the hard way when users started complaining that search results were "worse" after the migration.

Debugging Production Issues

Most common failures:

  • 413 Request Entity Too Large - text over 2,048 tokens, chunk it first
  • 503 Service Unavailable: The service is currently unavailable - regional quota exhausted, retry in different region or wait 5-10 minutes
  • 403 The caller does not have permission - service account key expired (they expire after 10 years by default, but admins can set shorter expiry)
  • 400 Invalid argument: Location us-east1 is not supported for model text-embedding-005 - some models only work in specific regions
  • 429 Rate limit exceeded followed by Retry after 60 seconds - you're hitting the 600 req/min limit, implement backoff

Set up billing alerts at $500, $1000, $2000. This API can get expensive fast and Google won't stop you from burning money.

The API works well once you get past the setup bullshit, but plan for 2-3 weeks of debugging and optimization before it's production-ready.

Speaking of common problems, here are the questions I get asked most often by other engineers who are trying to implement this stuff.

Questions People Actually Ask

Q

Why did my embedding bill jump from $50 to $500?

A

Token counting isn't intuitive. It's roughly 4 characters per token for English, but varies wildly for other languages and special characters. I got burned processing emoji-heavy social media posts - they consume way more tokens than expected.

Check your token usage first: Call the count-tokens API on a sample before processing your entire dataset. Our chatbot processes 100K customer messages monthly and uses about 8 million tokens ($800-1200/month).

Batch processing saves 20% if you can wait 30-60 minutes for results. Good for one-time data processing, useless for real-time apps.

Q

Do I have to migrate from text-embedding-004?

A

No, both models are still supported with no retirement date announced. But I'd recommend text-embedding-005 for new projects.

What I learned: The new model (text-embedding-005) gives slightly different vectors for the same text. If you do migrate existing systems, search relevance can change enough that you need to retune similarity thresholds. Budget 2-3 weeks to test thoroughly because "close enough" vectors can break user experience.

Q

Is Gemini Embedding worth 6x more cost?

A

Only if you need multilingual support or the accuracy improvement matters for your use case.

For English-only applications, text-embedding-005 works fine and costs way less. I tested both on our support docs - Gemini was maybe 3% better at finding relevant articles, not worth $1,500/month extra for us.

Q

Can I use this with Pinecone instead of Google's vector database?

A

Yeah, the embeddings are standard 768-dimension vectors that work with any vector database.

I use Weaviate because Vertex AI Vector Search costs $230/month minimum even if you're not using it. Pinecone starts at $70/month and scales better for small applications.

Q

What happens when I hit the 2,048 token limit?

A

Your text gets silently truncated. No error, no warning. You just lose the end of your document.

Chunk everything longer than 1,500 tokens. I use 1024-token chunks with 20% overlap. Yeah, it's extra work, but better than losing context. Process chunks separately then average the embeddings or pick the most relevant one.

Q

How do I handle "quota exceeded" errors?

A

Actual error you'll see: 429 Resource has been exhausted (e.g. check quota). This can happen even when you're nowhere near 600 requests/minute because there are separate quotas for tokens per minute and characters per minute.

Implement exponential backoff with jitter or you'll keep hitting the same limits. Default is 600 requests/minute, which sounds like a lot until you try processing 10K documents.

import time
import random

def retry_with_backoff(func, max_retries=5):
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            if "quota" in str(e).lower():
                wait = min(300, (2 ** attempt) + random.uniform(0, 1))
                time.sleep(wait)
            else:
                raise
Q

Should I just use OpenAI embeddings instead?

A

Depends on your existing stack. If you're already on Google Cloud, Vertex AI integrates better with BigQuery and other GCP services.

OpenAI's text-embedding-3-small costs $0.02 per million tokens (10x cheaper than Vertex AI) with decent quality. I'd go with OpenAI unless you need the tight GCP integration or enterprise compliance features.

Q

Can I cache embeddings to save money?

A

Absolutely. Store embeddings in Redis or your database. I cache frequent search queries and common document embeddings - reduced API calls by 60%.

Just remember to invalidate cached embeddings when you change models or your search results will be inconsistent.

Based on all these gotchas and trade-offs, here's my practical recommendations for different use cases and migration scenarios.

What Actually Works for Common Use Cases

Use Case

What I'd Use

Why

English docs/search

text-embedding-005

Good quality, cheaper than Gemini

Multilingual app

Gemini Embedding

Only option that handles multiple languages well

High-volume chatbot

OpenAI text-embedding-3-small

5x cheaper, quality difference doesn't matter

Code documentation

text-embedding-005

Better at understanding technical terms

Related Tools & Recommendations

howto
Similar content

Weaviate Production Deployment & Scaling: Avoid Common Pitfalls

So you've got Weaviate running in dev and now management wants it in production

Weaviate
/howto/weaviate-production-deployment-scaling/production-deployment-scaling
100%
tool
Similar content

Google Cloud Vertex AI Production Deployment Troubleshooting Guide

Debug endpoint failures, scaling disasters, and the 503 errors that'll ruin your weekend. Everything Google's docs won't tell you about production deployments.

Google Cloud Vertex AI
/tool/vertex-ai/production-deployment-troubleshooting
82%
tool
Similar content

Debug Kubernetes Issues: The 3AM Production Survival Guide

When your pods are crashing, services aren't accessible, and your pager won't stop buzzing - here's how to actually fix it

Kubernetes
/tool/kubernetes/debugging-kubernetes-issues
59%
integration
Recommended

Pinecone Production Reality: What I Learned After $3200 in Surprise Bills

Six months of debugging RAG systems in production so you don't have to make the same expensive mistakes I did

Vector Database Systems
/integration/vector-database-langchain-pinecone-production-architecture/pinecone-production-deployment
59%
compare
Recommended

Milvus vs Weaviate vs Pinecone vs Qdrant vs Chroma: What Actually Works in Production

I've deployed all five. Here's what breaks at 2AM.

Milvus
/compare/milvus/weaviate/pinecone/qdrant/chroma/production-performance-reality
57%
tool
Similar content

Google Cloud Developer Tools: SDKs, CLIs & Automation Guide

Google's collection of SDKs, CLIs, and automation tools that actually work together (most of the time).

Google Cloud Developer Tools
/tool/google-cloud-developer-tools/overview
56%
tool
Similar content

OpenCost: Kubernetes Cost Monitoring, Optimization & Setup Guide

When your AWS bill doubles overnight and nobody knows why

OpenCost
/tool/opencost/overview
56%
tool
Similar content

Change Data Capture (CDC) Troubleshooting Guide: Fix Common Issues

I've debugged CDC disasters at three different companies. Here's what actually breaks and how to fix it.

Change Data Capture (CDC)
/tool/change-data-capture/troubleshooting-guide
46%
tool
Similar content

Playwright Overview: Fast, Reliable End-to-End Web Testing

Cross-browser testing with one API that actually works

Playwright
/tool/playwright/overview
42%
tool
Similar content

Mint API Integration Troubleshooting: Survival Guide & Fixes

Stop clicking through their UI like a peasant - automate your identity workflows with the Mint API

mintapi
/tool/mint-api/integration-troubleshooting
42%
tool
Similar content

Anypoint Code Builder Troubleshooting Guide & Fixes

Troubleshoot common Anypoint Code Builder issues, from installation failures and runtime errors to deployment problems and DataWeave/AI integration challenges.

Anypoint Code Builder
/tool/anypoint-code-builder/troubleshooting-guide
41%
tool
Recommended

OpenAI Embeddings API - Turn Text Into Numbers That Actually Understand Meaning

Stop fighting with keyword search. Build search that gets what your users actually mean.

OpenAI Embeddings API
/tool/openai-embeddings/overview
41%
tool
Similar content

TaxBit API Integration Troubleshooting: Fix Common Errors & Debug

Six months of debugging hell, $300k in consulting fees, and the fixes that actually work

TaxBit API
/tool/taxbit-api/integration-troubleshooting
41%
integration
Similar content

MongoDB Express Mongoose Production: Deployment & Troubleshooting

Deploy Without Breaking Everything (Again)

MongoDB
/integration/mongodb-express-mongoose/production-deployment-guide
39%
tool
Similar content

TypeScript Compiler Performance: Fix Slow Builds & Optimize Speed

Practical performance fixes that actually work in production, not marketing bullshit

TypeScript Compiler
/tool/typescript/performance-optimization-guide
39%
integration
Similar content

Firebase Flutter Production: Build Robust Apps Without Losing Sanity

Real-world production deployment that actually works (and won't bankrupt you)

Firebase
/integration/firebase-flutter/production-deployment-architecture
39%
tool
Similar content

AWS API Gateway: The API Service That Actually Works

Discover AWS API Gateway, the service for managing and securing APIs. Learn its role in authentication, rate limiting, and building serverless APIs with Lambda.

AWS API Gateway
/tool/aws-api-gateway/overview
37%
tool
Recommended

Cohere Embed API - Finally, an Embedding Model That Handles Long Documents

128k context window means you can throw entire PDFs at it without the usual chunking nightmare. And yeah, the multimodal thing isn't marketing bullshit - it act

Cohere Embed API
/tool/cohere-embed-api/overview
37%
tool
Recommended

BigQuery Editions - Stop Playing Pricing Roulette

Google finally figured out that surprise $10K BigQuery bills piss off customers

BigQuery Editions
/tool/bigquery-editions/editions-decision-guide
37%
tool
Recommended

Google BigQuery - Fast as Hell, Expensive as Hell

integrates with Google BigQuery

Google BigQuery
/tool/bigquery/overview
37%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization