AutoML - Good Until It Isn't
Vertex AI AutoML works great for the happy path. Upload clean data, get a model, deploy it, done. The problems start when your data is messy (it always is) or when you need to understand why your model decided that a cat is a dog.
AutoML handles tabular data, images, video, and text. For tabular stuff, it'll automatically try different algorithms and hyperparameters. The 100GB dataset limit sounds generous until you realize that's post-processing - your original data needs to fit in memory first, which often means it doesn't.
Reality check: AutoML works for about 70% of use cases. The other 30% require digging into custom training because AutoML made decisions you can't understand or fix. When AutoML fails, the error messages are gems like:
INVALID_ARGUMENT: Dataset contains invalid data
- no details about which data or why
FAILED_PRECONDITION: Training could not start
- after waiting 45 minutes for it to begin
Resource exhausted
- because your 50MB dataset somehow needs 16GB RAM to process
Model export failed
- after 6 hours of training completed successfully
The known issues page exists because these problems are so common. Pro tip: when AutoML says "Try again" for the third time, start looking at custom training options. Check Stack Overflow for solutions to the cryptic error messages - community answers are better than Google's official troubleshooting.
Custom Training - Where the Real Work Happens
Custom training is where you'll spend most of your time if you're doing anything non-trivial. Vertex AI supports TensorFlow, PyTorch, scikit-learn, and XGBoost in pre-built containers, or you can bring your own Docker container and pray it works.
Distributed Training Pain: Google promises automatic distributed training, but configuring multi-node jobs properly takes expertise. TPUs are blazing fast for certain workloads but debugging TPU code is like debugging with one eye closed. When a job fails 4 hours into a 6-hour training run, you'll question your career choices. The interactive debugging shell helps, but only after you've already wasted time and money.
Hyperparameter Tuning Waste: Vertex AI Vizier runs Bayesian optimization to find optimal hyperparameters. In practice, you'll burn through hundreds of dollars in compute before finding parameters that are maybe 2% better than your initial guess. The service is smart, but hyperparameter tuning is inherently expensive. Check pricing examples before starting large tuning jobs.
Container Hell: Custom containers work great when they work. When they don't, you get cryptic errors like OSError: /lib/x86_64-linux-gnu/libz.so.1: version ZLIB_1.2.9 not found
because Vertex AI's base images use different glibc versions than your Docker Desktop.
Common container failures that will ruin your day:
ImportError: /usr/local/lib/python3.9/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so: undefined symbol: _ZN10tensorflow8OpKernel11TraceStringEPNS_15OpKernelContextEb
- CUDA/cuDNN version mismatch
ModuleNotFoundError: No module named 'google.cloud'
- forgot to install the Vertex AI SDK in your container
Permission denied: '/tmp/model'
- container user permissions are fucked, need to add USER root
or fix ownership
CUDA out of memory
at training start - your container works locally with CPU but explodes on GPU instances
Pro tip: Docker containers that work on your laptop will mysteriously fail in Vertex AI because of glibc version conflicts, CUDA driver mismatches, or because Google's container runtime hates your Dockerfile. Test everything in Cloud Build first, not your local machine. The container requirements docs are essential reading but won't save you from spending a weekend debugging why pip install torch
works locally but times out in Vertex AI.
Model Garden - Marketing vs Reality
The Model Garden is Google's AI model marketplace. It has Gemini variants, some open-source models, and a handful of third-party options. The selection is decent but not as comprehensive as promised.
Current Model Reality: As of September 2025, you get Gemini 2.5 Pro and Flash models, plus the newer Gemini 2.0 Flash. Pricing starts at $0.15 per million input tokens for 2.0 Flash, going up to $1.25-$10+ for Pro models depending on context length.
Fine-tuning Reality: Model customization through LoRA and prompt tuning sounds simple. In practice, getting good results requires domain expertise, quality training data, and multiple iterations. The "100 examples" marketing claim works for toy problems, not production use cases.
Production Deployment - When Reality Hits
Deploying models to production endpoints mostly works, but cost management will keep you awake at night. Google's auto-scaling is aggressive - it'll spin up resources "just in case" and bill you for them.
Serverless Inference Gotchas: Serverless sounds great until you realize cold starts can take 30+ seconds for large models. Your first request after downtime will timeout, guaranteed. Response times "typically range from 100-500ms" assumes your model is small and your data is simple.
Multi-Model Madness: A/B testing with multiple model versions on one endpoint works until you need to debug which version is causing errors. Traffic routing is black-box magic - when it goes wrong, good luck figuring out why 15% of your requests are hitting the wrong model.
MLOps - Promise vs Pain
Vertex Pipelines is built on Kubeflow, which means you're essentially managing Kubernetes workflows. If you love YAML and enjoy reading Kubernetes logs, you'll be at home. If not, prepare for suffering.
Pipeline Failures: Automated retraining sounds amazing until your pipeline fails at 3 AM because your data source changed format. The monitoring alerts work, but debugging why step 15 of 20 failed requires clicking through a web UI that feels like navigating a maze.
Experiment Tracking: ML Metadata captures everything automatically, which is great until you need to find that one experiment from three weeks ago. The search functionality is terrible, and the UI makes you miss spreadsheets.
Monitoring Reality: Data drift detection is useful when it works. False positives are common - your monitoring will alert you that your model is degrading when actually your data pipeline started including weekend data. Expect to tune alerts for weeks before they're useful.
Cost Horror Stories
Here's what nobody tells you: Vertex AI bills can spiral quickly. That $200 training job for a medium-sized model? Add storage costs ($50/month for datasets), network egress ($75 for moving training data), and the 3 failed attempts before it worked ($600 more). Real cost: $925 instead of the planned $200.
Real production examples that hurt:
- Image classification AutoML: Google's calculator said $150. Final bill: $890 after data preprocessing, failed runs, and storage
- Multi-node training job that failed at 90%: $1,200 down the drain with zero usable output - failed jobs still cost full compute time
- Hyperparameter tuning for 3 days: $2,400 to find parameters 2% better than the first guess
- Model serving with auto-scaling enabled: $3,800/month when a Reddit post drove traffic spike, scaled to 50 instances for 2 hours
- Fine-tuning Gemini Pro for a week: $4,200 because they bill for both base model inference AND fine-tuning compute
The pricing calculator lies because it doesn't account for:
- Storage accumulating at $0.023/GB/month (datasets + model artifacts + logs = $200+/month quickly)
- Network egress at $0.12/GB (moving 1TB of training data = $120 you didn't expect)
- Failed job costs (billed for full compute time even when job crashes)
- Auto-scaling overshoot (scales up fast, scales down slow, you pay for the whole time)
- Debugging time (every
CUDA out of memory
error costs $50+ while you troubleshoot)
Pro tip: Set up billing alerts at 50% and 80% of your budget, not just 100%. Teams regularly get hit with $5,000+ surprise bills when a training pipeline goes haywire overnight. That $300 free credit? Gone in a week if you're actually using the platform for real work.