Setting up Kubeflow and Feast in production isn't like following a fucking cookbook. It's like trying to assemble IKEA furniture while the instructions are on fire and your manager is asking when it'll be ready every 10 minutes.
Why This Guide Won't Bullshit You
I spent the better part of 2024 getting this stack to work properly. The official docs assume you're a Kubernetes wizard who never makes mistakes. Real talk: you're going to break things, and that's fine.
Here's what actually happens when you try to run ML in production:
- Your notebook that "totally works" will fail spectacularly when it tries to load 50GB of training data
- Kubeflow will eat all your memory and ask for more
- Feature serving will randomly return stale data and you won't notice until a model starts predicting that every customer wants to buy pet insurance
- The pipeline that worked fine for 3 weeks will suddenly decide to crash at 2 AM on a Sunday
What You're Actually Building
A system that can handle your ML team's chaos without requiring a full-time babysitter:
Infrastructure That Doesn't Suck:
- Recent Kubeflow that won't randomly break (we're running something recent, check what's current)
- Feast feature store that actually keeps your features consistent
- A Kubernetes cluster that can survive your data scientist's massive training jobs
- Storage that won't randomly delete your models (this has happened to me twice)
Pipeline Magic:
- Model serving that doesn't time out when someone hits refresh
- Feature engineering that handles time zones correctly (seriously, fuck time zones)
- Model versioning that lets you roll back when the new model decides cats are vegetables
- Monitoring that actually tells you useful shit when things break
Production Reality:
- Authentication that doesn't make everyone an admin by default
- Resource limits so one person can't crash the entire cluster
- Backups that you'll pray you never need but will save your ass
Time Expectations (AKA The Truth)
- Initial setup: Plan for a full weekend. The "quick start" guides are lying.
- Actually working system: Add another week for all the edge cases the docs don't mention
- Production ready: At least a month before you'd trust this with real business data
- Team onboarding: Your data scientists will need hand-holding for at least 2 weeks
The Infrastructure Tax
You'll need more resources than you think:
Minimum viable cluster:
- 3 nodes with decent CPUs and lots of RAM (think 16 cores, 64GB-ish if you can afford it)
- Fast storage - like 500GB+ of NVMe if you don't want to wait forever
- Decent network between nodes (don't cheap out here)
Reality check:
- Your cluster will use a huge chunk of resources just sitting there doing nothing
- ML training jobs are memory hogs that will OOM kill everything in sight
- Feature stores need fast storage or your response times go to shit
What Actually Breaks in Production
"Container won't start"
- Docker images that work on your laptop but fail in production
- Memory limits set too low (learned this the hard way)
- Missing environment variables that worked fine in development
"Features are inconsistent"
- Clock drift between systems causing feature freshness issues
- Race conditions during feature materialization
- Different Python versions computing features slightly differently
"Everything is slow"
- Network latency you didn't account for
- Database connections not properly pooled
- Images being pulled from slow registries every time
"It worked yesterday"
- Kubernetes node ran out of disk space
- Certificate expired (always happens at night)
- Someone changed a config and didn't tell anyone
This guide will walk you through the actual solutions to these problems, not just the happy path that works in demos.
The Essentials (stuff I actually used):
- Kubeflow docs - where you'll spend most of your debugging time
- This Stack Overflow thread - more useful than the official docs
- Feast deployment guide - only one that actually worked for me