Before you blow your AWS budget on this thing, here's what MLflow actually does in practice - not the marketing bullshit, but what you'll actually get after 7 years of people beating on it in production.
MLflow is an open-source platform that Databricks released in June 2018 to stop data scientists from losing their fucking minds trying to reproduce model results. After 7 years of development, it's become the most popular MLOps tracking tool, with 20k+ GitHub stars and millions of downloads, adopted by thousands of organizations who got tired of spreadsheet experiment tracking.
Core Components
MLflow has four main pieces that work together to streamline ML workflows:
MLflow Tracking
logs your experiment parameters, metrics, and artifacts automatically so you don't have to remember what you tried. The web UI works great for small teams until you have 10,000 experiments and it becomes slow as hell. Supports MySQL and PostgreSQL backends but you'll waste half a day debugging connection strings that look right but somehow aren't. We spent 3 hours on artifact storage paths before realizing the damn thing needed trailing slashes.
MLflow Models
packages models in a standard format that theoretically works anywhere. In practice, dependency hell is real - Python 3.8 models won't run on Python 3.11 servers without some Docker gymnastics. Supports the usual suspects - scikit-learn, TensorFlow, PyTorch, plus some obscure frameworks only you use.
MLflow Model Registry
is where models go to die in "staging" limbo. Version control works fine, but the approval workflow is basic - you'll end up writing your own CI/CD integration anyway. The stage transition API is straightforward if you enjoy writing custom deployment scripts.
MLflow Projects
nobody fucking uses because YAML config for reproducibility is about as reliable as weather forecasting. Most teams just use Docker containers and call it a day. Projects are supposed to solve environment consistency, but good luck getting the same CUDA version across different machines.
MLflow 3.0 and the GenAI Gold Rush
MLflow 3.0 launched June 11, 2025, finally acknowledging that traditional ML tracking doesn't work for LLM chaos. Seven years after the original 2018 release, they admitted that prompt engineering isn't just "hyperparameter tuning with words." As of August 2025, MLflow 3.3.0 includes Model Registry Webhooks and enhanced GenAI capabilities.
Finally, they're fixing actual problems instead of adding more YAML bullshit:
Enhanced Tracing
because debugging why your LangChain agent made 47 API calls to generate "Hello" is a nightmare
LLM Evaluation
with metrics for hallucination detection - finally someone admits models lie constantly
Prompt Management
for versioning because "make it more creative" isn't a reproducible instruction
LoggedModel Entity
that groups everything together since GenAI workflows are inherently messier
Traditional metrics like accuracy mean jackshit when your model generates different outputs every time, even with temperature=0. Because that's not how LLMs work, but it took MLflow 7 years to figure this out.
The "Free" vs "Just Pay Databricks" Decision
Open Source MLflow
is "free" until you factor in:
- Database hosting costs (PostgreSQL isn't free on AWS RDS)
- Artifact storage bills (S3 costs add up fast with large models)
- DevOps time setting up servers, monitoring, backups
- The inevitable weekend outage when your tracking server dies
Budget at least $500/month in AWS costs plus 20 hours of engineering time for a proper production setup. And that's if you're lucky - when our RDS instance crashed because we didn't set up connection pooling properly, it took us 2 days to get tracking back online.
Managed MLflow on Databricks
costs money upfront but includes:
- Automatic scaling (no more crashed servers during experiment storms)
- Enterprise security that actually works
- Unity Catalog integration for proper permissions
- Someone else's problem when shit breaks during dinner
The managed version makes sense if your team's time is worth more than fighting infrastructure. When you're getting paged at midnight because the tracking server's out of disk space again, that $1000/month suddenly looks reasonable.