MLflow exists because in 2018, some Databricks engineers got tired of the same problems we all deal with - training a model that actually works, then two weeks later staring at your terminal trying to remember which random hyperparameter combo made the magic happen. Sound familiar?
I've been using MLflow for 3 years and here's the reality: it solves the "where the fuck did I put that model" problem better than anything else I've tried. It's not perfect, but it beats keeping track of experiments in spreadsheets or naming directories things like "attempt_47_this_time_for_real".
What MLflow Actually Does
Experiment Tracking: Logs your parameters, metrics, and artifacts automatically. Autologging works great with scikit-learn and okay with most other frameworks. Don't expect it to read your mind though - custom metrics still need manual logging.
Model Registry: Version control for models that doesn't make you want to die. Promote models from "Staging" to "Production" without copying files around like a caveman. The UI crawls with thousands of models but it's functional.
Model Deployment: This is where things get sketchy. MLflow can serve models but authentication and scaling are your problems. The deployment docs show you how but assume you already know infrastructure.
Data Tracking: Dataset tracking exists but it's not great. Fine for small datasets, useless for anything over a few GB. They're working on it.
The MLflow 3.0+ Reality Check (Current: 3.3.2)
MLflow 3.0 dropped in June 2025 with GenAI features because apparently regular ML wasn't trendy enough. The LLM tracking actually works if you're doing prompt engineering or RAG systems. Version 3.3.2 came out August 27 with bug fixes and server improvements.
The bad news? They broke a metric ton of APIs without documenting half of it. Upgrading from 2.x means a full day of fighting import errors like ImportError: cannot import name 'MlflowClient' from 'mlflow.tracking'
even though it's supposed to be there. The migration guide exists but ignores the stuff that actually breaks in real codebases. Test everything twice before upgrading production.
The Components That Actually Matter
Tracking Server: This is the heart of MLflow. It stores all your experiment data and serves the UI. Works fine when you're the only one using it locally, but the moment you have 2+ data scientists hitting it simultaneously, SQLite will choke and die with database is locked
errors. You'll need PostgreSQL and proper storage or you'll be debugging database locks at 2am. Check the tracking server deployment guide for production setup.
Model Registry: Where your models live after training. The versioning works well and the model lineage is useful for compliance. The web UI gets sluggish with thousands of experiments but it's better than nothing. The REST API handles programmatic access.
Artifacts: MLflow stores your model files, plots, and other outputs. Local storage works for testing but you'll want S3 or equivalent for production. Pro tip: artifact storage gets expensive fast if you're not careful about cleanup. The artifact URI format is consistent across storage backends.
Deployments: Model serving that works but isn't magic. You can deploy to local servers, cloud platforms, or Kubernetes. The REST API is standard but the performance depends entirely on your infrastructure choices.
The architecture is modular so you can start with just experiment tracking and add pieces as needed. Most people begin with tracking because that's where the immediate pain relief happens. The MLflow architecture docs explain common deployment patterns.