MLflow - Stop Losing Track of Your Fucking Model Runs

What MLflow Actually Does (And What It Doesn't)

Before you blow your AWS budget on this thing, here's what MLflow actually does in practice - not the marketing bullshit, but what you'll actually get after 7 years of people beating on it in production.

MLflow is an open-source platform that Databricks released in June 2018 to stop data scientists from losing their fucking minds trying to reproduce model results. After 7 years of development, it's become the most popular MLOps tracking tool, with 20k+ GitHub stars and millions of downloads, adopted by thousands of organizations who got tired of spreadsheet experiment tracking.

Core Components

MLflow has four main pieces that work together to streamline ML workflows:

MLflow Tracking

logs your experiment parameters, metrics, and artifacts automatically so you don't have to remember what you tried. The web UI works great for small teams until you have 10,000 experiments and it becomes slow as hell. Supports MySQL and PostgreSQL backends but you'll waste half a day debugging connection strings that look right but somehow aren't. We spent 3 hours on artifact storage paths before realizing the damn thing needed trailing slashes.

MLflow Models

packages models in a standard format that theoretically works anywhere. In practice, dependency hell is real - Python 3.8 models won't run on Python 3.11 servers without some Docker gymnastics. Supports the usual suspects - scikit-learn, TensorFlow, PyTorch, plus some obscure frameworks only you use.

MLflow Model Registry

is where models go to die in "staging" limbo. Version control works fine, but the approval workflow is basic - you'll end up writing your own CI/CD integration anyway. The stage transition API is straightforward if you enjoy writing custom deployment scripts.

MLflow Projects

nobody fucking uses because YAML config for reproducibility is about as reliable as weather forecasting. Most teams just use Docker containers and call it a day. Projects are supposed to solve environment consistency, but good luck getting the same CUDA version across different machines.

MLflow 3.0 and the GenAI Gold Rush

MLflow 3.0 launched June 11, 2025, finally acknowledging that traditional ML tracking doesn't work for LLM chaos. Seven years after the original 2018 release, they admitted that prompt engineering isn't just "hyperparameter tuning with words." As of August 2025, MLflow 3.3.0 includes Model Registry Webhooks and enhanced GenAI capabilities.

Finally, they're fixing actual problems instead of adding more YAML bullshit:

Enhanced Tracing

because debugging why your LangChain agent made 47 API calls to generate "Hello" is a nightmare

LLM Evaluation

with metrics for hallucination detection - finally someone admits models lie constantly

Prompt Management

for versioning because "make it more creative" isn't a reproducible instruction

LoggedModel Entity

that groups everything together since GenAI workflows are inherently messier

Traditional metrics like accuracy mean jackshit when your model generates different outputs every time, even with temperature=0. Because that's not how LLMs work, but it took MLflow 7 years to figure this out.

The "Free" vs "Just Pay Databricks" Decision

Open Source MLflow

is "free" until you factor in:

Database hosting costs (PostgreSQL isn't free on AWS RDS)
Artifact storage bills (S3 costs add up fast with large models)
DevOps time setting up servers, monitoring, backups
The inevitable weekend outage when your tracking server dies

Budget at least $500/month in AWS costs plus 20 hours of engineering time for a proper production setup. And that's if you're lucky - when our RDS instance crashed because we didn't set up connection pooling properly, it took us 2 days to get tracking back online.

Managed MLflow on Databricks

costs money upfront but includes:

Automatic scaling (no more crashed servers during experiment storms)
Enterprise security that actually works
Unity Catalog integration for proper permissions
Someone else's problem when shit breaks during dinner

The managed version makes sense if your team's time is worth more than fighting infrastructure. When you're getting paged at midnight because the tracking server's out of disk space again, that $1000/month suddenly looks reasonable.

What Actually Works (And What Breaks)

Reading about MLflow's components is one thing - knowing which ones actually work when you're dealing with real data and real deadlines is completely different. Here's what I've learned after running MLflow in production for 2 years.

Experiment Tracking That Actually Saves Your Ass

MLflow's tracking is the one feature that consistently works. It auto-logs environment details like Python versions and Git hashes, which sounds boring until you spend 3 hours debugging why last month's model won't run. The metadata logging saved my team's sanity when we discovered our "breakthrough" results were due to scikit-learn 1.2.0 vs 1.3.1.

Real-time Monitoring works if you remember to call mlflow.log_metric() in your training loop. The web UI refreshes every few seconds, which is great for 10-minute training runs but painful for 12-hour deep learning jobs. Learn from my pain: log every 100 steps, not every batch, or you'll DDoS your own tracking server. I learned this the hard way when my BERT training logged 50,000 metrics and crashed the UI for everyone.

Artifact Management is where your AWS bill explodes. MLflow stores everything in S3, Azure Blob, or GCS, which sounds great until you realize each model checkpoint is 2GB and you're running 50 experiments. I got hit with some insane S3 bill - I think it was like $750 or $850, somewhere around there - because we forgot to set lifecycle policies and were storing every checkpoint from a 6-month hyperparameter sweep. Budget $200/month minimum for artifact storage if you're doing serious deep learning. The UI lets you download artifacts, assuming you enjoy waiting 10 minutes for large files.

Model Deployment (Your Infrastructure Team Will Hate You)

MLflow can serve models as REST APIs, batch jobs, or streaming apps, but don't expect magic. The mlflow models serve command works great for demos, breaks spectacularly in production. I spent way too long trying to get it working - probably 10 days of pain - with proper health checks and load balancing before giving up and writing our own Flask wrapper. You'll end up writing your own Docker containers and Kubernetes manifests anyway.

Multi-cloud Flexibility means MLflow generates deployment artifacts for AWS SageMaker, Azure ML, and Google Vertex AI. In practice, each platform has its own gotchas and you'll spend days debugging environment differences. I wasted a full week trying to get the same model working on both SageMaker and Vertex AI - turns out they expected different conda environment formats. The "standardized" format is more of a suggestion.

A/B Testing Support is basically non-existent. MLflow can version models, but traffic splitting and rollback logic? That's your job. Most teams end up building their own feature flags and monitoring on top of whatever MLflow gives them.

Why Databricks Wants Your Money (And It Might Be Worth It)

The managed version costs real money but solves real problems:

Enterprise Security that actually works includes RBAC that doesn't require a PhD in Kubernetes. Corporate SSO integration means no more shared service accounts. Unity Catalog provides actual access controls instead of "everyone can see everything" like open source MLflow.

Automatic Scaling means your tracking server doesn't crash when the entire data science team runs hyperparameter sweeps simultaneously. I learned this lesson when 500 concurrent experiments brought down our single EC2 instance for 6 hours, and suddenly our VP was asking why no one could work. That's when I started looking at the managed option real seriously.

Databricks Integration is the real value prop if you're already in their ecosystem. Notebooks can log to MLflow without configuration hell, and Unity Catalog tracks lineage from raw data to deployed model. If you're not on Databricks already, this integration doesn't help you.

GenAI Features (Because Traditional ML Metrics Are Useless Here)

MLflow 3.0 finally acknowledges that LLM development is chaos:

LLM Gateway unifies access to OpenAI, Anthropic, and Hugging Face APIs with cost tracking. Switching between providers is easy until you realize each has different rate limits, prompt formatting, and random failure modes. The cost tracking is useful when your bill hits $1000/month in API calls.

Prompt Engineering tools version your prompts, which is great until you have 47 variations of "be more helpful" and can't remember which one actually worked. The A/B testing requires you to build your own evaluation harness - MLflow just stores the results.

Agent Evaluation includes hallucination detection and factual accuracy scoring. These metrics are better than nothing but still struggle with subjective outputs. When your agent responds "I don't know" vs making shit up, which is better? Depends on your use case.

The tracing capabilities help debug why your LangChain agent took 47 steps to answer a simple question. Essential for complex workflows, assuming you can interpret the trace visualization without going insane.

Who Actually Uses What (And Why They Regret It)

Reality Check	MLflow OSS	Managed MLflow	W&B	Neptune	Kubeflow
Time to Setup	2 days if lucky	30 minutes	5 minutes	1 hour	2 weeks minimum
Monthly Bill	$500+ in AWS	$1,000+	$200+ per user	$176+ per user	$2,000+ in compute
When It Breaks	Your problem	Their problem	Their problem	Their problem	Definitely your problem
Learning Curve	Moderate	Easy	Easy	Steep	Vertical cliff
Vendor Lock-in	None	High	High	Medium	None

Real Questions Engineers Ask (With Honest Answers)

Why does the MLflow UI become unusably slow with large experiments?

The SQLite backend that everyone starts with can't handle thousands of runs. You need PostgreSQL or MySQL for anything serious. Even then, the web UI struggles with complex queries on 10,000+ experiments. Solution: Use a proper database and learn to filter experiments in the UI.

How much will MLflow actually cost me in production?

Budget anywhere from $500-2000/month, maybe more if you're unlucky, for AWS/GCP hosting. Database costs ($50-200/month), artifact storage explodes with large models ($200-1000/month), plus compute for the tracking server ($100-500/month). Managed MLflow starts around $1000/month but includes scaling and support.

Does MLflow work with Python 3.11/3.12 yet?

MLflow supports Python 3.8-3.11 as of late 2025. Python 3.12 support is still experimental and has known issues with the databricks-cli dependency. If you deployed models with 3.8, they won't run on 3.11 servers without Docker containers. Version compatibility is a constant pain point.

What happens when MLflow deployment breaks in production?

Model serving with mlflow models serve is great for demos, terrible for production. No health checks, no graceful shutdowns, no proper logging. You'll need to wrap it in Docker with proper monitoring. Most teams end up building custom Flask/FastAPI servers and just use MLflow for packaging.

Can I migrate experiments from Weights & Biases to MLflow?

There's no official migration tool. You'll need to write custom scripts using both APIs. Expect to lose visualization configs and some metadata. W&B export API helps, but plan for 2-3 days of data wrangling.

Why does MLflow Model Registry feel so basic compared to alternatives?

Because it is basic. No automated model validation, no deployment pipelines, no proper approval workflows. It's a version-controlled file store with a UI. Neptune and W&B have better registry features if you need enterprise workflows.

Should I use MLflow or just stick with Weights & Biases?

MLflow is free and doesn't lock you into a platform, but W&B has better visualizations and team features. If you're a solo developer or small team, MLflow is fine. If you need rich dashboards and collaboration, W&B is worth the $200/month. Neptune falls somewhere between.

Does MLflow actually work for LLM projects?

MLflow 3.0 added LLM support in June 2025, with MLflow 3.3.0 adding Model Registry Webhooks in August 2025, but it's still catching up to specialized tools. Prompt versioning works, but evaluation is basic. For serious LLM development, consider Weights & Biases, LangSmith, or Helicone for better observability.

What's the biggest MLflow gotcha that will bite me?

Artifact storage costs. Your first S3 bill will be a wake-up call when you realize MLflow stores every model checkpoint, dataset, and plot by default. I got hit with a bill north of $1100 one month because our hyperparameter sweep was saving 200MB checkpoints every epoch. Set lifecycle policies on your S3 buckets and be selective about what you log. That 2GB model saved 100 times adds up fast.

When should I NOT use MLflow?

If you're doing simple batch ML with minimal collaboration, a spreadsheet might be sufficient. If you need advanced workflow orchestration, Kubeflow or Metaflow are better choices. If you're primarily doing LLM work, specialized LLM tools often work better.

How do I avoid making MLflow a single point of failure?

Don't run MLflow on a single EC2 instance in production like we did. Use managed databases (RDS), load balancers, and proper monitoring. Or just pay for Managed MLflow and let Databricks handle the operational headaches. A 6-hour outage when your tracking server dies teaches you this lesson quickly

especially when your CEO asks why no one can work.

Quick Navigation

Core Components

MLflow Tracking

MLflow Models

MLflow Model Registry

MLflow Projects

MLflow 3.0 and the GenAI Gold Rush

Enhanced Tracing

LLM Evaluation

Prompt Management

LoggedModel Entity

The "Free" vs "Just Pay Databricks" Decision

Open Source MLflow

Managed MLflow on Databricks

Experiment Tracking That Actually Saves Your Ass

Model Deployment (Your Infrastructure Team Will Hate You)

Why Databricks Wants Your Money (And It Might Be Worth It)

GenAI Features (Because Traditional ML Metrics Are Useless Here)

Why does the MLflow UI become unusably slow with large experiments?

How much will MLflow actually cost me in production?

Does MLflow work with Python 3.11/3.12 yet?

What happens when MLflow deployment breaks in production?

Can I migrate experiments from Weights & Biases to MLflow?

Why does MLflow Model Registry feel so basic compared to alternatives?

Should I use MLflow or just stick with Weights & Biases?

Does MLflow actually work for LLM projects?

What's the biggest MLflow gotcha that will bite me?

When should I NOT use MLflow?

How do I avoid making MLflow a single point of failure?

Related Tools & Recommendations

MLflow: Experiment Tracking, Why It Exists & Setup Guide

MLflow & Kubernetes MLOps: Scalable Tracking Deployment Guide

Google Vertex AI: Overview, Costs, & Production Reality

PyTorch Production Deployment: Scale, Optimize & Prevent Crashes

Databricks Acquires Tecton for $900M+ in AI Agent Push

MLflow Production Troubleshooting: Fix Common Issues & Scale

AWS AI/ML Infrastructure: Build Cost-Effective, Robust ML Systems

PyTorch ↔ TensorFlow Model Conversion: The Real Story

Google Cloud Vertex AI: Overview, Costs, & Production Challenges

KServe - Deploy ML Models on Kubernetes Without Losing Your Mind

BentoML Production Deployment: Secure & Reliable ML Model Serving

TorchServe: What Happened & Your Migration Options | PyTorch Model Serving

Databricks Overview: Multi-Cloud Analytics, Setup & Cost Reality

NVIDIA Triton Inference Server: High-Performance AI Serving

BentoML: Deploy ML Models, Simplify MLOps & Model Serving

Replicate: Simplify AI Model Deployment, Skip Docker & CUDA Pain

Google Cloud Vertex AI Production Deployment Troubleshooting Guide

Mastering ML Model Deployment: From Jupyter to Production

Hugging Face Inference Endpoints: Deploy AI Models Easily

Perplexity's Comet Plus Offers Publishers 80% Revenue Share in AI Content Battle