What MLflow Actually Does (And What It Doesn't)

Before you blow your AWS budget on this thing, here's what MLflow actually does in practice - not the marketing bullshit, but what you'll actually get after 7 years of people beating on it in production.

MLflow is an open-source platform that Databricks released in June 2018 to stop data scientists from losing their fucking minds trying to reproduce model results. After 7 years of development, it's become the most popular MLOps tracking tool, with 20k+ GitHub stars and millions of downloads, adopted by thousands of organizations who got tired of spreadsheet experiment tracking.

Core Components

MLflow has four main pieces that work together to streamline ML workflows:

MLflow Tracking

logs your experiment parameters, metrics, and artifacts automatically so you don't have to remember what you tried. The web UI works great for small teams until you have 10,000 experiments and it becomes slow as hell. Supports MySQL and PostgreSQL backends but you'll waste half a day debugging connection strings that look right but somehow aren't. We spent 3 hours on artifact storage paths before realizing the damn thing needed trailing slashes.

MLflow Models

packages models in a standard format that theoretically works anywhere. In practice, dependency hell is real - Python 3.8 models won't run on Python 3.11 servers without some Docker gymnastics. Supports the usual suspects - scikit-learn, TensorFlow, PyTorch, plus some obscure frameworks only you use.

MLflow Model Registry

is where models go to die in "staging" limbo. Version control works fine, but the approval workflow is basic - you'll end up writing your own CI/CD integration anyway. The stage transition API is straightforward if you enjoy writing custom deployment scripts.

MLflow Projects

nobody fucking uses because YAML config for reproducibility is about as reliable as weather forecasting. Most teams just use Docker containers and call it a day. Projects are supposed to solve environment consistency, but good luck getting the same CUDA version across different machines.

MLflow 3.0 and the GenAI Gold Rush

MLflow 3.0 launched June 11, 2025, finally acknowledging that traditional ML tracking doesn't work for LLM chaos. Seven years after the original 2018 release, they admitted that prompt engineering isn't just "hyperparameter tuning with words." As of August 2025, MLflow 3.3.0 includes Model Registry Webhooks and enhanced GenAI capabilities.

Finally, they're fixing actual problems instead of adding more YAML bullshit:

Enhanced Tracing

because debugging why your LangChain agent made 47 API calls to generate "Hello" is a nightmare

LLM Evaluation

with metrics for hallucination detection - finally someone admits models lie constantly

Prompt Management

for versioning because "make it more creative" isn't a reproducible instruction

LoggedModel Entity

that groups everything together since GenAI workflows are inherently messier

Traditional metrics like accuracy mean jackshit when your model generates different outputs every time, even with temperature=0. Because that's not how LLMs work, but it took MLflow 7 years to figure this out.

The "Free" vs "Just Pay Databricks" Decision

Open Source MLflow

is "free" until you factor in:

  • Database hosting costs (PostgreSQL isn't free on AWS RDS)
  • Artifact storage bills (S3 costs add up fast with large models)
  • DevOps time setting up servers, monitoring, backups
  • The inevitable weekend outage when your tracking server dies

Budget at least $500/month in AWS costs plus 20 hours of engineering time for a proper production setup. And that's if you're lucky - when our RDS instance crashed because we didn't set up connection pooling properly, it took us 2 days to get tracking back online.

Managed MLflow on Databricks

costs money upfront but includes:

  • Automatic scaling (no more crashed servers during experiment storms)
  • Enterprise security that actually works
  • Unity Catalog integration for proper permissions
  • Someone else's problem when shit breaks during dinner

The managed version makes sense if your team's time is worth more than fighting infrastructure. When you're getting paged at midnight because the tracking server's out of disk space again, that $1000/month suddenly looks reasonable.

What Actually Works (And What Breaks)

Reading about MLflow's components is one thing - knowing which ones actually work when you're dealing with real data and real deadlines is completely different. Here's what I've learned after running MLflow in production for 2 years.

Experiment Tracking That Actually Saves Your Ass

MLflow's tracking is the one feature that consistently works. It auto-logs environment details like Python versions and Git hashes, which sounds boring until you spend 3 hours debugging why last month's model won't run. The metadata logging saved my team's sanity when we discovered our "breakthrough" results were due to scikit-learn 1.2.0 vs 1.3.1.

Real-time Monitoring works if you remember to call mlflow.log_metric() in your training loop. The web UI refreshes every few seconds, which is great for 10-minute training runs but painful for 12-hour deep learning jobs. Learn from my pain: log every 100 steps, not every batch, or you'll DDoS your own tracking server. I learned this the hard way when my BERT training logged 50,000 metrics and crashed the UI for everyone.

Artifact Management is where your AWS bill explodes. MLflow stores everything in S3, Azure Blob, or GCS, which sounds great until you realize each model checkpoint is 2GB and you're running 50 experiments. I got hit with some insane S3 bill - I think it was like $750 or $850, somewhere around there - because we forgot to set lifecycle policies and were storing every checkpoint from a 6-month hyperparameter sweep. Budget $200/month minimum for artifact storage if you're doing serious deep learning. The UI lets you download artifacts, assuming you enjoy waiting 10 minutes for large files.

Model Deployment (Your Infrastructure Team Will Hate You)

MLflow can serve models as REST APIs, batch jobs, or streaming apps, but don't expect magic. The mlflow models serve command works great for demos, breaks spectacularly in production. I spent way too long trying to get it working - probably 10 days of pain - with proper health checks and load balancing before giving up and writing our own Flask wrapper. You'll end up writing your own Docker containers and Kubernetes manifests anyway.

Multi-cloud Flexibility means MLflow generates deployment artifacts for AWS SageMaker, Azure ML, and Google Vertex AI. In practice, each platform has its own gotchas and you'll spend days debugging environment differences. I wasted a full week trying to get the same model working on both SageMaker and Vertex AI - turns out they expected different conda environment formats. The "standardized" format is more of a suggestion.

A/B Testing Support is basically non-existent. MLflow can version models, but traffic splitting and rollback logic? That's your job. Most teams end up building their own feature flags and monitoring on top of whatever MLflow gives them.

Why Databricks Wants Your Money (And It Might Be Worth It)

The managed version costs real money but solves real problems:

Enterprise Security that actually works includes RBAC that doesn't require a PhD in Kubernetes. Corporate SSO integration means no more shared service accounts. Unity Catalog provides actual access controls instead of "everyone can see everything" like open source MLflow.

Automatic Scaling means your tracking server doesn't crash when the entire data science team runs hyperparameter sweeps simultaneously. I learned this lesson when 500 concurrent experiments brought down our single EC2 instance for 6 hours, and suddenly our VP was asking why no one could work. That's when I started looking at the managed option real seriously.

Databricks Integration is the real value prop if you're already in their ecosystem. Notebooks can log to MLflow without configuration hell, and Unity Catalog tracks lineage from raw data to deployed model. If you're not on Databricks already, this integration doesn't help you.

GenAI Features (Because Traditional ML Metrics Are Useless Here)

MLflow 3.0 finally acknowledges that LLM development is chaos:

LLM Gateway unifies access to OpenAI, Anthropic, and Hugging Face APIs with cost tracking. Switching between providers is easy until you realize each has different rate limits, prompt formatting, and random failure modes. The cost tracking is useful when your bill hits $1000/month in API calls.

Prompt Engineering tools version your prompts, which is great until you have 47 variations of "be more helpful" and can't remember which one actually worked. The A/B testing requires you to build your own evaluation harness - MLflow just stores the results.

Agent Evaluation includes hallucination detection and factual accuracy scoring. These metrics are better than nothing but still struggle with subjective outputs. When your agent responds "I don't know" vs making shit up, which is better? Depends on your use case.

The tracing capabilities help debug why your LangChain agent took 47 steps to answer a simple question. Essential for complex workflows, assuming you can interpret the trace visualization without going insane.

Who Actually Uses What (And Why They Regret It)

Reality Check

MLflow OSS

Managed MLflow

W&B

Neptune

Kubeflow

Time to Setup

2 days if lucky

30 minutes

5 minutes

1 hour

2 weeks minimum

Monthly Bill

$500+ in AWS

$1,000+

$200+ per user

$176+ per user

$2,000+ in compute

When It Breaks

Your problem

Their problem

Their problem

Their problem

Definitely your problem

Learning Curve

Moderate

Easy

Easy

Steep

Vertical cliff

Vendor Lock-in

None

High

High

Medium

None

Real Questions Engineers Ask (With Honest Answers)

Q

Why does the MLflow UI become unusably slow with large experiments?

A

The SQLite backend that everyone starts with can't handle thousands of runs. You need PostgreSQL or MySQL for anything serious. Even then, the web UI struggles with complex queries on 10,000+ experiments. Solution: Use a proper database and learn to filter experiments in the UI.

Q

How much will MLflow actually cost me in production?

A

Budget anywhere from $500-2000/month, maybe more if you're unlucky, for AWS/GCP hosting. Database costs ($50-200/month), artifact storage explodes with large models ($200-1000/month), plus compute for the tracking server ($100-500/month). Managed MLflow starts around $1000/month but includes scaling and support.

Q

Does MLflow work with Python 3.11/3.12 yet?

A

MLflow supports Python 3.8-3.11 as of late 2025. Python 3.12 support is still experimental and has known issues with the databricks-cli dependency. If you deployed models with 3.8, they won't run on 3.11 servers without Docker containers. Version compatibility is a constant pain point.

Q

What happens when MLflow deployment breaks in production?

A

Model serving with mlflow models serve is great for demos, terrible for production. No health checks, no graceful shutdowns, no proper logging. You'll need to wrap it in Docker with proper monitoring. Most teams end up building custom Flask/FastAPI servers and just use MLflow for packaging.

Q

Can I migrate experiments from Weights & Biases to MLflow?

A

There's no official migration tool. You'll need to write custom scripts using both APIs. Expect to lose visualization configs and some metadata. W&B export API helps, but plan for 2-3 days of data wrangling.

Q

Why does MLflow Model Registry feel so basic compared to alternatives?

A

Because it is basic. No automated model validation, no deployment pipelines, no proper approval workflows. It's a version-controlled file store with a UI. Neptune and W&B have better registry features if you need enterprise workflows.

Q

Should I use MLflow or just stick with Weights & Biases?

A

MLflow is free and doesn't lock you into a platform, but W&B has better visualizations and team features. If you're a solo developer or small team, MLflow is fine. If you need rich dashboards and collaboration, W&B is worth the $200/month. Neptune falls somewhere between.

Q

Does MLflow actually work for LLM projects?

A

MLflow 3.0 added LLM support in June 2025, with MLflow 3.3.0 adding Model Registry Webhooks in August 2025, but it's still catching up to specialized tools. Prompt versioning works, but evaluation is basic. For serious LLM development, consider Weights & Biases, LangSmith, or Helicone for better observability.

Q

What's the biggest MLflow gotcha that will bite me?

A

Artifact storage costs. Your first S3 bill will be a wake-up call when you realize MLflow stores every model checkpoint, dataset, and plot by default. I got hit with a bill north of $1100 one month because our hyperparameter sweep was saving 200MB checkpoints every epoch. Set lifecycle policies on your S3 buckets and be selective about what you log. That 2GB model saved 100 times adds up fast.

Q

When should I NOT use MLflow?

A

If you're doing simple batch ML with minimal collaboration, a spreadsheet might be sufficient. If you need advanced workflow orchestration, Kubeflow or Metaflow are better choices. If you're primarily doing LLM work, specialized LLM tools often work better.

Q

How do I avoid making MLflow a single point of failure?

A

Don't run MLflow on a single EC2 instance in production like we did. Use managed databases (RDS), load balancers, and proper monitoring. Or just pay for Managed MLflow and let Databricks handle the operational headaches. A 6-hour outage when your tracking server dies teaches you this lesson quickly

  • especially when your CEO asks why no one can work.

Related Tools & Recommendations

tool
Similar content

MLflow: Experiment Tracking, Why It Exists & Setup Guide

Experiment tracking for people who've tried everything else and given up.

MLflow
/tool/mlflow/overview
100%
howto
Similar content

MLflow & Kubernetes MLOps: Scalable Tracking Deployment Guide

Deploy MLflow tracking that survives more than one data scientist

MLflow
/howto/setup-mlops-pipeline-mlflow-kubernetes/complete-setup-guide
78%
tool
Similar content

Google Vertex AI: Overview, Costs, & Production Reality

Google's ML platform that combines their scattered AI services into one place. Expect higher bills than advertised but decent Gemini model access if you're alre

Google Vertex AI
/tool/google-vertex-ai/overview
70%
tool
Similar content

PyTorch Production Deployment: Scale, Optimize & Prevent Crashes

The brutal truth about taking PyTorch models from Jupyter notebooks to production servers that don't crash at 3am

PyTorch
/tool/pytorch/production-deployment-optimization
55%
news
Similar content

Databricks Acquires Tecton for $900M+ in AI Agent Push

Databricks - Unified Analytics Platform

GitHub Copilot
/news/2025-08-23/databricks-tecton-acquisition
49%
tool
Similar content

MLflow Production Troubleshooting: Fix Common Issues & Scale

When MLflow works locally but dies in production. Again.

MLflow
/tool/mlflow/production-troubleshooting
46%
tool
Similar content

AWS AI/ML Infrastructure: Build Cost-Effective, Robust ML Systems

Master AWS AI/ML infrastructure architecture. Build resilient, cost-effective machine learning systems, avoid common pitfalls, and optimize for production succe

AWS AI/ML Services
/tool/aws-ai-ml-services/ai-ml-infrastructure-architecture
44%
integration
Recommended

PyTorch ↔ TensorFlow Model Conversion: The Real Story

How to actually move models between frameworks without losing your sanity

PyTorch
/integration/pytorch-tensorflow/model-interoperability-guide
43%
tool
Similar content

Google Cloud Vertex AI: Overview, Costs, & Production Challenges

Tries to solve every ML problem under one roof. Works great if you're already drinking the Google Kool-Aid and have deep pockets.

Google Cloud Vertex AI
/tool/vertex-ai/overview
33%
tool
Similar content

KServe - Deploy ML Models on Kubernetes Without Losing Your Mind

Deploy ML models on Kubernetes without writing custom serving code. Handles both traditional models and those GPU-hungry LLMs that eat your budget.

KServe
/tool/kserve/overview
33%
tool
Similar content

BentoML Production Deployment: Secure & Reliable ML Model Serving

Deploy BentoML models to production reliably and securely. This guide addresses common ML deployment challenges, robust architecture, security best practices, a

BentoML
/tool/bentoml/production-deployment-guide
31%
tool
Similar content

TorchServe: What Happened & Your Migration Options | PyTorch Model Serving

(Abandoned Ship)

TorchServe
/tool/torchserve/overview
30%
tool
Similar content

Databricks Overview: Multi-Cloud Analytics, Setup & Cost Reality

Managed Spark with notebooks that actually work

Databricks
/tool/databricks/overview
28%
tool
Similar content

NVIDIA Triton Inference Server: High-Performance AI Serving

Open-source inference serving that doesn't make you want to throw your laptop out the window

NVIDIA Triton Inference Server
/tool/nvidia-triton-server/overview
28%
tool
Similar content

BentoML: Deploy ML Models, Simplify MLOps & Model Serving

Discover BentoML, the model serving framework that simplifies ML model deployment and MLOps. Learn how it works, its performance benefits, and real-world produc

BentoML
/tool/bentoml/overview
26%
tool
Similar content

Replicate: Simplify AI Model Deployment, Skip Docker & CUDA Pain

Deploy AI models effortlessly with Replicate. Bypass Docker and CUDA driver complexities, streamline your MLOps, and get your models running fast. Learn how Rep

Replicate
/tool/replicate/overview
26%
tool
Similar content

Google Cloud Vertex AI Production Deployment Troubleshooting Guide

Debug endpoint failures, scaling disasters, and the 503 errors that'll ruin your weekend. Everything Google's docs won't tell you about production deployments.

Google Cloud Vertex AI
/tool/vertex-ai/production-deployment-troubleshooting
26%
howto
Similar content

Mastering ML Model Deployment: From Jupyter to Production

Tired of "it works on my machine" but crashes with real users? Here's what actually works.

Docker
/howto/deploy-machine-learning-models-to-production/production-deployment-guide
25%
tool
Similar content

Hugging Face Inference Endpoints: Deploy AI Models Easily

Deploy models without fighting Kubernetes, CUDA drivers, or container orchestration

Hugging Face Inference Endpoints
/tool/hugging-face-inference-endpoints/overview
25%
news
Recommended

Perplexity's Comet Plus Offers Publishers 80% Revenue Share in AI Content Battle

$5 Monthly Subscription Aims to Save Online Journalism with New Publisher Revenue Model

Microsoft Copilot
/news/2025-09-07/perplexity-comet-plus-publisher-revenue-share
25%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization