The MLflow tracking server works fine until you hit real production scale. Then the database becomes your bottleneck and the UI turns into a slideshow. The deployment docs mention scaling considerations but don't cover the painful realities. Here's how to fix the performance disasters before they kill your productivity.
The SQLite Trap That Kills Teams
Every team starts with SQLite because it's the default and "just works" until it fucking doesn't. We ran SQLite for six months until Black Friday when our A/B testing models decided to train simultaneously and SQLite completely shit the bed. Three hours of sqlite3.OperationalError: database is locked
errors while our website served completely random product recommendations to customers. Fun times explaining to the CEO why our ML system was recommending dog food to people buying laptops.
The Nuclear Option: Migrate to PostgreSQL immediately. Don't suffer through months of database locks. The database migration guide explains the process, and PostgreSQL performance tuning becomes essential at scale.
## Export from SQLite (backup first)
mlflow db upgrade sqlite:///mlruns.db
## Set up PostgreSQL properly
mlflow server \
--backend-store-uri postgresql://mlflow_user:password@postgres:5432/mlflow \
--default-artifact-root s3://your-bucket/mlflow-artifacts \
--host 0.0.0.0
The database backend documentation covers the migration, but doesn't mention how painful it is with existing data. Check the PostgreSQL documentation for proper database administration practices.
PostgreSQL Tuning That Actually Matters
Once you're on PostgreSQL, you need to tune it for MLflow's access patterns. The default PostgreSQL config assumes you're running a web app, not logging thousands of machine learning experiments.
Essential Indexes: MLflow's schema is missing key indexes that become critical at scale:
-- These should exist but don't by default
CREATE INDEX CONCURRENTLY idx_runs_experiment_id_start_time ON runs(experiment_id, start_time DESC);
CREATE INDEX CONCURRENTLY idx_metrics_run_uuid_key ON metrics(run_uuid, key);
CREATE INDEX CONCURRENTLY idx_params_run_uuid_key ON params(run_uuid, key);
CREATE INDEX CONCURRENTLY idx_tags_run_uuid_key ON tags(run_uuid, key);
-- For the UI queries that become slow
CREATE INDEX CONCURRENTLY idx_runs_status ON runs(status);
CREATE INDEX CONCURRENTLY idx_runs_name ON runs(name);
Connection Pooling: MLflow doesn't handle database connections well under load. Use pgbouncer to prevent connection exhaustion. The connection pooling guide explains PostgreSQL connection management, and the pgbouncer documentation covers configuration options.
## pgbouncer config
[databases]
mlflow = host=postgres port=5432 dbname=mlflow user=mlflow_user
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 50
Memory and Storage Issues That Blindside You
Artifact Storage Explosion: Teams always forget that MLflow stores everything forever by default unless you tell it not to. Our S3 bill exploded from $200 to $2,800 in one month because some genius was logging full 50GB training datasets with every fucking experiment. 50GB × 200 experiments = one finance team ready to murder whoever set up MLflow. The artifact storage documentation covers different backends, while AWS cost optimization helps manage the financial bleeding.
Set up S3 lifecycle policies immediately:
{
"Rules": [{
"ID": "mlflow-artifact-lifecycle",
"Status": "Enabled",
"Transitions": [{
"Days": 30,
"StorageClass": "STANDARD_IA"
}, {
"Days": 365,
"StorageClass": "GLACIER"
}],
"Expiration": {
"Days": 2555 // 7 years for compliance
}
}]
}
Database Growth: The metrics table grows exponentially. If you're logging metrics every epoch for 1000 experiments, you're looking at millions of rows fast.
Regular cleanup is essential:
-- Delete old experiment data (be careful!)
DELETE FROM metrics WHERE run_uuid IN (
SELECT run_uuid FROM runs
WHERE start_time < NOW() - INTERVAL '1 year'
AND experiment_id IN (SELECT experiment_id FROM experiments WHERE name LIKE '%test%')
);
UI Performance Fixes That Actually Work
The MLflow UI becomes unusable around 10,000 experiments because it loads everything at once. The UI performance issues have been known for years but aren't really fixed. Check the MLflow GitHub discussions for community workarounds and performance optimization strategies.
Pagination Workaround: Force pagination by limiting experiment queries:
import mlflow
client = mlflow.tracking.MlflowClient()
## Don't load all runs at once
runs = client.search_runs(
experiment_ids=[experiment_id],
max_results=100, # Pagination
order_by=["start_time DESC"]
)
Search Performance: The search functionality is garbage with large datasets. Use database queries directly:
-- Faster than UI search
SELECT run_uuid, name FROM runs
WHERE experiment_id = 'your-experiment-id'
AND name ILIKE '%model-name%'
ORDER BY start_time DESC
LIMIT 50;
Infrastructure Patterns That Prevent Disasters
Separate Read/Write Instances: Use PostgreSQL read replicas for the UI to prevent read queries from blocking experiment logging:
## Write instance for experiment tracking
mlflow server \
--backend-store-uri postgresql://user:pass@postgres-master:5432/mlflow \
--host 0.0.0.0 --port 5000
## Read-only instance for UI browsing
mlflow server \
--backend-store-uri postgresql://user:pass@postgres-replica:5432/mlflow \
--host 0.0.0.0 --port 5001
Connection Pooling: Use a proper connection pool or your database will hit connection limits under load:
## In your training code
import sqlalchemy
from mlflow.tracking import MlflowClient
engine = sqlalchemy.create_engine(
"postgresql://user:pass@host:5432/mlflow",
pool_size=10,
pool_timeout=30,
pool_recycle=3600
)
client = MlflowClient(tracking_uri="postgresql://...")
The reality is MLflow wasn't designed for the scale most teams need in production. These fixes help, but if you're hitting tens of thousands of experiments, consider whether the engineering overhead is worth it compared to managed alternatives like Weights & Biases or Neptune.ai.