SQLite: The Friendly Neighborhood Database That Hates Your Team
MLflow ships with SQLite because it's simple. Simple like a hand grenade with the pin pulled.
First day with our new ML hire: "Hey, can you run this experiment while I finish my hyperparameter sweep?"
Five minutes later: sqlite3.OperationalError: database is locked
.
MLflow just sat there, smugly refusing to accept any new experiments until whatever was holding the lock finished. Turned out Mike's sweep was going to run for 6 hours. Jenny went home.
PostgreSQL fixes this because it actually knows how to handle multiple writers. Copy this and never look back:
pip install psycopg2-binary
export MLFLOW_BACKEND_STORE_URI=\"postgresql://mlflow_user:password@localhost/mlflow\"
The PostgreSQL docs are boring but thorough. Unlike MLflow's documentation, which assumes you're running everything on localhost forever.
Storage Costs: The Bill That Made Finance Call a Meeting
Someone on our team thought logging the entire 50GB training dataset as an artifact was a good idea. "For reproducibility," they said.
AWS disagreed. So did our CFO when the bill jumped from $200 to $4,100 in one month.
MLflow logs everything by default: model checkpoints, datasets, failed experiments, temporary files you forgot about. It's like having a packrat in your cloud storage - everything seems important until you see the invoice.
Delete old experiment artifacts or go bankrupt. Your choice:
import boto3
from datetime import datetime, timedelta
## Nuclear option: delete experiments older than 90 days
cutoff_date = datetime.now() - timedelta(days=90)
client = mlflow.tracking.MlflowClient()
experiments = client.list_experiments()
for exp in experiments:
if exp.creation_time < cutoff_date.timestamp() * 1000:
client.delete_experiment(exp.experiment_id)
Set up S3 lifecycle policies before you deploy, not after the damage is done.
Kubernetes Networking: Welcome to YAML Hell
Pod running fine? Check. Service created? Check. Ingress configured? Check. MLflow still returning connection refused? Welcome to Kubernetes.
My personal record for debugging a networking issue: 6 hours to discover that our NetworkPolicy was blocking traffic between the MLflow pod and PostgreSQL. The error message? "Connection timed out." Helpful.
This YAML will save you from the worst of it:
## Don't put MLflow in default namespace unless you hate yourself
apiVersion: v1
kind: Namespace
metadata:
name: mlflow
labels:
name: mlflow
Persistent volumes are not optional. I learned this when our MLflow pod restarted and took 3 months of experiment history with it. The team was not pleased.
## Your data WILL disappear without this
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mlflow-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
Fun discovery: MLflow breaks if your PostgreSQL username contains spaces. Spent 2 hours debugging this during a client demo because we had a user "ml flow admin" (with a space). The error message just says "authentication failed" - no mention of the space character issue.
Another gotcha: The MLflow docs say you can use any PostgreSQL port, but if you use anything other than 5432, the health checks break in Kubernetes deployments. The health check endpoint hardcodes the port and doesn't respect your configuration. This cost me a weekend of debugging why pods kept restarting.
Security: Your MLflow Server is Wide Open Right Now
MLflow's default security model: "What's security?"
First week after deploying our tracking server, I get a Slack message from security: "Hey, why can I see your ML experiments from the coffee shop WiFi?"
Turns out MLflow has no authentication. None. Your experiment data, model artifacts, hyperparameters, everything - publicly accessible to anyone with the URL.
Basic auth fixes this in 5 minutes:
## Create password file
htpasswd -c /etc/nginx/htpasswd username
## Nginx config
location / {
auth_basic \"MLflow\";
auth_basic_user_file /etc/nginx/htpasswd;
proxy_pass http://mlflow-server:5000;
}
OAuth2 proxy works better when your company uses real SSO. But basic auth beats no auth every single time.
Kubernetes secrets keep passwords out of Git. This seems obvious until you see a Docker image with hardcoded AWS keys in production:
apiVersion: v1
kind: Secret
metadata:
name: mlflow-secrets
type: Opaque
stringData:
db-password: \"your-actual-password-here\"
PostgreSQL's default settings will ruin your day. Default max_connections is 100, which sounds like a lot until 15 data scientists run hyperparameter sweeps simultaneously. Your database will start rejecting connections with cryptic error messages.
This PostgreSQL tuning guide contains the settings that actually matter for MLflow workloads. The defaults were written for 1990s hardware.
Essential References for Real Deployments:
- PostgreSQL performance tuning - the settings that actually matter for MLflow workloads
- Kubernetes resource management - avoid the OOMKilled disasters
- Azure Blob lifecycle management - prevent storage cost surprises
- AWS S3 lifecycle policies - same thing for AWS shops
- MLflow database backends - why SQLite fails and what works
- Kubernetes networking concepts - debug the inevitable networking failures
- Helm chart best practices - avoid config management hell
- OAuth2 proxy setup - basic authentication that doesn't suck
- MLflow scaling patterns - architecture that works beyond toy examples
- Kubernetes secrets management - don't commit credentials to Git like an amateur