Your BentoML model runs fine in development.
It handles your test data perfectly. Then you deploy to production and everything goes to shit.
Here's what the tutorials don't tell you: memory leaks from model batching will slowly consume RAM until everything crashes.
Your beautiful batching logic that processes 32 samples at once? It's leaking 50MB per batch. Set memory limits and restart containers nightly, or accept that you'll get paged every weekend.
GPU out of memory errors hit differently in production. Models that work fine with batch size 1 will OOM with batch size 32. Your A100 instance costs $32/hour and your model takes 60+ seconds to load. Users will hate the cold starts. Use warm pools or accept the pain.
The BentoML docs are comprehensive but the examples are toy scenarios.
Real production deployment means debugging why your model randomly crashes at 2am (spoiler: it's always memory limits).
The production deployment guide covers basics, but check GitHub issues for real-world problems.
The observability docs show monitoring setup, and GPU inference guide covers CUDA issues.
The $5000 AWS Bill That Taught Us Everything
Auto-scaling kicked in during a load test, spun up 20 A100 instances, ran for a weekend. Always set resource limits.
Here's the shit that actually breaks:
Memory leaks:
Your model slowly consumes RAM. Set limits or restart nightly.
Cold starts: 60+ second model loading times.
Warm pools cost $200/month but prevent user rage.
Batch size disasters: Works fine with 1 sample, OOMs with 32.
Test with production batch sizes.
Monitoring noise: Log every prediction and Prometheus storage grows to 500GB.
Log samples, not everything.
Weekend crashes: Batch jobs max out memory at 2am Saturday.
Classic.
The BentoML Slack community actually answers these questions, unlike most developer communities.
Also check Stack Overflow BentoML questions, MLOps community discussions, and BentoML blog for case studies.
The examples repository shows production LLM deployments.
What You Actually Need (The Honest List)
Someone who can debug Kubernetes networking at 3am
- because it will break on a Friday night. K8s docs won't help when your pods can't reach each other.
GPU budget reality check
- A100 instances are $32/hour.
Run the math: 24/7 = $23k/month for one instance. BentoCloud pricing starts looking reasonable.
Secrets management that isn't .env files
- HashiCorp Vault if you hate yourself, AWS Secrets Manager if you want it to just work.
Kubernetes secrets are fine for small deployments.
Monitoring that doesn't wake you up for bullshit
- Set alerts for model accuracy drops below 85%, response times over 200ms, error rates above 1%. Everything else is noise.
A CI/CD pipeline that actually works
- Git
Hub Actions is fine, Jenkins is a nightmare to maintain, GitLab CI works if you're already on GitLab. Azure DevOps is corporate garbage. Check [GitHub Actions examples](https://github.com/bentoml/Bento
ML/tree/main/.github/workflows), MLflow integration guide, and model registry patterns for automated deployments.
Production Configuration That Won't Bite You
Resource limits or die
- This config prevents the weekend disaster:
## bentofile.yaml
- Prevents your model from eating all memory
service: 'service:
SentimentModel'
resources:
memory: "8Gi" # Hard limit
- process dies at 8GB
cpu: "4000m" # 4 cores max
gpu: 1 # T4 is $0.35/hour vs A100 $32/hour
gpu_type: "nvidia-tesla-t4"
traffic:
timeout: 30 # Don't wait 5 minutes for broken requests
concurrency: 8 # Start low, scale up based on actual usage
python:
requirements_txt: './requirements.txt'
lock_packages: true # Pin versions or upgrades will break everything
envs:
- MAX_BATCH_SIZE=4 # Learned this the hard way
- PROMETHEUS_METRICS=true
The official examples use toy resource allocations.
This config is based on what actually works in production.
Health checks that actually detect problems:
@bentoml.service
class SentimentModel:
@bentoml.api
def health(self) -> dict:
"""Actually test if the model works"""
try:
# Real test with actual model inference
result = self.model.predict(["this is a test sentence"])
return {"status": "ok", "model_loaded":
True}
except Exception as e:
# Return 503 so load balancer removes this instance
raise bentoml.
HTTPException(503, f"Model broken: {str(e)}")
def on_shutdown(self):
"""Clean shutdown
- finish current requests"""
# Don't just kill the process, finish what you started
pass
Most health checks are useless
- they return 200 even when the model is broken.
This one actually tests inference.
Environment config based on painful experience:
import os
ENV = os.getenv("ENVIRONMENT", "dev")
if ENV == "production":
# Learned these limits from outages
BATCH_TIMEOUT = 30 # Don't wait forever for batches
LOG_LEVEL = "WARNING" # INFO logs will fill your disk
VALIDATE_INPUTS = True # Users send garbage data
MAX_REQUEST_SIZE = "10MB" # Prevent abuse
else:
# Dev can be messy
BATCH_TIMEOUT = 300
LOG_LEVEL = "DEBUG"
VALIDATE_INPUTS = False
Production is paranoid for good reasons.
Users will try to send 100MB requests if you let them.
CI/CD That Won't Break Your Deployment
GitHub Actions that actually work
- this pipeline caught 3 broken deployments last month:
## .github/workflows/deploy.yml
name:
Deploy Model
on:
push:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v4
with:
python-version: '3.11'
- name:
Install and test
run: |
pip install bentoml pytest
pip install -r requirements.txt
# Test model accuracy
- don't deploy shit models
pytest tests/test_accuracy.py -v
# Test API works
bentoml serve service:
SentimentModel --port 3001 &
sleep 10
curl -f http://localhost:3001/health
deploy:
needs: test
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- name:
Deploy to BentoCloud
run: |
bentoml cloud login --api-token ${{ secrets.
BENTOML_TOKEN }}
bentoml deploy . --name prod-sentiment
This pipeline prevents deploying broken models. The BentoML CI/CD guide has more examples.
Tests that prevent production disasters:
## tests/test_no_broken_deployments.py
def test_accuracy_gate():
"""Don't deploy models worse than the current one"""
accuracy = evaluate_model_on_test_set()
assert accuracy > 0.85, f"Accuracy {accuracy} sucks, don't deploy"
def test_latency_sla():
"""Users complain when responses take forever"""
import time
start = time.time()
model.predict("test input")
latency = time.time()
- start
assert latency < 0.200, f"Latency {latency}s too slow for production"
def test_memory_limit():
"""Prevent OOM crashes"""
import psutil
memory_mb = psutil.Process().memory_info().rss / 1024 / 1024
assert memory_mb < 7000, f"Using {memory_mb}MB, will OOM at 8GB limit"
These tests caught a model that was 50% worse than the previous version. Quality gates save your ass.