Why we deployed version 47 of our recommendation model on Friday afternoon and spent the weekend debugging memory leaks
So there I was, three months into running TensorFlow Serving in production, thinking I had this shit figured out. Spoiler alert: I fucking didn't.
We deployed version 47 of our recommendation model on Friday afternoon (mistake #1). The deployment looked clean - health checks passed, latency was good, accuracy metrics were solid. My wife had planned anniversary dinner reservations for 7 PM.
At 6:47 PM, our alerting system went ballistic. The serving containers started eating memory like a teenager eats pizza. We went from 8GB baseline to 32GB per container in about 10 minutes. Then the really fun part - got 'ResourceExhaustedError: OOM when allocating tensor' and our entire model serving cluster died.
AWS bill for the weekend: Over 3 grand. Maybe 3.2K? I was too pissed to look at the exact number.
Wife's reaction: "You're debugging machine learning models during our anniversary dinner?" Not amused. Understandably.
Root cause: Model version 47 had a memory leak in the preprocessing pipeline. Some genius (me) had created a TensorFlow operation that kept references to input tensors without properly cleaning them up. Each prediction request leaked about 50MB. With 1000 requests per second, you do the math.
What TensorFlow Serving Actually Does (When It Works)
TensorFlow Serving is Google's production system for serving ML models. It handles model loading, versioning, and inference at scale. The core idea is solid: you export your trained model as a SavedModel format, point TensorFlow Serving at it, and it handles the HTTP/gRPC serving infrastructure.
The architecture consists of servables (your models), loaders (manage model lifecycle), managers (coordinate loading/unloading), and sources (discover new models). It's actually well-designed - when it works.
Key features that matter in production:
- Model versioning: Deploy new model versions without downtime
- Batching: Automatically batch requests to improve throughput
- Multi-model serving: Run different models in the same serving cluster
- Resource management: Control memory and CPU allocation per model
What Google doesn't tell you: Each of these features comes with gotchas that'll bite you when you scale beyond the tutorial examples. The official troubleshooting guide covers basic issues but misses the weird production edge cases.
The Memory Management Reality Check
TensorFlow Serving's memory usage is... unpredictable. In our production deployment, we saw memory patterns that made no goddamn sense:
- Cold start: 2-3GB per model (reasonable)
- After 1 hour: 4-5GB (suspicious but manageable)
- After 24 hours: 8-12GB (what the fuck is happening?)
- Under load: 15-30GB (time to panic and call the oncall engineer)
The real kicker: This wasn't even during high traffic. Our baseline was maybe 100 predictions per second, nothing crazy.
Memory debugging tools that actually help:
- `docker stats` - Shows real container memory usage
- TensorFlow Serving's monitoring endpoints - Gives internal memory metrics
- `nvidia-smi` if you're using GPUs (spoiler: GPU memory is even worse)
- TensorFlow Profiler - For deep memory analysis when shit really hits the fan
- Kubernetes memory metrics - When running in k8s clusters
Production memory limits we actually use:
- Container limit: 32GB (or 30GB, hard to remember when everything's on fire)
- Model cache: 16GB max
- Request timeout: 30 seconds (before models eat all your memory)
Docker Deployment: The Only Sane Way
Don't compile TensorFlow Serving from source. Just don't. I wasted a week fighting bazel build errors and dependency conflicts. Use the official Docker images. Check the Docker best practices for ML guide for container optimization.
Our production Docker setup:
FROM tensorflow/serving:2.19.1
## Copy your model(s)
COPY models/ /models/
## This config took 3 hours of debugging to get right
ENV MODEL_NAME=recommendation_model
ENV MODEL_BASE_PATH=/models
ENV TF_CPP_MIN_LOG_LEVEL=1
## Memory limits that actually work
ENV TF_SERVING_MEMORY_FRACTION=0.8
ENV TF_SERVING_BATCH_SIZE=64
Port configuration that won't make you cry:
- REST API: 8501 (for HTTP requests)
- gRPC: 8500 (for high-performance clients)
- Monitoring: 8502 (for Prometheus metrics)
Volume mounts for model updates:
volumes:
- /host/models:/models:ro
- /host/config:/config:ro
The :ro
(read-only) flag prevents the container from accidentally corrupting your models. Learned this the hard way when a container bug overwrote our production model with zeros.
Configuration Files: Simple Until They're Not
TensorFlow Serving uses model config files to define which models to serve. The basic format looks innocent:
{
\"model_config_list\": {
\"config\": [
{
\"name\": \"my_model\",
\"base_path\": \"/models/my_model\",
\"model_platform\": \"tensorflow\"
}
]
}
}
What they don't tell you about configuration:
Version management: If you don't specify versions, it loads ALL versions it finds. Great way to eat all your memory.
Batching config: The default batching is conservative. You'll want to tune
max_batch_size
,batch_timeout_micros
, andmax_enqueued_batches
.Resource allocation: Without proper
model_version_policy
, your server will try to keep every model version loaded.
Our production config (after many painful lessons):
{
\"model_config_list\": {
\"config\": [
{
\"name\": \"recommendation_model\",
\"base_path\": \"/models/recommendation_model\",
\"model_platform\": \"tensorflow\",
\"model_version_policy\": {
\"specific\": {
\"versions\": [47]
}
}
}
]
},
\"batching_parameters\": {
\"max_batch_size\": 128,
\"batch_timeout_micros\": 50000,
\"max_enqueued_batches\": 100
}
}
Pro tip: Start with conservative batch sizes. Our first production deployment used max_batch_size: 512
and killed the server under load. 128 works reliably.
Health Checks That Actually Matter
The default health check (/v1/models/your_model
) just tells you if the model loaded. It doesn't tell you if the model is working correctly or if memory usage is spiraling out of control.
Health checks we actually use:
Model prediction test:
# Test endpoint format - replace with your actual server and model name curl -X POST <your-tf-serving-host>:8501/v1/models/recommendation_model:predict \ -H \"Content-Type: application/json\" \ -d '{\"instances\": [{\"input\": \"test\"}]}'
Memory usage check:
# Monitor memory metrics from Prometheus endpoint curl <your-tf-serving-host>:8502/metrics | grep memory
Response time check:
# Should complete in under 100ms for simple models time curl -X POST ... (prediction request)
Kubernetes liveness probe that saved our asses:
livenessProbe:
httpGet:
path: /v1/models/recommendation_model
port: 8501
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
The initialDelaySeconds: 60
is critical. TensorFlow Serving takes time to load models, and impatient health checks will kill your containers before they're ready.
When Things Go Wrong (Spoiler: They Will)
Memory leak debugging: Use docker exec -it <container> bash
and check /proc/meminfo
. If memory keeps growing linearly, you've got a leak.
Model loading failures: Check the container logs with docker logs <container>
. TensorFlow Serving logs are verbose but helpful.
Performance degradation: Monitor request latency and batch processing metrics. If latency starts climbing, you're probably hitting resource limits.
The nuclear option: When all else fails, restart the container. TensorFlow Serving doesn't always clean up memory properly, and sometimes a restart is the only fix.
This took me 3 months and multiple production incidents to figure out. Hopefully this saves you from debugging model serving issues during your own anniversary dinner.