Why does my PyTorch serving randomly crash with OOM after running fine for days?

Memory leaks. PyTorch accumulates memory over time, especially with [MPS backend on macOS](https://buttondown.com/weekly-project-news/archive/weekly-github-report-for-pytorch-may-26-2025-june-5528/) and CUDA operations. Set `max-requests` in your server config to restart workers before they consume all memory: ```bash # This prevents memory leak crashes gunicorn --max-requests 1000 --max-requests-jitter 100 app:app ``` Also call `torch.cuda.empty_cache()` between batch inferences and monitor memory usage with `nvidia-smi` or `gpustat`.

Should I use torch.compile for production inference?

Only if you can accept the tradeoffs. I've seen 30-40% speedups on transformer models, but compilation adds 10-30 seconds to cold start time. For serving APIs where cold start matters, skip it. For batch processing where you can afford warmup time, use it. **Critical**: Never use torch.compile with dynamic input shapes in production. It recompiles constantly and kills performance.

TorchServe vs FastAPI: which should I use for production?

**FastAPI** for simple single-model serving where you need custom preprocessing or debugging capability. Response times are typically 50-100ms faster due to less overhead. **TorchServe** for enterprise deployments needing model versioning, A/B testing, or serving multiple models simultaneously. The [performance tuning guide](https://pytorch.org/blog/torchserve-performance-tuning/) shows how to optimize it properly. I start with FastAPI and migrate to TorchServe when I need its advanced features.

How do I monitor PyTorch model performance in production?

Track these metrics, not just generic system metrics: - **Model inference time** (excluding preprocessing) - **GPU memory utilization** over time (watch for leaks) - **CUDA kernel launch overhead** (high values indicate batching issues) - **Queue depth** for batching systems Use [torch.profiler](https://docs.pytorch.org/tutorials/recipes/recipes/profiler_recipe.html) scheduled runs to catch performance regressions: ```python # Run profiler every 1000 requests def conditional_profiling(step, model_fn, inputs): if step % 1000 == 0: with torch.profiler.profile(...) as prof: result = model_fn(inputs) prof.export_chrome_trace(f"profile_step_{step}.json") else: result = model_fn(inputs) return result ```

Currently viewing the AI version

Switch to human version

PyTorch Production Deployment: AI-Optimized Knowledge Base

Critical Production Failures

Memory Leaks - Guaranteed Occurrence

Problem: PyTorch has confirmed memory leaks in production environments
Impact: GPU memory leaks accumulate over days until OOM crashes occur
Specific Issue: torch.nn.Linear operations leak memory in long-running services
Solution: Mandatory worker restarts every 1000 requests or 3600 seconds

Configuration:

MAX_REQUESTS_PER_WORKER = 1000
WORKER_RESTART_INTERVAL = 3600  # 1 hour
gunicorn --max-requests ${MAX_REQUESTS_PER_WORKER} --max-requests-jitter 100 --timeout 120 app:application

First Inference Latency - Cold Start Penalty

Problem: 2-5 second delay on first inference due to CUDA initialization and torch.compile overhead
Solution: Mandatory warmup during startup

Implementation:

@torch.no_grad()
def warmup_model(model, input_shape, device, warmup_steps=5):
    dummy_input = torch.randn(input_shape).to(device)
    for _ in range(warmup_steps):
        _ = model(dummy_input)
    torch.cuda.synchronize()

Serving Architecture Decision Matrix

Deployment Method	Latency (P95)	Throughput	Memory Usage	Learning Curve	Production Use Case
FastAPI + Gunicorn	120-200ms	50-200 req/sec	2-4GB per worker	Easy	Single model, custom preprocessing
TorchServe	150-250ms	100-500 req/sec	3-6GB per model	Moderate	Multi-model, A/B testing, version management
ONNX Runtime	80-150ms	100-300 req/sec	1-3GB	Hard	Maximum performance requirement
TensorRT	60-120ms	200-800 req/sec	2-5GB	Very Hard	NVIDIA GPUs only, maximum optimization
Ray Serve	200-300ms	500-2000 req/sec	4-8GB	Moderate	Auto-scaling, distributed inference

Decision Criteria

FastAPI: Use for prototypes and simple models (50-100ms faster than TorchServe)
TorchServe: Only use when you need multi-model serving, version management, or built-in batching
Complexity Warning: Don't use TorchServe unless you specifically need its advanced features

torch.compile Production Guidelines

Performance Impact

Speedup: 1.5-2x for transformer models (observed 30-40% improvement)
Memory: Can reduce or increase usage depending on input shapes
Cold Start: Adds 10-30 seconds compilation time on first inference

Production Constraints

Never use with dynamic input shapes: Causes constant recompilation and performance degradation
Debugging Impact: Completely breaks debugging capabilities

Implementation Pattern:

# Compile after warmup, not before
if not DEBUG_MODE:
    model = torch.compile(model, mode="max-autotune")

When to Use

Use: Batch processing where warmup time is acceptable
Avoid: Real-time serving APIs where cold start matters

Memory Optimization Hierarchy

Tier 1 - Essential (30-50% memory reduction)

Gradient Checkpointing:

from torch.utils.checkpoint import checkpoint
class CheckpointedModel(nn.Module):
    def forward(self, x):
        return checkpoint(self.heavy_layer, x)

Mixed Precision (halves memory usage):

from torch.cuda.amp import autocast
with autocast():
    output = model(input)  # Automatically uses FP16 where safe

Manual Memory Cleanup:

def cleanup_gpu_memory():
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.ipc_collect()

Tier 2 - Avoid Unless Necessary

Activation offloading: Adds complexity without meaningful inference benefits
Parameter partitioning: Only beneficial for massive training models, not inference

Auto-Scaling Configuration

Standard Web Service Scaling Breaks ML Workloads

Problem: Model serving requires 30-60 second warmup time
Consequence: Aggressive scaling causes cold start penalties and memory fragmentation

Production-Tested Configuration

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 2  # Always maintain minimum availability
  maxReplicas: 10
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 300  # 5 minutes - longer than web services
      policies:
      - type: Percent
        value: 50  # Scale by 50% maximum
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 600  # 10 minutes - prevent thrashing
      policies:
      - type: Percent
        value: 25  # Scale down slowly
        periodSeconds: 120

Critical Monitoring Metrics

ML-Specific Metrics (Standard System Metrics Are Insufficient)

Inference latency distribution (not averages - P95/P99 matter)
GPU memory usage over time (detect memory leaks before crashes)
Model loading time (detect corruption/network issues)
Batch utilization (optimization opportunity indicator)
CUDA kernel launch overhead (batching efficiency measure)

Production Monitoring Implementation

import torch
from prometheus_client import Counter, Histogram, Gauge

inference_duration = Histogram('pytorch_inference_seconds', 'Time spent in model inference')
gpu_memory_used = Gauge('pytorch_gpu_memory_bytes', 'GPU memory usage')
batch_size_used = Histogram('pytorch_batch_size', 'Actual batch sizes processed')
model_errors = Counter('pytorch_model_errors_total', 'Model inference errors', ['error_type'])

@inference_duration.time()
def monitored_inference(model, batch):
    try:
        torch.cuda.synchronize()  # Ensure GPU ops complete
        start_mem = torch.cuda.memory_allocated()
        result = model(batch)
        end_mem = torch.cuda.memory_allocated()
        gpu_memory_used.set(end_mem)
        batch_size_used.observe(len(batch))
        return result
    except RuntimeError as e:
        if "out of memory" in str(e):
            model_errors.labels(error_type="oom").inc()
        else:
            model_errors.labels(error_type="runtime").inc()
        raise

Performance Regression Detection

Hidden Causes of Production Degradation

CUDA driver updates: Change kernel scheduling and affect PyTorch performance
System library updates: Affect memory allocation patterns
Hardware degradation: Causes subtle slowdowns missed by traditional monitoring
Data drift: New input distributions make models work harder

Automated Regression Detection

class PerformanceBaseline:
    def __init__(self, model, sample_inputs):
        self.model = model
        self.sample_inputs = sample_inputs
        self.baseline_times = self._establish_baseline()

    def _establish_baseline(self):
        times = []
        for _ in range(100):  # 100 runs for statistical significance
            start = time.time()
            with torch.no_grad():
                _ = self.model(random.choice(self.sample_inputs))
            torch.cuda.synchronize()
            times.append(time.time() - start)
        return np.percentile(times, [50, 95])  # median, P95

    def check_regression(self, current_times):
        current_p95 = np.percentile(current_times, 95)
        baseline_p95 = self.baseline_times[1]

        if current_p95 > baseline_p95 * 1.2:  # 20% regression threshold
            alert_performance_regression(current_p95, baseline_p95)

Batch Size Optimization

Production Reality

Dynamic batching is harder than it looks: Most production uses fixed batch sizes
Real-time serving: Batch size 1 often gives better latency than request batching

Memory constraint calculation:

def find_max_batch_size(model, input_shape, max_memory_gb=10):
    batch_size = 1
    while True:
        try:
            dummy_input = torch.randn(batch_size, *input_shape[1:])
            output = model(dummy_input)
            del dummy_input, output
            torch.cuda.empty_cache()
            batch_size *= 2
        except RuntimeError:  # OOM
            return batch_size // 2

Disaster Recovery for ML Services

Traditional Backup Strategies Don't Work

Problem: Model files are large, loading is slow, version consistency required
Solution Components:
1. Model Versioning: Keep last 3 working versions
2. Gradual Rollouts: Never deploy to all replicas simultaneously
3. Sophisticated Health Checks: Test actual inference, not just HTTP responses
4. Circuit Breaker Pattern: Fail gracefully when models error

Circuit Breaker Implementation

class ModelCircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=60):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.last_failure_time = None
        self.recovery_timeout = recovery_timeout
        self.state = "CLOSED"  # CLOSED, OPEN, HALF_OPEN

    def call_model(self, model, inputs):
        if self.state == "OPEN":
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = "HALF_OPEN"
            else:
                raise Exception("Circuit breaker OPEN - model unavailable")

        try:
            result = model(inputs)
            if self.state == "HALF_OPEN":
                self.reset()  # Recovery successful
            return result
        except Exception as e:
            self.record_failure()
            raise

    def record_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.failure_threshold:
            self.state = "OPEN"

Quantization in Production

Performance vs Accuracy Trade-off

Speedup: 8-bit quantization typically provides 1.5-2x speedup
Accuracy Loss: Minimal for most models, but model-dependent
Critical Requirement: Test extensively on validation set before deployment

Implementation

import torch.quantization

# Post-training quantization - easiest approach
quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

# Compare accuracy on validation set before deploying

When NOT to Use

Custom operations: Quantization breaks with non-standard operations
Accuracy degradation: Don't sacrifice prediction quality for speed
Rule: Speedup isn't worth broken predictions

Kubernetes Production Configuration

Resource Requirements Based on Profiling

# Model loading optimization with init containers
initContainers:
- name: model-loader
  image: your-model-image
  command: ['python', '-c', 'torch.jit.load("model.pt")']  # Validate model loads

containers:
- name: pytorch-server
  resources:
    limits:
      nvidia.com/gpu: 1
      memory: "16Gi"
    requests:
      memory: "8Gi"

Critical Configuration Rules

Memory limits: Set based on actual profiling, not guesswork
Under-provisioning consequence: OOM kills
Init containers: Use for model loading validation

Memory Profiling and Debugging

Production Memory Issue Debugging

import torch.profiler

with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
    profile_memory=True,
    record_shapes=True
) as prof:
    output = model(input)

print(prof.key_averages().table(sort_by="cuda_memory_usage", row_limit=10))

Scheduled Profiling for Regression Detection

# Run profiler every 1000 requests to catch regressions
def conditional_profiling(step, model_fn, inputs):
    if step % 1000 == 0:
        with torch.profiler.profile(...) as prof:
            result = model_fn(inputs)
        prof.export_chrome_trace(f"profile_step_{step}.json")
    else:
        result = model_fn(inputs)
    return result

Resource Requirements and Expertise Costs

Implementation Time Investment

FastAPI Setup: 1-2 days for basic implementation
TorchServe Setup: 1-2 weeks including learning curve and configuration optimization
ONNX Runtime: 2-4 weeks including conversion and optimization
TensorRT: 4-8 weeks requiring NVIDIA-specific expertise

Expertise Requirements

Basic Deployment: Python web development knowledge sufficient
Production Optimization: Requires GPU programming and memory management expertise
Advanced Serving: Kubernetes and distributed systems knowledge essential

Hidden Costs

Monitoring Setup: 1-2 weeks for proper ML-specific monitoring
Debugging Tools: Time investment in profiling and memory analysis tools
Maintenance: Regular worker restarts and memory leak management overhead

Common Failure Scenarios and Solutions

Scenario 1: Random OOM Crashes After Days of Stable Operation

Root Cause: Memory leaks in PyTorch operations
Solution: Configure max-requests worker restart
Prevention: Monitor GPU memory usage trends

Scenario 2: Slow First Inference in Production

Root Cause: CUDA lazy initialization and torch.compile overhead
Solution: Implement model warmup during startup
Impact: 2-5 second delay without warmup

Scenario 3: Auto-scaling Causes Performance Degradation

Root Cause: Cold start penalties and memory fragmentation
Solution: Use longer stabilization windows (5-10 minutes)
Configuration: Gradual scaling with 25-50% increments

Scenario 4: Performance Regression Without Code Changes

Root Cause: CUDA driver updates, hardware degradation, or data drift
Solution: Automated performance baseline monitoring
Detection: 20% P95 latency increase threshold

Scenario 5: Debugging Breaks After torch.compile

Root Cause: torch.compile disables debugging capabilities
Solution: Conditional compilation (disable in debug mode)
Trade-off: 30-40% performance gain vs debugging capability

Critical Production Warnings

What Official Documentation Doesn't Tell You

Memory leaks are inevitable: Plan for them, don't try to prevent them
torch.compile with dynamic shapes: Will destroy performance through constant recompilation
TorchServe defaults: Are configured for development, not production
Auto-scaling ML workloads: Requires different parameters than web services
Generic monitoring tools: Are insufficient for ML deployment debugging

Breaking Points and Failure Modes

1000+ spans in UI: Makes debugging large distributed transactions impossible
GPU memory fragmentation: Occurs during rapid scaling events
Model corruption: Detectable through loading time monitoring
Batch optimization loss: Happens when new pods haven't optimized parameters

Investment vs Benefit Analysis

torch.compile: 40% speedup but 10-30 second cold start penalty
Quantization: 1.5-2x speedup but potential accuracy degradation
TorchServe: Better for multi-model serving but 50-100ms latency overhead
Circuit breakers: Essential for graceful degradation but add complexity

Useful Links for Further Investigation

Production PyTorch Resources That Actually Help

Link	Description
PyTorch Performance Tuning Guide	The only official guide that covers real production optimizations. Skip the theory, focus on the memory management and batching sections.
TorchServe Performance Guide	Specific tuning parameters for TorchServe deployments. Essential if you're using TorchServe - the defaults are terrible for production.
PyTorch CUDA Memory Documentation	Deep dive into how PyTorch manages GPU memory. Read this before you debug your first OOM error.
TorchServe GitHub Repository	Official model serving solution. Check the issues section for known problems before deploying.
Optimizing PyTorch Model Serving at Scale	Real performance comparison between TorchServe and FastAPI. Actual benchmarks, not marketing.
Ray Serve Documentation	For distributed serving and auto-scaling. Complex but powerful when you need it.
PyTorch Profiler Tutorial	How to find performance bottlenecks in your models. Essential for production optimization.
NVIDIA System Management Interface	`nvidia-smi` documentation. Learn the flags that actually matter for monitoring GPU utilization.
Understanding GPU Memory Blog Series	PyTorch official blog series on debugging GPU memory issues and visualizing memory usage patterns.
TorchServe Performance Tuning Case Study	Meta's own experience optimizing TorchServe for production workloads. Real numbers and configurations.
Deploying LLMs with TorchServe + vLLM	Recent case study on serving large language models efficiently. Good for understanding modern serving patterns.
Production PyTorch Memory Optimization	Practical memory optimization techniques that work in production, not just toy examples.
torch.compile Tutorial	Official guide to PyTorch 2.x compilation. Pay attention to the production considerations section.
PyTorch Quantization	How to reduce model size and increase inference speed without destroying accuracy.
ONNX Runtime for PyTorch	Converting PyTorch models to ONNX for faster inference. Not always worth it, but when it works, it works well.
Kubernetes GPU Sharing	How to efficiently share GPUs between multiple model serving pods. Critical for cost optimization.
Docker Best Practices for ML	Container optimization for machine learning workloads. Smaller images, faster startup times.
NVIDIA Container Toolkit	Required for GPU access in containerized deployments. Setup guide and troubleshooting.
PyTorch GitHub Issues	Search here before assuming your problem is unique. Filter by "memory leak" and "production" for relevant issues.
PyTorch Discuss Forum	Community forum where core developers actually respond. Better than Stack Overflow for PyTorch-specific issues.
CUDA Memory Management Best Practices	NVIDIA's official guide to GPU memory optimization. Applies to PyTorch CUDA operations.