I spent way too long battling Ollama in production after it worked perfectly on my MacBook. Here's the brutal truth about what actually breaks and how I eventually fixed it, based on real production deployments and way too much time staring at monitoring dashboards.
The "Works on My Machine" Trap
Your laptop is a controlled environment. Production is chaos. I learned this when our internal chatbot went from handling 3 developers to supporting 200+ employees. Everything that could go wrong did.
The biggest lie: "If it works locally, it'll work in production." Bullshit. Your laptop has 32GB unified memory and no competing processes. Production has limited RAM, CPU contention, network timeouts, and users who do unexpected shit that breaks everything.
Memory Management Hell
The official docs say Llama 3.3 7B needs "8GB minimum." That's technically correct but practically useless.
In production, you need way more RAM than you think. The OS eats 3-4GB, your other services probably use another 6-8GB, plus overhead for model loading and context windows that grow over time.
So that "8GB" model actually needs 24-32GB of system RAM to run reliably. I learned this the hard way after three days of mysterious OOMKills and angry Slack messages about the chatbot being down.
What actually works:
## Monitor real memory usage
watch -n1 'free -h && echo "---" && ollama ps'
The memory usage creeps up over time. Long conversations consume more context. Multiple concurrent users multiply everything. Plan for 3x the theoretical minimum.
Concurrency: Where Dreams Go to Die
Ollama's default behavior is designed for single users, not production workloads. The OLLAMA_NUM_PARALLEL=1
default means requests queue up like customers at a single checkout line.
I tried setting OLLAMA_NUM_PARALLEL=8
and promptly killed our server. Each parallel request loads model context, multiplying VRAM usage. With a 40GB model and 8 parallel requests, you need 320GB+ VRAM. Most of us don't have A100 clusters.
The solution that actually works: Multiple Ollama instances.
## Run multiple instances on different ports
OLLAMA_HOST=127.0.0.1:11434 ollama serve &
OLLAMA_HOST=127.0.0.1:11435 ollama serve &
OLLAMA_HOST=127.0.0.1:11436 ollama serve &
## Load balance between them
Each instance handles sequential requests but you get horizontal scaling without the VRAM explosion.
GPU Driver Nightmares
CUDA drivers are finicky as hell. What works in development breaks in production for mysterious reasons:
- Driver version mismatches: CUDA 11.8 vs 12.1 can cause silent failures
- Multiple CUDA versions: Development tools install different versions
- Container runtime issues: Docker vs Podman vs native behave differently
- GPU memory fragmentation: Long-running processes fragment VRAM
Debug GPU issues:
## Check CUDA is actually working
nvidia-smi
nvcc --version
## Ollama GPU detection
ollama run llama3.3:7b "test"
## Watch the output - should show GPU layers loaded
I spent a week debugging "slow inference" only to discover Ollama fell back to CPU because of a CUDA library mismatch and suspend/resume cycle issues.
Network and Load Balancer Gotchas
Load balancers expect web applications, not AI inference servers. Default timeouts (30 seconds) are too short for model responses. Health checks might fail during model loading with comprehensive monitoring essential for production deployments.
HAProxy config that works:
backend ollama_backend
timeout server 300s
option httpchk GET /api/ps
server ollama1 10.0.0.10:11434 check
server ollama2 10.0.0.11:11434 check
NGINX timeout config:
proxy_connect_timeout 10s;
proxy_send_timeout 300s;
proxy_read_timeout 300s;
Storage: The Hidden Bottleneck
Models are massive files. Llama 3.3 70B is 40GB. Loading from slow storage kills performance:
- Network storage: Adds 30+ seconds to cold starts
- Spinning disks: Even local HDD is too slow
- Container ephemeral storage: Gets wiped on restarts
What I learned: Put models on local NVMe SSDs. Period. Network attached storage and container volumes are too slow for production.
## Check your storage speed
dd if=/dev/zero of=/tmp/test bs=1G count=10 oflag=dsync
## Should be 1GB/s+ for good performance
Monitoring What Actually Matters
Standard monitoring misses the important stuff. CPU and RAM usage look fine until everything explodes.
Monitor these metrics:
- Model load/unload frequency (high churn = memory pressure)
- Response queue length (requests backing up)
- Context window sizes (memory leaks show up here)
- GPU memory fragmentation (nvidia-smi vs ollama ps differences)
- Storage I/O during model loading (bottleneck detection)
## Simple monitoring script
while true; do
echo "=== $(date) ==="
echo "Models loaded:"
ollama ps
echo "Memory:"
free -h | grep Mem
echo "GPU:"
nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits
echo "---"
sleep 30
done
The Migration Path That Works
Don't go from laptop to production in one jump. Scale gradually:
- Single server, single model: Get basic deployment working
- Resource monitoring: Understand real usage patterns
- Multiple models: Test switching and memory management
- Multiple instances: Scale horizontally before vertically
- Load balancing: Add redundancy and request distribution
Each step reveals different failure modes. Better to fail small than catastrophically.
When to Give Up on Ollama
Sometimes Ollama isn't the right choice. If you're hitting these walls, consider alternatives:
- >100 concurrent users: vLLM handles concurrency better with 793 TPS vs Ollama's 41 TPS
- Multiple models simultaneously: TGI has better resource management and memory handling
- High availability requirements: Managed services might be worth it for production stability
- Complex deployment requirements: Kubernetes-native solutions exist with better orchestration
Ollama is fantastic for 10-50 users with reasonable response time expectations. Beyond that, the complexity explodes.
Real Production Architecture
Here's what actually works for 100+ users:
Load Balancer (HAProxy/NGINX)
├── Ollama Instance 1 (Port 11434) - Model A
├── Ollama Instance 2 (Port 11435) - Model A
├── Ollama Instance 3 (Port 11436) - Model B
└── Ollama Instance 4 (Port 11437) - Model B
Shared NVMe storage for models
Prometheus/Grafana for monitoring
Automated restart scripts for memory leaks
Each instance runs on dedicated hardware or in isolated containers with guaranteed resources. No fancy orchestration needed - just multiple instances with proven Unix tools.
This setup has handled 200+ daily active users for six months. Not elegant, but it works.