Production Disasters and How to Fix Them

Q

Why does Ollama crash with "SIGKILL: 9" after 10 minutes in production?

A

The OOMKiller is murdering your processes. Your model is eating more RAM than the system can handle.PThe fix: Set proper container memory limits and use smaller models. If you're running Llama 3.3 70B on a 32GB server with other services, you're fucked. Switch to the 7B model or get more RAM.Pbash# Docker with proper limitsdocker run -d --memory=16g --oom-kill-disable=false ollama/ollama

Q

My response times went from 2 seconds to 45 seconds in production. WTF?

A

You hit the concurrent request limit. Ollama defaults to processing requests sequentially, which is fine for 1-2 users but absolute garbage for production load.PThe nuclear option: Pbash# Enable parallel processing (will use more VRAM)OLLAMA_NUM_PARALLEL=4 ollama servePThe smart option: Set up multiple Ollama instances behind a load balancer.

Q

Models keep unloading every 5 minutes, making the first request super slow

A

This is Ollama's default behavior to "save memory." It's annoying as hell in production.Pbash# Keep models loaded foreverOLLAMA_KEEP_ALIVE=-1 ollama serve# Or keep for 2 hoursOLLAMA_KEEP_ALIVE=2h ollama servePWarning: Your RAM usage will stay high, but response times will be consistent.

Q

GPU utilization is shit - showing 20% usage but responses are slow

A

Classic GPU memory fragmentation. Your model loaded but can't efficiently use the available VRAM.PThe fix that actually works:Pbash# Force model to use specific GPU layersOLLAMA_NUM_GPU_LAYERS=35 ollama run llama3.3:7bPCheck what's actually loaded:Pbashnvidia-smi# vsollama ps

Q

Getting "model not found" errors randomly in production

A

Race condition in model loading. Multiple requests trying to load the same model simultaneously.PQuick fix: Pre-load models at startup:Pbash# In your deployment scriptollama pull llama3.3:7bollama run llama3.3:7b "test" # Load into memory

Q

Container restarts kill all loaded models

A

No persistent model storage configured properly.PDocker fix:Pbashdocker run -d \ -v ollama_models:/root/.ollama \ -p 11434:11434 \ ollama/ollamaPKubernetes fix: Use persistent volumes, not ephemeral storage.

Q

API requests timeout after 30 seconds

A

Your reverse proxy or load balancer has shitty timeout settings.PNginx fix:Pnginxlocation /api/ { proxy_pass http://ollama:11434; proxy_read_timeout 300s; proxy_connect_timeout 10s;}

Q

Memory usage keeps growing until crash (memory leak)

A

Known issue with certain models and long conversations. Context windows fill up and never get cleaned.PWorkaround:Pbash# Restart Ollama daily0 2 * * * systemctl restart ollamaPBetter fix: Clear conversation context periodically in your app code.

Q

Multiple users get the same response (response bleeding)

A

Context contamination between concurrent requests. This is fucking scary in production.PImmediate fix: Disable parallel processing until you can debug:PbashOLLAMA_NUM_PARALLEL=1 ollama servePProper fix: Update to Ollama 0.11.5+ which has better memory management.

Q

Ollama works fine, then suddenly "connection refused" errors

A

Process died silently.

Check your process manager logs.Pbash# Check what actually happenedjournalctl -u ollama -f# Or for Dockerdocker logs -f ollama_containerPCommon causes:P

  • OOM killer struck againP
  • GPU driver crashed P
  • Disk full (models are huge)P
  • ulimit hit (file descriptor limit)
Q

Model loading takes 3+ minutes on cold start

A

Storage I/O bottleneck.

Your models are probably on slow disks.PQuick wins:P

  • Move models to SSD (not spinning rust)P
  • Use local storage, not network mountsP
  • Pre-warm models at container startupPCheck your storage:

Pbash# Time model loadingtime ollama run llama3.3:7b "test"# Check disk I/Oiostat -x 1

Production Reality Check: What Nobody Tells You

I spent way too long battling Ollama in production after it worked perfectly on my MacBook. Here's the brutal truth about what actually breaks and how I eventually fixed it, based on real production deployments and way too much time staring at monitoring dashboards.

The "Works on My Machine" Trap

Your laptop is a controlled environment. Production is chaos. I learned this when our internal chatbot went from handling 3 developers to supporting 200+ employees. Everything that could go wrong did.

The biggest lie: "If it works locally, it'll work in production." Bullshit. Your laptop has 32GB unified memory and no competing processes. Production has limited RAM, CPU contention, network timeouts, and users who do unexpected shit that breaks everything.

Memory Management Hell

Performance vs Quantization Chart

The official docs say Llama 3.3 7B needs "8GB minimum." That's technically correct but practically useless.

In production, you need way more RAM than you think. The OS eats 3-4GB, your other services probably use another 6-8GB, plus overhead for model loading and context windows that grow over time.

So that "8GB" model actually needs 24-32GB of system RAM to run reliably. I learned this the hard way after three days of mysterious OOMKills and angry Slack messages about the chatbot being down.

What actually works:

## Monitor real memory usage
watch -n1 'free -h && echo "---" && ollama ps'

The memory usage creeps up over time. Long conversations consume more context. Multiple concurrent users multiply everything. Plan for 3x the theoretical minimum.

Concurrency: Where Dreams Go to Die

Ollama's default behavior is designed for single users, not production workloads. The OLLAMA_NUM_PARALLEL=1 default means requests queue up like customers at a single checkout line.

I tried setting OLLAMA_NUM_PARALLEL=8 and promptly killed our server. Each parallel request loads model context, multiplying VRAM usage. With a 40GB model and 8 parallel requests, you need 320GB+ VRAM. Most of us don't have A100 clusters.

The solution that actually works: Multiple Ollama instances.

## Run multiple instances on different ports
OLLAMA_HOST=127.0.0.1:11434 ollama serve &
OLLAMA_HOST=127.0.0.1:11435 ollama serve &
OLLAMA_HOST=127.0.0.1:11436 ollama serve &

## Load balance between them

Each instance handles sequential requests but you get horizontal scaling without the VRAM explosion.

GPU Driver Nightmares

CUDA drivers are finicky as hell. What works in development breaks in production for mysterious reasons:

  • Driver version mismatches: CUDA 11.8 vs 12.1 can cause silent failures
  • Multiple CUDA versions: Development tools install different versions
  • Container runtime issues: Docker vs Podman vs native behave differently
  • GPU memory fragmentation: Long-running processes fragment VRAM

Debug GPU issues:

## Check CUDA is actually working
nvidia-smi
nvcc --version

## Ollama GPU detection
ollama run llama3.3:7b "test"
## Watch the output - should show GPU layers loaded

I spent a week debugging "slow inference" only to discover Ollama fell back to CPU because of a CUDA library mismatch and suspend/resume cycle issues.

Network and Load Balancer Gotchas

Load balancers expect web applications, not AI inference servers. Default timeouts (30 seconds) are too short for model responses. Health checks might fail during model loading with comprehensive monitoring essential for production deployments.

HAProxy config that works:

backend ollama_backend
    timeout server 300s
    option httpchk GET /api/ps
    server ollama1 10.0.0.10:11434 check
    server ollama2 10.0.0.11:11434 check

NGINX timeout config:

proxy_connect_timeout 10s;
proxy_send_timeout 300s;
proxy_read_timeout 300s;

Storage: The Hidden Bottleneck

Models are massive files. Llama 3.3 70B is 40GB. Loading from slow storage kills performance:

  • Network storage: Adds 30+ seconds to cold starts
  • Spinning disks: Even local HDD is too slow
  • Container ephemeral storage: Gets wiped on restarts

What I learned: Put models on local NVMe SSDs. Period. Network attached storage and container volumes are too slow for production.

## Check your storage speed
dd if=/dev/zero of=/tmp/test bs=1G count=10 oflag=dsync
## Should be 1GB/s+ for good performance

Monitoring What Actually Matters

Standard monitoring misses the important stuff. CPU and RAM usage look fine until everything explodes.

Monitor these metrics:

  • Model load/unload frequency (high churn = memory pressure)
  • Response queue length (requests backing up)
  • Context window sizes (memory leaks show up here)
  • GPU memory fragmentation (nvidia-smi vs ollama ps differences)
  • Storage I/O during model loading (bottleneck detection)
## Simple monitoring script
while true; do
    echo "=== $(date) ==="
    echo "Models loaded:"
    ollama ps
    echo "Memory:"
    free -h | grep Mem
    echo "GPU:"
    nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits
    echo "---"
    sleep 30
done

The Migration Path That Works

Don't go from laptop to production in one jump. Scale gradually:

  1. Single server, single model: Get basic deployment working
  2. Resource monitoring: Understand real usage patterns
  3. Multiple models: Test switching and memory management
  4. Multiple instances: Scale horizontally before vertically
  5. Load balancing: Add redundancy and request distribution

Each step reveals different failure modes. Better to fail small than catastrophically.

When to Give Up on Ollama

Sometimes Ollama isn't the right choice. If you're hitting these walls, consider alternatives:

Ollama is fantastic for 10-50 users with reasonable response time expectations. Beyond that, the complexity explodes.

Real Production Architecture

Here's what actually works for 100+ users:

Load Balancer (HAProxy/NGINX)
├── Ollama Instance 1 (Port 11434) - Model A
├── Ollama Instance 2 (Port 11435) - Model A  
├── Ollama Instance 3 (Port 11436) - Model B
└── Ollama Instance 4 (Port 11437) - Model B

Shared NVMe storage for models
Prometheus/Grafana for monitoring  
Automated restart scripts for memory leaks

Each instance runs on dedicated hardware or in isolated containers with guaranteed resources. No fancy orchestration needed - just multiple instances with proven Unix tools.

This setup has handled 200+ daily active users for six months. Not elegant, but it works.

Production Deployment Options: What Actually Works

Approach

Max Concurrent Users

Memory Overhead

Setup Complexity

When It Breaks

Single Ollama Instance

5-10

Low

Easy

Under any real load

Ollama + OLLAMA_NUM_PARALLEL

10-20

High (multiplies VRAM)

Easy

GPU memory exhaustion

Multiple Ollama Instances

50-100

Medium

Medium

Need proper load balancing

Ollama + Redis Queue

20-50

Medium

Complex

Queue management overhead

Migration to vLLM

100+

Lower

High

Different API, rewrite needed

The Production Deployment Checklist That Actually Prevents Disasters

After watching dozens of Ollama deployments fail spectacularly, I've learned there are specific things you can check before going live that will save your ass. This isn't theory - this is a checklist I use for every production deployment.

Pre-Production Testing (Do This or Cry Later)

Browser API Integration Diagram

Load test with realistic concurrency:

## Simple concurrent test - replace localhost with your server
for i in {1..10}; do
  curl -X POST YOUR_OLLAMA_HOST:11434/api/generate \
    -H \"Content-Type: application/json\" \
    -d '{\"model\": \"llama3.3:7b\", \"prompt\": \"Test message\"}' &
done
wait

If your server doesn't handle 10 concurrent requests smoothly, it won't handle production. I've seen too many deployments that worked perfectly with 1-2 users completely collapse under real production load scenarios.

Memory pressure testing:

## Fill up RAM and see what happens
stress --vm 1 --vm-bytes 80% --timeout 60s &
## Try to use Ollama while memory is under pressure
ollama run llama3.3:7b \"memory test\"

GPU memory testing:

## Check GPU memory fragmentation
nvidia-smi --query-gpu=memory.used,memory.free --format=csv --loop=1 &
## Load and unload models repeatedly
for i in {1..10}; do
  ollama run llama3.3:7b \"test $i\"
  sleep 5
done

If GPU memory doesn't get released properly, you have a memory leak that will kill your deployment within hours. Use proper monitoring tools to track this.

Environment Variables That Actually Matter

Most guides skip these, but they're critical for production configuration. Here's what actually matters in production:

## Essential production config
export OLLAMA_NUM_PARALLEL=1  # Start conservative
export OLLAMA_KEEP_ALIVE=30m   # Balance memory vs response time
export OLLAMA_HOST=0.0.0.0     # Accept external connections
export OLLAMA_ORIGINS=\"*\"      # Configure CORS properly
export CUDA_VISIBLE_DEVICES=0  # Lock to specific GPU
export OLLAMA_DEBUG=1          # Enable detailed logging

The ones that will bite you:

  • OLLAMA_TMPDIR: Default /tmp fills up on small systems
  • OLLAMA_MODELS: Put models on fast, persistent storage
  • OLLAMA_FLASH_ATTENTION: Can improve performance but might be unstable

Logging Setup That Saves Hours of Debugging

Default Ollama logging is garbage for production troubleshooting. Set up proper logging:

## Structured logging with rotation
ollama serve 2>&1 | tee -a /var/log/ollama/ollama.log | logger -t ollama

## Log GPU metrics alongside
nvidia-smi --query-gpu=timestamp,memory.used,memory.total,utilization.gpu \
  --format=csv --loop=10 >> /var/log/ollama/gpu.log &

What to log:

  • Every model load/unload event
  • Memory usage before/after requests
  • Response times over 10 seconds
  • Any CUDA errors (these are silent killers)
  • Client IP addresses for rate limiting

Health Checks That Actually Work

Standard HTTP health checks miss most Ollama problems:

#!/bin/bash
## /opt/ollama/healthcheck.sh

## Check if Ollama is responding
if ! curl -s YOUR_OLLAMA_HOST:11434/api/ps >/dev/null; then
  echo \"CRITICAL: Ollama not responding\"
  exit 2
fi

## Check if models are loaded (if expected)
LOADED=$(curl -s YOUR_OLLAMA_HOST:11434/api/ps | jq -r '.models | length')
if [ \"$LOADED\" -eq 0 ] && [ \"$REQUIRE_LOADED_MODEL\" = \"true\" ]; then
  echo \"WARNING: No models loaded\"
  exit 1
fi

## Check GPU memory usage
GPU_MEM=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits)
if [ \"$GPU_MEM\" -gt 90 ]; then
  echo \"CRITICAL: GPU memory >90%\"
  exit 2
fi

## Check system memory
SYS_MEM=$(free | grep Mem | awk '{printf \"%.0f\", $3/$2 * 100}')
if [ \"$SYS_MEM\" -gt 95 ]; then
  echo \"CRITICAL: System memory >95%\"
  exit 2
fi

echo \"OK: All checks passed\"
exit 0

Run this every 30 seconds. Anything longer and you'll miss the rapid failures that kill Ollama.

Container Configuration That Doesn't Suck

Most Docker examples online are development configs that break in production:

## docker-compose.yml that works
version: '3.8'
services:
  ollama:
    image: ollama/ollama:latest
    restart: unless-stopped
    
    # Resource limits (critical!)
    deploy:
      resources:
        limits:
          memory: 24G
        reservations:
          memory: 16G
    
    # GPU access
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=0
      - OLLAMA_KEEP_ALIVE=30m
      - OLLAMA_NUM_PARALLEL=1
    
    # Storage (persistent and fast)
    volumes:
      - /data/ollama:/root/.ollama:Z
      - /tmp/ollama:/tmp:Z
    
    # Network
    ports:
      - \"11434:11434\"
    
    # Health checks
    healthcheck:
      test: [\"CMD\", \"/opt/ollama/healthcheck.sh\"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s

Key differences from development:

  • Memory limits prevent OOMKills from taking down the host
  • GPU runtime properly configured
  • Persistent storage for models
  • Health checks that actually test functionality

Automated Recovery Scripts

Ollama will crash. Plan for it:

#!/bin/bash
## /opt/ollama/restart-on-failure.sh

while true; do
  # Wait for failure
  sleep 60
  
  # Check if Ollama is healthy
  if ! /opt/ollama/healthcheck.sh; then
    echo \"$(date): Ollama unhealthy, restarting...\"
    
    # Kill any hung processes
    pkill -f ollama
    sleep 10
    
    # Clear GPU memory
    nvidia-smi --gpu-reset
    
    # Restart service
    systemctl restart ollama
    
    # Wait for startup
    sleep 30
    
    # Pre-load critical model
    ollama pull llama3.3:7b
    ollama run llama3.3:7b \"warmup\" >/dev/null 2>&1 &
    
    echo \"$(date): Ollama restarted and warmed up\"
  fi
done

This has saved my ass dozens of times. Manual restarts take too long in production.

Monitoring Dashboards for Ollama

Standard monitoring misses the important stuff. Here's a Grafana dashboard config that actually helps:

Key metrics to track:

  • Model loading frequency (high frequency = memory pressure)
  • Queue depth over time (sustained >3 = need more capacity)
  • Response time percentiles (not just averages)
  • GPU memory usage vs allocation
  • Context window size distribution (memory leaks show up here)

Alert thresholds that work:

  • Response time >10s for >5 minutes
  • Queue depth >5 for >2 minutes
  • GPU memory >90% for >30 seconds
  • No successful responses for >5 minutes

The Migration Escape Hatch

Sometimes Ollama isn't the right tool. Have a migration path ready:

When to give up on Ollama:

  • Can't keep response times under 10 seconds
  • Need >50 concurrent users
  • Memory leaks require restarts more than daily
  • GPU utilization consistently <30% (wasting money)

Migration targets:

  • vLLM: Better concurrency handling, more complex setup
  • TGI: Hugging Face's solution, good K8s integration
  • OpenAI API: Sometimes just paying for it is cheaper

Migration checklist:

  • API compatibility (most are OpenAI-compatible)
  • Model format conversion (GGUF → other formats)
  • Resource requirements (often lower)
  • Operational complexity (usually higher)

The Nuclear Option: Kubernetes

If you need enterprise features, here's a minimal K8s deployment that works with proper Kubernetes manifests:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ollama
  template:
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        resources:
          limits:
            memory: \"24Gi\"
            nvidia.com/gpu: 1
          requests:
            memory: \"16Gi\"
            cpu: \"4\"
        env:
        - name: OLLAMA_KEEP_ALIVE
          value: \"30m\"
        volumeMounts:
        - name: models
          mountPath: /root/.ollama
        livenessProbe:
          httpGet:
            path: /api/ps
            port: 11434
          initialDelaySeconds: 60
          periodSeconds: 30
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: ollama-models

Warning: Kubernetes adds operational complexity. Only use if you need the enterprise features (auto-scaling, rolling updates, multi-cluster deployment).

This checklist has prevented more production disasters than I can count. Use it, customize it for your environment, but don't skip it. The 2 hours you spend on proper preparation and health checks saves 20 hours of 3AM debugging.

Production Resources That Actually Help

Related Tools & Recommendations

tool
Similar content

LM Studio Performance: Fix Crashes & Speed Up Local AI

Stop fighting memory crashes and thermal throttling. Here's how to make LM Studio actually work on real hardware.

LM Studio
/tool/lm-studio/performance-optimization
100%
tool
Similar content

Django Troubleshooting Guide: Fix Production Errors & Debug

Stop Django apps from breaking and learn how to debug when they do

Django
/tool/django/troubleshooting-guide
88%
tool
Similar content

Ollama: Run Local AI Models & Get Started Easily | No Cloud

Finally, AI That Doesn't Phone Home

Ollama
/tool/ollama/overview
61%
tool
Similar content

Apache Kafka Overview: What It Is & Why It's Hard to Operate

Dive into Apache Kafka: understand its core, real-world production challenges, and advanced features. Discover why Kafka is complex to operate and how Kafka 4.0

Apache Kafka
/tool/apache-kafka/overview
52%
tool
Similar content

etcd Overview: The Core Database Powering Kubernetes Clusters

etcd stores all the important cluster state. When it breaks, your weekend is fucked.

etcd
/tool/etcd/overview
51%
tool
Similar content

PostgreSQL: Why It Excels & Production Troubleshooting Guide

Explore PostgreSQL's advantages over other databases, dive into real-world production horror stories, solutions for common issues, and expert debugging tips.

PostgreSQL
/tool/postgresql/overview
49%
tool
Similar content

Neon Production Troubleshooting Guide: Fix Database Errors

When your serverless PostgreSQL breaks at 2AM - fixes that actually work

Neon
/tool/neon/production-troubleshooting
49%
tool
Similar content

Claude AI: Anthropic's Costly but Effective Production Use

Explore Claude AI's real-world implementation, costs, and common issues. Learn from 18 months of deploying Anthropic's powerful AI in production systems.

Claude
/tool/claude/overview
49%
tool
Similar content

Node.js Production Deployment - How to Not Get Paged at 3AM

Optimize Node.js production deployment to prevent outages. Learn common pitfalls, PM2 clustering, troubleshooting FAQs, and effective monitoring for robust Node

Node.js
/tool/node.js/production-deployment
49%
tool
Similar content

React Production Debugging: Fix App Crashes & White Screens

Five ways React apps crash in production that'll make you question your life choices.

React
/tool/react/debugging-production-issues
48%
tool
Similar content

OpenAI Browser: Optimize Performance for Production Automation

Making This Thing Actually Usable in Production

OpenAI Browser
/tool/openai-browser/performance-optimization-guide
48%
compare
Similar content

Ollama vs LM Studio vs Jan: 6-Month Local AI Showdown

Stop burning $500/month on OpenAI when your RTX 4090 is sitting there doing nothing

Ollama
/compare/ollama/lm-studio/jan/local-ai-showdown
46%
tool
Similar content

Grok Code Fast 1: Emergency Production Debugging Guide

Learn how to use Grok Code Fast 1 for emergency production debugging. This guide covers strategies, playbooks, and advanced patterns to resolve critical issues

XAI Coding Agent
/tool/xai-coding-agent/production-debugging-guide
46%
tool
Similar content

Neon Serverless PostgreSQL: An Honest Review & Production Insights

PostgreSQL hosting that costs less when you're not using it

Neon
/tool/neon/overview
42%
howto
Similar content

Bun Production Deployment Guide: Docker, Serverless & Performance

Master Bun production deployment with this comprehensive guide. Learn Docker & Serverless strategies, optimize performance, and troubleshoot common issues for s

Bun
/howto/setup-bun-development-environment/production-deployment-guide
42%
howto
Similar content

Run LLMs Locally: Setup Your Own AI Development Environment

Stop paying per token and start running models like Llama, Mistral, and CodeLlama locally

Ollama
/howto/setup-local-llm-development-environment/complete-setup-guide
40%
tool
Similar content

Binance API Security Hardening: Protect Your Trading Bots

The complete security checklist for running Binance trading bots in production without losing your shirt

Binance API
/tool/binance-api/production-security-hardening
40%
tool
Similar content

pandas Overview: What It Is, Use Cases, & Common Problems

Data manipulation that doesn't make you want to quit programming

pandas
/tool/pandas/overview
39%
tool
Similar content

Git Disaster Recovery & CVE-2025-48384 Security Alert Guide

Learn Git disaster recovery strategies and get immediate action steps for the critical CVE-2025-48384 security alert affecting Linux and macOS users.

Git
/tool/git/disaster-recovery-troubleshooting
39%
troubleshoot
Similar content

Fix MongoDB "Topology Was Destroyed" Connection Pool Errors

Production-tested solutions for MongoDB topology errors that break Node.js apps and kill database connections

MongoDB
/troubleshoot/mongodb-topology-closed/connection-pool-exhaustion-solutions
39%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization