Why does Ollama crash with "SIGKILL: 9" after 10 minutes in production?

The OOMKiller is murdering your processes. Your model is eating more RAM than the system can handle.P**The fix**: Set proper container memory limits and use smaller models. If you're running Llama 3.3 70B on a 32GB server with other services, you're fucked. Switch to the 7B model or get more RAM.P```bash# Docker with proper limitsdocker run -d --memory=16g --oom-kill-disable=false ollama/ollama```

My response times went from 2 seconds to 45 seconds in production. WTF?

You hit the concurrent request limit. Ollama defaults to processing requests sequentially, which is fine for 1-2 users but absolute garbage for production load.P**The nuclear option**: P```bash# Enable parallel processing (will use more VRAM)OLLAMA_NUM_PARALLEL=4 ollama serve```P**The smart option**: Set up multiple Ollama instances behind a load balancer.

Models keep unloading every 5 minutes, making the first request super slow

This is Ollama's default behavior to "save memory." It's annoying as hell in production.P```bash# Keep models loaded foreverOLLAMA_KEEP_ALIVE=-1 ollama serve# Or keep for 2 hoursOLLAMA_KEEP_ALIVE=2h ollama serve```P**Warning**: Your RAM usage will stay high, but response times will be consistent.

GPU utilization is shit - showing 20% usage but responses are slow

Classic GPU memory fragmentation. Your model loaded but can't efficiently use the available VRAM.P**The fix that actually works**:P```bash# Force model to use specific GPU layersOLLAMA_NUM_GPU_LAYERS=35 ollama run llama3.3:7b```PCheck what's actually loaded:P```bashnvidia-smi# vsollama ps```

Getting "model not found" errors randomly in production

Race condition in model loading. Multiple requests trying to load the same model simultaneously.P**Quick fix**: Pre-load models at startup:P```bash# In your deployment scriptollama pull llama3.3:7bollama run llama3.3:7b "test" # Load into memory```

Container restarts kill all loaded models

No persistent model storage configured properly.P**Docker fix**:P```bashdocker run -d \ -v ollama_models:/root/.ollama \ -p 11434:11434 \ ollama/ollama```P**Kubernetes fix**: Use persistent volumes, not ephemeral storage.

API requests timeout after 30 seconds

Your reverse proxy or load balancer has shitty timeout settings.P**Nginx fix**:P```nginxlocation /api/ { proxy_pass http://ollama:11434; proxy_read_timeout 300s; proxy_connect_timeout 10s;}```

Memory usage keeps growing until crash (memory leak)

Known issue with certain models and long conversations. Context windows fill up and never get cleaned.P**Workaround**:P```bash# Restart Ollama daily0 2 * * * systemctl restart ollama```P**Better fix**: Clear conversation context periodically in your app code.

Multiple users get the same response (response bleeding)

Context contamination between concurrent requests. This is fucking scary in production.P**Immediate fix**: Disable parallel processing until you can debug:P```bashOLLAMA_NUM_PARALLEL=1 ollama serve```P**Proper fix**: Update to Ollama 0.11.5+ which has better memory management.

Ollama works fine, then suddenly "connection refused" errors

Process died silently. Check your process manager logs.P```bash# Check what actually happenedjournalctl -u ollama -f# Or for Dockerdocker logs -f ollama_container```PCommon causes:P- OOM killer struck againP- GPU driver crashed P- Disk full (models are huge)P- ulimit hit (file descriptor limit)

Model loading takes 3+ minutes on cold start

Storage I/O bottleneck. Your models are probably on slow disks.P**Quick wins**:P- Move models to SSD (not spinning rust)P- Use local storage, not network mountsP- Pre-warm models at container startupP**Check your storage**:P```bash# Time model loadingtime ollama run llama3.3:7b "test"# Check disk I/Oiostat -x 1```

Currently viewing the AI version

Switch to human version

Ollama Production Deployment: Critical Intelligence Summary

Critical Failure Modes and Solutions

Memory Management Disasters

SIGKILL after 10 minutes: OOMKiller terminating processes due to insufficient RAM

Root cause: Model consumes more RAM than system can handle
Real requirements: 24-32GB system RAM for "8GB" models (3-4x theoretical minimum)
Breaking point: Multiple services + OS overhead + context window growth
Solution: Set proper container memory limits, use smaller models

docker run -d --memory=16g --oom-kill-disable=false ollama/ollama

Memory leak pattern: Usage grows until crash, requires daily restarts

Frequency: Affects certain models with long conversations
Workaround: Daily restart automation
Better fix: Clear conversation context in application code

Concurrency Bottlenecks

Response time degradation: 2 seconds → 45 seconds under load

Root cause: Sequential request processing (OLLAMA_NUM_PARALLEL=1 default)
Breaking point: >2 concurrent users
Nuclear option: OLLAMA_NUM_PARALLEL=4 (multiplies VRAM usage)
Production solution: Multiple Ollama instances with load balancer

Context contamination: Users receive other users' responses

Severity: Critical security issue in production
Immediate fix: Disable parallel processing
Proper fix: Update to Ollama 0.11.5+ for better memory management

GPU Utilization Problems

Low GPU usage (20%): Memory fragmentation prevents efficient VRAM use

Solution: Force specific GPU layers

OLLAMA_NUM_GPU_LAYERS=35 ollama run llama3.3:7b

Silent CPU fallback: GPU driver mismatches cause undetected failures

Detection: Monitor nvidia-smi vs ollama ps differences
Common causes: CUDA 11.8 vs 12.1 conflicts, suspend/resume cycles

Resource Requirements (Production Reality)

Memory Allocation Table

Model Size	Theoretical RAM	Production RAM Required	Max Concurrent Users
7B	8GB	24-32GB	10-20
13B	16GB	48-64GB	5-10
70B	40GB	120-160GB	1-3

Storage Performance Requirements

Model loading time: 3+ minutes indicates storage bottleneck
Minimum: Local NVMe SSD (1GB/s+ throughput)
Avoid: Network storage (adds 30+ seconds), spinning disks
Test command: dd if=/dev/zero of=/tmp/test bs=1G count=10 oflag=dsync

GPU Memory Multipliers

Single instance: 1x model size
OLLAMA_NUM_PARALLEL=8: 8x model size (320GB+ for 40GB model)
Multiple instances: 1x per instance (horizontal scaling)

Production Configuration

Essential Environment Variables

export OLLAMA_NUM_PARALLEL=1      # Conservative start
export OLLAMA_KEEP_ALIVE=30m      # Balance memory vs response time
export OLLAMA_HOST=0.0.0.0        # External connections
export OLLAMA_ORIGINS="*"         # CORS configuration
export CUDA_VISIBLE_DEVICES=0     # Lock to specific GPU
export OLLAMA_DEBUG=1             # Production logging

Load Balancer Timeouts

Default timeouts (30s) cause failures

NGINX: proxy_read_timeout 300s
HAProxy: timeout server 300s
Health check: GET /api/ps with 300s timeout

Container Resource Limits

deploy:
  resources:
    limits:
      memory: 24G
    reservations:
      memory: 16G

Deployment Architecture Comparison

Architecture	Max Users	Memory Overhead	Complexity	Failure Mode
Single Instance	5-10	Low	Easy	Any real load
Parallel Processing	10-20	High (VRAM×N)	Easy	GPU exhaustion
Multiple Instances	50-100	Medium	Medium	Load balancing
vLLM Migration	100+	Lower	High	API rewrite needed

Critical Monitoring Metrics

Real-time Failure Indicators

Model load/unload frequency: High churn = memory pressure
Response queue length: Sustained >3 = capacity needed
GPU memory fragmentation: nvidia-smi vs ollama ps differences
Context window sizes: Growth indicates memory leaks

Alert Thresholds

Response time >10s for >5 minutes
Queue depth >5 for >2 minutes
GPU memory >90% for >30 seconds
No successful responses for >5 minutes

Production Health Checks

Functional Health Check Script

#!/bin/bash
# Check Ollama response
curl -s HOST:11434/api/ps >/dev/null || exit 2

# Check GPU memory
GPU_MEM=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits)
[ "$GPU_MEM" -gt 90 ] && exit 2

# Check system memory  
SYS_MEM=$(free | grep Mem | awk '{printf "%.0f", $3/$2 * 100}')
[ "$SYS_MEM" -gt 95 ] && exit 2

Migration Decision Matrix

When to Abandon Ollama

>100 concurrent users: vLLM handles 793 TPS vs Ollama's 41 TPS
Multiple simultaneous models: TGI has better resource management
High availability requirements: Managed services for stability
GPU utilization <30%: Resource waste indicates wrong tool

Migration Targets

vLLM: Better concurrency, complex setup, API compatible
TGI: Kubernetes-native, Hugging Face ecosystem
Managed services: Higher cost, lower operational burden

Automated Recovery Implementation

Restart Script for Memory Leaks

while true; do
  sleep 60
  if ! /opt/ollama/healthcheck.sh; then
    pkill -f ollama
    nvidia-smi --gpu-reset
    systemctl restart ollama
    ollama run llama3.3:7b "warmup" >/dev/null 2>&1 &
  fi
done

Pre-production Load Testing

# Concurrent request test
for i in {1..10}; do
  curl -X POST HOST:11434/api/generate \
    -H "Content-Type: application/json" \
    -d '{"model": "llama3.3:7b", "prompt": "Test"}' &
done

Cost-Benefit Analysis

Resource Investment Reality

Development: Works on MacBook (32GB unified memory)
Production minimum: 3x theoretical requirements
Operational overhead: Daily restarts, monitoring, debugging
Break-even point: 50+ daily users before managed services cost less

Hidden Costs

GPU driver maintenance: Silent failures require expertise
Storage requirements: Local NVMe SSD mandatory
Network configuration: Load balancer timeout tuning
Monitoring setup: Custom metrics beyond standard monitoring

Critical Warnings

Never use network storage for models in production
Memory leaks require automated restarts - not optional
GPU memory fragmentation causes performance degradation over time
Default configurations fail under any realistic load
Silent CPU fallback from GPU driver issues causes 10x slowdown
Context contamination between users is a security vulnerability
OOMKiller strikes without warning when memory planning inadequate

Useful Links for Further Investigation

Production Resources That Actually Help

Link	Description
Ollama GitHub Issues	Search here first, as someone else might have encountered and resolved the same problem, providing quick solutions.
Production deployment troubleshooting	This official guide provides comprehensive troubleshooting steps and solutions for common issues encountered during Ollama production deployments, ensuring smooth operation.
Stack Overflow ollama troubleshooting	Find community-driven technical questions and answers on Stack Overflow, offering practical troubleshooting insights and solutions for various Ollama-related challenges.
Ollama Docker issues thread	Explore this GitHub issues thread for discussions and solutions related to Ollama deployments within Docker containers, addressing container-specific challenges in production environments.
Multi-instance Ollama setup guide	A practical guide for developers detailing multi-instance Ollama setup, showcasing a real-world production architecture for scalable and robust LLM deployments.
Ollama load balancing examples	Discover practical load balancing examples for Ollama, including working HAProxy and NGINX configurations to ensure high availability and efficient request distribution.
Kubernetes Ollama manifests	Access Docker deployment manifests for Ollama on Kubernetes, providing essential configurations and resources for deploying and managing Ollama instances within a K8s cluster.
Open WebUI Docker Compose	Find production-ready Docker Compose files for Open WebUI, enabling easy deployment and configuration of a robust web interface for interacting with your Ollama instances.
Prometheus metrics for Ollama	Learn about essential system metrics for monitoring Ollama deployments using Prometheus, focusing on key performance indicators that are crucial for operational awareness.
Grafana dashboard examples	Explore various Grafana dashboard examples; search for "ollama" to discover community-contributed dashboards designed for visualizing and monitoring your Ollama instances effectively.
NVIDIA monitoring tools	Access documentation for NVIDIA monitoring tools, providing essential utilities and guidance for GPU-specific monitoring and performance analysis in Ollama environments.
Linux memory debugging	Understand Linux memory management concepts and debugging techniques, crucial for diagnosing and resolving out-of-memory issues when the OOMKiller strikes in your Ollama deployments.
vLLM migration guide	This comprehensive guide helps you understand when and how to migrate from Ollama to vLLM, detailing the ultimate steps for transitioning your LLM deployments from local to production.
TGI (Text Generation Inference)	Explore Hugging Face's Text Generation Inference (TGI), a robust production solution designed for efficient and scalable deployment of large language models in various environments.
Ray Serve for LLMs	Learn about Ray Serve, a powerful distributed serving framework specifically designed for deploying and managing large language models (LLMs) with high performance and scalability.
Triton Inference Server	Discover NVIDIA's Triton Inference Server, an enterprise-grade solution for deploying AI models, including LLMs, offering high-performance inference, dynamic batching, and multi-framework support.
GPU memory calculator	Use this GPU memory calculator to accurately estimate the real VRAM requirements for your LLM models, helping you provision appropriate hardware for efficient inference.
AMD ROCm compatibility	Access documentation on AMD ROCm compatibility, providing essential guidance for setting up and configuring AMD GPUs to work seamlessly with Ollama for accelerated inference.
Cloud GPU comparison	Compare the costs and specifications of various cloud GPU providers, helping you make informed decisions for cost-effective cloud inference deployments of your LLM applications.
Cost analysis for local vs cloud LLMs	This analysis provides a detailed cost comparison between local LLM deployments and cloud APIs, helping you determine the optimal strategy for self-hosting versus cloud solutions.
Load testing tools for LLMs	Explore tools like Locust for conducting realistic load testing on your LLM applications, ensuring they can handle expected traffic and perform reliably under stress.
API compatibility testing	Review documentation on OpenAI API compatibility for Ollama, enabling you to test and ensure seamless integration and functionality with existing OpenAI-compatible applications.
Model conversion tools	Discover tools like llama.cpp for converting various LLM models into the GGUF format, optimizing them for efficient local inference with Ollama and similar platforms.
Performance benchmarking tools	Access examples and tools for performance benchmarking, allowing you to conduct thorough speed and memory testing of your LLM models and Ollama deployments.
Ollama security best practices	Review the official security guidelines and best practices for deploying Ollama, ensuring your LLM applications are protected against common vulnerabilities and threats.
Docker best practices	Learn about Docker best practices for container optimization and security, crucial for building robust, efficient, and secure containerized environments for your Ollama deployments.
Network security for AI services	Consult the OWASP AI Security and Privacy Guide for comprehensive insights into network security best practices specifically tailored for artificial intelligence services and LLM deployments.
Data privacy considerations	Understand Ollama's data privacy considerations, specifically addressing what data remains local and how to ensure the confidentiality and integrity of your sensitive information.
Ollama Discord	Join the official Ollama Discord server to get real-time help, engage in discussions, and receive support directly from the active and knowledgeable community members.
Ollama deployment best practices	Explore Docker Scout documentation for best practices in container security and deployment, providing valuable insights for securing and optimizing your Ollama containerized applications.
Hugging Face Forums	Participate in the Hugging Face Forums for in-depth technical discussions, troubleshooting assistance, and community support related to various LLMs and AI development topics.
Ollama GitHub Issues	Browse Ollama's GitHub Issues filtered for 'production' to find discussions, reported problems, and community-contributed solutions specifically relevant to production deployments.

Ollama Production Deployment: Critical Intelligence Summary

Critical Failure Modes and Solutions

Memory Management Disasters

Concurrency Bottlenecks

GPU Utilization Problems

Resource Requirements (Production Reality)

Memory Allocation Table

Storage Performance Requirements

GPU Memory Multipliers

Production Configuration

Essential Environment Variables

Load Balancer Timeouts

Container Resource Limits

Deployment Architecture Comparison

Critical Monitoring Metrics

Real-time Failure Indicators

Alert Thresholds

Production Health Checks

Functional Health Check Script

Migration Decision Matrix

When to Abandon Ollama

Migration Targets

Automated Recovery Implementation

Restart Script for Memory Leaks

Pre-production Load Testing

Cost-Benefit Analysis

Resource Investment Reality

Hidden Costs

Critical Warnings

Useful Links for Further Investigation

Production Resources That Actually Help

Related Tools & Recommendations

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Llama.cpp - Run AI Models Locally Without Losing Your Mind

Ollama vs LM Studio vs Jan: The Real Deal After 6 Months Running Local AI

LM Studio - Run AI Models On Your Own Computer

LM Studio MCP Integration - Connect Your Local AI to Real Tools

Pinecone Production Reality: What I Learned After $3200 in Surprise Bills

Claude + LangChain + Pinecone RAG: What Actually Works in Production

Stop Fighting with Vector Databases - Here's How to Make Weaviate, LangChain, and Next.js Actually Work Together

LlamaIndex - Document Q&A That Doesn't Suck

I Migrated Our RAG System from LangChain to LlamaIndex

LangChain vs LlamaIndex vs Haystack vs AutoGen - Which One Won't Ruin Your Weekend

Docker Alternatives That Won't Break Your Budget

I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works

Continue - The AI Coding Tool That Actually Lets You Choose Your Model

GPT4All - ChatGPT That Actually Respects Your Privacy

Raycast - Finally, a Launcher That Doesn't Suck

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

Fix Kubernetes ImagePullBackOff Error - The Complete Battle-Tested Guide

Text-generation-webui - Run LLMs Locally Without the API Bills