Why does my chatbot suddenly become an asshole after talking for a while?

This is **silent truncation** - your system prompt got deleted to make room for new conversation. Ollama's garbage 2048 token default reached its limit and threw away your "be helpful and polite" instructions. **Fix**: `export OLLAMA_NUM_CTX=8192` and restart Ollama. Gives you 4x more space before everything breaks.

How do I know my context window is fucked?

**Dead giveaways**: - Your AI forgets it's supposed to help you - 50-page PDF summaries only mention the conclusion - After 10 messages, conversation quality goes to shit - Zero error messages (because Ollama loves silent failures) **Quick test**: Ask the model about something you said 20 messages ago. If it's confused, your context got truncated.

What's the difference between context errors and memory errors?

**Context errors**: Your conversation got too long and Ollama started deleting the beginning. No error messages, just broken behavior. Fix with `OLLAMA_NUM_CTX=8192`. **Memory errors**: Not enough VRAM to load the model. Shows "CUDA out of memory" and crashes. Fix by killing other apps or using a smaller model. One is about conversation length, the other is about your GPU being too weak. Totally different problems.

Why doesn't OLLAMA_NUM_CTX work in my setup?

**Usual suspects**: - **Systemd isolation**: Service file ignores your environment variables - **Docker isolation**: Container can't see your host settings - **Forgot to restart**: Changes need `pkill ollama && ollama serve` to take effect - **Version-specific bugs**: Ollama 0.1.29 broke this completely, 0.1.33 fixed systemd but broke Docker **Bulletproof fix**: Skip environment variables, use CLI params: ```bash ollama run llama3.1:70b --num-ctx 8192 "your prompt here" ```

How many tokens is "too many" for good performance?

**Performance impact by context length**: - **2K-4K tokens**: No performance impact - **8K tokens**: 1.5x slower, works on most hardware - **16K tokens**: 3x slower, needs 8GB+ VRAM - **32K+ tokens**: 5-10x slower, needs 16GB+ VRAM **Rule of thumb**: Use the smallest context that solves your problem. Start at 4K, increase only if needed.

Can I set different context lengths for different models?

Yes, three methods: **Method 1 - Per model via Modelfile**: ```bash echo "FROM llama3.1:70b" > Modelfile echo "PARAMETER num_ctx 16384" >> Modelfile ollama create llama-long -f Modelfile ``` **Method 2 - Per request via CLI**: ```bash # Different context per call ollama run llama3.1:70b --num-ctx 32768 "your prompt" ``` **Method 3 - Environment switching**: ```bash OLLAMA_NUM_CTX=4096 ollama run mistral:7b # Small context OLLAMA_NUM_CTX=16384 ollama run llama3.1:70b # Large context ```

Why does my 70B model run slower with longer context than my 7B model?

**Two factors compound**: 1. **Model size**: 70B has 10x more parameters than 7B 2. **Context scaling**: Attention calculation is O(n²) with context length With 32K context, a 70B model can be 50-100x slower than 7B with 2K context. This is normal behavior, not a bug.

What happens when I exceed a model's maximum context length?

Depends on the model architecture: **Llama 3.1 (128K max)**: - Works up to ~128K tokens - Performance degrades significantly after 32K - May generate lower quality responses at extreme lengths **Older models (4K-8K max)**: - Hard limit - requests fail or produce garbage output - No graceful degradation **Ollama default (2K)**: - Silent truncation regardless of model capability - Most models can actually handle much more

How do I fix "Token indices sequence length is longer than the specified maximum" errors?

This explicit error (rare in Ollama) means you hit the model's hard limit. **Solutions**: 1. **Reduce input size**: Truncate documents or shorten conversation 2. **Increase context**: `export OLLAMA_NUM_CTX=16384` (if model supports it) 3. **Switch models**: Use a model with larger native context window 4. **Implement chunking**: Process large inputs in smaller pieces

Can I use unlimited context length?

No. **Fundamental limitations**: - **Computational**: Attention scales quadratically (O(n²)) - **Memory**: Longer context needs exponentially more VRAM - **Quality**: Models lose quality at extreme context lengths - **Speed**: Inference becomes unusably slow **Practical limits**: 32K tokens is the sweet spot for most use cases. Beyond 64K, you need specialized hardware and expect major performance degradation.

Why does document summarization only cover the end of my document?

**Cause**: Document exceeds context limit, so beginning gets truncated. The model only "sees" the final pages. **Solutions**: 1. **Increase context**: `OLLAMA_NUM_CTX=32768` for large documents 2. **Chunking approach**: Summarize in sections, then combine summaries 3. **Hierarchical processing**: Extract key sections first, then summarize **Verification**: Ask the model about something from the document's beginning. If it doesn't know, you have truncation.

How do I handle long conversations without losing context?

**Smart conversation management**: **Option 1 - Sliding window**: Keep recent exchanges, periodically summarize older ones **Option 2 - Hierarchical context**: Preserve system instructions, summarize old exchanges, keep recent turns **Option 3 - Checkpointing**: Save conversation state periodically, reload when needed **Option 4 - Increase context**: `OLLAMA_NUM_CTX=16384` for longer conversations

What's the best context length for coding tasks?

**Depends on code complexity**: - **Single functions**: 4K tokens sufficient - **Class analysis**: 8K tokens recommended - **File review**: 16K tokens for large files - **Multi-file projects**: 32K tokens + retrieval strategies **Code-specific considerations**: Comments, documentation, and test files consume significant token budget. Factor this into your context planning.

Can I detect when context truncation happens?

**Detection methods**: 1. **Reference test**: Ask model to recall early conversation details 2. **Token counting**: Estimate tokens in conversation history 3. **Quality monitoring**: Watch for response degradation patterns 4. **Explicit checks**: Test with known large inputs **No built-in alerts**: Ollama doesn't warn about truncation. You must monitor proactively.

Why do some models work better with long contexts than others?

**Architecture differences**: - **Attention mechanisms**: Some use more efficient attention patterns - **Training context**: Models trained on longer sequences handle them better - **Positional encoding**: Different methods for handling sequence position - **Optimization**: Some architectures are optimized for longer contexts **Examples**: Llama 3.1 and Mistral handle long contexts better than Llama 2 due to architectural improvements and longer training sequences.

Currently viewing the AI version

Switch to human version

Ollama Context Length Errors: AI-Optimized Technical Reference

Executive Summary

Ollama's default 2048-token context limit causes silent failures in production applications. The system discards older tokens without warnings when limits are exceeded, leading to conversation amnesia, incomplete document analysis, and personality drift. Solutions range from simple configuration changes to advanced hierarchical context management.

Critical Failure Modes

Silent Truncation (Primary Issue)

Behavior: FIFO token deletion without error messages
Consequences: System prompt deletion, conversation amnesia, incomplete reasoning
Detection: Model forgets initial instructions, references only final portions of documents
Frequency: Occurs after ~1500-2000 words of conversation
Production Impact: Customer service bots become hostile, document analysis covers only conclusions

Performance Degradation Patterns

2K-4K tokens: Baseline performance
8K tokens: 1.5x slower response time
16K tokens: 3x slower, requires 8GB+ VRAM
32K+ tokens: 5-10x slower, requires 16GB+ VRAM
Scaling: Quadratic complexity O(n²) for attention calculations

Configuration Solutions

Environment Variable Method (Primary)

export OLLAMA_NUM_CTX=8192
ollama serve

Critical Failure Points:

Systemd service isolation prevents environment variable inheritance
Docker containers require explicit environment passing
Version-specific bugs in Ollama 0.1.29-0.1.33 range

Systemd Service Configuration (Production Required)

sudo systemctl edit ollama.service
# Add: [Service]
# Environment="OLLAMA_NUM_CTX=8192"
sudo systemctl daemon-reload && sudo systemctl restart ollama.service

Modelfile Method (Most Reliable)

echo "FROM llama3.1:70b" > Modelfile
echo "PARAMETER num_ctx 8192" >> Modelfile
ollama create custom-model -f Modelfile

Context Length Recommendations by Use Case

Conservative (4K-8K tokens)

Applications: General chatbots, code assistance, short documents
Hardware: Works on consumer GPUs
Performance: Minimal impact
Failure Risk: Low

Moderate (16K-32K tokens)

Applications: Long conversations, multi-page analysis
Hardware: 8GB+ VRAM required
Performance: 3-5x slower responses
Failure Risk: Memory exhaustion on limited hardware

Aggressive (64K+ tokens)

Applications: Research papers, entire codebase review
Hardware: 16GB+ VRAM mandatory
Performance: 10x+ slower, may timeout
Failure Risk: High - requires specialized hardware

Advanced Context Management Strategies

Hierarchical Context Allocation

Tier 1: System Instructions (never truncated)
Tier 2: Key Facts (preserved when possible)  
Tier 3: Recent Conversation (fills remaining space)
Tier 4: Old Messages (compressed or discarded)

Context Compression Techniques

Summarization: Compress old conversation blocks to 30% of original tokens
Template-based: Use patterns to compress repetitive exchanges
Retrieval-augmented: Store context externally, retrieve relevant portions

Dynamic Allocation Patterns

Technical support: 10 recent turns, 30% system ratio
Creative writing: 15 recent turns, 20% system ratio
Document analysis: 5 recent turns, 10% system ratio
Code review: 8 recent turns, 20% system ratio

Debugging and Detection Methods

Context Limit Detection

# Test with known large input
echo "Large test content..." > test.txt
ollama run llama3.1:70b "Summarize: $(cat test.txt)"

Memory vs Context Differentiation

Context errors: Silent truncation, no error messages, gradual quality degradation
Memory errors: CUDA OOM crashes, explicit error messages, immediate failure

Production Monitoring

Reference tests for conversation recall
Token counting estimation (4 characters ≈ 1 token for English)
Response quality degradation patterns
GPU memory usage correlation

Model-Specific Capabilities

Llama 3.1 Series

Maximum: 128K tokens supported
Optimal: 8K-32K for production balance
Performance: Good up to 32K, degrades beyond

Code-Focused Models

Typical requirement: 16K+ tokens for full file analysis
Consideration: Comments and documentation consume significant budget

Older Model Families

Hard limits: 4K-8K maximum, no graceful degradation
Behavior: Explicit failures rather than silent truncation

Critical Warnings

Production Deployment Issues

Default 2K limit: Inadequate for any real application
Silent failures: No error logging when limits exceeded
Environment inheritance: Broken in systemd and Docker without explicit configuration
Version instability: Context handling varies between Ollama releases

Hardware Resource Planning

VRAM consumption: Exponential increase with context length
Response latency: Quadratic scaling affects user experience
Memory fragmentation: Large contexts may cause allocation failures

Common Misconceptions

Word ≠ Token: "Optimization" = 3 tokens, not 1 word
Model capability: Ollama defaults don't reflect actual model limits
Error visibility: No built-in alerts for context truncation

Implementation Checklist

Basic Setup

Set OLLAMA_NUM_CTX to minimum 4096
Configure systemd service environment if using systemd
Test with large inputs to verify settings
Monitor GPU memory usage patterns

Production Hardening

Implement context length monitoring
Set up hierarchical context management
Configure compression for long conversations
Plan failover strategies for memory exhaustion

Performance Optimization

Benchmark response times at different context lengths
Implement context caching for repeated patterns
Monitor VRAM usage with nvidia-smi
Set up alerts for performance degradation

Resource Requirements

Minimum Viable Context (4K-8K)

VRAM: 4-6GB
Response time: < 5 seconds
Use cases: Basic conversations, short documents

Production Context (16K-32K)

VRAM: 8-16GB
Response time: 10-30 seconds
Use cases: Complex analysis, long conversations

Enterprise Context (64K+)

VRAM: 16GB+ mandatory
Response time: 60+ seconds
Use cases: Research analysis, entire document processing
Infrastructure: Specialized GPU hardware required

Breaking Points and Failure Scenarios

Context Window Exhaustion

Symptom: Model forgets system instructions mid-conversation
Cause: 2048 token default exceeded
Solution: Increase OLLAMA_NUM_CTX to 8192+

Memory Allocation Failure

Symptom: CUDA out of memory errors
Cause: Context size exceeds GPU capacity
Solution: Reduce context length or upgrade hardware

Performance Degradation

Symptom: Response times > 30 seconds
Cause: Quadratic attention scaling with large contexts
Solution: Implement context compression or reduce window size

Configuration Override Failure

Symptom: Settings appear correct but 2K limit persists
Cause: Environment variable inheritance issues
Solution: Use CLI parameters or Modelfile configuration method

This technical reference provides the operational intelligence needed for successful Ollama context management in production environments, with specific attention to common failure modes and their resolutions.

Useful Links for Further Investigation

Essential Resources for Ollama Context Length Troubleshooting

Link	Description
Ollama Context Size GitHub Discussion #2204	Where everyone figured out Ollama's 2K context default is trash. Essential reading, especially workarounds in comments.
Ollama FAQ - Memory and Context	Official docs that barely cover context problems. Read the GitHub issues instead.
Cannot Increase num_ctx Beyond 2048 Issue #9519	The main issue where everyone discovers OLLAMA_NUM_CTX doesn't work. Comments have better solutions than docs.
Context Length Limitations in Open WebUI #4246	Detailed discussion of how context limits affect web interface applications. Good insights into user-facing implications.
Conversation Context Issues #2595	Troubleshooting conversation memory problems and context window shrinkage in API usage.
OLLAMA_NUM_CTX Environment Variable Documentation	Sparse official documentation on environment variables. Community examples are more useful than official docs.
Intel IPEX-LLM Context Size Discussion	Advanced discussion of context size configuration in specialized deployment scenarios.
Ollama Modelfile Reference Documentation	Official documentation on configuring context size through modelfiles.
Context Window Size Increase Tutorial	Step-by-step guide for increasing Ollama's context window size.
What Happens When You Exceed Token Context Limit	Comprehensive explanation of silent truncation behavior and its implications for applications.
LLM Memory Management Guide	Advanced strategies for managing long conversations and implementing context hierarchies.
Finetuning LLMs for Longer Context	Technical deep-dive into performance implications of longer context windows. Essential for understanding scaling challenges.
Context Length vs Max Token vs Maximum Length	Clear explanation of terminology differences. Helps avoid confusion between context length and other token-related concepts.
How to Handle Context Length Errors in LLMs	Practical strategies for handling context length errors in application development.
Goose Documentation: Context Length Troubleshooting	Real-world examples of context length error handling in production applications.
LLM Token Management Best Practices	LangChain documentation on managing memory and context in conversational applications.
Aider Chat: Token Limits Troubleshooting	Practical solutions for token limit issues in code assistance applications.
Llama 3.1 Context Documentation	Real-world explanation of Llama 3.1's context capabilities and limitations.
Context Window Sizes Reference	Comprehensive list of context window sizes for different model families.
Model-Specific Context Configuration	Guide for optimizing context windows for different model types.
Ollama Model Library Overview	Main Ollama homepage with available models and basic specifications.
ChromaDB Getting Started	Vector database for implementing retrieval-augmented context systems.
Sentence Transformers for Context Embeddings	Embedding models for implementing semantic context retrieval.
LangChain Ollama Integration	Framework for building applications with advanced context management on top of Ollama.
Vector Database Comparison for RAG	Comprehensive comparison of vector databases suitable for context storage and retrieval.
NVIDIA System Management Interface	Essential tool for monitoring GPU memory usage when debugging context-related performance issues.
Prometheus Node Exporter	System monitoring for tracking memory usage patterns related to context length.
Grafana Dashboards for LLM Monitoring	Pre-built monitoring dashboards for tracking Ollama performance and resource usage.
vLLM vs Ollama Performance Comparison	When context length management becomes problematic, comparison with more efficient alternatives.
LM Studio as Ollama Alternative	GUI-based local LLM management with different context handling approaches.
Open WebUI Context Management	Web interface with built-in context management features for long conversations.
Ollama Discord Community	Active community for real-time troubleshooting and sharing context management strategies.
Ollama GitHub Issues	Community discussions about local LLM deployment including context management best practices.
Stack Overflow Ollama Questions	Technical Q&A focused on implementation details and troubleshooting.
Ollama Python Library	Official Python client with context configuration examples.
Ollama JavaScript Library	Official JavaScript client for web applications requiring context management.
Docker Ollama Configuration	Container deployment with proper context length configuration.
Kubernetes Ollama Deployment	Helm charts for production deployment with context management considerations.

Ollama Context Length Errors: AI-Optimized Technical Reference

Executive Summary

Critical Failure Modes

Silent Truncation (Primary Issue)

Performance Degradation Patterns

Configuration Solutions

Environment Variable Method (Primary)

Systemd Service Configuration (Production Required)

Modelfile Method (Most Reliable)

Context Length Recommendations by Use Case

Conservative (4K-8K tokens)

Moderate (16K-32K tokens)

Aggressive (64K+ tokens)

Advanced Context Management Strategies

Hierarchical Context Allocation

Context Compression Techniques

Dynamic Allocation Patterns

Debugging and Detection Methods

Context Limit Detection

Memory vs Context Differentiation

Production Monitoring

Model-Specific Capabilities

Llama 3.1 Series

Code-Focused Models

Older Model Families

Critical Warnings

Production Deployment Issues

Hardware Resource Planning

Common Misconceptions

Implementation Checklist

Basic Setup

Production Hardening

Performance Optimization

Resource Requirements

Minimum Viable Context (4K-8K)

Production Context (16K-32K)

Enterprise Context (64K+)

Breaking Points and Failure Scenarios

Context Window Exhaustion

Memory Allocation Failure

Performance Degradation

Configuration Override Failure

Useful Links for Further Investigation

Essential Resources for Ollama Context Length Troubleshooting

Related Tools & Recommendations

Llama.cpp - Run AI Models Locally Without Losing Your Mind

LM Studio Performance Optimization - Fix Crashes & Speed Up Local AI

LM Studio MCP Integration - Connect Your Local AI to Real Tools

Ollama vs LM Studio vs Jan: The Real Deal After 6 Months Running Local AI

LangChain Production Deployment - What Actually Breaks

LangChain + OpenAI + Pinecone + Supabase: Production RAG Architecture

Claude + LangChain + Pinecone RAG: What Actually Works in Production

LangChain vs LlamaIndex vs Haystack vs AutoGen - Which One Won't Ruin Your Weekend

I Migrated Our RAG System from LangChain to LlamaIndex

LlamaIndex - Document Q&A That Doesn't Suck

Docker Desktop Alternatives That Don't Suck

Docker Swarm - Container Orchestration That Actually Works

Docker Security Scanner Performance Optimization - Stop Waiting Forever

Continue - The AI Coding Tool That Actually Lets You Choose Your Model

GPT4All - ChatGPT That Actually Respects Your Privacy

Raycast - Finally, a Launcher That Doesn't Suck

Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together

CrashLoopBackOff Exit Code 1: When Your App Works Locally But Kubernetes Hates It

Temporal + Kubernetes + Redis: The Only Microservices Stack That Doesn't Hate You

Microsoft Drops 111 Security Fixes Like It's Normal