It keeps crashing with OOM errors

You're asking an 8GB card to load a 13B model - the interface won't stop you, but physics will. The system happily attempts loading 70B models on 8GB VRAM, then crashes spectacularly when reality intervenes. That [VRAM calculator](https://huggingface.co/spaces/oobabooga/accurate-gguf-vram-calculator)? Add 20% buffer to whatever it estimates - it's consistently optimistic about memory requirements.

Big models keep crashing the whole thing

Lower the `n_gpu_layers` in the model tab. Start with half your VRAM and work up. Also close Chrome - it eats VRAM like crazy. If you're on Windows 11, the OS reserves 1-2GB just because Microsoft hates you.

Generation is painfully slow

You're probably running CPU-only or with too few GPU layers. Check Task Manager (Windows) or `nvidia-smi` (Linux). If your GPU isn't at 90%+ utilization, you're not using it properly. Also, Q2 quantization is fast but makes models stupid. Q4 is the sweet spot.

Will this work on my 8GB laptop?

Barely. Stick to 7B models with Q4 quantization. Expect 2-3 tokens/second and your laptop to sound like a jet engine. Don't even try 13B+ models unless you enjoy watching progress bars for 10 minutes per response.

Windows installer broke again

Welcome to the club - membership is mandatory for Windows users. The [portable version](https://github.com/oobabooga/text-generation-webui/releases) sidesteps most installer issues. Alternatively, embrace [WSL](https://docs.microsoft.com/en-us/windows/wsl/install) and follow the Linux installation path - it's genuinely more reliable than native Windows setup. When you see `torch.cuda.OutOfMemoryError: CUDA out of memory`, that's typically Windows' aggressive VRAM management interfering with CUDA allocation.

Yes, that's the whole point. Once models are downloaded, no internet required. Though you'll want internet for downloading models from [HuggingFace](https://huggingface.co/models) because they're multi-gigabyte files.

What models don't suck?

For coding: [CodeLlama](https://huggingface.co/codellama), [WizardCoder](https://huggingface.co/WizardLM), or [DeepSeek-Coder](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct). For general chat: [Llama-2-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), [Mistral-7B-Instruct](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1), or [OpenHermes](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B). Skip the uncensored models unless you need them - they're usually just worse at following instructions.

How do I know if it's actually working?

If the model loads without errors and generates coherent text, you're good. If it outputs random garbage, restart and try different settings. The log tab shows actual errors instead of the useless Gradio error messages.

Which quantization doesn't suck?

Q4 is the sweet spot for most use cases. Q8 if you have lots of VRAM and want better quality. Q2 if you're desperate for speed and don't mind the model getting dumber. Avoid Q3 - weird middle ground that doesn't excel at anything.

Currently viewing the AI version

Switch to human version

Text-Generation-WebUI: AI-Optimized Technical Reference

Overview

Text-generation-webui is a local LLM hosting solution that eliminates API costs and keeps proprietary code on-premises. Created by oobabooga with active GitHub community support.

Critical Configuration Requirements

Hardware Specifications

8GB VRAM: 7B models at Q4 quantization, 5-10 tokens/second
12GB VRAM: 7B models at higher quality, some 13B at Q4 (close other applications)
24GB VRAM: Sweet spot for serious work, handles up to 30B parameters
48GB+ VRAM: Enables GPT-4 competitive models
CPU-only: 1-2 tokens/second maximum (barely usable for interactive work)

Model Format Support & Performance Impact

Format	Performance	Compatibility	Best Use Case
GGUF	Good, consistent	Excellent	Production deployments
HuggingFace	Excellent	Requires 48GB+ VRAM	Research/development
ExLlama	Fastest	Model-specific compatibility	Performance-critical apps
AutoGPTQ	Variable	Hit-or-miss on older cards	Legacy support
AutoAWQ	Better than GPTQ	Limited model selection	Newer hardware

Quantization Trade-offs

Q2: Fast but significantly degrades intelligence
Q4: Sweet spot for most use cases
Q8: Best quality, high VRAM requirement
Q3: Avoid - no significant advantages

Critical Failure Modes

Memory Management Failures

OutOfMemoryError: VRAM calculator estimates are optimistic, add 20% buffer
System crashes with large models: Windows 11 reserves 1-2GB VRAM for OS
Performance degradation: Chrome can consume significant VRAM during operation

Platform-Specific Issues

Windows installation: Regular failures requiring portable version or WSL
Extension breakage: Major updates frequently break community extensions
Driver conflicts: Windows users experience more CUDA dependency issues
Linux setup: Generally smoother but requires compilation comfort

Installation Reality Check

Success Probability by Platform

Linux: High success rate, minimal configuration issues
Windows native: Frequent installer failures, driver conflicts
Windows WSL: More reliable than native Windows installation
macOS: Limited CUDA support affects performance

Time Investment Required

Successful installation: 30 minutes to 2 hours
Failed installation debugging: Full weekend possible
Extension setup and configuration: Additional 2-4 hours
Model testing and optimization: 4-8 hours for production setup

Operational Intelligence

Model Selection Criteria

For coding tasks:

CodeLlama, WizardCoder, DeepSeek-Coder
Avoid uncensored models (worse instruction following)

For general chat:

Llama-2-Chat, Mistral-7B-Instruct, OpenHermes
bartowski and TheBloke (legacy) quantizations most reliable

Performance Optimization

GPU utilization target: 90%+ for optimal performance
Layer allocation: Start with 50% VRAM allocation, adjust upward
Buffer management: API streaming requires adequate chunk sizes
Memory clearing: Restart required for different model sizes

Integration Capabilities

API Compatibility

OpenAI-compatible endpoints: Available at http://localhost:5000 with --api flag
Supported integrations: Continue.dev, CodeGPT, other OpenAI-format tools
Streaming limitations: Buffer size issues can cause mid-token cutoffs
Reliability: Good when properly configured

Interface Modes

Chat mode: 90% of usage, ChatGPT-like experience
Instruct mode: Better for structured outputs and coding tasks
Notebook mode: Long-form generation, rarely used in practice

Cost-Benefit Analysis

Financial Considerations

Break-even point: $20+ monthly OpenAI spend
Hardware investment: $1000-4000 for adequate GPU
Electricity costs: 200-400W additional power consumption
Time investment: 10-20 hours initial setup and learning

Versus Alternatives Comparison

Solution	Setup Difficulty	Performance	Stability	Best For
text-generation-webui	High on Windows	Configurable	Update-dependent	Power users, tinkerers
Ollama	Low	Consistent	High	Developers, servers
LM Studio	Medium	Good	High	Non-technical users
OpenWebUI	Medium (Docker)	Via Ollama	Good	Teams, multi-user

Critical Warnings

What Documentation Doesn't Mention

Extension ecosystem fragility: Half of community extensions are abandoned
Update breaking changes: Major releases regularly require complete reconfiguration
Windows-specific pain: Driver conflicts and path issues are chronic
Model loading reality: Interface allows impossible configurations that will crash

Production Deployment Considerations

Offline operation: Complete after model download
Conversation privacy: All data stays local
Multi-model limitations: Simultaneous loading requires datacenter resources
Community support: Active but fragmented across multiple platforms

Troubleshooting Decision Tree

OOM Errors

Verify model size vs available VRAM with 20% buffer
Close memory-intensive applications (Chrome, Discord)
Reduce n_gpu_layers parameter
Switch to smaller quantization (Q4 → Q2)

Performance Issues

Check GPU utilization (should be 90%+)
Verify GPU layers allocation
Confirm proper quantization selection
Monitor thermal throttling

Installation Failures

Try portable version (Windows)
Switch to WSL (Windows users)
Check CUDA driver compatibility
Use one-click installers over manual setup

Resource Requirements Summary

Minimum Viable Setup

8GB VRAM graphics card
16GB system RAM
50GB storage for models
Stable internet for initial downloads

Recommended Production Setup

24GB VRAM (RTX 4090 or similar)
32GB system RAM
500GB NVMe storage
Linux or WSL environment

Enterprise Considerations

Plan for 48GB+ VRAM for competitive model performance
Budget 2-4x initial time estimates for deployment
Maintain rollback capability for failed updates
Monitor community Discord for breaking change announcements

Useful Links for Further Investigation

Resources That Actually Help

Link	Description
GitHub Repository	Main repository for the text-generation-webui project. It is highly recommended to read the README file as it contains essential information and guidance for users.
Installation Wiki	Comprehensive wiki providing detailed installation instructions and troubleshooting tips. This resource can prevent many common debugging issues and save significant time during setup.
One-Click Installers	Direct download links for one-click installers, highly recommended for quick and easy setup. Manual installation is generally more complex and time-consuming for most users.
VRAM Calculator	An essential tool to accurately calculate the required VRAM for GGUF models. Use this calculator to ensure hardware compatibility and avoid potential issues before downloading large models.
HuggingFace GGUF Models	A curated list of GGUF models available on HuggingFace, sorted by download popularity. This sorting indicates widely used and well-regarded models for various applications.
TheBloke's Legacy Models	A collection of quantized models from TheBloke, a prolific contributor known for providing optimized versions of many popular language models for efficient local use.
bartowski's Models	Repository for bartowski's models, currently a primary source for new and up-to-date GGUF quantizations. Offers optimized versions of the latest language models.
Microsoft's Phi Models	Access to Microsoft's Phi models, including Phi-3-mini-4k-instruct, which are noted for their surprisingly strong performance despite their relatively small size and resource requirements.
OpenAI API Docs	Documentation detailing how to integrate the text-generation-webui with existing tools and applications. This is achieved by using an OpenAI-compatible API interface for seamless connectivity.
Continue.dev Integration	Instructions for integrating text-generation-webui with Continue.dev, enabling local LLM capabilities directly within the VS Code environment for enhanced coding assistance and development.
SillyTavern GitHub	The GitHub repository for SillyTavern, a widely used and popular frontend application designed for engaging in character-based chat interactions with local LLMs.
LocalLLaMA Community	A collection of GitHub projects and resources related to local LLM development, offering a community hub for discussions, tools, and shared knowledge among enthusiasts.
GitHub Issues	The official GitHub issues page for text-generation-webui. It is highly recommended to search existing issues before creating a new one, as solutions often already exist.
PyImageSearch Tutorial	A comprehensive tutorial from PyImageSearch providing a detailed walkthrough of oobabooga's text-generation-webui, including installation, features, and LORA fine-tuning, complete with screenshots.
Extension Repository	The official repository for community-contributed extensions for text-generation-webui, offering a wide range of additional functionalities and tools to enhance the web UI experience.
Built-in Extensions	Access to the core extensions that are included directly within the main text-generation-webui repository, providing essential functionalities out-of-the-box for immediate use.
Extension Wiki	Detailed wiki page explaining the process of installing and effectively using extensions within the text-generation-webui, covering setup, configuration, and best practices.
Community Discord	The official Discord server for the text-generation-webui community, providing a platform for users to get help with extensions, setup, and general troubleshooting in real-time.

Related Tools & Recommendations

compare

Local AI Tools: Which One Actually Works?

Compare Ollama, LM Studio, Jan, GPT44all, and llama.cpp. Discover features, performance, and real-world experience to choose the best local AI tool for your nee

Ollama

/compare/ollama/lm-studio/jan/gpt4all/llama-cpp/comprehensive-local-ai-showdown

100%

tool

Llama.cpp - Run AI Models Locally Without Losing Your Mind

C++ inference engine that actually works (when it compiles)

llama.cpp

/tool/llama-cpp/overview

49%

tool

Gradio - Build and Share Machine Learning Apps in Python

Build a web UI for your ML model without learning React (finally)

Gradio

/tool/gradio/overview

36%

troubleshoot

Recommended

Fix Ollama Memory & GPU Allocation Issues - Stop the Suffering

Stop Memory Leaks, CUDA Bullshit, and Model Switching That Actually Works

Ollama

/troubleshoot/ollama-memory-gpu-allocation/memory-gpu-allocation-issues

31%

tool

Recommended

Ollama - Run AI Models Locally Without the Cloud Bullshit

Finally, AI That Doesn't Phone Home

Ollama

/tool/ollama/overview

31%

review

Recommended

Can Your Company Actually Trust Local AI?

A Security Review That Won't Put You to Sleep

Ollama

/review/ollama-lmstudio-jan/enterprise-security-assessment

31%

tool

Recommended

LM Studio Performance Optimization - Fix Crashes & Speed Up Local AI

Stop fighting memory crashes and thermal throttling. Here's how to make LM Studio actually work on real hardware.

LM Studio

/tool/lm-studio/performance-optimization

31%

compare

Recommended

Ollama vs LM Studio vs Jan: The Real Deal After 6 Months Running Local AI

Stop burning $500/month on OpenAI when your RTX 4090 is sitting there doing nothing

Ollama

/compare/ollama/lm-studio/jan/local-ai-showdown

31%

tool

Recommended

Hugging Face Transformers - The ML Library That Actually Works

One library, 300+ model architectures, zero dependency hell. Works with PyTorch, TensorFlow, and JAX without making you reinstall your entire dev environment.

Hugging Face Transformers

/tool/huggingface-transformers/overview

31%

tool

Recommended

GPT4All - ChatGPT That Actually Respects Your Privacy

Run AI models on your laptop without sending your data to OpenAI's servers

GPT4All

/tool/gpt4all/overview

28%

alternatives

Recommended

OpenAI Alternatives That Actually Save Money (And Don't Suck)

compatible with OpenAI API

OpenAI API

/alternatives/openai-api/comprehensive-alternatives

28%

alternatives

Recommended

OpenAI Alternatives That Won't Bankrupt You

Bills getting expensive? Yeah, ours too. Here's what we ended up switching to and what broke along the way.

OpenAI API

/alternatives/openai-api/enterprise-migration-guide

28%

review

Recommended

I've Been Testing Enterprise AI Platforms in Production - Here's What Actually Works

Real-world experience with AWS Bedrock, Azure OpenAI, Google Vertex AI, and Claude API after way too much time debugging this stuff

OpenAI API Enterprise

/review/openai-api-alternatives-enterprise-comparison/enterprise-evaluation

28%

tool

Popular choice

jQuery - The Library That Won't Die

Explore jQuery's enduring legacy, its impact on web development, and the key changes in jQuery 4.0. Understand its relevance for new projects in 2025.

jQuery

/tool/jquery/overview

28%

tool

Popular choice

Hoppscotch - Open Source API Development Ecosystem

Fast API testing that won't crash every 20 minutes or eat half your RAM sending a GET request.

Hoppscotch

/tool/hoppscotch/overview

27%

tool

Popular choice

Stop Jira from Sucking: Performance Troubleshooting That Works

Frustrated with slow Jira Software? Learn step-by-step performance troubleshooting techniques to identify and fix common issues, optimize your instance, and boo

Jira Software

/tool/jira-software/performance-troubleshooting

26%

tool

Popular choice

Northflank - Deploy Stuff Without Kubernetes Nightmares

Discover Northflank, the deployment platform designed to simplify app hosting and development. Learn how it streamlines deployments, avoids Kubernetes complexit

Northflank

/tool/northflank/overview

25%

tool

Popular choice

LM Studio MCP Integration - Connect Your Local AI to Real Tools

Turn your offline model into an actual assistant that can do shit

LM Studio

/tool/lm-studio/mcp-integration

23%

tool

Recommended

CPython - The Python That Actually Runs Your Code

CPython is what you get when you download Python from python.org. It's slow as hell, but it's the only Python implementation that runs your production code with

CPython

/tool/cpython/overview

21%

compare

Recommended

Python vs JavaScript vs Go vs Rust - Production Reality Check

What Actually Happens When You Ship Code With These Languages

python

/compare/python-javascript-go-rust/production-reality-check

21%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization