VRAM Is Everything - Don't Make My Expensive Mistakes

Local LLM Setup

Look, I've tried running LLMs on everything from a GTX 1060 to an RTX 4090. Here's what actually matters and what's just marketing bullshit.

VRAM: The One Thing That Actually Matters

VRAM is everything. Run out of VRAM and your model either won't load or will crawl slower than a dying browser tab. I learned this the hard way trying to run Llama 70B on my RTX 3080's 12GB - it just laughed and fell back to system RAM at 0.5 tokens per second.

Real-world VRAM needs I've actually tested:

  • 7B models: Need 4-6GB minimum. My RTX 3060 with 12GB runs Llama 3.1 8B at about 45 tokens/second
  • 13B models: 8-12GB if you want decent speed. Anything less and you're swapping to system RAM
  • 34B+ models: Forget it unless you have 24GB+. My friend's RTX 4090 barely handles Llama 34B at 15 tokens/second

System RAM matters when VRAM runs out. I run 32GB because models spill over constantly. 16GB works if you're only doing one thing at a time, but who actually does that?

GPU Reality Check

NVIDIA just works. Every LLM framework supports CUDA out of the box. No setup hell, no driver conflicts, no mysterious crashes. RTX 4080 ($800) and 4090 ($1600) are the sweet spots if you can afford them - but factor in the 320W and 450W power draw. My electricity bill jumped $40/month running inference workloads.

AMD ROCm is... complicated. Spent 6 hours getting ROCm working on Ubuntu 22.04 with my RX 7900 XTX. Performance is decent once it's running - about 80% of equivalent NVIDIA speeds - but the setup process is a nightmare of conflicting documentation and kernel module hell.

Apple Silicon works better than expected. My M2 Mac Studio with 64GB unified memory runs 13B models at 25 tokens/second. Not blazing fast, but the fact that it uses system RAM as VRAM means you can actually run larger models than most gaming rigs. Plus it's dead silent and sips power.

Storage: Don't Use Hard Drives

Get an NVMe SSD or suffer. Learned this loading Llama 70B from a mechanical drive - took 8 minutes every time. Same model loads in 20 seconds from my Samsung 980 Pro. These models are massive:

  • Llama 3.1 8B: ~4.7GB
  • Llama 3.1 70B: ~40GB for 4-bit, 140GB unquantized
  • Code Llama 34B: ~20GB

Plan for 500GB minimum if you want to try different models. I filled a 1TB drive in two weeks downloading every interesting model I found on Hugging Face. "Oh, I'll just try this one 30B model" turns into a model hoarding addiction fast.

Network bandwidth: You'll download a lot of models. Each one is several GB. Get decent internet or you'll be waiting hours for each download. Ollama's resume feature works sometimes - when it doesn't, just ctrl+c and restart the damn thing.

CPU Performance: Don't Count It Out Completely

Modern CPUs aren't hopeless. While GPU inference dominates performance, AMD Zen 4 and Intel 13th gen processors with AVX-512 support can push 3-8 tokens per second on quantized 7B models. Not fast, but usable for testing and development when your GPU is busy mining Bitcoin or whatever.

ARM64 is getting interesting. Apple's M3 processors and AWS Graviton4 instances show decent performance per watt. My M3 MacBook Pro runs Llama 3.1 8B at 12 tokens/second using only system RAM - slower than dedicated GPU, but I can run inference for 8 hours on battery without the laptop turning into a space heater.

Local LLM Framework Comparison

Feature

Ollama

llama.cpp

vLLM

Ease of Setup

Excellent

  • One-click install

Moderate

  • Compilation required

Complex

  • Server configuration

Model Management

Built-in model registry

Manual file management

Manual model loading

Performance

Good

  • 41 TPS peak

Excellent

  • Direct inference

Outstanding

  • 793 TPS peak

Memory Efficiency

Good

  • Automatic optimization

Excellent

  • Fine-grained control

Good

  • Optimized for throughput

Concurrent Users

Limited

  • Single user focused

Moderate

  • Via HTTP server

Excellent

  • Designed for scale

Quantization Support

GGML/GGUF formats

Native GGML/GGUF

Multiple formats

API Compatibility

OpenAI-compatible REST API

HTTP server mode

OpenAI-compatible API

Hardware Support

CUDA, ROCm, Metal, CPU

CUDA, ROCm, Metal, CPU

CUDA, ROCm

Development Focus

Consumer/Developer friendly

Performance optimization

Enterprise/Production

Resource Overhead

Low

  • Minimal system impact

Minimal

  • Direct execution

Moderate

  • Server infrastructure

Multi-GPU Support

Limited

Basic

Advanced

  • Up to 8 GPUs

Best Use Case

Rapid prototyping, local development

Resource-constrained environments

Production inference servers

Learning Curve

Minimal

Moderate

Steep

Community Support

Large, growing rapidly

Established, technical

Smaller, enterprise-focused

The Installations That Actually Work (And The Ones That Don't)

Ollama Installation Guide

I've installed these tools on more machines than I care to remember. Here's what actually works and what will waste your weekend.

Ollama: Actually Easy (Mostly)

Windows: Just download the exe from ollama.com and run it. Seriously, it's one of the few installers that actually works. Windows Defender might freak out and quarantine it - click "allow" and move on.

macOS: brew install ollama works perfectly. Don't download the .pkg unless you hate yourself - Homebrew handles everything cleanly. If you get "command not found" after installing, restart your terminal like a normal person.

Linux: curl -fsSL https://ollama.com/install.sh | sh works on Ubuntu and most Debian-based distros. On Arch, just use the AUR package. On CentOS/RHEL, you might need to manually add the binary to /usr/local/bin/ because their install script has issues with systemd service files.

First model: ollama pull llama3.1:8b downloads about 5GB. Don't panic when it seems stuck at 90% - it's verifying the download. Takes 2-10 minutes depending on your internet.

llama.cpp: Prepare for Compilation Hell

Building from source: git clone https://github.com/ggerganov/llama.cpp.git then make clean && make -j$(nproc) on Linux. On Windows, good luck with Visual Studio - I spent 3 hours fighting CUDA path issues. For CUDA support, use make LLAMA_CUBLAS=1 but make sure your CUDA toolkit version matches exactly what their GitHub README says or you'll get cryptic linking errors like "nvcc fatal: Unsupported gpu architecture 'compute_89'".

Models: Download GGUF files from Hugging Face. Don't use GGML files - that format is dead since 2023. I keep models in ~/llm-models/ because scattered files are chaos. Popular ones: mistral-7b-instruct-v0.3.Q4_K_M.gguf runs well on 8GB cards.

Running it: ./main -m model.gguf -p "Your prompt" -n 100 -t 8 where -t is your CPU cores. Without CUDA, it's CPU-only and will take forever - like watching paint dry.

API server: ./server -m model.gguf --host 0.0.0.0 --port 8080 creates OpenAI-compatible endpoints. Works with most tools that expect GPT-style APIs.

vLLM: Installation Nightmare Mode

vLLM installation is a pain in the ass. Expect to spend 2 hours dealing with CUDA version conflicts, PyTorch compatibility hell, and mysteriously missing dependencies.

Set up Python environment: python -m venv vllm-env && source vllm-env/bin/activate then pip install --upgrade pip setuptools wheel. Don't skip this - mixing system Python with vLLM is asking for dependency conflicts.

Install vLLM: pip install vllm for CUDA. On AMD, use pip install vllm[rocm] and pray to the ROCm gods. Half the time this fails with CUDA version mismatches. If you get "RuntimeError: CUDA error: no kernel image is available for execution", your PyTorch CUDA version doesn't match your driver CUDA version. Delete everything and start over.

Actually running it:

python -m vllm.entrypoints.openai.api_server \
  --model /path/to/model \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 2

This assumes you have 2 GPUs. vLLM will crash if it can't find enough VRAM across your GPUs.

Making It Actually Work

Memory stuff: Set vm.swappiness=10 on Linux so it doesn't swap to death. Monitor with nvidia-smi and htop. When VRAM fills up, performance dies and your fans sound like jet engines.

IDE integration: Most code editors support OpenAI APIs. Point them at http://localhost:8080 for llama.cpp or http://localhost:8000 for vLLM. Works with VSCode extensions, Cursor, and most Python notebooks. Some need fake API keys even for local models - just put "sk-local" or whatever.

Production Deployment Reality Check

Docker makes this easier. Skip the dependency hell and use containers. Ollama's official Docker image handles GPU passthrough automatically with --gpus all. For vLLM, the official containers save you hours of CUDA setup pain.

Load balancing multiple instances: If you're serving real traffic, run multiple vLLM instances behind NGINX or HAProxy. Each instance can handle 10-50 concurrent requests depending on your hardware. Don't expect Ollama to scale beyond 4 simultaneous users.

Monitoring is critical. Set up Prometheus to track GPU utilization, model loading times, and request queues. When VRAM fills up, everything becomes a crawling disaster. I use Grafana dashboards to catch memory leaks before they kill performance and wake me up with alerts at 3am.

Questions You'll Actually Ask (And Real Answers)

Q

Why does it take 5 minutes to load a 7B model?

A

You're probably using a hard drive. Get an SSD or suffer. I loaded Llama 13B from a 5400rpm drive once

  • took 12 minutes every time I wanted to switch models. Same model loads in 30 seconds from my NVMe drive.
Q

Can I run this stuff on CPU only?

A

Technically yes, practically no. I tried running Llama 3.1 8B on CPU-only (32-core Threadripper). Got about 2 tokens per second. My GPU does 45+ tokens per second. CPU is fine for testing if you have patience, but you'll want to upgrade to at least a 1660 Ti for real work.

Q

My model won't load - says "CUDA out of memory"

A

You ran out of VRAM. Check with nvidia-smi to see what's using your GPU memory. Chrome with hardware acceleration eats 2GB easily. Close everything GPU-related and try again. If it still fails, use a smaller model or higher quantization (4-bit instead of 8-bit).

Q

Should I use 4-bit or 8-bit quantization?

A

4-bit for most stuff. The quality loss is barely noticeable unless you're doing complex reasoning tasks. I use 4-bit Llama models and can't tell the difference for coding or writing. 8-bit if you have the VRAM to spare, but 4-bit gets you 2x more model for the same memory.

Q

Why is Ollama slower than the benchmarks claim?

A

Because benchmarks lie. Ollama adds overhead for model management, API stuff, and being user-friendly. Raw llama.cpp is 10-20% faster, but good luck managing multiple models manually. For most development work, Ollama's convenience beats the performance hit.

Q

How do I get AMD GPUs working with ROCm?

A

ROCm on Linux is doable but prepare for pain. Install ROCm drivers first, then fight with environment variables until something works. On Ubuntu 22.04, I had to install rocm-dev-* packages and set HIP_VISIBLE_DEVICES. Performance is about 80% of equivalent NVIDIA once it's running. On Windows, just buy NVIDIA.

Q

Can I use multiple GPUs?

A

vLLM does multi-GPU well with tensor parallelism. llama.cpp has basic multi-GPU support that works okay. Ollama barely supports multi-GPU

  • it's designed for single-card setups. Multi-GPU is complex to configure but can double your performance if you have matching cards.
Q

GGML vs GGUF - which one?

A

GGUF. GGML is dead, replaced in 2023. Don't download GGML files anymore

  • they're slower to load and miss features. All the new models use GGUF anyway.
Q

My download keeps failing at 90%

A

Check disk space first

  • these models are huge. Ollama's resume (ollama pull --resume) works sometimes. When it doesn't, delete the partial download and start over. Don't waste hours trying to fix corrupted downloads like I did.
Q

Why is my GPU at 0% but inference is slow?

A

You're probably running on CPU. Check nvidia-smi

  • if you see 0% GPU usage, your CUDA installation is broken or the model fell back to CPU. Make sure you compiled with CUDA support and your drivers are current.
Q

How do I connect this to VSCode?

A

Install a code assistant extension that supports OpenAI APIs.

Point it to http://localhost:11434/v1 for Ollama. Most extensions work

  • Continue.dev, Codeium, and others. Some need API keys even for local models
  • just put "sk-local" or whatever.
Q

What about the new RTX 50 series GPUs?

A

The RTX 5090 with 32GB VRAM is a beast for LLM inference. Can run quantized 70B models on a single card. The RTX 5080 with 16GB is the new sweet spot for most users

  • handles 13B models easily and some 30B with tight quantization. Both cards cost more than my first car but they're fast as hell.
Q

Should I trust cloud GPU services?

A

Runpod, Vast.ai, and Lambda Labs are solid for testing without buying hardware. Costs $0.30-2.00/hour depending on GPU. Good for evaluating models before committing to local hardware. Just don't put sensitive data through them

  • you never know who's logging your prompts.
Q

What's this about quantization formats?

A

Stick with GGUF Q4_K_M for most stuff. It's the best balance of quality and size. Q8_0 if you have VRAM to spare and want maximum quality. Q2_K is garbage

  • only use if desperate. The newer Q4_0_4_8 format gives 2-3x speedup on ARM but only works with recent llama.cpp builds.

How to Run an LLM Locally on Your Computer (Ollama Tutorial) by Gen AI Cafe

# Actually Useful Ollama Tutorial

This video actually shows you what the terminal looks like instead of just telling you to "run the installer." Worth watching if you're new to this stuff.

What you'll see:
- 0:00 - Why local LLMs don't suck anymore
- 2:30 - Installing Ollama without breaking things
- 5:45 - Downloading your first model (and why it takes forever)
- 8:20 - Making it work with your code editor
- 11:15 - Making it run faster than molasses

Watch: How to Run an LLM Locally on Your Computer (Ollama Tutorial)

Why watch this: The presenter actually shows error messages and fixes them instead of pretending everything works perfectly. Shows real terminal output and troubleshoots the stuff that always breaks.

📺 YouTube

Resources That Don't Suck

Related Tools & Recommendations

compare
Similar content

Ollama vs LM Studio vs Jan: 6-Month Local AI Showdown

Stop burning $500/month on OpenAI when your RTX 4090 is sitting there doing nothing

Ollama
/compare/ollama/lm-studio/jan/local-ai-showdown
100%
tool
Similar content

Ollama: Run Local AI Models & Get Started Easily | No Cloud

Finally, AI That Doesn't Phone Home

Ollama
/tool/ollama/overview
100%
tool
Similar content

Ollama Production Troubleshooting: Fix Deployment Nightmares & Performance

Your Local Hero Becomes a Production Nightmare

Ollama
/tool/ollama/production-troubleshooting
90%
tool
Similar content

LM Studio Performance: Fix Crashes & Speed Up Local AI

Stop fighting memory crashes and thermal throttling. Here's how to make LM Studio actually work on real hardware.

LM Studio
/tool/lm-studio/performance-optimization
86%
tool
Similar content

LM Studio: Run AI Models Locally & Ditch ChatGPT Bills

Finally, ChatGPT without the monthly bill or privacy nightmare

LM Studio
/tool/lm-studio/overview
83%
tool
Similar content

GPT4All - ChatGPT That Actually Respects Your Privacy

Run AI models on your laptop without sending your data to OpenAI's servers

GPT4All
/tool/gpt4all/overview
83%
tool
Similar content

Text-generation-webui: Run LLMs Locally Without API Bills

Discover Text-generation-webui to run LLMs locally, avoiding API costs. Learn its benefits, hardware requirements, and troubleshoot common OOM errors.

Text-generation-webui
/tool/text-generation-webui/overview
64%
tool
Similar content

Setting Up Jan's MCP Automation That Actually Works

Transform your local AI from chatbot to workflow powerhouse with Model Context Protocol

Jan
/tool/jan/mcp-automation-setup
53%
tool
Recommended

LM Studio MCP Integration - Connect Your Local AI to Real Tools

Turn your offline model into an actual assistant that can do shit

LM Studio
/tool/lm-studio/mcp-integration
53%
tool
Recommended

Llama.cpp - Run AI Models Locally Without Losing Your Mind

C++ inference engine that actually works (when it compiles)

llama.cpp
/tool/llama-cpp/overview
40%
tool
Recommended

Django - The Web Framework for Perfectionists with Deadlines

Build robust, scalable web applications rapidly with Python's most comprehensive framework

Django
/tool/django/overview
36%
tool
Recommended

Django Troubleshooting Guide - Fixing Production Disasters at 3 AM

Stop Django apps from breaking and learn how to debug when they do

Django
/tool/django/troubleshooting-guide
36%
tool
Recommended

LangChain Production Deployment - What Actually Breaks

integrates with LangChain

LangChain
/tool/langchain/production-deployment-guide
28%
integration
Recommended

LangChain + Hugging Face Production Deployment Architecture

Deploy LangChain + Hugging Face without your infrastructure spontaneously combusting

LangChain
/integration/langchain-huggingface-production-deployment/production-deployment-architecture
28%
tool
Recommended

LangChain - Python Library for Building AI Apps

integrates with LangChain

LangChain
/tool/langchain/overview
28%
troubleshoot
Recommended

Docker Won't Start on Windows 11? Here's How to Fix That Garbage

Stop the whale logo from spinning forever and actually get Docker working

Docker Desktop
/troubleshoot/docker-daemon-not-running-windows-11/daemon-startup-issues
28%
howto
Recommended

Stop Docker from Killing Your Containers at Random (Exit Code 137 Is Not Your Friend)

Three weeks into a project and Docker Desktop suddenly decides your container needs 16GB of RAM to run a basic Node.js app

Docker Desktop
/howto/setup-docker-development-environment/complete-development-setup
28%
news
Recommended

Docker Desktop's Stupidly Simple Container Escape Just Owned Everyone

integrates with Technology News Aggregation

Technology News Aggregation
/news/2025-08-26/docker-cve-security
28%
tool
Recommended

Hugging Face Transformers - The ML Library That Actually Works

One library, 300+ model architectures, zero dependency hell. Works with PyTorch, TensorFlow, and JAX without making you reinstall your entire dev environment.

Hugging Face Transformers
/tool/huggingface-transformers/overview
27%
tool
Recommended

CUDA Development Toolkit 13.0 - Still Breaking Builds Since 2007

NVIDIA's parallel programming platform that makes GPU computing possible but not painless

CUDA Development Toolkit
/tool/cuda/overview
21%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization