How to Run LLMs on Your Own Hardware Without Sending Everything to OpenAI

VRAM Is Everything - Don't Make My Expensive Mistakes

Local LLM Setup

Look, I've tried running LLMs on everything from a GTX 1060 to an RTX 4090. Here's what actually matters and what's just marketing bullshit.

VRAM: The One Thing That Actually Matters

VRAM is everything. Run out of VRAM and your model either won't load or will crawl slower than a dying browser tab. I learned this the hard way trying to run Llama 70B on my RTX 3080's 12GB - it just laughed and fell back to system RAM at 0.5 tokens per second.

Real-world VRAM needs I've actually tested:

7B models: Need 4-6GB minimum. My RTX 3060 with 12GB runs Llama 3.1 8B at about 45 tokens/second
13B models: 8-12GB if you want decent speed. Anything less and you're swapping to system RAM
34B+ models: Forget it unless you have 24GB+. My friend's RTX 4090 barely handles Llama 34B at 15 tokens/second

System RAM matters when VRAM runs out. I run 32GB because models spill over constantly. 16GB works if you're only doing one thing at a time, but who actually does that?

GPU Reality Check

NVIDIA just works. Every LLM framework supports CUDA out of the box. No setup hell, no driver conflicts, no mysterious crashes. RTX 4080 ($800) and 4090 ($1600) are the sweet spots if you can afford them - but factor in the 320W and 450W power draw. My electricity bill jumped $40/month running inference workloads.

AMD ROCm is... complicated. Spent 6 hours getting ROCm working on Ubuntu 22.04 with my RX 7900 XTX. Performance is decent once it's running - about 80% of equivalent NVIDIA speeds - but the setup process is a nightmare of conflicting documentation and kernel module hell.

Apple Silicon works better than expected. My M2 Mac Studio with 64GB unified memory runs 13B models at 25 tokens/second. Not blazing fast, but the fact that it uses system RAM as VRAM means you can actually run larger models than most gaming rigs. Plus it's dead silent and sips power.

Storage: Don't Use Hard Drives

Get an NVMe SSD or suffer. Learned this loading Llama 70B from a mechanical drive - took 8 minutes every time. Same model loads in 20 seconds from my Samsung 980 Pro. These models are massive:

Llama 3.1 8B: ~4.7GB
Llama 3.1 70B: ~40GB for 4-bit, 140GB unquantized
Code Llama 34B: ~20GB

Plan for 500GB minimum if you want to try different models. I filled a 1TB drive in two weeks downloading every interesting model I found on Hugging Face. "Oh, I'll just try this one 30B model" turns into a model hoarding addiction fast.

Network bandwidth: You'll download a lot of models. Each one is several GB. Get decent internet or you'll be waiting hours for each download. Ollama's resume feature works sometimes - when it doesn't, just ctrl+c and restart the damn thing.

CPU Performance: Don't Count It Out Completely

Modern CPUs aren't hopeless. While GPU inference dominates performance, AMD Zen 4 and Intel 13th gen processors with AVX-512 support can push 3-8 tokens per second on quantized 7B models. Not fast, but usable for testing and development when your GPU is busy mining Bitcoin or whatever.

ARM64 is getting interesting. Apple's M3 processors and AWS Graviton4 instances show decent performance per watt. My M3 MacBook Pro runs Llama 3.1 8B at 12 tokens/second using only system RAM - slower than dedicated GPU, but I can run inference for 8 hours on battery without the laptop turning into a space heater.

Local LLM Framework Comparison

Feature	Ollama	llama.cpp	vLLM
Ease of Setup	Excellent One-click install	Moderate Compilation required	Complex Server configuration
Model Management	Built-in model registry	Manual file management	Manual model loading
Performance	Good 41 TPS peak	Excellent Direct inference	Outstanding 793 TPS peak
Memory Efficiency	Good Automatic optimization	Excellent Fine-grained control	Good Optimized for throughput
Concurrent Users	Limited Single user focused	Moderate Via HTTP server	Excellent Designed for scale
Quantization Support	GGML/GGUF formats	Native GGML/GGUF	Multiple formats
API Compatibility	OpenAI-compatible REST API	HTTP server mode	OpenAI-compatible API
Hardware Support	CUDA, ROCm, Metal, CPU	CUDA, ROCm, Metal, CPU	CUDA, ROCm
Development Focus	Consumer/Developer friendly	Performance optimization	Enterprise/Production
Resource Overhead	Low Minimal system impact	Minimal Direct execution	Moderate Server infrastructure
Multi-GPU Support	Limited	Basic	Advanced Up to 8 GPUs
Best Use Case	Rapid prototyping, local development	Resource-constrained environments	Production inference servers
Learning Curve	Minimal	Moderate	Steep
Community Support	Large, growing rapidly	Established, technical	Smaller, enterprise-focused

The Installations That Actually Work (And The Ones That Don't)

Ollama Installation Guide

I've installed these tools on more machines than I care to remember. Here's what actually works and what will waste your weekend.

Ollama: Actually Easy (Mostly)

Windows: Just download the exe from ollama.com and run it. Seriously, it's one of the few installers that actually works. Windows Defender might freak out and quarantine it - click "allow" and move on.

macOS: brew install ollama works perfectly. Don't download the .pkg unless you hate yourself - Homebrew handles everything cleanly. If you get "command not found" after installing, restart your terminal like a normal person.

Linux: curl -fsSL https://ollama.com/install.sh | sh works on Ubuntu and most Debian-based distros. On Arch, just use the AUR package. On CentOS/RHEL, you might need to manually add the binary to /usr/local/bin/ because their install script has issues with systemd service files.

First model: ollama pull llama3.1:8b downloads about 5GB. Don't panic when it seems stuck at 90% - it's verifying the download. Takes 2-10 minutes depending on your internet.

llama.cpp: Prepare for Compilation Hell

Building from source: git clone https://github.com/ggerganov/llama.cpp.git then make clean && make -j$(nproc) on Linux. On Windows, good luck with Visual Studio - I spent 3 hours fighting CUDA path issues. For CUDA support, use make LLAMA_CUBLAS=1 but make sure your CUDA toolkit version matches exactly what their GitHub README says or you'll get cryptic linking errors like "nvcc fatal: Unsupported gpu architecture 'compute_89'".

Models: Download GGUF files from Hugging Face. Don't use GGML files - that format is dead since 2023. I keep models in ~/llm-models/ because scattered files are chaos. Popular ones: mistral-7b-instruct-v0.3.Q4_K_M.gguf runs well on 8GB cards.

Running it: ./main -m model.gguf -p "Your prompt" -n 100 -t 8 where -t is your CPU cores. Without CUDA, it's CPU-only and will take forever - like watching paint dry.

API server: ./server -m model.gguf --host 0.0.0.0 --port 8080 creates OpenAI-compatible endpoints. Works with most tools that expect GPT-style APIs.

vLLM: Installation Nightmare Mode

vLLM installation is a pain in the ass. Expect to spend 2 hours dealing with CUDA version conflicts, PyTorch compatibility hell, and mysteriously missing dependencies.

Set up Python environment: python -m venv vllm-env && source vllm-env/bin/activate then pip install --upgrade pip setuptools wheel. Don't skip this - mixing system Python with vLLM is asking for dependency conflicts.

Install vLLM: pip install vllm for CUDA. On AMD, use pip install vllm[rocm] and pray to the ROCm gods. Half the time this fails with CUDA version mismatches. If you get "RuntimeError: CUDA error: no kernel image is available for execution", your PyTorch CUDA version doesn't match your driver CUDA version. Delete everything and start over.

Actually running it:

python -m vllm.entrypoints.openai.api_server \
  --model /path/to/model \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 2

This assumes you have 2 GPUs. vLLM will crash if it can't find enough VRAM across your GPUs.

Making It Actually Work

Memory stuff: Set vm.swappiness=10 on Linux so it doesn't swap to death. Monitor with nvidia-smi and htop. When VRAM fills up, performance dies and your fans sound like jet engines.

IDE integration: Most code editors support OpenAI APIs. Point them at http://localhost:8080 for llama.cpp or http://localhost:8000 for vLLM. Works with VSCode extensions, Cursor, and most Python notebooks. Some need fake API keys even for local models - just put "sk-local" or whatever.

Production Deployment Reality Check

Docker makes this easier. Skip the dependency hell and use containers. Ollama's official Docker image handles GPU passthrough automatically with --gpus all. For vLLM, the official containers save you hours of CUDA setup pain.

Load balancing multiple instances: If you're serving real traffic, run multiple vLLM instances behind NGINX or HAProxy. Each instance can handle 10-50 concurrent requests depending on your hardware. Don't expect Ollama to scale beyond 4 simultaneous users.

Monitoring is critical. Set up Prometheus to track GPU utilization, model loading times, and request queues. When VRAM fills up, everything becomes a crawling disaster. I use Grafana dashboards to catch memory leaks before they kill performance and wake me up with alerts at 3am.

Questions You'll Actually Ask (And Real Answers)

Why does it take 5 minutes to load a 7B model?

You're probably using a hard drive. Get an SSD or suffer. I loaded Llama 13B from a 5400rpm drive once

took 12 minutes every time I wanted to switch models. Same model loads in 30 seconds from my NVMe drive.

Can I run this stuff on CPU only?

Technically yes, practically no. I tried running Llama 3.1 8B on CPU-only (32-core Threadripper). Got about 2 tokens per second. My GPU does 45+ tokens per second. CPU is fine for testing if you have patience, but you'll want to upgrade to at least a 1660 Ti for real work.

My model won't load - says "CUDA out of memory"

You ran out of VRAM. Check with nvidia-smi to see what's using your GPU memory. Chrome with hardware acceleration eats 2GB easily. Close everything GPU-related and try again. If it still fails, use a smaller model or higher quantization (4-bit instead of 8-bit).

Should I use 4-bit or 8-bit quantization?

4-bit for most stuff. The quality loss is barely noticeable unless you're doing complex reasoning tasks. I use 4-bit Llama models and can't tell the difference for coding or writing. 8-bit if you have the VRAM to spare, but 4-bit gets you 2x more model for the same memory.

Why is Ollama slower than the benchmarks claim?

Because benchmarks lie. Ollama adds overhead for model management, API stuff, and being user-friendly. Raw llama.cpp is 10-20% faster, but good luck managing multiple models manually. For most development work, Ollama's convenience beats the performance hit.

How do I get AMD GPUs working with ROCm?

ROCm on Linux is doable but prepare for pain. Install ROCm drivers first, then fight with environment variables until something works. On Ubuntu 22.04, I had to install rocm-dev-* packages and set HIP_VISIBLE_DEVICES. Performance is about 80% of equivalent NVIDIA once it's running. On Windows, just buy NVIDIA.

Can I use multiple GPUs?

vLLM does multi-GPU well with tensor parallelism. llama.cpp has basic multi-GPU support that works okay. Ollama barely supports multi-GPU

it's designed for single-card setups. Multi-GPU is complex to configure but can double your performance if you have matching cards.

GGML vs GGUF - which one?

GGUF. GGML is dead, replaced in 2023. Don't download GGML files anymore

they're slower to load and miss features. All the new models use GGUF anyway.

My download keeps failing at 90%

Check disk space first

these models are huge. Ollama's resume (ollama pull --resume) works sometimes. When it doesn't, delete the partial download and start over. Don't waste hours trying to fix corrupted downloads like I did.

Why is my GPU at 0% but inference is slow?

You're probably running on CPU. Check nvidia-smi

if you see 0% GPU usage, your CUDA installation is broken or the model fell back to CPU. Make sure you compiled with CUDA support and your drivers are current.

How do I connect this to VSCode?

Install a code assistant extension that supports OpenAI APIs.

Point it to http://localhost:11434/v1 for Ollama. Most extensions work

Continue.dev, Codeium, and others. Some need API keys even for local models
just put "sk-local" or whatever.

What about the new RTX 50 series GPUs?

The RTX 5090 with 32GB VRAM is a beast for LLM inference. Can run quantized 70B models on a single card. The RTX 5080 with 16GB is the new sweet spot for most users

handles 13B models easily and some 30B with tight quantization. Both cards cost more than my first car but they're fast as hell.

Should I trust cloud GPU services?

Runpod, Vast.ai, and Lambda Labs are solid for testing without buying hardware. Costs $0.30-2.00/hour depending on GPU. Good for evaluating models before committing to local hardware. Just don't put sensitive data through them

you never know who's logging your prompts.

What's this about quantization formats?

Stick with GGUF Q4_K_M for most stuff. It's the best balance of quality and size. Q8_0 if you have VRAM to spare and want maximum quality. Q2_K is garbage

only use if desperate. The newer Q4_0_4_8 format gives 2-3x speedup on ARM but only works with recent llama.cpp builds.

How to Run an LLM Locally on Your Computer (Ollama Tutorial) by Gen AI Cafe

# Actually Useful Ollama Tutorial

This video actually shows you what the terminal looks like instead of just telling you to "run the installer." Worth watching if you're new to this stuff.

What you'll see:
- 0:00 - Why local LLMs don't suck anymore
- 2:30 - Installing Ollama without breaking things
- 5:45 - Downloading your first model (and why it takes forever)
- 8:20 - Making it work with your code editor
- 11:15 - Making it run faster than molasses

Watch: How to Run an LLM Locally on Your Computer (Ollama Tutorial)

Why watch this: The presenter actually shows error messages and fixes them instead of pretending everything works perfectly. Shows real terminal output and troubleshoots the stuff that always breaks.

📺 YouTube

Resources That Don't Suck

Related Tools & Recommendations

compare

Similar content

Ollama vs LM Studio vs Jan: 6-Month Local AI Showdown

Stop burning $500/month on OpenAI when your RTX 4090 is sitting there doing nothing

Quick Navigation

VRAM: The One Thing That Actually Matters

GPU Reality Check

Storage: Don't Use Hard Drives

CPU Performance: Don't Count It Out Completely

Ollama: Actually Easy (Mostly)

llama.cpp: Prepare for Compilation Hell

vLLM: Installation Nightmare Mode

Making It Actually Work

Production Deployment Reality Check

Why does it take 5 minutes to load a 7B model?

Can I run this stuff on CPU only?

My model won't load - says "CUDA out of memory"

Should I use 4-bit or 8-bit quantization?

Why is Ollama slower than the benchmarks claim?

How do I get AMD GPUs working with ROCm?

Can I use multiple GPUs?

GGML vs GGUF - which one?

My download keeps failing at 90%

Why is my GPU at 0% but inference is slow?

How do I connect this to VSCode?

What about the new RTX 50 series GPUs?

Should I trust cloud GPU services?

What's this about quantization formats?

Related Tools & Recommendations

Ollama vs LM Studio vs Jan: 6-Month Local AI Showdown

Ollama: Run Local AI Models & Get Started Easily | No Cloud

Ollama Production Troubleshooting: Fix Deployment Nightmares & Performance

LM Studio Performance: Fix Crashes & Speed Up Local AI

LM Studio: Run AI Models Locally & Ditch ChatGPT Bills

GPT4All - ChatGPT That Actually Respects Your Privacy

Text-generation-webui: Run LLMs Locally Without API Bills

Setting Up Jan's MCP Automation That Actually Works

LM Studio MCP Integration - Connect Your Local AI to Real Tools

Llama.cpp - Run AI Models Locally Without Losing Your Mind

Django - The Web Framework for Perfectionists with Deadlines

Django Troubleshooting Guide - Fixing Production Disasters at 3 AM

LangChain Production Deployment - What Actually Breaks

LangChain + Hugging Face Production Deployment Architecture

LangChain - Python Library for Building AI Apps

Docker Won't Start on Windows 11? Here's How to Fix That Garbage

Stop Docker from Killing Your Containers at Random (Exit Code 137 Is Not Your Friend)

Docker Desktop's Stupidly Simple Container Escape Just Owned Everyone

Hugging Face Transformers - The ML Library That Actually Works

CUDA Development Toolkit 13.0 - Still Breaking Builds Since 2007