Look, I've tried running LLMs on everything from a GTX 1060 to an RTX 4090. Here's what actually matters and what's just marketing bullshit.
VRAM: The One Thing That Actually Matters
VRAM is everything. Run out of VRAM and your model either won't load or will crawl slower than a dying browser tab. I learned this the hard way trying to run Llama 70B on my RTX 3080's 12GB - it just laughed and fell back to system RAM at 0.5 tokens per second.
Real-world VRAM needs I've actually tested:
- 7B models: Need 4-6GB minimum. My RTX 3060 with 12GB runs Llama 3.1 8B at about 45 tokens/second
- 13B models: 8-12GB if you want decent speed. Anything less and you're swapping to system RAM
- 34B+ models: Forget it unless you have 24GB+. My friend's RTX 4090 barely handles Llama 34B at 15 tokens/second
System RAM matters when VRAM runs out. I run 32GB because models spill over constantly. 16GB works if you're only doing one thing at a time, but who actually does that?
GPU Reality Check
NVIDIA just works. Every LLM framework supports CUDA out of the box. No setup hell, no driver conflicts, no mysterious crashes. RTX 4080 ($800) and 4090 ($1600) are the sweet spots if you can afford them - but factor in the 320W and 450W power draw. My electricity bill jumped $40/month running inference workloads.
AMD ROCm is... complicated. Spent 6 hours getting ROCm working on Ubuntu 22.04 with my RX 7900 XTX. Performance is decent once it's running - about 80% of equivalent NVIDIA speeds - but the setup process is a nightmare of conflicting documentation and kernel module hell.
Apple Silicon works better than expected. My M2 Mac Studio with 64GB unified memory runs 13B models at 25 tokens/second. Not blazing fast, but the fact that it uses system RAM as VRAM means you can actually run larger models than most gaming rigs. Plus it's dead silent and sips power.
Storage: Don't Use Hard Drives
Get an NVMe SSD or suffer. Learned this loading Llama 70B from a mechanical drive - took 8 minutes every time. Same model loads in 20 seconds from my Samsung 980 Pro. These models are massive:
- Llama 3.1 8B: ~4.7GB
- Llama 3.1 70B: ~40GB for 4-bit, 140GB unquantized
- Code Llama 34B: ~20GB
Plan for 500GB minimum if you want to try different models. I filled a 1TB drive in two weeks downloading every interesting model I found on Hugging Face. "Oh, I'll just try this one 30B model" turns into a model hoarding addiction fast.
Network bandwidth: You'll download a lot of models. Each one is several GB. Get decent internet or you'll be waiting hours for each download. Ollama's resume feature works sometimes - when it doesn't, just ctrl+c
and restart the damn thing.
CPU Performance: Don't Count It Out Completely
Modern CPUs aren't hopeless. While GPU inference dominates performance, AMD Zen 4 and Intel 13th gen processors with AVX-512 support can push 3-8 tokens per second on quantized 7B models. Not fast, but usable for testing and development when your GPU is busy mining Bitcoin or whatever.
ARM64 is getting interesting. Apple's M3 processors and AWS Graviton4 instances show decent performance per watt. My M3 MacBook Pro runs Llama 3.1 8B at 12 tokens/second using only system RAM - slower than dedicated GPU, but I can run inference for 8 hours on battery without the laptop turning into a space heater.