Why does LM Studio crash with "Exit code 137" during model loading?

Exit code 137 = OOMKilled. Your system ran out of memory and Windows killed the process. This happens because LM Studio needs 2-3x the model file size in RAM during loading. **Quick fix**: Close other programs, use a smaller model, or add more RAM. For emergency situations, increase Windows virtual memory (pagefile) to 32GB, but expect terrible performance.

My RTX 4090 is slower than expected, what's wrong?

Probably thermal throttling or NUMA issues. RTX 4090s run hot and will throttle at 83°C, dropping from 25 tok/s to 15 tok/s. Also check if you have a dual-socket system - LM Studio randomly allocates memory to the wrong NUMA node 50% of the time. **Monitor temps with GPU-Z. If hitting 83°C, improve cooling or reduce batch size.**

LM Studio starts but model loading hangs at 50%, then crashes

Two main causes: 1. **Insufficient VRAM**: Model needs more GPU memory than available 2. **Driver timeout**: Windows kills GPU processes that take too long Solution: Reduce GPU layer offloading by 5-10 layers, or switch to CPU-only mode for testing.

Performance starts fast but slows down after 30 minutes

Heat buildup causing thermal throttling. Your GPU starts cool but gradually heats up until it throttles. I've seen RTX 4070s drop from 15 tok/s to 8 tok/s when this happens. **Fix: Aggressive fan curves, better case ventilation, or reduce sustained load (lower batch size).**

"CUDA out of memory" error even though I have 24GB VRAM

LM Studio doesn't account for memory fragmentation. After loading/unloading models, VRAM gets fragmented and large allocations fail even with apparent free space. **Nuclear option: Restart LM Studio completely. Prevents most fragmentation issues.**

My laptop sounds like a jet engine and gets too hot to use

AI inference pushes hardware harder than gaming. Laptops especially struggle with sustained high GPU/CPU loads. Your cooling system wasn't designed for 200W continuous draw. **Realistic options**: Lower model quantization (Q4 instead of Q8), reduce batch size, use CPU-only mode, or get a desktop.

Model loads fine but responses are slow as hell (1-2 tok/s)

Usually CPU bottleneck or memory bandwidth issue. Check if: - GPU utilization is actually high (should be 95%+) - You're using the right drivers (CUDA 12.8+ for RTX cards) - Windows isn't memory-swapping (check Task Manager) **Common fix**: Increase GPU layer offloading if you have VRAM available.

LM Studio crashes when I switch models

Memory not being properly freed between model loads. This is a known issue with certain quantizations and large models. **Workaround**: Completely close and restart LM Studio between different models. Annoying but reliable.

Version 0.3.24 crashes more than previous versions

Yeah, I noticed this too. Some regression with memory management on Windows. Version 0.3.20 was more stable for me. **Temporary fix**: Revert to 0.3.20 or wait for 0.3.25. Check the GitHub issue tracker for updates.

GPU shows 0% utilization but model is loaded

Driver issue or layer offloading set wrong. LM Studio thinks it's using GPU but actually falling back to CPU. **Debug steps**: 1. Check Windows Device Manager for GPU driver errors 2. Reinstall CUDA toolkit (version 12.8+ recommended) 3. Set GPU layers manually instead of auto-detect 4. Try different model to isolate issue

How do I know if I'm hitting memory limits before crashing?

Watch Task Manager's memory graph during model loading. If it climbs above 90% and stays there, you're about to crash. LM Studio doesn't warn you - it just dies with exit code 137. **Early warning signs**: System becomes sluggish, disk activity spikes (swapping), other programs start closing automatically.

Why does my model start fast but slow down over time?

Three main causes: 1. **Thermal throttling** - GPU gets hot and reduces clock speeds 2. **Memory leaks** - Context builds up and uses more RAM 3. **Background processes** - Windows Update, antivirus, etc. stealing resources **Quick check**: Monitor temperatures with GPU-Z. If GPU temp climbs above 80°C, that's your problem.

My RTX 4090 is slower than benchmarks show, what gives?

Probably NUMA issues if you have a high-end motherboard. Windows randomly puts LM Studio's memory on the wrong socket, causing 30% performance loss. Also check for: - Power limit throttling (increase power target to 120%) - PCIe lane limitations (needs x16, not x8) - CPU bottleneck (unlikely but possible with very old CPUs)

Can I run multiple models simultaneously?

Sort of. You can load multiple models in RAM, but only one can use GPU acceleration at a time. Each additional model eats 8-20GB of system RAM. **Practical limit**: With 64GB RAM, I can keep 3-4 models loaded and switch between them. Performance drops 10-15% due to memory pressure.

LM Studio says "Model loaded" but inference doesn't start

Driver issue or GPU memory allocation failure. Check Device Manager for GPU errors, then: 1. Restart LM Studio completely 2. Try reducing GPU layers by 5-10 3. Check Windows Event Viewer for CUDA errors 4. Reinstall GPU drivers if nothing else works

How much VRAM buffer should I leave free?

I leave 2-3GB VRAM buffer or you'll hit memory limits during long conversations. Context accumulation uses more VRAM over time, and Windows needs buffer space for driver operations. **Safe calculation**: Use 85% of total VRAM for model offloading. RTX 4070 (12GB) = use ~10GB max.

Is there a way to predict performance before downloading models?

Rough formula: **tok/s ≈ (VRAM_GB × 2.5) / (Model_Size_GB × 1.2)** Examples: - RTX 4070 (12GB) with 8GB model: ~26 tok/s theoretical, ~15 tok/s real - RTX 4090 (24GB) with 13GB model: ~46 tok/s theoretical, ~25 tok/s real Real performance is always lower due to overhead, but this gives you a ballpark.

Why do some models crash drivers while others don't?

Model architecture differences. Some models stress memory bandwidth more, others hit compute limits harder. Mixture-of-Experts (MoE) models are especially problematic - they cause random driver timeouts. **Specific problematic models**: Qwen-MoE, Mixtral-8x7B often crash on RTX 3000 series cards.

Can I use integrated graphics + discrete GPU together?

No, LM Studio can't split models across different GPU types. It's either iGPU or discrete GPU, not both. Discrete GPU is always faster anyway. **Exception**: Some Intel Arc + Intel iGPU setups might work, but I haven't tested this extensively.

How do I optimize for battery life on laptops?

Use CPU-only mode and smaller models. GPU inference drains laptop batteries in 1-2 hours. CPU-only extends to 4-6 hours but performance drops to 1-3 tok/s. **Battery-friendly setup**: 7B Q4_K_M model, CPU-only, 2048 context length. Usable for basic tasks without destroying battery.

My model quality seems worse than cloud APIs, why?

Quantization reduces quality. Q4_K_M models are noticeably dumber than full-precision cloud models. Also, smaller local models (7B-13B) just aren't as capable as GPT-4 class models (100B+ parameters). **Reality check**: Local 13B model ≈ GPT-3.5 quality, not GPT-4. Manage expectations accordingly.

Windows keeps killing LM Studio processes, how to prevent?

Disable Windows memory compression and increase virtual memory (pagefile). Windows aggressively kills processes when RAM is low. **Settings to change**: 1. Disable "Compress memory" in Task Manager → Performance → Memory 2. Set pagefile to 32GB+ (System → Advanced → Virtual Memory) 3. Add LM Studio to high priority in Task Manager

Version 4.1.3 has a memory leak, skip it

Yeah, known issue. Model switching doesn't properly free GPU memory. Stick with 0.3.24 or wait for the next release. **Workaround**: Restart LM Studio completely between model changes. Annoying but prevents crashes.

Currently viewing the AI version

Switch to human version

LM Studio Performance Optimization - AI Technical Reference

Critical Configuration Requirements

Memory Management

The 32GB Rule: Total system RAM should be 4x the model file size for reliable operation.

Memory Overhead Calculation:

Model file size: Base requirement
Loading overhead: +2-3GB during initialization
Context buffer: +1-2GB (varies by context length)
System overhead: +2-4GB (OS, UI, other programs)
GPU memory copying: +1-2GB during layer offloading
Total overhead: 2.5-3x the model file size

Production-Ready Examples:

Model Size	File Size	Minimum RAM	Recommended RAM
7B model	4GB	16GB	32GB
14B model	8GB	32GB	64GB
30B model	18GB	64GB	128GB

Context Length Impact on Memory

Memory Usage Multipliers:

2048 context: Baseline memory usage
4096 context: +50% memory usage
8192 context: +200-300% memory usage

Use Case Optimization:

Quick questions: 2048 context (saves 1-2GB RAM)
Coding sessions: 4096 context (balance of capability/memory)
Document analysis: 8192+ context (requires +3-4GB RAM)

GPU Configuration Specifications

VRAM Allocation Rules

Safe VRAM Usage: Set manual GPU memory limit to 85% of total VRAM

RTX 4070 (12GB): Set 10GB limit
RTX 4080 (16GB): Set 13GB limit
RTX 4090 (24GB): Set 20GB limit

Layer Offloading Configurations:

GPU Model	VRAM	7B Models	13B Models	20B+ Models
RTX 4070	12GB	32/32 layers	28-29 layers	CPU only
RTX 4080	16GB	Full GPU	Full GPU	35-40 layers
RTX 4090	24GB	Full GPU	Full GPU	Full GPU

Performance Benchmarks

Real-World Token Generation Rates (13B models):

GPU Model	VRAM	Performance	Max Model Size	Power Draw
RTX 4090	24GB	22-25 tok/s	30B+ (full)	350-400W
RTX 4080 Super	16GB	18-21 tok/s	20B (full)	280-320W
RTX 4070 Ti Super	16GB	16-19 tok/s	20B (full)	220-250W
RTX 4070	12GB	13-16 tok/s	13B (full)	180-200W
RTX 3070	8GB	11-13 tok/s	7B (full)	200-240W
RTX 3060 12GB	12GB	8-10 tok/s	13B (full)	170W

Critical Failure Modes

Exit Code 137 (OOMKilled)

Cause: System ran out of memory, OS killed process
Prevention: Follow 32GB rule, monitor RAM usage during loading
Emergency Fix: Increase Windows pagefile to 32GB (performance will be terrible)

CUDA Out of Memory

Cause: VRAM allocation exceeds available memory or fragmentation
Prevention: Leave 15% VRAM buffer, restart LM Studio between model changes
Fix: Reduce GPU layer offloading by 5-10 layers

Thermal Throttling

Symptom: Performance drops from 15 tok/s to 8 tok/s after 30 minutes
Cause: GPU temperature hits 83°C thermal limit
Prevention:

Set aggressive fan curves starting at 60°C
Monitor temperatures with GPU-Z
Reduce batch size from 512 to 256/128
Cap GPU utilization to 90%

Driver Crashes

Cause: GPU driver timeout from overcommitted VRAM
Prevention: Manual VRAM limits, avoid auto-detect settings
Fix: Reduce layer offloading, restart LM Studio completely

Platform-Specific Optimizations

Windows Configuration

Required Settings:

Disable Windows memory compression
Set pagefile to 32GB minimum
Add LM Studio to high priority processes
Use 75% of CPU threads (not 100%)

Multi-GPU Reality

Performance: 1.4x improvement with 2x complexity and cost
Issues: Memory synchronization overhead, driver conflicts, doubled heat/power
Recommendation: Single high-end GPU preferred over dual mid-range

AMD GPU Status

Current State: 60% slower than equivalent NVIDIA, frequent crashes
Stability: Crashes every 30-45 minutes with no clear pattern
Recommendation: Use NVIDIA for production workloads

Apple Silicon Performance

M2 MacBook Pro (32GB):

Qwen-14B: 12.4 tok/s (competitive with RTX 4070)
Power efficiency: 25W vs 200W+ for RTX cards
Limitation: No quantization flexibility

Monitoring Requirements

Essential Metrics:

RAM usage: Keep below 75% of total system RAM
GPU utilization: Target 95-99% during inference
GPU temperature: Keep under 80°C sustained
VRAM usage: Target 85-90% with 10-15% buffer

Recommended Tools:

Temperature monitoring: MSI Afterburner, GPU-Z
Memory usage: Task Manager, HWiNFO64
Performance tracking: Built-in LM Studio metrics

Version-Specific Issues

Known Problems:

Version 0.3.24: Memory regression, more crashes than 0.3.20
Version 4.1.3: Memory leak in model switching
Solution: Use version 0.3.20 for stability or wait for fixes

Hardware Recommendations

Budget Build: RTX 3060 12GB + 32GB RAM

Handles 13B models adequately
Best price/performance ratio
170W power draw

Performance Build: RTX 4070 Ti Super + 64GB RAM

Handles 20B models at full speed
16-19 tok/s performance
Good balance of cost/capability

Enthusiast Build: RTX 4090 + 128GB RAM

Handles 30B+ models
22-25 tok/s performance
Future-proof for larger models

Troubleshooting Decision Tree

Exit Code 137: Insufficient RAM → Add memory or use smaller model
CUDA Out of Memory: VRAM exceeded → Reduce layer offloading
Slow Performance: Check thermal throttling → Improve cooling
Driver Crashes: Overcommitted VRAM → Set manual limits
Random Crashes: Memory fragmentation → Restart LM Studio
Inconsistent Performance: NUMA issues → Set CPU affinity

Performance Formula

GitHub Copilot

/news/2025-08-23/chatgpt5-user-backlash

52%

tool

Popular choice

Framer - The Design Tool That Actually Builds Real Websites

Started as a Mac app for prototypes, now builds production sites that don't suck

/tool/framer/overview

49%

news

Recommended

Vercel AI SDK 5.0 Drops With Breaking Changes - 2025-09-07

Deprecated APIs finally get the axe, Zod 4 support arrives

Microsoft Copilot

/news/2025-09-07/vercel-ai-sdk-5-breaking-changes

47%

tool

Recommended

Vercel AI SDK - Stop rebuilding your entire app every time some AI provider changes their shit

Tired of rewriting your entire app just because your client wants Claude instead of GPT?

Vercel AI SDK

/tool/vercel-ai-sdk/overview

47%

tool

Popular choice

Optimize Node.js production deployment to prevent outages. Learn common pitfalls, PM2 clustering, troubleshooting FAQs, and effective monitoring for robust Node

Node.js

/tool/node.js/production-deployment

34%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization