"RuntimeError: mat1 and mat2 shapes cannot be multiplied" - Why does this error message suck?

Because PyTorch waits until the actual matrix multiplication to tell you the shapes are wrong. The error could be from 5 layers back. **90% of the time it's this shit:** ```python # You did this (wrong) x = some_conv_layer(input) # Output: [batch, 512, 7, 7] x = linear_layer(x) # Expects [batch, features] - BOOM # Fix: Flatten the damn thing x = x.view(x.size(0), -1) # Now: [batch, 512*7*7] ``` **The other 10% is mismatched batch sizes.** Print shapes everywhere and cry.

"CUDA error: device-side assert triggered" - The most useless error message ever

This means "something crashed on the GPU but fuck you, figure it out." 99% of the time it's: 1. **Your class indices are wrong** - you have 10 classes but passed class index 15 2. **Loss function got NaN** - usually from log(0) somewhere 3. **Some tensor index went negative** - congrats, you broke math **Copy this, run it, save your sanity:** ```python # Force everything to CPU to get the real error model = model.cpu() data = data.cpu() # Now you'll get the actual error message ```

"Expected all tensors to be on the same device" - The device mismatch from hell

You mixed CPU and GPU tensors. This breaks everything. **Nuclear option (copy this):** ```python def forward(self, x): # Force everything to match the model's device device = next(self.parameters()).device x = x.to(device) return self.layers(x) ```

"Expected object of scalar type Float but got Double" - Numpy strikes again

Numpy defaults to float64. PyTorch wants float32. They fight. **Fix at the source:** ```python # When loading numpy data tensor = torch.tensor(numpy_array, dtype=torch.float32) # Or just convert everything data = data.float() # Converts to float32 ```

GPU memory keeps growing until everything crashes - PyTorch 1.13.1 is leaky as hell

**The usual suspects:** 1. You stored the whole loss tensor instead of `loss.item()` 2. Forgot `optimizer.zero_grad()` 3. Created computation graphs during eval mode **Copy this pattern or suffer:** ```python losses = [] for batch in dataloader: optimizer.zero_grad() loss = model(batch) loss.backward() optimizer.step() # THIS CAUSES MEMORY LEAKS # losses.append(loss) # WRONG - keeps entire graph # THIS DOESN'T losses.append(loss.item()) # RIGHT - just the number ```

"CUDA out of memory" debugging

Your batch size is too big. Start here: ```python # Check current usage allocated = torch.cuda.memory_allocated()/1024**2 reserved = torch.cuda.memory_reserved()/1024**2 print(f"Using {allocated:.0f}MB, reserved {reserved:.0f}MB") # Try smaller batch size if allocated > 8000: # 8GB threshold batch_size = max(1, batch_size // 2) ```

Model won't learn - loss stays flat

**Check this dumb shit first:** 1. Learning rate is 0.1 (too high) or 0.0001 (too low) - try 0.001 2. You forgot `optimizer.zero_grad()` 3. Your labels are wrong (common with custom datasets) **Copy this debug script:** ```python # Check if anything is actually happening for name, param in model.named_parameters(): if param.grad is None: print(f"NO GRADIENTS: {name}") break print(f"{name}: grad={param.grad.norm().item():.6f}") ``` **If gradients are 0.000001, lower your learning rate. If they're 100+, you're fucked - use gradient clipping.**

Loss becomes NaN after 10 epochs - exploding gradients

Happens in PyTorch 1.12+ more than older versions. Your model went insane. **Nuclear option that works:** ```python # Add this BEFORE optimizer.step() torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) ``` **If that doesn't work, your learning rate is too high. Cut it in half until it stops exploding.**

Custom Dataset crashes with mysterious errors

**Test it properly before blaming PyTorch:** ```python # Test dataset isolation dataset = MyDataset() for i in range(5): try: sample = dataset[i] print(f"Sample {i} shapes: {[x.shape for x in sample]}") except Exception as e: print(f"Dataset broken at index {i}: {e}") break ```

Currently viewing the AI version

Switch to human version

PyTorch Debugging & Troubleshooting: AI-Optimized Technical Reference

Critical Error Patterns & Solutions

RuntimeError: mat1 and mat2 shapes cannot be multiplied

Root Cause: Tensor shape mismatch in matrix operations
Failure Impact: Immediate training crash with misleading error location
Common Scenarios:

Convolutional output fed directly to linear layer without flattening
Mismatched batch sizes between tensors
Solution Pattern:

# Essential shape debugging wrapper
def debug_shapes(tensor, name="tensor"):
    print(f"{name}: {tensor.shape}")
    return tensor

# Typical fix for conv-to-linear transition
x = x.view(x.size(0), -1)  # Flatten: [batch, channels, h, w] → [batch, features]

CUDA error: device-side assert triggered

Root Cause: GPU-side operation failure with hidden error details
Failure Impact: Complete training halt with no actionable error message
Debug Strategy: Force CPU execution to reveal actual error

if torch.cuda.is_available():
    try:
        result = model(batch.cuda())
    except RuntimeError as e:
        if "device-side assert" in str(e):
            model_cpu = model.cpu()
            batch_cpu = batch.cpu()
            result = model_cpu(batch_cpu)  # Real error revealed

Common Causes (90% of cases):

Invalid class indices in loss functions
NaN values in loss computation
Negative tensor indices

Expected all tensors to be on the same device

Root Cause: CPU/GPU tensor mixing
Prevention Pattern:

def forward(self, x):
    device = next(self.parameters()).device
    x = x.to(device)
    return self.layers(x)

Memory Management & Leak Detection

Memory Leak Patterns

Critical Failure Mode: Gradual memory growth causing OOM after hours of training
Detection Threshold: >100MB growth over 1000 batches indicates leak
Implementation:

class MemoryTracker:
    def __init__(self):
        self.start_memory = torch.cuda.memory_allocated()
        
    def check_memory_leak(self, tolerance_mb=100):
        gc.collect()
        torch.cuda.empty_cache()
        current_memory = torch.cuda.memory_allocated()
        leak_mb = (current_memory - self.start_memory) / 1024**2
        return leak_mb > tolerance_mb

Common Memory Leak Sources

Storing full loss tensors: losses.append(loss) → losses.append(loss.item())
Missing gradient clearing: Always call optimizer.zero_grad()
Validation mode computation graphs: Use torch.no_grad() context

Memory Profiling Configuration

Working Setup (PyTorch 1.13+):

with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CUDA],
    profile_memory=True,
    # AVOID: record_shapes=True (adds memory overhead)
    # AVOID: with_stack=True (crashes with custom datasets)
) as prof:
    # Training steps

Gradient Flow Debugging

Gradient Monitoring System

Implementation for vanishing/exploding detection:

def register_gradient_hooks(model):
    def hook_fn(module, grad_input, grad_output):
        if grad_output[0] is not None:
            grad_norm = grad_output[0].norm().item()
            if grad_norm > 10 or grad_norm != grad_norm:  # NaN check
                print(f"Gradient issue in {module.__class__.__name__}: norm={grad_norm}")
    
    for name, module in model.named_modules():
        if len(list(module.children())) == 0:  # Leaf modules only
            module.register_backward_hook(hook_fn)

Gradient Explosion Mitigation

Critical Threshold: Gradient norms >100 indicate explosion
Nuclear Option: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
Learning Rate Adjustment: Halve learning rate until explosion stops

Deterministic Debugging Configuration

Complete Determinism Setup

Performance Impact: 20-30% training slowdown
Use Case: Bug reproduction and consistent debugging sessions

def set_deterministic_mode(seed=42):
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    torch.use_deterministic_algorithms(True)  # PyTorch 1.12+

Tensor Shape Validation System

Strategic Assertion Placement

Philosophy: Catch shape errors at source, not 20 layers downstream

def assert_shape(tensor, expected_shape, name="tensor"):
    if tensor.shape != torch.Size(expected_shape):
        raise ValueError(f"{name} has shape {tensor.shape}, expected {torch.Size(expected_shape)}")

# Usage pattern
def forward(self, x):
    batch_size = x.size(0)
    assert_shape(x, (batch_size, 3, 224, 224), "input")
    features = self.backbone(x)
    assert_shape(features, (batch_size, 512), "features")

Performance Debugging Specifications

DataLoader Optimization Configuration

High-performance setup:

dataloader = DataLoader(
    dataset,
    batch_size=32,
    num_workers=8,          # More workers = faster (usually)
    pin_memory=True,        # Faster GPU transfer  
    persistent_workers=True # Keeps workers alive between epochs
)

Model Benchmarking Protocol

Performance Threshold: >100ms per forward pass indicates bottleneck

def benchmark_model(model, batch, warmup=10, runs=100):
    # Warmup phase
    for _ in range(warmup):
        _ = model(batch)
    torch.cuda.synchronize()
    
    # Timing measurement
    start = time.time()
    for _ in range(runs):
        output = model(batch)
        torch.cuda.synchronize()
    avg_time = (time.time() - start) / runs
    
    return avg_time * 1000  # Return in milliseconds

Distributed Training Debugging

Communication Verification

Critical Test: All-reduce operation verification across ranks

def check_distributed_setup():
    rank = dist.get_rank()
    world_size = dist.get_world_size()
    
    tensor = torch.ones(1).cuda() * rank
    dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
    expected = sum(range(world_size))
    
    if tensor.item() != expected:
        raise RuntimeError(f"All-reduce broken. Got {tensor.item()}, expected {expected}")

Common Distributed Failures

NCCL timeout: Network issues or rank desynchronization
Hanging on all_reduce: One rank died without notification
Different loss values: Data loading inconsistency across ranks
Rank 0 OOM: Uneven batch distribution or extra work on master rank

Advanced Memory Analysis

Memory Snapshot Configuration (PyTorch 1.12+)

Correct API usage:

torch.cuda.memory._record_memory_history()  # No parameters
# Run training
torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")
torch.cuda.memory._record_memory_history(None)  # Disable

Hook-Based Debugging System

Dead neuron detection threshold: >90% zero activations
NaN propagation monitoring:

def register_debug_hooks(model):
    def forward_hook(name):
        def hook(module, input, output):
            if torch.isnan(output).any():
                print(f"NaN detected in {name}")
            
            # Dead ReLU detection
            if 'relu' in name.lower() and isinstance(output, torch.Tensor):
                dead_pct = (output == 0).float().mean().item()
                if dead_pct > 0.9:
                    print(f"WARNING: {name} has {dead_pct*100:.1f}% dead neurons")
        return hook
    
    for name, module in model.named_modules():
        if len(list(module.children())) == 0:
            module.register_forward_hook(forward_hook(name))

Debugging Tool Comparison Matrix

Method	Setup Time	Performance Impact	Info Depth	Best Use Case
Print Statements	Immediate	Minimal	Low	Shape/value debugging
torch.profiler	5-10 minutes	Medium (20-30%)	Very High	Memory/performance analysis
Tensor Hooks	2-5 minutes	Low-Medium	Medium	Gradient flow analysis
Memory Snapshots	10-15 minutes	Low	Very High	Persistent leak investigation
Deterministic Mode	1 minute	High (30-50%)	N/A	Bug reproduction
Custom Assertions	1-2 minutes	Minimal	Low	Runtime validation

Critical Resource Requirements

Hardware Specifications for Debugging

Memory overhead: Profiling adds 10-20% memory usage
Compute overhead: Deterministic mode reduces throughput by 30-50%
Storage requirements: Memory snapshots can be 1-10GB
Network bandwidth: Distributed debugging requires stable high-bandwidth connections

Version Compatibility Matrix

PyTorch 1.11.x: Profiler crashes with custom dataloaders
PyTorch 1.12.0: Memory profiler gives incorrect readings
PyTorch 1.12+: Memory snapshots available, deterministic algorithms supported
PyTorch 1.13.1: Improved memory leak detection, stable profiling

Production Deployment Considerations

Debug Mode Performance Impact

Development: Use full debugging suite
Staging: Enable memory tracking only
Production: Disable all debugging except critical error logging

Monitoring Thresholds for Production

Memory growth: >5% per hour indicates potential leak
Gradient norms: >50 suggests instability
Training speed: >2x slowdown indicates bottleneck
Loss divergence: NaN or >10x increase requires immediate intervention

Emergency Debugging Protocols

When Training Fails Catastrophically

Immediate: Switch to CPU mode to get real error messages
Within 5 minutes: Enable memory tracking and gradient monitoring
Within 15 minutes: Set deterministic mode for reproduction
Within 30 minutes: Deploy full profiling suite if issue persists

Data Quality Verification

# Essential dataset testing before training
dataset = MyDataset()
for i in range(min(5, len(dataset))):
    try:
        sample = dataset[i]
        print(f"Sample {i} shapes: {[x.shape for x in sample]}")
    except Exception as e:
        raise RuntimeError(f"Dataset broken at index {i}: {e}")

This technical reference provides actionable debugging protocols with specific thresholds, configurations, and decision trees for systematic PyTorch troubleshooting.

Useful Links for Further Investigation

Essential PyTorch Debugging Resources

Link	Description
PyTorch Profiler Documentation	The official guide to PyTorch's built-in profiling tools. Essential for understanding memory usage, kernel execution, and performance bottlenecks. Skip the theory - focus on the practical examples.
CUDA Memory Management Guide	PyTorch's official documentation on GPU memory handling. Explains `torch.cuda.empty_cache()`, memory allocation patterns, and the difference between reserved and allocated memory.
Understanding GPU Memory Blog Series	Meta's comprehensive blog series on GPU memory visualization and debugging. Includes the memory snapshot tool and detailed explanations of memory fragmentation patterns.
Autograd Mechanics	Deep dive into how PyTorch's automatic differentiation works. Read this when you need to understand gradient computation bugs or create custom backward functions.
PyTorch Discuss Forum	The most reliable place for PyTorch-specific debugging help. The core development team actually responds here. Search before posting - most debugging questions have been asked before.
Memory Leak Debugging Thread	Community-maintained thread with practical memory leak debugging techniques. Contains solutions for the most common memory leak patterns in PyTorch training loops.
Stack Overflow PyTorch Tag	Good for general debugging questions, but the PyTorch forum is usually more helpful for framework-specific issues. Use for quick tensor manipulation questions.
pytorch_memlab	Third-party library for profiling and inspecting memory usage in PyTorch. Provides line-by-line memory profiling and memory leak detection tools.
TorchViz	Visualizes PyTorch computation graphs. Useful for understanding model architecture and debugging gradient flow issues. Simple installation: `pip install torchviz`.
PyTorch Lightning	While primarily a training framework, Lightning includes excellent debugging utilities like gradient clipping, learning rate monitoring, and automatic mixed precision debugging.
NVIDIA Nsight Systems	Professional GPU profiling tool that works with PyTorch. More detailed than PyTorch's built-in profiler for CUDA kernel analysis. Free download from NVIDIA.
TensorBoard with PyTorch	Official tutorial for integrating TensorBoard with PyTorch. Essential for visualizing training metrics, model graphs, and debugging convergence issues.
Intel VTune Profiler	For CPU-bound PyTorch operations and Intel GPU debugging. Particularly useful for analyzing data loading bottlenecks and CPU-side performance issues.
PyTorch Distributed Debugging Guide	Official guide to debugging multi-GPU and distributed training issues. Covers DDP, NCCL errors, and synchronization debugging techniques.
CUDA Debugging Best Practices	NVIDIA's official guide to CUDA debugging. Essential when dealing with "device-side assert triggered" errors and other low-level CUDA issues.
PyTorch Hook Documentation	Official documentation for forward and backward hooks. Critical for non-invasive debugging of model internals without modifying code.
DataLoader Debugging Tips	Community thread focused on debugging data loading issues. Covers multiprocessing errors, slow data loading, and memory leaks in custom datasets.
Mixed Precision Debugging	Official guide to debugging automatic mixed precision training. Covers NaN detection, loss scaling issues, and gradient overflow problems.
Quantization Debugging	Official documentation for debugging quantized models. Essential when dealing with accuracy degradation after quantization.
PyTorch GitHub Issues	When you encounter a genuine PyTorch bug, search here first. Use labels like "memory leak", "CUDA", or "autograd" to filter. File issues only after thorough debugging.
NVIDIA Developer Forums	For CUDA-specific errors that resist PyTorch-level debugging. The NVIDIA team is responsive to GPU driver and CUDA toolkit issues.
PyTorch Community Forums	Community discussions about debugging challenges. Good for getting perspective on whether your issue is common or unique.
PyTorch Debugging Cheat Sheet	Condensed debugging reference from the PyTorch team. Keep this bookmarked for quick access to common debugging patterns.
CUDA Error Code Reference	NVIDIA's official CUDA error code documentation. Use this to decode cryptic CUDA error numbers into actionable information.
PyTorch Common Errors Guide	University tutorial covering the most common PyTorch errors with practical solutions. Excellent for beginners and intermediate users.

PyTorch Debugging & Troubleshooting: AI-Optimized Technical Reference

Critical Error Patterns & Solutions

RuntimeError: mat1 and mat2 shapes cannot be multiplied

CUDA error: device-side assert triggered

Expected all tensors to be on the same device

Memory Management & Leak Detection

Memory Leak Patterns

Common Memory Leak Sources

Memory Profiling Configuration

Gradient Flow Debugging

Gradient Monitoring System

Gradient Explosion Mitigation

Deterministic Debugging Configuration

Complete Determinism Setup

Tensor Shape Validation System

Strategic Assertion Placement

Performance Debugging Specifications

DataLoader Optimization Configuration

Model Benchmarking Protocol

Distributed Training Debugging

Communication Verification

Common Distributed Failures

Advanced Memory Analysis

Memory Snapshot Configuration (PyTorch 1.12+)

Hook-Based Debugging System

Debugging Tool Comparison Matrix

Critical Resource Requirements

Hardware Specifications for Debugging

Version Compatibility Matrix

Production Deployment Considerations

Debug Mode Performance Impact

Monitoring Thresholds for Production

Emergency Debugging Protocols

When Training Fails Catastrophically

Data Quality Verification

Useful Links for Further Investigation

Essential PyTorch Debugging Resources

Related Tools & Recommendations

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

PyTorch ↔ TensorFlow Model Conversion: The Real Story

TensorFlow Serving Production Deployment - The Shit Nobody Tells You About

TensorFlow - End-to-End Machine Learning Platform

CUDA Performance Optimization - Making Your GPU Actually Fast

CUDA Production Debugging - When Your GPU Code Breaks at 3AM

CUDA Development Toolkit 13.0 - Still Breaking Builds Since 2007

Hugging Face Inference Endpoints Security & Production Guide

Hugging Face Inference Endpoints Cost Optimization Guide

Hugging Face Inference Endpoints - Skip the DevOps Hell

Docker Alternatives That Won't Break Your Budget

I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works

Stop MLflow from Murdering Your Database Every Time Someone Logs an Experiment

MLflow - Stop Losing Track of Your Fucking Model Runs

MLflow Production Troubleshooting Guide - Fix the Shit That Always Breaks

Weights & Biases - Because Spreadsheet Tracking Died in 2019

Raycast - Finally, a Launcher That Doesn't Suck

AWS X-Ray - Distributed Tracing Before the 2027 Sunset

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break