PyTorch Debugging - When Your Models Decide to Die

The Essential PyTorch Debugging Toolkit

PyTorch debugging has gotten better with the latest profiling tools, but the fundamental challenges remain the same: cryptic error messages, dynamic computation graphs that make stack traces useless, and memory management that fails in ways that make you question your understanding of computers. Knowing which debugging approach to take when your model decides to break is half the battle.

Rule #1: Learn to Read PyTorch's Terrible Error Messages

PyTorch error messages are designed to confuse you. Here's how to decode the most common ones:

"RuntimeError: mat1 and mat2 shapes cannot be multiplied"
This means you're trying to multiply tensors that don't match up dimensionally. The error tells you the shapes, but not where in your code this happens. Add shape debugging everywhere:

def debug_shapes(tensor, name="tensor"):
    print(f"{name}: {tensor.shape}")
    return tensor

## Wrap your tensors to see what's happening
x = debug_shapes(x, "input")
hidden = debug_shapes(self.linear1(x), "after_linear1") 
output = debug_shapes(self.linear2(hidden), "final_output")

"CUDA error: device-side assert triggered"
Something went wrong on the GPU, but PyTorch won't tell you what. Usually caused by index out of bounds in loss functions or embedding layers. Run the same code on CPU to get actual Python exceptions:

## This debugging pattern has saved me countless hours
if torch.cuda.is_available():
    try:
        result = model(batch.cuda())
    except RuntimeError as e:
        if "device-side assert" in str(e):
            print("CUDA error detected, switching to CPU for debugging...")
            model_cpu = model.cpu()
            batch_cpu = batch.cpu()
            result = model_cpu(batch_cpu)  # This will give you the real error
        else:
            raise e

PyTorch's memory profiler showing GPU memory allocation patterns - essential for debugging OOM errors

PyTorch Debugging Workflow

TensorBoard visualization showing loss curves and debugging metrics for PyTorch training

"RuntimeError: Expected all tensors to be on the same device"
You mixed CPU and GPU tensors somewhere. The stack trace usually points to the wrong line. Add device checking to your forward pass:

def check_device_consistency(self, x):
    """Add this to your model's forward method during debugging"""
    model_device = next(self.parameters()).device
    if x.device != model_device:
        raise ValueError(f"Input on {x.device}, model on {model_device}")
    return x

PyTorch Memory Allocation Timeline

PyTorch memory allocation timeline showing allocation patterns and potential leak detection points

Memory Leak Detection That Actually Works

PyTorch has memory leaks that are well-documented in the community. The official CUDA memory management guide explains the theory, but here's what actually helps in practice when you're dealing with gradual memory growth that kills your training runs.

import torch
import gc

class MemoryTracker:
    def __init__(self):
        self.start_memory = torch.cuda.memory_allocated()
        
    def check_memory_leak(self, tolerance_mb=100):
        gc.collect()  # Force garbage collection
        torch.cuda.empty_cache()  # Clear PyTorch cache
        
        current_memory = torch.cuda.memory_allocated()
        leak_mb = (current_memory - self.start_memory) / 1024**2
        
        if leak_mb > tolerance_mb:
            print(f"Potential memory leak: {leak_mb:.2f}MB increase")
            return True
        return False

## Use it in your training loop
tracker = MemoryTracker()
for epoch in range(num_epochs):
    for batch in dataloader:
        loss = training_step(batch)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        
        # Check for leaks every 100 batches
        if batch_idx % 100 == 0:
            tracker.check_memory_leak()

The PyTorch profiler provides detailed memory tracking, but it's overkill for simple leak detection. The above approach catches 90% of memory issues without the complexity.

Gradient Debugging: When Backprop Goes Wrong

Gradient problems are the worst to debug because they fail silently. Your model trains but learns nothing, or worse, explodes into NaN values after 50 epochs of seemingly normal training. Here are the practical tools that actually work when gradient flow goes sideways.

Essential gradient debugging tools:

def register_gradient_hooks(model):
    """Add hooks to monitor gradient flow"""
    def hook_fn(module, grad_input, grad_output):
        if grad_output[0] is not None:
            grad_norm = grad_output[0].norm().item()
            if grad_norm > 10 or grad_norm != grad_norm:  # NaN check
                print(f"Gradient issue in {module.__class__.__name__}: norm={grad_norm}")
    
    for name, module in model.named_modules():
        if len(list(module.children())) == 0:  # Leaf modules only
            module.register_backward_hook(hook_fn)

## Use during training
register_gradient_hooks(model)

## Also check for dead neurons
def check_gradient_flow(named_parameters):
    ave_grads = []
    layers = []
    for n, p in named_parameters:
        if p.requires_grad and p.grad is not None:
            layers.append(n)
            ave_grads.append(p.grad.abs().mean().cpu().item())
    
    # Visualize gradient magnitudes
    import matplotlib.pyplot as plt
    plt.plot(ave_grads, alpha=0.3, color="b")
    plt.hlines(0, 0, len(ave_grads)+1, linewidth=1, color="k")
    plt.xticks(range(0,len(ave_grads), 1), layers, rotation="vertical")
    plt.xlim(xmin=0, xmax=len(ave_grads))
    plt.ylabel("average gradient")
    plt.title("Gradient flow")
    plt.grid(True)
    plt.show()

## Call after loss.backward()
check_gradient_flow(model.named_parameters())

Gradient Flow Visualization

PyTorch gradient debugging tools help identify vanishing/exploding gradient problems

PyTorch Computational Graph

PyTorch computational graph structure showing how operations and tensors are connected during forward pass

The Nuclear Option: Deterministic Debugging

When your model behaves differently between runs, even with the same random seed, you need deterministic mode. This is essential for reproducing bugs:

import torch
import numpy as np
import random

def set_deterministic_mode(seed=42):
    """Make PyTorch completely deterministic - slow but necessary for bug hunting"""
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)
    
    # The nuclear option - makes everything deterministic but slow
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    
    # For even more determinism (PyTorch 1.12+)
    torch.use_deterministic_algorithms(True)

## Call at the start of your debugging session
set_deterministic_mode()

Warning: This will slow down training significantly, but it's the only way to guarantee reproducible debugging sessions. Use only when hunting specific bugs.

Tensor Shape Debugging with Assertions

The most underused debugging technique in PyTorch is strategic assertions. They catch shape errors at the source instead of 20 lines later in some random linear layer:

def assert_shape(tensor, expected_shape, name="tensor"):
    """Assert tensor has expected shape with helpful error message"""
    if tensor.shape != torch.Size(expected_shape):
        raise ValueError(
            f"{name} has shape {tensor.shape}, expected {torch.Size(expected_shape)}"
        )

class DebuggableModel(nn.Module):
    def forward(self, x):
        batch_size = x.size(0)
        
        # Assert input shape
        assert_shape(x, (batch_size, 3, 224, 224), "input")
        
        features = self.backbone(x)
        assert_shape(features, (batch_size, 512), "features")
        
        logits = self.classifier(features)
        assert_shape(logits, (batch_size, num_classes), "logits")
        
        return logits

This approach catches 80% of tensor shape bugs immediately at their source. Remove the assertions once your model is stable.

The key insight: PyTorch debugging is about building visibility into the black box of tensor operations. The dynamic graph is powerful but opaque - you need to explicitly add debugging instrumentation to understand what's happening during training.

PyTorch Debugging FAQ - The Stuff That Actually Breaks

"RuntimeError: mat1 and mat2 shapes cannot be multiplied" - Why does this error message suck?

Because PyTorch waits until the actual matrix multiplication to tell you the shapes are wrong. The error could be from 5 layers back.

90% of the time it's this shit:

## You did this (wrong)
x = some_conv_layer(input)  # Output: [batch, 512, 7, 7]  
x = linear_layer(x)         # Expects [batch, features] - BOOM

## Fix: Flatten the damn thing
x = x.view(x.size(0), -1)  # Now: [batch, 512*7*7]

The other 10% is mismatched batch sizes. Print shapes everywhere and cry.

"CUDA error: device-side assert triggered" - The most useless error message ever

This means "something crashed on the GPU but fuck you, figure it out." 99% of the time it's:

Your class indices are wrong - you have 10 classes but passed class index 15
Loss function got NaN - usually from log(0) somewhere
Some tensor index went negative - congrats, you broke math

Copy this, run it, save your sanity:

## Force everything to CPU to get the real error
model = model.cpu()
data = data.cpu()
## Now you'll get the actual error message

"Expected all tensors to be on the same device" - The device mismatch from hell

You mixed CPU and GPU tensors. This breaks everything.

Nuclear option (copy this):

def forward(self, x):
    # Force everything to match the model's device
    device = next(self.parameters()).device
    x = x.to(device)
    return self.layers(x)

"Expected object of scalar type Float but got Double" - Numpy strikes again

Numpy defaults to float64. PyTorch wants float32. They fight.

Fix at the source:

## When loading numpy data
tensor = torch.tensor(numpy_array, dtype=torch.float32)

## Or just convert everything
data = data.float()  # Converts to float32

GPU memory keeps growing until everything crashes - PyTorch 1.13.1 is leaky as hell

The usual suspects:

You stored the whole loss tensor instead of loss.item()
Forgot optimizer.zero_grad()
Created computation graphs during eval mode

Copy this pattern or suffer:

losses = []
for batch in dataloader:
    optimizer.zero_grad()
    loss = model(batch)
    loss.backward()
    optimizer.step()
    
    # THIS CAUSES MEMORY LEAKS
    # losses.append(loss)  # WRONG - keeps entire graph
    
    # THIS DOESN'T
    losses.append(loss.item())  # RIGHT - just the number

"CUDA out of memory" debugging

Your batch size is too big. Start here:

## Check current usage
allocated = torch.cuda.memory_allocated()/1024**2
reserved = torch.cuda.memory_reserved()/1024**2  
print(f"Using {allocated:.0f}MB, reserved {reserved:.0f}MB")

## Try smaller batch size
if allocated > 8000:  # 8GB threshold
    batch_size = max(1, batch_size // 2)

Model won't learn - loss stays flat

Check this dumb shit first:

Learning rate is 0.1 (too high) or 0.0001 (too low) - try 0.001
You forgot optimizer.zero_grad()
Your labels are wrong (common with custom datasets)

Copy this debug script:

## Check if anything is actually happening
for name, param in model.named_parameters():
    if param.grad is None:
        print(f"NO GRADIENTS: {name}")
        break
    print(f"{name}: grad={param.grad.norm().item():.6f}")

If gradients are 0.000001, lower your learning rate. If they're 100+, you're fucked - use gradient clipping.

Loss becomes NaN after 10 epochs - exploding gradients

Happens in PyTorch 1.12+ more than older versions. Your model went insane.

Nuclear option that works:

## Add this BEFORE optimizer.step()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

If that doesn't work, your learning rate is too high. Cut it in half until it stops exploding.

Same seed, different results every run

PyTorch randomness is a shitshow. CUDA operations are non-deterministic by default.

Copy this deterministic setup:

def make_deterministic(seed=42):
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

Warning: Makes training 20% slower, but at least it's consistent.

DataLoader is slower than dial-up internet

Your DataLoader sucks. Fix it:

## This configuration doesn't suck
dataloader = DataLoader(
    dataset,
    batch_size=32,
    num_workers=8,          # More workers = faster (usually)
    pin_memory=True,        # Faster GPU transfer  
    persistent_workers=True # Keeps workers alive between epochs
)

If it's still slow, your transforms are garbage. Move expensive preprocessing to dataset creation time.

Custom Dataset crashes with mysterious errors

Test it properly before blaming PyTorch:

## Test dataset isolation 
dataset = MyDataset()
for i in range(5):
    try:
        sample = dataset[i]
        print(f"Sample {i} shapes: {[x.shape for x in sample]}")
    except Exception as e:
        print(f"Dataset broken at index {i}: {e}")
        break

Advanced PyTorch Debugging - When the Obvious Stuff Doesn't Work

Three weeks ago I had a model that trained fine on dev data but crashed in production after 6 hours. Memory usage looked normal, no obvious errors, just... death. Turned out using record_shapes=True adds memory overhead that can cause issues in long training runs. Burned through a few thousand in GPU costs.

This is the advanced debugging shit for when your model is broken in subtle ways that make you question reality.

PyTorch Memory Profiler

PyTorch profiler interface in TensorBoard showing detailed performance analysis

PyTorch Profiler - Actually Useful for Finding Bottlenecks

The profiler is good when it works. It breaks in PyTorch 1.11.x with custom dataloaders and lies about memory usage in 1.12.0.

Basic profiling that won't crash:

import torch.profiler

## Profile WITHOUT the fancy shit that breaks
with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CUDA],
    profile_memory=True,
    # DON'T use record_shapes=True - adds memory overhead
    # DON'T use with_stack=True - crashes with custom datasets
) as prof:
    
    # Run a few training steps
    for i in range(5):
        optimizer.zero_grad()
        loss = model(batch)
        loss.backward()
        optimizer.step()

## What actually matters - memory and time
print(prof.key_averages().table(sort_by="cuda_memory_usage", row_limit=10))

Reading profiler results without bullshit:

If CPU time > CUDA time: Your DataLoader is slow as hell
Memory spikes during backward(): You're accumulating gradients somewhere
1000+ small kernel launches: Your model architecture is inefficient

Memory Debugging When Everything Looks Normal But Isn't

Had a ResNet that slowly ate memory over 48 hours until OOM. Memory stats looked fine, no obvious leaks, but something was accumulating. Took 3 days to track down - turns out we were storing validation losses as full tensors instead of .item() values.

Basic memory tracking that actually works:

import gc
import torch

def track_memory_over_time():
    """Track memory every N batches to catch slow leaks"""
    
    memory_log = []
    
    for batch_idx, batch in enumerate(dataloader):
        # Normal training
        optimizer.zero_grad()
        loss = model(batch)
        loss.backward()
        optimizer.step()
        
        # Memory tracking every 100 batches
        if batch_idx % 100 == 0:
            allocated = torch.cuda.memory_allocated() / 1024**3  # GB
            reserved = torch.cuda.memory_reserved() / 1024**3
            memory_log.append((batch_idx, allocated, reserved))
            print(f"Batch {batch_idx}: {allocated:.2f}GB allocated, {reserved:.2f}GB reserved")
            
            # Force garbage collection
            gc.collect()
            torch.cuda.empty_cache()
            
        # Alert on memory growth
        if len(memory_log) > 10:
            recent_allocated = [m[1] for m in memory_log[-10:]]
            if max(recent_allocated) - min(recent_allocated) > 1.0:  # 1GB growth
                print("WARNING: Memory usage growing over time")
                break

Memory snapshots (PyTorch 1.12+ only):

## CORRECT API - the online examples are wrong
torch.cuda.memory._record_memory_history()  # No parameters!

## Run your training
for batch in dataloader:
    loss = model(batch)
    loss.backward()
    
torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")
torch.cuda.memory._record_memory_history(None)  # Disable

Debugging Hooks - When You Need to See Inside Your Model

Hooks are for when your model is doing weird shit and you can't figure out where. I used them to debug a transformer that was learning perfectly for 10 epochs, then suddenly all attention heads collapsed to uniform distributions. Took hooks to figure out the layer norm was getting extreme gradients.

Simple hook to catch the most common problems:

def register_debug_hooks(model):
    """Find dead neurons, exploding gradients, and NaN propagation"""
    
    hooks = []
    
    def forward_hook(name):
        def hook(module, input, output):
            if torch.isnan(output).any():
                print(f"NaN detected in {name}")
                print(f"Input range: {input[0].min():.4f} to {input[0].max():.4f}")
            
            # Check for dead ReLU
            if 'relu' in name.lower() and isinstance(output, torch.Tensor):
                dead_pct = (output == 0).float().mean().item()
                if dead_pct > 0.9:
                    print(f"WARNING: {name} has {dead_pct*100:.1f}% dead neurons")
        return hook
    
    def backward_hook(name):
        def hook(module, grad_input, grad_output):
            if grad_output[0] is not None:
                grad_norm = grad_output[0].norm().item()
                if grad_norm > 100:
                    print(f"Large gradient in {name}: {grad_norm:.2f}")
                elif grad_norm < 1e-7:
                    print(f"Vanishing gradient in {name}: {grad_norm:.2e}")
        return hook
    
    # Register on all leaf modules
    for name, module in model.named_modules():
        if len(list(module.children())) == 0:
            hooks.append(module.register_forward_hook(forward_hook(name)))
            hooks.append(module.register_backward_hook(backward_hook(name)))
    
    return hooks  # Keep references so hooks don't get garbage collected

## Use it like this
hooks = register_debug_hooks(model)
## ... run training ...
## Remove hooks when done: [h.remove() for h in hooks]

Performance Debugging - When Your Model Is Mysteriously Slow

Had a simple CNN that should've trained in 2 hours but took 8. Profiling showed 80% of time spent in memory copies. Turned out the DataLoader was converting everything to CPU then back to GPU because of one stupid transform that didn't support CUDA tensors.

Quick performance check:

import time

def benchmark_model(model, batch, warmup=10, runs=100):
    """Time your model to find bottlenecks"""
    
    # Warmup
    for _ in range(warmup):
        _ = model(batch)
    torch.cuda.synchronize()
    
    # Time it
    start = time.time()
    for _ in range(runs):
        output = model(batch)
        torch.cuda.synchronize()
    end = time.time()
    
    avg_time = (end - start) / runs
    print(f"Average forward pass: {avg_time*1000:.2f}ms")
    
    # Check if you're memory bound
    memory_gb = torch.cuda.max_memory_allocated() / 1024**3
    print(f"Peak memory: {memory_gb:.2f}GB")
    
    if avg_time > 0.1:  # > 100ms is usually too slow
        print("WARNING: Model is slow, check your architecture")

Distributed Training Debugging - Multiple GPUs, Multiple Problems

Distributed training fails in creative ways. Rank 0 finishes while rank 1 hangs forever. Gradients don't sync properly. NCCL timeouts that tell you nothing useful.

Basic distributed sanity check:

import torch.distributed as dist

def check_distributed_setup():
    """Make sure distributed training isn't completely fucked"""
    
    if not dist.is_initialized():
        print("ERROR: torch.distributed not initialized")
        return
    
    rank = dist.get_rank()
    world_size = dist.get_world_size()
    
    print(f"Rank {rank}/{world_size}")
    
    # Test communication
    tensor = torch.ones(1).cuda() * rank
    dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
    expected = sum(range(world_size))
    
    if tensor.item() != expected:
        print(f"FAIL: All-reduce broken. Got {tensor.item()}, expected {expected}")
    else:
        print(f"OK: All-reduce working")

def check_model_sync(model):
    """Check if model weights are synchronized across ranks"""
    
    for name, param in model.named_parameters():
        # Compare first parameter across ranks
        if dist.get_rank() == 0:
            param_slice = param.flatten()[:5]
            print(f"Rank 0 {name}: {param_slice.cpu().numpy()}")
        
        # Simple check: all ranks should have same weight values
        param_sum = param.sum().item()
        all_sums = [torch.zeros(1) for _ in range(dist.get_world_size())]
        dist.all_gather(all_sums, torch.tensor([param_sum]))
        
        if not all(abs(s.item() - param_sum) < 1e-6 for s in all_sums):
            print(f"WARNING: {name} not synchronized across ranks")

Most common distributed training failures:

NCCL timeout - Your network is shit or ranks are out of sync
Hanging on all_reduce - One rank died and didn't tell anyone
Different loss values - Data loading is fucked, ranks seeing different data
OOM on rank 0 only - Uneven batch sizes or rank 0 doing extra work

PyTorch Debugging Tools: Complete Comparison

Debugging Method	Use Case	Setup Complexity	Performance Impact	Information Depth	Best For
Print Statements	Quick shape/value debugging	None	Minimal	Low	Tensor shape errors, basic debugging
`pdb` Python Debugger	Step-through debugging	None	High (stops execution)	High	Logic errors, control flow issues
Tensor Hooks	Monitor activations/gradients	Low	Low-Medium	Medium	Gradient analysis, dead neurons
`torch.profiler`	Performance bottlenecks	Medium	Medium	Very High	Memory leaks, kernel analysis
Memory Snapshots	Detailed memory analysis	High	Low	Very High	Persistent memory leaks
TensorBoard	Training visualization	Medium	Low	Medium	Loss curves, gradient flow
Custom Assertions	Catch errors early	Low	Minimal	Low	Shape validation, runtime checks
Deterministic Mode	Reproduce bugs	Low	High (30-50% slower)	N/A	Non-deterministic behavior
GPU Profiling	CUDA kernel analysis	High	Medium	Very High	GPU utilization, kernel efficiency
Distributed Debugging	Multi-GPU issues	High	Medium	High	DDP errors, synchronization

Essential PyTorch Debugging Resources

Related Tools & Recommendations

integration

Similar content

PyTorch to TensorFlow Model Conversion Guide with ONNX

How to actually move models between frameworks without losing your sanity

PyTorch

/integration/pytorch-tensorflow/model-interoperability-guide

Quick Navigation

Rule #1: Learn to Read PyTorch's Terrible Error Messages

Memory Leak Detection That Actually Works

Gradient Debugging: When Backprop Goes Wrong

The Nuclear Option: Deterministic Debugging

Tensor Shape Debugging with Assertions

"RuntimeError: mat1 and mat2 shapes cannot be multiplied" - Why does this error message suck?

"CUDA error: device-side assert triggered" - The most useless error message ever

"Expected all tensors to be on the same device" - The device mismatch from hell

"Expected object of scalar type Float but got Double" - Numpy strikes again

GPU memory keeps growing until everything crashes - PyTorch 1.13.1 is leaky as hell

"CUDA out of memory" debugging

Model won't learn - loss stays flat

Loss becomes NaN after 10 epochs - exploding gradients

Same seed, different results every run

DataLoader is slower than dial-up internet

Custom Dataset crashes with mysterious errors

PyTorch Profiler - Actually Useful for Finding Bottlenecks

Memory Debugging When Everything Looks Normal But Isn't

Debugging Hooks - When You Need to See Inside Your Model

Performance Debugging - When Your Model Is Mysteriously Slow

Distributed Training Debugging - Multiple GPUs, Multiple Problems

Related Tools & Recommendations

PyTorch to TensorFlow Model Conversion Guide with ONNX

PyTorch Production Deployment: Scale, Optimize & Prevent Crashes

PyTorch - The Deep Learning Framework That Doesn't Suck

shadcn/ui Production Troubleshooting: Fix Build & Hydration Errors

Django Troubleshooting Guide: Fix Production Errors & Debug

React Production Debugging: Fix App Crashes & White Screens

PostgreSQL: Why It Excels & Production Troubleshooting Guide

Neon Production Troubleshooting Guide: Fix Database Errors

Trivy & Docker Security Scanner Failures: Debugging CI/CD Integration Issues

Fix TaxAct Errors: Login, WebView2, E-file & State Rejection Guide

TensorFlow Serving Production Deployment - The Shit Nobody Tells You About

TensorFlow - End-to-End Machine Learning Platform

Debugging AI Coding Assistant Failures: Copilot, Cursor & More

NVIDIA Triton Inference Server: High-Performance AI Serving

Hugging Face Transformers: Overview, Features & How to Use

OpenAI Browser: Optimize Performance for Production Automation

Fix Common Xcode Build Failures & Crashes: Troubleshooting Guide

TaxBit Enterprise Production Troubleshooting: Debug & Fix Issues

Arbitrum Production Debugging: Fix Gas & WASM Errors in Live Dapps

CUDA Development Toolkit 13.0 - Still Breaking Builds Since 2007