Your PyTorch DataLoader is the reason your training crawls. I've debugged enough hanging DataLoaders at 3am to know the symptoms: GPU utilization sitting at 15%, workers mysteriously dying, and that sinking feeling when you realize your "optimized" training loop is slower than running on CPU.
When you set num_workers > 0
, PyTorch spawns separate processes to load your data in parallel. Sounds great until it isn't.
The Multiprocessing Nightmare (And Why It's Worth It)
Here's what happens when multiprocessing goes sideways:
- Workers hanging forever because OpenCV and multiprocessing hate each other
- Memory usage exploding because every worker gets its own copy of everything
- Mysterious
signal: Killed
errors when the OOM killer decides your workers need to die
The num_workers = 4 * num_GPUs
rule is a starting point, not gospel. I usually start there and tune based on what breaks first - your RAM, your CPU, or your sanity.
Memory Pinning: The 20% Speedup Nobody Talks About
Set pin_memory=True
if you're using GPU. It allocates tensors in pinned memory so data transfers to GPU are faster. The speedup is real - usually around 20% - but it eats more RAM. Skip it if you're training on CPU or already memory-constrained.
Persistent Workers: The PyTorch 1.7+ Secret Weapon
persistent_workers=True
is the single best addition to DataLoader since forever. It keeps worker processes alive between epochs instead of spawning new ones every time.
Before this existed, I watched training scripts waste 30 seconds between epochs just spawning processes. Now it's instant. Use it unless you're debugging and need fresh workers.
The Prefetch Factor Reality Check
prefetch_factor=2
is the default - each worker loads 2 batches ahead. It doubles your memory usage. Set it to 1 if you're memory-bound. You'll lose maybe 5-10% performance but won't run out of RAM.
Memory Consumption: The Math That'll Scare You
Your memory usage roughly equals: num_workers × whatever_your_dataset_loads × prefetch_factor
.
I once saw a setup using 40x more memory than expected because every worker was loading the entire dataset into memory. The training worked great until it didn't, and then it crashed spectacularly.
Pro tip: Use htop
or nvidia-smi
to watch memory consumption during the first few batches. If it's climbing fast, reduce workers or prefetch_factor before your system dies.
When DataLoader Hangs: The OpenCV Debugging Saga
If your DataLoader hangs with num_workers > 0
, it's probably OpenCV. The fix is stupid but works:
import cv2
cv2.setNumThreads(0) # Add this before DataLoader
I learned this after spending 6 hours debugging a hanging DataLoader that worked fine with num_workers=0
. OpenCV's internal threading conflicts with PyTorch's multiprocessing, and the combination just... stops.
The GPU Utilization Test
A properly configured DataLoader shows consistent 80-90% GPU utilization during training. Poor configurations result in utilization bouncing between 0-100%, indicating the GPU is starved for data.
Run nvidia-smi -l 1
during training. If GPU utilization is below 80% and memory isn't maxed out, your DataLoader is the bottleneck. Increase num_workers
until utilization hits 85-90%, then stop. More workers past that point usually just waste resources.