Computer vision without GPU acceleration is like driving a Ferrari in first gear. Technically possible, practically useless. But getting Roboflow to actually use your expensive GPU? That's where things go sideways.
Modern NVIDIA GPUs have complex architectures with streaming multiprocessors, CUDA cores, Tensor cores, and memory hierarchies that require specific driver versions, CUDA toolkit versions, and cuDNN libraries to work properly with inference frameworks.
The problem isn't Roboflow - it's the insane dependency matrix between CUDA versions, cuDNN versions, ONNX Runtime builds, and your specific GPU generation. One mismatch and you're running inference on CPU while your $500 GPU sits there doing nothing.
The CUDA Version Dumpster Fire
ONNX Runtime is picky as hell about CUDA versions. As of September 2025, you need:
- CUDA 12.x with cuDNN 9.x for modern GPUs (RTX 30/40 series)
- CUDA 11.8 with cuDNN 8.x for older cards (GTX 1660, RTX 20 series)
The NVIDIA compatibility matrix tells you what your card supports, but ONNX Runtime's requirements override everything. If they say CUDA 12.x only, that's what you get.
Windows users get extra pain: You need matching Visual C++ runtimes, correct PATH entries, and sometimes specific ONNX Runtime builds. The error LoadLibrary failed with error 126
means your DLLs are fucked.
I spent an entire Saturday reinstalling CUDA drivers in different orders until I found the magic sequence: CUDA toolkit first, then cuDNN, then Visual C++ redistributable, then Python packages. Do it backwards and you get to start over.
The Docker GPU Passthrough Catastrophe
Docker GPU support requires nvidia-container-runtime, which half the time isn't properly installed. You'll think everything's working until you try to access the GPU from inside the container.
Test GPU access inside your container:
docker exec -it container_name nvidia-smi
If that fails, your Docker daemon isn't configured for GPU passthrough. On Ubuntu: sudo apt install nvidia-container-runtime
then restart Docker. On Windows with WSL2, you need CUDA in both Windows and the WSL2 distribution.
The really fun part? Some Docker base images come with incompatible CUDA versions baked in. You'll install everything correctly on the host, then the container loads its own broken CUDA libraries.
Memory Problems Nobody Talks About
Large models like SAM eat 4-8GB of GPU memory. Your RTX 3060 with 12GB sounds fine until you realize Windows/background processes already claimed 2GB, leaving you with barely enough to load one model.
Solution: Monitor GPU memory during startup with `nvidia-smi -l 1`. If you're hitting limits, either get more VRAM or switch to quantized models. The YOLOv8 nano models use way less memory than SAM for basic detection tasks.
Edge devices are worse. A Jetson Nano with 4GB shared memory will choke on anything beyond the smallest models. Plan your memory budget before picking models, not after deployment fails.