NVIDIA Container Toolkit - Make Your GPUs Work in Docker

What This Actually Does

Got an NVIDIA GPU and want to use it in Docker? Without this toolkit, Docker completely ignores your GPU. Your containers crawl along on CPU while your expensive graphics card sits there doing nothing.

The NVIDIA Container Toolkit fixes this disaster. Version 1.17.8 dropped on May 30, 2025, and honestly, it's the only way that actually works. Before this existed, you'd spend days manually mounting /dev/nvidia* devices and copying driver files around like an animal.

Here's what actually happens: when you run a container that needs GPU access, the toolkit automatically mounts the right driver files and sets up all the CUDA libraries your container needs. No more manually mounting /dev/nvidia0 or copying driver files around like some caveman.

Docker GPU Container Workflow

How It Actually Works

The toolkit has four main pieces that do the heavy lifting:

nvidia-container-runtime - Wraps Docker's runtime and tells it "this container wants GPU access." Works with Docker, containerd, whatever.

nvidia-container-toolkit - The hook that runs before container start. Figures out which GPU files to mount and does the setup. This replaced the old nvidia-docker2 mess that nobody misses.

libnvidia-container - The low-level library doing the heavy lifting. Mounts devices, injects CUDA libraries, discovers GPUs. This is where the actual work happens.

nvidia-ctk - Command line tool for configuration. You'll use this to set up Docker daemon configs and generate CDI specs.

The flow: Docker sees --gpus all → toolkit hook runs → mounts driver files → CUDA libraries appear → your container can finally see the GPU. It's automated device passthrough that doesn't suck.

What Actually Works (And What Doesn't)

Docker Engine - This is where it all started and works best. If you're running Docker on Ubuntu or RHEL, you'll probably have a good time. The installation guide is actually decent.

containerd - Kubernetes uses this by default. Works fine once you get past the initial setup headaches. Configuration is more involved than Docker, and you'll need to understand CRI plugins.

Podman - Great for rootless containers, but the GPU support is still a bit janky. Expect to spend extra time troubleshooting cgroup issues.

CRI-O - OpenShift's container runtime. Works but you'll be reading a lot of Red Hat docs.

Container Runtime Support

For orchestration:

Kubernetes is the most popular orchestration platform for GPU containers.

Kubernetes: Use the NVIDIA GPU Operator. It handles most of the complexity but debugging GPU scheduling issues will make you question your life choices.
Docker Swarm: Technically supported but GPU scheduling is primitive
Everything else: You're on your own, check the supported platforms list

What You'll Actually Use This For

Machine Learning: Training PyTorch models or running TensorFlow inference without your containers falling back to CPU. Companies like Uber use this for their ML pipelines because it actually works at scale. Popular frameworks include RAPIDS, Hugging Face, and JAX.

CUDA Applications: Any scientific computing or data processing that needs serious GPU power. Molecular dynamics, weather simulations, crypto mining (yes, people containerize mining). Check out NVIDIA HPC containers for pre-built images.

Graphics Workloads: OpenGL/Vulkan apps in containers. Useful for remote rendering or running CAD software in the cloud. You'll need X11 forwarding or VNC setups.

Check out NVIDIA's official CUDA container images for pre-built containers.

Jetson Edge Devices: GPU containers on NVIDIA Jetson hardware. Works but the ARM ecosystem can be painful. Check the Jetson containers repo for pre-built images.

The real benefit is that once you get this working, your containers behave the same way whether they're running on your dev laptop, a beefy DGX server, or in AWS EC2 P4 instances. No more "works on my machine but not in production" GPU disasters.

Just remember: the toolkit handles mounting drivers and CUDA libraries automatically, but you still need to actually install the NVIDIA drivers on your host. The containers don't magically create GPUs out of thin air.

Installation Hell and Security Nightmares

Getting This Thing Installed

Installing NVIDIA Container Toolkit looks simple in the docs until you actually try it. I've probably done this install 20+ times across different systems and every environment finds a new way to break.

What You Actually Need:

An NVIDIA GPU (duh)
NVIDIA GPU drivers already working on your host - spent 2 hours once because secure boot was blocking the kernel modules
Docker/containerd/whatever runtime you're using
A supported Linux distro - Ubuntu works best, had nightmares with CentOS 8

The Installation Command That Actually Works:

## This will probably work on Ubuntu/Debian
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

Then you need to configure Docker (this is where it usually breaks):

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Test it with: sudo docker run --rm --gpus all nvidia/cuda:11.8-base-ubuntu20.04 nvidia-smi

If you see your GPU info, you won. If not, here's what usually goes wrong:

docker: Error response from daemon: could not select device driver - Docker daemon config is missing the nvidia runtime
Container hangs forever - your host drivers are loaded but something's broken in the toolkit hook
nvidia-smi: command not found inside container - you're using the wrong base image or drivers aren't mounted
Failed to initialize NVML - driver version mismatch between host and container

Real talk: I once spent 6 hours debugging this on a fresh Ubuntu 22.04 install. Problem was AppArmor blocking access to /dev/nvidiactl. Another time, worked perfectly in dev but broke in prod because the Docker daemon was running in a different namespace.

The CVE That Scared Everyone (CVE-2025-23266)

In July 2025, security researchers found a nasty container escape bug dubbed "NVIDIAScape". This thing scores a CVSS 9.0 - that's "drop everything and patch now" territory. Discovered by Wiz Research and affects the OCI hook mechanism.

Container escape vulnerabilities work by exploiting privilege escalation paths from inside containers to the host system.

What's Actually Broken:
The vulnerability is in the container initialization hooks. An attacker can escape from a container and get root access on your host system. That's game over for any security model you thought you had. It exploits LD_PRELOAD to inject malicious libraries during container startup.

Real Impact:

Container breakouts with root privileges
Data tampering on the host system
Complete compromise of the host from inside a container
There's a 3-line exploit floating around - it's that simple
Affects cloud AI services running GPU workloads

How to Not Get Owned:

Update to toolkit version 1.17.8 immediately - this is not optional
If you're using Kubernetes with GPU Operator, update to 25.3.2+
Run nvidia-ctk --version to check what you're running
Audit any containers you've been running with GPU access - they might have been compromised

Development Status and What's Coming

NVIDIA keeps this project pretty active, which is good because container security is a moving target. Version 1.17.8 dropped May 30, 2025, and got pushed harder after the July CVE disclosure. Check the release notes for breaking changes.

What They're Working On:

Security hardening after the container escape disaster
Better Container Device Interface (CDI) support - this is the future of device management
Making the Kubernetes GPU Operator less of a pain to debug
Performance improvements for people running hundreds of GPU containers
Multi-Instance GPU (MIG) support improvements

The NVIDIA container ecosystem includes the toolkit, GPU Operator, NGC catalog, and various development frameworks.

Where to Get It:
The packages come from NVIDIA's official repo. Don't install random builds from GitHub unless you enjoy security vulnerabilities. For air-gapped environments, there's an offline package repository. You can also use NGC Catalog for pre-built container images.

Community Reality Check:
The GitHub repo has decent activity and NVIDIA actually responds to issues, which is more than you can say for most enterprise software. The project follows semantic versioning and has good CI/CD practices. Keep an eye on their security bulletins because GPU containers are a juicy target for attackers.

For support, check NVIDIA Developer Forums or Stack Overflow for community help.

Bottom line: update regularly or get owned. There's no middle ground with container security.

Frequently Asked Questions

What's the difference between NVIDIA Container Toolkit and NVIDIA Docker?

The Container Toolkit is what nvidia-docker evolved into. The old nvidia-docker2 package is dead

don't touch it. The new toolkit works with Docker, containerd, CRI-O, and Podman. Same goal, way better implementation.

Do I need to install CUDA inside my containers?

No. The toolkit mounts CUDA libraries from your host automatically. Install NVIDIA drivers on the host and use pre-built images like nvidia/cuda:11.8-base-ubuntu20.04. Don't bloat your containers with duplicate CUDA installs.

How do I know if NVIDIA Container Toolkit is working correctly?

Run this command and pray:

docker run --rm --gpus all nvidia/cuda:11.8-base-ubuntu20.04 nvidia-smi

If you see your GPU info, you won. If you get errors about NVIDIA runtime not found, drivers not being detected, or containers that just hang, welcome to debugging hell.

Is that scary container escape bug fixed?

Yeah, CVE-2025-23266 got patched in version 1.17.8. The "NVIDIAScape" bug let containers escape and take over your host

basically game over if exploited. NVIDIA dropped the fix in July 2025 after Wiz Research found it. If you're running anything older than 1.17.8, upgrade immediately. For Kubernetes, make sure GPU Operator is at 25.3.2+.

Can I use NVIDIA Container Toolkit with Kubernetes?

Yes, through the NVIDIA GPU Operator. It's basically a Kubernetes operator that handles all the GPU setup automatically across your cluster. Deploys drivers, configures the toolkit, manages GPU device plugins, the whole nine yards. Way easier than manually configuring every node.

The GPU Operator deploys a set of DaemonSets across your Kubernetes cluster to manage GPU drivers and the container toolkit.

What container runtimes are supported?

Container Runtime Architecture

NVIDIA Container Toolkit supports multiple container runtimes: Docker Engine, containerd (Kubernetes default), CRI-O (OpenShift), and Podman. The architecture is designed to be runtime-agnostic, with specific integration methods for each runtime type.

Does it work on Windows or macOS?

No, NVIDIA Container Toolkit only supports Linux distributions. GPU containerization on Windows requires NVIDIA Container Toolkit for Windows, which is a separate implementation. macOS is not supported due to NVIDIA GPU driver limitations on Apple hardware.

Why does my container say "nvidia-smi: command not found"?

Something's broken. I've debugged this exact issue probably 50 times. Usually it's:

Host drivers are fucked - Run nvidia-smi on your host first. If that fails, your drivers are toast
Docker daemon config is missing - /etc/docker/daemon.json needs the nvidia runtime configured
You forgot --gpus all - Docker doesn't telepathically know you want GPU access
Wrong base image - Use nvidia/cuda images, not plain Ubuntu

Pro tip: docker info | grep nvidia should show nvidia runtime if it's configured right. Also spent 3 hours once debugging this on Ubuntu 22.04 - kernel module wasn't loading after driver update.

Can I limit GPU access to specific devices?

Yes, you can control GPU access using the --gpus flag with Docker or environment variables. For example, --gpus '"device=0,1"' restricts access to specific GPU indices. The toolkit also supports CUDA_VISIBLE_DEVICES environment variable for fine-grained GPU control within containers.

What's the difference between MIG and regular GPU sharing?

Multi-Instance GPU (MIG) is supported by the toolkit for A100 and H100 GPUs, allowing hardware-level partitioning of a single GPU into multiple isolated instances. This differs from regular GPU sharing where processes compete for the same GPU resources. MIG provides memory and compute isolation for secure multi-tenant deployments.

How do I migrate from older NVIDIA Docker versions?

The migration process involves uninstalling nvidia-docker2, installing nvidia-container-toolkit, and reconfiguring the container runtime. The toolkit maintains backward compatibility with existing container images and Docker commands, but the underlying runtime configuration changes. Follow the official migration documentation for your specific container runtime.

Does the toolkit support air-gapped environments?

Yes, packages are available for offline installation through the GitHub repository's gh-pages branch. This includes .deb and .rpm packages for air-gapped deployments. You'll need to manually transfer the packages and their dependencies to isolated environments.

GPU Container Solutions Comparison

Feature	NVIDIA Container Toolkit	AMD ROCm Containers	Intel GPU Containers	Apptainer/Singularity	Cloud GPU Services
Primary GPU Support	NVIDIA GPUs	AMD GPUs	Intel Arc/Data Center GPUs	Multi-vendor GPUs	Cloud-specific GPUs
Container Runtimes	Docker, containerd, CRI-O, Podman	Docker, containerd	Docker, oneAPI containers	Native container format	Platform-specific runtimes
Kubernetes Integration	Native (GPU Operator)	Limited ROCm support	Intel GPU Operator	HPC-focused	Managed services
Installation Complexity	Moderate	High	Moderate	Low	Minimal (managed)
Security Model	Runtime hooks, privileged access	Runtime hooks	Level Zero integration	User namespace isolation	Cloud provider security
Performance Overhead	Minimal	Low-moderate	Low	Minimal	Network-dependent
Enterprise Support	Full NVIDIA support	Community-driven	Intel support	Open source community	Vendor SLAs
AI/ML Framework Support	Extensive (CUDA ecosystem)	Growing (PyTorch, TensorFlow)	Emerging (oneAPI)	Framework-agnostic	Pre-configured environments
Cost Model	Hardware + licensing	Hardware only	Hardware + software licensing	Open source	Pay-per-use
Recent Security Issues	CVE-2025-23266 (patched)	None reported	None reported	Secure by design	Vendor-managed

Essential Resources and Documentation

Related Tools & Recommendations

tool

Similar content

Docker: Package Code, Run Anywhere - Fix 'Works on My Machine'

No more "works on my machine" excuses. Docker packages your app with everything it needs so it runs the same on your laptop, staging, and prod.

Docker Engine

/tool/docker/overview

100%

tool

Similar content

Rancher Desktop: The Free Docker Desktop Alternative That Works

Discover why Rancher Desktop is a powerful, free alternative to Docker Desktop. Learn its features, installation process, and solutions for common issues on mac

Rancher Desktop

/tool/rancher-desktop/overview

92%

troubleshoot

Similar content

Fix Trivy & ECR Container Scan Authentication Issues

Trivy says "unauthorized" but your Docker login works fine? ECR tokens died overnight? Here's how to fix the authentication bullshit that keeps breaking your sc

Trivy

/troubleshoot/container-security-scan-failed/registry-access-authentication-issues

69%

troubleshoot

Similar content

Debug Kubernetes AI GPU Failures: Pods Stuck Pending & OOM

Debugging workflows for when Kubernetes decides your AI workload doesn't deserve those GPUs. Based on 3am production incidents where everything was on fire.

Kubernetes

/troubleshoot/kubernetes-ai-workload-deployment-issues/ai-workload-gpu-resource-failures

69%

tool

Similar content

Google Cloud Vertex AI Production Deployment Troubleshooting Guide

Debug endpoint failures, scaling disasters, and the 503 errors that'll ruin your weekend. Everything Google's docs won't tell you about production deployments.

Google Cloud Vertex AI

/tool/vertex-ai/production-deployment-troubleshooting

67%

troubleshoot

Similar content

Fix Docker Build Context Too Large: Optimize & Reduce Size

Learn practical solutions to fix 'Docker Build Context Too Large' errors. Optimize your Docker builds, reduce context size from GBs to MBs, and speed up develop

Docker Engine

/troubleshoot/docker-build-context-too-large/context-optimization-solutions

59%

integration

Similar content