Kubernetes was designed for web apps that need 2 CPU cores and 4GB RAM. Your Llama-70B model wants 8 A100s and 140GB of memory. These assumptions are fundamentally incompatible.
When GPU scheduling fails, you get error messages like "0/5 nodes available: insufficient nvidia.com/gpu." Thanks for nothing, scheduler.
The Stupid Problems That Will Ruin Your Day
Your GPUs Are Scattered Like Confetti
Here's what pisses me off most: you've got 8 A100s across 4 nodes. Your training job wants 4 GPUs that can actually talk to each other. The default scheduler sees 8 available GPUs and thinks "perfect!" then scatters them across different fucking racks.
Of course the default scheduler doesn't understand that distributed training needs GPUs on the same interconnect. So NCCL fails with some cryptic bullshit about topology.
First, figure out where your GPUs actually live:
kubectl get nodes -o custom-columns=\"NAME:.metadata.name,GPU:.status.capacity.nvidia\\.com/gpu\"
If you see GPUs spread across different nodes and your job needs them together, you're screwed unless you fix the scheduling.
Took me two days of banging my head against this before I realized the default scheduler is completely clueless about GPU topology. Finally fixed it with Volcano scheduler and gang scheduling - either all training pods get proper GPU placement or none do.
Gang scheduling means your 8-pod training job either gets all the GPUs it needs or waits. No partial deployments that burn money while half the workers can't talk to each other.
CUDA Version Hell
Then there's CUDA version hell. Pod loads your 7B model, dies with "CUDA out of memory" using only 3GB. Or my favorite: "no CUDA-capable device found" when nvidia-smi clearly shows 8 GPUs sitting there.
This usually means one of these delightful scenarios:
- Node has CUDA 11.8, your container wants 12.1
- NVIDIA container runtime is completely fucked
- Some zombie process is hogging the GPU from a crashed pod
- Device plugin lost track of what's actually available
The golden rule that nobody tells you: container CUDA version must be ≤ node driver version. So your shiny CUDA 12.1 containers won't work on 11.8 drivers. Because why would they make this obvious?
The dumb thing to check first:
kubectl run cuda-test --image=nvidia/cuda:12.1-runtime-ubuntu20.04 --rm -it --restart=Never --limits=nvidia.com/gpu=1 -- nvidia-smi
If that fails, your NVIDIA container runtime is broken. If it works but your actual workload fails, it's a version mismatch.
I spent 4 hours debugging this shit once. Base image was using CUDA 11.7, nodes had 12.1 drivers. The error message? Completely useless: "RuntimeError: CUDA initialization failed." Zero mention of version conflicts. Because helpful error messages are apparently too much to ask for.
Nuclear option: docker system prune -a && kubectl delete pods --all
then rebuild everything. Sometimes CUDA context gets corrupted and nothing short of nuking everything works.
GPU Resource Hogging
And don't get me started on GPU hogging. Your critical training job can't start because some inference pod is sitting on an entire A100 at 5% utilization. Thanks, Kubernetes, for treating GPUs as all-or-nothing resources.
You can't request 0.5 GPUs like you can request 500m CPU. Every pod gets a full GPU even if it's running a tiny model that could easily share. Brilliant design choice.
GPU time-slicing can help here - it splits each physical GPU into virtual ones, giving each a time slice of the real hardware. But setting it up is another adventure in YAML hell.
## Time-slicing ConfigMap that actually works
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
a100: |-
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4 # Split each A100 into 4 virtual GPUs
GPU Operator Issues: When the Foundation is Broken
The NVIDIA GPU Operator automates GPU management but it's also a single point of failure. When it fails, every GPU pod goes down.
GPU Operator has 5 components that can fail independently: driver daemonset, device plugin, container runtime, MIG manager, DCGM exporter. Each failure has different symptoms.
Common operator failures:
## GPU operator pod stuck in ImagePullBackOff
kubectl get pods -n gpu-operator
## NAME READY STATUS RESTARTS AGE
## nvidia-cuda-validator-12345 0/1 ImagePullBackOff 0 10m
## Device plugin not creating GPU resources
kubectl describe nodes gpu-node-1 | grep nvidia.com/gpu
## Shows: nvidia.com/gpu: 0 (should show 8)
## Driver installation failing silently
kubectl logs -n gpu-operator nvidia-driver-daemonset-xxxxx
## Error: failed to install NVIDIA driver: kernel version mismatch
The debugging sequence that actually works:
## 1. Check if nodes have GPUs detected at hardware level
kubectl debug node/gpu-node-1 -it --image=busybox
## In debug container: lspci | grep -i nvidia
## 2. Verify GPU operator components health
kubectl get pods -n gpu-operator -o wide
## 3. Check driver installation logs
kubectl logs -n gpu-operator daemonset/nvidia-driver-daemonset --tail=50
## 4. Test GPU access from a basic container
kubectl run gpu-test --image=nvidia/cuda:12.1-base-ubuntu20.04 --rm -it \
--overrides='{\"spec\":{\"containers\":[{\"name\":\"gpu-test\",\"image\":\"nvidia/cuda:12.1-base-ubuntu20.04\",\"resources\":{\"limits\":{\"nvidia.com/gpu\":\"1\"}},\"command\":[\"nvidia-smi\"]}]}}' \
--restart=Never
Advanced GPU Scheduling: Beyond Basic Resource Requests
Modern AI workloads need more sophisticated scheduling than "give me 2 GPUs." They need topology awareness, memory locality, and workload-specific optimizations.
Example: Distributed Training with Gang Scheduling
apiVersion: scheduling.volcano.sh/v1beta1
kind: Job
metadata:
name: distributed-llm-training
spec:
schedulerName: volcano
minAvailable: 4 # All 4 pods must be scheduled together
plugins:
env: []
svc: []
policies:
- event: PodEvicted
action: RestartJob
tasks:
- replicas: 4
name: trainer
template:
spec:
affinity:
podAntiAffinity: # Spread across nodes for fault tolerance
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
job-name: distributed-llm-training
topologyKey: kubernetes.io/hostname
containers:
- name: trainer
image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel
resources:
limits:
nvidia.com/gpu: 2 # 2 GPUs per node
memory: \"64Gi\" # 8x GPU memory for model + gradients
env:
- name: NCCL_DEBUG
value: \"INFO\"
- name: NCCL_TREE_THRESHOLD
value: \"0\"
What makes this work:
- Volcano scheduler with gang scheduling: all pods start together or none start
- Pod anti-affinity spreads training across nodes for better interconnect usage
- NCCL environment variables optimize GPU-to-GPU communication
- Memory allocation accounts for model weights, gradients, and optimizer states
Kueue can help with job queuing and resource sharing if you want another layer of complexity. Yunikorn is an alternative to Volcano, though honestly Volcano just works better for most cases in my experience.
The reality: AI workloads aren't web apps that happen to use GPUs. They have completely different resource patterns, scheduling needs, and failure modes. Standard Kubernetes scheduling breaks under the demands of model training, inference scaling, and multi-tenant GPU sharing. Next up: memory and performance issues that show up once you fix the scheduling nightmare.