Kubernetes AI GPU Failure Debug Guide
Critical Context Overview
Primary Issue: Kubernetes was designed for web applications requiring 2 CPU cores and 4GB RAM. AI workloads requiring 8 A100s and 140GB memory create fundamental incompatibilities causing systematic failures.
Failure Severity: Production incidents with 20+ minute pending pods, complete training job failures, and resource waste of 80-87% due to poor GPU scheduling.
Time Investment: Individual debugging sessions range from 2-4 hours for basic issues to multiple days for distributed training setup.
GPU Scheduling Failures
Critical Failure Modes
GPU Scattering Problem
- Issue: Default scheduler distributes 8 available A100s across 4 different nodes
- Consequence: Distributed training fails with NCCL topology errors making multi-GPU training impossible
- Root Cause: Default scheduler ignores GPU interconnect requirements for collective operations
- Solution Difficulty: Moderate - requires Volcano scheduler implementation
Configuration That Works:
apiVersion: scheduling.volcano.sh/v1beta1
kind: Job
spec:
schedulerName: volcano
minAvailable: 4 # Gang scheduling - all pods or none
plugins:
env: []
svc: []
CUDA Version Incompatibility
- Critical Rule: Container CUDA version ≤ node driver version
- Failure Example: CUDA 12.1 containers fail on 11.8 drivers with "no CUDA-capable device found"
- Debug Time: 4+ hours due to misleading error messages
- Diagnosis Command:
kubectl run cuda-test --image=nvidia/cuda:12.1-runtime-ubuntu20.04 --rm -it --restart=Never --limits=nvidia.com/gpu=1 -- nvidia-smi
GPU Resource Hogging
- Issue: Single inference pod consumes entire A100 at 5% utilization
- Impact: Blocks critical training jobs requiring multiple GPUs
- Workaround: GPU time-slicing to split physical GPUs into virtual ones
- Performance Impact: 4x resource efficiency improvement possible
GPU Operator Failure Modes
Components That Fail Independently:
- Driver daemonset
- Device plugin
- Container runtime
- MIG manager
- DCGM exporter
Debugging Sequence:
# 1. Hardware detection
kubectl debug node/gpu-node-1 -it --image=busybox
lspci | grep -i nvidia
# 2. Component health
kubectl get pods -n gpu-operator -o wide
# 3. Driver logs
kubectl logs -n gpu-operator daemonset/nvidia-driver-daemonset --tail=50
Nuclear Option: kubectl delete pods -n gpu-operator -l app=nvidia-device-plugin-daemonset
Memory Management Failures
GPU Memory vs System Memory Confusion
Critical Distinction: nvidia-smi shows physical memory, CUDA allocates in chunks with fragmentation over time. Available ≠ allocatable.
Memory Failure Triggers:
- FP16 in development → FP32 in production (2x memory increase)
- Context length: 2K → 8K tokens (4x memory usage)
- Dynamic batching spikes to maximum
- Memory fragmentation from crashed models
- Multiple models loaded simultaneously
Resource Allocation Formula: Set memory limits to 2x GPU memory (GPU = 16GB → memory limit = 32Gi)
DeepSpeed ZeRO Implementation
Memory Reduction Results: 160GB → 40GB per GPU for 70B parameter models
ZeRO Stages:
- ZeRO-1: Shards optimizer states
- ZeRO-2: Adds gradient sharding
- ZeRO-3: Shards model parameters
model_config = {
"fp16": True,
"gradient_checkpointing": True,
"offload_optimizer": True,
"partition_activations": True,
}
Dynamic Resource Allocation Problems
Traditional Kubernetes Waste: 87% resource waste allocating for worst-case scenarios while using only 13% average
DRA Configuration Example:
spec:
requests:
memory: "8Gi" # Minimum guarantee
limits:
memory: "40Gi" # Maximum burst capacity
Batch Size Optimization
Performance Impact: GPU utilization stuck at 15% with batch size = 1
Optimal Configuration: Batch sizes in multiples of 8 for Tensor Core efficiency
Solution: vLLM with continuous batching provides 3x throughput improvement
Framework Integration Failures
PyTorch Distributed Training
Fundamental Problem: PyTorch expects direct process communication, Kubernetes adds network abstraction layers causing systematic failures
NCCL Port Access Issues: Specific ports required by NCCL are blocked by pod networking
Service Discovery Mismatch: PyTorch discovery doesn't work with Kubernetes DNS
Working Configuration:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: pytorch-distributed-training
spec:
serviceName: "training-service"
template:
spec:
containers:
- env:
- name: MASTER_ADDR
value: "pytorch-distributed-training-0.training-service.default.svc.cluster.local"
- name: NCCL_SOCKET_IFNAME
value: "eth0"
TensorFlow Serving Failures
Common Failure Modes:
- Model signature mismatches between training and serving
- SavedModel format incompatibilities
- GPU memory configuration conflicts
- Missing CUDA graph support in containers
Model Precision Issue: Models trained with mixed precision but TF Serving defaults to FP32
Solution Time: Multiple hours rebuilding SavedModel with explicit precision
@tf.function
def serve_function(inputs):
predictions = model(inputs)
return {
'predictions': tf.cast(predictions, tf.float32), # Always return FP32
'scores': tf.nn.softmax(predictions, axis=-1)
}
Ray Cluster Resource Conflicts
Coordination Problem: Ray and Kubernetes resource allocation must be precisely aligned
Breaking Point: Ray workers request GPUs that Kubernetes hasn't allocated to pods
Alignment Requirement:
rayStartParams:
num-gpus: "1" # Must exactly match nvidia.com/gpu: 1
Hugging Face Model Loading
Production vs Development Failures:
- Internet access restrictions in locked-down pods
- HF Hub authentication token storage errors
- Model cache filesystem conflicts
- Storage performance bottlenecks (60% GPU idle time)
Reliable Deployment Pattern:
initContainers:
- name: model-preloader
env:
- name: HF_HUB_OFFLINE
value: "0"
containers:
- name: model-server
env:
- name: HF_HUB_OFFLINE
value: "1" # Use cached models only
Storage Performance Bottlenecks
Critical Performance Impacts
Model Loading Times: 70B models = 140GB files requiring complete loading before inference
GPU Idle Time: 60% idle due to storage bottleneck rather than compute limitations
Checkpoint Overhead: 10GB+ checkpoints every few minutes during training
Solution Results: NVMe SSD with high IOPS + ReadWriteMany volumes + init container preloading eliminated 60% idle time
Storage Configuration Requirements
apiVersion: v1
kind: PersistentVolumeClaim
spec:
accessModes:
- ReadWriteMany
storageClassName: ssd-high-iops
resources:
requests:
storage: 500Gi
Emergency Troubleshooting Procedures
Pod Stuck Pending - "Insufficient nvidia.com/gpu"
Root Cause: Zombie pods claiming GPUs without proper cleanup
Diagnostic: kubectl get pods --all-namespaces | grep -E "(Error|Failed|Unknown)"
Nuclear Option: kubectl delete pods -n gpu-operator -l app=nvidia-device-plugin-daemonset
Recovery Time: 30 seconds for device plugin restart
CUDA Out of Memory in Production
Memory Diagnostic Sequence:
kubectl exec -it model-pod -- nvidia-smi --query-gpu=memory.used,memory.total --format=csv
kubectl exec -it model-pod -- fuser -v /dev/nvidia*
90% Frequency Causes:
- Production batch sizes exceed test configuration
- Previous model remains loaded from crash
- FP32 production vs FP16 development mismatch
Recovery: torch.cuda.empty_cache()
→ pod restart → node reboot (escalation order)
PyTorch Distributed Training Hangs
Network Diagnosis Priority:
kubectl exec -it trainer-0 -- nc -zv trainer-1.training-service.default.svc.cluster.local 29500
80% Success Rate Fix: export NCCL_SOCKET_IFNAME=eth0
Container Exit Code 137 (OOMKilled)
Confusion Factor: System memory (RAM) vs GPU memory are separate resources
Resource Rule: Memory limits = 2x GPU memory for model loading + inference buffers
Memory Leak Check: Monitor preprocessing code for unbounded growth
Critical Resource Allocation Guidelines
CPU-GPU Balance Requirements
AI Workload CPU Needs:
- Data preprocessing and tokenization
- Model weight loading and caching
- Post-processing and response formatting
- Monitoring and logging overhead
Allocation Rule: 4-8 CPU cores per GPU (more GPUs = proportionally more CPU)
Balanced Configuration:
resources:
limits:
nvidia.com/gpu: 1
memory: "48Gi" # 1.2x GPU memory
cpu: "8000m" # 8 cores per GPU
ephemeral-storage: "100Gi"
Performance Optimization Patterns
Batch Processing Optimization
GPU Utilization Problem: Batch size 1 wastes 90% of parallel compute capacity
Memory Overhead: Fixed per batch, not per item
Tensor Core Requirement: Batch sizes in multiples of 8
Gradient Accumulation Alternative:
for batch in dataloader: # batch_size=8
loss = model(batch)
(loss / 4).backward() # Accumulate 4 times
if step % 4 == 0:
optimizer.step()
optimizer.zero_grad()
Network Performance Requirements
NCCL Timeout Configuration:
export NCCL_TIMEOUT=3600 # 1 hour timeout
export NCCL_IB_RETRY_CNT=10 # More retries
export NCCL_DEBUG=INFO # Verbose logging
Decision Matrix for Common Problems
Problem Category | Start Diagnostic | Escalation Path | Success Rate |
---|---|---|---|
Pod Scheduling | kubectl describe pod |
Device plugin restart → Node reboot | 95% |
Memory Issues | nvidia-smi in pod |
Cache clear → Pod restart → Node reboot | 90% |
Model Loading | kubectl logs |
Auth check → Storage optimization | 85% |
Distributed Training | Framework debug + logs | NCCL tuning → Network team | 70% |
Performance | nvidia-smi dmon + profiling |
Resource rebalancing | 80% |
Critical Warnings
Breaking Points:
- GPU memory fragmentation cannot be resolved without restart
- NCCL failures often require network infrastructure changes
- Multi-tenant GPU sharing without proper isolation causes cascading failures
- Default Kubernetes scheduling is fundamentally incompatible with AI workload requirements
Time-to-Resolution Expectations:
- Simple configuration issues: 30 minutes - 2 hours
- Framework integration problems: 4 hours - 2 days
- Network/infrastructure issues: 1-3 days
- Storage performance optimization: 2-5 days
Investment Requirements:
- Technical expertise: Senior-level Kubernetes + AI framework knowledge required
- Infrastructure: High-performance storage, proper GPU interconnects, network bandwidth
- Time: 20-40% of initial deployment time should be allocated for debugging and optimization
Useful Links for Further Investigation
Essential Resources for GPU Debugging
Link | Description |
---|---|
NVIDIA GPU Operator Troubleshooting | Start here when the GPU operator breaks. Actually useful documentation. |
NVIDIA Container Runtime Guide | For when containers can't access GPUs. |
PyTorch Distributed Troubleshooting | NCCL errors and networking failures |
Ray on Kubernetes Guide | Resource conflicts between Ray and K8s |
NVIDIA DCGM | GPU monitoring for production. Set this up before you have problems. |
Volcano Scheduler | Gang scheduling for distributed training. Actually works. |
NVIDIA Developer Forums | NVIDIA engineers actually respond here |
k9s | Terminal UI for Kubernetes that doesn't suck |
Related Tools & Recommendations
GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus
How to Wire Together the Modern DevOps Stack Without Losing Your Sanity
Set Up Microservices Monitoring That Actually Works
Stop flying blind - get real visibility into what's breaking your distributed services
containerd - The Container Runtime That Actually Just Works
The boring container runtime that Kubernetes uses instead of Docker (and you probably don't need to care about it)
Podman Desktop - Free Docker Desktop Alternative
competes with Podman Desktop
Podman Desktop Alternatives That Don't Suck
Container tools that actually work (tested by someone who's debugged containers at 3am)
GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015
Deploy your app without losing your mind or your weekend
GitHub Actions + Jenkins Security Integration
When Security Wants Scans But Your Pipeline Lives in Jenkins Hell
Fix Kubernetes ImagePullBackOff Error - The Complete Battle-Tested Guide
From "Pod stuck in ImagePullBackOff" to "Problem solved in 90 seconds"
Fix Kubernetes OOMKilled Pods - Production Memory Crisis Management
When your pods die with exit code 137 at 3AM and production is burning - here's the field guide that actually works
Docker Desktop vs Podman Desktop vs Rancher Desktop vs OrbStack: What Actually Happens
extends Docker Desktop
Kubeflow Pipelines - When You Need ML on Kubernetes and Hate Yourself
Turns your Python ML code into YAML nightmares, but at least containers don't conflict anymore. Kubernetes expertise required or you're fucked.
Helm - Because Managing 47 YAML Files Will Drive You Insane
Package manager for Kubernetes that saves you from copy-pasting deployment configs like a savage. Helm charts beat maintaining separate YAML files for every dam
Fix Helm When It Inevitably Breaks - Debug Guide
The commands, tools, and nuclear options for when your Helm deployment is fucked and you need to debug template errors at 3am.
Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together
Stop fighting with YAML hell and infrastructure drift - here's how to manage everything through Git without losing your sanity
Mongoose - Because MongoDB's "Store Whatever" Philosophy Gets Messy Fast
built on Mongoose
Rust, Go, or Zig? I've Debugged All Three at 3am
What happens when you actually have to ship code that works
Docker Business vs Podman Enterprise Pricing - What Changed in 2025
Red Hat gave away enterprise infrastructure while Docker raised prices again
Stop Docker from Killing Your Containers at Random (Exit Code 137 Is Not Your Friend)
Three weeks into a project and Docker Desktop suddenly decides your container needs 16GB of RAM to run a basic Node.js app
CVE-2025-9074 Docker Desktop Emergency Patch - Critical Container Escape Fixed
Critical vulnerability allowing container breakouts patched in Docker Desktop 4.44.3
Prometheus - Scrapes Metrics From Your Shit So You Know When It Breaks
Free monitoring that actually works (most of the time) and won't die when your network hiccups
Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization