Pod stuck pending with "Insufficient nvidia.com/gpu" but nvidia-smi shows GPUs available?

Your GPUs are probably claimed by zombie pods that didn't clean up properly. This happens constantly: ```bash kubectl describe nodes | grep -A 5 -B 5 "nvidia.com/gpu" kubectl get pods --all-namespaces | grep -E "(Error|Failed|Unknown)" ``` Usually it's zombie pods hogging GPUs or the device plugin crashed. Nuclear option (this actually works): ```bash kubectl delete pods -n gpu-operator -l app=nvidia-device-plugin-daemonset ``` Wait 30 seconds for it to restart, then try scheduling again. If you're desperate, reboot the nodes.

My model works fine locally but gives "CUDA out of memory" in production?

Production memory patterns are never the same as your laptop. First, figure out what's actually using the GPU: ```bash kubectl exec -it model-pod -- nvidia-smi --query-gpu=memory.used,memory.total --format=csv kubectl exec -it model-pod -- fuser -v /dev/nvidia* ``` 90% of the time it's one of these stupid things: - Production batch sizes are way larger than you tested locally - Another model is still loaded from a previous run that crashed - GPU memory wasn't cleared after the last crash - You're running FP32 in prod but FP16 on your laptop (classic mistake) Try clearing GPU memory first: ```python import torch torch.cuda.empty_cache() torch.cuda.ipc_collect() ``` If that doesn't work, restart the pod. If that doesn't work, reboot the node.

PyTorch distributed training hangs during initialization - what's wrong?

Nine times out of ten, it's networking bullshit. Try these in order: ```bash # Can pods actually reach each other on the right port? kubectl exec -it trainer-0 -- nc -zv trainer-1.training-service.default.svc.cluster.local 29500 # Are NCCL settings completely fucked? kubectl exec -it trainer-0 -- env | grep NCCL # Any useful error messages? (probably not) kubectl logs trainer-0 | grep -i "nccl\|process group\|rank" ``` Common ways to screw this up: - MASTER_ADDR points to wrong pod (check your StatefulSet naming) - NCCL picks the wrong network interface (it will pick wrong) - Some firewall is blocking ports you didn't know existed - Process ranks don't match world size (count your pods again) Quick fix that works 80% of the time: `export NCCL_SOCKET_IFNAME=eth0`

My inference pods are getting killed with exit code 137 (OOMKilled) but GPU memory looks fine?

You're confusing system memory (RAM) with GPU memory. They're completely different: ```bash # Check container memory limits vs actual usage kubectl describe pod inference-pod | grep -A 5 -B 5 "Limits\|Requests" # Monitor memory usage over time kubectl top pod inference-pod --containers # Check for memory leaks in your shitty preprocessing code kubectl exec -it inference-pod -- ps aux --sort=-%mem | head -10 ``` Usually caused by: - Insufficient RAM for model loading plus inference buffers (duh) - Memory leak in your data preprocessing (check your code) - Large batch processing overwhelming system memory - Model weights cached in both GPU and CPU memory (wasteful but happens) Quick rule: set memory limits to 2x GPU memory. GPU = 16GB? Set memory limit = 32Gi. Don't ask why, just do it.

Why do my GPU pods stay in "ContainerCreating" for 5+ minutes?

Slow as hell model loading from storage or pulling massive container images: ```bash # Check if it's slow image pulling kubectl describe pod slow-pod | grep -A 10 Events # Monitor storage I/O during startup (spoiler: it's probably terrible) kubectl exec -it slow-pod -- iostat -x 1 # Check model download status kubectl logs slow-pod --previous ``` Performance killers: - Downloading 50GB+ model weights on every single startup (why?) - Using the slowest possible storage class - Cold pulling container images from registry over slow internet - No model cache or completely broken caching Actual fix: use init containers for model preloading: ```yaml initContainers: - name: model-preloader image: your-model-image command: ["python", "-c", "from transformers import AutoModel; AutoModel.from_pretrained('model-name', cache_dir='/cache')"] volumeMounts: - name: model-cache mountPath: /cache ```

My AI workload shows 100% GPU utilization but inference is still slow?

GPU utilization percentage is a lie. Check for actual bottlenecks: ```bash # Monitor GPU memory bandwidth and compute separately kubectl exec -it ai-pod -- nvidia-smi dmon -s um # Check if your CPU is being throttled kubectl top pod ai-pod # Profile your inference pipeline (prepare for bad news) kubectl exec -it ai-pod -- python -m torch.profiler.profile your_script.py ``` Hidden bottlenecks that'll ruin your day: - CPU preprocessing can't keep up with the GPU (classic) - I/O bound by slow tokenization or data loading - Inefficient tensor operations causing GPU to wait around - Memory bandwidth limitations (your GPU is starving) Rule of thumb: use 4-8 CPU cores per GPU. More GPUs = more CPU cores needed.

Why does my model work fine for small requests but fail on large batches?

Memory allocation scales badly with batch size. Test this yourself: ```bash # Test memory usage with increasing batch sizes kubectl exec -it model-pod -- python3 -c " import torch for batch_size in [1, 4, 8, 16, 32]: try: x = torch.randn(batch_size, 1024, 1024).cuda() print(f'Batch {batch_size}: {torch.cuda.memory_allocated()/1e9:.2f}GB') del x torch.cuda.empty_cache() except Exception as e: print(f'Batch {batch_size}: FAILED - {e}') " ``` Scaling failures: - Memory growth is quadratic with sequence length (oops) - You hit fixed GPU memory limits at larger batches - Your model architecture doesn't support dynamic batching - KV cache memory grows linearly with context length Actual fix: gradient accumulation instead of huge batches: ```python # Instead of batch_size=32 (which OOMs) for batch in dataloader: # batch_size=8 loss = model(batch) (loss / 4).backward() # Accumulate gradients 4 times if step % 4 == 0: optimizer.step() optimizer.zero_grad() ```

My Hugging Face model download keeps failing in the pod?

Network restrictions or auth problems. Debug in order: ```bash # Can your pod even reach the internet? kubectl exec -it model-pod -- curl -I https://huggingface.co # Is HF authentication working? kubectl exec -it model-pod -- python3 -c " from huggingface_hub import whoami try: print('Logged in as:', whoami()) except Exception as e: print('Auth failed:', e) " # Check environment variables kubectl exec -it model-pod -- env | grep HF_ ``` Common download failures: - Network policies blocking external access (ask your platform team) - Missing or expired HF tokens (check your secrets) - Corporate firewall blocking model downloads (good luck) - Insufficient disk space for 50GB+ model weights (increase storage) Actual fix: pre-download models to persistent volume: ```bash kubectl run model-downloader --image=huggingface/transformers \ --env="HUGGING_FACE_HUB_TOKEN=your_token" \ --command -- python3 -c "from transformers import AutoModel; AutoModel.from_pretrained('model-name', cache_dir='/cache')" ```

Why does my Ray cluster show GPU resources but tasks can't access them?

Resource allocation mismatch between Kubernetes and Ray. They're fighting: ```bash # Check what K8s actually allocated kubectl describe pod ray-worker-xxx | grep nvidia.com/gpu # Check what Ray thinks it has kubectl exec -it ray-head -- ray status --address=ray://127.0.0.1:10001 # Test if Ray tasks can actually use GPUs kubectl exec -it ray-head -- python3 -c " import ray @ray.remote(num_gpus=1) def test_gpu(): import torch return torch.cuda.is_available() result = ray.get(test_gpu.remote()) print('GPU accessible:', result) " ``` Common mismatches: - Ray requesting more GPUs than Kubernetes allocated (Ray is greedy) - CUDA_VISIBLE_DEVICES not set correctly - Ray worker can't see the GPUs that Kubernetes allocated - Resource detection happening before GPU operator is ready Fix: align Ray and Kubernetes resource requests exactly: ```yaml # If Kubernetes allocated 1 GPU, Ray must request exactly 1 rayStartParams: num-gpus: "1" # Must match nvidia.com/gpu: 1 ```

My multi-node training job fails with "RuntimeError: NCCL operation failed"?

NCCL communication is breaking down across nodes. Debug step by step: ```bash # Test if NCCL can even initialize kubectl exec -it trainer-0 -- python3 -c " import torch.distributed as dist import torch dist.init_process_group(backend='nccl', init_method='env://') print('NCCL backend initialized successfully') " # Check network bandwidth between nodes kubectl exec -it trainer-0 -- iperf3 -s & kubectl exec -it trainer-1 -- iperf3 -c trainer-0.training-service.default.svc.cluster.local ``` NCCL failure modes (all painful): - High network latency between nodes (blame the network team) - Insufficient bandwidth for gradient synchronization - NCCL timeouts during large tensor operations - Network interface selection problems (NCCL picks wrong interface) Tune NCCL settings (this helps sometimes): ```bash export NCCL_TIMEOUT=3600 # 1 hour timeout (generous) export NCCL_IB_RETRY_CNT=10 # More retries before giving up export NCCL_DEBUG=INFO # Verbose logging (prepare for spam) ```

Currently viewing the AI version

Switch to human version

Kubernetes AI GPU Failure Debug Guide

Critical Context Overview

Primary Issue: Kubernetes was designed for web applications requiring 2 CPU cores and 4GB RAM. AI workloads requiring 8 A100s and 140GB memory create fundamental incompatibilities causing systematic failures.

Failure Severity: Production incidents with 20+ minute pending pods, complete training job failures, and resource waste of 80-87% due to poor GPU scheduling.

Time Investment: Individual debugging sessions range from 2-4 hours for basic issues to multiple days for distributed training setup.

GPU Scheduling Failures

Critical Failure Modes

GPU Scattering Problem

Issue: Default scheduler distributes 8 available A100s across 4 different nodes
Consequence: Distributed training fails with NCCL topology errors making multi-GPU training impossible
Root Cause: Default scheduler ignores GPU interconnect requirements for collective operations
Solution Difficulty: Moderate - requires Volcano scheduler implementation

Configuration That Works:

apiVersion: scheduling.volcano.sh/v1beta1
kind: Job
spec:
  schedulerName: volcano
  minAvailable: 4  # Gang scheduling - all pods or none
  plugins:
    env: []
    svc: []

CUDA Version Incompatibility

Critical Rule: Container CUDA version ≤ node driver version
Failure Example: CUDA 12.1 containers fail on 11.8 drivers with "no CUDA-capable device found"
Debug Time: 4+ hours due to misleading error messages
Diagnosis Command: kubectl run cuda-test --image=nvidia/cuda:12.1-runtime-ubuntu20.04 --rm -it --restart=Never --limits=nvidia.com/gpu=1 -- nvidia-smi

GPU Resource Hogging

Issue: Single inference pod consumes entire A100 at 5% utilization
Impact: Blocks critical training jobs requiring multiple GPUs
Workaround: GPU time-slicing to split physical GPUs into virtual ones
Performance Impact: 4x resource efficiency improvement possible

GPU Operator Failure Modes

Components That Fail Independently:

Driver daemonset
Device plugin
Container runtime
MIG manager
DCGM exporter

Debugging Sequence:

# 1. Hardware detection
kubectl debug node/gpu-node-1 -it --image=busybox
lspci | grep -i nvidia

# 2. Component health
kubectl get pods -n gpu-operator -o wide

# 3. Driver logs
kubectl logs -n gpu-operator daemonset/nvidia-driver-daemonset --tail=50

Nuclear Option: kubectl delete pods -n gpu-operator -l app=nvidia-device-plugin-daemonset

Memory Management Failures

GPU Memory vs System Memory Confusion

Critical Distinction: nvidia-smi shows physical memory, CUDA allocates in chunks with fragmentation over time. Available ≠ allocatable.

Memory Failure Triggers:

FP16 in development → FP32 in production (2x memory increase)
Context length: 2K → 8K tokens (4x memory usage)
Dynamic batching spikes to maximum
Memory fragmentation from crashed models
Multiple models loaded simultaneously

Resource Allocation Formula: Set memory limits to 2x GPU memory (GPU = 16GB → memory limit = 32Gi)

DeepSpeed ZeRO Implementation

Memory Reduction Results: 160GB → 40GB per GPU for 70B parameter models

ZeRO Stages:

ZeRO-1: Shards optimizer states
ZeRO-2: Adds gradient sharding
ZeRO-3: Shards model parameters

model_config = {
    "fp16": True,
    "gradient_checkpointing": True,
    "offload_optimizer": True,
    "partition_activations": True,
}

Dynamic Resource Allocation Problems

Traditional Kubernetes Waste: 87% resource waste allocating for worst-case scenarios while using only 13% average

DRA Configuration Example:

spec:
  requests:
    memory: "8Gi"     # Minimum guarantee
  limits:
    memory: "40Gi"    # Maximum burst capacity

Batch Size Optimization

Performance Impact: GPU utilization stuck at 15% with batch size = 1
Optimal Configuration: Batch sizes in multiples of 8 for Tensor Core efficiency
Solution: vLLM with continuous batching provides 3x throughput improvement

Framework Integration Failures

PyTorch Distributed Training

Fundamental Problem: PyTorch expects direct process communication, Kubernetes adds network abstraction layers causing systematic failures

NCCL Port Access Issues: Specific ports required by NCCL are blocked by pod networking
Service Discovery Mismatch: PyTorch discovery doesn't work with Kubernetes DNS

Working Configuration:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: pytorch-distributed-training
spec:
  serviceName: "training-service"
  template:
    spec:
      containers:
      - env:
        - name: MASTER_ADDR
          value: "pytorch-distributed-training-0.training-service.default.svc.cluster.local"
        - name: NCCL_SOCKET_IFNAME
          value: "eth0"

TensorFlow Serving Failures

Common Failure Modes:

Model signature mismatches between training and serving
SavedModel format incompatibilities
GPU memory configuration conflicts
Missing CUDA graph support in containers

Model Precision Issue: Models trained with mixed precision but TF Serving defaults to FP32
Solution Time: Multiple hours rebuilding SavedModel with explicit precision

@tf.function
def serve_function(inputs):
    predictions = model(inputs)
    return {
        'predictions': tf.cast(predictions, tf.float32),  # Always return FP32
        'scores': tf.nn.softmax(predictions, axis=-1)
    }

Ray Cluster Resource Conflicts

Coordination Problem: Ray and Kubernetes resource allocation must be precisely aligned
Breaking Point: Ray workers request GPUs that Kubernetes hasn't allocated to pods

Alignment Requirement:

rayStartParams:
  num-gpus: "1"  # Must exactly match nvidia.com/gpu: 1

Hugging Face Model Loading

Production vs Development Failures:

Internet access restrictions in locked-down pods
HF Hub authentication token storage errors
Model cache filesystem conflicts
Storage performance bottlenecks (60% GPU idle time)

Reliable Deployment Pattern:

initContainers:
- name: model-preloader
  env:
  - name: HF_HUB_OFFLINE
    value: "0"
containers:
- name: model-server
  env:
  - name: HF_HUB_OFFLINE
    value: "1"  # Use cached models only

Storage Performance Bottlenecks

Critical Performance Impacts

Model Loading Times: 70B models = 140GB files requiring complete loading before inference
GPU Idle Time: 60% idle due to storage bottleneck rather than compute limitations
Checkpoint Overhead: 10GB+ checkpoints every few minutes during training

Solution Results: NVMe SSD with high IOPS + ReadWriteMany volumes + init container preloading eliminated 60% idle time

Storage Configuration Requirements

apiVersion: v1
kind: PersistentVolumeClaim
spec:
  accessModes:
  - ReadWriteMany
  storageClassName: ssd-high-iops
  resources:
    requests:
      storage: 500Gi

Emergency Troubleshooting Procedures

Pod Stuck Pending - "Insufficient nvidia.com/gpu"

Root Cause: Zombie pods claiming GPUs without proper cleanup
Diagnostic: kubectl get pods --all-namespaces | grep -E "(Error|Failed|Unknown)"
Nuclear Option: kubectl delete pods -n gpu-operator -l app=nvidia-device-plugin-daemonset
Recovery Time: 30 seconds for device plugin restart

CUDA Out of Memory in Production

Memory Diagnostic Sequence:

kubectl exec -it model-pod -- nvidia-smi --query-gpu=memory.used,memory.total --format=csv
kubectl exec -it model-pod -- fuser -v /dev/nvidia*

90% Frequency Causes:

Production batch sizes exceed test configuration
Previous model remains loaded from crash
FP32 production vs FP16 development mismatch

Recovery: torch.cuda.empty_cache() → pod restart → node reboot (escalation order)

PyTorch Distributed Training Hangs

Network Diagnosis Priority:

kubectl exec -it trainer-0 -- nc -zv trainer-1.training-service.default.svc.cluster.local 29500

80% Success Rate Fix: export NCCL_SOCKET_IFNAME=eth0

Container Exit Code 137 (OOMKilled)

Confusion Factor: System memory (RAM) vs GPU memory are separate resources
Resource Rule: Memory limits = 2x GPU memory for model loading + inference buffers
Memory Leak Check: Monitor preprocessing code for unbounded growth

Critical Resource Allocation Guidelines

CPU-GPU Balance Requirements

AI Workload CPU Needs:

Data preprocessing and tokenization
Model weight loading and caching
Post-processing and response formatting
Monitoring and logging overhead

Allocation Rule: 4-8 CPU cores per GPU (more GPUs = proportionally more CPU)

Balanced Configuration:

resources:
  limits:
    nvidia.com/gpu: 1
    memory: "48Gi"    # 1.2x GPU memory
    cpu: "8000m"      # 8 cores per GPU
    ephemeral-storage: "100Gi"

Performance Optimization Patterns

Batch Processing Optimization

GPU Utilization Problem: Batch size 1 wastes 90% of parallel compute capacity
Memory Overhead: Fixed per batch, not per item
Tensor Core Requirement: Batch sizes in multiples of 8

Gradient Accumulation Alternative:

for batch in dataloader:  # batch_size=8
    loss = model(batch)
    (loss / 4).backward()  # Accumulate 4 times
    if step % 4 == 0:
        optimizer.step()
        optimizer.zero_grad()

Network Performance Requirements

NCCL Timeout Configuration:

export NCCL_TIMEOUT=3600        # 1 hour timeout
export NCCL_IB_RETRY_CNT=10     # More retries
export NCCL_DEBUG=INFO          # Verbose logging

Decision Matrix for Common Problems

Problem Category	Start Diagnostic	Escalation Path	Success Rate
Pod Scheduling	`kubectl describe pod`	Device plugin restart → Node reboot	95%
Memory Issues	`nvidia-smi` in pod	Cache clear → Pod restart → Node reboot	90%
Model Loading	`kubectl logs`	Auth check → Storage optimization	85%
Distributed Training	Framework debug + logs	NCCL tuning → Network team	70%
Performance	`nvidia-smi dmon` + profiling	Resource rebalancing	80%

Critical Warnings

Breaking Points:

GPU memory fragmentation cannot be resolved without restart
NCCL failures often require network infrastructure changes
Multi-tenant GPU sharing without proper isolation causes cascading failures
Default Kubernetes scheduling is fundamentally incompatible with AI workload requirements

Time-to-Resolution Expectations:

Simple configuration issues: 30 minutes - 2 hours
Framework integration problems: 4 hours - 2 days
Network/infrastructure issues: 1-3 days
Storage performance optimization: 2-5 days

Investment Requirements:

Technical expertise: Senior-level Kubernetes + AI framework knowledge required
Infrastructure: High-performance storage, proper GPU interconnects, network bandwidth
Time: 20-40% of initial deployment time should be allocated for debugging and optimization

Useful Links for Further Investigation

Essential Resources for GPU Debugging

Link	Description
NVIDIA GPU Operator Troubleshooting	Start here when the GPU operator breaks. Actually useful documentation.
NVIDIA Container Runtime Guide	For when containers can't access GPUs.
PyTorch Distributed Troubleshooting	NCCL errors and networking failures
Ray on Kubernetes Guide	Resource conflicts between Ray and K8s
NVIDIA DCGM	GPU monitoring for production. Set this up before you have problems.
Volcano Scheduler	Gang scheduling for distributed training. Actually works.
NVIDIA Developer Forums	NVIDIA engineers actually respond here
k9s	Terminal UI for Kubernetes that doesn't suck

Kubernetes AI GPU Failure Debug Guide

Critical Context Overview

GPU Scheduling Failures

Critical Failure Modes

GPU Operator Failure Modes

Memory Management Failures

GPU Memory vs System Memory Confusion

DeepSpeed ZeRO Implementation

Dynamic Resource Allocation Problems

Batch Size Optimization

Framework Integration Failures

PyTorch Distributed Training

TensorFlow Serving Failures

Ray Cluster Resource Conflicts

Hugging Face Model Loading

Storage Performance Bottlenecks

Critical Performance Impacts

Storage Configuration Requirements

Emergency Troubleshooting Procedures

Pod Stuck Pending - "Insufficient nvidia.com/gpu"

CUDA Out of Memory in Production

PyTorch Distributed Training Hangs

Container Exit Code 137 (OOMKilled)

Critical Resource Allocation Guidelines

CPU-GPU Balance Requirements

Performance Optimization Patterns

Batch Processing Optimization

Network Performance Requirements

Decision Matrix for Common Problems

Critical Warnings

Useful Links for Further Investigation

Essential Resources for GPU Debugging

Related Tools & Recommendations

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Set Up Microservices Monitoring That Actually Works

containerd - The Container Runtime That Actually Just Works

Podman Desktop - Free Docker Desktop Alternative

Podman Desktop Alternatives That Don't Suck

GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015

GitHub Actions + Jenkins Security Integration

Fix Kubernetes ImagePullBackOff Error - The Complete Battle-Tested Guide

Fix Kubernetes OOMKilled Pods - Production Memory Crisis Management

Docker Desktop vs Podman Desktop vs Rancher Desktop vs OrbStack: What Actually Happens

Kubeflow Pipelines - When You Need ML on Kubernetes and Hate Yourself

Helm - Because Managing 47 YAML Files Will Drive You Insane

Fix Helm When It Inevitably Breaks - Debug Guide

Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together

Mongoose - Because MongoDB's "Store Whatever" Philosophy Gets Messy Fast

Rust, Go, or Zig? I've Debugged All Three at 3am

Docker Business vs Podman Enterprise Pricing - What Changed in 2025

Stop Docker from Killing Your Containers at Random (Exit Code 137 Is Not Your Friend)

CVE-2025-9074 Docker Desktop Emergency Patch - Critical Container Escape Fixed

Prometheus - Scrapes Metrics From Your Shit So You Know When It Breaks