Currently viewing the AI version
Switch to human version

Kubernetes AI GPU Failure Debug Guide

Critical Context Overview

Primary Issue: Kubernetes was designed for web applications requiring 2 CPU cores and 4GB RAM. AI workloads requiring 8 A100s and 140GB memory create fundamental incompatibilities causing systematic failures.

Failure Severity: Production incidents with 20+ minute pending pods, complete training job failures, and resource waste of 80-87% due to poor GPU scheduling.

Time Investment: Individual debugging sessions range from 2-4 hours for basic issues to multiple days for distributed training setup.

GPU Scheduling Failures

Critical Failure Modes

GPU Scattering Problem

  • Issue: Default scheduler distributes 8 available A100s across 4 different nodes
  • Consequence: Distributed training fails with NCCL topology errors making multi-GPU training impossible
  • Root Cause: Default scheduler ignores GPU interconnect requirements for collective operations
  • Solution Difficulty: Moderate - requires Volcano scheduler implementation

Configuration That Works:

apiVersion: scheduling.volcano.sh/v1beta1
kind: Job
spec:
  schedulerName: volcano
  minAvailable: 4  # Gang scheduling - all pods or none
  plugins:
    env: []
    svc: []

CUDA Version Incompatibility

  • Critical Rule: Container CUDA version ≤ node driver version
  • Failure Example: CUDA 12.1 containers fail on 11.8 drivers with "no CUDA-capable device found"
  • Debug Time: 4+ hours due to misleading error messages
  • Diagnosis Command: kubectl run cuda-test --image=nvidia/cuda:12.1-runtime-ubuntu20.04 --rm -it --restart=Never --limits=nvidia.com/gpu=1 -- nvidia-smi

GPU Resource Hogging

  • Issue: Single inference pod consumes entire A100 at 5% utilization
  • Impact: Blocks critical training jobs requiring multiple GPUs
  • Workaround: GPU time-slicing to split physical GPUs into virtual ones
  • Performance Impact: 4x resource efficiency improvement possible

GPU Operator Failure Modes

Components That Fail Independently:

  • Driver daemonset
  • Device plugin
  • Container runtime
  • MIG manager
  • DCGM exporter

Debugging Sequence:

# 1. Hardware detection
kubectl debug node/gpu-node-1 -it --image=busybox
lspci | grep -i nvidia

# 2. Component health
kubectl get pods -n gpu-operator -o wide

# 3. Driver logs
kubectl logs -n gpu-operator daemonset/nvidia-driver-daemonset --tail=50

Nuclear Option: kubectl delete pods -n gpu-operator -l app=nvidia-device-plugin-daemonset

Memory Management Failures

GPU Memory vs System Memory Confusion

Critical Distinction: nvidia-smi shows physical memory, CUDA allocates in chunks with fragmentation over time. Available ≠ allocatable.

Memory Failure Triggers:

  • FP16 in development → FP32 in production (2x memory increase)
  • Context length: 2K → 8K tokens (4x memory usage)
  • Dynamic batching spikes to maximum
  • Memory fragmentation from crashed models
  • Multiple models loaded simultaneously

Resource Allocation Formula: Set memory limits to 2x GPU memory (GPU = 16GB → memory limit = 32Gi)

DeepSpeed ZeRO Implementation

Memory Reduction Results: 160GB → 40GB per GPU for 70B parameter models

ZeRO Stages:

  • ZeRO-1: Shards optimizer states
  • ZeRO-2: Adds gradient sharding
  • ZeRO-3: Shards model parameters
model_config = {
    "fp16": True,
    "gradient_checkpointing": True,
    "offload_optimizer": True,
    "partition_activations": True,
}

Dynamic Resource Allocation Problems

Traditional Kubernetes Waste: 87% resource waste allocating for worst-case scenarios while using only 13% average

DRA Configuration Example:

spec:
  requests:
    memory: "8Gi"     # Minimum guarantee
  limits:
    memory: "40Gi"    # Maximum burst capacity

Batch Size Optimization

Performance Impact: GPU utilization stuck at 15% with batch size = 1
Optimal Configuration: Batch sizes in multiples of 8 for Tensor Core efficiency
Solution: vLLM with continuous batching provides 3x throughput improvement

Framework Integration Failures

PyTorch Distributed Training

Fundamental Problem: PyTorch expects direct process communication, Kubernetes adds network abstraction layers causing systematic failures

NCCL Port Access Issues: Specific ports required by NCCL are blocked by pod networking
Service Discovery Mismatch: PyTorch discovery doesn't work with Kubernetes DNS

Working Configuration:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: pytorch-distributed-training
spec:
  serviceName: "training-service"
  template:
    spec:
      containers:
      - env:
        - name: MASTER_ADDR
          value: "pytorch-distributed-training-0.training-service.default.svc.cluster.local"
        - name: NCCL_SOCKET_IFNAME
          value: "eth0"

TensorFlow Serving Failures

Common Failure Modes:

  • Model signature mismatches between training and serving
  • SavedModel format incompatibilities
  • GPU memory configuration conflicts
  • Missing CUDA graph support in containers

Model Precision Issue: Models trained with mixed precision but TF Serving defaults to FP32
Solution Time: Multiple hours rebuilding SavedModel with explicit precision

@tf.function
def serve_function(inputs):
    predictions = model(inputs)
    return {
        'predictions': tf.cast(predictions, tf.float32),  # Always return FP32
        'scores': tf.nn.softmax(predictions, axis=-1)
    }

Ray Cluster Resource Conflicts

Coordination Problem: Ray and Kubernetes resource allocation must be precisely aligned
Breaking Point: Ray workers request GPUs that Kubernetes hasn't allocated to pods

Alignment Requirement:

rayStartParams:
  num-gpus: "1"  # Must exactly match nvidia.com/gpu: 1

Hugging Face Model Loading

Production vs Development Failures:

  • Internet access restrictions in locked-down pods
  • HF Hub authentication token storage errors
  • Model cache filesystem conflicts
  • Storage performance bottlenecks (60% GPU idle time)

Reliable Deployment Pattern:

initContainers:
- name: model-preloader
  env:
  - name: HF_HUB_OFFLINE
    value: "0"
containers:
- name: model-server
  env:
  - name: HF_HUB_OFFLINE
    value: "1"  # Use cached models only

Storage Performance Bottlenecks

Critical Performance Impacts

Model Loading Times: 70B models = 140GB files requiring complete loading before inference
GPU Idle Time: 60% idle due to storage bottleneck rather than compute limitations
Checkpoint Overhead: 10GB+ checkpoints every few minutes during training

Solution Results: NVMe SSD with high IOPS + ReadWriteMany volumes + init container preloading eliminated 60% idle time

Storage Configuration Requirements

apiVersion: v1
kind: PersistentVolumeClaim
spec:
  accessModes:
  - ReadWriteMany
  storageClassName: ssd-high-iops
  resources:
    requests:
      storage: 500Gi

Emergency Troubleshooting Procedures

Pod Stuck Pending - "Insufficient nvidia.com/gpu"

Root Cause: Zombie pods claiming GPUs without proper cleanup
Diagnostic: kubectl get pods --all-namespaces | grep -E "(Error|Failed|Unknown)"
Nuclear Option: kubectl delete pods -n gpu-operator -l app=nvidia-device-plugin-daemonset
Recovery Time: 30 seconds for device plugin restart

CUDA Out of Memory in Production

Memory Diagnostic Sequence:

kubectl exec -it model-pod -- nvidia-smi --query-gpu=memory.used,memory.total --format=csv
kubectl exec -it model-pod -- fuser -v /dev/nvidia*

90% Frequency Causes:

  • Production batch sizes exceed test configuration
  • Previous model remains loaded from crash
  • FP32 production vs FP16 development mismatch

Recovery: torch.cuda.empty_cache() → pod restart → node reboot (escalation order)

PyTorch Distributed Training Hangs

Network Diagnosis Priority:

kubectl exec -it trainer-0 -- nc -zv trainer-1.training-service.default.svc.cluster.local 29500

80% Success Rate Fix: export NCCL_SOCKET_IFNAME=eth0

Container Exit Code 137 (OOMKilled)

Confusion Factor: System memory (RAM) vs GPU memory are separate resources
Resource Rule: Memory limits = 2x GPU memory for model loading + inference buffers
Memory Leak Check: Monitor preprocessing code for unbounded growth

Critical Resource Allocation Guidelines

CPU-GPU Balance Requirements

AI Workload CPU Needs:

  • Data preprocessing and tokenization
  • Model weight loading and caching
  • Post-processing and response formatting
  • Monitoring and logging overhead

Allocation Rule: 4-8 CPU cores per GPU (more GPUs = proportionally more CPU)

Balanced Configuration:

resources:
  limits:
    nvidia.com/gpu: 1
    memory: "48Gi"    # 1.2x GPU memory
    cpu: "8000m"      # 8 cores per GPU
    ephemeral-storage: "100Gi"

Performance Optimization Patterns

Batch Processing Optimization

GPU Utilization Problem: Batch size 1 wastes 90% of parallel compute capacity
Memory Overhead: Fixed per batch, not per item
Tensor Core Requirement: Batch sizes in multiples of 8

Gradient Accumulation Alternative:

for batch in dataloader:  # batch_size=8
    loss = model(batch)
    (loss / 4).backward()  # Accumulate 4 times
    if step % 4 == 0:
        optimizer.step()
        optimizer.zero_grad()

Network Performance Requirements

NCCL Timeout Configuration:

export NCCL_TIMEOUT=3600        # 1 hour timeout
export NCCL_IB_RETRY_CNT=10     # More retries
export NCCL_DEBUG=INFO          # Verbose logging

Decision Matrix for Common Problems

Problem Category Start Diagnostic Escalation Path Success Rate
Pod Scheduling kubectl describe pod Device plugin restart → Node reboot 95%
Memory Issues nvidia-smi in pod Cache clear → Pod restart → Node reboot 90%
Model Loading kubectl logs Auth check → Storage optimization 85%
Distributed Training Framework debug + logs NCCL tuning → Network team 70%
Performance nvidia-smi dmon + profiling Resource rebalancing 80%

Critical Warnings

Breaking Points:

  • GPU memory fragmentation cannot be resolved without restart
  • NCCL failures often require network infrastructure changes
  • Multi-tenant GPU sharing without proper isolation causes cascading failures
  • Default Kubernetes scheduling is fundamentally incompatible with AI workload requirements

Time-to-Resolution Expectations:

  • Simple configuration issues: 30 minutes - 2 hours
  • Framework integration problems: 4 hours - 2 days
  • Network/infrastructure issues: 1-3 days
  • Storage performance optimization: 2-5 days

Investment Requirements:

  • Technical expertise: Senior-level Kubernetes + AI framework knowledge required
  • Infrastructure: High-performance storage, proper GPU interconnects, network bandwidth
  • Time: 20-40% of initial deployment time should be allocated for debugging and optimization

Useful Links for Further Investigation

Essential Resources for GPU Debugging

LinkDescription
NVIDIA GPU Operator TroubleshootingStart here when the GPU operator breaks. Actually useful documentation.
NVIDIA Container Runtime GuideFor when containers can't access GPUs.
PyTorch Distributed TroubleshootingNCCL errors and networking failures
Ray on Kubernetes GuideResource conflicts between Ray and K8s
NVIDIA DCGMGPU monitoring for production. Set this up before you have problems.
Volcano SchedulerGang scheduling for distributed training. Actually works.
NVIDIA Developer ForumsNVIDIA engineers actually respond here
k9sTerminal UI for Kubernetes that doesn't suck

Related Tools & Recommendations

integration
Recommended

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

How to Wire Together the Modern DevOps Stack Without Losing Your Sanity

prometheus
/integration/docker-kubernetes-argocd-prometheus/gitops-workflow-integration
100%
howto
Recommended

Set Up Microservices Monitoring That Actually Works

Stop flying blind - get real visibility into what's breaking your distributed services

Prometheus
/howto/setup-microservices-observability-prometheus-jaeger-grafana/complete-observability-setup
40%
tool
Recommended

containerd - The Container Runtime That Actually Just Works

The boring container runtime that Kubernetes uses instead of Docker (and you probably don't need to care about it)

containerd
/tool/containerd/overview
35%
tool
Recommended

Podman Desktop - Free Docker Desktop Alternative

competes with Podman Desktop

Podman Desktop
/tool/podman-desktop/overview
33%
alternatives
Recommended

Podman Desktop Alternatives That Don't Suck

Container tools that actually work (tested by someone who's debugged containers at 3am)

Podman Desktop
/alternatives/podman-desktop/comprehensive-alternatives-guide
33%
integration
Recommended

GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015

Deploy your app without losing your mind or your weekend

GitHub Actions
/integration/github-actions-docker-aws-ecs/ci-cd-pipeline-automation
33%
integration
Recommended

GitHub Actions + Jenkins Security Integration

When Security Wants Scans But Your Pipeline Lives in Jenkins Hell

GitHub Actions
/integration/github-actions-jenkins-security-scanning/devsecops-pipeline-integration
31%
troubleshoot
Recommended

Fix Kubernetes ImagePullBackOff Error - The Complete Battle-Tested Guide

From "Pod stuck in ImagePullBackOff" to "Problem solved in 90 seconds"

Kubernetes
/troubleshoot/kubernetes-imagepullbackoff/comprehensive-troubleshooting-guide
31%
troubleshoot
Recommended

Fix Kubernetes OOMKilled Pods - Production Memory Crisis Management

When your pods die with exit code 137 at 3AM and production is burning - here's the field guide that actually works

Kubernetes
/troubleshoot/kubernetes-oom-killed-pod/oomkilled-production-crisis-management
31%
compare
Recommended

Docker Desktop vs Podman Desktop vs Rancher Desktop vs OrbStack: What Actually Happens

extends Docker Desktop

Docker Desktop
/compare/docker-desktop/podman-desktop/rancher-desktop/orbstack/performance-efficiency-comparison
28%
tool
Similar content

Kubeflow Pipelines - When You Need ML on Kubernetes and Hate Yourself

Turns your Python ML code into YAML nightmares, but at least containers don't conflict anymore. Kubernetes expertise required or you're fucked.

Kubeflow Pipelines
/tool/kubeflow-pipelines/workflow-orchestration
24%
tool
Recommended

Helm - Because Managing 47 YAML Files Will Drive You Insane

Package manager for Kubernetes that saves you from copy-pasting deployment configs like a savage. Helm charts beat maintaining separate YAML files for every dam

Helm
/tool/helm/overview
24%
tool
Recommended

Fix Helm When It Inevitably Breaks - Debug Guide

The commands, tools, and nuclear options for when your Helm deployment is fucked and you need to debug template errors at 3am.

Helm
/tool/helm/troubleshooting-guide
24%
integration
Recommended

Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together

Stop fighting with YAML hell and infrastructure drift - here's how to manage everything through Git without losing your sanity

Pulumi
/integration/pulumi-kubernetes-helm-gitops/complete-workflow-integration
24%
tool
Recommended

Mongoose - Because MongoDB's "Store Whatever" Philosophy Gets Messy Fast

built on Mongoose

Mongoose
/tool/mongoose/overview
24%
compare
Recommended

Rust, Go, or Zig? I've Debugged All Three at 3am

What happens when you actually have to ship code that works

go
/compare/rust/go/zig/modern-systems-programming-comparison
24%
pricing
Recommended

Docker Business vs Podman Enterprise Pricing - What Changed in 2025

Red Hat gave away enterprise infrastructure while Docker raised prices again

Docker Desktop
/pricing/docker-vs-podman-enterprise/game-changer-analysis
23%
howto
Recommended

Stop Docker from Killing Your Containers at Random (Exit Code 137 Is Not Your Friend)

Three weeks into a project and Docker Desktop suddenly decides your container needs 16GB of RAM to run a basic Node.js app

Docker Desktop
/howto/setup-docker-development-environment/complete-development-setup
23%
troubleshoot
Recommended

CVE-2025-9074 Docker Desktop Emergency Patch - Critical Container Escape Fixed

Critical vulnerability allowing container breakouts patched in Docker Desktop 4.44.3

Docker Desktop
/troubleshoot/docker-cve-2025-9074/emergency-response-patching
23%
tool
Recommended

Prometheus - Scrapes Metrics From Your Shit So You Know When It Breaks

Free monitoring that actually works (most of the time) and won't die when your network hiccups

Prometheus
/tool/prometheus/overview
22%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization