Why Kubernetes Hates Your AI Workloads

Kubernetes was designed for web apps that need 2 CPU cores and 4GB RAM. Your Llama-70B model wants 8 A100s and 140GB of memory. These assumptions are fundamentally incompatible.

When GPU scheduling fails, you get error messages like "0/5 nodes available: insufficient nvidia.com/gpu." Thanks for nothing, scheduler.

The Stupid Problems That Will Ruin Your Day

Your GPUs Are Scattered Like Confetti

Here's what pisses me off most: you've got 8 A100s across 4 nodes. Your training job wants 4 GPUs that can actually talk to each other. The default scheduler sees 8 available GPUs and thinks "perfect!" then scatters them across different fucking racks.

Of course the default scheduler doesn't understand that distributed training needs GPUs on the same interconnect. So NCCL fails with some cryptic bullshit about topology.

First, figure out where your GPUs actually live:

kubectl get nodes -o custom-columns=\"NAME:.metadata.name,GPU:.status.capacity.nvidia\\.com/gpu\"

If you see GPUs spread across different nodes and your job needs them together, you're screwed unless you fix the scheduling.

Took me two days of banging my head against this before I realized the default scheduler is completely clueless about GPU topology. Finally fixed it with Volcano scheduler and gang scheduling - either all training pods get proper GPU placement or none do.

Gang scheduling means your 8-pod training job either gets all the GPUs it needs or waits. No partial deployments that burn money while half the workers can't talk to each other.

CUDA Version Hell

Then there's CUDA version hell. Pod loads your 7B model, dies with "CUDA out of memory" using only 3GB. Or my favorite: "no CUDA-capable device found" when nvidia-smi clearly shows 8 GPUs sitting there.

This usually means one of these delightful scenarios:

  • Node has CUDA 11.8, your container wants 12.1
  • NVIDIA container runtime is completely fucked
  • Some zombie process is hogging the GPU from a crashed pod
  • Device plugin lost track of what's actually available

The golden rule that nobody tells you: container CUDA version must be ≤ node driver version. So your shiny CUDA 12.1 containers won't work on 11.8 drivers. Because why would they make this obvious?

The dumb thing to check first:

kubectl run cuda-test --image=nvidia/cuda:12.1-runtime-ubuntu20.04 --rm -it --restart=Never --limits=nvidia.com/gpu=1 -- nvidia-smi

If that fails, your NVIDIA container runtime is broken. If it works but your actual workload fails, it's a version mismatch.

I spent 4 hours debugging this shit once. Base image was using CUDA 11.7, nodes had 12.1 drivers. The error message? Completely useless: "RuntimeError: CUDA initialization failed." Zero mention of version conflicts. Because helpful error messages are apparently too much to ask for.

Nuclear option: docker system prune -a && kubectl delete pods --all then rebuild everything. Sometimes CUDA context gets corrupted and nothing short of nuking everything works.

GPU Resource Hogging

And don't get me started on GPU hogging. Your critical training job can't start because some inference pod is sitting on an entire A100 at 5% utilization. Thanks, Kubernetes, for treating GPUs as all-or-nothing resources.

You can't request 0.5 GPUs like you can request 500m CPU. Every pod gets a full GPU even if it's running a tiny model that could easily share. Brilliant design choice.

GPU time-slicing can help here - it splits each physical GPU into virtual ones, giving each a time slice of the real hardware. But setting it up is another adventure in YAML hell.

## Time-slicing ConfigMap that actually works
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  a100: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4  # Split each A100 into 4 virtual GPUs

GPU Operator Issues: When the Foundation is Broken

The NVIDIA GPU Operator automates GPU management but it's also a single point of failure. When it fails, every GPU pod goes down.

GPU Operator has 5 components that can fail independently: driver daemonset, device plugin, container runtime, MIG manager, DCGM exporter. Each failure has different symptoms.

Common operator failures:

## GPU operator pod stuck in ImagePullBackOff
kubectl get pods -n gpu-operator
## NAME                                      READY   STATUS             RESTARTS   AGE
## nvidia-cuda-validator-12345               0/1     ImagePullBackOff   0          10m

## Device plugin not creating GPU resources
kubectl describe nodes gpu-node-1 | grep nvidia.com/gpu
## Shows: nvidia.com/gpu: 0 (should show 8)

## Driver installation failing silently
kubectl logs -n gpu-operator nvidia-driver-daemonset-xxxxx
## Error: failed to install NVIDIA driver: kernel version mismatch

The debugging sequence that actually works:

## 1. Check if nodes have GPUs detected at hardware level
kubectl debug node/gpu-node-1 -it --image=busybox
## In debug container: lspci | grep -i nvidia

## 2. Verify GPU operator components health
kubectl get pods -n gpu-operator -o wide

## 3. Check driver installation logs
kubectl logs -n gpu-operator daemonset/nvidia-driver-daemonset --tail=50

## 4. Test GPU access from a basic container
kubectl run gpu-test --image=nvidia/cuda:12.1-base-ubuntu20.04 --rm -it \
  --overrides='{\"spec\":{\"containers\":[{\"name\":\"gpu-test\",\"image\":\"nvidia/cuda:12.1-base-ubuntu20.04\",\"resources\":{\"limits\":{\"nvidia.com/gpu\":\"1\"}},\"command\":[\"nvidia-smi\"]}]}}' \
  --restart=Never

Advanced GPU Scheduling: Beyond Basic Resource Requests

Modern AI workloads need more sophisticated scheduling than "give me 2 GPUs." They need topology awareness, memory locality, and workload-specific optimizations.

Example: Distributed Training with Gang Scheduling

apiVersion: scheduling.volcano.sh/v1beta1
kind: Job
metadata:
  name: distributed-llm-training
spec:
  schedulerName: volcano
  minAvailable: 4  # All 4 pods must be scheduled together
  plugins:
    env: []
    svc: []
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
  - replicas: 4
    name: trainer
    template:
      spec:
        affinity:
          podAntiAffinity:  # Spread across nodes for fault tolerance
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  job-name: distributed-llm-training
              topologyKey: kubernetes.io/hostname
        containers:
        - name: trainer
          image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel
          resources:
            limits:
              nvidia.com/gpu: 2  # 2 GPUs per node
              memory: \"64Gi\"     # 8x GPU memory for model + gradients
          env:
          - name: NCCL_DEBUG
            value: \"INFO\"
          - name: NCCL_TREE_THRESHOLD
            value: \"0\"

What makes this work:

  • Volcano scheduler with gang scheduling: all pods start together or none start
  • Pod anti-affinity spreads training across nodes for better interconnect usage
  • NCCL environment variables optimize GPU-to-GPU communication
  • Memory allocation accounts for model weights, gradients, and optimizer states

Kueue can help with job queuing and resource sharing if you want another layer of complexity. Yunikorn is an alternative to Volcano, though honestly Volcano just works better for most cases in my experience.

The reality: AI workloads aren't web apps that happen to use GPUs. They have completely different resource patterns, scheduling needs, and failure modes. Standard Kubernetes scheduling breaks under the demands of model training, inference scaling, and multi-tenant GPU sharing. Next up: memory and performance issues that show up once you fix the scheduling nightmare.

GPU Memory Problems That Make No Sense

Pods finally scheduled. You think you're done. Then GPU memory problems hit that make absolutely no fucking sense. Your 70B model loaded fine on your laptop but throws "CUDA out of memory" even though nvidia-smi shows plenty of free memory.

Here's the thing that'll drive you insane: nvidia-smi shows physical memory. But CUDA allocates in chunks and fragments over time. Available != allocatable. Whoever designed this can go fuck themselves.

Why GPU memory is such a nightmare? Usually it's one of these gems:

  • You used FP16 in dev, production defaulted to FP32 (doubled memory requirements)
  • Context length jumped from 2K to 8K tokens (quadrupled memory usage)
  • Dynamic batching spiked to max batch size right when you hit OOM
  • Memory fragmentation from crashed models that didn't clean up properly
  • Someone "temporarily" loaded multiple models on the same GPU and forgot

Quick memory check:

kubectl exec -it llm-pod -- nvidia-smi --query-gpu=memory.total,memory.used,memory.free --format=csv

If nvidia-smi shows free memory but you're still getting OOM, your memory is fragmented to absolute hell. Nuclear option time:

kubectl exec -it llm-pod -- python3 -c "import torch; torch.cuda.empty_cache()"

I ended up saying fuck it and using DeepSpeed ZeRO to split models across GPUs. Because trying to cram 70B parameters into a single GPU is completely insane. Memory usage dropped from 160GB to 40GB per GPU.

DeepSpeed ZeRO has three stages: ZeRO-1 shards optimizer states, ZeRO-2 adds gradient sharding, ZeRO-3 shards model parameters. Each stage cuts your per-GPU memory requirements.

## Memory-efficient model loading configuration
from transformers import AutoModelForCausalLM
import deepspeed

model_config = {
    "fp16": True,                    # Half precision
    "gradient_checkpointing": True,  # Recompute vs store activations
    "offload_optimizer": True,       # Store optimizer states in CPU
    "partition_activations": True,   # Shard activations across GPUs
}

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    torch_dtype=torch.float16,
    device_map="auto",      # Automatic GPU distribution
    low_cpu_mem_usage=True, # Stream loading to reduce CPU usage
    load_in_8bit=True       # Quantized loading
)

Dynamic Resource Allocation: When Static Limits Break

Here's what makes this even more fucked up: AI workloads have wildly different resource needs depending on the input. A simple text summarization might need 2GB GPU memory, while generating code could spike to 20GB. Static resource limits either waste 80% of your resources or randomly OOM kill your pods.

Traditional Kubernetes approach (doesn't work):

resources:
  limits:
    nvidia.com/gpu: 1
    memory: "32Gi"      # Must account for worst-case scenario
  requests:
    nvidia.com/gpu: 1
    memory: "32Gi"      # Same as limits for GPU workloads

The result? You allocate for the worst case (32GB) but use only 4GB most of the time. 87% resource waste and 10x higher cloud costs. Brilliant.

Dynamic Resource Allocation (DRA) can help with workload-aware autoscaling, but good luck getting that working reliably.

DRA lets pods request flexible GPU resources that adjust at runtime based on actual needs.

## DRA ResourceClaim for flexible GPU memory
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaim
metadata:
  name: flexible-gpu-memory
spec:
  resourceClassName: nvidia-a100
  parametersRef:
    apiVersion: gpu.nvidia.com/v1alpha1
    kind: GpuClaimParameters
    name: inference-optimized
---
apiVersion: gpu.nvidia.com/v1alpha1
kind: GpuClaimParameters
metadata:
  name: inference-optimized
spec:
  requests:
    memory: "8Gi"     # Minimum guarantee
  limits:
    memory: "40Gi"    # Maximum burst capacity
  sharing:
    strategy: "time-slicing"
    maxClients: 4

The Batch Size Optimization Nightmare

Here's a trap that'll waste weeks of your life: you optimize your model for batch size = 32 during training, then deploy with batch size = 1 for inference. Then you spend days wondering why GPU utilization is stuck at 15%.

Batch size matters because:

  • GPU cores are designed for parallel processing - batch size 1 wastes 90% of your expensive compute
  • Memory overhead is fixed per batch, not per item, so you're burning money
  • Tensor Cores need specific batch sizes (multiples of 8) to not suck
  • Dynamic batching causes unpredictable memory spikes that'll randomly kill your pods

The debugging approach:

## Monitor GPU utilization vs batch size
kubectl exec -it inference-pod -- nvidia-smi dmon -s um -c 100

## Profile memory usage during batch processing
kubectl exec -it inference-pod -- python3 -c "
import torch
from transformers import pipeline

pipe = pipeline('text-generation', model='gpt2', device=0)

## Test different batch sizes
for batch_size in [1, 4, 8, 16, 32]:
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()
    
    inputs = ['Hello world'] * batch_size
    outputs = pipe(inputs, max_length=100)
    
    peak_mem = torch.cuda.max_memory_allocated() / 1e9
    print(f'Batch size {batch_size}: {peak_mem:.2f}GB peak memory')
"

What actually worked and tripled my throughput: vLLM with continuous batching. Instead of waiting for fixed batches like an idiot, the system continuously combines requests up to memory limits.

Traditional batching waits for N requests like it's 2019. Continuous batching processes requests as they come in, dynamically filling GPU memory.

## vLLM config that actually works
from vllm import AsyncLLMEngine, AsyncEngineArgs

engine_args = AsyncEngineArgs(
    model="meta-llama/Llama-2-7b-chat-hf",
    tensor_parallel_size=1,
    max_model_len=4096,
    block_size=16,
    max_num_batched_tokens=8192, 
    max_num_seqs=256,
    gpu_memory_utilization=0.85, # Leave some headroom
    enforce_eager=False,
)

engine = AsyncLLMEngine.from_engine_args(engine_args)

Container Resource Allocation: The CPU-GPU Balance

Here's a hidden problem that'll make you want to throw your laptop: your GPU shows 98% utilization, but your pods are CPU-throttled, creating a bottleneck that wastes your expensive GPU time.

AI workloads need serious CPU power for:

  • Data preprocessing and tokenization (more than you think)
  • Loading and caching massive model weights
  • Post-processing and response formatting
  • All the monitoring and logging nobody wants to think about

Example failure pattern:

## GPU utilization looks great
nvidia-smi
## GPU-Util: 98%  Memory: 38000MiB / 40960MiB

## But CPU is the bottleneck
kubectl top pods
## NAME           CPU(cores)   MEMORY(bytes)
## llm-pod        2000m/2000m  12Gi/16Gi  # CPU maxed out, throttling entire workload

The fix: actually right-size your CPU and memory based on AI workload patterns, not some web app bullshit from 2015.

## Properly balanced AI workload resource allocation
apiVersion: v1
kind: Pod
metadata:
  name: llm-inference
spec:
  containers:
  - name: model-server
    image: vllm/vllm-openai:latest
    resources:
      limits:
        nvidia.com/gpu: 1
        memory: "48Gi"    # 1.2x GPU memory for caching and buffers
        cpu: "8000m"      # 8 CPU cores per GPU for preprocessing
        ephemeral-storage: "100Gi"  # Model weights and cache
      requests:
        nvidia.com/gpu: 1
        memory: "32Gi"    # Guaranteed memory
        cpu: "4000m"      # Guaranteed CPU
        ephemeral-storage: "50Gi"
    env:
    - name: VLLM_CPU_KVCACHE_SPACE
      value: "4"          # Use 4GB CPU memory for KV cache offload
    - name: CUDA_VISIBLE_DEVICES
      value: "0"
    volumeMounts:
    - name: model-cache
      mountPath: /root/.cache
  volumes:
  - name: model-cache
    emptyDir:
      sizeLimit: "200Gi"

Storage Performance: The Overlooked Bottleneck

Here's a surprise that'll ruin your day: your GPU sits idle 60% of the time. Not because of CPU bottlenecks, but because your model is loading from storage that's slower than molasses.

Storage matters way more than anyone wants to admit:

  • Model weights: 70B models are 140GB files that must load completely before inference starts
  • Dataset loading: training jobs stream terabytes of data continuously
  • Checkpoint saving: model checkpoints are 10GB+ every few minutes during training
  • Cache warming: tokenizers, embeddings, preprocessed data all need fast access or your users wait

The debugging approach:

## Monitor I/O wait time
kubectl exec -it training-pod -- iostat -x 1 10

## Check storage throughput during model loading
kubectl exec -it training-pod -- time python3 -c "
from transformers import AutoModelForCausalLM
import time

start = time.time()
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-70b-hf')
end = time.time()
print(f'Model loading took {end-start:.2f} seconds')
"

## Check PVC performance characteristics
kubectl describe pvc model-storage-pvc | grep -A 5 -B 5 "StorageClass\|Access\|Capacity"

What finally solved my 60% idle GPU problem: moved to NVMe SSD storage with high IOPS, preloaded model weights in init containers, and used ReadWriteMany volumes to share models across pods. Should've done this from day one.

## High-performance storage configuration for AI workloads
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: high-iops-model-storage
spec:
  accessModes:
  - ReadWriteMany
  storageClassName: ssd-high-iops  # Cloud provider specific
  resources:
    requests:
      storage: 500Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: llm-with-fast-storage
spec:
  initContainers:
  - name: model-preloader
    image: huggingface/transformers:4.35-pytorch2.1-gpu
    command: ["/bin/sh", "-c"]
    args:
    - |
      python3 -c "
      from transformers import AutoModelForCausalLM, AutoTokenizer
      model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf', cache_dir='/shared-cache')
      tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf', cache_dir='/shared-cache')
      print('Model and tokenizer preloaded')
      "
    volumeMounts:
    - name: shared-model-cache
      mountPath: /shared-cache
    resources:
      limits:
        memory: "32Gi"
        cpu: "4000m"
  containers:
  - name: model-server
    image: vllm/vllm-openai:latest
    env:
    - name: HF_HOME
      value: "/shared-cache"
    volumeMounts:
    - name: shared-model-cache
      mountPath: /shared-cache
    resources:
      limits:
        nvidia.com/gpu: 1
        memory: "16Gi"
  volumes:
  - name: shared-model-cache
    persistentVolumeClaim:
      claimName: high-iops-model-storage

Understanding these memory and performance patterns is crucial because AI workloads stress Kubernetes in ways that traditional applications don't. The next section covers how different AI frameworks (PyTorch, TensorFlow, JAX) create their own specific integration challenges that require framework-aware debugging approaches.

Framework Integration Hell

Single-GPU training works fine on your laptop. Try distributed training across pods and everything falls apart with networking errors that make absolutely no sense.

PyTorch distributed training is broken by default in Kubernetes. Why? Because PyTorch expects training processes to talk directly to each other, but Kubernetes networking adds abstraction layers that fuck everything up:

  • NCCL needs specific ports that pods can't access
  • Pod networking latency breaks collective operations
  • Service discovery works nothing like PyTorch expects
  • Your expensive high-speed interconnects get routed through slow software bridges

The debugging approach that works

## Check if pods can actually talk to each other
kubectl exec -it trainer-0 -- ping trainer-1.training-service.default.svc.cluster.local

## Test NCCL communication (this is where it usually breaks)
kubectl exec -it trainer-0 -- python3 -c \"
import torch
import torch.distributed as dist
import os

dist.init_process_group(
    backend='nccl',
    init_method='env://',
    rank=int(os.environ['RANK']),
    world_size=int(os.environ['WORLD_SIZE'])
)
print('NCCL works')
print(f'Rank: {dist.get_rank()}, World size: {dist.get_world_size()}')
\"

## Check bandwidth between nodes
kubectl exec -it trainer-0 -- iperf3 -s -p 5001 &
kubectl exec -it trainer-1 -- iperf3 -c trainer-0.training-service.default.svc.cluster.local -p 5001

What finally worked after days of suffering: ditched torchrun for PyTorch Elastic with Kubernetes-native service discovery. Used StatefulSet with headless services so pods get predictable names that don't change every restart.

## PyTorch distributed training that actually works in K8s
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: pytorch-distributed-training
spec:
  serviceName: \"training-service\"
  replicas: 4
  selector:
    matchLabels:
      app: pytorch-training
  template:
    metadata:
      labels:
        app: pytorch-training
    spec:
      containers:
      - name: trainer
        image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel
        ports:
        - containerPort: 29500  # Default PyTorch distributed port
          name: dist-port
        env:
        - name: MASTER_ADDR
          value: \"pytorch-distributed-training-0.training-service.default.svc.cluster.local\"
        - name: MASTER_PORT
          value: \"29500\"
        - name: WORLD_SIZE
          value: \"4\"
        - name: RANK
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['apps.kubernetes.io/pod-index']
        - name: NCCL_DEBUG
          value: \"INFO\"
        - name: NCCL_SOCKET_IFNAME
          value: \"eth0\"  # Force NCCL to use pod network interface
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: \"32Gi\"
        command: [\"/bin/bash\", \"-c\"]
        args:
        - |
          python3 -m torch.distributed.run \
            --nproc_per_node=1 \
            --nnodes=4 \
            --node_rank=$RANK \
            --master_addr=$MASTER_ADDR \
            --master_port=$MASTER_PORT \
            train_distributed.py
---
apiVersion: v1
kind: Service
metadata:
  name: training-service
spec:
  clusterIP: None  # Headless service for StatefulSet
  selector:
    app: pytorch-training
  ports:
  - port: 29500
    targetPort: 29500
    name: dist-port

TensorFlow Serving will ruin your week. Your model serves perfectly locally but returns empty responses or just crashes when deployed through TF Serving on Kubernetes.

Why TF Serving loves to break:

  • Model signature mismatches between training and serving versions (because consistency is optional)
  • SavedModel format incompatibilities that nobody documents
  • GPU memory config conflicts that make no sense
  • CUDA graphs aren't supported in containerized TF Serving (why would they be?)
  • Missing input preprocessing pipelines that worked fine in notebooks

The diagnostic nightmare

## Check what TF thinks your model signature is
kubectl exec -it tf-serving-pod -- saved_model_cli show --dir /models/my_model/1 --tag_set serve --signature_def serving_default

## Try to load the model directly (usually fails)
kubectl exec -it tf-serving-pod -- python3 -c \"
import tensorflow as tf
model = tf.saved_model.load('/models/my_model/1')
print('Model loaded')
print('Signatures:', list(model.signatures.keys()))
\"

## Look for actual error messages in the logs
kubectl logs tf-serving-pod | grep -A 5 -B 5 \"Error\\|Failed\\|Exception\"

## Check if TF can even see the GPU
kubectl exec -it tf-serving-pod -- python3 -c \"
import tensorflow as tf
print('TF version:', tf.__version__)
print('GPUs:', tf.config.list_physical_devices('GPU'))
\"

Took me forever to discover the model was trained with mixed precision but TF Serving defaulted to FP32, causing memory failures. Had to rebuild the entire SavedModel with explicit precision config. Hours of my life I'll never get back.

## Proper TensorFlow model saving for Kubernetes deployment
import tensorflow as tf

## Configure mixed precision policy
policy = tf.keras.mixed_precision.Policy('mixed_float16')
tf.keras.mixed_precision.set_global_policy(policy)

## Save model with serving-compatible signature
@tf.function
def serve_function(inputs):
    predictions = model(inputs)
    return {
        'predictions': tf.cast(predictions, tf.float32),  # Always return FP32
        'scores': tf.nn.softmax(predictions, axis=-1)
    }

model.save(
    '/tmp/saved_model',
    signatures={'serving_default': serve_function.get_concrete_function(
        tf.TensorSpec(shape=[None, 224, 224, 3], dtype=tf.float32)
    )}
)

## TF Serving deployment configuration
tf_serving_config = {
    \"model_config_list\": [
        {
            \"name\": \"my_model\",
            \"base_path\": \"/models/my_model\",
            \"model_platform\": \"tensorflow\",
            \"model_version_policy\": {\"latest\": {\"num_versions\": 2}}
        }
    ]
}

Ray Serve and Ray Train promise "easy distributed AI" but they add their own networking and resource management layer that conflicts with Kubernetes in spectacular ways.

Where Ray breaks (and it will break)

  • Head node discovery and worker registration just stops working randomly
  • GPU allocation conflicts between Ray and Kubernetes that make both systems confused
  • Port conflicts between Ray services and Kubernetes services
  • Ray's fault tolerance fighting with Kubernetes pod restarts
  • Object store memory conflicts with container limits that cause mysterious crashes

Ray debugging is honestly a special kind of hell

## Check if Ray cluster even knows about workers
kubectl exec -it ray-head-pod -- ray status

## See what resources Ray thinks it has
kubectl exec -it ray-head-pod -- python3 -c \"
import ray
ray.init(address='ray://127.0.0.1:10001')
print('Ray sees:', ray.cluster_resources())
print('Actually available:', ray.available_resources())
\"

## Compare with what Kubernetes allocated  
kubectl top pods | grep ray
kubectl describe pod ray-worker-xxx | grep -A 10 -B 10 resources

## Test if Ray tasks can actually use GPUs
kubectl exec -it ray-head-pod -- python3 -c \"
import ray
import torch

@ray.remote(num_gpus=1)
def gpu_test():
    return torch.cuda.is_available(), torch.cuda.device_count()

ray.init(address='ray://127.0.0.1:10001')  
result = ray.get(gpu_test.remote())
print('GPU works in Ray task:', result)
\"

What eventually made Ray work: I had to carefully coordinate resource allocation between Kubernetes and Ray. Ray workers can only request GPUs that Kubernetes already allocated to their pods. Obvious in retrospect, nightmare to debug.

## Ray cluster configuration that works with K8s GPU allocation
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-config
data:
  ray_config.yaml: |
    ray_head_pod_template:
      apiVersion: v1
      kind: Pod
      metadata:
        labels:
          component: ray-head
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray:2.8.0-gpu
          ports:
          - containerPort: 6379  # Redis
          - containerPort: 8265  # Dashboard
          - containerPort: 10001 # Client server
          resources:
            requests:
              cpu: \"2000m\"
              memory: \"8Gi\"
            limits:
              cpu: \"2000m\"
              memory: \"8Gi\"
          env:
          - name: RAY_DISABLE_IMPORT_WARNING
            value: \"1\"
    
    ray_worker_pod_template:
      apiVersion: v1
      kind: Pod
      metadata:
        labels:
          component: ray-worker
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray:2.8.0-gpu
          resources:
            requests:
              nvidia.com/gpu: 1
              cpu: \"4000m\"
              memory: \"16Gi\"
            limits:
              nvidia.com/gpu: 1
              cpu: \"4000m\" 
              memory: \"16Gi\"
          env:
          - name: CUDA_VISIBLE_DEVICES
            value: \"0\"
          - name: RAY_DISABLE_IMPORT_WARNING
            value: \"1\"
---
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  name: gpu-ray-cluster
spec:
  rayVersion: '2.8.0'
  headGroupSpec:
    rayStartParams:
      dashboard-host: '0.0.0.0'
      num-gpus: '0'  # Head node gets no GPUs
    template:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray:2.8.0-gpu
          resources:
            requests:
              cpu: \"2000m\"
              memory: \"8Gi\"
  workerGroupSpecs:
  - replicas: 3
    minReplicas: 1
    maxReplicas: 10
    groupName: gpu-workers
    rayStartParams:
      num-gpus: '1'  # Each worker gets exactly 1 GPU
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray:2.8.0-gpu
          resources:
            requests:
              nvidia.com/gpu: 1  # K8s allocates 1 GPU
              cpu: \"4000m\"
              memory: \"16Gi\"
            limits:
              nvidia.com/gpu: 1
              cpu: \"4000m\"
              memory: \"16Gi\"

Hugging Face models work perfectly in your Jupyter notebook but fail in production with auth errors, painfully slow loading, and missing dependencies.

Why Transformers break in production

  • Model downloads need internet access that your locked-down pods don't have
  • HF Hub auth tokens are stored wrong or not at all
  • Model cache conflicts with your container filesystem setup
  • Tokenizer loading is slow as shit due to garbage storage performance
  • Version mismatches between model requirements and whatever packages you actually have installed

What finally made model loading reliable

## Reliable Hugging Face model deployment
apiVersion: v1
kind: Secret
metadata:
  name: huggingface-token
type: Opaque
data:
  token: <base64-encoded-hf-token>
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: hf-cache-pvc
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 100Gi
  storageClassName: ssd-fast
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hf-model-server
spec:
  replicas: 2
  selector:
    matchLabels:
      app: hf-server
  template:
    metadata:
      labels:
        app: hf-server
    spec:
      initContainers:
      - name: model-downloader
        image: huggingface/transformers:4.35-pytorch2.1-gpu
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: huggingface-token
              key: token
        - name: HF_HOME
          value: \"/shared-cache\"
        - name: HF_HUB_OFFLINE
          value: \"0\"
        volumeMounts:
        - name: hf-cache
          mountPath: /shared-cache
        command: [\"/bin/bash\", \"-c\"]
        args:
        - |
          python3 -c \"
          from transformers import AutoModelForCausalLM, AutoTokenizer
          import os
          
          model_name = 'meta-llama/Llama-2-7b-chat-hf'
          print(f'Downloading {model_name}...')
          
          # Pre-download model and tokenizer
          tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir='/shared-cache')
          model = AutoModelForCausalLM.from_pretrained(
              model_name,
              cache_dir='/shared-cache',
              torch_dtype='auto',
              device_map='cpu'  # Download to CPU first
          )
          print('Model download complete')
          \"
        resources:
          requests:
            cpu: \"2000m\"
            memory: \"16Gi\"
      containers:
      - name: model-server
        image: huggingface/transformers:4.35-pytorch2.1-gpu
        env:
        - name: HF_HOME
          value: \"/shared-cache\"
        - name: HF_HUB_OFFLINE
          value: \"1\"  # Use cached models only
        - name: CUDA_VISIBLE_DEVICES
          value: \"0\"
        volumeMounts:
        - name: hf-cache
          mountPath: /shared-cache
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: \"24Gi\"
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
      volumes:
      - name: hf-cache
        persistentVolumeClaim:
          claimName: hf-cache-pvc

Here's the brutal truth: AI frameworks were never designed for containerized, multi-tenant environments. Every single one makes assumptions about networking, storage, and resource access that directly conflict with how Kubernetes works.

Success means understanding what each framework expects and what Kubernetes actually provides, then building hacky bridges between them until something works. It's not elegant, but it's reality.

When Everything Goes Wrong: Emergency Fixes

Q

Pod stuck pending with "Insufficient nvidia.com/gpu" but nvidia-smi shows GPUs available?

A

Your GPUs are probably claimed by zombie pods that didn't clean up properly. This happens constantly:

kubectl describe nodes | grep -A 5 -B 5 "nvidia.com/gpu"
kubectl get pods --all-namespaces | grep -E "(Error|Failed|Unknown)"

Usually it's zombie pods hogging GPUs or the device plugin crashed. Nuclear option (this actually works):

kubectl delete pods -n gpu-operator -l app=nvidia-device-plugin-daemonset

Wait 30 seconds for it to restart, then try scheduling again. If you're desperate, reboot the nodes.

Q

My model works fine locally but gives "CUDA out of memory" in production?

A

Production memory patterns are never the same as your laptop. First, figure out what's actually using the GPU:

kubectl exec -it model-pod -- nvidia-smi --query-gpu=memory.used,memory.total --format=csv
kubectl exec -it model-pod -- fuser -v /dev/nvidia*

90% of the time it's one of these stupid things:

  • Production batch sizes are way larger than you tested locally
  • Another model is still loaded from a previous run that crashed
  • GPU memory wasn't cleared after the last crash
  • You're running FP32 in prod but FP16 on your laptop (classic mistake)

Try clearing GPU memory first:

import torch
torch.cuda.empty_cache()
torch.cuda.ipc_collect()

If that doesn't work, restart the pod. If that doesn't work, reboot the node.

Q

PyTorch distributed training hangs during initialization - what's wrong?

A

Nine times out of ten, it's networking bullshit. Try these in order:

## Can pods actually reach each other on the right port?
kubectl exec -it trainer-0 -- nc -zv trainer-1.training-service.default.svc.cluster.local 29500

## Are NCCL settings completely fucked?
kubectl exec -it trainer-0 -- env | grep NCCL

## Any useful error messages? (probably not)
kubectl logs trainer-0 | grep -i "nccl\|process group\|rank"

Common ways to screw this up:

  • MASTER_ADDR points to wrong pod (check your StatefulSet naming)
  • NCCL picks the wrong network interface (it will pick wrong)
  • Some firewall is blocking ports you didn't know existed
  • Process ranks don't match world size (count your pods again)

Quick fix that works 80% of the time: export NCCL_SOCKET_IFNAME=eth0

Q

My inference pods are getting killed with exit code 137 (OOMKilled) but GPU memory looks fine?

A

You're confusing system memory (RAM) with GPU memory. They're completely different:

## Check container memory limits vs actual usage
kubectl describe pod inference-pod | grep -A 5 -B 5 "Limits\|Requests"

## Monitor memory usage over time
kubectl top pod inference-pod --containers

## Check for memory leaks in your shitty preprocessing code
kubectl exec -it inference-pod -- ps aux --sort=-%mem | head -10

Usually caused by:

  • Insufficient RAM for model loading plus inference buffers (duh)
  • Memory leak in your data preprocessing (check your code)
  • Large batch processing overwhelming system memory
  • Model weights cached in both GPU and CPU memory (wasteful but happens)

Quick rule: set memory limits to 2x GPU memory. GPU = 16GB? Set memory limit = 32Gi. Don't ask why, just do it.

Q

Why do my GPU pods stay in "ContainerCreating" for 5+ minutes?

A

Slow as hell model loading from storage or pulling massive container images:

## Check if it's slow image pulling
kubectl describe pod slow-pod | grep -A 10 Events

## Monitor storage I/O during startup (spoiler: it's probably terrible)
kubectl exec -it slow-pod -- iostat -x 1

## Check model download status
kubectl logs slow-pod --previous

Performance killers:

  • Downloading 50GB+ model weights on every single startup (why?)
  • Using the slowest possible storage class
  • Cold pulling container images from registry over slow internet
  • No model cache or completely broken caching

Actual fix: use init containers for model preloading:

initContainers:
- name: model-preloader
  image: your-model-image
  command: ["python", "-c", "from transformers import AutoModel; AutoModel.from_pretrained('model-name', cache_dir='/cache')"]
  volumeMounts:
  - name: model-cache
    mountPath: /cache
Q

My AI workload shows 100% GPU utilization but inference is still slow?

A

GPU utilization percentage is a lie. Check for actual bottlenecks:

## Monitor GPU memory bandwidth and compute separately
kubectl exec -it ai-pod -- nvidia-smi dmon -s um

## Check if your CPU is being throttled
kubectl top pod ai-pod

## Profile your inference pipeline (prepare for bad news)
kubectl exec -it ai-pod -- python -m torch.profiler.profile your_script.py

Hidden bottlenecks that'll ruin your day:

  • CPU preprocessing can't keep up with the GPU (classic)
  • I/O bound by slow tokenization or data loading
  • Inefficient tensor operations causing GPU to wait around
  • Memory bandwidth limitations (your GPU is starving)

Rule of thumb: use 4-8 CPU cores per GPU. More GPUs = more CPU cores needed.

Q

Why does my model work fine for small requests but fail on large batches?

A

Memory allocation scales badly with batch size. Test this yourself:

## Test memory usage with increasing batch sizes
kubectl exec -it model-pod -- python3 -c "
import torch
for batch_size in [1, 4, 8, 16, 32]:
    try:
        x = torch.randn(batch_size, 1024, 1024).cuda()
        print(f'Batch {batch_size}: {torch.cuda.memory_allocated()/1e9:.2f}GB')
        del x
        torch.cuda.empty_cache()
    except Exception as e:
        print(f'Batch {batch_size}: FAILED - {e}')
"

Scaling failures:

  • Memory growth is quadratic with sequence length (oops)
  • You hit fixed GPU memory limits at larger batches
  • Your model architecture doesn't support dynamic batching
  • KV cache memory grows linearly with context length

Actual fix: gradient accumulation instead of huge batches:

## Instead of batch_size=32 (which OOMs)
for batch in dataloader:  # batch_size=8
    loss = model(batch)
    (loss / 4).backward()  # Accumulate gradients 4 times
    if step % 4 == 0:
        optimizer.step()
        optimizer.zero_grad()
Q

My Hugging Face model download keeps failing in the pod?

A

Network restrictions or auth problems. Debug in order:

## Can your pod even reach the internet?
kubectl exec -it model-pod -- curl -I https://huggingface.co

## Is HF authentication working?
kubectl exec -it model-pod -- python3 -c "
from huggingface_hub import whoami
try:
    print('Logged in as:', whoami())
except Exception as e:
    print('Auth failed:', e)
"

## Check environment variables
kubectl exec -it model-pod -- env | grep HF_

Common download failures:

  • Network policies blocking external access (ask your platform team)
  • Missing or expired HF tokens (check your secrets)
  • Corporate firewall blocking model downloads (good luck)
  • Insufficient disk space for 50GB+ model weights (increase storage)

Actual fix: pre-download models to persistent volume:

kubectl run model-downloader --image=huggingface/transformers \
  --env="HUGGING_FACE_HUB_TOKEN=your_token" \
  --command -- python3 -c "from transformers import AutoModel; AutoModel.from_pretrained('model-name', cache_dir='/cache')"
Q

Why does my Ray cluster show GPU resources but tasks can't access them?

A

Resource allocation mismatch between Kubernetes and Ray. They're fighting:

## Check what K8s actually allocated
kubectl describe pod ray-worker-xxx | grep nvidia.com/gpu

## Check what Ray thinks it has
kubectl exec -it ray-head -- ray status --address=ray://127.0.0.1:10001

## Test if Ray tasks can actually use GPUs
kubectl exec -it ray-head -- python3 -c "
import ray
@ray.remote(num_gpus=1)
def test_gpu():
    import torch
    return torch.cuda.is_available()
result = ray.get(test_gpu.remote())
print('GPU accessible:', result)
"

Common mismatches:

  • Ray requesting more GPUs than Kubernetes allocated (Ray is greedy)
  • CUDA_VISIBLE_DEVICES not set correctly
  • Ray worker can't see the GPUs that Kubernetes allocated
  • Resource detection happening before GPU operator is ready

Fix: align Ray and Kubernetes resource requests exactly:

## If Kubernetes allocated 1 GPU, Ray must request exactly 1
rayStartParams:
  num-gpus: "1"  # Must match nvidia.com/gpu: 1
Q

My multi-node training job fails with "RuntimeError: NCCL operation failed"?

A

NCCL communication is breaking down across nodes. Debug step by step:

## Test if NCCL can even initialize
kubectl exec -it trainer-0 -- python3 -c "
import torch.distributed as dist
import torch
dist.init_process_group(backend='nccl', init_method='env://')
print('NCCL backend initialized successfully')
"

## Check network bandwidth between nodes
kubectl exec -it trainer-0 -- iperf3 -s &
kubectl exec -it trainer-1 -- iperf3 -c trainer-0.training-service.default.svc.cluster.local

NCCL failure modes (all painful):

  • High network latency between nodes (blame the network team)
  • Insufficient bandwidth for gradient synchronization
  • NCCL timeouts during large tensor operations
  • Network interface selection problems (NCCL picks wrong interface)

Tune NCCL settings (this helps sometimes):

export NCCL_TIMEOUT=3600        # 1 hour timeout (generous)
export NCCL_IB_RETRY_CNT=10     # More retries before giving up
export NCCL_DEBUG=INFO          # Verbose logging (prepare for spam)

What Actually Works for GPU Debugging

Problem

Start Here

Why

Pod won't schedule

kubectl describe pod

Shows what the scheduler is actually thinking

CUDA out of memory

nvidia-smi inside the pod

Shows real memory usage vs what you think

Model won't load

kubectl logs

Usually auth/download/storage errors

Distributed training hangs

Framework debug tools + logs

Network issues need both perspectives

Terrible performance

nvidia-smi dmon + profiling

Need real-time metrics, not snapshots

Resource fights

kubectl describe pod + nvidia-smi

Compare what K8s allocated vs what's used

Related Tools & Recommendations

integration
Similar content

Jenkins Docker Kubernetes CI/CD: Deploy Without Breaking Production

The Real Guide to CI/CD That Actually Works

Jenkins
/integration/jenkins-docker-kubernetes/enterprise-ci-cd-pipeline
100%
troubleshoot
Similar content

Fix Kubernetes Service Not Accessible: Stop 503 Errors

Your pods show "Running" but users get connection refused? Welcome to Kubernetes networking hell.

Kubernetes
/troubleshoot/kubernetes-service-not-accessible/service-connectivity-troubleshooting
72%
tool
Recommended

Google Kubernetes Engine (GKE) - Google's Managed Kubernetes (That Actually Works Most of the Time)

Google runs your Kubernetes clusters so you don't wake up to etcd corruption at 3am. Costs way more than DIY but beats losing your weekend to cluster disasters.

Google Kubernetes Engine (GKE)
/tool/google-kubernetes-engine/overview
66%
tool
Similar content

Helm Troubleshooting Guide: Fix Deployments & Debug Errors

The commands, tools, and nuclear options for when your Helm deployment is fucked and you need to debug template errors at 3am.

Helm
/tool/helm/troubleshooting-guide
56%
pricing
Similar content

Docker, Podman & Kubernetes Enterprise Pricing Comparison

Real costs, hidden fees, and why your CFO will hate you - Docker Business vs Red Hat Enterprise Linux vs managed Kubernetes services

Docker
/pricing/docker-podman-kubernetes-enterprise/enterprise-pricing-comparison
53%
alternatives
Recommended

Terraform Alternatives That Don't Suck to Migrate To

Stop paying HashiCorp's ransom and actually keep your infrastructure working

Terraform
/alternatives/terraform/migration-friendly-alternatives
38%
pricing
Recommended

Infrastructure as Code Pricing Reality Check: Terraform vs Pulumi vs CloudFormation

What these IaC tools actually cost you in 2025 - and why your AWS bill might double

Terraform
/pricing/terraform-pulumi-cloudformation/infrastructure-as-code-cost-analysis
38%
tool
Recommended

Terraform - Define Infrastructure in Code Instead of Clicking Through AWS Console for 3 Hours

The tool that lets you describe what you want instead of how to build it (assuming you enjoy YAML's evil twin)

Terraform
/tool/terraform/overview
38%
tool
Similar content

ArgoCD Production Troubleshooting: Debugging & Fixing Deployments

The real-world guide to debugging ArgoCD when your deployments are on fire and your pager won't stop buzzing

Argo CD
/tool/argocd/production-troubleshooting
33%
tool
Similar content

etcd Overview: The Core Database Powering Kubernetes Clusters

etcd stores all the important cluster state. When it breaks, your weekend is fucked.

etcd
/tool/etcd/overview
32%
tool
Recommended

containerd - The Container Runtime That Actually Just Works

The boring container runtime that Kubernetes uses instead of Docker (and you probably don't need to care about it)

containerd
/tool/containerd/overview
32%
integration
Recommended

Setting Up Prometheus Monitoring That Won't Make You Hate Your Job

How to Connect Prometheus, Grafana, and Alertmanager Without Losing Your Sanity

Prometheus
/integration/prometheus-grafana-alertmanager/complete-monitoring-integration
30%
tool
Recommended

Helm - Because Managing 47 YAML Files Will Drive You Insane

Package manager for Kubernetes that saves you from copy-pasting deployment configs like a savage. Helm charts beat maintaining separate YAML files for every dam

Helm
/tool/helm/overview
30%
tool
Recommended

MongoDB Atlas Enterprise Deployment Guide

built on MongoDB Atlas

MongoDB Atlas
/tool/mongodb-atlas/enterprise-deployment
30%
news
Recommended

Google Guy Says AI is Better Than You at Most Things Now

Jeff Dean makes bold claims about AI superiority, conveniently ignoring that his job depends on people believing this

OpenAI ChatGPT/GPT Models
/news/2025-09-01/google-ai-human-capabilities
30%
tool
Recommended

Podman - The Container Tool That Doesn't Need Root

Runs containers without a daemon, perfect for security-conscious teams and CI/CD pipelines

Podman
/tool/podman/overview
29%
troubleshoot
Recommended

Docker Won't Start on Windows 11? Here's How to Fix That Garbage

Stop the whale logo from spinning forever and actually get Docker working

Docker Desktop
/troubleshoot/docker-daemon-not-running-windows-11/daemon-startup-issues
29%
howto
Recommended

Stop Docker from Killing Your Containers at Random (Exit Code 137 Is Not Your Friend)

Three weeks into a project and Docker Desktop suddenly decides your container needs 16GB of RAM to run a basic Node.js app

Docker Desktop
/howto/setup-docker-development-environment/complete-development-setup
29%
news
Recommended

Docker Desktop's Stupidly Simple Container Escape Just Owned Everyone

integrates with Technology News Aggregation

Technology News Aggregation
/news/2025-08-26/docker-cve-security
29%
troubleshoot
Similar content

Fix Kubernetes CrashLoopBackOff Exit Code 1 Application Errors

Troubleshoot and fix Kubernetes CrashLoopBackOff with Exit Code 1 errors. Learn why your app works locally but fails in Kubernetes and discover effective debugg

Kubernetes
/troubleshoot/kubernetes-crashloopbackoff-exit-code-1/exit-code-1-application-errors
28%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization