Currently viewing the AI version
Switch to human version

Kubernetes GPU Allocation Troubleshooting: AI-Optimized Reference

Critical Context and Failure Scenarios

Severity Indicators

  • Critical Production Impact: Device plugin crashes eliminate all GPU visibility (affects $80k+ infrastructure)
  • High Frequency Issues: Device plugin failures occur weekly, scheduling problems daily
  • Resource Cost: Each hour of GPU downtime costs $200-500 in compute resources
  • Debug Time Investment: Typical troubleshooting sessions range 2-4 hours at 3AM

Common Misconceptions

  • False: nvidia-smi working means Kubernetes can see GPUs
  • Reality: Device plugin layer frequently fails while hardware remains functional
  • False: GPU quotas work like CPU/memory quotas
  • Reality: GPU quotas require exact matching requests/limits and vendor-specific resource names

Configuration Requirements

Device Plugin Prerequisites

Critical Dependencies (95% of failures here):

  • CUDA version compatibility: Device plugin CUDA ≤ node driver version
  • Privileged security context required (non-negotiable)
  • Socket permissions: /var/lib/kubelet/device-plugins/nvidia.sock
  • Container runtime: NVIDIA Container Runtime must be configured

Socket Permission Fix (solves 60% of issues):

# Check socket permissions
ls -la /var/lib/kubelet/device-plugins/
# Fix permissions if needed
chmod 755 /var/lib/kubelet/device-plugins/nvidia.sock

Resource Quota Configuration

GPU-Specific Requirements:

spec:
  hard:
    requests.nvidia.com/gpu: "8"
    limits.nvidia.com/gpu: "8"    # MUST match requests exactly
    requests.memory: "256Gi"      # Set high - GPU workloads consume massive memory
    persistentvolumeclaims: "20"  # Model storage adds up quickly

Time-Slicing Quota Multiplication:

  • Physical GPUs × Replica Count = Virtual GPU Quota
  • Example: 4 physical GPUs × 4-way time-slicing = 16 virtual GPU quota

Multi-GPU Scheduling Constraints

Critical Limitations:

  • All GPUs must reside on single node (cannot span nodes)
  • Scheduler ignores GPU topology and performance differences
  • No automatic NUMA or NVLink awareness
  • Gang scheduling required for distributed training

Resource Requirements and Performance Thresholds

Hardware Memory Mapping

GPU Model Memory Use Case Limitations
Tesla T4 16GB Cannot run modern large models
Tesla V100 32GB Limited for 80GB+ model requirements
Tesla A100 40/80GB Production-ready for most workloads
Tesla H100 80GB Optimal for largest models

Time Investment by Problem Type

Issue Category Diagnostic Time Fix Implementation Success Rate
Device Plugin Crash 5 minutes 5 minutes (pod restart) 90%
Scheduling Failures 30 minutes 10-45 minutes 85%
Resource Quotas 10 minutes 5 minutes 95%
Driver/Runtime Issues 60 minutes 2+ hours 70%
Time-Slicing Problems 45 minutes 30 minutes 75%
MIG Configuration 60 minutes 45 minutes + reboot 60%

Critical Warnings and Failure Modes

Breaking Points That Official Documentation Omits

  1. UI Breakdown: Kubernetes UI becomes unusable at 1000+ GPU spans during large distributed transactions
  2. Default Settings Failures: GPU Operator default configurations fail in 70% of production environments
  3. Security Context Requirements: GPU workloads require privileged access (security team resistance common)
  4. Migration Breaking Changes: GPU Operator upgrades frequently break existing workloads

Hidden Costs

  • Human Time: 4-hour emergency debugging sessions at 3AM
  • Expertise Requirements: Deep CUDA, Kubernetes, and hardware topology knowledge needed
  • Community Support Quality: NVIDIA forums provide actual engineer responses (unusual for vendor support)
  • Infrastructure Dependencies: Requires specialized monitoring, networking, and storage configurations

Systematic Diagnostic Procedures

Primary Failure Chain Analysis

Ordered diagnostic steps (stop when broken component found):

# 1. Hardware Detection
kubectl debug node/gpu-node-1 -it --image=busybox
lspci | grep -i nvidia

# 2. Driver Functionality
kubectl debug node/gpu-node-1 -it --image=nvidia/cuda:12.3-runtime-ubuntu22.04
nvidia-smi

# 3. Device Plugin Status
kubectl get pods -n gpu-operator -l app=nvidia-device-plugin-daemonset
kubectl logs -n gpu-operator nvidia-device-plugin-daemonset-xxx

# 4. Resource Advertisement
kubectl describe nodes | grep -A 5 -B 5 "nvidia.com/gpu"

# 5. Allocation Test
kubectl run gpu-test --image=nvidia/cuda:12.3-runtime-ubuntu22.04 --restart=Never --rm -it --limits=nvidia.com/gpu=1 -- nvidia-smi

Emergency Recovery Procedures

Quick Fixes by Problem Type:

  • Device Plugin Down: kubectl delete pods -n gpu-operator -l app=nvidia-device-plugin-daemonset
  • Quota Exceeded: Check for stuck terminating pods consuming quota
  • Scheduling Failures: Verify tolerations match node taints exactly
  • Runtime Errors: Validate NVIDIA Container Runtime installation

Advanced Configuration Patterns

Time-Slicing Implementation

Node-Specific Configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  tesla-a100: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 8    # 8-way sharing for A100
  tesla-t4: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 2    # 2-way sharing for T4

Multi-Instance GPU (MIG) Configuration

Production MIG Strategy:

mixed-partition: |-
  version: v1
  mig-configs:
    mixed-workloads:
      - devices: [0]
        mig-enabled: true
        mig-devices:
          1g.5gb: 2    # Inference workloads
          3g.20gb: 1   # Training workloads
          7g.40gb: 1   # Large model workloads

Priority-Based Resource Management

Production Priority Classes:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: gpu-production-critical
value: 1000000
description: "Critical production GPU workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: gpu-training-low
value: 10000
preemptionPolicy: Never  # Cannot preempt other workloads

Production Monitoring and Alerting

Essential Metrics

  • GPU quota utilization by namespace
  • Device plugin health and restart frequency
  • Pending pod counts with GPU resource requests
  • Node-level GPU allocation vs capacity ratios

Critical Alerts

# GPU quota approaching exhaustion
(kube_resourcequota{resource="limits.nvidia.com/gpu", type="used"} / kube_resourcequota{resource="limits.nvidia.com/gpu", type="hard"}) > 0.9

# Cluster GPU resource exhaustion
(sum(kube_node_status_allocatable{resource="nvidia_com_gpu"}) - sum(kube_pod_container_resource_requests{resource="nvidia_com_gpu"})) == 0

# Device plugin crash detection
kube_pod_container_status_restarts_total{container="nvidia-device-plugin"} > 0

Implementation Decision Criteria

When to Use Time-Slicing vs MIG

Time-Slicing Appropriate When:

  • Inference workloads with variable resource needs
  • Multiple small workloads can share GPU temporal access
  • Mixed GPU types in cluster (T4, V100 without MIG support)

MIG Appropriate When:

  • Hard isolation required between workloads
  • Consistent resource allocation patterns
  • A100/H100 hardware available
  • Billing/chargeback requires precise resource attribution

Hardware Selection Impact

  • T4 Nodes: Maximum 2-way time-slicing recommended
  • V100 Nodes: 4-way time-slicing viable for inference
  • A100 Nodes: Choose between 8-way time-slicing or MIG partitioning
  • H100 Nodes: Prefer MIG for maximum resource utilization

Operational Intelligence Summary

Success Patterns:

  • Device plugin restarts resolve 90% of "no GPUs available" issues
  • Systematic diagnostic approach reduces debug time from 4 hours to 30 minutes
  • Proper quota configuration eliminates 95% of allocation failures
  • Node affinity targeting prevents CPU-only scheduling disasters

Failure Patterns:

  • Random configuration changes without diagnostic steps waste time
  • Ignoring CUDA version compatibility causes repeated crashes
  • Missing privileged security contexts prevent device access
  • Insufficient memory quotas cause GPU workload failures

Resource Optimization:

  • Gang scheduling essential for distributed training efficiency
  • Priority classes enable fair resource sharing in multi-tenant environments
  • Monitoring and alerting prevent prolonged outages
  • Documentation and runbooks reduce incident response time

Useful Links for Further Investigation

Essential Resources for Kubernetes GPU Allocation Troubleshooting

LinkDescription
Kubernetes GPU Scheduling DocumentationStart here if you're new to GPU scheduling. Surprisingly, the examples actually work (unlike most K8s docs) and cover the fundamentals you need before things inevitably break. I've bookmarked this and referenced it probably 20 times in the past year.
NVIDIA GPU Operator DocumentationThis literally saved my ass when the operator failed during installation at 1am and my boss was asking for an ETA. Has actual troubleshooting workflows that work, not just theoretical bullshit. The troubleshooting section is pure gold.
NVIDIA Container Toolkit Installation GuideMust-read if your containers can see GPUs but can't use them. I spent 3 fucking hours debugging "could not start container" errors before finding this guide. Would have saved me a lot of coffee and rage.
Kubernetes Device Plugin FrameworkTechnical specification for device plugin architecture. Dry as hell but helpful for understanding how GPU resource advertisement actually works. Read this when you need to understand why device plugins keep crashing.
NVIDIA GPU Operator Troubleshooting GuideOfficial troubleshooting guide that actually has useful commands, unlike most vendor docs. I reference this every time the operator does something stupid, which is weekly. Start here when the operator breaks.
NVIDIA Multi-Instance GPU (MIG) User GuideThe definitive guide for MIG configuration. Dense as fuck but you'll need this if you're trying to partition A100s or H100s. Fair warning: MIG is finicky and will make you question your life choices.
NVIDIA DCGM DocumentationData Center GPU Manager documentation for monitoring, health checks, and performance metrics. Critical for production GPU cluster monitoring.
NVIDIA Developer Forums - Kubernetes SectionNVIDIA engineers actually respond here, which is shocking for a vendor forum. Search first - someone else definitely had your exact problem before and hopefully got a real answer.
AWS EKS GPU Workload DocumentationAWS-specific guide for GPU node groups, optimized AMIs, and EKS-specific GPU configurations. Covers common EKS GPU allocation issues.
Google GKE GPU DocumentationSolid guide for GKE GPU clusters, including autopilot GPU support, node pools, and monitoring configurations.
Azure AKS GPU ClustersMicrosoft documentation for AKS GPU configurations, Windows GPU support, and Azure-specific troubleshooting.
Volcano Scheduler DocumentationGang scheduling and advanced GPU workload scheduling. Essential for distributed training and multi-GPU job coordination.
Kueue Resource ManagementKubernetes-native job queuing system with GPU awareness. Useful for batch workload management and resource sharing.
Node Feature Discovery (NFD)Automatic hardware feature detection and node labeling. Critical for automated GPU node classification and targeting.
Prometheus GPU Metrics ConfigurationDCGM Exporter setup for GPU metrics collection in Prometheus. Includes sample queries and alerting rules for production monitoring.
Grafana GPU Dashboard TemplatesPre-built Grafana dashboards for GPU cluster monitoring. The [NVIDIA DCGM Dashboard](https://grafana.com/grafana/dashboards/11578-nvidia-dcgm-exporter/) shows GPU utilization, memory usage, and temperature metrics in real-time.
Azure AKS GPU Monitoring GuideMicrosoft's guide for monitoring GPU metrics in AKS clusters using Managed Prometheus and Grafana. Includes step-by-step setup instructions and example configurations.
k9s - Kubernetes CLIWay faster than kubectl for debugging GPU problems. Shows resource usage in real-time without memorizing a dozen kubectl commands. Seriously, just use this instead of killing yourself with kubectl describe everything. I wish I'd found this 2 years ago - would have saved me hundreds of hours of typing the same commands over and over.
Kubernetes GPU Special Interest GroupOfficial SIG-Node working group focusing on GPU and hardware acceleration. Follow for feature development and roadmap updates.
Kubernetes Community DiscussionsOfficial Kubernetes community forum with dedicated GPU troubleshooting threads. Search existing posts or ask questions about GPU allocation issues.
Stack Overflow Kubernetes GPU TagsSkip the vendor forums unless you enjoy pain. Real engineers post actual solutions here with working code examples. This is where you'll find the dirty hacks that actually fix production. I've found solutions here at 2:47am that literally saved my ass when the CEO was asking why ML training was down.
GPU Validation Test SuiteNVIDIA's official GPU testing and validation tools. Includes device plugin testing and hardware verification utilities.
Kubernetes Resource RecommenderVPA with GPU awareness for right-sizing GPU resource requests. Helps optimize resource allocation and reduce waste.
Cluster API GPU ProviderInfrastructure-as-code approach to GPU cluster provisioning. Useful for consistent GPU node configuration across environments.
Kubernetes Disaster Recovery for GPU WorkloadsVelero backup and restore procedures for GPU-enabled workloads. Critical for production GPU cluster recovery planning.
GPU Workload Migration ToolsTools for migrating GPU workloads between clusters. Helpful for cluster upgrades and disaster recovery scenarios.
Production GPU Cluster RunbooksCommunity-maintained runbooks for common GPU cluster operations. Includes incident response procedures and troubleshooting playbooks.
CNCF Cloud Native AI Working GroupTechnical Advisory Group focused on AI/ML workload standards including GPU best practices. Follow for industry guidance on GPU acceleration.
Kubernetes GPU Best Practices GuideOfficial best practices for GPU resource management, security, and performance optimization in production environments.
Training and Certification ResourcesNVIDIA's official training programs for GPU computing and Kubernetes integration. Recommended for platform teams managing GPU infrastructure.

Related Tools & Recommendations

integration
Recommended

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

How to Wire Together the Modern DevOps Stack Without Losing Your Sanity

prometheus
/integration/docker-kubernetes-argocd-prometheus/gitops-workflow-integration
100%
integration
Recommended

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go

Apache Kafka
/integration/kafka-mongodb-kubernetes-prometheus-event-driven/complete-observability-architecture
63%
integration
Recommended

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

When your API shits the bed right before the big demo, this stack tells you exactly why

Prometheus
/integration/prometheus-grafana-jaeger/microservices-observability-integration
43%
alternatives
Recommended

Docker Alternatives That Won't Break Your Budget

Docker got expensive as hell. Here's how to escape without breaking everything.

Docker
/alternatives/docker/budget-friendly-alternatives
40%
compare
Recommended

I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works

Trivy, Docker Scout, Snyk Container, Grype, and Clair - which one won't make you want to quit DevOps

docker
/compare/docker-security/cicd-integration/docker-security-cicd-integration
40%
tool
Recommended

Fix Helm When It Inevitably Breaks - Debug Guide

The commands, tools, and nuclear options for when your Helm deployment is fucked and you need to debug template errors at 3am.

Helm
/tool/helm/troubleshooting-guide
35%
tool
Recommended

Helm - Because Managing 47 YAML Files Will Drive You Insane

Package manager for Kubernetes that saves you from copy-pasting deployment configs like a savage. Helm charts beat maintaining separate YAML files for every dam

Helm
/tool/helm/overview
35%
integration
Recommended

Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together

Stop fighting with YAML hell and infrastructure drift - here's how to manage everything through Git without losing your sanity

Pulumi
/integration/pulumi-kubernetes-helm-gitops/complete-workflow-integration
35%
integration
Recommended

GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015

Deploy your app without losing your mind or your weekend

GitHub Actions
/integration/github-actions-docker-aws-ecs/ci-cd-pipeline-automation
33%
integration
Recommended

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

Running RAG Systems on K8s Will Make You Hate Your Life, But Sometimes You Don't Have a Choice

Vector Databases
/integration/vector-database-rag-production-deployment/kubernetes-orchestration
27%
tool
Recommended

containerd - The Container Runtime That Actually Just Works

The boring container runtime that Kubernetes uses instead of Docker (and you probably don't need to care about it)

containerd
/tool/containerd/overview
26%
tool
Recommended

Grafana - The Monitoring Dashboard That Doesn't Suck

integrates with Grafana

Grafana
/tool/grafana/overview
22%
howto
Recommended

Set Up Microservices Monitoring That Actually Works

Stop flying blind - get real visibility into what's breaking your distributed services

Prometheus
/howto/setup-microservices-observability-prometheus-jaeger-grafana/complete-observability-setup
22%
tool
Recommended

Amazon EKS - Managed Kubernetes That Actually Works

Kubernetes without the 3am etcd debugging nightmares (but you'll pay $73/month for the privilege)

Amazon Elastic Kubernetes Service
/tool/amazon-eks/overview
22%
tool
Recommended

GitHub Actions Marketplace - Where CI/CD Actually Gets Easier

integrates with GitHub Actions Marketplace

GitHub Actions Marketplace
/tool/github-actions-marketplace/overview
22%
alternatives
Recommended

GitHub Actions Alternatives That Don't Suck

integrates with GitHub Actions

GitHub Actions
/alternatives/github-actions/use-case-driven-selection
22%
tool
Recommended

Rancher Desktop - Docker Desktop's Free Replacement That Actually Works

extends Rancher Desktop

Rancher Desktop
/tool/rancher-desktop/overview
18%
review
Recommended

I Ditched Docker Desktop for Rancher Desktop - Here's What Actually Happened

3 Months Later: The Good, Bad, and Bullshit

Rancher Desktop
/review/rancher-desktop/overview
18%
tool
Recommended

Rancher - Manage Multiple Kubernetes Clusters Without Losing Your Sanity

One dashboard for all your clusters, whether they're on AWS, your basement server, or that sketchy cloud provider your CTO picked

Rancher
/tool/rancher/overview
18%
alternatives
Recommended

MongoDB Alternatives: Choose the Right Database for Your Specific Use Case

Stop paying MongoDB tax. Choose a database that actually works for your use case.

MongoDB
/alternatives/mongodb/use-case-driven-alternatives
17%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization