Kubernetes GPU Allocation Troubleshooting: AI-Optimized Reference
Critical Context and Failure Scenarios
Severity Indicators
- Critical Production Impact: Device plugin crashes eliminate all GPU visibility (affects $80k+ infrastructure)
- High Frequency Issues: Device plugin failures occur weekly, scheduling problems daily
- Resource Cost: Each hour of GPU downtime costs $200-500 in compute resources
- Debug Time Investment: Typical troubleshooting sessions range 2-4 hours at 3AM
Common Misconceptions
- False:
nvidia-smi
working means Kubernetes can see GPUs - Reality: Device plugin layer frequently fails while hardware remains functional
- False: GPU quotas work like CPU/memory quotas
- Reality: GPU quotas require exact matching requests/limits and vendor-specific resource names
Configuration Requirements
Device Plugin Prerequisites
Critical Dependencies (95% of failures here):
- CUDA version compatibility: Device plugin CUDA ≤ node driver version
- Privileged security context required (non-negotiable)
- Socket permissions:
/var/lib/kubelet/device-plugins/nvidia.sock
- Container runtime: NVIDIA Container Runtime must be configured
Socket Permission Fix (solves 60% of issues):
# Check socket permissions
ls -la /var/lib/kubelet/device-plugins/
# Fix permissions if needed
chmod 755 /var/lib/kubelet/device-plugins/nvidia.sock
Resource Quota Configuration
GPU-Specific Requirements:
spec:
hard:
requests.nvidia.com/gpu: "8"
limits.nvidia.com/gpu: "8" # MUST match requests exactly
requests.memory: "256Gi" # Set high - GPU workloads consume massive memory
persistentvolumeclaims: "20" # Model storage adds up quickly
Time-Slicing Quota Multiplication:
- Physical GPUs × Replica Count = Virtual GPU Quota
- Example: 4 physical GPUs × 4-way time-slicing = 16 virtual GPU quota
Multi-GPU Scheduling Constraints
Critical Limitations:
- All GPUs must reside on single node (cannot span nodes)
- Scheduler ignores GPU topology and performance differences
- No automatic NUMA or NVLink awareness
- Gang scheduling required for distributed training
Resource Requirements and Performance Thresholds
Hardware Memory Mapping
GPU Model | Memory | Use Case Limitations |
---|---|---|
Tesla T4 | 16GB | Cannot run modern large models |
Tesla V100 | 32GB | Limited for 80GB+ model requirements |
Tesla A100 | 40/80GB | Production-ready for most workloads |
Tesla H100 | 80GB | Optimal for largest models |
Time Investment by Problem Type
Issue Category | Diagnostic Time | Fix Implementation | Success Rate |
---|---|---|---|
Device Plugin Crash | 5 minutes | 5 minutes (pod restart) | 90% |
Scheduling Failures | 30 minutes | 10-45 minutes | 85% |
Resource Quotas | 10 minutes | 5 minutes | 95% |
Driver/Runtime Issues | 60 minutes | 2+ hours | 70% |
Time-Slicing Problems | 45 minutes | 30 minutes | 75% |
MIG Configuration | 60 minutes | 45 minutes + reboot | 60% |
Critical Warnings and Failure Modes
Breaking Points That Official Documentation Omits
- UI Breakdown: Kubernetes UI becomes unusable at 1000+ GPU spans during large distributed transactions
- Default Settings Failures: GPU Operator default configurations fail in 70% of production environments
- Security Context Requirements: GPU workloads require privileged access (security team resistance common)
- Migration Breaking Changes: GPU Operator upgrades frequently break existing workloads
Hidden Costs
- Human Time: 4-hour emergency debugging sessions at 3AM
- Expertise Requirements: Deep CUDA, Kubernetes, and hardware topology knowledge needed
- Community Support Quality: NVIDIA forums provide actual engineer responses (unusual for vendor support)
- Infrastructure Dependencies: Requires specialized monitoring, networking, and storage configurations
Systematic Diagnostic Procedures
Primary Failure Chain Analysis
Ordered diagnostic steps (stop when broken component found):
# 1. Hardware Detection
kubectl debug node/gpu-node-1 -it --image=busybox
lspci | grep -i nvidia
# 2. Driver Functionality
kubectl debug node/gpu-node-1 -it --image=nvidia/cuda:12.3-runtime-ubuntu22.04
nvidia-smi
# 3. Device Plugin Status
kubectl get pods -n gpu-operator -l app=nvidia-device-plugin-daemonset
kubectl logs -n gpu-operator nvidia-device-plugin-daemonset-xxx
# 4. Resource Advertisement
kubectl describe nodes | grep -A 5 -B 5 "nvidia.com/gpu"
# 5. Allocation Test
kubectl run gpu-test --image=nvidia/cuda:12.3-runtime-ubuntu22.04 --restart=Never --rm -it --limits=nvidia.com/gpu=1 -- nvidia-smi
Emergency Recovery Procedures
Quick Fixes by Problem Type:
- Device Plugin Down:
kubectl delete pods -n gpu-operator -l app=nvidia-device-plugin-daemonset
- Quota Exceeded: Check for stuck terminating pods consuming quota
- Scheduling Failures: Verify tolerations match node taints exactly
- Runtime Errors: Validate NVIDIA Container Runtime installation
Advanced Configuration Patterns
Time-Slicing Implementation
Node-Specific Configuration:
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
tesla-a100: |-
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 8 # 8-way sharing for A100
tesla-t4: |-
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 2 # 2-way sharing for T4
Multi-Instance GPU (MIG) Configuration
Production MIG Strategy:
mixed-partition: |-
version: v1
mig-configs:
mixed-workloads:
- devices: [0]
mig-enabled: true
mig-devices:
1g.5gb: 2 # Inference workloads
3g.20gb: 1 # Training workloads
7g.40gb: 1 # Large model workloads
Priority-Based Resource Management
Production Priority Classes:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: gpu-production-critical
value: 1000000
description: "Critical production GPU workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: gpu-training-low
value: 10000
preemptionPolicy: Never # Cannot preempt other workloads
Production Monitoring and Alerting
Essential Metrics
- GPU quota utilization by namespace
- Device plugin health and restart frequency
- Pending pod counts with GPU resource requests
- Node-level GPU allocation vs capacity ratios
Critical Alerts
# GPU quota approaching exhaustion
(kube_resourcequota{resource="limits.nvidia.com/gpu", type="used"} / kube_resourcequota{resource="limits.nvidia.com/gpu", type="hard"}) > 0.9
# Cluster GPU resource exhaustion
(sum(kube_node_status_allocatable{resource="nvidia_com_gpu"}) - sum(kube_pod_container_resource_requests{resource="nvidia_com_gpu"})) == 0
# Device plugin crash detection
kube_pod_container_status_restarts_total{container="nvidia-device-plugin"} > 0
Implementation Decision Criteria
When to Use Time-Slicing vs MIG
Time-Slicing Appropriate When:
- Inference workloads with variable resource needs
- Multiple small workloads can share GPU temporal access
- Mixed GPU types in cluster (T4, V100 without MIG support)
MIG Appropriate When:
- Hard isolation required between workloads
- Consistent resource allocation patterns
- A100/H100 hardware available
- Billing/chargeback requires precise resource attribution
Hardware Selection Impact
- T4 Nodes: Maximum 2-way time-slicing recommended
- V100 Nodes: 4-way time-slicing viable for inference
- A100 Nodes: Choose between 8-way time-slicing or MIG partitioning
- H100 Nodes: Prefer MIG for maximum resource utilization
Operational Intelligence Summary
Success Patterns:
- Device plugin restarts resolve 90% of "no GPUs available" issues
- Systematic diagnostic approach reduces debug time from 4 hours to 30 minutes
- Proper quota configuration eliminates 95% of allocation failures
- Node affinity targeting prevents CPU-only scheduling disasters
Failure Patterns:
- Random configuration changes without diagnostic steps waste time
- Ignoring CUDA version compatibility causes repeated crashes
- Missing privileged security contexts prevent device access
- Insufficient memory quotas cause GPU workload failures
Resource Optimization:
- Gang scheduling essential for distributed training efficiency
- Priority classes enable fair resource sharing in multi-tenant environments
- Monitoring and alerting prevent prolonged outages
- Documentation and runbooks reduce incident response time
Useful Links for Further Investigation
Essential Resources for Kubernetes GPU Allocation Troubleshooting
Link | Description |
---|---|
Kubernetes GPU Scheduling Documentation | Start here if you're new to GPU scheduling. Surprisingly, the examples actually work (unlike most K8s docs) and cover the fundamentals you need before things inevitably break. I've bookmarked this and referenced it probably 20 times in the past year. |
NVIDIA GPU Operator Documentation | This literally saved my ass when the operator failed during installation at 1am and my boss was asking for an ETA. Has actual troubleshooting workflows that work, not just theoretical bullshit. The troubleshooting section is pure gold. |
NVIDIA Container Toolkit Installation Guide | Must-read if your containers can see GPUs but can't use them. I spent 3 fucking hours debugging "could not start container" errors before finding this guide. Would have saved me a lot of coffee and rage. |
Kubernetes Device Plugin Framework | Technical specification for device plugin architecture. Dry as hell but helpful for understanding how GPU resource advertisement actually works. Read this when you need to understand why device plugins keep crashing. |
NVIDIA GPU Operator Troubleshooting Guide | Official troubleshooting guide that actually has useful commands, unlike most vendor docs. I reference this every time the operator does something stupid, which is weekly. Start here when the operator breaks. |
NVIDIA Multi-Instance GPU (MIG) User Guide | The definitive guide for MIG configuration. Dense as fuck but you'll need this if you're trying to partition A100s or H100s. Fair warning: MIG is finicky and will make you question your life choices. |
NVIDIA DCGM Documentation | Data Center GPU Manager documentation for monitoring, health checks, and performance metrics. Critical for production GPU cluster monitoring. |
NVIDIA Developer Forums - Kubernetes Section | NVIDIA engineers actually respond here, which is shocking for a vendor forum. Search first - someone else definitely had your exact problem before and hopefully got a real answer. |
AWS EKS GPU Workload Documentation | AWS-specific guide for GPU node groups, optimized AMIs, and EKS-specific GPU configurations. Covers common EKS GPU allocation issues. |
Google GKE GPU Documentation | Solid guide for GKE GPU clusters, including autopilot GPU support, node pools, and monitoring configurations. |
Azure AKS GPU Clusters | Microsoft documentation for AKS GPU configurations, Windows GPU support, and Azure-specific troubleshooting. |
Volcano Scheduler Documentation | Gang scheduling and advanced GPU workload scheduling. Essential for distributed training and multi-GPU job coordination. |
Kueue Resource Management | Kubernetes-native job queuing system with GPU awareness. Useful for batch workload management and resource sharing. |
Node Feature Discovery (NFD) | Automatic hardware feature detection and node labeling. Critical for automated GPU node classification and targeting. |
Prometheus GPU Metrics Configuration | DCGM Exporter setup for GPU metrics collection in Prometheus. Includes sample queries and alerting rules for production monitoring. |
Grafana GPU Dashboard Templates | Pre-built Grafana dashboards for GPU cluster monitoring. The [NVIDIA DCGM Dashboard](https://grafana.com/grafana/dashboards/11578-nvidia-dcgm-exporter/) shows GPU utilization, memory usage, and temperature metrics in real-time. |
Azure AKS GPU Monitoring Guide | Microsoft's guide for monitoring GPU metrics in AKS clusters using Managed Prometheus and Grafana. Includes step-by-step setup instructions and example configurations. |
k9s - Kubernetes CLI | Way faster than kubectl for debugging GPU problems. Shows resource usage in real-time without memorizing a dozen kubectl commands. Seriously, just use this instead of killing yourself with kubectl describe everything. I wish I'd found this 2 years ago - would have saved me hundreds of hours of typing the same commands over and over. |
Kubernetes GPU Special Interest Group | Official SIG-Node working group focusing on GPU and hardware acceleration. Follow for feature development and roadmap updates. |
Kubernetes Community Discussions | Official Kubernetes community forum with dedicated GPU troubleshooting threads. Search existing posts or ask questions about GPU allocation issues. |
Stack Overflow Kubernetes GPU Tags | Skip the vendor forums unless you enjoy pain. Real engineers post actual solutions here with working code examples. This is where you'll find the dirty hacks that actually fix production. I've found solutions here at 2:47am that literally saved my ass when the CEO was asking why ML training was down. |
GPU Validation Test Suite | NVIDIA's official GPU testing and validation tools. Includes device plugin testing and hardware verification utilities. |
Kubernetes Resource Recommender | VPA with GPU awareness for right-sizing GPU resource requests. Helps optimize resource allocation and reduce waste. |
Cluster API GPU Provider | Infrastructure-as-code approach to GPU cluster provisioning. Useful for consistent GPU node configuration across environments. |
Kubernetes Disaster Recovery for GPU Workloads | Velero backup and restore procedures for GPU-enabled workloads. Critical for production GPU cluster recovery planning. |
GPU Workload Migration Tools | Tools for migrating GPU workloads between clusters. Helpful for cluster upgrades and disaster recovery scenarios. |
Production GPU Cluster Runbooks | Community-maintained runbooks for common GPU cluster operations. Includes incident response procedures and troubleshooting playbooks. |
CNCF Cloud Native AI Working Group | Technical Advisory Group focused on AI/ML workload standards including GPU best practices. Follow for industry guidance on GPU acceleration. |
Kubernetes GPU Best Practices Guide | Official best practices for GPU resource management, security, and performance optimization in production environments. |
Training and Certification Resources | NVIDIA's official training programs for GPU computing and Kubernetes integration. Recommended for platform teams managing GPU infrastructure. |
Related Tools & Recommendations
GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus
How to Wire Together the Modern DevOps Stack Without Losing Your Sanity
Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break
When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go
Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015
When your API shits the bed right before the big demo, this stack tells you exactly why
Docker Alternatives That Won't Break Your Budget
Docker got expensive as hell. Here's how to escape without breaking everything.
I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works
Trivy, Docker Scout, Snyk Container, Grype, and Clair - which one won't make you want to quit DevOps
Fix Helm When It Inevitably Breaks - Debug Guide
The commands, tools, and nuclear options for when your Helm deployment is fucked and you need to debug template errors at 3am.
Helm - Because Managing 47 YAML Files Will Drive You Insane
Package manager for Kubernetes that saves you from copy-pasting deployment configs like a savage. Helm charts beat maintaining separate YAML files for every dam
Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together
Stop fighting with YAML hell and infrastructure drift - here's how to manage everything through Git without losing your sanity
GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015
Deploy your app without losing your mind or your weekend
RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)
Running RAG Systems on K8s Will Make You Hate Your Life, But Sometimes You Don't Have a Choice
containerd - The Container Runtime That Actually Just Works
The boring container runtime that Kubernetes uses instead of Docker (and you probably don't need to care about it)
Grafana - The Monitoring Dashboard That Doesn't Suck
integrates with Grafana
Set Up Microservices Monitoring That Actually Works
Stop flying blind - get real visibility into what's breaking your distributed services
Amazon EKS - Managed Kubernetes That Actually Works
Kubernetes without the 3am etcd debugging nightmares (but you'll pay $73/month for the privilege)
GitHub Actions Marketplace - Where CI/CD Actually Gets Easier
integrates with GitHub Actions Marketplace
GitHub Actions Alternatives That Don't Suck
integrates with GitHub Actions
Rancher Desktop - Docker Desktop's Free Replacement That Actually Works
extends Rancher Desktop
I Ditched Docker Desktop for Rancher Desktop - Here's What Actually Happened
3 Months Later: The Good, Bad, and Bullshit
Rancher - Manage Multiple Kubernetes Clusters Without Losing Your Sanity
One dashboard for all your clusters, whether they're on AWS, your basement server, or that sketchy cloud provider your CTO picked
MongoDB Alternatives: Choose the Right Database for Your Specific Use Case
Stop paying MongoDB tax. Choose a database that actually works for your use case.
Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization