CUDA GPU Performance Optimization - AI Reference Guide
Executive Summary
CUDA performance optimization follows a strict hierarchy where algorithm-level changes provide 100x gains, memory optimization provides 10x gains, execution configuration provides 2-5x gains, and instruction-level optimization provides 20-50% gains. Most CUDA kernels are memory bandwidth bound, not compute bound, making memory optimization the primary focus.
Critical Performance Reality
Memory Bandwidth Dominance: RTX 4090 can execute 83 TFLOPS but only has 1000 GB/s memory bandwidth. This means 250 billion float reads/writes per second vs 21 trillion float operations per second. Kernels need 10+ FP32 operations per memory access to be compute bound - most don't achieve this ratio.
Occupancy Misconception: High occupancy does not equal high performance. Memory-bound kernels often perform identically from 25% to 100% occupancy. Focus on memory throughput, not thread count.
Performance Optimization Hierarchy
Level 1: Algorithm-Level Optimization (100x gains possible)
- Critical Decision: Some algorithms don't parallelize - accept CPU implementation for inherently sequential work
- Failure Mode: Optimizing inappropriate algorithms (e.g., bubble sort on GPU)
- Success Pattern: Implementing parallel-friendly algorithms (merge sort vs bubble sort)
Level 2: Memory Optimization (10x gains typical)
- Primary Bottleneck: Memory bandwidth, not compute capacity
- Coalescing Requirement: Threads in warp must access consecutive addresses
- Shared Memory Benefits: Cache frequently accessed data on-chip
- Bank Conflict Cost: 5x slowdown when multiple threads hit same shared memory bank
Level 3: Execution Configuration (2-5x gains)
- Thread Block Sizing: Balance occupancy with resource usage
- Grid Configuration: Ensure sufficient work to saturate all SMs
- Stream Utilization: Overlap computation with memory transfers
Level 4: Instruction-Level Optimization (20-50% gains)
- Loop Unrolling: Effective for small, known iteration counts
- Math Intrinsics: Use fast math when precision allows
- Register Optimization: Minimize pressure to increase occupancy
Memory Coalescing - Critical Implementation Details
Coalescing Failure Impact
- Non-coalesced access reduces memory bandwidth by 8x
- Nsight Compute "Global Load Efficiency" below 80% indicates bandwidth waste
Structure Layout Impact
// WRONG: Structure of Arrays (SoA) - poor coalescing
float* r_channel, *g_channel, *b_channel;
// RIGHT: Array of Structures (AoS) - coalesced access
struct RGB { unsigned char r, g, b; };
RGB* image;
Detection Method
- Nsight Compute "Bytes per Request": 32+ bytes indicates coalescing, 4-8 bytes indicates failure
- Below 128 bytes per request indicates poor coalescing or cache thrashing
Shared Memory Optimization
Bank Conflict Rules
- 32 banks, 32-bit wide per SM
- Concurrent access to same bank causes 5x slowdown
- 64KB per SM on modern GPUs
Bank Conflict Elimination
// WRONG: Bank conflicts on transpose
__shared__ float tile[32][32];
// RIGHT: Padding eliminates stride conflicts
__shared__ float tile[32][33]; // +1 padding
Resource Limits
- Up to 48KB per thread block
- Don't use shared memory just to use it - poorly utilized shared memory performs worse than cached global memory
Occupancy vs Performance Reality
Memory-Bound Kernels
- Often perform identically from 25% to 100% occupancy
- Memory bandwidth ceiling means more threads don't help
- Focus on memory efficiency over thread count
Compute-Bound Kernels
- Need higher occupancy to hide instruction latency
- Rare in practice due to memory bandwidth limits
Register-Heavy Kernels
- May perform better with lower occupancy and more registers per thread
- Register spilling to local memory kills performance
CUDA 13.0 Performance Features
Green Contexts
- Purpose: Lightweight resource isolation between workloads
- Performance Cost: 5-15% overhead for isolation
- Benefit: Eliminates interference between workloads
- Use Cases: Multi-tenant inference, training job isolation
ZStandard Compression
- Improvement: 17% reduction in kernel binary size
- Benefits: Faster startup, reduced memory footprint, better instruction cache utilization
- Compatibility Risk: Older drivers cannot load ZStd-compressed kernels
CUDA Graphs
- Purpose: Eliminate kernel launch overhead
- Performance Gain: 50% CPU overhead reduction for repetitive patterns
- Effective For: Inference pipelines, training loops, multi-kernel sequences
- Ineffective For: One-off launches, dynamic parameters, conditional execution
Critical Profiling Metrics
Memory Performance Indicators
- Global Load Efficiency: >80% good, <50% critical problem
- L1/TEX Hit Rate: >70% good, <30% indicates cache thrashing
- DRAM Utilization: >85% indicates memory bandwidth saturation
Compute Performance Indicators
- SM Utilization: Percentage of time SMs are busy
- Warp State Distribution: Time warps spend stalled vs computing
- Achieved Occupancy: Only matters for compute-bound kernels
Bank Conflict Detection
- Shared Memory Conflicts: Check for serialized access patterns
- Symptoms: High shared memory latency with low throughput
Production Failure Scenarios
Memory Coalescing Disaster
- Symptom: 12 GB/s on 900 GB/s hardware
- Cause: Structure-of-Arrays layout causing non-coalesced access
- Solution: Array-of-Structures conversion
- Result: 680 GB/s achievement, 45ms to 6ms execution time
Bank Conflict Hell
- Symptom: 28% performance regression after "optimization"
- Cause:
__shared__ float tile[32][32]
causing stride conflicts - Solution: Padding to
tile[32][33]
- Result: 40% improvement over original
Occupancy Obsession
- Symptom: 96% occupancy, only 2% performance improvement
- Cause: Memory bandwidth saturation, not compute limitation
- Solution: Reduced thread blocks, improved cache hit rates
- Result: Lower occupancy (62%), 35% better performance
Register Spilling Mystery
- Symptom: Same code randomly 3x slower
- Cause: Non-deterministic NVCC register allocation
- Detection: Different register counts between compilations
- Solution: Explicit register limiting with
-maxrregcount=32
Multi-GPU Scaling Wall
- Symptom: 40% efficiency on 4 GPUs, 15% on 8 GPUs
- Expected Cause: NCCL communication overhead
- Actual Cause: Single-threaded CPU preprocessing bottleneck
- Solution: Parallel data loading with prefetch
- Result: 75% efficiency on 8 GPUs
Profiling Tool Effectiveness
Tool | Best For | Accuracy | Production Ready | Key Limitation |
---|---|---|---|---|
nvidia-smi | Quick health check | Basic metrics only | Yes | Surface-level only |
Nsight Compute | Kernel optimization | Best-in-class | Yes | Steep learning curve |
Nsight Systems | Timeline analysis | Excellent for CPU-GPU | Yes | Not kernel-deep |
nvprof | Legacy profiling | Good for compute-bound | Deprecated CUDA 12+ | No longer supported |
Critical Configuration Settings
Memory Management
- cudaMalloc vs cudaMallocManaged: Explicit management wins in production (10-30% overhead for Unified Memory)
- Prefetch Strategy: Essential for multi-GPU scaling
Compilation Flags
- Register Control:
-maxrregcount=N
for consistent performance - Architecture Targeting:
-arch=compute_89 -code=sm_89
for RTX 4090 - Verbose Output:
-Xptxas -v
to check register usage
Runtime Detection
- Register Spilling: Monitor memory utilization spikes during execution
- Bank Conflicts: Profile shared memory access patterns
- Coalescing Issues: Check bytes per memory request
Hardware-Specific Considerations
Architecture Differences
- RTX 4090: More cores, same memory bandwidth per core as RTX 3080
- Memory-bound kernels perform similarly across generations
- Compute capability affects available features and optimizations
Memory Bandwidth Reality
- Consumer GPUs: Limited memory bandwidth vs compute capability
- Tesla/Datacenter: Better balanced memory bandwidth
- Memory controller contention increases with thread count
Decision Support Framework
When to Optimize
- Memory bandwidth utilization <70%: Focus on coalescing and cache optimization
- High occupancy, low SM efficiency: Address warp divergence and memory stalls
- CPU overhead >5%: Consider CUDA Graphs for launch optimization
- Multi-GPU scaling <70%: Check data pipeline bottlenecks first
When to Accept Current Performance
- Memory bandwidth >85% utilized: Near hardware limits
- Algorithm inherently sequential: GPU may not be appropriate
- Development time exceeds performance benefit: Optimization has diminishing returns
Architecture Migration Risks
- Code optimized for one GPU generation may perform poorly on another
- Always profile on target production hardware
- Consumer GPU optimizations may not transfer to datacenter hardware
Common Misconceptions That Cause Failures
- "Higher occupancy always improves performance" - Memory-bound kernels see no benefit
- "More thread blocks always help" - Can cause memory bandwidth saturation
- "Shared memory is always faster" - Poorly used shared memory underperforms cached global memory
- "Tensor Cores accelerate everything" - Only benefits specific mixed-precision matrix operations
- "CUDA optimization is deterministic" - Compiler behavior varies, requiring explicit controls
Useful Links for Further Investigation
CUDA Performance Resources That Don't Suck
Link | Description |
---|---|
CUDA C++ Best Practices Guide | The only NVIDIA doc that's practical instead of theoretical. Skip the intro chapters, jump to "Memory Optimization" and "Execution Configuration Optimizations". |
Nsight Compute Documentation | Dense but comprehensive. Focus on the "Profiling Guide" section - it explains what all those cryptic metrics actually mean. |
CUDA Programming Guide - Performance Guidelines | Dry but accurate. The memory coalescing examples are worth understanding even if you never write C++ CUDA. |
CUDA MODE YouTube Series | Real engineers explaining real optimization problems. Skip the intro lectures, watch the kernel optimization episodes. |
GPU MODE Lecture Series | Community-driven CUDA optimization lectures and resources. Focus on practical optimization techniques over theory. |
Nsight Compute CLI Reference | Command-line profiling without the GUI frustration. Essential for batch profiling and CI integration. |
Simon Boehm's CUDA Matrix Multiplication Tutorial | Step-by-step optimization of a real kernel. Shows the profiling workflow and optimization thought process. |
CUDA Performance Guidelines from Purdue | University course materials with practical examples. PDFs are dense but cover memory coalescing and shared memory well. |
cuBLAS Source Code Analysis | NVIDIA's own matrix multiplication optimizations. Warning: extremely complex, but shows production-level optimization techniques. |
PyTorch CUDA Kernels | Real-world kernels handling irregular data sizes and edge cases. Good examples of robust optimization. |
Thrust Library Implementation | Shows how to write generic, high-performance CUDA code. The reduction and scan implementations are educational. |
GPU Memory Coalescing Visualization | Interactive examples showing coalesced vs non-coalesced access patterns. Finally makes the concept click. |
CUDA Memory Hierarchy Guide | Explains the hardware reality behind memory optimization advice. |
Compute Sanitizer Documentation | Essential CUDA memory debugging tool that replaced cuda-memcheck. Critical for finding memory errors. |
Nsight Systems Timeline Analysis | Official tutorials for timeline profiling. The multi-GPU analysis section is particularly useful. |
NVIDIA GPU Architecture Whitepapers | Hardware specifications for compute capability, memory bandwidth, cache sizes. Essential for understanding architecture limits. |
CUDA Toolkit Release Notes | New features and breaking changes in recent CUDA versions. CUDA 13.0 has significant performance-related changes. |
CUDA Stack Overflow | High-quality voted answers to optimization questions. Filter by votes to find the most reliable solutions. |
NVIDIA Developer Forums - CUDA Programming | NVIDIA engineers sometimes answer questions. Search before posting - they've answered most performance questions already. |
Related Tools & Recommendations
Anthropic Raises $13B at $183B Valuation: AI Bubble Peak or Actual Revenue?
Another AI funding round that makes no sense - $183 billion for a chatbot company that burns through investor money faster than AWS bills in a misconfigured k8s
Docker Desktop Hit by Critical Container Escape Vulnerability
CVE-2025-9074 exposes host systems to complete compromise through API misconfiguration
Yarn Package Manager - npm's Faster Cousin
Explore Yarn Package Manager's origins, its advantages over npm, and the practical realities of using features like Plug'n'Play. Understand common issues and be
PostgreSQL Alternatives: Escape Your Production Nightmare
When the "World's Most Advanced Open Source Database" Becomes Your Worst Enemy
AWS RDS Blue/Green Deployments - Zero-Downtime Database Updates
Explore Amazon RDS Blue/Green Deployments for zero-downtime database updates. Learn how it works, deployment steps, and answers to common FAQs about switchover
Three Stories That Pissed Me Off Today
Explore the latest tech news: You.com's funding surge, Tesla's robotaxi advancements, and the surprising quiet launch of Instagram's iPad app. Get your daily te
Aider - Terminal AI That Actually Works
Explore Aider, the terminal-based AI coding assistant. Learn what it does, how to install it, and get answers to common questions about API keys and costs.
jQuery - The Library That Won't Die
Explore jQuery's enduring legacy, its impact on web development, and the key changes in jQuery 4.0. Understand its relevance for new projects in 2025.
vtenext CRM Allows Unauthenticated Remote Code Execution
Three critical vulnerabilities enable complete system compromise in enterprise CRM platform
Django Production Deployment - Enterprise-Ready Guide for 2025
From development server to bulletproof production: Docker, Kubernetes, security hardening, and monitoring that doesn't suck
HeidiSQL - Database Tool That Actually Works
Discover HeidiSQL, the efficient database management tool. Learn what it does, its benefits over DBeaver & phpMyAdmin, supported databases, and if it's free to
Fix Redis "ERR max number of clients reached" - Solutions That Actually Work
When Redis starts rejecting connections, you need fixes that work in minutes, not hours
QuickNode - Blockchain Nodes So You Don't Have To
Runs 70+ blockchain nodes so you can focus on building instead of debugging why your Ethereum node crashed again
Get Alpaca Market Data Without the Connection Constantly Dying on You
WebSocket Streaming That Actually Works: Stop Polling APIs Like It's 2005
OpenAI Alternatives That Won't Bankrupt You
Bills getting expensive? Yeah, ours too. Here's what we ended up switching to and what broke along the way.
Migrate JavaScript to TypeScript Without Losing Your Mind
A battle-tested guide for teams migrating production JavaScript codebases to TypeScript
Docker Compose 2.39.2 and Buildx 0.27.0 Released with Major Updates
Latest versions bring improved multi-platform builds and security fixes for containerized applications
Google Vertex AI - Google's Answer to AWS SageMaker
Google's ML platform that combines their scattered AI services into one place. Expect higher bills than advertised but decent Gemini model access if you're alre
Google NotebookLM Goes Global: Video Overviews in 80+ Languages
Google's AI research tool just became usable for non-English speakers who've been waiting months for basic multilingual support
Figma Gets Lukewarm Wall Street Reception Despite AI Potential - August 25, 2025
Major investment banks issue neutral ratings citing $37.6B valuation concerns while acknowledging design platform's AI integration opportunities
Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization