Why is my kernel slower on RTX 4090 than RTX 3080?

Architecture differences. The RTX 4090 has more cores but the same memory bandwidth per core. If your kernel is memory-bound (most are), more cores don't help. Use `nvidia-smi -q -d MEMORY` to check bandwidth utilization. Memory-bound kernels often perform identically across different GPU generations.

My occupancy is 100% but performance is still terrible. What's wrong?

You're probably memory-bound, not compute-bound. High occupancy only helps compute-bound kernels that need to hide instruction latency. Run Nsight Compute and check "Memory Workload Analysis" - if "L1/TEX Hit Rate" is low and "DRAM Utilization" is high, you have a memory problem, not an occupancy problem.

Should I optimize for theoretical peak FLOPS or memory bandwidth?

**Always memory bandwidth first**. Most kernels never come close to theoretical FLOPS. A typical matrix multiplication needs 2 FP32 ops per element but reads/writes 3 FP32 values (12 bytes). That's 0.17 ops per byte - nowhere near the ~10 ops/byte needed to be compute-bound on modern GPUs.

How do I know if my memory access is coalesced?

Nsight Compute's "L1/TEX" section shows "Global Hit Rate" and "Bytes per Request". Coalesced access shows 32+ bytes per request (full cache line). Non-coalesced shows 4-8 bytes per request. Anything below 128 bytes per request indicates poor coalescing or cache thrashing.

What's the difference between shared memory bank conflicts and cache misses?

**Bank conflicts** happen when multiple threads in the same warp access different addresses within the same 32-bit bank of shared memory. This serializes access. **Cache misses** happen when data isn't in L1/L2 cache, forcing expensive DRAM access. Shared memory bank conflicts are 5x slower than perfect access; DRAM misses are 500x slower.

Why does adding more thread blocks make my kernel slower?

You've hit the memory bandwidth ceiling. More thread blocks mean more concurrent memory requests, causing cache evictions and memory controller contention. This is especially common on consumer GPUs with limited memory bandwidth. Profile with `nvidia-smi dmon` during execution to see memory utilization.

My kernel runs fine on Tesla V100 but crashes on RTX 4090. Why?

Architecture compute capability differences. RTX 4090 is compute capability 8.9, V100 is 7.0. Check if you're using: - Different shared memory configurations - Hardware-specific intrinsics - Assume specific warp scheduler behavior Use `nvcc -arch=compute_89 -code=sm_89` to target RTX 4090 specifically.

When should I use CUDA Streams vs CUDA Graphs?

**Streams** for overlapping independent operations (compute + memory transfers). **Graphs** for reducing launch overhead of repetitive kernel sequences. Streams help with latency hiding, graphs help with CPU overhead. If your CPU spends >5% time in CUDA runtime calls, consider graphs.

Is it worth optimizing for Tensor Cores on non-ML workloads?

Only if your problem naturally maps to mixed-precision matrix operations. Tensor Cores require specific data layouts (half-precision, specific dimensions) and only accelerate GEMM operations. For general compute, the restrictions usually aren't worth the 2-4x speedup.

How do I optimize kernels that process variable-length data?

**Dynamic parallelism** if the variance is high and you can't predict work distribution. **Thread coarsening** if most sequences are long. **Persistent kernels** if you want to eliminate kernel launch overhead. Avoid warp divergence by padding short sequences or grouping similar lengths together.

My kernel shows high "Achieved Occupancy" but low "SM Efficiency". What gives?

Warp divergence or memory stalls. High occupancy means threads are scheduled, but SM Efficiency measures how much they're actually computing vs waiting. Check "Warp State Distribution" - if threads spend >50% time stalled on memory or barriers, you have a latency problem, not a capacity problem.

Should I use `cudaMalloc` or `cudaMallocManaged` for performance?

**`cudaMalloc`** for predictable data access patterns where you control transfers. **`cudaMallocManaged`** for prototyping or unpredictable access patterns. Unified Memory has 10-30% overhead due to page migration, but eliminates explicit transfers. In production, explicit memory management usually wins.

Why do CUDA math intrinsics sometimes make kernels slower?

Reduced numerical precision can cause convergence issues in iterative algorithms, requiring more iterations. `__sinf()` is faster than `sin()` but less accurate. Profile end-to-end runtime, not just kernel execution time. Sometimes the "slower" accurate function gives better algorithmic convergence.

How much shared memory should I use per thread block?

**Modern GPUs**: 64KB per SM, up to 48KB per thread block. Use what you need for algorithm correctness, then optimize. Don't use shared memory just to use it - poorly utilized shared memory performs worse than cached global memory. Profile L1/TEX hit rates before and after shared memory optimization.

My multi-GPU scaling is terrible. Is NCCL the problem?

Usually not. Check if you're CPU-bound in data preprocessing, have imbalanced workloads across GPUs, or memory bandwidth limits on PCIe transfers. NCCL itself rarely bottlenecks unless you're doing frequent small allreduce operations. Use `nvprof --print-gpu-trace` to see where time actually goes.

Currently viewing the AI version

Switch to human version

CUDA GPU Performance Optimization - AI Reference Guide

Executive Summary

CUDA performance optimization follows a strict hierarchy where algorithm-level changes provide 100x gains, memory optimization provides 10x gains, execution configuration provides 2-5x gains, and instruction-level optimization provides 20-50% gains. Most CUDA kernels are memory bandwidth bound, not compute bound, making memory optimization the primary focus.

Critical Performance Reality

Memory Bandwidth Dominance: RTX 4090 can execute 83 TFLOPS but only has 1000 GB/s memory bandwidth. This means 250 billion float reads/writes per second vs 21 trillion float operations per second. Kernels need 10+ FP32 operations per memory access to be compute bound - most don't achieve this ratio.

Occupancy Misconception: High occupancy does not equal high performance. Memory-bound kernels often perform identically from 25% to 100% occupancy. Focus on memory throughput, not thread count.

Performance Optimization Hierarchy

Level 1: Algorithm-Level Optimization (100x gains possible)

Critical Decision: Some algorithms don't parallelize - accept CPU implementation for inherently sequential work
Failure Mode: Optimizing inappropriate algorithms (e.g., bubble sort on GPU)
Success Pattern: Implementing parallel-friendly algorithms (merge sort vs bubble sort)

Level 2: Memory Optimization (10x gains typical)

Primary Bottleneck: Memory bandwidth, not compute capacity
Coalescing Requirement: Threads in warp must access consecutive addresses
Shared Memory Benefits: Cache frequently accessed data on-chip
Bank Conflict Cost: 5x slowdown when multiple threads hit same shared memory bank

Level 3: Execution Configuration (2-5x gains)

Thread Block Sizing: Balance occupancy with resource usage
Grid Configuration: Ensure sufficient work to saturate all SMs
Stream Utilization: Overlap computation with memory transfers

Level 4: Instruction-Level Optimization (20-50% gains)

Loop Unrolling: Effective for small, known iteration counts
Math Intrinsics: Use fast math when precision allows
Register Optimization: Minimize pressure to increase occupancy

Memory Coalescing - Critical Implementation Details

Coalescing Failure Impact

Non-coalesced access reduces memory bandwidth by 8x
Nsight Compute "Global Load Efficiency" below 80% indicates bandwidth waste

Structure Layout Impact

// WRONG: Structure of Arrays (SoA) - poor coalescing
float* r_channel, *g_channel, *b_channel;

// RIGHT: Array of Structures (AoS) - coalesced access
struct RGB { unsigned char r, g, b; };
RGB* image;

Detection Method

Nsight Compute "Bytes per Request": 32+ bytes indicates coalescing, 4-8 bytes indicates failure
Below 128 bytes per request indicates poor coalescing or cache thrashing

Shared Memory Optimization

Bank Conflict Rules

32 banks, 32-bit wide per SM
Concurrent access to same bank causes 5x slowdown
64KB per SM on modern GPUs

Bank Conflict Elimination

// WRONG: Bank conflicts on transpose
__shared__ float tile[32][32];

// RIGHT: Padding eliminates stride conflicts  
__shared__ float tile[32][33];  // +1 padding

Resource Limits

Up to 48KB per thread block
Don't use shared memory just to use it - poorly utilized shared memory performs worse than cached global memory

Occupancy vs Performance Reality

Memory-Bound Kernels

Often perform identically from 25% to 100% occupancy
Memory bandwidth ceiling means more threads don't help
Focus on memory efficiency over thread count

Compute-Bound Kernels

Need higher occupancy to hide instruction latency
Rare in practice due to memory bandwidth limits

Register-Heavy Kernels

May perform better with lower occupancy and more registers per thread
Register spilling to local memory kills performance

CUDA 13.0 Performance Features

Green Contexts

Purpose: Lightweight resource isolation between workloads
Performance Cost: 5-15% overhead for isolation
Benefit: Eliminates interference between workloads
Use Cases: Multi-tenant inference, training job isolation

ZStandard Compression

Improvement: 17% reduction in kernel binary size
Benefits: Faster startup, reduced memory footprint, better instruction cache utilization
Compatibility Risk: Older drivers cannot load ZStd-compressed kernels

CUDA Graphs

Purpose: Eliminate kernel launch overhead
Performance Gain: 50% CPU overhead reduction for repetitive patterns
Effective For: Inference pipelines, training loops, multi-kernel sequences
Ineffective For: One-off launches, dynamic parameters, conditional execution

Critical Profiling Metrics

Memory Performance Indicators

Global Load Efficiency: >80% good, <50% critical problem
L1/TEX Hit Rate: >70% good, <30% indicates cache thrashing
DRAM Utilization: >85% indicates memory bandwidth saturation

Compute Performance Indicators

SM Utilization: Percentage of time SMs are busy
Warp State Distribution: Time warps spend stalled vs computing
Achieved Occupancy: Only matters for compute-bound kernels

Bank Conflict Detection

Shared Memory Conflicts: Check for serialized access patterns
Symptoms: High shared memory latency with low throughput

Production Failure Scenarios

Memory Coalescing Disaster

Symptom: 12 GB/s on 900 GB/s hardware
Cause: Structure-of-Arrays layout causing non-coalesced access
Solution: Array-of-Structures conversion
Result: 680 GB/s achievement, 45ms to 6ms execution time

Bank Conflict Hell

Symptom: 28% performance regression after "optimization"
Cause: __shared__ float tile[32][32] causing stride conflicts
Solution: Padding to tile[32][33]
Result: 40% improvement over original

Occupancy Obsession

Symptom: 96% occupancy, only 2% performance improvement
Cause: Memory bandwidth saturation, not compute limitation
Solution: Reduced thread blocks, improved cache hit rates
Result: Lower occupancy (62%), 35% better performance

Register Spilling Mystery

Symptom: Same code randomly 3x slower
Cause: Non-deterministic NVCC register allocation
Detection: Different register counts between compilations
Solution: Explicit register limiting with -maxrregcount=32

Multi-GPU Scaling Wall

Symptom: 40% efficiency on 4 GPUs, 15% on 8 GPUs
Expected Cause: NCCL communication overhead
Actual Cause: Single-threaded CPU preprocessing bottleneck
Solution: Parallel data loading with prefetch
Result: 75% efficiency on 8 GPUs

Profiling Tool Effectiveness

Tool	Best For	Accuracy	Production Ready	Key Limitation
nvidia-smi	Quick health check	Basic metrics only	Yes	Surface-level only
Nsight Compute	Kernel optimization	Best-in-class	Yes	Steep learning curve
Nsight Systems	Timeline analysis	Excellent for CPU-GPU	Yes	Not kernel-deep
nvprof	Legacy profiling	Good for compute-bound	Deprecated CUDA 12+	No longer supported

Critical Configuration Settings

Memory Management

cudaMalloc vs cudaMallocManaged: Explicit management wins in production (10-30% overhead for Unified Memory)
Prefetch Strategy: Essential for multi-GPU scaling

Compilation Flags

Register Control: -maxrregcount=N for consistent performance
Architecture Targeting: -arch=compute_89 -code=sm_89 for RTX 4090
Verbose Output: -Xptxas -v to check register usage

Runtime Detection

Register Spilling: Monitor memory utilization spikes during execution
Bank Conflicts: Profile shared memory access patterns
Coalescing Issues: Check bytes per memory request

Hardware-Specific Considerations

Architecture Differences

RTX 4090: More cores, same memory bandwidth per core as RTX 3080
Memory-bound kernels perform similarly across generations
Compute capability affects available features and optimizations

Memory Bandwidth Reality

Consumer GPUs: Limited memory bandwidth vs compute capability
Tesla/Datacenter: Better balanced memory bandwidth
Memory controller contention increases with thread count

Decision Support Framework

When to Optimize

Memory bandwidth utilization <70%: Focus on coalescing and cache optimization
High occupancy, low SM efficiency: Address warp divergence and memory stalls
CPU overhead >5%: Consider CUDA Graphs for launch optimization
Multi-GPU scaling <70%: Check data pipeline bottlenecks first

When to Accept Current Performance

Memory bandwidth >85% utilized: Near hardware limits
Algorithm inherently sequential: GPU may not be appropriate
Development time exceeds performance benefit: Optimization has diminishing returns

Architecture Migration Risks

Code optimized for one GPU generation may perform poorly on another
Always profile on target production hardware
Consumer GPU optimizations may not transfer to datacenter hardware

Common Misconceptions That Cause Failures

"Higher occupancy always improves performance" - Memory-bound kernels see no benefit
"More thread blocks always help" - Can cause memory bandwidth saturation
"Shared memory is always faster" - Poorly used shared memory underperforms cached global memory
"Tensor Cores accelerate everything" - Only benefits specific mixed-precision matrix operations
"CUDA optimization is deterministic" - Compiler behavior varies, requiring explicit controls

Useful Links for Further Investigation

CUDA Performance Resources That Don't Suck

Link	Description
CUDA C++ Best Practices Guide	The only NVIDIA doc that's practical instead of theoretical. Skip the intro chapters, jump to "Memory Optimization" and "Execution Configuration Optimizations".
Nsight Compute Documentation	Dense but comprehensive. Focus on the "Profiling Guide" section - it explains what all those cryptic metrics actually mean.
CUDA Programming Guide - Performance Guidelines	Dry but accurate. The memory coalescing examples are worth understanding even if you never write C++ CUDA.
CUDA MODE YouTube Series	Real engineers explaining real optimization problems. Skip the intro lectures, watch the kernel optimization episodes.
GPU MODE Lecture Series	Community-driven CUDA optimization lectures and resources. Focus on practical optimization techniques over theory.
Nsight Compute CLI Reference	Command-line profiling without the GUI frustration. Essential for batch profiling and CI integration.
Simon Boehm's CUDA Matrix Multiplication Tutorial	Step-by-step optimization of a real kernel. Shows the profiling workflow and optimization thought process.
CUDA Performance Guidelines from Purdue	University course materials with practical examples. PDFs are dense but cover memory coalescing and shared memory well.
cuBLAS Source Code Analysis	NVIDIA's own matrix multiplication optimizations. Warning: extremely complex, but shows production-level optimization techniques.
PyTorch CUDA Kernels	Real-world kernels handling irregular data sizes and edge cases. Good examples of robust optimization.
Thrust Library Implementation	Shows how to write generic, high-performance CUDA code. The reduction and scan implementations are educational.
GPU Memory Coalescing Visualization	Interactive examples showing coalesced vs non-coalesced access patterns. Finally makes the concept click.
CUDA Memory Hierarchy Guide	Explains the hardware reality behind memory optimization advice.
Compute Sanitizer Documentation	Essential CUDA memory debugging tool that replaced cuda-memcheck. Critical for finding memory errors.
Nsight Systems Timeline Analysis	Official tutorials for timeline profiling. The multi-GPU analysis section is particularly useful.
NVIDIA GPU Architecture Whitepapers	Hardware specifications for compute capability, memory bandwidth, cache sizes. Essential for understanding architecture limits.
CUDA Toolkit Release Notes	New features and breaking changes in recent CUDA versions. CUDA 13.0 has significant performance-related changes.
CUDA Stack Overflow	High-quality voted answers to optimization questions. Filter by votes to find the most reliable solutions.
NVIDIA Developer Forums - CUDA Programming	NVIDIA engineers sometimes answer questions. Search before posting - they've answered most performance questions already.