Your CUDA kernel works. It produces correct results. Your boss is happy. But deep down, you know it's running at 5% of theoretical peak performance, and the profiler is mocking you with single-digit occupancy numbers.
Welcome to CUDA performance optimization - where "it works" is just the beginning of the pain. After five years of optimizing CUDA kernels for production workloads, I've learned that performance optimization isn't about applying random tricks from Stack Overflow. It's a systematic process of identifying bottlenecks, fixing them methodically, and accepting that your GPU will never reach theoretical peak.
The Performance Optimization Pyramid
CUDA performance optimization follows a hierarchy. You can't skip steps - fixing occupancy won't help if you're memory bandwidth bound, and optimizing memory coalescing is pointless if your algorithm is fundamentally broken.
Level 1: Algorithm-Level Optimization (100x gains possible)
- Wrong approach: Optimizing a bubble sort on GPU
- Right approach: Implementing a parallel merge sort
- Reality check: Some algorithms don't parallelize. Accept CPU implementation for inherently sequential work
Level 2: Memory Optimization (10x gains typical)
- Global memory coalescing: Threads in a warp access consecutive addresses
- Shared memory utilization: Cache frequently accessed data on-chip
- Memory bank conflict elimination: Avoid concurrent access to same shared memory bank
Level 3: Execution Configuration (2-5x gains)
- Thread block sizing: Balance occupancy with resource usage
- Grid configuration: Ensure enough work to saturate all SMs
- Stream utilization: Overlap computation with memory transfers
Level 4: Instruction-Level Optimization (20-50% gains)
- Loop unrolling: Reduce loop overhead for small, known iteration counts
- Math function optimization: Use fast math intrinsics when precision allows
- Register optimization: Minimize register pressure to increase occupancy
The Memory Bandwidth Wall
Most CUDA kernels are memory bandwidth bound, not compute bound. Your RTX 4090 can execute 83 TFLOPS of FP32 operations but only has 1000 GB/s memory bandwidth. That means you can read/write 250 billion floats per second, but compute on 21 trillion floats per second.
The brutal math: If your kernel reads one float and writes one float (8 bytes total), you need 10 FP32 operations per memory access to be compute bound. Most kernels don't come close.
Memory Coalescing - Your First Battle
// Terrible: Each thread accesses different cache line
__global__ void strided_access(float* data, int stride) {
int idx = threadIdx.x * stride; // WRONG
data[idx] = threadIdx.x;
}
// Good: Adjacent threads access adjacent memory
__global__ void coalesced_access(float* data, int stride) {
int idx = blockIdx.x * blockDim.x + threadIdx.x; // RIGHT
data[idx] = threadIdx.x;
}
Reality check: Non-coalesced access can reduce memory bandwidth by 8x. Nsight Compute will show you "Global Load Efficiency" - anything below 80% means you're wasting bandwidth.
Shared Memory - The On-Chip Cache That Actually Matters
// Matrix transpose with shared memory tiling
__global__ void transpose_shared(float* out, float* in, int n) {
__shared__ float tile[32][33]; // +1 to avoid bank conflicts
int x = blockIdx.x * 32 + threadIdx.x;
int y = blockIdx.y * 32 + threadIdx.y;
// Load tile cooperatively
tile[threadIdx.y][threadIdx.x] = in[y * n + x];
__syncthreads();
// Write transposed tile
out[x * n + y] = tile[threadIdx.x][threadIdx.y];
}
The shared memory rules:
- 32 banks, 32-bit wide: Avoid concurrent access to same bank
- 64KB per SM on modern GPUs: Don't waste it, but don't assume unlimited
- Bank conflict cost: 5x slowdown when multiple threads hit same bank
Occupancy - The Most Misunderstood Metric
Everyone obsesses over occupancy. "My kernel only shows 25% occupancy, it must be slow!" Wrong. Occupancy measures how many threads can run simultaneously, not performance.
The occupancy myths:
- ❌ "Higher occupancy always means better performance"
- ❌ "100% occupancy is the goal"
- ❌ "Low occupancy means the kernel is broken"
The occupancy reality:
- Memory-bound kernels: Often perform identically from 25% to 100% occupancy
- Compute-bound kernels: Need higher occupancy to hide instruction latency
- Register-heavy kernels: May perform better with lower occupancy and more registers per thread
The Occupancy Calculator Lies
The CUDA Occupancy Calculator tells you maximum theoretical occupancy. It doesn't know about:
- Memory access patterns
- Branch divergence
- Actual workload characteristics
- Cache behavior
Better approach: Profile with Nsight Compute and look at:
- SM Utilization: Percentage of time SMs are busy
- Memory Throughput: Actual achieved bandwidth vs theoretical
- Warp State Distribution: How much time warps spend stalled
The CUDA 13.0 Performance Features You Should Know
Green Contexts - GPU Virtualization That Actually Works
Green Contexts in CUDA 13.0 allow lightweight resource isolation between different workloads on the same GPU. Unlike MPS (Multi-Process Service), which shares everything, Green Contexts provide dedicated compute and memory resources.
Use cases where Green Contexts matter:
- Multi-tenant inference serving
- Training job isolation on shared hardware
- Background vs foreground workload prioritization
Performance impact: Typically 5-15% overhead for resource isolation, but eliminates interference between workloads.
ZStandard Compression - Smaller Binaries, Faster Loading
CUDA 13.0 switched from LZ4 to ZStandard compression for kernel binaries, reducing size by up to 17%. This matters more than you'd think:
- Faster application startup
- Reduced memory footprint for JIT compilation
- Better instruction cache utilization
Backward compatibility: Older drivers can't load ZStd-compressed kernels. Pin your driver versions in production.
CUDA Graphs - Reduce Launch Overhead
CUDA Graphs eliminate kernel launch overhead by pre-recording sequences of operations. For workloads with repetitive kernel patterns, graphs can reduce CPU overhead by 50%.
// Graph creation and execution
cudaGraph_t graph;
cudaGraphExec_t graphExec;
// Record operations into graph
cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
kernel1<<<blocks, threads, 0, stream>>>();
kernel2<<<blocks, threads, 0, stream>>>();
cudaStreamEndCapture(stream, &graph);
// Instantiate and launch
cudaGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0);
cudaGraphLaunch(graphExec, stream);
Graphs work best for:
- Inference pipelines with fixed topology
- Training loops with consistent patterns
- Multi-kernel sequences executed repeatedly
Graphs don't help:
- One-off kernel launches
- Dynamic kernel parameters
- Conditional execution patterns
The harsh truth about CUDA optimization: there are no silver bullets. Memory coalescing, shared memory optimization, and occupancy tuning are table stakes. Real performance gains come from algorithmic improvements and understanding your specific workload's bottlenecks.