Currently viewing the AI version
Switch to human version

CUDA GPU Performance Optimization - AI Reference Guide

Executive Summary

CUDA performance optimization follows a strict hierarchy where algorithm-level changes provide 100x gains, memory optimization provides 10x gains, execution configuration provides 2-5x gains, and instruction-level optimization provides 20-50% gains. Most CUDA kernels are memory bandwidth bound, not compute bound, making memory optimization the primary focus.

Critical Performance Reality

Memory Bandwidth Dominance: RTX 4090 can execute 83 TFLOPS but only has 1000 GB/s memory bandwidth. This means 250 billion float reads/writes per second vs 21 trillion float operations per second. Kernels need 10+ FP32 operations per memory access to be compute bound - most don't achieve this ratio.

Occupancy Misconception: High occupancy does not equal high performance. Memory-bound kernels often perform identically from 25% to 100% occupancy. Focus on memory throughput, not thread count.

Performance Optimization Hierarchy

Level 1: Algorithm-Level Optimization (100x gains possible)

  • Critical Decision: Some algorithms don't parallelize - accept CPU implementation for inherently sequential work
  • Failure Mode: Optimizing inappropriate algorithms (e.g., bubble sort on GPU)
  • Success Pattern: Implementing parallel-friendly algorithms (merge sort vs bubble sort)

Level 2: Memory Optimization (10x gains typical)

  • Primary Bottleneck: Memory bandwidth, not compute capacity
  • Coalescing Requirement: Threads in warp must access consecutive addresses
  • Shared Memory Benefits: Cache frequently accessed data on-chip
  • Bank Conflict Cost: 5x slowdown when multiple threads hit same shared memory bank

Level 3: Execution Configuration (2-5x gains)

  • Thread Block Sizing: Balance occupancy with resource usage
  • Grid Configuration: Ensure sufficient work to saturate all SMs
  • Stream Utilization: Overlap computation with memory transfers

Level 4: Instruction-Level Optimization (20-50% gains)

  • Loop Unrolling: Effective for small, known iteration counts
  • Math Intrinsics: Use fast math when precision allows
  • Register Optimization: Minimize pressure to increase occupancy

Memory Coalescing - Critical Implementation Details

Coalescing Failure Impact

  • Non-coalesced access reduces memory bandwidth by 8x
  • Nsight Compute "Global Load Efficiency" below 80% indicates bandwidth waste

Structure Layout Impact

// WRONG: Structure of Arrays (SoA) - poor coalescing
float* r_channel, *g_channel, *b_channel;

// RIGHT: Array of Structures (AoS) - coalesced access
struct RGB { unsigned char r, g, b; };
RGB* image;

Detection Method

  • Nsight Compute "Bytes per Request": 32+ bytes indicates coalescing, 4-8 bytes indicates failure
  • Below 128 bytes per request indicates poor coalescing or cache thrashing

Shared Memory Optimization

Bank Conflict Rules

  • 32 banks, 32-bit wide per SM
  • Concurrent access to same bank causes 5x slowdown
  • 64KB per SM on modern GPUs

Bank Conflict Elimination

// WRONG: Bank conflicts on transpose
__shared__ float tile[32][32];

// RIGHT: Padding eliminates stride conflicts  
__shared__ float tile[32][33];  // +1 padding

Resource Limits

  • Up to 48KB per thread block
  • Don't use shared memory just to use it - poorly utilized shared memory performs worse than cached global memory

Occupancy vs Performance Reality

Memory-Bound Kernels

  • Often perform identically from 25% to 100% occupancy
  • Memory bandwidth ceiling means more threads don't help
  • Focus on memory efficiency over thread count

Compute-Bound Kernels

  • Need higher occupancy to hide instruction latency
  • Rare in practice due to memory bandwidth limits

Register-Heavy Kernels

  • May perform better with lower occupancy and more registers per thread
  • Register spilling to local memory kills performance

CUDA 13.0 Performance Features

Green Contexts

  • Purpose: Lightweight resource isolation between workloads
  • Performance Cost: 5-15% overhead for isolation
  • Benefit: Eliminates interference between workloads
  • Use Cases: Multi-tenant inference, training job isolation

ZStandard Compression

  • Improvement: 17% reduction in kernel binary size
  • Benefits: Faster startup, reduced memory footprint, better instruction cache utilization
  • Compatibility Risk: Older drivers cannot load ZStd-compressed kernels

CUDA Graphs

  • Purpose: Eliminate kernel launch overhead
  • Performance Gain: 50% CPU overhead reduction for repetitive patterns
  • Effective For: Inference pipelines, training loops, multi-kernel sequences
  • Ineffective For: One-off launches, dynamic parameters, conditional execution

Critical Profiling Metrics

Memory Performance Indicators

  • Global Load Efficiency: >80% good, <50% critical problem
  • L1/TEX Hit Rate: >70% good, <30% indicates cache thrashing
  • DRAM Utilization: >85% indicates memory bandwidth saturation

Compute Performance Indicators

  • SM Utilization: Percentage of time SMs are busy
  • Warp State Distribution: Time warps spend stalled vs computing
  • Achieved Occupancy: Only matters for compute-bound kernels

Bank Conflict Detection

  • Shared Memory Conflicts: Check for serialized access patterns
  • Symptoms: High shared memory latency with low throughput

Production Failure Scenarios

Memory Coalescing Disaster

  • Symptom: 12 GB/s on 900 GB/s hardware
  • Cause: Structure-of-Arrays layout causing non-coalesced access
  • Solution: Array-of-Structures conversion
  • Result: 680 GB/s achievement, 45ms to 6ms execution time

Bank Conflict Hell

  • Symptom: 28% performance regression after "optimization"
  • Cause: __shared__ float tile[32][32] causing stride conflicts
  • Solution: Padding to tile[32][33]
  • Result: 40% improvement over original

Occupancy Obsession

  • Symptom: 96% occupancy, only 2% performance improvement
  • Cause: Memory bandwidth saturation, not compute limitation
  • Solution: Reduced thread blocks, improved cache hit rates
  • Result: Lower occupancy (62%), 35% better performance

Register Spilling Mystery

  • Symptom: Same code randomly 3x slower
  • Cause: Non-deterministic NVCC register allocation
  • Detection: Different register counts between compilations
  • Solution: Explicit register limiting with -maxrregcount=32

Multi-GPU Scaling Wall

  • Symptom: 40% efficiency on 4 GPUs, 15% on 8 GPUs
  • Expected Cause: NCCL communication overhead
  • Actual Cause: Single-threaded CPU preprocessing bottleneck
  • Solution: Parallel data loading with prefetch
  • Result: 75% efficiency on 8 GPUs

Profiling Tool Effectiveness

Tool Best For Accuracy Production Ready Key Limitation
nvidia-smi Quick health check Basic metrics only Yes Surface-level only
Nsight Compute Kernel optimization Best-in-class Yes Steep learning curve
Nsight Systems Timeline analysis Excellent for CPU-GPU Yes Not kernel-deep
nvprof Legacy profiling Good for compute-bound Deprecated CUDA 12+ No longer supported

Critical Configuration Settings

Memory Management

  • cudaMalloc vs cudaMallocManaged: Explicit management wins in production (10-30% overhead for Unified Memory)
  • Prefetch Strategy: Essential for multi-GPU scaling

Compilation Flags

  • Register Control: -maxrregcount=N for consistent performance
  • Architecture Targeting: -arch=compute_89 -code=sm_89 for RTX 4090
  • Verbose Output: -Xptxas -v to check register usage

Runtime Detection

  • Register Spilling: Monitor memory utilization spikes during execution
  • Bank Conflicts: Profile shared memory access patterns
  • Coalescing Issues: Check bytes per memory request

Hardware-Specific Considerations

Architecture Differences

  • RTX 4090: More cores, same memory bandwidth per core as RTX 3080
  • Memory-bound kernels perform similarly across generations
  • Compute capability affects available features and optimizations

Memory Bandwidth Reality

  • Consumer GPUs: Limited memory bandwidth vs compute capability
  • Tesla/Datacenter: Better balanced memory bandwidth
  • Memory controller contention increases with thread count

Decision Support Framework

When to Optimize

  1. Memory bandwidth utilization <70%: Focus on coalescing and cache optimization
  2. High occupancy, low SM efficiency: Address warp divergence and memory stalls
  3. CPU overhead >5%: Consider CUDA Graphs for launch optimization
  4. Multi-GPU scaling <70%: Check data pipeline bottlenecks first

When to Accept Current Performance

  1. Memory bandwidth >85% utilized: Near hardware limits
  2. Algorithm inherently sequential: GPU may not be appropriate
  3. Development time exceeds performance benefit: Optimization has diminishing returns

Architecture Migration Risks

  • Code optimized for one GPU generation may perform poorly on another
  • Always profile on target production hardware
  • Consumer GPU optimizations may not transfer to datacenter hardware

Common Misconceptions That Cause Failures

  1. "Higher occupancy always improves performance" - Memory-bound kernels see no benefit
  2. "More thread blocks always help" - Can cause memory bandwidth saturation
  3. "Shared memory is always faster" - Poorly used shared memory underperforms cached global memory
  4. "Tensor Cores accelerate everything" - Only benefits specific mixed-precision matrix operations
  5. "CUDA optimization is deterministic" - Compiler behavior varies, requiring explicit controls

Useful Links for Further Investigation

CUDA Performance Resources That Don't Suck

LinkDescription
CUDA C++ Best Practices GuideThe only NVIDIA doc that's practical instead of theoretical. Skip the intro chapters, jump to "Memory Optimization" and "Execution Configuration Optimizations".
Nsight Compute DocumentationDense but comprehensive. Focus on the "Profiling Guide" section - it explains what all those cryptic metrics actually mean.
CUDA Programming Guide - Performance GuidelinesDry but accurate. The memory coalescing examples are worth understanding even if you never write C++ CUDA.
CUDA MODE YouTube SeriesReal engineers explaining real optimization problems. Skip the intro lectures, watch the kernel optimization episodes.
GPU MODE Lecture SeriesCommunity-driven CUDA optimization lectures and resources. Focus on practical optimization techniques over theory.
Nsight Compute CLI ReferenceCommand-line profiling without the GUI frustration. Essential for batch profiling and CI integration.
Simon Boehm's CUDA Matrix Multiplication TutorialStep-by-step optimization of a real kernel. Shows the profiling workflow and optimization thought process.
CUDA Performance Guidelines from PurdueUniversity course materials with practical examples. PDFs are dense but cover memory coalescing and shared memory well.
cuBLAS Source Code AnalysisNVIDIA's own matrix multiplication optimizations. Warning: extremely complex, but shows production-level optimization techniques.
PyTorch CUDA KernelsReal-world kernels handling irregular data sizes and edge cases. Good examples of robust optimization.
Thrust Library ImplementationShows how to write generic, high-performance CUDA code. The reduction and scan implementations are educational.
GPU Memory Coalescing VisualizationInteractive examples showing coalesced vs non-coalesced access patterns. Finally makes the concept click.
CUDA Memory Hierarchy GuideExplains the hardware reality behind memory optimization advice.
Compute Sanitizer DocumentationEssential CUDA memory debugging tool that replaced cuda-memcheck. Critical for finding memory errors.
Nsight Systems Timeline AnalysisOfficial tutorials for timeline profiling. The multi-GPU analysis section is particularly useful.
NVIDIA GPU Architecture WhitepapersHardware specifications for compute capability, memory bandwidth, cache sizes. Essential for understanding architecture limits.
CUDA Toolkit Release NotesNew features and breaking changes in recent CUDA versions. CUDA 13.0 has significant performance-related changes.
CUDA Stack OverflowHigh-quality voted answers to optimization questions. Filter by votes to find the most reliable solutions.
NVIDIA Developer Forums - CUDA ProgrammingNVIDIA engineers sometimes answer questions. Search before posting - they've answered most performance questions already.

Related Tools & Recommendations

news
Popular choice

Anthropic Raises $13B at $183B Valuation: AI Bubble Peak or Actual Revenue?

Another AI funding round that makes no sense - $183 billion for a chatbot company that burns through investor money faster than AWS bills in a misconfigured k8s

/news/2025-09-02/anthropic-funding-surge
60%
news
Popular choice

Docker Desktop Hit by Critical Container Escape Vulnerability

CVE-2025-9074 exposes host systems to complete compromise through API misconfiguration

Technology News Aggregation
/news/2025-08-25/docker-cve-2025-9074
57%
tool
Popular choice

Yarn Package Manager - npm's Faster Cousin

Explore Yarn Package Manager's origins, its advantages over npm, and the practical realities of using features like Plug'n'Play. Understand common issues and be

Yarn
/tool/yarn/overview
55%
alternatives
Popular choice

PostgreSQL Alternatives: Escape Your Production Nightmare

When the "World's Most Advanced Open Source Database" Becomes Your Worst Enemy

PostgreSQL
/alternatives/postgresql/pain-point-solutions
52%
tool
Popular choice

AWS RDS Blue/Green Deployments - Zero-Downtime Database Updates

Explore Amazon RDS Blue/Green Deployments for zero-downtime database updates. Learn how it works, deployment steps, and answers to common FAQs about switchover

AWS RDS Blue/Green Deployments
/tool/aws-rds-blue-green-deployments/overview
47%
news
Popular choice

Three Stories That Pissed Me Off Today

Explore the latest tech news: You.com's funding surge, Tesla's robotaxi advancements, and the surprising quiet launch of Instagram's iPad app. Get your daily te

OpenAI/ChatGPT
/news/2025-09-05/tech-news-roundup
40%
tool
Popular choice

Aider - Terminal AI That Actually Works

Explore Aider, the terminal-based AI coding assistant. Learn what it does, how to install it, and get answers to common questions about API keys and costs.

Aider
/tool/aider/overview
40%
tool
Popular choice

jQuery - The Library That Won't Die

Explore jQuery's enduring legacy, its impact on web development, and the key changes in jQuery 4.0. Understand its relevance for new projects in 2025.

jQuery
/tool/jquery/overview
40%
news
Popular choice

vtenext CRM Allows Unauthenticated Remote Code Execution

Three critical vulnerabilities enable complete system compromise in enterprise CRM platform

Technology News Aggregation
/news/2025-08-25/vtenext-crm-triple-rce
40%
tool
Popular choice

Django Production Deployment - Enterprise-Ready Guide for 2025

From development server to bulletproof production: Docker, Kubernetes, security hardening, and monitoring that doesn't suck

Django
/tool/django/production-deployment-guide
40%
tool
Popular choice

HeidiSQL - Database Tool That Actually Works

Discover HeidiSQL, the efficient database management tool. Learn what it does, its benefits over DBeaver & phpMyAdmin, supported databases, and if it's free to

HeidiSQL
/tool/heidisql/overview
40%
troubleshoot
Popular choice

Fix Redis "ERR max number of clients reached" - Solutions That Actually Work

When Redis starts rejecting connections, you need fixes that work in minutes, not hours

Redis
/troubleshoot/redis/max-clients-error-solutions
40%
tool
Popular choice

QuickNode - Blockchain Nodes So You Don't Have To

Runs 70+ blockchain nodes so you can focus on building instead of debugging why your Ethereum node crashed again

QuickNode
/tool/quicknode/overview
40%
integration
Popular choice

Get Alpaca Market Data Without the Connection Constantly Dying on You

WebSocket Streaming That Actually Works: Stop Polling APIs Like It's 2005

Alpaca Trading API
/integration/alpaca-trading-api-python/realtime-streaming-integration
40%
alternatives
Popular choice

OpenAI Alternatives That Won't Bankrupt You

Bills getting expensive? Yeah, ours too. Here's what we ended up switching to and what broke along the way.

OpenAI API
/alternatives/openai-api/enterprise-migration-guide
40%
howto
Popular choice

Migrate JavaScript to TypeScript Without Losing Your Mind

A battle-tested guide for teams migrating production JavaScript codebases to TypeScript

JavaScript
/howto/migrate-javascript-project-typescript/complete-migration-guide
40%
news
Popular choice

Docker Compose 2.39.2 and Buildx 0.27.0 Released with Major Updates

Latest versions bring improved multi-platform builds and security fixes for containerized applications

Docker
/news/2025-09-05/docker-compose-buildx-updates
40%
tool
Popular choice

Google Vertex AI - Google's Answer to AWS SageMaker

Google's ML platform that combines their scattered AI services into one place. Expect higher bills than advertised but decent Gemini model access if you're alre

Google Vertex AI
/tool/google-vertex-ai/overview
40%
news
Popular choice

Google NotebookLM Goes Global: Video Overviews in 80+ Languages

Google's AI research tool just became usable for non-English speakers who've been waiting months for basic multilingual support

Technology News Aggregation
/news/2025-08-26/google-notebooklm-video-overview-expansion
40%
news
Popular choice

Figma Gets Lukewarm Wall Street Reception Despite AI Potential - August 25, 2025

Major investment banks issue neutral ratings citing $37.6B valuation concerns while acknowledging design platform's AI integration opportunities

Technology News Aggregation
/news/2025-08-25/figma-neutral-wall-street
40%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization