The Performance Optimization Reality Check

Your CUDA kernel works. It produces correct results. Your boss is happy. But deep down, you know it's running at 5% of theoretical peak performance, and the profiler is mocking you with single-digit occupancy numbers.

Welcome to CUDA performance optimization - where "it works" is just the beginning of the pain. After five years of optimizing CUDA kernels for production workloads, I've learned that performance optimization isn't about applying random tricks from Stack Overflow. It's a systematic process of identifying bottlenecks, fixing them methodically, and accepting that your GPU will never reach theoretical peak.

The Performance Optimization Pyramid

CUDA performance optimization follows a hierarchy. You can't skip steps - fixing occupancy won't help if you're memory bandwidth bound, and optimizing memory coalescing is pointless if your algorithm is fundamentally broken.

Level 1: Algorithm-Level Optimization (100x gains possible)

  • Wrong approach: Optimizing a bubble sort on GPU
  • Right approach: Implementing a parallel merge sort
  • Reality check: Some algorithms don't parallelize. Accept CPU implementation for inherently sequential work

Level 2: Memory Optimization (10x gains typical)

  • Global memory coalescing: Threads in a warp access consecutive addresses
  • Shared memory utilization: Cache frequently accessed data on-chip
  • Memory bank conflict elimination: Avoid concurrent access to same shared memory bank

Level 3: Execution Configuration (2-5x gains)

  • Thread block sizing: Balance occupancy with resource usage
  • Grid configuration: Ensure enough work to saturate all SMs
  • Stream utilization: Overlap computation with memory transfers

Level 4: Instruction-Level Optimization (20-50% gains)

  • Loop unrolling: Reduce loop overhead for small, known iteration counts
  • Math function optimization: Use fast math intrinsics when precision allows
  • Register optimization: Minimize register pressure to increase occupancy

The Memory Bandwidth Wall

Most CUDA kernels are memory bandwidth bound, not compute bound. Your RTX 4090 can execute 83 TFLOPS of FP32 operations but only has 1000 GB/s memory bandwidth. That means you can read/write 250 billion floats per second, but compute on 21 trillion floats per second.

The brutal math: If your kernel reads one float and writes one float (8 bytes total), you need 10 FP32 operations per memory access to be compute bound. Most kernels don't come close.

Memory Coalescing - Your First Battle

// Terrible: Each thread accesses different cache line
__global__ void strided_access(float* data, int stride) {
    int idx = threadIdx.x * stride;  // WRONG
    data[idx] = threadIdx.x;
}

// Good: Adjacent threads access adjacent memory
__global__ void coalesced_access(float* data, int stride) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;  // RIGHT
    data[idx] = threadIdx.x;
}

Reality check: Non-coalesced access can reduce memory bandwidth by 8x. Nsight Compute will show you "Global Load Efficiency" - anything below 80% means you're wasting bandwidth.

Shared Memory - The On-Chip Cache That Actually Matters

// Matrix transpose with shared memory tiling
__global__ void transpose_shared(float* out, float* in, int n) {
    __shared__ float tile[32][33];  // +1 to avoid bank conflicts
    
    int x = blockIdx.x * 32 + threadIdx.x;
    int y = blockIdx.y * 32 + threadIdx.y;
    
    // Load tile cooperatively
    tile[threadIdx.y][threadIdx.x] = in[y * n + x];
    __syncthreads();
    
    // Write transposed tile
    out[x * n + y] = tile[threadIdx.x][threadIdx.y];
}

The shared memory rules:

  • 32 banks, 32-bit wide: Avoid concurrent access to same bank
  • 64KB per SM on modern GPUs: Don't waste it, but don't assume unlimited
  • Bank conflict cost: 5x slowdown when multiple threads hit same bank

Occupancy - The Most Misunderstood Metric

Everyone obsesses over occupancy. "My kernel only shows 25% occupancy, it must be slow!" Wrong. Occupancy measures how many threads can run simultaneously, not performance.

The occupancy myths:

  • ❌ "Higher occupancy always means better performance"
  • ❌ "100% occupancy is the goal"
  • ❌ "Low occupancy means the kernel is broken"

The occupancy reality:

  • Memory-bound kernels: Often perform identically from 25% to 100% occupancy
  • Compute-bound kernels: Need higher occupancy to hide instruction latency
  • Register-heavy kernels: May perform better with lower occupancy and more registers per thread

The Occupancy Calculator Lies

The CUDA Occupancy Calculator tells you maximum theoretical occupancy. It doesn't know about:

  • Memory access patterns
  • Branch divergence
  • Actual workload characteristics
  • Cache behavior

Better approach: Profile with Nsight Compute and look at:

  • SM Utilization: Percentage of time SMs are busy
  • Memory Throughput: Actual achieved bandwidth vs theoretical
  • Warp State Distribution: How much time warps spend stalled

The CUDA 13.0 Performance Features You Should Know

Green Contexts - GPU Virtualization That Actually Works

Green Contexts in CUDA 13.0 allow lightweight resource isolation between different workloads on the same GPU. Unlike MPS (Multi-Process Service), which shares everything, Green Contexts provide dedicated compute and memory resources.

Use cases where Green Contexts matter:

  • Multi-tenant inference serving
  • Training job isolation on shared hardware
  • Background vs foreground workload prioritization

Performance impact: Typically 5-15% overhead for resource isolation, but eliminates interference between workloads.

ZStandard Compression - Smaller Binaries, Faster Loading

CUDA 13.0 switched from LZ4 to ZStandard compression for kernel binaries, reducing size by up to 17%. This matters more than you'd think:

  • Faster application startup
  • Reduced memory footprint for JIT compilation
  • Better instruction cache utilization

Backward compatibility: Older drivers can't load ZStd-compressed kernels. Pin your driver versions in production.

CUDA Graphs - Reduce Launch Overhead

CUDA Graphs eliminate kernel launch overhead by pre-recording sequences of operations. For workloads with repetitive kernel patterns, graphs can reduce CPU overhead by 50%.

// Graph creation and execution
cudaGraph_t graph;
cudaGraphExec_t graphExec;

// Record operations into graph
cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
kernel1<<<blocks, threads, 0, stream>>>();
kernel2<<<blocks, threads, 0, stream>>>();
cudaStreamEndCapture(stream, &graph);

// Instantiate and launch
cudaGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0);
cudaGraphLaunch(graphExec, stream);

Graphs work best for:

  • Inference pipelines with fixed topology
  • Training loops with consistent patterns
  • Multi-kernel sequences executed repeatedly

Graphs don't help:

  • One-off kernel launches
  • Dynamic kernel parameters
  • Conditional execution patterns

The harsh truth about CUDA optimization: there are no silver bullets. Memory coalescing, shared memory optimization, and occupancy tuning are table stakes. Real performance gains come from algorithmic improvements and understanding your specific workload's bottlenecks.

CUDA Performance FAQ - The Questions Nsight Can't Answer

Q

Why is my kernel slower on RTX 4090 than RTX 3080?

A

Architecture differences. The RTX 4090 has more cores but the same memory bandwidth per core. If your kernel is memory-bound (most are), more cores don't help. Use nvidia-smi -q -d MEMORY to check bandwidth utilization. Memory-bound kernels often perform identically across different GPU generations.

Q

My occupancy is 100% but performance is still terrible. What's wrong?

A

You're probably memory-bound, not compute-bound. High occupancy only helps compute-bound kernels that need to hide instruction latency. Run Nsight Compute and check "Memory Workload Analysis"

  • if "L1/TEX Hit Rate" is low and "DRAM Utilization" is high, you have a memory problem, not an occupancy problem.
Q

Should I optimize for theoretical peak FLOPS or memory bandwidth?

A

Always memory bandwidth first. Most kernels never come close to theoretical FLOPS. A typical matrix multiplication needs 2 FP32 ops per element but reads/writes 3 FP32 values (12 bytes). That's 0.17 ops per byte

  • nowhere near the ~10 ops/byte needed to be compute-bound on modern GPUs.
Q

How do I know if my memory access is coalesced?

A

Nsight Compute's "L1/TEX" section shows "Global Hit Rate" and "Bytes per Request". Coalesced access shows 32+ bytes per request (full cache line). Non-coalesced shows 4-8 bytes per request. Anything below 128 bytes per request indicates poor coalescing or cache thrashing.

Q

What's the difference between shared memory bank conflicts and cache misses?

A

Bank conflicts happen when multiple threads in the same warp access different addresses within the same 32-bit bank of shared memory. This serializes access. Cache misses happen when data isn't in L1/L2 cache, forcing expensive DRAM access. Shared memory bank conflicts are 5x slower than perfect access; DRAM misses are 500x slower.

Q

Why does adding more thread blocks make my kernel slower?

A

You've hit the memory bandwidth ceiling. More thread blocks mean more concurrent memory requests, causing cache evictions and memory controller contention. This is especially common on consumer GPUs with limited memory bandwidth. Profile with nvidia-smi dmon during execution to see memory utilization.

Q

My kernel runs fine on Tesla V100 but crashes on RTX 4090. Why?

A

Architecture compute capability differences. RTX 4090 is compute capability 8.9, V100 is 7.0. Check if you're using:

  • Different shared memory configurations
  • Hardware-specific intrinsics
  • Assume specific warp scheduler behavior
    Use nvcc -arch=compute_89 -code=sm_89 to target RTX 4090 specifically.
Q

When should I use CUDA Streams vs CUDA Graphs?

A

Streams for overlapping independent operations (compute + memory transfers). Graphs for reducing launch overhead of repetitive kernel sequences. Streams help with latency hiding, graphs help with CPU overhead. If your CPU spends >5% time in CUDA runtime calls, consider graphs.

Q

Is it worth optimizing for Tensor Cores on non-ML workloads?

A

Only if your problem naturally maps to mixed-precision matrix operations. Tensor Cores require specific data layouts (half-precision, specific dimensions) and only accelerate GEMM operations. For general compute, the restrictions usually aren't worth the 2-4x speedup.

Q

How do I optimize kernels that process variable-length data?

A

Dynamic parallelism if the variance is high and you can't predict work distribution. Thread coarsening if most sequences are long. Persistent kernels if you want to eliminate kernel launch overhead. Avoid warp divergence by padding short sequences or grouping similar lengths together.

Q

My kernel shows high "Achieved Occupancy" but low "SM Efficiency". What gives?

A

Warp divergence or memory stalls. High occupancy means threads are scheduled, but SM Efficiency measures how much they're actually computing vs waiting. Check "Warp State Distribution"

  • if threads spend >50% time stalled on memory or barriers, you have a latency problem, not a capacity problem.
Q

Should I use `cudaMalloc` or `cudaMallocManaged` for performance?

A

cudaMalloc for predictable data access patterns where you control transfers. cudaMallocManaged for prototyping or unpredictable access patterns. Unified Memory has 10-30% overhead due to page migration, but eliminates explicit transfers. In production, explicit memory management usually wins.

Q

Why do CUDA math intrinsics sometimes make kernels slower?

A

Reduced numerical precision can cause convergence issues in iterative algorithms, requiring more iterations. __sinf() is faster than sin() but less accurate. Profile end-to-end runtime, not just kernel execution time. Sometimes the "slower" accurate function gives better algorithmic convergence.

Q

How much shared memory should I use per thread block?

A

Modern GPUs: 64KB per SM, up to 48KB per thread block. Use what you need for algorithm correctness, then optimize. Don't use shared memory just to use it

  • poorly utilized shared memory performs worse than cached global memory. Profile L1/TEX hit rates before and after shared memory optimization.
Q

My multi-GPU scaling is terrible. Is NCCL the problem?

A

Usually not. Check if you're CPU-bound in data preprocessing, have imbalanced workloads across GPUs, or memory bandwidth limits on PCIe transfers. NCCL itself rarely bottlenecks unless you're doing frequent small allreduce operations. Use nvprof --print-gpu-trace to see where time actually goes.

The Performance Optimization War Stories

Three years ago, we had a computer vision pipeline that was supposed to process 1000 images per second. It was processing 50. Management was asking questions. The GPU was at 15% utilization. I was updating my resume.

Here's what I learned from six months of brutal optimization work.

War Story #1: The Memory Coalescing Disaster

The Problem

Image preprocessing kernel running at 12 GB/s on hardware with 900 GB/s theoretical bandwidth.

The Obvious Culprit

Must be a memory access pattern issue.

The Investigation

Nsight Compute showed "Global Load Efficiency: 12.5%" and "L1/TEX Hit Rate: 8%". Classic symptoms of terrible memory coalescing.

The Original Code

// Processing RGB image as separate channels
__global__ void process_rgb(unsigned char* r_channel, 
                           unsigned char* g_channel,
                           unsigned char* b_channel, int width, int height) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < width * height) {
        // Each thread accesses three different arrays
        r_channel[idx] = process_red(r_channel[idx]);
        g_channel[idx] = process_green(g_channel[idx]);  
        b_channel[idx] = process_blue(b_channel[idx]);
    }
}

The Reality

Three separate memory accesses per thread, each to different memory regions. Warp coalescing was impossible.

The Fix

Structure-of-Arrays to Array-of-Structures conversion.

struct RGB {
    unsigned char r, g, b;
};

__global__ void process_rgb_interleaved(RGB* image, int width, int height) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < width * height) {
        RGB pixel = image[idx];  // Single coalesced load
        pixel.r = process_red(pixel.r);
        pixel.g = process_green(pixel.g);
        pixel.b = process_blue(pixel.b);
        image[idx] = pixel;      // Single coalesced store
    }
}

The Result

Memory bandwidth jumped to 680 GB/s. Kernel went from 45ms to 6ms execution time.

The Lesson

Memory layout matters more than algorithm cleverness. Sometimes the "obvious" data structure is performance poison.

War Story #2: The Shared Memory Bank Conflict Hell

The Problem

Matrix transpose kernel showing 28% performance regression after "optimization".

The Debug Process

Added shared memory caching to improve global memory access patterns. Performance got worse, not better.

The Discovery

__shared__ float tile[32][32]; was causing bank conflicts on every access.

// Bank conflict nightmare
__shared__ float tile[32][32];
int tx = threadIdx.x, ty = threadIdx.y;
tile[ty][tx] = input[...];  // ty=0,1,2... all access bank 0,1,2...

The Problem

When ty=0 for all threads in a warp, they all access banks 0-31. But when ty=1, they access banks 1-32, wrapping bank 0 again. Massive bank conflicts.

The Solution

Padding to eliminate stride conflicts.

// Pad to 33 elements to eliminate bank conflicts  
__shared__ float tile[32][33];  // +1 padding
int tx = threadIdx.x, ty = threadIdx.y;
tile[ty][tx] = input[...];  // Now accesses are distributed across banks

The Result

40% performance improvement over the original naive implementation.

The Lesson

Shared memory bank conflicts are invisible until you profile. The GPU doesn't warn you - it just runs slower.

War Story #3: The Occupancy Obsession

The Problem

Optimization team spent two weeks increasing occupancy from 31% to 96%. Performance improved by... 2%.

The Misunderstanding

Higher occupancy must mean better performance, right?

The Reality Check

Nsight Compute showed the kernel was memory bandwidth bound. More threads meant more memory contention, not more useful work.

Key Metrics That Actually Mattered

  • DRAM Utilization: 87% (saturated)
  • L1/TEX Hit Rate: 23% (cache thrashing)
  • SM Utilization: 92% (compute units mostly idle waiting for memory)

The Real Optimization

Reduced thread blocks from 1024 threads to 256 threads per block. This improved cache hit rates and reduced memory contention.

The Result

Lower occupancy (62%), but 35% better performance.

The Lesson

Occupancy is a tool, not a goal. Memory-bound kernels often perform better with fewer, more efficient threads.

War Story #4: The Register Spilling Mystery

The Problem

Machine learning kernel randomly taking 3x longer on identical hardware.

The Symptoms

Same code, same data, same GPU. Sometimes fast, sometimes slow. Heisenbug from hell.

The Investigation

nvcc -Xptxas -v showed different register usage between compilations:

Fast version: 32 registers per thread
Slow version: 48 registers per thread  

The Root Cause

NVCC's register allocation is non-deterministic. Same source code can produce different register assignments. The 48-register version was spilling to local memory (slow device memory), killing performance.

The Detection

nvidia-smi dmon -s u during execution showed memory utilization spikes during the slow runs.

The Fix

Explicit register limiting to force consistent behavior.

nvcc -maxrregcount=32 kernel.cu

The Result

Consistent performance, but had to rewrite parts of the algorithm to fit in 32 registers.

The Lesson

Non-deterministic compilers are evil. Always check register usage with -Xptxas -v and test multiple compilations.

War Story #5: The Multi-GPU Scaling Wall

The Problem

Perfect single-GPU performance, 40% efficiency on 4 GPUs, 15% efficiency on 8 GPUs.

The Obvious Suspect

NCCL communication overhead.

The Real Culprit

CPU preprocessing bottleneck. Data loading and augmentation was single-threaded while 8 GPUs were starving for work.

The Discovery

htop showed one CPU core at 100%, others idle. nvidia-smi dmon showed GPUs at 20% utilization.

The Solution

## Wrong: Single-threaded data loading
for batch in dataloader:
    process_on_gpu(batch)

## Right: Parallel data loading with prefetch
dataloader = DataLoader(dataset, num_workers=16, prefetch_factor=4)

The Result

8-GPU scaling improved to 75% efficiency. The GPUs weren't the bottleneck - data movement was.

The Lesson

Profile the entire pipeline, not just the GPU kernels. The fastest kernel is useless if it's waiting for data.

The Production Optimization Playbook

After optimizing hundreds of CUDA kernels in production:

Step 1: Profile First, Optimize Second

  • Use nvprof --print-gpu-trace to see where time actually goes
  • Most performance problems aren't where you think they are
  • "Obvious" bottlenecks are usually wrong

Step 2: Fix the Biggest Bottleneck First

  • Memory bandwidth: 80% of kernels are memory-bound
  • CPU-GPU data transfer: Often overlooked, frequently limiting
  • Launch overhead: Matters for small, frequent kernels

Step 3: Verify Every Optimization

  • Performance can get worse with "optimization"
  • Profile before and after every change
  • Some optimizations help one architecture but hurt another

Step 4: Test on Target Hardware

  • Consumer GPUs have different characteristics than Tesla cards
  • Memory bandwidth, cache sizes, and scheduler behavior vary
  • What works on RTX 3080 may fail on A100

Step 5: Optimize for Real Workloads

  • Synthetic benchmarks lie
  • Production data has different characteristics than test data
  • Cache behavior changes with dataset size

The most important lesson: CUDA optimization is empirical, not theoretical. The GPU architecture is too complex for intuition. Profile everything, trust nothing, and always measure performance improvements on real workloads.

CUDA Profiling Tools Compared - What Actually Helps

Tool

Best For

Learning Curve

Cost

Accuracy

Production Ready

nvidia-smi

Quick health check

None

Free

Basic metrics only

Yes

nvprof

Legacy GPU profiling

Moderate

Free

Good for compute-bound

Deprecated in CUDA 12+

Nsight Systems

Timeline analysis

Moderate

Free

Excellent for CPU-GPU interaction

Yes

Nsight Compute

Kernel-level optimization

Steep

Free

Best-in-class for GPU kernels

Yes

Visual Profiler

GUI-based profiling

Easy

Free

Limited depth

Deprecated

Intel VTune

CPU-GPU hybrid workloads

Steep

Paid

Good for Intel GPUs only

Limited CUDA support

AMD ROCProfiler

AMD GPU comparison

Moderate

Free

N/A for NVIDIA

AMD hardware only

CUDA Performance Resources That Don't Suck

Related Tools & Recommendations

tool
Similar content

CUDA Production Debugging: Fix GPU Crashes & Memory Errors

The real-world guide to fixing CUDA crashes, memory errors, and performance disasters before your boss finds out

CUDA Development Toolkit
/tool/cuda/debugging-production-issues
100%
tool
Similar content

NVIDIA CUDA Toolkit 13.0: Overview, Installation & Troubleshooting

NVIDIA's parallel programming platform that makes GPU computing possible but not painless

CUDA Development Toolkit
/tool/cuda/overview
88%
troubleshoot
Similar content

Debug Kubernetes AI GPU Failures: Pods Stuck Pending & OOM

Debugging workflows for when Kubernetes decides your AI workload doesn't deserve those GPUs. Based on 3am production incidents where everything was on fire.

Kubernetes
/troubleshoot/kubernetes-ai-workload-deployment-issues/ai-workload-gpu-resource-failures
73%
news
Popular choice

Anthropic Raises $13B at $183B Valuation: AI Bubble Peak or Actual Revenue?

Another AI funding round that makes no sense - $183 billion for a chatbot company that burns through investor money faster than AWS bills in a misconfigured k8s

/news/2025-09-02/anthropic-funding-surge
60%
tool
Popular choice

Node.js Performance Optimization - Stop Your App From Being Embarrassingly Slow

Master Node.js performance optimization techniques. Learn to speed up your V8 engine, effectively use clustering & worker threads, and scale your applications e

Node.js
/tool/node.js/performance-optimization
57%
news
Popular choice

Anthropic Hits $183B Valuation - More Than Most Countries

Claude maker raises $13B as AI bubble reaches peak absurdity

/news/2025-09-03/anthropic-183b-valuation
55%
news
Popular choice

OpenAI Suddenly Cares About Kid Safety After Getting Sued

ChatGPT gets parental controls following teen's suicide and $100M lawsuit

/news/2025-09-03/openai-parental-controls-lawsuit
52%
tool
Similar content

Anchor Framework Performance Optimization: Master Solana Program Efficiency

No-Bullshit Performance Optimization for Production Anchor Programs

Anchor Framework
/tool/anchor/performance-optimization
52%
news
Popular choice

Goldman Sachs: AI Will Break the Power Grid (And They're Probably Right)

Investment bank warns electricity demand could triple while tech bros pretend everything's fine

/news/2025-09-03/goldman-ai-boom
50%
news
Popular choice

OpenAI Finally Adds Parental Controls After Kid Dies

Company magically discovers child safety features exist the day after getting sued

/news/2025-09-03/openai-parental-controls
47%
tool
Similar content

Replicate: Simplify AI Model Deployment, Skip Docker & CUDA Pain

Deploy AI models effortlessly with Replicate. Bypass Docker and CUDA driver complexities, streamline your MLOps, and get your models running fast. Learn how Rep

Replicate
/tool/replicate/overview
46%
news
Popular choice

Big Tech Antitrust Wave Hits - Only 15 Years Late

DOJ finally notices that maybe, possibly, tech monopolies are bad for competition

/news/2025-09-03/big-tech-antitrust-wave
45%
news
Popular choice

ISRO Built Their Own Processor (And It's Actually Smart)

India's space agency designed the Vikram 3201 to tell chip sanctions to fuck off

/news/2025-09-03/isro-vikram-processor
42%
news
Popular choice

Google Antitrust Ruling: A Clusterfuck of Epic Proportions

Judge says "keep Chrome and Android, but share your data" - because that'll totally work

/news/2025-09-03/google-antitrust-clusterfuck
40%
news
Popular choice

Apple's "It's Glowtime" Event: iPhone 17 Air is Real, Apparently

Apple confirms September 9th event with thinnest iPhone ever and AI features nobody asked for

/news/2025-09-03/iphone-17-event
40%
tool
Popular choice

Amazon SageMaker - AWS's ML Platform That Actually Works

AWS's managed ML service that handles the infrastructure so you can focus on not screwing up your models. Warning: This will cost you actual money.

Amazon SageMaker
/tool/aws-sagemaker/overview
40%
tool
Popular choice

Node.js Production Deployment - How to Not Get Paged at 3AM

Optimize Node.js production deployment to prevent outages. Learn common pitfalls, PM2 clustering, troubleshooting FAQs, and effective monitoring for robust Node

Node.js
/tool/node.js/production-deployment
40%
alternatives
Popular choice

Docker Alternatives for When Docker Pisses You Off

Every Docker Alternative That Actually Works

/alternatives/docker/enterprise-production-alternatives
40%
howto
Popular choice

How to Run LLMs on Your Own Hardware Without Sending Everything to OpenAI

Stop paying per token and start running models like Llama, Mistral, and CodeLlama locally

Ollama
/howto/setup-local-llm-development-environment/complete-setup-guide
40%
news
Popular choice

Meta Slashes Android Build Times by 3x With Kotlin Buck2 Breakthrough

Facebook's engineers just cracked the holy grail of mobile development: making Kotlin builds actually fast for massive codebases

Technology News Aggregation
/news/2025-08-26/meta-kotlin-buck2-incremental-compilation
40%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization