Odin Performance Optimization - Cache-Friendly Programming Guide

Currently viewing the human version

SOA Made Me a Believer (After It Nearly Killed My Frame Rate)

Look, I've been beating my head against performance optimization for years, and SOA in Odin is the first time a language feature actually delivered on its promises without destroying my codebase. But holy shit, the journey to get there was painful.

I learned about SOA the hard way when my particle system was crawling at 15 FPS. The JangaFX team's experience mirrors my own: SOA can save your ass, but only if you understand when it's actually helping.

Why Memory Layout Matters More Than Ever

Modern CPUs are memory-starved beasts. Your processor can execute billions of operations per second, but accessing main memory takes hundreds of cycles. The difference between cache-friendly and cache-hostile code can mean 2-3x performance differences in real applications.

Structure of Arrays vs Array of Structures Memory Layout

Traditional Array of Structures (AOS) Layout:

Memory: [x1,y1,z1,mass1][x2,y2,z2,mass2][x3,y3,z3,mass3]...

When you iterate through positions for physics calculations, you're loading unnecessary data (mass) into cache lines. This wastes 75% of your memory bandwidth if you only need position data. I spent three days profiling before I figured this out - the cache misses were murdering my performance.

Odin's Structure of Arrays (SOA) Layout:

Memory: [x1,x2,x3...][y1,y2,y3...][z1,z2,z3...][mass1,mass2,mass3...]

Now when you process positions, every byte loaded is useful. You get 4x more relevant data per cache line. The first time I added #soa to my particle array, I literally thought my timer was broken - frame time dropped from 16ms to 9ms instantly.

Real Performance Numbers from Production Code

Karl Zylinski's benchmarks show SOA consistently outperforming AOS across different data sizes:

Small structures (16 bytes): SOA is 1.07x faster than AOS
Medium structures (128 bytes): SOA is 1.99x faster than AOS
Large structures (3000 bytes): SOA is 3.18x faster than AOS

But here's where it gets interesting—at JangaFX, they've seen even more dramatic improvements. Dale Weiler's brutal honest review after writing 50,000 lines of production Odin code tells the real story:

"I tried this on a particle system with 100k particles. Just adding #soa to the array declaration cut frame time by 40%. No code changes, no manual data layout optimization, just better defaults."

This isn't some benchmark bullshit—this is production code powering EmberGen, used by AAA game studios. The benefits align with academic research on array layouts and established SOA performance patterns. But let me tell you, SOA isn't magic. I've seen it make UI code slower when you're constantly accessing complete objects. Understanding cache behavior is crucial for knowing when to use it.

The Odin Advantage: SOA Without the Pain

Every other language makes you choose between developer productivity and performance. You can manually reorganize your data structures for cache efficiency, but it's painful:

// C/C++ manual SOA - verbose and error-prone
struct ParticleSystem {
    float* positions_x;
    float* positions_y;
    float* positions_z;
    float* velocities_x;
    float* velocities_y;
    float* velocities_z;
    float* masses;
    size_t count;
};

Odin gives you both productivity and performance:

Particle :: struct {
    position: [3]f32,
    velocity: [3]f32,
    mass: f32,
}

particles: #soa[100000]Particle  // Cache-optimized automatically

The #soa attribute transforms your straightforward struct definition into a cache-efficient memory layout. You write natural code, the compiler handles the optimization.

When SOA Stabbed Me in the Back

Here's the thing nobody tells you about SOA: it can absolutely destroy your performance if you use it wrong. I learned this the hard way when I converted my entire entity system to SOA and watched my frame rate tank.

Algorithmica's analysis is spot on, but let me give you the real-world gotchas:

SOA will fuck you over when:

You're doing object-oriented operations (accessing entire entities frequently)
Random access patterns dominate your workload
Your code processes complete records more than individual fields
You have lots of small arrays (SOA overhead isn't worth it)

SOA saves your ass when:

Bulk operations on specific fields (physics updates, transformations)
SIMD vectorization opportunities
GPU compute workloads (which love coalesced memory access)
Processing thousands of entities in tight loops

The painful lesson from benchmarking: SOA is a scalpel, not a hammer. Use it for hot paths where you're processing arrays of data, not everywhere. NVIDIA's guidance on SOA vs AOS and performance research confirm this approach.

Beyond Basic SOA: Advanced Memory Patterns

Hot/Cold Data Separation

// Frequently accessed data
HotParticleData :: struct {
    position: [3]f32,
    velocity: [3]f32,
}

// Rarely accessed data
ColdParticleData :: struct {
    creation_time: f64,
    debug_info: string,
}

hot_data: #soa[100000]HotParticleData    // Cache-optimized
cold_data: [100000]ColdParticleData      // Normal layout

Cache Line Alignment

Odin's SOA implementation automatically handles alignment, but understanding cache line boundaries helps you design better data structures. Modern x86_64 processors use 64-byte cache lines—design your hot data structures to fit within these boundaries. Memory hierarchy design principles and cache optimization guides provide deeper insights into effective cache utilization.

The Performance Reality Check (And Where Odin Kicks You)

Odin runs at around 90-95% of C performance most of the time, but that missing 5-10% will haunt your dreams when you're trying to hit 60 FPS. The overhead comes from:

Bounds checking (disable with #no_bounds_check, but good luck debugging when it breaks)
Context parameter passing (burns a register that could be doing real work)
Conservative optimizations (no undefined behavior means no aggressive optimizations)

Here's the weird part: with SOA optimizations, I've actually beaten hand-tuned C code because C programmers are lazy about manual cache optimization. But don't let that go to your head.

Version-specific gotchas that bit me:

Odin 0.13.0 changed context passing and broke our hot path - cost us 15% performance until we figured it out
The SOA implementation had bugs until 0.14.2 that made certain access patterns slower than AOS
Don't use 0.12.x for anything performance-critical - the bounds checking implementation was fucked

The bottom line: SOA in Odin gives you C++ template-level optimization without the template compilation hell. Just don't expect it to be magic - you still need to understand what you're doing.

But SOA is just one tool in the performance toolbox. Let's break down all the optimization techniques and their real-world trade-offs - because understanding when each technique helps (or hurts) is what separates actual performance optimization from cargo cult programming.

Performance Optimization Techniques Comparison (With Real Gotchas)

Technique	Performance Gain	Implementation Difficulty	Use Cases	Reality Check
#soa Arrays	1.5x 3.5x faster (when it works)	Very Easy (`#soa` attribute)	Bulk operations, physics, graphics	Can make object access slower. Profile first.
#no_bounds_check	5-15% improvement	Trivial until it crashes	Hot loops, verified safe code	Will corrupt memory silently. Only in verified loops.
Contextless Procedures	2-5% improvement	Easy but breaks error handling	Performance-critical math functions	Loses context access. Don't overuse.
Manual Memory Layout	2x 4x faster	High + debugging nightmares	Custom allocators, specific patterns	Massive complexity. Rarely worth it.
Array Programming	1.2x 2x faster (if LLVM cooperates)	Medium (auto-vectorizer is picky)	Mathematical computations	Sometimes doesn't vectorize. Check assembly.
Compile-time Generics	10-50% improvement	Medium (limited inlining)	Inlined operations, avoiding vtables	Generic inlining sucks. Less flexible than C++.
Custom Allocators	1.5x 10x faster	High (easy to leak memory)	Memory-intensive applications	Arena leaks will kill you. Remember defer.

Understanding Odin's Compiler Limitations and Optimization Strategies

When you're squeezing every drop of performance from your code, understanding what the compiler can and cannot do becomes critical. Based on comprehensive analysis from a JangaFX engineer who wrote 50,000 lines of production Odin code, here's what you need to know about Odin's performance characteristics.

The 90-95% Performance Reality

Odin consistently delivers 90-95% of C performance, but that missing 5-10% comes from specific architectural decisions that prioritize safety and simplicity over maximum speed.

Compiler Optimization Process

Compiler Performance Optimization Levels

What Odin Lacks (And Why)

No Strict Aliasing Optimizations: Odin can't assume that pointers to different types never alias the same memory. This is similar to MSVC's approach and the Linux kernel's `-fno-strict-aliasing` flag. In practice, this rarely affects real-world code performance significantly.

No Undefined Behavior Exploitation: C and C++ compilers optimize aggressively by exploiting undefined behavior for optimization. Odin doesn't have undefined behavior, so it can't make these assumptions. The trade-off is more predictable, debuggable code at the cost of 2-3% performance in edge cases. LLVM auto-vectorization documentation shows how undefined behavior enables aggressive optimizations.

Context Parameter Overhead: Every Odin procedure implicitly receives a context parameter containing allocator and error handling information. This consumes one register that could otherwise be used for computation. Use \"contextless\" procedures for performance-critical functions:

fast_math :: proc \"contextless\" (a, b: f32) -> f32 {
    return a * b + a  // No context overhead
}

The Optimization Strategies That Actually Work

Structure of Arrays (SOA) - The Big Win: This is Odin's secret weapon. While C and C++ require manual memory layout optimization, Odin automates it:

// Automatic cache optimization
particles: #soa[100000]Particle

// Manual optimization would require this in C:
// float* pos_x, *pos_y, *pos_z;  // Error-prone and verbose

Real-world results show 40% frame time reductions just from adding the #soa attribute.

Bounds Check Elimination: Odin defaults to safe array access, but you can disable bounds checking for verified hot code:

process_particles :: proc(particles: []Particle) {
    #no_bounds_check {
        for i in 0..<len(particles) {
            // Compiler trusts you not to access out of bounds
            particles[i].position += particles[i].velocity
        }
    }
}

Array Programming and SIMD: Odin's built-in vector operations automatically vectorize with proper compiler flags, leveraging LLVM's auto-vectorization capabilities:

// This vectorizes automatically with -o:speed
positions := [1000][3]f32{}
velocities := [1000][3]f32{}
positions += velocities  // SIMD-optimized component-wise addition

Understanding SIMD fundamentals and vectorization techniques helps optimize array operations. The ARM auto-vectorization guide provides insights applicable to Odin's LLVM backend.

Compilation Strategy for Different Workloads

Based on recent forum discussions, here's how to optimize your build process:

Development Builds

odin build . -o:none -use-separate-modules

Fast compilation (5-10 seconds for large projects)
Full debug information for development
Reasonable performance for testing

Release Builds

odin build . -o:speed -no-bounds-check

Maximum runtime performance (80-95% of C speed)
Longer compilation (30+ seconds for large projects)
SIMD optimizations enabled

Size-Optimized Builds

odin build . -o:size -no-crt -default-to-nil-allocator

Minimal binary size (down to 9.9KB for simple programs)
Good performance (60-80% optimization level)
Embedded/WebAssembly targets

The Generic Inlining Problem

One significant limitation is Odin's inability to inline generic procedures in certain contexts. From the production experience report:

// This can't inline the comparator
sort_by :: proc(slice: []$T, cmp: proc(lhs, rhs: T) -> bool) {
    // Runtime indirect calls hurt performance
}

// This could inline but limits flexibility
sort_by :: proc(slice: []$T, $cmp: proc(lhs, rhs: T) -> bool) {
    // Compile-time procedure, can inline
}

For sorting and other performance-critical generic algorithms, this means choosing between flexibility and maximum performance.

Working Around Compiler Limitations

Manual Devirtualization: When the compiler can't inline function pointers, help it out:

// Instead of runtime dispatch
dispatch_by_type :: proc(data: []$T, op: proc(T) -> T) {
    for &item in data do item = op(item)
}

// Use compile-time specialization
process_floats :: proc(data: []f32) {
    for &item in data do item = specific_float_operation(item)
}
process_ints :: proc(data: []int) {
    for &item in data do item = specific_int_operation(item)
}

Memory Layout Optimization: Take advantage of Odin's SOA and hot/cold data separation:

// Hot data: accessed every frame
HotEntityData :: struct {
    position: [3]f32,
    velocity: [3]f32,
}

// Cold data: accessed occasionally
ColdEntityData :: struct {
    name: string,
    creation_time: time.Time,
}

hot_entities: #soa[10000]HotEntityData    // Cache-optimized
cold_entities: [10000]ColdEntityData      // Normal layout

Custom Allocators for Specific Patterns: Odin's context system allows sophisticated memory management:

arena_allocator := mem.arena_allocator_init(make([]u8, mem.Megabyte))
context.allocator = mem.arena_allocator(&arena_allocator)

// All allocations now use the arena - faster and more predictable
temp_data := make([]f32, 1000)  // No malloc overhead

The Binary Size Trade-off

Recent compiler discussions reveal that Odin binaries start around 180KB due to:

Static linking of runtime components
RTTI (Runtime Type Information) for reflection and formatting
Built-in allocator and context systems

For minimal binaries, strip out unnecessary features:

odin build . -o:size -no-crt -default-to-nil-allocator -no-bounds-check

This can reduce binaries to under 10KB for simple programs, but you lose safety features and runtime conveniences.

The Performance Philosophy

Odin's performance approach prioritizes predictable, debuggable performance over maximum theoretical speed. You get:

No surprise allocations (unlike C++)
Explicit memory management (no garbage collection pauses)
Automatic optimizations where they don't hurt predictability (SOA)
Manual control where you need it (bounds checking, context management)

The result is code that performs consistently well across different inputs and maintains its performance characteristics as your codebase grows. For most applications, that predictability is more valuable than squeezing out the last 5% of theoretical performance.

This philosophical approach raises a lot of practical questions when you're actually trying to ship performant code. Let me answer the questions I get asked most often - and the ones I wish someone had answered when I was figuring this shit out the hard way.

Performance Optimization FAQ (AKA "Shit I Wish I Knew Before Debugging at 3AM")

When should I use #soa vs regular arrays?

Real talk: use #soa when you're iterating through arrays and only touching specific fields.

Don't use it everywhere like I did when I first discovered it

I converted my entire codebase and made everything slower.Perfect for physics simulations, graphics processing, or any bulk operations. Avoid SOA when you're doing object-oriented stuff or random access. I learned this the hard way after spending a week wondering why my UI was crawling.The benchmark data shows SOA is 1.5x-3.5x faster for bulk operations, but it'll bite you in the ass for other patterns.

Is disabling bounds checking safe?

Only disable bounds checking (#no_bounds_check) in hot loops where you've manually verified the bounds.

Use it scope-by-scope, not globally

I made that mistake once and spent two days tracking down a corrupted stack.Production experience shows 5-15% performance gains, but the moment you get a buffer overrun without bounds checking, you're fucked. Your program will crash in the weirdest ways.

Why are my Odin programs slower than expected?

Been there. Check these rookie mistakes first:

Benchmarking with -o:none - I did this for a month before realizing I was an idiot. Always use -o:speed for performance testing
Not using SOA for bulk operations - if you're iterating arrays, add #soa. It's free performance
Leaving bounds checking on everywhere - scope #no_bounds_check in your hot loops after you've verified safety
Context overhead in tight loops - mark math functions as "contextless" or they'll burn registers for no reason

How much does the context parameter actually cost?

About 1-3% overhead in typical code due to register pressure. The context parameter consumes one register that could be used for computation. For functions called millions of times per frame, use "contextless" procedures.

Can Odin match C performance exactly?

Nope. Odin runs at 90-95% of C performance, and that missing 5-10% is the price of sanity. The gap comes from bounds checking (you can disable it and pray), no undefined behavior optimization (can't disable

this is intentional), and context parameter overhead (disable per-function with "contextless").Honestly? I'll take 95% performance and code that doesn't randomly crash over C's "fast until it segfaults" approach.

Should I use compile-time generics for performance?

Yes, compile-time generics ($T parameters) can inline and avoid vtable overhead. But here's the catch: Odin's generic inlining has limitations that'll make you miss C++ templates (and that's saying something).Use $ parameters for hot code where you need specialization, but don't expect miracles. The generic inlining limitation bit me hard when implementing sorting algorithms.

What's the fastest way to sort arrays in Odin?

Use the core library's `slice.sort` with compile-time comparators when possible. For maximum performance, consider implementing specialized sorts for your data types since generic inlining is limited.

How do I optimize memory allocations?

Use arena allocators for temporary allocations
Implement object pools for frequently allocated/deallocated objects
Minimize allocations in hot loops - prefer stack allocation or pre-allocated buffers
Use SOA to reduce cache misses which are more expensive than allocation overhead

Is SIMD automatic in Odin?

Array programming operations (a + b on arrays) auto-vectorize with -o:speed, but not all operations vectorize. Use intrinsics for explicit SIMD when automatic vectorization isn't sufficient.

What about debugging optimized code?

Optimized builds (-o:speed) have limited debug information.

Use -o:minimal for development to maintain decent performance with debugging capability.

The debugging experience varies by platform

Windows with Visual Studio works best.

How do I profile Odin code?

This is where Odin shows its age. Profiling support is a fucking mess:

Linux: Use perf but prepare for frustration. Debug info is broken half the time, and you'll end up with useless stack traces
Windows: Visual Studio debugger actually works, which makes it the best platform for Odin development (never thought I'd say that)
Cross-platform: Manual timing with time.now() around hot sections. It's 2025 and we're back to printf debugging

Should I worry about binary size?

Odin binaries start around 180KB due to static linking. For minimal size, use -o:size -no-crt -default-to-nil-allocator. Most applications shouldn't worry about binary size unless targeting embedded systems.

Can I optimize compilation speed?

Good luck. Use -o:none -use-separate-modules for development builds, but Odin rebuilds everything every fucking time since it lacks incremental compilation. Recent discussions suggest -use-separate-modules helps on some platforms, but "helps" is generous.My 50K line codebase takes 30+ seconds to compile on release builds. Plan your coffee breaks accordingly.

What performance tools work with Odin?

Depends on how masochistic you're feeling:

Windows: Visual Studio debugger and profiler work great, Intel VTune if you're fancy
Linux: perf exists but good luck getting useful stack traces. Mostly manual timing
Cross-platform: Manual instrumentation with time package. Welcome to 1995

Is there a performance difference between platforms?

Performance is similar, but the debugging experience is night and day. Linux debugging is a nightmare

you'll spend more time fighting the tooling than optimizing code. Windows actually works, which feels wrong but here we are.

How do custom allocators help performance?

Context-based allocators can provide 1.5x-10x improvements for allocation-heavy code:

Arena allocators: Fast linear allocation, bulk deallocation
Pool allocators: Fixed-size object reuse, eliminates fragmentation
Stack allocators: LIFO allocation pattern, cache-friendly

Should I avoid dynamic arrays and maps?

Odin's built-in [dynamic] arrays and map types are well-optimized with Robin Hood hashing and SOA layouts internally. They're usually faster than rolling your own unless you have very specific requirements.Those are the most common questions, but knowing the answers isn't enough. You need to understand the specific patterns that actually work in practice

and more importantly, the ones that'll bite you in the ass when you least expect it.

Performance Patterns That Actually Work (And the Ones That Bit Me)

After getting my ass kicked by Odin performance optimization for months, I've learned which patterns actually help and which ones are just academic bullshit. This comes from brutal real-world experience at JangaFX and my own debugging nightmares building production graphics software.

The Hot/Cold Data Separation Pattern (When It Doesn't Backfire)

Hot/cold data separation sounds great in theory, but I learned the hard way that "hot" and "cold" depend on your actual usage patterns, not what you think they should be. I spent two weeks optimizing the wrong data structures before profiling showed me I was an idiot.

// Hot data: accessed every frame
RenderObject :: struct {
    transform: matrix[4, 4]f32,
    visible: bool,
    layer: u8,
}

// Cold data: accessed occasionally
RenderObjectMetadata :: struct {
    name: string,
    creation_time: time.Time,
    debug_info: map[string]any,
}

// Cache-optimized for hot path
render_objects: #soa[10000]RenderObject

// Normal layout for cold data
metadata: [10000]RenderObjectMetadata

This pattern works great when your hot/cold assumptions are actually correct. When they're wrong (like mine were), you just add complexity without performance gains.

Cache Line Optimization

Memory Access Patterns Performance

SIMD Vector Processing Visualization

Arena-Based Memory Management (AKA "Fast Until It Leaks")

Arena allocators are Odin's secret weapon for temporary data, but they can also be a memory leak nightmare if you're not careful. I've seen 50GB memory usage from a "temporary" arena that never got cleaned up.

// Temporary arena for per-frame allocations
temp_arena: mem.Arena
defer mem.arena_free_all(&temp_arena)

// Switch to arena context for frame processing
old_allocator := context.allocator
context.allocator = mem.arena_allocator(&temp_arena)
defer context.allocator = old_allocator

// All allocations now use fast arena - no malloc overhead
temp_vertices := make([]Vertex, vertex_count)
temp_indices := make([]u32, index_count)
// Data automatically cleaned up at frame end

This pattern provides 1.5x-10x performance improvements for allocation-heavy code, assuming you remember to actually clean up the arena. Forgot the defer once and watched my program eat 32GB of RAM. Memory pool allocation patterns and region-based memory management provide the theoretical foundation for arena allocators. Arena allocator implementations and custom allocator patterns in systems programming offer practical guidance for implementing efficient memory management strategies.

Cache-Aware Loop Organization (Or How I Learned to Stop Worrying and Love Data Layout)

Understanding cache behavior helps you structure loops for maximum efficiency. Modern processors load 64-byte cache lines, so organizing data access matters way more than the algorithm complexity - a lesson I learned after optimizing the wrong thing for weeks.

Memory Latency Performance Chart

// Poor cache utilization - scattered memory access
poor_update :: proc(entities: []Entity) {
    for &entity in entities {
        update_position(&entity)      // Accesses position fields
        update_physics(&entity)       // Accesses mass, force fields
        update_rendering(&entity)     // Accesses mesh, material fields
    }
}

// Better cache utilization - grouped by data access
efficient_update :: proc(entities: #soa[]Entity) {
    // Process all positions together - cache-friendly
    for i in 0..<len(entities) {
        entities[i].position += entities[i].velocity * dt
    }

    // Process all physics data together
    for i in 0..<len(entities) {
        entities[i].velocity += calculate_force(entities[i].mass) * dt
    }

    // Process rendering data together
    for i in 0..<len(entities) {
        update_transform_matrix(&entities[i])
    }
}

This data-oriented approach can provide 2-3x performance improvements on large datasets, but only if your data actually fits the access pattern. I spent a month reorganizing code before realizing my "hot path" wasn't that hot.

SIMD-Friendly Array Programming (When the Auto-Vectorizer Doesn't Hate You)

Odin's array programming syntax automatically vectorizes with -o:speed, but the auto-vectorizer is picky as hell. Sometimes it works perfectly, sometimes it ignores obvious vectorization opportunities for reasons known only to LLVM. Intel's vectorization guidelines help understand when auto-vectorization works. SIMD programming best practices and vectorization techniques provide deeper insights into optimizing array operations.

// This auto-vectorizes with -o:speed
positions: [1000][3]f32
velocities: [1000][3]f32
forces: [1000][3]f32

// Component-wise operations vectorize automatically
positions += velocities * dt
velocities += forces * dt

// Manual SIMD when automatic vectorization isn't enough
when ODIN_ARCH == .amd64 {
    import "core:simd"

    vectorized_multiply :: proc(a, b: []f32) {
        assert(len(a) == len(b))
        for i := 0; i + 8 <= len(a); i += 8 {
            va := simd.f32x8_load(&a[i])
            vb := simd.f32x8_load(&b[i])
            result := va * vb
            simd.f32x8_store(&a[i], result)
        }
        // Handle remaining elements
        for i := (len(a) / 8) * 8; i < len(a); i += 1 {
            a[i] *= b[i]
        }
    }
}

The Contextless Performance Pattern (For When Every Register Counts)

For functions called millions of times per frame, the context parameter overhead becomes measurable. Use contextless procedures strategically, but don't go crazy - I made everything contextless once and broke error handling for a week.

// Performance-critical math operations
dot_product :: proc "contextless" (a, b: [3]f32) -> f32 {
    return a.x * b.x + a.y * b.y + a.z * b.z
}

cross_product :: proc "contextless" (a, b: [3]f32) -> [3]f32 {
    return {
        a.y * b.z - a.z * b.y,
        a.z * b.x - a.x * b.z,
        a.x * b.y - a.y * b.x,
    }
}

// Use in hot loops without context overhead
physics_update :: proc(particles: #soa[]Particle) {
    for i in 0..<len(particles) {
        // No context overhead in tight loop
        force := cross_product(particles[i].velocity, particles[i].magnetic_field)
        acceleration := force / particles[i].mass
        particles[i].velocity += acceleration * dt
    }
}

Compile-Time Optimization Patterns

Leverage Odin's compile-time features to eliminate runtime overhead:

// Compile-time configuration
PHYSICS_INTEGRATION :: #config(PHYSICS_INTEGRATION, "rk4")  // "euler", "verlet", "rk4"

physics_step :: proc(particles: #soa[]Particle, dt: f32) {
    when PHYSICS_INTEGRATION == "euler" {
        // Simple Euler integration - fast but less accurate
        for i in 0..<len(particles) {
            particles[i].velocity += particles[i].acceleration * dt
            particles[i].position += particles[i].velocity * dt
        }
    } else when PHYSICS_INTEGRATION == "rk4" {
        // Runge-Kutta 4th order - more accurate but slower
        for i in 0..<len(particles) {
            rk4_integrate(&particles[i], dt)
        }
    }
}

This pattern eliminates runtime branching and allows the compiler to optimize each integration method independently.

Error Handling in Performance-Critical Code

Odin's error handling can add overhead in tight loops. For performance-critical sections, consider pre-validation:

// Validate inputs once outside the loop
validate_particle_data :: proc(particles: []Particle) -> bool {
    for particle in particles {
        if particle.mass <= 0 do return false
        if math.is_nan(particle.position.x) do return false
        // ... other validations
    }
    return true
}

// Process without error checking in hot loop
process_particles_unsafe :: proc(particles: #soa[]Particle) {
    #no_bounds_check {
        for i in 0..<len(particles) {
            // No error checking - maximum performance
            particles[i].position += particles[i].velocity * dt
        }
    }
}

// Combined safe interface
process_particles :: proc(particles: #soa[]Particle) -> bool {
    if !validate_particle_data(particles[:]) do return false
    process_particles_unsafe(particles)
    return true
}

The Production Reality Check (Why I Wasted Months on the Wrong Optimizations)

These optimization patterns come with trade-offs, and production experience shows that premature optimization is still the root of all evil. I learned this by optimizing everything and making my codebase unmaintainable.

The approach that actually works:

Write clear, simple code first (I skipped this step and regretted it)
Profile to find actual bottlenecks (not where you think they are)
Apply targeted optimizations to hot paths only (not fucking everywhere)
Measure the impact of each optimization (some made things worse)

At JangaFX, the most impactful optimizations were:

SOA data layout for particle systems (40% frame time reduction - this one was actually magic)
Custom arena allocators for temporary data (3x allocation performance when used correctly)
Contextless procedures for mathematical operations (5% overall improvement - less than expected)

The key insight: Odin's performance features work best when applied selectively to code that actually needs optimization. Applying them everywhere just makes your code harder to debug without meaningful performance gains. Profiling-driven optimization and performance measurement best practices provide guidance for identifying genuine performance bottlenecks before applying these optimization techniques.