Memory problems kill Zig apps in production. I've seen these same failure patterns during traffic spikes, demos, and compliance audits. The timing is always terrible.
Why DebugAllocator Won't Save You
DebugAllocator is perfect for development - catches every leak, shows you the exact problem line, makes debugging feel straightforward. Try it in production and API calls slow from 100ms to 500ms. Everything times out. DebugAllocator trades speed for safety with extensive tracking overhead.
The rename from GeneralPurposeAllocator was telling. The original name suggested production readiness when it's really a development tool. DebugAllocator makes the purpose clear.
So you're stuck with production allocators that prioritize speed over helpful error messages:
- SmpAllocator - Fast for multi-threaded workloads, provides minimal debugging info when failures occur
- ArenaAllocator - Great for bulk cleanup, makes individual leak detection difficult within the arena
- page_allocator - Direct OS interface, delegates error handling to kernel-level mechanisms
Production allocators prioritize performance over debugging assistance. Speed comes at the cost of diagnostic information.
Death By OOMKiller
Your app dies with OutOfMemory but free -h
shows 20GB unused. Container limits are the culprit - your process thinks it can access all system RAM but the OOMKiller terminates it when crossing a 512MB limit set deployments ago.
This failure mode is particularly frustrating. Process appears healthy with reasonable memory usage, then dies instantly. Container hits its limit while the host has gigabytes available. OOMKiller enforces container limits regardless of host capacity.
Debugging Container Memory Issues
Check actual usage versus limits with basic monitoring:
## See if your process is actually using too much memory
watch ps -o rss= -p $(pgrep your-app)
## Check if your container has secret memory limits
cat /sys/fs/cgroup/memory/memory.limit_in_bytes
## If this shows something like 536870912 (512MB), you found your problem
## Check OOMKiller logs for terminated processes
dmesg | grep -i \"killed process\"
OOMKiller terminates processes without warning or graceful shutdown when container limits are exceeded.
The Nuclear Option: Just Give It More Memory
First, try the obvious fix - bump your container memory limits by 50% and see if the problem goes away. If your app was happy with 512MB but dies under load, try 1GB. It's not elegant, but it works while you figure out the real problem.
## Docker: Update your memory limits
docker run -m 1g your-app
## Kubernetes: Update your resource limits
## resources:
## limits:
## memory: \"1Gi\"
Actually Fix the Problem
For large file processing, avoid loading everything into memory simultaneously. Loading large files works fine on development machines with abundant RAM but fails in production containers with memory limits. Process files in chunks instead of reading entirely into memory.
// Don't do this - you'll run out of memory with large files
const file_content = try std.fs.cwd().readFileAlloc(allocator, path, std.math.maxInt(usize));
// Do this instead - process chunks
var file = try std.fs.cwd().openFile(path, .{});
defer file.close();
var reader = file.reader();
const chunk_size = 64 * 1024; // 64KB chunks
var buffer: [chunk_size]u8 = undefined;
while (try reader.readAll(buffer[0..])) |bytes_read| {
if (bytes_read == 0) break;
// Process this chunk
try processChunk(buffer[0..bytes_read]);
}
The Slow Death: Memory Leaks
Memory leaks are particularly frustrating. Your app starts at 100MB, runs fine for hours, then slowly grows to 2GB before OOMKiller terminates it. These failures typically occur during off-hours or weekends.
ArenaAllocator in request handlers is a common mistake. Memory gets allocated for each request but the arena never resets. Every request accumulates additional memory. After thousands of requests, memory usage grows exponentially.
Early Leak Detection
Set up monitoring to detect memory growth before service failure:
## Simple memory watcher - runs every 30 seconds
while true; do
RSS=$(ps -o rss= -p $(pgrep your-service) 2>/dev/null || echo \"0\")
echo \"$(date '+%H:%M:%S') RSS: ${RSS}KB\" | tee -a memory.log
# Alert if memory grows more than 50MB in 10 minutes
# (implement your own alerting logic here)
sleep 30
done
Set your alerts to trigger at 85% of container memory. That gives you maybe 30 minutes to debug before the OOMKiller shows up. Any consistent growth over 1MB per hour in a stable service means you have a leak.
The Arena Allocator Trap
ArenaAllocator appears simple - allocate freely and call deinit() when finished. In web servers, "finished" never happens. This pattern causes repeated memory issues:
// This code looks innocent but will eat all your memory
var arena = std.heap.ArenaAllocator.init(allocator);
defer arena.deinit(); // This only runs when the server shuts down
while (handleRequest()) { // This loop runs forever
const request = getNextRequest();
// Each request adds to the arena, reset never happens
const response = try arena.allocator().alloc(u8, request.size);
processRequest(response);
// Memory just accumulates here forever
}
Add arena reset to prevent accumulation:
var arena = std.heap.ArenaAllocator.init(allocator);
defer arena.deinit();
while (handleRequest()) {
defer arena.reset(.retain_capacity); // This one line saves your ass
const request = getNextRequest();
const response = try arena.allocator().alloc(u8, request.size);
processRequest(response);
}
Error handling paths frequently skip cleanup when using explicit deallocation:
// Ensure cleanup happens in all code paths
const data = try allocator.alloc(DataType, count);
defer allocator.free(data); // Executes even if processData fails
try processData(data);
Use-After-Free Debugging
Use-after-free bugs hide in development and surface in production. DebugAllocator never reuses memory addresses, so buggy code continues working in development. Production allocators aggressively recycle memory - freed pointers reference new data, causing segfaults.
Detection Strategies
When stack traces are unreliable, use systematic code analysis to identify lifetime issues:
// Problematic pattern - unclear lifetime
var data: ?[]u8 = null;
if (condition) {
data = try allocator.alloc(u8, size);
}
// data might be freed elsewhere before use
if (data) |d| {
// Potential use-after-free
processData(d);
}
// Better pattern - clear scope and lifetime
if (condition) {
const data = try allocator.alloc(u8, size);
defer allocator.free(data);
processData(data); // Clear lifetime within scope
}
Enable core dumps for crash analysis when stack traces are insufficient:
## Enable core dumps
ulimit -c unlimited
## Analyze core dump after crash
gdb your-zig-binary core
Prevention Techniques
Use consistent patterns for memory lifetime management:
// Add debugging assertions in development builds
const data = try allocator.alloc(u8, size);
defer {
if (builtin.mode == .Debug) {
// Fill with recognizable pattern to catch use-after-free
std.mem.set(u8, data, 0xDD);
}
allocator.free(data);
}
Implement null checks for defensive programming:
pub fn safeProcessData(data: ?[]const u8) !void {
const valid_data = data orelse {
std.log.err(\"Attempted to process null data pointer\");
return error.InvalidData;
};
// Process valid_data
}
When Production is On Fire
Memory outage? Here's what you do when everyone's screaming:
Step 1: Don't restart yet, grab evidence:
## Save the crime scene before you destroy it
ps aux > memory-snapshot.txt
cat /proc/meminfo > system-memory.txt
dmesg | grep -i \"killed process\" >> oom-killer.log
Step 2: Band-aid the bleeding:
- Double your container memory (temporary fix, not permanent)
- Rate limit requests so you're not drowning
- Route traffic away from the broken instances
- Start rejecting large requests
Step 3: Watch for the next explosion:
## Is it going to happen again?
watch -n 10 'ps -o rss $(pgrep your-service)'
Root Cause Analysis
After stabilization, analyze monitoring data. Check what occurred before the failure - large requests, unusual traffic patterns, or unexpected data uploads.
Prevent recurrence:
- Add alerts for memory growth rate
- Monitor container memory percentage, not just usage
- Track which endpoints consume excessive memory
- Set up automatic circuit breakers
Test fixes in staging with production memory limits. Many fixes work with unlimited RAM but fail in production containers. Use realistic data sizes for testing.