Last month our user service started dying every 6 hours. Memory usage kept climbing - started around 2GB, then 4GB, then somewhere north of 8GB before it crashed. Could have been 10GB, I never caught the exact peak because by then we were in full panic mode. The alerts went off at 3am, and management was asking questions by 9am.
The smoking gun wasn't obvious. Heap dumps showed millions of protobuf objects, but they all looked valid. No circular references, no obvious leaks. Just a steady accumulation of UserProfile
messages that should have been garbage collected.
The Real Problem: Object Builders That Never Die
The issue was in our caching layer. We were using protobuf builders to construct messages, then storing references to those builders "for performance." Every cache hit would reuse the builder, add new data, and build a message. Except we never called .clear()
on the builders.
In protobuf, builders accumulate state. Even after you call .build()
, the builder keeps all the intermediate objects in memory. With thousands of cache operations per second, we were leaking megabytes per minute.
// This slowly kills your heap
private static final UserProfile.Builder reusedBuilder = UserProfile.newBuilder();
// Instead of this (which leaks)
UserProfile profile = reusedBuilder
.setId(userId)
.setName(userName)
.build();
// Do this (which doesn't)
UserProfile profile = UserProfile.newBuilder()
.setId(userId)
.setName(userName)
.build();
The "optimization" of reusing builders was actually causing the leak. Each build operation left data in the builder, and the GC couldn't clean up because we held a static reference.
How We Actually Debug Protobuf Memory Issues
Standard heap analysis tools don't help much with protobuf because everything looks legitimate. Here's what actually works:
1. Count objects, not just memory. If you have 10 million RepeatedFieldBuilder
instances, that's your problem even if they're small.
2. Track message lifecycle. Add logging to your message creation and destruction. If creation vastly outpaces destruction, you've got a leak.
3. Use protobuf-specific profiling. Most profilers can break down protobuf operations. Look for parseFrom()
calls that don't have matching clear()
or GC events.
The fix took 20 minutes once we understood the problem. The diagnosis took 4 hours of late-night debugging while our service kept crashing.
Performance Problems You Can Actually Control
Most protobuf performance advice focuses on schema design and field ordering. That stuff matters, but it's not what kills services in production. Here's what actually moves the needle:
Memory churn is worse than memory usage. Creating millions of small objects stresses the GC more than a few large objects. Reuse message instances when possible, but do it right.
Parsing large nested messages destroys performance. If your messages have 10+ levels of nesting, parsing becomes exponentially expensive. Flatten your schema or split large messages into smaller ones.
Reflection-based parsing is 10x slower. Make sure you're using generated classes, not generic message handling. This happens accidentally when you upgrade protobuf libraries.
The problems that actually matter in production aren't the ones the documentation warns you about. They're the ones that emerge from how you use protobuf in your specific system architecture.