Burrow - Kafka Consumer Lag Monitoring That Actually Works

Why I Started Using Burrow

Consumer lag monitoring is a nightmare. I've wasted weeks tuning thresholds that either fire constantly during traffic spikes or miss dead consumers for hours.

First time I set up Kafka monitoring, I used JMX metrics and Prometheus. Set the lag threshold to like 10,000 messages because that's what some tutorial said. Black Friday hits, traffic doubles, every single alert fires at 2AM. Ops team is fucking pissed. So I raise the threshold to 50,000 or something. Few weeks later, a consumer dies on Thursday afternoon and sits dead all weekend because something like 45,000 messages of lag was "fine" according to my brilliant threshold.

Kafka Architecture Overview

So Burrow does this clever shit - it reads the __consumer_offsets topic that Kafka uses internally to track where consumers are. No per-app JMX bullshit setup, no config for each consumer group, no "we forgot to monitor the payments service" disasters. Point Burrow at your cluster and it sees everything automatically.

The sliding window thing is what actually makes it not suck. Instead of "lag > threshold = bad," it looks at the pattern over time. High lag but dropping? Consumer's probably catching up from a restart or batch job. High lag and climbing? Falling behind. High lag and flatlined while new messages pile up? Consumer's fucking dead, go fix it.

The Consumer Monitoring Shitshow I Lived Through

Threshold Tuning Hell

I've spent three different weeks at three different jobs tuning consumer lag thresholds. Every time, same story. Start with 1,000 messages like every tutorial recommends - alerts fire on every deploy when consumers restart and rebalance. Raise to 10,000 based on capacity planning docs - Black Friday traffic spike wakes the whole team up. Raise to 100,000 out of desperation - miss actually dead consumers for days.

There's no sweet spot. Traffic patterns aren't consistent, batch processing creates natural spikes, and consumers have different processing speeds depending on message complexity. One threshold can't cover all scenarios.

JMX Disappearing Act

JMX consumer metrics vanish the moment your consumer process dies. Which is exactly when you need them most. Why the fuck does the monitoring disappear right when something breaks? I've spent more time debugging "why aren't we getting lag metrics" than fixing actual broken consumers. The monitoring system can't tell you about problems it can't see.

The "Forgot to Monitor" Problem

New service deploys, team "forgets" to add JMX endpoint configuration. Happened to us last year - some user activity processor died for like three weeks and nobody noticed until the monthly report was empty. "Why didn't we know this stopped?" Because nobody remembered to set up monitoring for the new service. Happens every fucking time in large organizations.

Numbers Without Context Are Useless

Consumer lag of like 50,000 messages. Should you panic? Maybe the consumer's processing a daily batch job and working through backlog normally. Or maybe it's been dead for the last 2 hours. The number by itself tells you jack shit.

Burrow Kafka Monitoring

How Burrow Fixes This Mess

The key insight: stop asking consumers how they're doing (they lie or disappear). Read Kafka's internal bookkeeping instead.

Every time a consumer commits an offset, Kafka writes it to the __consumer_offsets topic. Burrow reads this topic directly, so it sees every consumer group whether they're running, dead, or somewhere in between. No JMX setup, no per-app configuration, no gaps.

The Sliding Window Thing

Here's what makes it not suck. Instead of "lag > 50,000 = alert," Burrow looks at the last 10 offset commits over about 10 minutes. If lag is high but the commits are moving forward and getting closer to the current offset, the consumer is catching up. Maybe it restarted, maybe it's working through a batch job. Either way, it's not dead.

But if the lag is high and the commits stopped advancing while new messages keep arriving? Consumer's dead. The sliding window catches this pattern reliably.

Three States I Actually Understand

OK: Consumer is keeping up or catching up normally
WARNING: Falling behind but still making progress
ERROR: Consumer is stalled, dead, or completely broken

I query the HTTP API from my existing Prometheus setup. Same alerting infrastructure, just better consumer status evaluation. Works with any Kafka since 0.8.2 when __consumer_offsets became a thing.

This is the monitoring setup that finally let me sleep through the night without lag alerts going off every time traffic spiked or a batch job started.

My Consumer Monitoring Journey (A Tale of Pain)

Tool	Setup Pain	False Positives	Missed Failures	Monthly Cost	Verdict
Burrow	One config file	None lately	Haven't missed one since setup	0	Finally works
Prometheus + JMX	Every service needs config	Constant	Major (invisible when dead)	~$50 infra	Gave up
DataDog	Easy setup	Daily during spikes	Several hours each time	200+	Overpriced threshold hell
Custom Prometheus	Like 3 weeks of suffering	Weekly	Who the fuck knows	Dev time	Stupidest idea ever

How Burrow Actually Works (The Technical Bits)

The Sliding Window Evaluation (Actually Clever)

The sliding window thing is what makes Burrow actually useful compared to threshold hell. Here's what it watches:

Commit Progression: Is the consumer regularly committing offsets that are moving forward? If offsets are stuck but messages keep arriving, the consumer is probably dead or stuck processing one bad message. Poison pill messages can kill consumers for hours while offsets don't advance.

Lag Trend Over Time: This is the key insight. High lag + decreasing = consumer is catching up from a batch job or restart. High lag + increasing = consumer is falling behind processing speed. High lag + flatlined = consumer is dead and you need to go fix it.

Pattern Recognition: During normal traffic spikes, healthy consumers show consistent sawtooth patterns. Random erratic lag jumps usually mean something's fucked - JVM GC pauses (especially with Java 8's G1GC before tuning), network partition issues, thread pool exhaustion in Spring Boot 2.x apps, or code bugs that cause intermittent failures.

The default window is 10 offset commits over ~10 minutes, which works well for most setups. You can tune it, but honestly the defaults are pretty solid.

Multi-Cluster Support (Because You Have Dev/Stage/Prod)

One Burrow instance can monitor multiple Kafka clusters, which is actually useful instead of just marketing bullshit:

Separate Cluster Monitoring: Each cluster gets its own consumer tracking and evaluation. Your dev cluster's fucked-up consumers won't affect production monitoring. Burrow keeps the data separate so cross-cluster noise isn't an issue.

Same API for Everything: /v3/kafka/{cluster} endpoints work the same across all clusters. Your Prometheus monitoring scripts or Grafana dashboards don't need to know which cluster they're hitting - just change the cluster name in the URL.

Scales Pretty Well: LinkedIn runs this on thousands of consumer groups across multiple data centers. A single Burrow instance handles way more than you probably need, unless you're Netflix-scale with petabytes of data or Uber-level real-time processing. Our instance monitors like 200+ consumer groups across dev/stage/prod with zero performance issues since we set it up.

HTTP API (Simple and Actually Works)

The API is straightforward - no GraphQL bullshit, no complex authentication schemes, just HTTP GET requests:

GET /v3/kafka                                    # List clusters
GET /v3/kafka/{cluster}/consumer                 # List consumer groups
GET /v3/kafka/{cluster}/consumer/{group}/status  # The money shot - consumer status
GET /v3/kafka/{cluster}/topic                   # List topics

The /status endpoint returns JSON with the important bits: overall status (OK/WARNING/ERROR), per-partition details, and the lag trends that drove the decision. Perfect for integrating with Prometheus, Grafana, or whatever monitoring stack you already have.

Grafana Kafka Integration

Alerting (Email and HTTP Webhooks)

Burrow can push alerts instead of just providing an API:

Email: Basic SMTP alerts. Honestly, most people skip this and just poll the API from their existing monitoring system instead of dealing with email routing.

HTTP Notifications: POSTs JSON to whatever endpoint you want. Useful for sending alerts to PagerDuty, Slack webhooks, or your custom alerting system. Includes configurable templates and retry logic.

Most deployments just use the HTTP API and skip the built-in notifications - easier to integrate with whatever alerting system you already have than to configure another notification pathway.

Deployment Reality Check

Resources: It's Go, so it's not a memory hog like JVM apps. Typical production instance uses like 500MB RAM and minimal CPU. Scales pretty well with number of consumer groups you're monitoring.

High Availability: No built-in clustering, so you run multiple instances behind a load balancer. Since it's stateless (just reads from Kafka), this works fine. Most people run 2-3 instances across different AZs.

Configuration: TOML files for config. You'll need to restart when changing config, but the state is just in-memory sliding windows, so restart is fast.

Docker: There's an official Docker image and docker-compose setup that actually works. The compose includes Kafka and Zookeeper for testing, which is handy.

Real Gotcha: Burrow needs read access to __consumer_offsets. If your Kafka cluster has ACLs, make sure Burrow can read that topic, or you'll get the cryptic error TOPIC_AUTHORIZATION_FAILED and mysteriously see "no consumer groups found" in the API. I spent like 4 hours debugging this once before realizing the security team had locked down internal topics. That was a fun afternoon.

The __consumer_offsets Deep Dive: This internal topic is where Kafka stores consumer group metadata and offset commits. By default it has 50 partitions (configurable with offsets.topic.num.partitions) and uses compacted cleanup policy to keep the latest offset for each consumer group + partition combination. Burrow essentially becomes another consumer of this topic, which is why it can see all consumer activity without any per-app configuration.

The Questions I Actually Get Asked (In Order of Frequency)

"Burrow says ERROR but my consumer is processing messages fine"

This is the #1 confusion. Yeah, I've seen this shit constantly. Usually happens during consumer rebalancing

when you add/remove consumers or restart, Burrow gets confused for like 5-10 minutes because the offset commit patterns change suddenly.Just fucking wait it out. If it's still ERROR after 15 minutes or so, then you've got a real problem. I learned this the hard way after spending like 2 hours debugging a "broken" consumer that was actually fine. Now I set a timer before I panic.

"Getting ECONNREFUSED when hitting Burrow API"

Second most common.

Check the obvious shit first:

Is Burrow actually running? ps aux | grep burrow
Right port? Default is 8000, check your config
Firewall blocking it?
Wrong hostname in your script?I've wasted entire afternoons troubleshooting connection issues that were just typos in port numbers. Recently spent like 3 hours debugging this only to find I was hitting port 8080 instead of 8000. Felt like an idiot.

"High lag but Burrow says OK - is something broken?"

Nope, and this is exactly why I switched from threshold monitoring.Our ETL job runs daily and processes like 500,000 messages. Lag spikes to 400,000+ when it starts, then drops to 0 over a couple hours. My old Prometheus setup would fire alerts for the entire time. Burrow sees the lag decreasing and says "consumer's working fine, chill out."

"Burrow shows no consumer groups but consumers are definitely running"

I debug this one monthly.

Two things:

Your consumers aren't committing offsets (auto-commit disabled maybe?)2. Burrow can't read __consumer_offsets due to ACLsHit #2 in production once

took down our monitoring for a weekend because someone "tightened security" and blocked Burrow's access. Fun times.

"Can I use this with consumers that don't commit offsets?"

No, and why the fuck would you want to? Burrow reads __consumer_offsets, so no commits = invisible consumers. Fix your offset management instead of working around this.I've seen teams disable auto-commit thinking they're being clever, then wonder why monitoring doesn't work. Just use Kafka's offset management

it's battle-tested and you're not smarter than the Kafka devs.

"Docker Compose crashes on startup every time"

Burrow tries to connect before Kafka is ready. I've hit this on every single Docker setup. Add health checks:yamldepends_on: kafka: condition: service_healthyWithout this, Burrow crashes with connection refused and Docker Compose just gives up. Took me way too long to figure this out.

"What happens when I restart Burrow?"

Sliding window history disappears (it's all in memory), so status goes back to unknown for about 10 minutes. Your consumers keep running

Burrow just reads data, doesn't control anything.I restart our Burrow instance weekly during maintenance windows. Never affects actual processing.

"How fast does it detect dead consumers?"

About 10-15 minutes with defaults. I tried tuning it faster once, set the window to 5 commits. Got false positives every damn time consumers restarted. The Linked

In devs know what they're doing

don't fuck with the defaults.These are the questions I get asked most often when teams are evaluating or setting up Burrow. If you're ready to dive deeper or need help with specific configurations, the resources section has everything I wish I'd found when I first started using this tool.

Resources That Actually Help

41%

tool

Popular choice

CTO: "AI Cannot Work Everywhere" (No Shit, Sherlock)

Samsung Galaxy Devices

/news/2025-08-31/taco-bell-ai-failures

31%

integration

Rust, WebAssembly, JavaScript, and Python Polyglot Microservices

When you need Rust's speed, Python's ML stuff, JavaScript's async magic, and WebAssembly's universal deployment promises - and you hate yourself enough to run a

Rust

/integration/rust-webassembly-javascript-python/polyglot-microservices-architecture

30%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization

Quick Navigation

The Consumer Monitoring Shitshow I Lived Through

Threshold Tuning Hell

JMX Disappearing Act

The "Forgot to Monitor" Problem

Numbers Without Context Are Useless

How Burrow Fixes This Mess

The Sliding Window Thing

Three States I Actually Understand

The Sliding Window Evaluation (Actually Clever)

Multi-Cluster Support (Because You Have Dev/Stage/Prod)

HTTP API (Simple and Actually Works)

Alerting (Email and HTTP Webhooks)

Deployment Reality Check

"Burrow says ERROR but my consumer is processing messages fine"

"Getting ECONNREFUSED when hitting Burrow API"

"High lag but Burrow says OK - is something broken?"

"Burrow shows no consumer groups but consumers are definitely running"

"Can I use this with consumers that don't commit offsets?"

"Docker Compose crashes on startup every time"

"What happens when I restart Burrow?"

"How fast does it detect dead consumers?"

Related Tools & Recommendations

Low-Code Platform Costs: What These Vendors Actually Charge

I've Built 6 Apps With Bubble and I Have Regrets

AI Stocks Finally Getting Reality-Checked - September 2, 2025

OpenAI Will Burn Through $115 Billion by 2029 and Still Might Not Turn a Profit

Framer vs Webflow vs Figma Sites - Design to Development Workflow Comparison

Webflow Production Deployment - The Real Engineering Experience

Webflow Review - I Used This Overpriced Website Builder for 2 Years

OutSystems: Expensive Low-Code Platform That Actually Works

Mendix DevOps Deployment Automation Guide

Mendix - Siemens' Low-Code Platform

jQuery - The Library That Won't Die

Fix Your Broken Kafka Consumers

Hoppscotch - Open Source API Development Ecosystem

Stop Jira from Sucking: Performance Troubleshooting That Works

Appian - Enterprise Workflow Software That Actually Works (For a Price)

Northflank - Deploy Stuff Without Kubernetes Nightmares

LM Studio MCP Integration - Connect Your Local AI to Real Tools

CUDA Development Toolkit 13.0 - Still Breaking Builds Since 2007

Taco Bell's AI Drive-Through Crashes on Day One

Rust, WebAssembly, JavaScript, and Python Polyglot Microservices