Why I Started Using Burrow

Consumer lag monitoring is a nightmare. I've wasted weeks tuning thresholds that either fire constantly during traffic spikes or miss dead consumers for hours.

First time I set up Kafka monitoring, I used JMX metrics and Prometheus. Set the lag threshold to like 10,000 messages because that's what some tutorial said. Black Friday hits, traffic doubles, every single alert fires at 2AM. Ops team is fucking pissed. So I raise the threshold to 50,000 or something. Few weeks later, a consumer dies on Thursday afternoon and sits dead all weekend because something like 45,000 messages of lag was "fine" according to my brilliant threshold.

Kafka Architecture Overview

Sample Consumer Lag Graph

So Burrow does this clever shit - it reads the __consumer_offsets topic that Kafka uses internally to track where consumers are. No per-app JMX bullshit setup, no config for each consumer group, no "we forgot to monitor the payments service" disasters. Point Burrow at your cluster and it sees everything automatically.

The sliding window thing is what actually makes it not suck. Instead of "lag > threshold = bad," it looks at the pattern over time. High lag but dropping? Consumer's probably catching up from a restart or batch job. High lag and climbing? Falling behind. High lag and flatlined while new messages pile up? Consumer's fucking dead, go fix it.

The Consumer Monitoring Shitshow I Lived Through

Threshold Tuning Hell

I've spent three different weeks at three different jobs tuning consumer lag thresholds. Every time, same story. Start with 1,000 messages like every tutorial recommends - alerts fire on every deploy when consumers restart and rebalance. Raise to 10,000 based on capacity planning docs - Black Friday traffic spike wakes the whole team up. Raise to 100,000 out of desperation - miss actually dead consumers for days.

There's no sweet spot. Traffic patterns aren't consistent, batch processing creates natural spikes, and consumers have different processing speeds depending on message complexity. One threshold can't cover all scenarios.

JMX Disappearing Act

JMX consumer metrics vanish the moment your consumer process dies. Which is exactly when you need them most. Why the fuck does the monitoring disappear right when something breaks? I've spent more time debugging "why aren't we getting lag metrics" than fixing actual broken consumers. The monitoring system can't tell you about problems it can't see.

The "Forgot to Monitor" Problem

New service deploys, team "forgets" to add JMX endpoint configuration. Happened to us last year - some user activity processor died for like three weeks and nobody noticed until the monthly report was empty. "Why didn't we know this stopped?" Because nobody remembered to set up monitoring for the new service. Happens every fucking time in large organizations.

Numbers Without Context Are Useless

Consumer lag of like 50,000 messages. Should you panic? Maybe the consumer's processing a daily batch job and working through backlog normally. Or maybe it's been dead for the last 2 hours. The number by itself tells you jack shit.

Burrow Kafka Monitoring

How Burrow Fixes This Mess

The key insight: stop asking consumers how they're doing (they lie or disappear). Read Kafka's internal bookkeeping instead.

Every time a consumer commits an offset, Kafka writes it to the __consumer_offsets topic. Burrow reads this topic directly, so it sees every consumer group whether they're running, dead, or somewhere in between. No JMX setup, no per-app configuration, no gaps.

The Sliding Window Thing

Here's what makes it not suck. Instead of "lag > 50,000 = alert," Burrow looks at the last 10 offset commits over about 10 minutes. If lag is high but the commits are moving forward and getting closer to the current offset, the consumer is catching up. Maybe it restarted, maybe it's working through a batch job. Either way, it's not dead.

But if the lag is high and the commits stopped advancing while new messages keep arriving? Consumer's dead. The sliding window catches this pattern reliably.

Three States I Actually Understand

  • OK: Consumer is keeping up or catching up normally
  • WARNING: Falling behind but still making progress
  • ERROR: Consumer is stalled, dead, or completely broken

I query the HTTP API from my existing Prometheus setup. Same alerting infrastructure, just better consumer status evaluation. Works with any Kafka since 0.8.2 when __consumer_offsets became a thing.

This is the monitoring setup that finally let me sleep through the night without lag alerts going off every time traffic spiked or a batch job started.

My Consumer Monitoring Journey (A Tale of Pain)

Tool

Setup Pain

False Positives

Missed Failures

Monthly Cost

Verdict

Burrow

One config file

None lately

Haven't missed one since setup

0

Finally works

Prometheus + JMX

Every service needs config

Constant

Major (invisible when dead)

~$50 infra

Gave up

DataDog

Easy setup

Daily during spikes

Several hours each time

200+

Overpriced threshold hell

Custom Prometheus

Like 3 weeks of suffering

Weekly

Who the fuck knows

Dev time

Stupidest idea ever

How Burrow Actually Works (The Technical Bits)

Burrow High Level Design

The Sliding Window Evaluation (Actually Clever)

The sliding window thing is what makes Burrow actually useful compared to threshold hell. Here's what it watches:

Commit Progression: Is the consumer regularly committing offsets that are moving forward? If offsets are stuck but messages keep arriving, the consumer is probably dead or stuck processing one bad message. Poison pill messages can kill consumers for hours while offsets don't advance.

Lag Trend Over Time: This is the key insight. High lag + decreasing = consumer is catching up from a batch job or restart. High lag + increasing = consumer is falling behind processing speed. High lag + flatlined = consumer is dead and you need to go fix it.

Pattern Recognition: During normal traffic spikes, healthy consumers show consistent sawtooth patterns. Random erratic lag jumps usually mean something's fucked - JVM GC pauses (especially with Java 8's G1GC before tuning), network partition issues, thread pool exhaustion in Spring Boot 2.x apps, or code bugs that cause intermittent failures.

The default window is 10 offset commits over ~10 minutes, which works well for most setups. You can tune it, but honestly the defaults are pretty solid.

Sliding Window Algorithm

Multi-Cluster Support (Because You Have Dev/Stage/Prod)

One Burrow instance can monitor multiple Kafka clusters, which is actually useful instead of just marketing bullshit:

Separate Cluster Monitoring: Each cluster gets its own consumer tracking and evaluation. Your dev cluster's fucked-up consumers won't affect production monitoring. Burrow keeps the data separate so cross-cluster noise isn't an issue.

Same API for Everything: /v3/kafka/{cluster} endpoints work the same across all clusters. Your Prometheus monitoring scripts or Grafana dashboards don't need to know which cluster they're hitting - just change the cluster name in the URL.

Scales Pretty Well: LinkedIn runs this on thousands of consumer groups across multiple data centers. A single Burrow instance handles way more than you probably need, unless you're Netflix-scale with petabytes of data or Uber-level real-time processing. Our instance monitors like 200+ consumer groups across dev/stage/prod with zero performance issues since we set it up.

HTTP API (Simple and Actually Works)

The API is straightforward - no GraphQL bullshit, no complex authentication schemes, just HTTP GET requests:

GET /v3/kafka                                    # List clusters
GET /v3/kafka/{cluster}/consumer                 # List consumer groups
GET /v3/kafka/{cluster}/consumer/{group}/status  # The money shot - consumer status
GET /v3/kafka/{cluster}/topic                   # List topics

The /status endpoint returns JSON with the important bits: overall status (OK/WARNING/ERROR), per-partition details, and the lag trends that drove the decision. Perfect for integrating with Prometheus, Grafana, or whatever monitoring stack you already have.

Grafana Kafka Integration

Alerting (Email and HTTP Webhooks)

Burrow can push alerts instead of just providing an API:

Email: Basic SMTP alerts. Honestly, most people skip this and just poll the API from their existing monitoring system instead of dealing with email routing.

HTTP Notifications: POSTs JSON to whatever endpoint you want. Useful for sending alerts to PagerDuty, Slack webhooks, or your custom alerting system. Includes configurable templates and retry logic.

Most deployments just use the HTTP API and skip the built-in notifications - easier to integrate with whatever alerting system you already have than to configure another notification pathway.

Deployment Reality Check

Resources: It's Go, so it's not a memory hog like JVM apps. Typical production instance uses like 500MB RAM and minimal CPU. Scales pretty well with number of consumer groups you're monitoring.

High Availability: No built-in clustering, so you run multiple instances behind a load balancer. Since it's stateless (just reads from Kafka), this works fine. Most people run 2-3 instances across different AZs.

Configuration: TOML files for config. You'll need to restart when changing config, but the state is just in-memory sliding windows, so restart is fast.

Docker: There's an official Docker image and docker-compose setup that actually works. The compose includes Kafka and Zookeeper for testing, which is handy.

Docker Logo

Real Gotcha: Burrow needs read access to __consumer_offsets. If your Kafka cluster has ACLs, make sure Burrow can read that topic, or you'll get the cryptic error TOPIC_AUTHORIZATION_FAILED and mysteriously see "no consumer groups found" in the API. I spent like 4 hours debugging this once before realizing the security team had locked down internal topics. That was a fun afternoon.

The __consumer_offsets Deep Dive: This internal topic is where Kafka stores consumer group metadata and offset commits. By default it has 50 partitions (configurable with offsets.topic.num.partitions) and uses compacted cleanup policy to keep the latest offset for each consumer group + partition combination. Burrow essentially becomes another consumer of this topic, which is why it can see all consumer activity without any per-app configuration.

The Questions I Actually Get Asked (In Order of Frequency)

Q

"Burrow says ERROR but my consumer is processing messages fine"

A

This is the #1 confusion. Yeah, I've seen this shit constantly. Usually happens during consumer rebalancing

  • when you add/remove consumers or restart, Burrow gets confused for like 5-10 minutes because the offset commit patterns change suddenly.Just fucking wait it out. If it's still ERROR after 15 minutes or so, then you've got a real problem. I learned this the hard way after spending like 2 hours debugging a "broken" consumer that was actually fine. Now I set a timer before I panic.
Q

"Getting ECONNREFUSED when hitting Burrow API"

A

Second most common.

Check the obvious shit first:

  • Is Burrow actually running? ps aux | grep burrow
  • Right port? Default is 8000, check your config
  • Firewall blocking it?
  • Wrong hostname in your script?I've wasted entire afternoons troubleshooting connection issues that were just typos in port numbers. Recently spent like 3 hours debugging this only to find I was hitting port 8080 instead of 8000. Felt like an idiot.
Q

"High lag but Burrow says OK - is something broken?"

A

Nope, and this is exactly why I switched from threshold monitoring.Our ETL job runs daily and processes like 500,000 messages. Lag spikes to 400,000+ when it starts, then drops to 0 over a couple hours. My old Prometheus setup would fire alerts for the entire time. Burrow sees the lag decreasing and says "consumer's working fine, chill out."

Q

"Burrow shows no consumer groups but consumers are definitely running"

A

I debug this one monthly.

Two things:

  1. Your consumers aren't committing offsets (auto-commit disabled maybe?)2. Burrow can't read __consumer_offsets due to ACLsHit #2 in production once
  • took down our monitoring for a weekend because someone "tightened security" and blocked Burrow's access. Fun times.
Q

"Can I use this with consumers that don't commit offsets?"

A

No, and why the fuck would you want to? Burrow reads __consumer_offsets, so no commits = invisible consumers. Fix your offset management instead of working around this.I've seen teams disable auto-commit thinking they're being clever, then wonder why monitoring doesn't work. Just use Kafka's offset management

  • it's battle-tested and you're not smarter than the Kafka devs.
Q

"Docker Compose crashes on startup every time"

A

Burrow tries to connect before Kafka is ready. I've hit this on every single Docker setup. Add health checks:yamldepends_on: kafka: condition: service_healthyWithout this, Burrow crashes with connection refused and Docker Compose just gives up. Took me way too long to figure this out.

Q

"What happens when I restart Burrow?"

A

Sliding window history disappears (it's all in memory), so status goes back to unknown for about 10 minutes. Your consumers keep running

  • Burrow just reads data, doesn't control anything.I restart our Burrow instance weekly during maintenance windows. Never affects actual processing.
Q

"How fast does it detect dead consumers?"

A

About 10-15 minutes with defaults. I tried tuning it faster once, set the window to 5 commits. Got false positives every damn time consumers restarted. The Linked

In devs know what they're doing

  • don't fuck with the defaults.These are the questions I get asked most often when teams are evaluating or setting up Burrow. If you're ready to dive deeper or need help with specific configurations, the resources section has everything I wish I'd found when I first started using this tool.

Related Tools & Recommendations

pricing
Recommended

Low-Code Platform Costs: What These Vendors Actually Charge

What low-code vendors don't want you to know about their pricing

Mendix
/pricing/low-code-platforms-tco-mendix-outsystems-appian/total-cost-ownership-analysis
100%
review
Recommended

I've Built 6 Apps With Bubble and I Have Regrets

Here's what actually happens when you use no-code for real projects

Bubble.io
/review/bubble-io/honest-evaluation
46%
news
Recommended

AI Stocks Finally Getting Reality-Checked - September 2, 2025

Turns out spending billions on AI magic pixie dust doesn't automatically print money

bubble
/news/2025-09-02/ai-stocks-bubble-concerns
46%
news
Recommended

OpenAI Will Burn Through $115 Billion by 2029 and Still Might Not Turn a Profit

Company just revised spending up by $80 billion while 95% of AI projects deliver zero ROI, raising serious bubble questions

Redis
/news/2025-09-11/openai-cash-burn-115b-ai-bubble
46%
compare
Recommended

Framer vs Webflow vs Figma Sites - Design to Development Workflow Comparison

Transform Your Design Process: From Prototype to Production Website

Framer
/compare/framer/webflow/figma/design-to-development-workflow
41%
tool
Recommended

Webflow Production Deployment - The Real Engineering Experience

Debug production issues, handle downtime, and deploy websites that actually work at scale

Webflow
/tool/webflow/production-deployment
41%
review
Recommended

Webflow Review - I Used This Overpriced Website Builder for 2 Years

The Truth About This Beautiful, Expensive, Complicated Platform That Everyone's Talking About

Webflow
/review/webflow-developer-handoff/user-experience-review
41%
tool
Recommended

OutSystems: Expensive Low-Code Platform That Actually Works

competes with OutSystems

OutSystems
/tool/outsystems/overview
41%
tool
Recommended

Mendix DevOps Deployment Automation Guide

Stop clicking through 47 deployment steps every Friday at 5 PM before your weekend gets destroyed

Mendix
/tool/mendix/devops-deployment-automation
41%
tool
Recommended

Mendix - Siemens' Low-Code Platform

Build apps fast (if you've got enterprise money)

Mendix
/tool/mendix/overview
41%
tool
Popular choice

jQuery - The Library That Won't Die

Explore jQuery's enduring legacy, its impact on web development, and the key changes in jQuery 4.0. Understand its relevance for new projects in 2025.

jQuery
/tool/jquery/overview
41%
troubleshoot
Similar content

Fix Your Broken Kafka Consumers

Stop pretending your "real-time" system isn't a disaster

Apache Kafka
/troubleshoot/kafka-consumer-lag-performance/consumer-lag-performance-troubleshooting
40%
tool
Popular choice

Hoppscotch - Open Source API Development Ecosystem

Fast API testing that won't crash every 20 minutes or eat half your RAM sending a GET request.

Hoppscotch
/tool/hoppscotch/overview
39%
tool
Popular choice

Stop Jira from Sucking: Performance Troubleshooting That Works

Frustrated with slow Jira Software? Learn step-by-step performance troubleshooting techniques to identify and fix common issues, optimize your instance, and boo

Jira Software
/tool/jira-software/performance-troubleshooting
38%
tool
Recommended

Appian - Enterprise Workflow Software That Actually Works (For a Price)

alternative to Appian

Appian
/tool/appian/overview
37%
tool
Popular choice

Northflank - Deploy Stuff Without Kubernetes Nightmares

Discover Northflank, the deployment platform designed to simplify app hosting and development. Learn how it streamlines deployments, avoids Kubernetes complexit

Northflank
/tool/northflank/overview
36%
tool
Popular choice

LM Studio MCP Integration - Connect Your Local AI to Real Tools

Turn your offline model into an actual assistant that can do shit

LM Studio
/tool/lm-studio/mcp-integration
34%
tool
Popular choice

CUDA Development Toolkit 13.0 - Still Breaking Builds Since 2007

NVIDIA's parallel programming platform that makes GPU computing possible but not painless

CUDA Development Toolkit
/tool/cuda/overview
32%
news
Popular choice

Taco Bell's AI Drive-Through Crashes on Day One

CTO: "AI Cannot Work Everywhere" (No Shit, Sherlock)

Samsung Galaxy Devices
/news/2025-08-31/taco-bell-ai-failures
31%
integration
Similar content

Rust, WebAssembly, JavaScript, and Python Polyglot Microservices

When you need Rust's speed, Python's ML stuff, JavaScript's async magic, and WebAssembly's universal deployment promises - and you hate yourself enough to run a

Rust
/integration/rust-webassembly-javascript-python/polyglot-microservices-architecture
30%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization