Fix Your Broken Kafka Consumers

Currently viewing the human version

When Consumers Fall Behind

Consumer lag is that moment when you realize your streaming platform is actually broken as hell.

The worst part? The symptoms lie to you. I spent 6 hours one night chasing org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed errors that turned out to be database connection pool exhaustion. The real problem was three layers deep from what Kafka was telling me.

What Is Consumer Lag (Skip the Theory)

Kafka Consumer Groups

Consumer lag = how far behind your consumer is. It's the difference between the latest message offset and where your consumer actually is. If producers are at offset 50,000 and your consumer just processed offset 47,000, you have 3,000 messages of lag.

Anything over a few hundred messages usually means shit's going sideways.

Why Lag Happens (The Real Reasons)

Your Code Is Slow: 90% of the time it's your consumer logic being garbage. I've seen 50ms processing time turn into 2+ seconds because someone added a synchronous HTTP call to validate every message. Don't do that.

Producers Going Nuts: Black Friday hits and suddenly your producers are vomiting 10x normal traffic while your consumers are still configured for Tuesday afternoon load. Your consumers can't keep up.

Partition Assignment Problems: One consumer gets all the hot partitions while others sit idle. I've seen one partition drowning while others barely had any lag. That's messed up partition distribution.

Infrastructure Issues:

JVM garbage collection decides to pause for 30 seconds
Network hiccups make consumers think brokers died
Some genius put Kafka on spinning disks
Kubernetes kills your pod right as it's catching up

When Lag Becomes a Problem

Had a client where even a couple seconds of lag meant fraudulent transactions slipped through. The costs were brutal when fraud detection can't keep up.

Lag creates vicious cycles. Messages pile up faster than you can process them. Your consumers get overwhelmed, process even slower, lag gets worse. I've watched systems take down entire platforms this way.

KRaft in Kafka 2.8+ helped with coordinator issues, but if your consumer code sucks, you're still screwed. I've seen teams upgrade from Kafka 2.4 to 3.6 expecting miracles, only to discover their 5-second database calls were still killing performance. No amount of Kafka improvements will fix slow database calls or blocking I/O.

How to Actually Debug This

When lag hits production, your first instinct is to panic and start changing shit randomly. Don't. I've seen more outages caused by panicked "fixes" than by the original lag problem.

Here's the debugging workflow that's saved my ass multiple times.

Step 1: Figure Out What Normal Looks Like

First thing I do when troubleshooting lag is check what the metrics looked like before everything went to hell.

Key metrics that actually matter:

records-lag per partition (not just the total)
records-consumed-rate (is processing getting slower?)
fetch-rate (are consumers even polling?)

Pro tip: If you don't have baseline metrics, you're flying blind. Set up monitoring before you have problems, not after.

Step 2: Check Per-Partition Lag

Consumer lag totals lie. Always check per-partition metrics.

kafka-consumer-groups --bootstrap-server localhost:9092 \
  --describe --group your-consumer-group

I've seen one partition with massive lag while the rest are fine. That's not a consumer problem, that's a partition assignment problem.

If one partition is drowning, your key distribution is broken. All the hot data is landing on one partition while others sit idle.

Step 3: Is It Your Code or Traffic?

Simple test: Look at producer rate vs consumer rate.

If producer rate is normal but consumer rate dropped off a cliff, your code is slow. If producer rate spiked and consumer rate is steady, you're just overwhelmed.

Time each message in your consumer logic. I guarantee you'll find something stupid:

Database calls that block for 2 seconds (usually Connection timed out after 30000ms or HikariPool - Connection is not available)
HTTP requests without timeouts throwing java.net.SocketTimeoutException: Read timed out
JSON parsing that's somehow taking 500ms (looking at you, Jackson with 50MB payloads)
Logging that's writing to disk synchronously because someone set immediateFlush=true

Step 4: Infrastructure Reality Check

Kafka Cluster Architecture

Half the time "consumer lag" is actually infrastructure being garbage.

JVM GC is a lag killer. If you're seeing 30-second GC pauses, your consumers will timeout and rebalance. Use G1GC and monitor the logs:

-XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35

Check your resources. Kubernetes loves to CPU throttle containers at the worst possible moment. Make sure you actually have the CPU and memory you think you do.

Network issues. I spent 4 hours debugging "slow consumers" that were actually packet loss between availability zones. Check your network latency, especially if you're running cross-region.

The number of times I've seen "Kafka is slow" turn into "we put Kafka on spinning disks" is embarrassing. Use SSDs or your I/O will be garbage.

Fixes That Actually Work

Once you know what's broken, here's what actually works in production. I'm not going to waste your time with theoretical bullshit that falls apart under real load.

Config Changes That Don't Suck

Before you throw hardware at the problem, try these config tweaks. They fixed 70% of our lag issues:

## Pull more data per request
fetch.max.bytes=52428800
max.poll.records=500

## Don't wait forever for batches
fetch.max.wait.ms=100

## Give yourself breathing room for processing
max.poll.interval.ms=300000

The defaults are garbage. fetch.max.bytes at 50MB lets you grab way more data per network call. max.poll.records at 500 means you're not constantly polling for tiny batches.

Pro tip: Don't copy my config blindly. Test it. I've seen these same settings destroy performance on different workloads. Every setup is different and what works for one team can break another.

Make Your Consumer Code Not Suck

Stop processing messages one by one. Batch that stuff:

// This will kill you
consumer.poll(Duration.ofMillis(100)).forEach(record -> {
    callDatabase(record);  // 50ms per message = RIP
});

// This won't
List<ConsumerRecord<String, String>> batch = new ArrayList<>();
consumer.poll(Duration.ofMillis(100)).forEach(batch::add);
processBatchInOneDbCall(batch);  // 200ms total

I learned this the hard way when someone added database logging to our consumer. Went from 10ms per message to 2 seconds. Took down our entire pipeline for 4 hours because every consumer in the group started timing out on session timeout.

Parallelize I/O operations:

CompletableFuture<?>[] futures = consumer.poll(Duration.ofMillis(100))
    .stream()
    .map(record -> CompletableFuture.runAsync(() -> 
        processRecord(record), threadPool))
    .toArray(CompletableFuture[]::new);

CompletableFuture.allOf(futures).join();

This turns sequential death into parallel processing. Just don't go crazy with thread pools or you'll exhaust your database connections.

When Everything Is on Fire

Quick fixes for production disasters:

Add more consumers - Scale horizontally first, optimize later
Temporarily skip expensive processing - Comment out non-critical logic until lag normalizes
Increase container resources - Double CPU and memory, figure out efficiency tomorrow
Nuclear option - Reset offsets and lose data:

kfka-consumer-groups --bootstrap-server localhost:9092 \
  --group your-group \
  --reset-offsets --to-latest \
  --topic your-topic --execute

Only use offset reset if losing data is better than staying down. I've had to make this choice at 3am more times than I'd like. Sometimes the business would rather lose some data than have systems offline for hours.

Infrastructure That Actually Works

Don't run Kafka on garbage hardware.

Use SSDs or your disk I/O will be garbage
Give consumers actual CPU and memory, not Kubernetes default limits
Keep consumers close to brokers - cross-region adds 200ms+ per poll

Monitor GC pauses. If your JVM is pausing for 30+ seconds, consumers will timeout and trigger rebalances. Java 8 defaults are especially brutal:

## These flags work on Java 11+
-XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35

## If you're stuck on Java 8, you're fucked but try:
-XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled

This saved us from weekly rebalancing disasters. The default JVM settings are optimized for 2003, and if you're still running Java 8 in production, you have bigger problems than consumer lag.

Bottom line: Most consumer lag isn't a Kafka problem—it's an infrastructure or code problem that shows up in Kafka metrics. Fix your foundation first, then optimize the streaming layer.

Questions I Get Asked All The Damn Time

How much lag is too much lag?

Depends on your use case, but here's reality:

Fraud detection? Anything over 100ms and money walks out the door
Analytics? Minutes are fine, nobody cares if dashboards are a bit stale
Real-time recommendations? Sub-second or customers bounce

Don't obsess over absolute numbers. Consistent lag is fine, growing lag means you're screwed.

Why does lag randomly spike when traffic is normal?

Usually infrastructure being garbage:

JVM decides to pause for 30 seconds (classic)
Network hiccups that last just long enough to screw you
Kubernetes kills your pod at the worst possible moment
Some background process saturates your disk

Check your infra metrics when lag spikes. It's rarely actually your application code.

Will newer Kafka versions fix my problems?

KRaft in 2.8+ and the new consumer group protocol in 3.7+ help with coordinator issues, but if your code sucks, you're still screwed. I've seen teams spend weeks upgrading from 2.4 to 3.6 expecting performance miracles, only to find their 10-second database calls were still destroying throughput. New Kafka versions won't magically make your synchronous HTTP calls faster.

Should I add more partitions when lag increases?

Only if you need more parallelism. More partitions = more complexity.

Start with 6 partitions per topic. If you have 10 consumers fighting over 2 partitions, add partitions. If your processing is just slow, more partitions won't help.

Don't go over 1000 partitions per topic unless you hate yourself.

How do I debug lag when multiple consumer groups use the same topic?

Check lag per consumer group, not per topic:

kafka-consumer-groups --bootstrap-server localhost:9092 --describe --all-groups

If all groups are lagging equally, your producers are flooding the topic. If one group is way behind, that group has problems (slow processing, resource starvation, etc.).

My lag exploded to crazy numbers. What now?

Triage mode:

Scale out - add more consumer instances immediately
Cut the fat - remove expensive processing temporarily
Throw hardware at it - double CPU/memory limits
Last resort - reset offsets and lose data

I've had to reset offsets at 3am more times than I want to admit. Sometimes losing data is better than staying down.

Why does GC cause lag spikes?

If your JVM pauses longer than session.timeout.ms (default 10s), Kafka thinks your consumer died and triggers a rebalance. Every consumer stops until rebalancing finishes.

Fix: Use G1GC and tune it properly (Java 11+ required for decent G1):

## Java 11+
-XX:+UseG1GC -XX:MaxGCPauseMillis=20

## Java 8 (if you're stuck in hell)
-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70

How do I deploy new consumer versions without breaking everything?

Blue-green deployment. Run old and new consumers side by side, then gradually kill the old ones. Or just deploy during low traffic and hope for the best.

Pro tip: Use static group membership so rebalancing doesn't happen every time you restart a consumer.

Can one slow consumer kill the entire group?

Absolutely. The slowest consumer determines group performance. I've seen one consumer with a fucked database connection slow down 20 others.

Monitor per-consumer metrics and kill slow instances. Use health checks. Don't let one bad consumer ruin everyone's day.

What happens if lag exceeds retention time?

You lose data permanently. Messages get deleted before consumers process them. You'll start seeing OffsetOutOfRangeException errors when consumers try to read messages that no longer exist.

Set retention way higher than your worst-case recovery time. I learned this the hard way when a weekend outage ate 3 days of transaction data because retention was set to 168 hours and we took 4 days to fix the issue.

The real lesson about consumer lag?

Most lag problems aren't actually Kafka problems. They're symptoms of deeper issues: slow databases, network problems, garbage collection pauses, or just bad code. Kafka is usually the messenger getting shot for delivering bad news about your infrastructure.

Debugging Kafka Consumer Issues: How to Ensure Your Consumer Receives Messages by vlogize

# Debugging Kafka Consumer Issues This 15-minute video actually shows debugging consumer issues in a realistic scenario, not just theory. Key topics covered: - 0:00 - Setting up monitoring for consumer groups - 3:30 - Identifying when consumers stop receiving messages - 7:15 - Common configuration issues that cause lag - 11:45 - Real troubleshooting workflow for production incidents Watch: Debugging Kafka Consumer Issues: How to Ensure Your Consumer Receives Messages Why this video helps: Shows actual debugging techniques instead of just explaining concepts. The presenter walks through real error scenarios and demonstrates the diagnostic commands you'll actually use when things break. Most Kafka tutorials are garbage that skip the real-world pain like partition rebalancing during deployments or debugging why lag suddenly spikes. This one actually shows you how to read consumer group output and what those JMX metrics mean when you're troubleshooting.

📺 YouTube

Quick Navigation

What Is Consumer Lag (Skip the Theory)

Why Lag Happens (The Real Reasons)

When Lag Becomes a Problem

Step 1: Figure Out What Normal Looks Like

Step 2: Check Per-Partition Lag

Step 3: Is It Your Code or Traffic?

Step 4: Infrastructure Reality Check

Config Changes That Don't Suck

Make Your Consumer Code Not Suck

When Everything Is on Fire

Infrastructure That Actually Works

How much lag is too much lag?

Why does lag randomly spike when traffic is normal?

Will newer Kafka versions fix my problems?

Should I add more partitions when lag increases?

How do I debug lag when multiple consumer groups use the same topic?

My lag exploded to crazy numbers. What now?

Why does GC cause lag spikes?

How do I deploy new consumer versions without breaking everything?

Can one slow consumer kill the entire group?

What happens if lag exceeds retention time?

The real lesson about consumer lag?

Related Tools & Recommendations

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

ELK Stack for Microservices - Stop Losing Log Data

Datadog vs New Relic vs Sentry: Real Pricing Breakdown (From Someone Who's Actually Paid These Bills)

Datadog Cost Management - Stop Your Monitoring Bill From Destroying Your Budget

Datadog Enterprise Pricing - What It Actually Costs When Your Shit Breaks at 3AM

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

Grafana - The Monitoring Dashboard That Doesn't Suck

Set Up Microservices Monitoring That Actually Works

Docker Alternatives That Won't Break Your Budget

I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works

MongoDB vs PostgreSQL vs MySQL: Which One Won't Ruin Your Weekend

Splunk - Expensive But It Works

Your Elasticsearch Cluster Went Red and Production is Down

Kafka + Spark + Elasticsearch: Don't Let This Pipeline Ruin Your Life

MongoDB Alternatives: Choose the Right Database for Your Specific Use Case

Install Go 1.25 on Windows (Prepare for Windows to Be Windows)

New Relic - Application Monitoring That Actually Works (If You Can Afford It)

Kafka Will Fuck Your Budget - Here's the Real Cost