Apache Kafka - The Distributed Log That LinkedIn Built (And You Probably Don't Need)

What Kafka Actually Is (And Why It'll Probably Break Your Production)

Kafka Architecture Overview

Basically, Kafka is a distributed log that LinkedIn built to handle their massive data firehose. They open-sourced it in 2011 because even they couldn't handle maintaining it alone. Kafka 4.0 dropped in March 2025 and finally killed ZooKeeper - thank fucking god, because ZooKeeper was a nightmare to debug.

The new KRaft mode (Kafka Raft) eliminates the ZooKeeper dependency that's been causing split-brain scenarios for over a decade. Plus there's a next-gen consumer rebalance protocol (KIP-848) that supposedly fixes the "stop-the-world" rebalances that have ruined our weekends.

Here's the thing about Kafka: it's incredibly fast and can handle ridiculous amounts of data, but the operational complexity will make you question your life choices. I've seen senior engineers with 10+ years of experience spend weeks trying to figure out why consumer groups are rebalancing randomly.

How This Thing Actually Works

Brokers are just servers that store your data. You need at least 3 in production (learned this the hard way when our 2-broker cluster ate shit and lost a day's worth of events). Each broker can theoretically handle thousands of partition reads/writes per second, but good luck achieving that with your network setup. Read the broker configuration guide to understand the dozens of settings you'll need to tune.

Topics are where you dump your data. Think of them as really big logs that never get deleted (until retention kicks in). The catch? You can't just throw data at a topic - you need to think about partitioning strategy or you'll hate yourself later.

Kafka Topic Partitions

Partitions are how Kafka scales, and they're also how it'll fuck you over. More partitions = more parallelism, but also more complexity. I've seen clusters with thousands of partitions become unmanageable during rebalancing. One team I worked with had 500+ partitions per topic and spent 3 days debugging why consumers were taking 10 minutes to rebalance. Check out this partition sizing guide to avoid making the same mistakes.

Producers send data to Kafka. Sounds simple until you realize you need to configure acks, retries, idempotency, compression, batching, and a dozen other settings. Default configs are designed to fail in production - they prioritize throughput over reliability. Here's a producer tuning guide that will save you weeks of debugging.

Consumers read data from Kafka. This is where the fun begins. Consumer groups, offset management, rebalancing, lag monitoring - it's a full-time job. Our monitoring alerts go crazy every time a consumer restarts because the rebalancing triggers a cascade of false alarms.

The Reality Check

Yeah, benchmarks show Kafka hitting 605 MB/s peak throughput with 5ms p99 latency at 200 MB/s load, but those tests run on perfect lab setups with infinite money and no network issues. In the real world with shitty networks and misconfigured brokers, expect something more reasonable. Still faster than everything else, but not magic.

The "sub-millisecond latency" marketing bullshit? That performance requires perfect network conditions and unlimited budget. In practice, expect 5-50ms latency and be happy if you get it consistently.

Message Broker Comparison

Real Talk: Unless you're processing terabytes per day, Kafka is probably overkill. I've seen too many teams adopt Kafka for a simple pub/sub use case and then spend 6 months learning how to operate it. Redis Streams or even RabbitMQ might be what you actually need. Check out this comprehensive comparison to understand the architectural differences.

Kafka vs Everything Else (Spoiler: You Probably Want Something Simpler)

Feature	Apache Kafka	Apache Pulsar	RabbitMQ	Amazon Kinesis	Redis Streams
Operational Complexity	Nuclear physics level	PhD required	Works when you screw up	AWS handles it	Actually simple
Throughput	15x faster than RabbitMQ (lab conditions)	Decent	4K-10K msgs/sec	1MB/sec per shard	Fast enough
Real-world Latency	5-50ms (not the marketing BS)	10-100ms	<10ms	~200ms	<5ms
Learning Curve	6+ months to competency	3-6 months	2 weeks	1 day	1 hour
Team Size Needed	3+ dedicated engineers	2+ engineers	1 part-time	0 (AWS problem)	0.5 engineer
When it breaks	Good luck debugging	At least has docs	Clear error messages	Call AWS	Restart Redis
Monthly Cost (medium scale)	$5K+ (self-hosted), $1,150+ managed	$3K+	$500	$2K+ MSK, $385+ Confluent	$200
Honest Use Cases	TB/day data streams	Multi-tenant SaaS	Normal messaging	AWS-locked apps	Simple pub/sub
Should you use it?	Only if you absolutely need it	If Kafka is too complex	For most use cases	If you're all-in on AWS	Try this first

The Production Reality: Advanced Features That'll Break Your Weekend

Kafka 4.0: Finally Fixed ZooKeeper (About Time)

Kafka 4.0 finally killed ZooKeeper in March 2025. Thank fucking god. If you've ever tried debugging a ZooKeeper split-brain situation at 2 AM, you know why this matters. KRaft is the new metadata management, and while it's not perfect, at least you don't need to become a distributed systems expert just to understand your cluster state.

The new consumer group protocol (KIP-848) supposedly fixes rebalancing performance and is now GA in 4.0. I'm cautiously optimistic because rebalancing has been the bane of every Kafka operator's existence. Our team spent 2 weeks debugging why a simple consumer restart was taking 10 minutes to rebalance - turns out we had too many partitions and the old protocol was garbage at handling it. Read more about consumer group rebalancing issues that plagued earlier versions.

Queues for Kafka (KIP-932) adds point-to-point messaging as "share groups" - early access in 4.0. Cool feature, but honestly, if you need queues, just use RabbitMQ. Adding queue semantics to a distributed log feels like feature creep.

Kafka 4.0 also requires Java 11+ for clients and Java 17+ for brokers/tools. If you're still on Java 8, this upgrade will be a nightmare.

Stream Processing: Where Good Engineers Go to Die

Streaming Architecture

Kafka Streams is powerful but will consume your life. I've seen teams spend months trying to get exactly-once semantics working correctly. The library is great in theory - just JAR files you can deploy anywhere. In practice, you'll be deep-diving into state stores, changelog topics, and reprocessing strategies.

Windowing and joins sound simple until you realize that late-arriving events can fuck up your aggregations. One team I worked with had to implement custom timestamp extractors because their upstream service occasionally sent events with clock skew.

ksqlDB lets you write SQL for stream processing, which sounds amazing until you hit its limitations. Complex joins become a nightmare, and debugging failed queries requires understanding the underlying Kafka Streams topology. It's better than writing raw Kafka Streams code, but don't expect SQL magic to solve distributed stream processing complexity.

Enterprise Deployments: The Scale You'll Never Reach

Enterprise Scale

Yeah, Netflix processes trillions of events and Uber tracks millions of rides in real-time. Know what they also have? Teams of 50+ engineers dedicated to Kafka operations, millions in infrastructure budget, and custom tooling built over years.

LinkedIn handles 7 trillion messages per day because they literally invented Kafka and have been operating it for over a decade. They also have specialized teams for Kafka development, operations, and tooling. Check out how Netflix built their real-time recommendations using Kafka at massive scale.

The lesson? These companies didn't start with Kafka at this scale - they grew into it. If you're processing 1GB per day and thinking about Kafka because "it scales like Netflix," you're solving the wrong problem. Read about Uber's event-driven architecture to understand how they actually use Kafka in production.

Performance Tuning: Welcome to JVM Hell

Partition Strategy will make or break your deployment. Too few partitions = throughput bottleneck. Too many partitions = rebalancing nightmare and increased memory overhead. I've seen clusters become unusable because someone thought "more partitions = more better" and created 1000 partitions for a topic that needed 10.

JVM Tuning is mandatory for production Kafka. Default heap settings will cause GC pauses that trigger false broker failures. You'll spend weeks learning G1GC settings, heap sizing, and off-heap memory management. Budget at least a month for someone to become competent at Kafka JVM tuning. Here's a comprehensive JVM tuning guide that will save you weeks of trial and error.

Hardware Requirements are not optional. That old spinning disk server? It'll become a bottleneck immediately. NVMe SSDs are mandatory, 32GB+ RAM is standard, and network bandwidth needs to handle replication traffic on top of client traffic. Check the official hardware recommendations before you buy anything.

How Kafka Will Ruin Your Life

Monitoring Complexity

Rebalancing: Our monitoring system triggers 50+ alerts every time a consumer restarts because rebalancing cascades through the entire consumer group. Normal operations look like disasters in the monitoring dashboard.

Consumer Lag: Became our most-watched metric because it's the only reliable indicator that something's wrong. But lag can spike for dozens of reasons: slow downstream services, garbage collection pauses, network hiccups, or cosmic rays.

Operational Complexity: We have dedicated Kafka runbooks, on-call rotations, and specialized monitoring dashboards. It's not software - it's infrastructure that requires constant attention.

The brutal truth? Kafka is incredible at massive scale, but most companies would be better served by managed services or simpler solutions. If you can't dedicate at least 2 full-time engineers to Kafka operations, you'll spend more time debugging it than building features.

Questions Nobody Wants to Answer (But You'll Ask Anyway)

Why is Kafka so fucking hard to operate?

Because it's a distributed system designed for massive scale, not your 10 GB/day use case. Every component (brokers, ZooKeeper/KRaft, producers, consumers) can fail independently, and the interactions between them are complex. LinkedIn built it for their scale and open-sourced it, but didn't make it easy for mere mortals to operate.

My consumer group is stuck in rebalancing hell. Help?

Been there.

Check if you have too many partitions - we had 500+ partitions per topic and rebalancing took 10+ minutes.
Look at your session.timeout.ms and heartbeat.interval.ms settings.
One slow consumer can fuck up the entire group. Find the slow one and fix it or remove it from the group.

How many engineers do I actually need to run Kafka in production?

At least 2 full-time if you want to sleep at night. One for primary operations, one for backup/vacation coverage. Netflix has 50+ people working on Kafka. LinkedIn probably has 100+. If you have 1 part-time person managing Kafka, expect outages and burned weekends.

Should I use exactly-once semantics?

Probably not. It's complex, slower, and most use cases can handle at-least-once with idempotent consumers. I've seen teams spend months debugging exactly-once issues. Unless you're processing financial transactions, design your consumers to be idempotent and save yourself the headache.

My Kafka cluster randomly becomes unavailable. What's wrong?

Could be anything: JVM garbage collection pauses, network partitions, disk I/O spikes, under-replicated partitions, or some consumer group causing a cascade failure. Start with monitoring JVM metrics, broker resource usage, and under-replicated partition counts. Budget weeks for root cause analysis.

How many partitions should I actually create?

Start with 6 partitions per topic, not 100. More partitions = more operational complexity. We had a topic with 1000 partitions that became unmaintainable during rebalancing. Increase partition count when you actually hit throughput limits, not preemptively.

Can I just restart Kafka when things go wrong?

Restarting Kafka brokers is like performing surgery

possible, but requires planning. Restarting can trigger rebalancing across all consumer groups, fail-over leadership for thousands of partitions, and potentially cause data loss if not done correctly. Have runbooks and test your restart procedures.

Why does consumer lag spike randomly?

Because everything can cause consumer lag: slow downstream databases, garbage collection pauses, network hiccups, consumer group rebalancing, broker failovers, or just cosmic rays. We monitor lag obsessively because it's the canary in the coal mine for system health.

Is managed Kafka worth the money?

Yes. Confluent Cloud ($385/month for Standard, $1,150/month for Enterprise) and AWS MSK ($0.21/hour per broker + storage/data transfer) cost 3-5x more than self-hosting, but they handle the operational nightmare for you. Unless you have dedicated Kafka engineers, managed services are cheaper than the opportunity cost of your team fighting Kafka instead of building features.

Can I use Kafka for my small microservices project?

No. Use Redis Streams or RabbitMQ. Kafka is overkill for 99% of use cases. If you're processing less than 1TB per day, you don't need Kafka's complexity.

What happens when I need to scale up quickly?

Adding brokers requires partition reassignment, which can take hours and impacts cluster performance. Scaling consumer groups requires careful partition rebalancing. Unlike stateless services that scale in minutes, Kafka scaling is measured in hours or days.

How do I debug Kafka performance issues?

Start with JVM metrics (GC pauses, heap usage).
Then broker metrics (CPU, disk I/O, network).
Then application metrics (producer/consumer latency, batch sizes).

Kafka performance debugging is like performance tuning databases - requires deep system knowledge and takes weeks to master.

Should I upgrade to Kafka 4.0 immediately?

Fuck no. Let other companies be the guinea pigs. Major Kafka upgrades break things in unexpected ways. Wait 6 months, read the war stories on Reddit and Stack Overflow, then plan your upgrade with extensive testing. We're still running 3.x because it works and upgrading isn't worth the risk.

What about the new Java requirements in Kafka 4.0?

Java 11+ for clients and Java 17+ for brokers is mandatory in 4.0. If you're still on Java 8, this upgrade becomes a massive project involving your entire JVM ecosystem. Budget months for testing application compatibility and performance regressions.

Resources That'll Actually Help You (Not Just Marketing BS)

Another AI funding round that makes no sense - $183 billion for a chatbot company that burns through investor money faster than AWS bills in a misconfigured k8s

/news/2025-09-02/anthropic-funding-surge

60%

tool

Similar content