Apache Pulsar - Multi-Layered Messaging Platform

Currently viewing the human version

What Actually Is Apache Pulsar?

Pulsar is what happens when you decide Kafka's architecture is fundamentally broken and rebuild everything from scratch. Yahoo created it around 2013 because they needed something that could scale beyond Kafka's limitations. The current version is 4.1.0 released September 8, 2025, and it actually works - if you can handle the operational complexity.

The Architecture That Makes Your Life Complicated

Pulsar's layered architecture separates compute from storage, which is brilliant in theory. In practice, you're now running two distributed systems instead of one.

Pulsar Architecture Diagram

You've got four moving parts to worry about:

Pulsar Brokers: These handle routing but don't store data. When they crash, your topics just migrate to other brokers. Sounds nice until you realize that the migration can take 30+ seconds and your clients start timing out.
Apache BookKeeper: This is where your data actually lives. When BookKeeper has issues, you're in for a long night. And trust me, you'll see errors like BKNotEnoughBookiesException at 3am and wonder why you didn't just use Kafka.
Apache ZooKeeper: Because every distributed system needs ZooKeeper to make your life miserable. It'll work fine until you hit ~500K topics, then ZK becomes your bottleneck and you'll be optimizing JVM heap sizes.
Pulsar Functions: Serverless stream processing that works great in demos, terrible for debugging in production. When a function fails, good luck figuring out which K8s pod it was running in.

BookKeeper and ZooKeeper Role in Pulsar

The Multi-Tenancy Promise

Pulsar's multi-tenancy is actually pretty good. You get tenant isolation without running separate clusters, which saves on ops overhead. But good luck debugging cross-tenant issues when they happen.

Scale Reality Check

Can Pulsar handle millions of topics? Probably. Should you do that? Probably not unless you have a dedicated platform team. The theoretical limits are impressive, but the operational reality is that most people run into ZooKeeper bottlenecks long before they hit Pulsar's actual limits.

Yahoo runs this thing at massive scale, and it works for them. Whether it'll work for your use case depends entirely on whether you're willing to invest in understanding BookKeeper and ZooKeeper operational patterns.

The Features That Actually Matter (And Their Hidden Costs)

Geo-Replication: Great Until It Breaks

Pulsar's geo-replication works, but it's not magic. You're replicating entire message streams across regions, which means your network bills are going to hurt. When replication falls behind, you'll spend hours trying to figure out if it's a network issue, BookKeeper problem, or just the eventual consistency model working as designed.

I once had geo-replication silently fail for 6 hours because of a misconfigured security group rule. The cluster logs showed everything was "healthy", but messages just weren't making it across regions. Took down our disaster recovery capabilities without a single alert firing.

Load Balancing That Actually Works

Pulsar Segmented Storage - Data Distributed Across Bookies

Unlike Kafka's manual partition rebalancing nightmare, Pulsar's load balancing is genuinely good. Brokers automatically split topic bundles when they get hot, and adding new nodes actually distributes load without requiring manual intervention. This is one of the few areas where Pulsar delivers on its promises.

Tiered Storage: The Feature You'll Wish You'd Set Up Earlier

BookKeeper IO Isolation - Separate Journal and Ledger Storage

Tiered storage to S3/GCS is brilliant for cost control. Old messages automatically move to cheap storage while recent data stays fast. Setting it up requires understanding BookKeeper's ledger lifecycle, but once it's working, it saves serious money on long-term retention.

Word of warning: make sure you get the S3 bucket policies right the first time. I've seen production clusters lock up trying to offload data to S3 with misconfigured IAM roles, throwing OffloadException: Access Denied errors that took hours to debug.

Client Libraries: Java Works, Others Are Hit-or-Miss

The official clients are:

Java: Rock solid, all features work
Go: Pretty good, occasionally missing newer features
Python: Adequate for most use cases, asyncio support exists
C++: Fast but documentation is sparse
Node.js: Works but you'll hit edge cases. The connection pooling breaks in weird ways under high load.
C#: Basic functionality only. Don't expect schema registry support anytime soon.

If you're not using Java, test thoroughly before committing to production. I've had Python clients randomly start consuming duplicate messages in production due to some obscure consumer acknowledgment bug that only shows up under load.

Pulsar Functions: Cool Demo, Production Hell

Pulsar Functions let you deploy stream processing code directly into the Pulsar cluster. It's serverless in theory, but debugging a function that's misbehaving in production involves digging through Kubernetes logs and understanding Pulsar's function runtime. Most people end up running separate stream processing anyway.

Schema Registry: Built-In But Basic

The schema registry supports Avro, JSON, and Protobuf. Schema evolution works for simple cases, but complex schema migrations require careful planning. It's not as sophisticated as Confluent's Schema Registry, but it's included and mostly works.

Security: Comprehensive If You Can Configure It

Pulsar's security model is actually quite good:

TLS everywhere (required for production)
JWT/OAuth integration that works
Fine-grained ACLs down to the topic level
End-to-end encryption for sensitive data

The challenge is getting all the moving parts configured correctly. BookKeeper, ZooKeeper, and Pulsar brokers all need coordinated security config, and getting it wrong means either no security or complete lockout.

Pulsar vs Kafka vs RabbitMQ: The Honest Comparison

System	Architecture	Throughput	Latency	Operational complexity	Multi-tenancy	When it breaks
Apache Pulsar	Brokers + BookKeeper + ZooKeeper. Three distributed systems to operate (each with their own config files to fuck up).	Depends entirely on your BookKeeper setup. I've seen 20K-200K msg/sec depending on configuration. On a good day.	Usually 5-50ms. The "sub-10ms" claims assume perfect conditions you'll never have in production like dedicated 10Gb networks and NVMe storage.	High. You need to understand three different systems, their failure modes, and how they interact when shit goes sideways.	Actually works without separate clusters. This is the one thing Pulsar genuinely does better.	BookKeeper storage issues are painful to debug. Hope you like analyzing JVM heap dumps at 2am.
Apache Kafka	Brokers with local storage. Simple to understand.	Consistently high. 100K+ msg/sec is achievable.	10-100ms typically. Predictable performance.	Medium. Partition management is the main pain point.	Manual topic naming conventions and ACLs.	Usually network or disk issues. Easier to diagnose.
RabbitMQ	Single broker with clustering. Traditional message broker.	10K-50K msg/sec before you start hitting limits.	Can be very low (<5ms) for simple use cases.	Low for basic setups, medium for clustering.	Virtual hosts work but aren't as sophisticated.	Memory issues usually. Well-understood failure modes.

Real Production Experiences: The Good and The Ugly

The Companies Actually Using It

I've talked to engineers at several companies running Pulsar in production. Here's what they actually told me:

Yahoo: The Success Story That Started It All

Yahoo built Pulsar because Kafka couldn't handle their scale. They've been running it since 2015, and it works for them because:

They had the team that built it
They invested heavily in operational expertise
Their use case (massive multi-tenant messaging) is exactly what Pulsar was designed for

Tencent: Scale But With Pain Points

Tencent runs Pulsar at massive scale, but their optimization blog posts tell the real story:

They had to optimize client connection pooling because the default configuration doesn't scale
ZooKeeper became a bottleneck at around 600K topics
They spent significant engineering effort on stability improvements

They handle "tens of billions of transactions during peak time" according to verified metrics, which is impressive but not the inflated numbers you sometimes see in marketing materials.

Flipkart: Topic-as-a-Service Reality

Flipkart built their platform on Pulsar's multi-tenancy, but it took them 18 months to get it production-ready. Their main challenges:

Getting geo-replication working reliably
Building monitoring and alerting for BookKeeper
Training their ops team on three different distributed systems

The IoT Use Case: Where Pulsar Actually Shines

Cisco's IoT platform is one of the success stories that makes sense. IoT workloads need:

Massive numbers of topics (one per device)
Multi-tenancy (different customers)
Geo-distribution

This is exactly what Pulsar was designed for. But notice they replaced "legacy message queue services" - they didn't migrate from Kafka.

What Companies Don't Tell You

The Migration Stories That Aren't
Most Pulsar success stories involve greenfield deployments or replacing older systems. Kafka-to-Pulsar migrations are rare because:

The operational overhead is significant
Kafka works well enough for most use cases
The business case for migration is hard to justify

The Platform Team Requirement
Every successful Pulsar deployment I've seen has a dedicated platform team. If you don't have 2-3 engineers who can become Pulsar experts, you're going to struggle.

Common Use Case Patterns

Multi-Tenant SaaS Platforms
This is where Pulsar really wins. If you're building a platform that serves multiple customers and need topic isolation, Pulsar's multi-tenancy is genuinely better than Kafka's manual approaches.

IoT Data Ingestion
Millions of topics? Pulsar handles this better than alternatives. Each device gets its own topic, and the routing/load balancing just works.

Event-Driven Microservices
Pulsar works fine for this, but so does Kafka. Choose based on operational capacity, not technical features.

The Reality Check

Pulsar works well for the companies using it, but they all have something in common: significant investment in platform engineering. If you're evaluating messaging systems, ask yourself:

Do you actually need multi-tenancy? (Hint: you probably don't)
Can you invest in the operational expertise? (Budget for 2-3 senior engineers who'll become Pulsar experts)
Are you building something that will justify the complexity? (Most aren't)

For most companies, Kafka is boring and reliable. Pulsar is powerful but complex. Choose accordingly. And if your boss is pushing Pulsar because they read a blog post about Yahoo's scale, remind them that Yahoo also had the team that built the fucking thing.

Questions People Actually Ask About Pulsar

Should I use Pulsar instead of Kafka?

Probably not, unless you specifically need multi-tenancy or are building an IoT platform with millions of topics. Kafka is more mature, has better tooling, and most teams can operate it successfully. Pulsar is more complex but has better architecture for specific use cases.

How hard is it to operate Pulsar in production?

Harder than Kafka, easier than running your own distributed database. You're managing three systems: Pulsar brokers, BookKeeper bookies, and ZooKeeper. Each has different failure modes and scaling characteristics. You need at least a couple engineers who really understand this stuff.

What happens when BookKeeper breaks?

Your day gets ruined. BookKeeper stores all the actual data, so when it has issues, message production and consumption can stop. The most common problems are:

Disk space exhaustion on bookies (you'll see NoWritableLedgerException)
Network partitions between bookies (hello BKNotEnoughBookiesException)
Ledger replication falling behind (causes LedgerRecoveryException)
Journal disk corruption (prepare to restore from backups)

I've spent entire nights debugging BookieException: Error writing entry only to find out it was a fucking disk permission issue. Learning to diagnose BookKeeper issues is critical for Pulsar operations.

Can I run Pulsar on a single machine for development?

Yes, using the standalone deployment. It runs all components in one process, which is fine for testing but obviously not production-ready. Takes about 2-3 minutes to start up on a decent laptop (compared to ~15 seconds for Kafka).

Docker Compose setups exist but are still complex compared to running a single Kafka broker. Expect to allocate at least 4GB RAM for the standalone setup, or you'll get OutOfMemoryError exceptions during startup.

How much does Pulsar cost compared to Kafka?

The software is free, but operational costs are higher because:

More complex infrastructure (3x as many moving parts)
Higher expertise requirements (more expensive engineers)
More monitoring and alerting infrastructure needed

Managed services like StreamNative Cloud or DataStax Astra exist but cost more than managed Kafka.

Does the Java client actually work well?

Yes, the Java client is solid and has all the features. The other clients are... adequate. If you're not using Java, test thoroughly:

Python client works for basic use cases
Go client is decent but sometimes missing newer features
Node.js client exists but has edge cases
C++ client is fast but documentation is sparse

What's this about ZooKeeper being a problem?

ZooKeeper stores metadata about topics, partitions, and cluster state. It becomes a bottleneck around 500K-1M topics. Pulsar is moving to eliminate ZooKeeper dependency, but it's not there yet. For now, ZooKeeper is your scaling ceiling.

How do I debug Pulsar when things go wrong?

Hope you like logs. Seriously, debugging Pulsar issues requires understanding:

Pulsar broker logs (for routing and client issues) - usually in /var/log/pulsar/
BookKeeper bookie logs (for storage issues) - look for bookie.log and gc.log
ZooKeeper logs (for metadata issues) - check zookeeper.out when shit hits the fan
Client-side logs (for application issues) - enable DEBUG logging for org.apache.pulsar.client

Pro tip: when you see TimeoutException errors, it's probably not a timeout issue - it's usually BookKeeper being overloaded. The monitoring documentation is actually pretty good once you're set up, but expect to spend days understanding what all the metrics mean.

Can I migrate from Kafka without downtime?

Theoretically yes, using the Kafka-on-Pulsar (KoP) compatibility layer. In practice, you'll want to migrate gradually and test extensively. Most migrations take months of planning and execution.

Is Pulsar Functions worth using?

For simple transformations, maybe. For anything complex, you're better off with proper stream processing frameworks like Flink or Spark. Debugging functions that fail in production is painful because they're running inside the Pulsar cluster.

What's the real latency like?

Depends entirely on your setup. I've seen:

Best case: 5-15ms for local clusters with SSD storage
Typical: 20-50ms for networked deployments
Worst case: 100ms+ when BookKeeper is under pressure

The "sub-10ms" marketing claims assume perfect conditions you won't have in production.

Should small teams use Pulsar?

No. Unless you have specific requirements that only Pulsar meets (like true multi-tenancy), stick with Kafka or RabbitMQ. The operational overhead isn't worth it for most teams.

The bottom line?

Pulsar is impressive technology with real advantages in specific use cases. But it's complex, operationally demanding, and overkill for most messaging needs. If you're evaluating it, make sure you understand not just what it can do, but what it'll cost you to run it properly.

And remember: there's no shame in choosing the boring, reliable option. Sometimes "good enough" is actually good enough.

Useful Resources (If You're Still Considering Pulsar)

Related Tools & Recommendations

tool

Similar content

Apache Kafka - The Distributed Log That LinkedIn Built (And You Probably Don't Need)

Dive into Apache Kafka: understand its core, real-world production challenges, and advanced features. Discover why Kafka is complex to operate and how Kafka 4.0

Apache Kafka

/tool/apache-kafka/overview

100%

review

Similar content

Kafka Will Fuck Your Budget - Here's the Real Cost

Don't let "free and open source" fool you. Kafka costs more than your mortgage.

Apache Kafka

/review/apache-kafka/cost-benefit-review

92%

tool

Similar content

Apache Spark - The Big Data Framework That Doesn't Completely Suck

Explore Apache Spark: understand its core concepts, why it's a powerful big data framework, and how to get started with system requirements and common challenge

Apache Spark

/tool/apache-spark/overview

63%

integration

Recommended