Apache Pulsar: Production Reality & Decision Framework
Executive Summary
Apache Pulsar is a message broker built by Yahoo in 2013 to handle 100 billion messages/day when Kafka scaling failed. Key differentiator: separates compute (brokers) from storage (BookKeeper), enabling independent scaling without cluster rebalancing. Production-ready as of 4.0 LTS (October 2024), with 4.1.0 released September 8, 2025.
Critical Decision Point: 50% higher infrastructure costs vs Kafka, but eliminates weekend rebalancing disasters and enables true multi-tenancy.
Architecture & Core Capabilities
Storage-Compute Separation
- Brokers: Route messages only, no data storage
- BookKeeper: Handles all persistence and replication
- Scaling Impact: Add storage without rebalancing (vs 18-hour Kafka partition migrations)
- Failure Behavior: Storage node loss = 12 seconds latency spike vs 4-minute Kafka outage
Multi-Tenancy Implementation
- Structure:
persistent://tenant/namespace/topic
- Production Reality: Single cluster supports dev/staging/prod with isolated auth, quotas, policies
- Kafka Alternative: Requires 3 separate clusters and operational complexity multiplication
Geo-Replication
- Setup Time: 20 minutes configuration vs 2 weeks Kafka MirrorMaker debugging
- Replication Lag: 80-120ms typical vs 200-800ms Kafka MirrorMaker
- Reliability: Built-in vs MirrorMaker random failures
Production Performance Metrics
Real-World Throughput Comparison
Metric | Pulsar (Production) | Kafka (Production) |
---|---|---|
Messages/sec (normal) | 65k | 200k |
Messages/sec (peak) | 180k | 200k |
P99 Latency (normal) | 15ms | 10-15ms |
P99 Latency (peak) | 45ms | Variable |
Storage failure recovery | 12 seconds | 4+ minutes |
Infrastructure Costs (500GB/day workload)
- Pulsar: $3,420/month (4 brokers + 6 BookKeeper + 3 ZK)
- Kafka: $2,100/month (6 brokers)
- Retention Cost Advantage: 2-year retention = $60/month (S3) vs $4,000/month (Kafka brokers)
Critical Failure Modes & Solutions
Most Common Production Issues
ServiceUnitNotReady Errors
- Cause: Broker unloading topics during scaling
- Impact: Producer failures during autoscaling
- Debug Time: 6+ hours typical
Connection Refused
- Cause: Port confusion (6650 binary vs 8080 admin)
- Frequency: Every new deployment team
Memory Thrashing
- Cause: BookKeeper write cache, broker caching, ZooKeeper competing for memory
- Resolution: 4+ tuning iterations typically required
Debugging Complexity
- Components to Debug: 5 (brokers, bookies, ZK, proxy, functions) vs 3 for Kafka
- Stack Overflow Support: 200 questions vs 5,000+ for Kafka
- 3AM Debug Sessions: "Debugging chain is brutal" - expect distributed storage expertise requirement
Decision Matrix: When to Choose Pulsar
Strong Use Cases
- Multi-cluster Kafka Operations: Replace 6 Kafka clusters with 1 Pulsar deployment
- Geo-replication Requirements: MirrorMaker causing operational pain
- Long-term Retention: Years of data without datacenter costs
- Platform Team Available: Dedicated distributed systems expertise
Avoid Pulsar If
- Startup/Small Team: <100k msgs/sec workload
- Cost Sensitivity: 50% infrastructure cost increase unacceptable
- Simple Pub/Sub: Basic producer→consumer patterns
- Limited Ops Expertise: Struggling with current Kafka operations
Operational Requirements
Staffing Requirements
- Minimum: Senior engineer with distributed systems experience
- Reality Check: "Don't throw at junior devs and hope it works"
- Learning Curve: 6+ months for operational competency
Monitoring Complexity
Essential Metrics (vs Kafka's 3 components):
- BookKeeper bookie health + disk I/O
- Individual ledger write rates
- Broker message rates + backlog
- ZooKeeper ensemble health
- Network connectivity matrix (5x5 vs 3x3)
Deployment Risk Assessment
- Migration Timeline: 6 months minimum with rollback plan
- Production Readiness: Requires 4+ memory tuning iterations
- Weekend Risk: Higher than Kafka due to component interdependencies
Version Status & Stability
Current Release (September 2025)
- Pulsar 4.1.0: Latest stable (September 8, 2025)
- Foundation: 4.0 LTS (October 2024)
- Key Fixes: Key_Shared ordering, connection leaks, Java 21 support
Production Readiness Indicators
- Data Corruption Risk: Resolved in 4.0+ (BookKeeper stability improvements)
- Connection Management: Fixed connection leaks that caused instability
- Schema Registry: Built-in, eliminates Confluent licensing
Migration Strategy
Gradual Transition Approach
- Parallel Operation: Run both systems during transition
- Kafka Proxy: Use Pulsar's Kafka compatibility for gradual cutover
- New Topics First: Start with new workloads, migrate existing last
- Rollback Plan: Critical due to ecosystem size limitations
Risk Mitigation
- Expertise Gap: BookKeeper knowledge essential before production
- Vendor Lock-in: Small ecosystem limits alternatives
- Support Availability: Limited compared to Kafka community
Bottom Line Assessment
Choose Pulsar When: Multi-tenancy, geo-replication, or retention requirements justify 50% cost increase and operational complexity. Requires dedicated platform engineering capability.
Avoid Pulsar When: Simple messaging needs, cost constraints, or limited operational expertise. Managed Kafka solutions provide better ROI for most use cases.
Migration Decision: Only migrate existing Kafka if specific pain points (multi-cluster management, geo-replication, retention costs) justify 6-month migration project with operational complexity increase.
Useful Links for Further Investigation
Resources That Actually Help (Not Just Marketing)
Link | Description |
---|---|
Official Pulsar Docs | The docs assume you know distributed storage. Start with the quickstart, then cry when you hit BookKeeper configuration. |
Pulsar 4.1.0 Release Notes | Just released September 8, 2025. Check what's new in the latest stable version. |
Pulsar 4.0 Getting Started | Docker quickstart that actually works. Use this before trying production deployment. |
Apache Pulsar GitHub | Where you'll spend hours reading issue comments to understand why things break. Pro tip: search closed issues first. |
Interspirit's Production Experience | Honest review from a team that actually deployed Pulsar. Spoiler: the connector didn't work as expected. |
Zendesk's Pulsar Evaluation | Deep technical evaluation with real performance numbers and gotchas. One of the few honest technical reviews. |
Apache Pulsar Discussions | Community discussions and Q&A about Pulsar usage patterns and production experiences. |
Stack Overflow Pulsar Questions | All 200 questions about Pulsar errors. Start here when debugging at 2AM. |
Common Pulsar Issues (GitHub) | Sorted by reactions. The most upvoted issues are probably what you'll hit too. |
BookKeeper Documentation | Essential when your storage layer starts corrupting data. Learn BookKeeper internals or suffer. |
StreamNative Cloud | Let them deal with BookKeeper operations. Starts at $73/month, worth every penny. |
StreamNative Console | Web-based management console for monitoring Pulsar clusters with metrics and operational dashboards. |
DataStax Astra Streaming | DataStax's managed Pulsar service with built-in analytics and change data capture capabilities. |
Pulsar vs Kafka Performance Analysis | Kai Waehner's detailed comparison. Less marketing BS, more technical reality. |
2025 Pulsar vs Kafka Benchmarks | Recent benchmarks with actual performance numbers under different loads. |
Apache Pulsar Case Studies | Official case studies. Take with grain of salt, but useful to understand use cases. |
Pulsar Slack Community | Small but helpful community. Maintainers actually respond, unlike some projects. |
StreamNative Blog | Best source for Pulsar technical content. They employ half the core contributors. |
Deep Dive: Message Chunking | When you need to send messages larger than 5MB and everything breaks. |
Confluent Cloud | Managed Kafka that just works. Consider this before Pulsar unless you specifically need Pulsar features. |
Amazon MSK | AWS-managed Kafka. Simpler than self-managed, less features than Confluent. |
Redis Streams | For simple use cases. Way easier to operate than Pulsar. |
Related Tools & Recommendations
Kafka Will Fuck Your Budget - Here's the Real Cost
Don't let "free and open source" fool you. Kafka costs more than your mortgage.
Apache Kafka - The Distributed Log That LinkedIn Built (And You Probably Don't Need)
competes with Apache Kafka
RabbitMQ - Message Broker That Actually Works
competes with RabbitMQ
RabbitMQ Production Review - Real-World Performance Analysis
What They Don't Tell You About Production (Updated September 2025)
Stop Fighting Your Messaging Architecture - Use All Three
Kafka + Redis + RabbitMQ Event Streaming Architecture
Apache Spark - The Big Data Framework That Doesn't Completely Suck
integrates with Apache Spark
Apache Spark Troubleshooting - Debug Production Failures Fast
When your Spark job dies at 3 AM and you need answers, not philosophy
ELK Stack for Microservices - Stop Losing Log Data
How to Actually Monitor Distributed Systems Without Going Insane
Your Elasticsearch Cluster Went Red and Production is Down
Here's How to Fix It Without Losing Your Mind (Or Your Job)
Kafka + Spark + Elasticsearch: Don't Let This Pipeline Ruin Your Life
The Data Pipeline That'll Consume Your Soul (But Actually Works)
RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)
Running RAG Systems on K8s Will Make You Hate Your Life, But Sometimes You Don't Have a Choice
GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus
How to Wire Together the Modern DevOps Stack Without Losing Your Sanity
Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break
When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go
Anthropic Raises $13B at $183B Valuation: AI Bubble Peak or Actual Revenue?
Another AI funding round that makes no sense - $183 billion for a chatbot company that burns through investor money faster than AWS bills in a misconfigured k8s
Docker Desktop Hit by Critical Container Escape Vulnerability
CVE-2025-9074 exposes host systems to complete compromise through API misconfiguration
Yarn Package Manager - npm's Faster Cousin
Explore Yarn Package Manager's origins, its advantages over npm, and the practical realities of using features like Plug'n'Play. Understand common issues and be
Grafana - The Monitoring Dashboard That Doesn't Suck
integrates with Grafana
Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015
When your API shits the bed right before the big demo, this stack tells you exactly why
Set Up Microservices Monitoring That Actually Works
Stop flying blind - get real visibility into what's breaking your distributed services
Redis vs Memcached vs Hazelcast: Production Caching Decision Guide
Three caching solutions that tackle fundamentally different problems. Redis 8.2.1 delivers multi-structure data operations with memory complexity. Memcached 1.6
Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization