Event-Driven Observability Stack: Kafka + MongoDB + Kubernetes + Prometheus
Executive Summary
Production-grade event-driven observability using Kafka, MongoDB, Kubernetes, and Prometheus. Addresses critical monitoring gaps in asynchronous architectures where traditional APM tools fail to trace events across microservices.
Critical Context
Primary Challenge: Event-driven architectures make debugging nearly impossible when events fail across distributed services. Traditional monitoring shows green dashboards while business transactions fail silently.
Success Criteria:
- End-to-end event tracing across all services
- Sub-10-second detection of critical failures
- Cost under $5000/month for medium-scale deployment
Technical Specifications
Component Requirements
Kafka Configuration
- Version: 3.5.x (avoid 3.6.0 - consumer group stability issues crash brokers)
- Minimum Resources: 3 brokers, 4-8GB RAM each, fast SSDs required
- Critical Settings:
num.network.threads: 8 num.io.threads: 16 min.insync.replicas: 2 log.retention.hours: 168
- Breaking Point: UI becomes unusable at 1000+ message lag
- Cost: $500-1500/month for 3-broker production cluster
MongoDB Setup
- Version: 7.0.x stable
- Minimum Resources: 3-node replica set, 8-16GB RAM each
- Critical Settings:
net.maxIncomingConnections: 65536 maxPoolSize: 200 minPoolSize: 50
- Breaking Point: Default 10-connection pool causes timeouts under any real load
- Cost: $800-2000/month for production replica set
Kubernetes Infrastructure
- Minimum Requirements: 3 nodes, 16GB RAM, 8 vCPUs each
- Storage: 500GB fast SSDs per node (NVMe preferred)
- Network: 10Gbps or Kafka bottlenecks entire system
- Resource Planning: 2x expected usage for requests, 1.5x for limits
Prometheus Monitoring
- Memory Requirements: 16GB minimum, 32-64GB for serious workloads
- Storage: 1-2GB per day per 1000 active series
- Breaking Point: High cardinality metrics cause OOM kills every 6 hours
- Cost: Storage costs grow exponentially with retention
Version Compatibility Matrix
Component | Stable Version | Avoid | Compatibility Notes |
---|---|---|---|
Kafka | 3.5.x | 3.6.0 | Consumer group stability issues |
MongoDB | 7.0.x | - | Works with Percona exporter 0.39.0 |
Kubernetes | 1.28+ | - | Required for Prometheus operator features |
Prometheus | 2.47+ | - | Native histogram support needed |
Percona MongoDB Exporter | 0.39.0 | 0.40.0 | Auth issues in newer version |
Implementation Roadmap
Phase 1: Foundation (Week 1-2)
- Deploy Kubernetes cluster with proper resource allocation
- Install Prometheus Operator with RBAC configuration
- Set up basic alerting (5 critical alerts maximum)
Phase 2: Core Services (Week 3-4)
- Deploy Kafka using Strimzi operator
- Configure MongoDB with proper connection pooling
- Implement JMX exporters for metrics collection
Phase 3: Observability (Week 5-6)
- Configure distributed tracing with correlation IDs
- Set up Grafana dashboards for key metrics
- Implement synthetic transaction monitoring
Critical Failure Modes
High-Impact Failures
Kafka Consumer Lag Spikes
- Cause: JVM garbage collection pauses (2+ seconds)
- Detection:
kafka.consumer.fetch-manager.records-lag > 10000
- Solution: Use G1GC with
-XX:MaxGCPauseMillis=200
- Impact: Orders disappear, users get charged without confirmation
MongoDB Primary Elections
- Cause: Network instability, high CPU pressure, replication lag
- Detection: Frequent primary changes in replica set status
- Solution: Increase election timeout from 10s to 30s
- Impact: Write failures during election periods
Prometheus Memory Exhaustion
- Cause: Unbounded metric labels (user IDs, transaction IDs)
- Detection: OOM kills every few hours
- Solution: Use recording rules, avoid high cardinality labels
- Impact: Complete monitoring blindness
Medium-Impact Issues
Kafka Log Retention Overflow
- Cause: Default 7-day retention with high-throughput topics
- Impact: 500GB+ daily storage costs
- Solution: Topic-specific retention (24-48 hours for high volume)
Connection Pool Exhaustion
- Cause: MongoDB default pool size (10 connections)
- Impact: Service timeouts under load
- Solution: Set maxPoolSize=200, minPoolSize=50
Resource Planning
Production Deployment Costs (Monthly)
Component | Basic Setup | Enterprise Scale |
---|---|---|
EKS Cluster | $73 | $73 |
Worker Nodes (3x m5.2xlarge) | $840 | $2520 (9 nodes) |
EBS Storage (3TB) | $240 | $800 |
Load Balancers | $64 | $128 |
Data Transfer | $50 | $200 |
Total | $1267 | $3721 |
Performance Thresholds
Metric | Warning | Critical | Impact |
---|---|---|---|
Kafka Consumer Lag | 5000 messages | 10000 messages | Event processing delays |
MongoDB Connections | 150 active | 190 active | Service timeouts |
Prometheus Memory | 24GB | 30GB | Query timeouts |
Kubernetes Node Memory | 80% | 90% | Pod evictions |
Operational Procedures
Troubleshooting Sequence (3AM Failures)
Service Health Check
kubectl get pods -A kubectl get nodes
Event Flow Verification
- Check Kafka consumer lag:
kafka.consumer.fetch-manager.records-lag
- Verify MongoDB connectivity:
mongodb_connections_current
- Validate end-to-end with synthetic transactions
- Check Kafka consumer lag:
Infrastructure Status
- Network connectivity between services
- Storage capacity and IOPS
- DNS resolution for service discovery
Emergency Recovery Procedures
Complete System Recovery Order
- Stop all applications
- Scale down Kafka Connect workers
- Scale down Kafka brokers to 0, then back to 3
- Restart MongoDB replica set if needed
- Restart applications with gradual traffic increase
Kafka Cluster Restart
kubectl scale statefulset production-kafka-kafka --replicas=0
kubectl scale statefulset production-kafka-kafka --replicas=3
MongoDB Replica Set Reset
// Connect to any working member
rs.reconfig(config, {force: true})
Prometheus Storage Cleanup
kubectl exec prometheus-0 -- rm -rf /prometheus/data/*
kubectl delete pod prometheus-0
Monitoring Strategy
Essential Alerts (Maximum 5)
Kafka Consumer Lag > 10,000 messages
- Indicates event processing backup
- Usually caused by GC pauses or service failures
MongoDB Replica Set Degraded
- Primary election in progress or secondary failure
- Risk of data consistency issues
Service Error Rate > 5%
- User-facing failures requiring immediate attention
- Typically network or configuration issues
Kubernetes Node NotReady
- Infrastructure failure affecting pod scheduling
- Can cascade to service availability issues
Prometheus Out of Disk Space
- Monitoring system failure imminent
- Blindness to all other issues
Key Performance Indicators
Event Processing Health
# End-to-end latency
kafka_event_processing_duration_seconds_p99
# Message throughput
rate(kafka_server_brokertopicmetrics_messagesin_total[5m])
# Error rate across services
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
Resource Utilization
# Memory pressure
container_memory_usage_bytes / container_spec_memory_limit_bytes
# Disk usage growth
increase(prometheus_tsdb_symbol_table_size_bytes[1h])
# Network saturation
rate(container_network_receive_bytes_total[5m])
Security Considerations
Network Policies
- Allow Kafka inter-broker communication (ports 9092, 9093)
- Permit Prometheus scraping from all exporters
- Restrict MongoDB access to application services only
- Enable TLS for all inter-service communication
Authentication Configuration
- Use SCRAM-SHA-256 for MongoDB authentication
- Implement RBAC for Kubernetes service accounts
- Configure TLS certificates for Kafka client connections
- Enable authentication for Prometheus scraping endpoints
Migration Strategy
From Monolithic Monitoring
- Phase 1: Parallel deployment with existing monitoring
- Phase 2: Migrate non-critical services first
- Phase 3: Gradual migration of business-critical events
- Phase 4: Decommission legacy monitoring after validation
Rollback Planning
- Maintain legacy monitoring for 30 days post-migration
- Document service dependencies and rollback procedures
- Test rollback scenarios in staging environment
- Prepare automated rollback scripts for critical failures
Success Metrics
Technical Metrics
- Mean Time to Detection (MTTD) < 30 seconds for critical issues
- Mean Time to Recovery (MTTR) < 15 minutes for service failures
- Event processing latency p99 < 5 seconds
- System availability > 99.9% excluding planned maintenance
Business Metrics
- Reduced debugging time from hours to minutes
- Proactive issue detection before customer impact
- Decreased on-call escalations by 70%
- Improved customer satisfaction scores
Common Pitfalls
Configuration Errors
- High Cardinality Metrics: Avoid user IDs or transaction IDs as Prometheus labels
- Insufficient Resource Limits: Set memory limits to 1.5-2x typical usage
- Default Connection Pools: MongoDB defaults are inadequate for production
- Log Retention: Kafka's 7-day default fills disks rapidly
Operational Mistakes
- Ignoring GC Tuning: Java services need G1GC configuration for stable performance
- Missing Correlation IDs: Essential for tracing events across service boundaries
- Inadequate Testing: Synthetic transactions required for end-to-end validation
- Alert Fatigue: Start with 5 critical alerts, add gradually
Scaling Challenges
- Prometheus Federation: Complex setup, breaks easily with high cardinality
- Kafka Partition Rebalancing: Manual intervention required for optimal distribution
- MongoDB Sharding: Introduces significant operational complexity
- Cost Explosion: Storage and compute costs scale non-linearly with usage
Recommended Resources
Essential Documentation
- Kafka Documentation - Configuration section critical for production
- MongoDB Manual - Focus on indexing and replica sets
- Prometheus Best Practices - Prevents cardinality disasters
- Kubernetes Networking - Service discovery configuration
Operational Tools
- Strimzi Kafka Operator - Best Kafka deployment method for Kubernetes
- MongoDB Kubernetes Operator - Automated MongoDB management
- KEDA - Event-driven autoscaling for Kubernetes workloads
- Jaeger - Distributed tracing for event flows
This implementation provides production-grade observability for event-driven architectures while avoiding common configuration pitfalls that cause monitoring failures and operational nightmares.
Useful Links for Further Investigation
Resources That Actually Help (And Some That Will Waste Your Time)
Link | Description |
---|---|
**Kafka Documentation** | Actually readable, which is rare for Apache docs. The configuration section will save your ass. |
**Confluent Platform** | Great docs, but they hook you with free stuff then charge enterprise prices. |
**Strimzi Kafka Operator** | Best way to run Kafka on K8s. Wish I'd found this before trying to do it manually. |
**MongoDB Manual** | Decent docs, but overly academic. The indexing section is essential reading. |
**MongoDB Kubernetes Operator** | Better than rolling your own, but RBAC setup is a pain. |
**MongoDB Change Streams** | Great feature but can flood your Kafka cluster if you don't filter properly. |
**Kubernetes Documentation** | Complete but overwhelming. The networking section is particularly brutal. |
**Prometheus Operator** | Works great when configured right, but debugging issues is a nightmare. |
**KEDA Documentation** | Event-driven autoscaling that actually works with Kafka consumer lag. |
**Prometheus Documentation** | Good docs but they assume you have unlimited resources. |
**Grafana Documentation** | Solid documentation for building dashboards. |
**AlertManager** | Alert routing system that's surprisingly complex to configure right. |
**JMX Exporter** | Turns JMX metrics into Prometheus format, but can create cardinality issues. |
**Kafka Exporter** | Often works better than JMX exporter for basic Kafka metrics. |
**Kafka Connect Documentation** | For streaming data between systems. |
**Percona MongoDB Exporter** | Solid MongoDB exporter, though some versions have auth issues. |
**MongoDB Grafana Dashboards** | Pre-built dashboards that are decent starting points. |
**MongoDB Atlas Monitoring** | Built-in monitoring if you're using Atlas. |
**kube-state-metrics** | Essential for K8s monitoring, but will generate thousands of high-cardinality metrics and murder your Prometheus server |
**Node Exporter** | Actually works as advertised, which is shocking in this ecosystem |
**cAdvisor** | Container metrics collector that Google built and uses, so at least they have skin in the game |
**Event-Driven Architecture Best Practices** | Comprehensive guide to EDA patterns and implementation |
**Microservices Observability** | Observability strategies for distributed systems |
**Event Sourcing Pattern** | Using events as the source of truth in system design |
**Prometheus Best Practices** | Naming conventions and operational best practices |
**Observability Engineering** | O'Reilly book on modern observability practices |
**SRE Book - Monitoring** | Google's approach to monitoring distributed systems |
**Kubernetes Security Best Practices** | Security hardening for production clusters |
**Kubernetes Networking** | Network configuration and service discovery |
**Pod Security Standards** | Security policies for pod specifications |
**Prometheus Community Helm Charts** | Production-ready Prometheus stack deployments |
**Bitnami Kafka Helm Chart** | Kafka deployment with monitoring enabled |
**MongoDB Community Operator** | Kubernetes operator for MongoDB deployment |
**Kafka JMX Metrics Configuration** | Complete JMX exporter configuration for Kafka |
**MongoDB Monitoring Stack** | Complete monitoring stack for MongoDB |
**Event-Driven Microservices Example** | Reference implementation of event-driven architecture |
**KubeCon + CloudNativeCon** | Premier Kubernetes and cloud-native conference |
**Kafka Summit** | Apache Kafka community conference |
**MongoDB World** | MongoDB community and developer conference |
**Confluent Training** | Kafka developer and administrator certification |
**MongoDB University** | Free MongoDB courses and certification |
**Linux Foundation Kubernetes Training** | Expensive but worth it if your company pays, otherwise just break things in production until you learn |
**Prometheus Training** | Solid training that actually teaches you how to avoid the cardinality disasters that will ruin your life |
**OpenTelemetry** | Observability framework for cloud-native software |
**Jaeger** | Distributed tracing system for microservices |
**Fluentd** | Data collector for unified logging layer |
**Chaos Mesh** | Cloud-native chaos engineering platform |
**Confluent Platform** | Enterprise Kafka that costs more than your car but actually works in production |
**MongoDB Atlas** | Expensive but saves you from managing MongoDB replica set elections at 3am |
**Google Kubernetes Engine (GKE)** | Most reliable managed K8s service, but Google will shut it down in 5 years anyway |
**Amazon Managed Service for Prometheus** | AWS trying to monetize Prometheus with their usual proprietary lock-in bullshit |
**Datadog** | Expensive but polished observability platform that will bankrupt your startup but save your sanity |
**New Relic** | APM platform that's great until you realize you're paying $500/month for metrics you could get for free |
**Grafana Cloud** | Managed Grafana that costs less than building your own monitoring infrastructure (barely) |
**Splunk** | Enterprise logging platform that costs more than a house but can search through petabytes of logs in seconds |
Related Tools & Recommendations
GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus
How to Wire Together the Modern DevOps Stack Without Losing Your Sanity
MongoDB vs PostgreSQL vs MySQL: Which One Won't Ruin Your Weekend
integrates with postgresql
Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015
When your API shits the bed right before the big demo, this stack tells you exactly why
Why I Finally Dumped Cassandra After 5 Years of 3AM Hell
integrates with MongoDB
I Survived Our MongoDB to PostgreSQL Migration - Here's How You Can Too
Four Months of Pain, 47k Lost Sessions, and What Actually Works
Grafana - The Monitoring Dashboard That Doesn't Suck
integrates with Grafana
Set Up Microservices Monitoring That Actually Works
Stop flying blind - get real visibility into what's breaking your distributed services
ELK Stack for Microservices - Stop Losing Log Data
How to Actually Monitor Distributed Systems Without Going Insane
Your Elasticsearch Cluster Went Red and Production is Down
Here's How to Fix It Without Losing Your Mind (Or Your Job)
Kafka + Spark + Elasticsearch: Don't Let This Pipeline Ruin Your Life
The Data Pipeline That'll Consume Your Soul (But Actually Works)
Datadog vs New Relic vs Sentry: Real Pricing Breakdown (From Someone Who's Actually Paid These Bills)
Observability pricing is a shitshow. Here's what it actually costs.
RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)
Running RAG Systems on K8s Will Make You Hate Your Life, But Sometimes You Don't Have a Choice
MongoDB Alternatives: Choose the Right Database for Your Specific Use Case
Stop paying MongoDB tax. Choose a database that actually works for your use case.
MySQL Replication - How to Keep Your Database Alive When Shit Goes Wrong
competes with MySQL Replication
MySQL Alternatives That Don't Suck - A Migration Reality Check
Oracle's 2025 Licensing Squeeze and MySQL's Scaling Walls Are Forcing Your Hand
GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015
Deploy your app without losing your mind or your weekend
Apache Spark - The Big Data Framework That Doesn't Completely Suck
integrates with Apache Spark
Apache Spark Troubleshooting - Debug Production Failures Fast
When your Spark job dies at 3 AM and you need answers, not philosophy
Fix Helm When It Inevitably Breaks - Debug Guide
The commands, tools, and nuclear options for when your Helm deployment is fucked and you need to debug template errors at 3am.
Helm - Because Managing 47 YAML Files Will Drive You Insane
Package manager for Kubernetes that saves you from copy-pasting deployment configs like a savage. Helm charts beat maintaining separate YAML files for every dam
Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization