Why is Kafka consumer lag spiking randomly when CPU and memory look fine?

Your consumers are probably getting screwed by garbage collection pauses. Java's default GC settings suck for Kafka consumers. I spent way too long debugging this before realizing our consumers were freezing for like 2 seconds during GC cycles.Try using G1GC with settings like:```bash-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:G1HeapRegionSize=16m```Monitor `kafka.consumer.fetch-manager.records-lag` and set alerts when lag gets too high. For us, that's around 10,000 messages, but it depends on your processing capacity and how fast you can catch up.

My MongoDB replica set won't stop electing new primaries and it's driving me crazy. What's wrong?

This screwed us for a couple hours during an AWS maintenance window because we didn't test failover scenarios. Primary elections usually happen when:- Network gets flaky (AWS maintenance loves this)- High CPU/memory pressure on the primary- Replication lag gets too highCheck your replica set configuration:```javascript// Connect to any memberrs.conf()// Look for members with weird priority or vote settings```We increased election timeouts from the default 10 seconds to 30 seconds. Gives the system time to recover during temporary issues instead of constantly failing over. Not sure why the default is so aggressive.

Prometheus ate all my RAM again and every query times out. Why does this keep happening?

High cardinality metrics will destroy Prometheus. Someone added user IDs as metric labels and killed our entire monitoring stack. Don't put unbounded values in labels unless you want to get paged constantly.Bad:```user_requests_total{user_id="12345"} 1user_requests_total{user_id="67890"} 1```Good:```user_requests_total{service="api"} 2user_login_attempts_total 50000```Use [recording rules](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/) to pre-aggregate metrics:```yaml- record: kafka:messages_per_sec:rate5m expr: sum(rate(kafka_server_topic_messages_in[5m])) by (topic)```

Why is my Kafka cluster eating 500GB of disk space every day?

Log retention is probably misconfigured because Kafka's defaults suck. Kafka keeps ALL messages until retention kicks in, and the default 7-day retention will eat your entire disk budget on high-throughput topics.Check your topic configs:```bashkafka-topics.sh --bootstrap-server localhost:9092 --describe --topic your-topic```Set appropriate retention based on throughput:- High volume topics: 24-48 hours max- Audit logs: 30 days- Business events: 7 daysAlso check `log.segment.bytes` - large segments delay cleanup.

My MongoDB connection pool keeps shitting itself under load and I'm losing my mind. What's wrong?

MongoDB's default connection pool is pathetically small for anything beyond a blog. We learned this the hard way when our order processing died during Black Friday and customers started tweeting death threats.Increase your connection pool settings:```javascript// In your MongoDB connection stringmongodb://user:pass@host:27017/db?maxPoolSize=200&minPoolSize=50&maxIdleTimeMS=30000```Monitor `mongodb_connections_current` in Prometheus. If it's consistently hitting your max pool size, increase it.

Why does Kubernetes keep murdering my pods with "OOMKilled" when memory usage looks totally fine?

It's the classic memory requests vs limits mindfuck. Kubernetes kills pods based on actual usage vs limits, not requests, but your pretty Grafana graphs show requests. Confusing as hell by design.Set memory limits to 1.5-2x your typical usage:```yamlresources: requests: memory: "2Gi" cpu: "500m" limits: memory: "4Gi" # Allow headroom for spikes cpu: "2000m"```Also check for memory leaks in your Java applications. Use [JVM metrics exporter](https://github.com/prometheus/jmx_exporter) to monitor heap usage over time. [JVisualVM](https://visualvm.github.io/) and [JProfiler](https://www.ej-technologies.com/products/jprofiler/overview.html) are great for deep heap analysis.

How do I debug this nightmare where events keep getting processed twice and customers are getting charged multiple times?

Kafka's "at least once" delivery is code for "deal with duplicates yourself, sucker." We had customers getting double-charged because our payment processing wasn't idempotent and nobody thought to test this scenario.Implement idempotency using MongoDB upserts:```javascript// Use event ID as unique keydb.events.updateOne( { eventId: "payment-123" }, { $setOnInsert: { processed: true, amount: 100 } }, { upsert: true })```Monitor duplicate processing rates:```promqlincrease(duplicate_events_total[5m])```If duplicates spike, check for Kafka consumer group rebalances or network issues.

My Kafka brokers have completely uneven partition distribution and one broker is getting hammered. How do I fix this clusterfuck?

Kafka is too stupid to automatically balance partitions when you add brokers. I've seen clusters where one broker was handling 80% of the traffic while the others sat around doing fuck-all.Use [kafka-reassign-partitions.sh](https://kafka.apache.org/documentation/#basic_ops_replicaplacement) to rebalance:```bash# Generate rebalance plankafka-reassign-partitions.sh --bootstrap-server localhost:9092 --topics-to-move-json-file topics.json --broker-list "0,1,2" --generate# Execute the plan (this takes time)kafka-reassign-partitions.sh --bootstrap-server localhost:9092 --reassignment-json-file expand-cluster-reassignment.json --execute```Monitor partition distribution with Prometheus:```promqlkafka_cluster_partition_replication_factor by (topic, partition)```

Why does my Prometheus server keep restarting every few hours like a broken washing machine?

It's running out of memory or disk space because Prometheus is a resource-hungry beast that loads everything into RAM. Check your storage and retention settings before it dies again.Common fixes:- Reduce retention: `--storage.tsdb.retention.time=15d`- Limit memory: `--storage.tsdb.head-chunks-write-queue-size=10000`- Use recording rules to reduce series count- Move to Thanos or VictoriaMetrics for long-term storageMonitor Prometheus itself:```promqlprometheus_tsdb_symbol_table_size_bytesprometheus_tsdb_head_chunks_created_total```

How do I know if my event processing is actually working end-to-end?

Synthetic transactions are your friend. We send test events through the entire pipeline and measure end-to-end latency.Create a test event producer that:1. Publishes test events with known IDs2. Checks MongoDB for processed events3. Measures total processing time4. Alerts if test events aren't processed within SLAUse Prometheus to track synthetic transaction success:```promqlrate(synthetic_transaction_success_total[5m])```If synthetic transactions fail but individual components look healthy, there's usually a networking or configuration issue between services.

My alerting is completely fucked. I get paged for every minor blip but miss the actual critical issues. How do I fix this disaster?

Alert fatigue is real and will destroy your sanity. Start with these 5 alerts that actually matter:1. **Kafka consumer lag > 10,000 messages**: Events backing up2. **MongoDB replica set degraded**: Data persistence at risk3. **Service error rate > 5%**: Users seeing failures4. **Kubernetes node NotReady**: Infrastructure issues5. **Prometheus out of disk space**: Losing monitoring dataEverything else should be a dashboard metric, not a fucking alert. You can always add more alerts later, but removing noisy alerts is harder than training your team to ignore the boy who cried wolf.Use [alert inhibition rules](https://prometheus.io/docs/alerting/latest/configuration/#inhibit_rule) to suppress cascading alerts when root cause is known. [PagerDuty](https://www.pagerduty.com/) and [Opsgenie](https://www.atlassian.com/software/opsgenie) handle escalation well if you prefer SaaS.

The entire system is completely fucked and I have no clue where to start troubleshooting. Someone please help me before I get fired!

OK, I've been there. 3am, phone buzzing, everything broken, users screaming on Twitter, and your pretty Grafana dashboards are all green like some kind of sick joke. Here's the actual troubleshooting process that saved my ass multiple times:1. **Check service endpoints**: Are they responding?2. **Check Kafka consumer lag**: Are events being processed?3. **Check MongoDB connectivity**: Can services write/read data?4. **Check Kubernetes pods**: Are services running?5. **Check infrastructure**: Network, storage, DNSUse distributed tracing to follow a single event through the entire system. If you don't have tracing, add correlation IDs to every log message and event. Consider [OpenTelemetry](https://opentelemetry.io/) for standardized observability or [Jaeger](https://www.jaegertracing.io/) for distributed tracing.The key is having a systematic troubleshooting process, not just randomly checking dashboards until something looks wrong.

Currently viewing the AI version

Switch to human version

Event-Driven Observability Stack: Kafka + MongoDB + Kubernetes + Prometheus

Executive Summary

Production-grade event-driven observability using Kafka, MongoDB, Kubernetes, and Prometheus. Addresses critical monitoring gaps in asynchronous architectures where traditional APM tools fail to trace events across microservices.

Critical Context

Primary Challenge: Event-driven architectures make debugging nearly impossible when events fail across distributed services. Traditional monitoring shows green dashboards while business transactions fail silently.

Success Criteria:

End-to-end event tracing across all services
Sub-10-second detection of critical failures
Cost under $5000/month for medium-scale deployment

Technical Specifications

Component Requirements

Kafka Configuration

Version: 3.5.x (avoid 3.6.0 - consumer group stability issues crash brokers)
Minimum Resources: 3 brokers, 4-8GB RAM each, fast SSDs required

Critical Settings:

num.network.threads: 8
num.io.threads: 16
min.insync.replicas: 2
log.retention.hours: 168

Breaking Point: UI becomes unusable at 1000+ message lag
Cost: $500-1500/month for 3-broker production cluster

MongoDB Setup

Version: 7.0.x stable
Minimum Resources: 3-node replica set, 8-16GB RAM each

Critical Settings:

net.maxIncomingConnections: 65536
maxPoolSize: 200
minPoolSize: 50

Breaking Point: Default 10-connection pool causes timeouts under any real load
Cost: $800-2000/month for production replica set

Kubernetes Infrastructure

Minimum Requirements: 3 nodes, 16GB RAM, 8 vCPUs each
Storage: 500GB fast SSDs per node (NVMe preferred)
Network: 10Gbps or Kafka bottlenecks entire system
Resource Planning: 2x expected usage for requests, 1.5x for limits

Prometheus Monitoring

Memory Requirements: 16GB minimum, 32-64GB for serious workloads
Storage: 1-2GB per day per 1000 active series
Breaking Point: High cardinality metrics cause OOM kills every 6 hours
Cost: Storage costs grow exponentially with retention

Version Compatibility Matrix

Component	Stable Version	Avoid	Compatibility Notes
Kafka	3.5.x	3.6.0	Consumer group stability issues
MongoDB	7.0.x	-	Works with Percona exporter 0.39.0
Kubernetes	1.28+	-	Required for Prometheus operator features
Prometheus	2.47+	-	Native histogram support needed
Percona MongoDB Exporter	0.39.0	0.40.0	Auth issues in newer version

Implementation Roadmap

Phase 1: Foundation (Week 1-2)

Deploy Kubernetes cluster with proper resource allocation
Install Prometheus Operator with RBAC configuration
Set up basic alerting (5 critical alerts maximum)

Phase 2: Core Services (Week 3-4)

Deploy Kafka using Strimzi operator
Configure MongoDB with proper connection pooling
Implement JMX exporters for metrics collection

Phase 3: Observability (Week 5-6)

Configure distributed tracing with correlation IDs
Set up Grafana dashboards for key metrics
Implement synthetic transaction monitoring

Critical Failure Modes

High-Impact Failures

Kafka Consumer Lag Spikes

Cause: JVM garbage collection pauses (2+ seconds)
Detection: kafka.consumer.fetch-manager.records-lag > 10000
Solution: Use G1GC with -XX:MaxGCPauseMillis=200
Impact: Orders disappear, users get charged without confirmation

MongoDB Primary Elections

Cause: Network instability, high CPU pressure, replication lag
Detection: Frequent primary changes in replica set status
Solution: Increase election timeout from 10s to 30s
Impact: Write failures during election periods

Prometheus Memory Exhaustion

Cause: Unbounded metric labels (user IDs, transaction IDs)
Detection: OOM kills every few hours
Solution: Use recording rules, avoid high cardinality labels
Impact: Complete monitoring blindness

Medium-Impact Issues

Kafka Log Retention Overflow

Cause: Default 7-day retention with high-throughput topics
Impact: 500GB+ daily storage costs
Solution: Topic-specific retention (24-48 hours for high volume)

Connection Pool Exhaustion

Cause: MongoDB default pool size (10 connections)
Impact: Service timeouts under load
Solution: Set maxPoolSize=200, minPoolSize=50

Resource Planning

Production Deployment Costs (Monthly)

Component	Basic Setup	Enterprise Scale
EKS Cluster	$73	$73
Worker Nodes (3x m5.2xlarge)	$840	$2520 (9 nodes)
EBS Storage (3TB)	$240	$800
Load Balancers	$64	$128
Data Transfer	$50	$200
Total	$1267	$3721

Performance Thresholds

Metric	Warning	Critical	Impact
Kafka Consumer Lag	5000 messages	10000 messages	Event processing delays
MongoDB Connections	150 active	190 active	Service timeouts
Prometheus Memory	24GB	30GB	Query timeouts
Kubernetes Node Memory	80%	90%	Pod evictions

Operational Procedures

Troubleshooting Sequence (3AM Failures)

Service Health Check
```
kubectl get pods -A
kubectl get nodes
```
Event Flow Verification
- Check Kafka consumer lag: kafka.consumer.fetch-manager.records-lag
- Verify MongoDB connectivity: mongodb_connections_current
- Validate end-to-end with synthetic transactions
Infrastructure Status
- Network connectivity between services
- Storage capacity and IOPS
- DNS resolution for service discovery

Emergency Recovery Procedures

Complete System Recovery Order

Stop all applications
Scale down Kafka Connect workers
Scale down Kafka brokers to 0, then back to 3
Restart MongoDB replica set if needed
Restart applications with gradual traffic increase

Kafka Cluster Restart

kubectl scale statefulset production-kafka-kafka --replicas=0
kubectl scale statefulset production-kafka-kafka --replicas=3

MongoDB Replica Set Reset

// Connect to any working member
rs.reconfig(config, {force: true})

Prometheus Storage Cleanup

kubectl exec prometheus-0 -- rm -rf /prometheus/data/*
kubectl delete pod prometheus-0

Monitoring Strategy

Essential Alerts (Maximum 5)

Kafka Consumer Lag > 10,000 messages
- Indicates event processing backup
- Usually caused by GC pauses or service failures
MongoDB Replica Set Degraded
- Primary election in progress or secondary failure
- Risk of data consistency issues
Service Error Rate > 5%
- User-facing failures requiring immediate attention
- Typically network or configuration issues
Kubernetes Node NotReady
- Infrastructure failure affecting pod scheduling
- Can cascade to service availability issues
Prometheus Out of Disk Space
- Monitoring system failure imminent
- Blindness to all other issues

Key Performance Indicators

Event Processing Health

# End-to-end latency
kafka_event_processing_duration_seconds_p99

# Message throughput
rate(kafka_server_brokertopicmetrics_messagesin_total[5m])

# Error rate across services
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

Resource Utilization

# Memory pressure
container_memory_usage_bytes / container_spec_memory_limit_bytes

# Disk usage growth
increase(prometheus_tsdb_symbol_table_size_bytes[1h])

# Network saturation
rate(container_network_receive_bytes_total[5m])

Security Considerations

Network Policies

Allow Kafka inter-broker communication (ports 9092, 9093)
Permit Prometheus scraping from all exporters
Restrict MongoDB access to application services only
Enable TLS for all inter-service communication

Authentication Configuration

Use SCRAM-SHA-256 for MongoDB authentication
Implement RBAC for Kubernetes service accounts
Configure TLS certificates for Kafka client connections
Enable authentication for Prometheus scraping endpoints

Migration Strategy

From Monolithic Monitoring

Phase 1: Parallel deployment with existing monitoring
Phase 2: Migrate non-critical services first
Phase 3: Gradual migration of business-critical events
Phase 4: Decommission legacy monitoring after validation

Rollback Planning

Maintain legacy monitoring for 30 days post-migration
Document service dependencies and rollback procedures
Test rollback scenarios in staging environment
Prepare automated rollback scripts for critical failures

Success Metrics

Technical Metrics

Mean Time to Detection (MTTD) < 30 seconds for critical issues
Mean Time to Recovery (MTTR) < 15 minutes for service failures
Event processing latency p99 < 5 seconds
System availability > 99.9% excluding planned maintenance

Business Metrics

Reduced debugging time from hours to minutes
Proactive issue detection before customer impact
Decreased on-call escalations by 70%
Improved customer satisfaction scores

Common Pitfalls

Configuration Errors

High Cardinality Metrics: Avoid user IDs or transaction IDs as Prometheus labels
Insufficient Resource Limits: Set memory limits to 1.5-2x typical usage
Default Connection Pools: MongoDB defaults are inadequate for production
Log Retention: Kafka's 7-day default fills disks rapidly

Operational Mistakes

Ignoring GC Tuning: Java services need G1GC configuration for stable performance
Missing Correlation IDs: Essential for tracing events across service boundaries
Inadequate Testing: Synthetic transactions required for end-to-end validation
Alert Fatigue: Start with 5 critical alerts, add gradually

Scaling Challenges

Prometheus Federation: Complex setup, breaks easily with high cardinality
Kafka Partition Rebalancing: Manual intervention required for optimal distribution
MongoDB Sharding: Introduces significant operational complexity
Cost Explosion: Storage and compute costs scale non-linearly with usage

Recommended Resources

Essential Documentation

Kafka Documentation - Configuration section critical for production
MongoDB Manual - Focus on indexing and replica sets
Prometheus Best Practices - Prevents cardinality disasters
Kubernetes Networking - Service discovery configuration

Operational Tools

Strimzi Kafka Operator - Best Kafka deployment method for Kubernetes
MongoDB Kubernetes Operator - Automated MongoDB management
KEDA - Event-driven autoscaling for Kubernetes workloads
Jaeger - Distributed tracing for event flows

This implementation provides production-grade observability for event-driven architectures while avoiding common configuration pitfalls that cause monitoring failures and operational nightmares.

Useful Links for Further Investigation

Resources That Actually Help (And Some That Will Waste Your Time)

Link	Description
Kafka Documentation	Actually readable, which is rare for Apache docs. The configuration section will save your ass.
Confluent Platform	Great docs, but they hook you with free stuff then charge enterprise prices.
Strimzi Kafka Operator	Best way to run Kafka on K8s. Wish I'd found this before trying to do it manually.
MongoDB Manual	Decent docs, but overly academic. The indexing section is essential reading.
MongoDB Kubernetes Operator	Better than rolling your own, but RBAC setup is a pain.
MongoDB Change Streams	Great feature but can flood your Kafka cluster if you don't filter properly.
Kubernetes Documentation	Complete but overwhelming. The networking section is particularly brutal.
Prometheus Operator	Works great when configured right, but debugging issues is a nightmare.
KEDA Documentation	Event-driven autoscaling that actually works with Kafka consumer lag.
Prometheus Documentation	Good docs but they assume you have unlimited resources.
Grafana Documentation	Solid documentation for building dashboards.
AlertManager	Alert routing system that's surprisingly complex to configure right.
JMX Exporter	Turns JMX metrics into Prometheus format, but can create cardinality issues.
Kafka Exporter	Often works better than JMX exporter for basic Kafka metrics.
Kafka Connect Documentation	For streaming data between systems.
Percona MongoDB Exporter	Solid MongoDB exporter, though some versions have auth issues.
MongoDB Grafana Dashboards	Pre-built dashboards that are decent starting points.
MongoDB Atlas Monitoring	Built-in monitoring if you're using Atlas.
kube-state-metrics	Essential for K8s monitoring, but will generate thousands of high-cardinality metrics and murder your Prometheus server
Node Exporter	Actually works as advertised, which is shocking in this ecosystem
cAdvisor	Container metrics collector that Google built and uses, so at least they have skin in the game
Event-Driven Architecture Best Practices	Comprehensive guide to EDA patterns and implementation
Microservices Observability	Observability strategies for distributed systems
Event Sourcing Pattern	Using events as the source of truth in system design
Prometheus Best Practices	Naming conventions and operational best practices
Observability Engineering	O'Reilly book on modern observability practices
SRE Book - Monitoring	Google's approach to monitoring distributed systems
Kubernetes Security Best Practices	Security hardening for production clusters
Kubernetes Networking	Network configuration and service discovery
Pod Security Standards	Security policies for pod specifications
Prometheus Community Helm Charts	Production-ready Prometheus stack deployments
Bitnami Kafka Helm Chart	Kafka deployment with monitoring enabled
MongoDB Community Operator	Kubernetes operator for MongoDB deployment
Kafka JMX Metrics Configuration	Complete JMX exporter configuration for Kafka
MongoDB Monitoring Stack	Complete monitoring stack for MongoDB
Event-Driven Microservices Example	Reference implementation of event-driven architecture
KubeCon + CloudNativeCon	Premier Kubernetes and cloud-native conference
Kafka Summit	Apache Kafka community conference
MongoDB World	MongoDB community and developer conference
Confluent Training	Kafka developer and administrator certification
MongoDB University	Free MongoDB courses and certification
Linux Foundation Kubernetes Training	Expensive but worth it if your company pays, otherwise just break things in production until you learn
Prometheus Training	Solid training that actually teaches you how to avoid the cardinality disasters that will ruin your life
OpenTelemetry	Observability framework for cloud-native software
Jaeger	Distributed tracing system for microservices
Fluentd	Data collector for unified logging layer
Chaos Mesh	Cloud-native chaos engineering platform
Confluent Platform	Enterprise Kafka that costs more than your car but actually works in production
MongoDB Atlas	Expensive but saves you from managing MongoDB replica set elections at 3am
Google Kubernetes Engine (GKE)	Most reliable managed K8s service, but Google will shut it down in 5 years anyway
Amazon Managed Service for Prometheus	AWS trying to monetize Prometheus with their usual proprietary lock-in bullshit
Datadog	Expensive but polished observability platform that will bankrupt your startup but save your sanity
New Relic	APM platform that's great until you realize you're paying $500/month for metrics you could get for free
Grafana Cloud	Managed Grafana that costs less than building your own monitoring infrastructure (barely)
Splunk	Enterprise logging platform that costs more than a house but can search through petabytes of logs in seconds