Prometheus + Grafana + Jaeger: Production Observability Stack
Executive Summary
Purpose: Integrated observability stack for microservices debugging and monitoring
Implementation Time: 2-4 weeks for proper production deployment
Total Cost: $10K-50K annually (infrastructure only, no vendor surprises)
Operational Impact: Reduces debugging time from hours to minutes during outages
Critical Production Requirements
Resource Requirements (Actual Usage)
Memory Requirements (Critical - Plan Accordingly):
- Prometheus: 2GB RAM per 1 million active series
- Jaeger: 500MB RAM per collector instance
- Grafana: 512MB basic, 2GB+ for complex dashboards
Storage Reality Check:
- Metrics: 1-3 bytes per sample (manageable)
- Traces: 50KB average per trace (10-100x more storage than metrics)
- Budget 10x more storage for traces vs metrics
Performance Overhead:
- Prometheus: 1-3% CPU overhead for collection
- Jaeger: 2-5% CPU with proper sampling, 15%+ without sampling
- Application instrumentation: 1-2ms per operation
Scaling Thresholds and Breaking Points
Small Setup (< 10 services):
- 4GB RAM, 2 CPU cores, 100GB SSD
- Single instance deployment acceptable
Medium Setup (10-50 services):
- 16GB RAM, 4 CPU cores, 500GB SSD
- Requires clustering for availability
Large Setup (50+ services):
- 32GB+ RAM, 8+ CPU cores, 2TB+ SSD
- Mandatory federation/clustering, external storage
Critical Breaking Point: UI becomes unusable at 1000+ spans, making debugging large distributed transactions impossible
Implementation-Critical Configuration
Prometheus Production Settings
Essential Settings That Prevent Failures:
global:
scrape_interval: 15s # Don't go lower - kills disk performance
evaluation_interval: 15s
# Critical flags for production
--storage.tsdb.retention.time=30d
--storage.tsdb.retention.size=50GB
--query.max-concurrency=10 # Prevents query-of-death scenarios
Common Configuration Failures:
- Port 9090 conflicts (use
--web.listen-address=:9091
) - File descriptor limits cause random scrape failures (
ulimit -n 65536
) - NTP drift breaks time-series queries with cryptic error messages
Jaeger V2 Critical Configuration
V2 vs V1 Migration Reality:
- V2 (November 2024): Built on OpenTelemetry, better performance
- Migration complexity: Moderate (configuration format changed)
- V1 compatibility issues with other tools resolved in V2
Sampling Strategy (Storage Cost Control):
processors:
probabilistic_sampler:
sampling_percentage: 5.0 # 5% base sampling
tail_sampling:
policies:
- name: errors
type: status_code
status_code: {status_codes: [ERROR]} # Always keep errors
- name: slow_requests
type: latency
latency: {threshold_ms: 2000} # Always keep slow requests
Storage Backend Selection (Choose Carefully):
- Elasticsearch: Fast but unpredictable memory usage, expensive scaling
- Cassandra: Handles massive writes but requires specialized expertise
- ClickHouse: Best performance for new deployments, limited documentation
Critical Failure Modes and Solutions
Trace Collection Failures
Symptom: Traces not appearing in Grafana
Root Causes:
- Jaeger collector connectivity:
curl http://jaeger-query:16686/api/services
returns connection refused - Wrong OTLP endpoint configuration in applications
- Sampling set too aggressively (traces being dropped)
Solution Path:
- Verify collector health endpoints
- Check application OTLP exporter configuration
- Temporarily increase sampling rate for debugging
Metrics Collection Failures
Symptom: Missing metrics in Prometheus
Root Causes:
- Scrape targets failing with
context deadline exceeded
/metrics
endpoint returning HTML errors instead of Prometheus format- Port/path configuration mismatches
Solution Path:
- Check Prometheus UI → Status → Targets for red entries
- Validate
/metrics
endpoint format manually - Review scrape configuration for correct ports/paths
Query Performance Degradation
Breaking Points:
- Wide time ranges on high-cardinality metrics
- Regex operations on trace attributes
- Complex metric-trace correlation queries
Performance Optimization:
- Use recording rules for expensive PromQL aggregations
- Implement proper indexing in Jaeger storage backend
- Configure Grafana query caching with appropriate TTL
Operational Intelligence
Real-World Implementation Lessons
Time Investment Reality:
- Simple setup: 1 week (basic monitoring)
- Production-ready: 3-4 weeks (proper scaling, HA, integration)
- Team training: Additional 2-4 weeks for effective usage
Hidden Complexity Factors:
- Trace context propagation between services (often broken by default)
- Correlation ID management across service boundaries
- Alert tuning to avoid noise while catching real issues
Vendor Cost Comparison (Annual)
Solution | Implementation | Annual Cost | Vendor Lock-in | Best For |
---|---|---|---|---|
Prometheus/Grafana/Jaeger | 2-4 weeks | $10K-50K | None | Teams wanting data ownership |
DataDog APM | Days | $60K-400K+ | High | Unlimited budget teams |
New Relic | Days | $35K-250K+ | High | Simple billing preference |
Elastic Stack | 4-8 weeks | $25K-180K | Medium | Elasticsearch expertise |
Team Structure Requirements
Small Teams (< 20 engineers): Part-time person can manage
Medium Teams (20-100 engineers): Dedicated SRE + team champions
Large Teams (100+ engineers): Full observability team required
Critical Skills Needed:
- PromQL query optimization
- Distributed systems debugging
- Time-series database concepts
- Kubernetes/container orchestration
Production Deployment Patterns
High Availability Architecture
Prometheus HA:
- Requires federation or Thanos for true HA
- Single point of failure otherwise
Jaeger HA:
- Deploy collectors as DaemonSet in Kubernetes
- Use shared storage backend (Elasticsearch, Cassandra, ClickHouse)
Grafana HA:
- Stateless deployment behind load balancer
- External database for dashboard storage
Kubernetes Deployment Tools
Recommended Helm Charts:
kube-prometheus-stack
: Integrated Prometheus + Grafana + AlertManagerjaeger-operator
: Kubernetes-native Jaeger management- OpenTelemetry Collector charts for trace collection
Troubleshooting Decision Trees
Trace-to-Metrics Correlation Issues
Problem: Cannot link metrics spikes to trace samples
Solution: Configure exemplars in Prometheus metrics with trace IDs
- name: http_request_duration
exemplars:
- trace_id="{{.trace_id}}"
span_id="{{.span_id}}"
Resource Exhaustion Patterns
Memory Exhaustion Indicators:
- Prometheus: Query timeouts, OOM kills
- Jaeger: Collector restarts, trace drops
- Grafana: Dashboard loading failures
Immediate Actions:
- Implement resource limits in Kubernetes
- Configure retention policies
- Enable trace sampling if not already active
Compliance and Security Considerations
Data Privacy (GDPR/PII)
Risk Areas:
- Traces can contain PII in span attributes
- Request payloads in trace data
- User identifiers in metric labels
Mitigation Strategies:
# Use OpenTelemetry redaction processor
processors:
redaction:
blocked_values: ["email", "ssn", "phone"]
Retention Policies:
- Implement automatic data deletion after compliance periods
- Use geographic data residency controls
- Implement role-based access to sensitive trace data
Success Metrics and KPIs
Implementation Success Indicators
Technical Metrics:
- Mean Time to Detection (MTTD): < 5 minutes for critical issues
- Mean Time to Resolution (MTTR): 50% reduction from baseline
- False positive alert rate: < 10%
Operational Metrics:
- Developer adoption: > 80% of services instrumented
- Dashboard usage: Active usage by all teams
- Query performance: 95th percentile < 5 seconds
Cost Optimization Strategies
Storage Optimization:
- Implement tiered retention (7 days hot, 30 days warm, 6+ months cold)
- Use sampling strategies to control trace volume
- Configure metric retention based on business requirements
Performance Optimization:
- Pre-compute expensive queries with recording rules
- Use read replicas for popular dashboards
- Implement caching for frequently accessed data
Critical Resources and Documentation
Essential References
- Prometheus: Official docs at prometheus.io/docs (configuration, querying, best practices)
- Jaeger: jaegertracing.io/docs (architecture, deployment, v2 migration)
- Grafana: grafana.com/docs (data sources, dashboards, alerting, exemplars)
- OpenTelemetry: opentelemetry.io/docs (SDKs, collector, instrumentation)
Production Examples
- Netflix: Processes 2+ trillion spans daily using this stack
- Uber: Billions of requests traced for ride debugging
- Shopify: Performance optimization through metric-trace correlation
- GitHub: Platform monitoring with Prometheus at scale
Community Support
- Prometheus: Google Groups forum (prometheus-users)
- Grafana: Official community forums (community.grafana.com)
- Jaeger: GitHub Discussions for troubleshooting
- OpenTelemetry: CNCF Slack (#opentelemetry channel)
Implementation Warnings
Will Break If:
- Trace sampling disabled (storage and performance death)
- File descriptor limits not increased (random connection failures)
- Time synchronization issues (cryptic query failures)
- Port conflicts ignored (Docker Desktop commonly conflicts)
- Memory limits not set (OOM kills during peak traffic)
Common Misconceptions:
- "Jaeger v1 and v2 are compatible" (configuration format completely changed)
- "100% trace sampling is fine for production" (will bankrupt storage budget)
- "Metrics and traces automatically correlate" (requires explicit configuration)
- "Default configurations work in production" (almost always need tuning)
This stack delivers complete observability when implemented correctly, but requires understanding these operational realities to avoid common pitfalls that waste weeks of debugging time.
Useful Links for Further Investigation
Resources That Don't Suck for Production Implementation
Link | Description |
---|---|
Prometheus Documentation | The official documentation for Prometheus, providing useful resources for getting started, configuring, querying, and understanding best practices to avoid common implementation mistakes. |
getting started guide | A comprehensive guide to help users quickly set up and begin using Prometheus, covering initial setup and fundamental concepts. |
configuration | Detailed documentation on configuring Prometheus, including how to set up targets, rules, and various operational parameters for effective monitoring. |
querying | An introduction to Prometheus Query Language (PromQL) basics, essential for extracting meaningful insights and data from your metrics. |
best practices | A collection of recommended practices for using Prometheus effectively in production, designed to help users avoid common pitfalls and optimize their monitoring setup. |
Jaeger Documentation | The official documentation for Jaeger, providing essential information on its architecture, deployment, client libraries, and critical migration guides for upgrading versions. |
architecture overview | An essential overview of Jaeger's distributed tracing architecture, explaining its components and how they interact to provide end-to-end visibility. |
deployment guide | A comprehensive guide for deploying Jaeger in various environments, covering different deployment strategies and configuration options for production use. |
client libraries | Documentation for Jaeger's client libraries, detailing how to instrument applications in different programming languages to send trace data to Jaeger. |
v2 migration guide | A critical guide for users upgrading from Jaeger v1 to v2, outlining necessary changes and considerations to ensure a smooth and successful migration. |
Grafana Documentation | The official documentation for Grafana, covering essential topics like data source configuration, dashboard creation, alerting, and the concept of exemplars for trace-to-metrics correlation. |
data sources configuration | Detailed instructions on how to configure various data sources in Grafana, enabling connection to different monitoring systems and databases for visualization. |
dashboard creation | A guide to creating and customizing dashboards in Grafana, allowing users to visualize their metrics and logs effectively with various panel types. |
alerting | Documentation on setting up and managing alerts in Grafana, enabling users to be notified of critical events and anomalies in their systems. |
exemplars documentation | Explanation of Grafana's exemplars feature, which facilitates the correlation between metrics and traces, providing deeper insights into system performance issues. |
OpenTelemetry Documentation | The official documentation for OpenTelemetry, the modern instrumentation standard, covering core concepts, language-specific SDKs, and collector configuration, representing the future of observability. |
concepts | Fundamental concepts of OpenTelemetry, explaining the core principles behind distributed tracing, metrics, and logs, essential for understanding the standard. |
language-specific SDKs | Documentation for OpenTelemetry's Software Development Kits (SDKs) tailored for various programming languages, enabling easy instrumentation of applications. |
collector configuration | Detailed guide on configuring the OpenTelemetry Collector, a powerful component for processing, aggregating, and exporting telemetry data from various sources. |
Prometheus Community Helm Charts | A collection of production-ready Helm charts for deploying Prometheus and related components on Kubernetes, serving as an excellent starting point for robust monitoring setups. |
kube-prometheus-stack | A comprehensive Helm chart that bundles Prometheus, Grafana, and AlertManager, providing sensible defaults for a complete and integrated Kubernetes monitoring solution. |
Jaeger Operator | The Kubernetes-native operator for deploying and managing Jaeger, offering streamlined installation, configuration, and operational management of distributed tracing infrastructure. |
examples directory | A collection of practical examples demonstrating common deployment patterns and configurations for the Jaeger Operator, useful for quick setup and learning. |
storage backends | Documentation detailing the various storage backends supported by the Jaeger Operator, including configuration options for different persistent storage solutions. |
scaling strategies | Information on scaling strategies for Jaeger deployments managed by the Operator, providing guidance on how to handle increased tracing loads efficiently. |
Grafana Helm Charts | Official Helm charts for deploying Grafana on Kubernetes, providing robust solutions for persistence, data source provisioning, and automated dashboard management using the sidecar pattern. |
grafana chart | The official Helm chart for Grafana, designed to manage persistence, provision data sources, and import dashboards, supporting automated management via the sidecar pattern. |
OpenTelemetry Helm Charts | Official Helm charts for deploying the OpenTelemetry Collector in Kubernetes, supporting both agent and gateway modes with practical examples for various configurations. |
agent and gateway modes | Documentation and examples for deploying the OpenTelemetry Collector in Kubernetes, illustrating configurations for both agent and gateway modes to suit different telemetry collection needs. |
PromQL Tutorial by Robust Perception | Brian Brazil's blog, a premier resource for Prometheus insights and PromQL tutorials, offering real expertise directly from one of Prometheus's creators. |
Understanding machine roles | An insightful blog post explaining the concept of machine roles in Prometheus monitoring, crucial for organizing and querying metrics effectively in complex environments. |
When to use the Pushgateway | A detailed article by Brian Brazil on the appropriate use cases for Prometheus Pushgateway, clarifying when and why this component should be employed. |
Common query patterns | Exploration of common and effective PromQL query patterns, providing practical examples and guidance for extracting valuable insights from Prometheus metrics. |
Grafana Tutorials and Training | Official Grafana training materials and tutorials, offering high-quality, genuinely useful content for learning Grafana fundamentals and advanced features, a rarity in vendor education. |
Grafana fundamentals course | A genuinely useful official course covering the fundamental concepts and operations of Grafana, ideal for beginners looking to master dashboard creation and data visualization. |
Grafana University | Grafana University offers free, high-quality courses that provide in-depth knowledge and practical skills for using Grafana effectively, standing out among vendor education platforms. |
Distributed Tracing Workshop | A hands-on distributed tracing workshop using real microservices examples, providing practical experience with Jaeger dashboard usage and trace analysis, excellent for team training. |
OpenTelemetry Community | The official OpenTelemetry community hub, known for its helpful Slack workspace where users can find real answers to complex problems from maintainers and experienced users. |
Netflix's Observability at Scale | A detailed case study on how Netflix achieved observability at massive scale, offering key insights into sampling strategies, storage optimization, and organizational patterns for processing trillions of spans daily. |
Uber's Jaeger: Evolution of Distributed Tracing | The foundational case study detailing Uber's journey in building and scaling Jaeger for distributed tracing, processing over 40 billion requests daily, essential for understanding tracing at scale. |
Shopify's Performance Monitoring | A real-world case study demonstrating Shopify's application of metrics and tracing for performance optimization, showcasing correlation techniques between Prometheus and trace data to pinpoint bottlenecks. |
GitHub's Prometheus Usage | An article detailing GitHub's implementation of Prometheus for infrastructure monitoring, addressing scaling challenges, query optimization, and effective operational patterns in a large-scale environment. |
Prometheus Community Docker Images | Official Docker images and example configurations provided by the Prometheus community, offering convenient docker-compose setups ideal for local development and testing environments. |
Jaeger Demo Applications | A collection of Jaeger demo applications like HotROD, illustrating realistic instrumentation patterns to help users understand span relationships, context propagation, and effective error tracking. |
Grafana Dashboard Repository | A repository of community-contributed Grafana dashboards, offering a wide range of options from comprehensive Node Exporter to specific JVM and Traefik overviews, requiring customization for optimal use. |
Node Exporter Full | A comprehensive Grafana dashboard designed for the Node Exporter, providing a detailed overview of system metrics and performance, highly recommended for infrastructure monitoring. |
JVM Overview | A useful Grafana dashboard for monitoring Java Virtual Machine (JVM) metrics, providing insights into memory, garbage collection, and thread activity for Java applications. |
Traefik | A clean and useful Grafana dashboard specifically designed for Traefik proxy, offering clear visualizations of request rates, latency, and other critical load balancer metrics. |
OpenTelemetry Registry | The official OpenTelemetry Registry, a reliable resource for discovering instrumentation libraries and exporters, with crucial maturity status indicators to guide selection and avoid experimental components. |
instrumentation libraries | A curated list of OpenTelemetry instrumentation libraries, categorized by language and component, providing reliable options for integrating telemetry into various applications. |
exporters list | A comprehensive list of OpenTelemetry exporters, detailing various destinations for telemetry data, including popular monitoring systems and storage solutions, highly useful for configuration. |
Prometheus Community Forum | An active, old-school Google Groups forum for the Prometheus community, where maintainers frequently respond to questions, making it a reliable resource for troubleshooting and support. |
Grafana Community Forums | The official Grafana support forum, offering dedicated categories for dashboards, data sources, and alerting, providing better response times for Grafana-specific issues compared to generic platforms. |
dashboards | A dedicated section within the Grafana Community Forums for discussions and support related to creating, customizing, and troubleshooting Grafana dashboards, fostering community knowledge sharing. |
data sources | The Grafana Community Forum category focused on data sources, providing a platform for users to ask questions, share solutions, and get support for connecting Grafana to various data backends. |
alerting | A specific forum category for Grafana alerting, where users can discuss best practices, troubleshoot issues, and seek assistance with configuring and managing alerts effectively. |
Jaeger GitHub Discussions | The official GitHub Discussions for Jaeger, providing a Q&A format for user questions and a valuable resource for finding solutions to common problems by reviewing closed issues. |
closed issues | A searchable archive of closed issues on the Jaeger GitHub repository, offering a rich source of previously encountered problems and their resolved solutions for troubleshooting. |
Cloud Native Computing Foundation Slack | The official CNCF Slack workspace, offering access to project-specific channels like #prometheus, #grafana, #jaeger, and #opentelemetry for real-time help and direct interaction with maintainers and users. |
Prometheus: Up & Running by Brian Brazil | The definitive book on Prometheus by its creator, Brian Brazil, covering architecture, configuration, query optimization, and scaling patterns, essential reading for serious Prometheus users. |
Distributed Tracing in Practice by Austin Parker | A comprehensive guide to distributed tracing concepts and practical implementation strategies by Austin Parker, covering OpenTracing, OpenTelemetry, and effective instrumentation techniques for modern systems. |
Site Reliability Engineering (Google) | Google's free online book on Site Reliability Engineering, providing foundational monitoring principles that are crucial for understanding and implementing modern observability stacks effectively. |
Monitoring Distributed Systems | Chapter 6 from Google's SRE book, offering deep insights into the principles and practices of monitoring distributed systems, highly relevant for modern observability architectures. |
Observability Engineering by Charity Majors | A seminal work by Charity Majors on modern observability practices, extending beyond traditional monitoring to cover cultural and organizational aspects crucial for implementing observability at scale. |
Awesome Prometheus | A curated and regularly updated list of Prometheus resources, including exporters, dashboards, and tools, serving as an excellent starting point for discovering community contributions and enhancing monitoring setups. |
exporters | A section within the Awesome Prometheus list dedicated to various Prometheus exporters, providing a comprehensive collection of tools for exposing metrics from different services and systems. |
dashboards | A curated list of community-contributed Grafana dashboards compatible with Prometheus, offering diverse visualization options for various applications and infrastructure components. |
tools | A collection of useful tools related to Prometheus, including client libraries, alert managers, and other utilities that enhance the Prometheus monitoring ecosystem. |
Grafana Dashboard as Code Examples | A Jsonnet library for programmatically generating Grafana dashboards, enabling standardization of dashboard creation and ensuring consistency across development and operations teams. |
OpenTelemetry Configuration Examples | A repository of real-world OpenTelemetry Collector configurations for common deployment scenarios, including Docker, Kubernetes, and service mesh integrations, providing practical setup guidance. |
Docker | Example configurations for deploying the OpenTelemetry Collector in Docker environments, demonstrating how to set up telemetry collection for containerized applications effectively. |
Kubernetes | The official documentation for the OpenTelemetry Collector, providing comprehensive guidance on its configuration and deployment, including considerations for Kubernetes environments. |
service mesh | Configuration examples for integrating the OpenTelemetry Collector with service mesh environments, showcasing how to capture and process telemetry data from mesh-enabled applications. |
Related Tools & Recommendations
Temporal + Kubernetes + Redis: The Only Microservices Stack That Doesn't Hate You
Stop debugging distributed transactions at 3am like some kind of digital masochist
Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together
Stop fighting with YAML hell and infrastructure drift - here's how to manage everything through Git without losing your sanity
CrashLoopBackOff Exit Code 1: When Your App Works Locally But Kubernetes Hates It
integrates with Kubernetes
Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break
When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go
EFK Stack Integration - Stop Your Logs From Disappearing Into the Void
Elasticsearch + Fluentd + Kibana: Because searching through 50 different log files at 3am while the site is down fucking sucks
ELK Stack for Microservices - Stop Losing Log Data
How to Actually Monitor Distributed Systems Without Going Insane
Setting Up Prometheus Monitoring That Won't Make You Hate Your Job
How to Connect Prometheus, Grafana, and Alertmanager Without Losing Your Sanity
Enterprise Datadog Deployments That Don't Destroy Your Budget or Your Sanity
Real deployment strategies from engineers who've survived $100k+ monthly Datadog bills
Datadog Cost Management - Stop Your Monitoring Bill From Destroying Your Budget
competes with Datadog
Datadog - Expensive Monitoring That Actually Works
Finally, one dashboard instead of juggling 5 different monitoring tools when everything's on fire
Set Up Microservices Monitoring That Actually Works
Stop flying blind - get real visibility into what's breaking your distributed services
Prometheus - Scrapes Metrics From Your Shit So You Know When It Breaks
Free monitoring that actually works (most of the time) and won't die when your network hiccups
Connecting ClickHouse to Kafka Without Losing Your Sanity
Three ways to pipe Kafka events into ClickHouse, and what actually breaks in production
Docker Desktop Alternatives That Don't Suck
Tried every alternative after Docker started charging - here's what actually works
Docker Swarm - Container Orchestration That Actually Works
Multi-host Docker without the Kubernetes PhD requirement
Docker Security Scanner Performance Optimization - Stop Waiting Forever
integrates with Docker Security Scanners (Category)
Your Elasticsearch Cluster Went Red and Production is Down
Here's How to Fix It Without Losing Your Mind (Or Your Job)
Splunk - Expensive But It Works
Search your logs when everything's on fire. If you've got $100k+/year to spend and need enterprise-grade log search, this is probably your tool.
OpenTelemetry - Finally, Observability That Doesn't Lock You Into One Vendor
Because debugging production issues with console.log and prayer isn't sustainable
OpenTelemetry Alternatives - For When You're Done Debugging Your Debugging Tools
I spent last Sunday fixing our collector again. It ate 6GB of RAM and crashed during the fucking football game. Here's what actually works instead.
Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization