OpenTelemetry + Jaeger + Grafana on Kubernetes: Production Observability Stack
Stack Overview
Core Components:
- OpenTelemetry: Vendor-neutral instrumentation (CNCF project, v1.0+ stable since 2021)
- Jaeger v2: Distributed tracing storage and search (November 2024 release, built on OpenTelemetry Collector)
- Grafana: Visualization and dashboards (trace-to-metrics since 9.1, improved TraceQL in 11.0)
- Kubernetes: Container orchestration with service discovery
Key Value Proposition: Zero vendor lock-in, enterprise-scale performance (Netflix processes 2+ trillion spans daily), complete observability stack for free.
Critical Performance Specifications
Resource Requirements (Production Reality)
Component | Minimum RAM | Typical RAM | CPU | Storage Impact |
---|---|---|---|---|
OpenTelemetry Agent (DaemonSet) | 150MB | 300MB (spikes to 400MB) | 0.1-0.2 cores | N/A |
OpenTelemetry Gateway | 500MB | 1-4GB (can balloon to 8GB) | 1-2 cores | N/A |
Jaeger v2 | 500MB | 1-3GB (storage dependent) | 1-2 cores | Varies by backend |
Grafana | 250MB | 400MB-2GB (dashboard complexity) | 0.5-1 cores | Minimal |
Performance Overhead
- Normal Operation: 2-5% CPU overhead
- Misconfigured: Up to 50% CPU overhead
- Memory: 50-200MB per collector (can leak to gigabytes)
- Network: Scales with trace volume
Critical Failure Modes
Silent Failures
- Collectors die silently when resource limits are exceeded
- Traces disappear without alerts when collectors fail
- Memory limits kill collectors without visible errors
- Default Helm chart configurations fail in production load
Configuration Hell
- Helm charts assume defaults work (they don't)
- Service mesh configs conflict with OpenTelemetry configs
- Storage backends timeout under production load
- Version 1.2.3 of OpenTelemetry operator has known memory leak in webhook
Production Breaking Points
- UI breaks at 1000+ spans making debugging impossible
- 100% sampling kills production - use 1-10% for high-traffic services
- Poorly configured collectors eat entire CPU cores
- Complex Grafana dashboards can consume 32GB RAM
Deployment Reality
Time Investment
- Optimistic: Few hours if nothing breaks
- Realistic: 2-3 weeks for production-ready deployment
- Disaster: Several weeks when everything breaks
- Expertise Required: Deep Kubernetes, YAML configuration, distributed systems knowledge
What Actually Breaks During Deployment
- Resource limits too low (default charts)
- Storage backend timeouts under load
- Service mesh integration conflicts
- Auto-instrumentation breaks authentication headers
- Version compatibility issues between components
- Network policies blocking component communication
Deployment Approaches Comparison
Method | Setup Time | Customization | Production Readiness | Maintenance |
---|---|---|---|---|
OpenTelemetry Operator | Fast | Limited (CRD hell for custom configs) | High | Medium |
Helm Charts | Medium | YAML configuration nightmare | High | High |
Manual Deployment | Slow | Complete control | Highest | Highest |
Storage and Cost Reality
Storage Costs (Monthly)
- Budget Range: $500-5000/month depending on scale
- Retention Strategy: Detailed traces 7 days, aggregated metrics 6 months, trends forever
- Storage Backend Costs: ClickHouse (cheapest) < Cassandra < Elasticsearch (most expensive)
- Object Storage: Unlimited but slowest query performance
Sampling Strategy Requirements
- High-traffic services: 1-10% probabilistic sampling
- Error traces: Always retain via tail-based sampling
- Slow requests: Always retain via adaptive sampling
- Volume management: Essential to prevent collector death
Security Implementation
Required Security Measures
- mTLS for OTLP communications (security team requirement)
- Kubernetes network policies (default allow-all is dangerous)
- Grafana RBAC (prevent developers accessing billing dashboards)
- Data sanitization processors to remove sensitive information
- Cross-cluster service discovery configuration for multi-tenant deployments
Integration Complexity Matrix
Integration Type | Complexity | Performance Impact | Vendor Lock-in | Use Case |
---|---|---|---|---|
OpenTelemetry + Jaeger + Grafana | Moderate (3 weeks) | Low (2-5%) | None | Complete observability |
Proprietary APM | Low (but expensive) | Medium (5-15%) | High | Budget > time |
ELK Stack + APM | High (YAML nightmare) | Medium (3-10%) | Medium | Elasticsearch masochists |
Cloud Provider Solutions | Low (until customization) | Variable (black box) | High | Cloud-native convenience |
Critical Warnings and Operational Intelligence
What Documentation Doesn't Tell You
- Default configurations will fail in production load
- Collectors require health monitoring or failures go unnoticed
- Memory leaks are common in misconfigured deployments
- Service mesh integration requires understanding both Istio and OpenTelemetry configs
- Storage backends have different reliability characteristics under load
Migration Considerations
- Dual deployment strategy: Run alongside existing APM during transition
- Automatic instrumentation: Minimizes code changes but can break authentication
- Dashboard migration: Gradual transition while maintaining existing tooling
- Multi-cluster deployments: Require gateway aggregation points and centralized storage
Troubleshooting Hierarchy
- Check collector logs first (usually the problem)
- Verify OTLP endpoint reachability
- Enable debug logging temporarily (fills disk quickly)
- Confirm collectors are actually running (resource limits kill silently)
- Monitor collector health religiously
Recommended Implementation Path
Phase 1: Foundation (Week 1)
- Deploy minimal OpenTelemetry Collector with basic configuration
- Set up Jaeger v2 with ClickHouse backend
- Configure basic Grafana dashboards
- Implement health monitoring for all components
Phase 2: Production Hardening (Week 2-3)
- Configure proper resource limits based on traffic patterns
- Implement sampling strategies (start with 5% probabilistic)
- Set up alerting for collector health and pipeline failures
- Configure security (mTLS, RBAC, network policies)
Phase 3: Scale and Optimize (Ongoing)
- Tune sampling rates based on production data
- Optimize storage retention policies
- Implement custom business metrics
- Monitor and adjust resource allocations
Success Criteria and Validation
Deployment Success Indicators
- Trace completeness: >95% of requests produce complete traces
- Collector uptime: >99.9% availability with automatic restart
- Query performance: Dashboard loads <10 seconds
- Resource stability: No OOM kills or CPU throttling
- Storage performance: Query response times <5 seconds
Common Failure Patterns to Monitor
- Trace volume spikes killing Elasticsearch
- Auto-instrumentation breaking application authentication
- Grafana memory consumption during complex dashboard rendering
- Collector resource exhaustion during traffic surges
- Storage backend timeouts during high query loads
Useful Links for Further Investigation
Resources That Don't Suck
Link | Description |
---|---|
OpenTelemetry Documentation | The official docs - they're actually decent, which is rare. Skip the conceptual bullshit and go straight to the [language-specific SDKs](https://opentelemetry.io/docs/languages/). The [collector configuration](https://opentelemetry.io/docs/collector/configuration/) section will save you hours of trial and error. |
Jaeger v2 Documentation | Finally updated for v2. The [migration guide from v1](https://www.jaegertracing.io/docs/2.10/deployment/) doesn't lie about the complexity. Start with the [getting started](https://www.jaegertracing.io/docs/2.10/getting-started/) if you're new, skip the theory. |
Grafana Observability Documentation | Their docs used to suck, but they're better now. The [data source configuration](https://grafana.com/docs/grafana/latest/datasources/) section is where you'll spend most of your time. The [alerting docs](https://grafana.com/docs/grafana/latest/alerting/) are actually readable. |
Kubernetes Observability Guide | Official K8s docs for logging architecture. Dry as hell but accurate. The [resource management](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/) section will prevent your pods from getting OOMKilled. |
OpenTelemetry Operator | The operator works great until you need custom configs. Then you're deep in CRD hell. But for basic deployments, it's solid. Check the [releases page](https://github.com/open-telemetry/opentelemetry-operator/releases) before upgrading - some versions have broken our deployments. |
OpenTelemetry Helm Charts | I've used these in production, they work. Don't trust the default values though - you'll need to customize [resource limits](https://github.com/open-telemetry/opentelemetry-helm-charts/tree/main/charts/opentelemetry-collector) or your collectors will die under load. |
Jaeger Operator | Works for basic deployments. The [storage backend configuration](https://github.com/jaegertracing/jaeger-operator#storage-backends) is where most people fuck up. Read the docs twice before going to production. |
Grafana Helm Charts | Community charts that don't suck. The [grafana/grafana](https://github.com/grafana/helm-charts/tree/main/charts/grafana) chart is solid for production. Just don't forget persistence or you'll lose all your dashboards. |
OpenTelemetry Demo Application | This actually works. Full microservices setup with real instrumentation. Clone it, run it, see how the pieces fit together. Way better than trying to figure it out from documentation. |
Kubernetes OTLP Example | Real configs that work in production. The [DaemonSet config](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/examples/kubernetes/otel-collector-daemonset.yaml) is what you want. Don't use the default resource limits - they're too low. |
Grafana Observability Dashboards | Community dashboards are hit or miss but [these ones don't suck](https://grafana.com/grafana/dashboards/15983-opentelemetry-collector/). Import them as a starting point, then customize. Don't trust the default queries - half of them are wrong. |
OpenTelemetry Community | Join the [Slack workspace](https://cloud-native.slack.com/) - #opentelemetry channel has people who actually know what they're talking about. The [SIG meetings](https://opentelemetry.io/community/meetings/) are boring but useful if you're doing complex integrations. |
CNCF Jaeger Project | The [roadmap](https://www.jaegertracing.io/roadmap/) tells you what's coming. The [GitHub issues](https://github.com/jaegertracing/jaeger/issues) are where you'll find solutions to the problems you're about to hit. |
Grafana Community Forums | Better than Stack Overflow for Grafana problems. The [observability section](https://community.grafana.com/c/grafana/observability/35) has people who've solved the same problems you're facing. |
OpenTelemetry Specification | Dry technical specs that you'll reference when building [custom instrumentation](https://opentelemetry.io/docs/specs/otel/trace/api/). The [semantic conventions](https://opentelemetry.io/docs/specs/semconv/) are crucial if you want consistent attributes across your stack. |
Jaeger Deployment Guide | The [production deployment section](https://www.jaegertracing.io/docs/latest/deployment/#production-deployment) is gold. Follow it or you'll be troubleshooting storage issues at 3am. The [scaling strategies](https://www.jaegertracing.io/docs/latest/deployment/#scaling) will save your ass when traffic spikes. |
Grafana Academy | Actually useful tutorials. The [dashboard creation](https://grafana.com/tutorials/grafana-fundamentals/) course teaches you the right way instead of clicking randomly until something works. |
OpenTelemetry Registry | Find [instrumentation libraries](https://opentelemetry.io/ecosystem/registry/?component=instrumentation&language=all) that actually work. The [vendor integrations](https://opentelemetry.io/ecosystem/registry/?component=exporter&language=all) list shows what's supported and what's experimental (avoid the experimental ones). |
Jaeger Performance Testing | Load testing tools that show you where your deployment will break. Run these before production or you'll find out the hard way during Black Friday. The [capacity planning scripts](https://github.com/jaegertracing/jaeger-performance/tree/master/scripts) are worth their weight in gold. |
OpenTelemetry Collector Builder | Build custom collector images with only the [components you need](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main). Reduces image size and attack surface. The [build configs](https://github.com/open-telemetry/opentelemetry-collector/tree/main/cmd/builder/test) show you how it's done. |
Related Tools & Recommendations
GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus
How to Wire Together the Modern DevOps Stack Without Losing Your Sanity
Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break
When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go
Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015
When your API shits the bed right before the big demo, this stack tells you exactly why
Set Up Microservices Monitoring That Actually Works
Stop flying blind - get real visibility into what's breaking your distributed services
Grafana - The Monitoring Dashboard That Doesn't Suck
integrates with Grafana
GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015
Deploy your app without losing your mind or your weekend
Connecting ClickHouse to Kafka Without Losing Your Sanity
Three ways to pipe Kafka events into ClickHouse, and what actually breaks in production
Docker Alternatives That Won't Break Your Budget
Docker got expensive as hell. Here's how to escape without breaking everything.
I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works
Trivy, Docker Scout, Snyk Container, Grype, and Clair - which one won't make you want to quit DevOps
RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)
Running RAG Systems on K8s Will Make You Hate Your Life, But Sometimes You Don't Have a Choice
Zipkin - Distributed Tracing That Actually Works
alternative to Zipkin
MongoDB Alternatives: Choose the Right Database for Your Specific Use Case
Stop paying MongoDB tax. Choose a database that actually works for your use case.
Fix gRPC Production Errors - The 3AM Debugging Guide
depends on gRPC
gRPC - Google's Binary RPC That Actually Works
depends on gRPC
gRPC Service Mesh Integration
What happens when your gRPC services meet service mesh reality
Docker Swarm Node Down? Here's How to Fix It
When your production cluster dies at 3am and management is asking questions
Docker Swarm Service Discovery Broken? Here's How to Unfuck It
When your containers can't find each other and everything goes to shit
Docker Swarm - Container Orchestration That Actually Works
Multi-host Docker without the Kubernetes PhD requirement
HashiCorp Nomad - Kubernetes Alternative Without the YAML Hell
competes with HashiCorp Nomad
Amazon ECS - Container orchestration that actually works
alternative to Amazon ECS
Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization