Currently viewing the AI version
Switch to human version

OpenTelemetry + Jaeger + Grafana on Kubernetes: Production Observability Stack

Stack Overview

Core Components:

  • OpenTelemetry: Vendor-neutral instrumentation (CNCF project, v1.0+ stable since 2021)
  • Jaeger v2: Distributed tracing storage and search (November 2024 release, built on OpenTelemetry Collector)
  • Grafana: Visualization and dashboards (trace-to-metrics since 9.1, improved TraceQL in 11.0)
  • Kubernetes: Container orchestration with service discovery

Key Value Proposition: Zero vendor lock-in, enterprise-scale performance (Netflix processes 2+ trillion spans daily), complete observability stack for free.

Critical Performance Specifications

Resource Requirements (Production Reality)

Component Minimum RAM Typical RAM CPU Storage Impact
OpenTelemetry Agent (DaemonSet) 150MB 300MB (spikes to 400MB) 0.1-0.2 cores N/A
OpenTelemetry Gateway 500MB 1-4GB (can balloon to 8GB) 1-2 cores N/A
Jaeger v2 500MB 1-3GB (storage dependent) 1-2 cores Varies by backend
Grafana 250MB 400MB-2GB (dashboard complexity) 0.5-1 cores Minimal

Performance Overhead

  • Normal Operation: 2-5% CPU overhead
  • Misconfigured: Up to 50% CPU overhead
  • Memory: 50-200MB per collector (can leak to gigabytes)
  • Network: Scales with trace volume

Critical Failure Modes

Silent Failures

  • Collectors die silently when resource limits are exceeded
  • Traces disappear without alerts when collectors fail
  • Memory limits kill collectors without visible errors
  • Default Helm chart configurations fail in production load

Configuration Hell

  • Helm charts assume defaults work (they don't)
  • Service mesh configs conflict with OpenTelemetry configs
  • Storage backends timeout under production load
  • Version 1.2.3 of OpenTelemetry operator has known memory leak in webhook

Production Breaking Points

  • UI breaks at 1000+ spans making debugging impossible
  • 100% sampling kills production - use 1-10% for high-traffic services
  • Poorly configured collectors eat entire CPU cores
  • Complex Grafana dashboards can consume 32GB RAM

Deployment Reality

Time Investment

  • Optimistic: Few hours if nothing breaks
  • Realistic: 2-3 weeks for production-ready deployment
  • Disaster: Several weeks when everything breaks
  • Expertise Required: Deep Kubernetes, YAML configuration, distributed systems knowledge

What Actually Breaks During Deployment

  1. Resource limits too low (default charts)
  2. Storage backend timeouts under load
  3. Service mesh integration conflicts
  4. Auto-instrumentation breaks authentication headers
  5. Version compatibility issues between components
  6. Network policies blocking component communication

Deployment Approaches Comparison

Method Setup Time Customization Production Readiness Maintenance
OpenTelemetry Operator Fast Limited (CRD hell for custom configs) High Medium
Helm Charts Medium YAML configuration nightmare High High
Manual Deployment Slow Complete control Highest Highest

Storage and Cost Reality

Storage Costs (Monthly)

  • Budget Range: $500-5000/month depending on scale
  • Retention Strategy: Detailed traces 7 days, aggregated metrics 6 months, trends forever
  • Storage Backend Costs: ClickHouse (cheapest) < Cassandra < Elasticsearch (most expensive)
  • Object Storage: Unlimited but slowest query performance

Sampling Strategy Requirements

  • High-traffic services: 1-10% probabilistic sampling
  • Error traces: Always retain via tail-based sampling
  • Slow requests: Always retain via adaptive sampling
  • Volume management: Essential to prevent collector death

Security Implementation

Required Security Measures

  • mTLS for OTLP communications (security team requirement)
  • Kubernetes network policies (default allow-all is dangerous)
  • Grafana RBAC (prevent developers accessing billing dashboards)
  • Data sanitization processors to remove sensitive information
  • Cross-cluster service discovery configuration for multi-tenant deployments

Integration Complexity Matrix

Integration Type Complexity Performance Impact Vendor Lock-in Use Case
OpenTelemetry + Jaeger + Grafana Moderate (3 weeks) Low (2-5%) None Complete observability
Proprietary APM Low (but expensive) Medium (5-15%) High Budget > time
ELK Stack + APM High (YAML nightmare) Medium (3-10%) Medium Elasticsearch masochists
Cloud Provider Solutions Low (until customization) Variable (black box) High Cloud-native convenience

Critical Warnings and Operational Intelligence

What Documentation Doesn't Tell You

  • Default configurations will fail in production load
  • Collectors require health monitoring or failures go unnoticed
  • Memory leaks are common in misconfigured deployments
  • Service mesh integration requires understanding both Istio and OpenTelemetry configs
  • Storage backends have different reliability characteristics under load

Migration Considerations

  • Dual deployment strategy: Run alongside existing APM during transition
  • Automatic instrumentation: Minimizes code changes but can break authentication
  • Dashboard migration: Gradual transition while maintaining existing tooling
  • Multi-cluster deployments: Require gateway aggregation points and centralized storage

Troubleshooting Hierarchy

  1. Check collector logs first (usually the problem)
  2. Verify OTLP endpoint reachability
  3. Enable debug logging temporarily (fills disk quickly)
  4. Confirm collectors are actually running (resource limits kill silently)
  5. Monitor collector health religiously

Recommended Implementation Path

Phase 1: Foundation (Week 1)

  • Deploy minimal OpenTelemetry Collector with basic configuration
  • Set up Jaeger v2 with ClickHouse backend
  • Configure basic Grafana dashboards
  • Implement health monitoring for all components

Phase 2: Production Hardening (Week 2-3)

  • Configure proper resource limits based on traffic patterns
  • Implement sampling strategies (start with 5% probabilistic)
  • Set up alerting for collector health and pipeline failures
  • Configure security (mTLS, RBAC, network policies)

Phase 3: Scale and Optimize (Ongoing)

  • Tune sampling rates based on production data
  • Optimize storage retention policies
  • Implement custom business metrics
  • Monitor and adjust resource allocations

Success Criteria and Validation

Deployment Success Indicators

  • Trace completeness: >95% of requests produce complete traces
  • Collector uptime: >99.9% availability with automatic restart
  • Query performance: Dashboard loads <10 seconds
  • Resource stability: No OOM kills or CPU throttling
  • Storage performance: Query response times <5 seconds

Common Failure Patterns to Monitor

  • Trace volume spikes killing Elasticsearch
  • Auto-instrumentation breaking application authentication
  • Grafana memory consumption during complex dashboard rendering
  • Collector resource exhaustion during traffic surges
  • Storage backend timeouts during high query loads

Useful Links for Further Investigation

Resources That Don't Suck

LinkDescription
OpenTelemetry DocumentationThe official docs - they're actually decent, which is rare. Skip the conceptual bullshit and go straight to the [language-specific SDKs](https://opentelemetry.io/docs/languages/). The [collector configuration](https://opentelemetry.io/docs/collector/configuration/) section will save you hours of trial and error.
Jaeger v2 DocumentationFinally updated for v2. The [migration guide from v1](https://www.jaegertracing.io/docs/2.10/deployment/) doesn't lie about the complexity. Start with the [getting started](https://www.jaegertracing.io/docs/2.10/getting-started/) if you're new, skip the theory.
Grafana Observability DocumentationTheir docs used to suck, but they're better now. The [data source configuration](https://grafana.com/docs/grafana/latest/datasources/) section is where you'll spend most of your time. The [alerting docs](https://grafana.com/docs/grafana/latest/alerting/) are actually readable.
Kubernetes Observability GuideOfficial K8s docs for logging architecture. Dry as hell but accurate. The [resource management](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/) section will prevent your pods from getting OOMKilled.
OpenTelemetry OperatorThe operator works great until you need custom configs. Then you're deep in CRD hell. But for basic deployments, it's solid. Check the [releases page](https://github.com/open-telemetry/opentelemetry-operator/releases) before upgrading - some versions have broken our deployments.
OpenTelemetry Helm ChartsI've used these in production, they work. Don't trust the default values though - you'll need to customize [resource limits](https://github.com/open-telemetry/opentelemetry-helm-charts/tree/main/charts/opentelemetry-collector) or your collectors will die under load.
Jaeger OperatorWorks for basic deployments. The [storage backend configuration](https://github.com/jaegertracing/jaeger-operator#storage-backends) is where most people fuck up. Read the docs twice before going to production.
Grafana Helm ChartsCommunity charts that don't suck. The [grafana/grafana](https://github.com/grafana/helm-charts/tree/main/charts/grafana) chart is solid for production. Just don't forget persistence or you'll lose all your dashboards.
OpenTelemetry Demo ApplicationThis actually works. Full microservices setup with real instrumentation. Clone it, run it, see how the pieces fit together. Way better than trying to figure it out from documentation.
Kubernetes OTLP ExampleReal configs that work in production. The [DaemonSet config](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/examples/kubernetes/otel-collector-daemonset.yaml) is what you want. Don't use the default resource limits - they're too low.
Grafana Observability DashboardsCommunity dashboards are hit or miss but [these ones don't suck](https://grafana.com/grafana/dashboards/15983-opentelemetry-collector/). Import them as a starting point, then customize. Don't trust the default queries - half of them are wrong.
OpenTelemetry CommunityJoin the [Slack workspace](https://cloud-native.slack.com/) - #opentelemetry channel has people who actually know what they're talking about. The [SIG meetings](https://opentelemetry.io/community/meetings/) are boring but useful if you're doing complex integrations.
CNCF Jaeger ProjectThe [roadmap](https://www.jaegertracing.io/roadmap/) tells you what's coming. The [GitHub issues](https://github.com/jaegertracing/jaeger/issues) are where you'll find solutions to the problems you're about to hit.
Grafana Community ForumsBetter than Stack Overflow for Grafana problems. The [observability section](https://community.grafana.com/c/grafana/observability/35) has people who've solved the same problems you're facing.
OpenTelemetry SpecificationDry technical specs that you'll reference when building [custom instrumentation](https://opentelemetry.io/docs/specs/otel/trace/api/). The [semantic conventions](https://opentelemetry.io/docs/specs/semconv/) are crucial if you want consistent attributes across your stack.
Jaeger Deployment GuideThe [production deployment section](https://www.jaegertracing.io/docs/latest/deployment/#production-deployment) is gold. Follow it or you'll be troubleshooting storage issues at 3am. The [scaling strategies](https://www.jaegertracing.io/docs/latest/deployment/#scaling) will save your ass when traffic spikes.
Grafana AcademyActually useful tutorials. The [dashboard creation](https://grafana.com/tutorials/grafana-fundamentals/) course teaches you the right way instead of clicking randomly until something works.
OpenTelemetry RegistryFind [instrumentation libraries](https://opentelemetry.io/ecosystem/registry/?component=instrumentation&language=all) that actually work. The [vendor integrations](https://opentelemetry.io/ecosystem/registry/?component=exporter&language=all) list shows what's supported and what's experimental (avoid the experimental ones).
Jaeger Performance TestingLoad testing tools that show you where your deployment will break. Run these before production or you'll find out the hard way during Black Friday. The [capacity planning scripts](https://github.com/jaegertracing/jaeger-performance/tree/master/scripts) are worth their weight in gold.
OpenTelemetry Collector BuilderBuild custom collector images with only the [components you need](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main). Reduces image size and attack surface. The [build configs](https://github.com/open-telemetry/opentelemetry-collector/tree/main/cmd/builder/test) show you how it's done.

Related Tools & Recommendations

integration
Recommended

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

How to Wire Together the Modern DevOps Stack Without Losing Your Sanity

prometheus
/integration/docker-kubernetes-argocd-prometheus/gitops-workflow-integration
100%
integration
Recommended

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go

Apache Kafka
/integration/kafka-mongodb-kubernetes-prometheus-event-driven/complete-observability-architecture
89%
integration
Recommended

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

When your API shits the bed right before the big demo, this stack tells you exactly why

Prometheus
/integration/prometheus-grafana-jaeger/microservices-observability-integration
85%
howto
Recommended

Set Up Microservices Monitoring That Actually Works

Stop flying blind - get real visibility into what's breaking your distributed services

Prometheus
/howto/setup-microservices-observability-prometheus-jaeger-grafana/complete-observability-setup
51%
tool
Recommended

Grafana - The Monitoring Dashboard That Doesn't Suck

integrates with Grafana

Grafana
/tool/grafana/overview
39%
integration
Recommended

GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015

Deploy your app without losing your mind or your weekend

GitHub Actions
/integration/github-actions-docker-aws-ecs/ci-cd-pipeline-automation
28%
integration
Recommended

Connecting ClickHouse to Kafka Without Losing Your Sanity

Three ways to pipe Kafka events into ClickHouse, and what actually breaks in production

ClickHouse
/integration/clickhouse-kafka/production-deployment-guide
27%
alternatives
Recommended

Docker Alternatives That Won't Break Your Budget

Docker got expensive as hell. Here's how to escape without breaking everything.

Docker
/alternatives/docker/budget-friendly-alternatives
27%
compare
Recommended

I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works

Trivy, Docker Scout, Snyk Container, Grype, and Clair - which one won't make you want to quit DevOps

docker
/compare/docker-security/cicd-integration/docker-security-cicd-integration
27%
integration
Recommended

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

Running RAG Systems on K8s Will Make You Hate Your Life, But Sometimes You Don't Have a Choice

Vector Databases
/integration/vector-database-rag-production-deployment/kubernetes-orchestration
27%
tool
Recommended

Zipkin - Distributed Tracing That Actually Works

alternative to Zipkin

Zipkin
/tool/zipkin/overview
26%
alternatives
Recommended

MongoDB Alternatives: Choose the Right Database for Your Specific Use Case

Stop paying MongoDB tax. Choose a database that actually works for your use case.

MongoDB
/alternatives/mongodb/use-case-driven-alternatives
20%
tool
Recommended

Fix gRPC Production Errors - The 3AM Debugging Guide

depends on gRPC

gRPC
/tool/grpc/production-troubleshooting
19%
tool
Recommended

gRPC - Google's Binary RPC That Actually Works

depends on gRPC

gRPC
/tool/grpc/overview
19%
integration
Recommended

gRPC Service Mesh Integration

What happens when your gRPC services meet service mesh reality

gRPC
/integration/microservices-grpc/service-mesh-integration
19%
troubleshoot
Recommended

Docker Swarm Node Down? Here's How to Fix It

When your production cluster dies at 3am and management is asking questions

Docker Swarm
/troubleshoot/docker-swarm-node-down/node-down-recovery
17%
troubleshoot
Recommended

Docker Swarm Service Discovery Broken? Here's How to Unfuck It

When your containers can't find each other and everything goes to shit

Docker Swarm
/troubleshoot/docker-swarm-production-failures/service-discovery-routing-mesh-failures
17%
tool
Recommended

Docker Swarm - Container Orchestration That Actually Works

Multi-host Docker without the Kubernetes PhD requirement

Docker Swarm
/tool/docker-swarm/overview
17%
tool
Recommended

HashiCorp Nomad - Kubernetes Alternative Without the YAML Hell

competes with HashiCorp Nomad

HashiCorp Nomad
/tool/hashicorp-nomad/overview
17%
tool
Recommended

Amazon ECS - Container orchestration that actually works

alternative to Amazon ECS

Amazon ECS
/tool/aws-ecs/overview
17%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization