OpenTelemetry + Jaeger + Grafana on Kubernetes - The Stack That Actually Works

Why Microservices Observability Is A Nightmare

Microservices turned debugging into archaeology. One request touches 15 services, fails in the 12th one, and good luck figuring out why. Traditional monitoring tools shit the bed with this complexity.

Microservices Complexity Diagram

I've spent years building observability stacks that work in production. Here's what I learned: you need OpenTelemetry for instrumentation (because vendor SDKs are trash), Jaeger for tracing (because following requests across services manually is hell), and Grafana for visualization (because readable dashboards matter).

What This Stack Actually Does

OpenTelemetry collects traces, metrics, and logs from your apps without tying you to any vendor. It's the CNCF's way of saying "fuck proprietary instrumentation." Version 1.0 dropped in 2021 and it's been solid since.

Jaeger stores and searches your traces. V2 just came out (November 2024) and it's built on OpenTelemetry Collector, which means native OTLP support and way less configuration hell. V2 supports ClickHouse, Elasticsearch, and Cassandra as storage backends, with gRPC and HTTP ingestion endpoints that actually work under load.

Grafana turns your data into dashboards people can actually read. Grafana's had trace-to-metrics since 9.1, but 11.0 improved the TraceQL integration, which is crucial when you're debugging cascading failures.

Kubernetes runs it all and provides service discovery that actually works (most of the time).

Why This Integration Doesn't Suck

Unlike most monitoring stacks that make you choose between vendor lock-in or configuration hell, this combination actually works:

OpenTelemetry standardizes everything - traces, metrics, logs all use the same format. No more vendor-specific data formats or proprietary SDKs that break every update.

Request tracing that actually works - W3C trace context flows through your entire stack. A request failing in service 12 gets tracked back to the originating API call, with timing for every hop.

Distributed Tracing Flow

Zero-code Kubernetes integration - The OpenTelemetry Operator injects sidecar collectors automatically. Your pods get instrumented without touching application code.

Netflix-scale for free - They process 2+ trillion spans daily using this architecture. That's enterprise scale without the enterprise price tag that makes your CFO cry.

Kubernetes Architecture

The best part? It's all open source. No surprise licensing fees, no vendor account managers calling you every week, no \"contact sales for pricing\" bullshit.

Deployment Reality - What Actually Works

Now that you understand why this stack makes sense, let's talk about actually deploying it. This is where good intentions meet production reality. Here's how to do it without losing your sanity.

Component Architecture That Won't Break

OpenTelemetry Collector runs two ways: agents on every node (DaemonSets) and central gateways (Deployments). Agents forward app data, gateways do the heavy lifting. Resource requirements: agents usually eat around 150MB, sometimes 300MB if you're unlucky, gateways can be anywhere from 500MB to "holy shit it's using 4GB again" (the docs say 50MB, they're wrong).

OpenTelemetry Collector Architecture

Jaeger v2 simplified everything by building on OpenTelemetry Collector. One binary, native OTLP, and storage backends that don't crash under load. Use ClickHouse for storage unless you enjoy paying Elasticsearch licensing fees.

Grafana needs maybe 250MB RAM for basic setups, maybe 2GB if you have data sources that timeout constantly. The unified alerting actually works now, unlike the old system that failed silently.

Kubernetes Integration Horror Stories

Operators vs. Helm vs. Manual Deployment:

Operators: Work great until you need custom configs, then you're editing CRDs
Helm: Fast setup but customization is YAML hell
Manual: Takes forever but you understand what breaks

Kubernetes Deployment Options

Jaeger Architecture Diagram

Resource Requirements (Real Numbers):

OpenTelemetry Collector agents: usually around 150MB, sometimes spikes to 400MB when processing gets weird
Gateway collectors: anywhere from 1GB to "why is this using 8GB" depending on traffic spikes
Jaeger: 500MB minimum, but I've seen it balloon to 3GB (storage backend dependent)
Grafana: starts at 400MB, grows to 1.5GB+ if your dashboards are complex

Security That Doesn't Break Everything:

mTLS for OTLP (because security team demands it)
Network policies (because default allow-all is stupid)
RBAC for Grafana (because developers shouldn't see billing dashboards)

What Actually Breaks During Deployment

I've deployed this 12 times and here's what actually breaks: Version 1.2.3 of the operator broke our entire deployment because of a memory leak in the webhook, spent 6 hours debugging why collectors were dying - turned out to be memory limits set too low.

Plan for however long it takes - maybe a few hours if nothing breaks, maybe a week if everything does. And when collectors die, they die silently and your traces disappear. Monitor collector health religiously.

Integration Approaches Comparison

Integration Method	Complexity	Performance Impact	Vendor Lock-in	Enterprise Readiness	Use Case
OpenTelemetry + Jaeger + Grafana	Took us 3 weeks but we're slow (Moderate)	Low (2-5% overhead)	None	High	Complete observability stack
Proprietary APM Solutions	Low (but $$$)	Medium (5-15% overhead)	High (good luck leaving)	High	When budget > time
ELK Stack + APM	High (YAML nightmare)	Medium (3-10% overhead)	Medium	Medium	If you love Elasticsearch pain
Prometheus + Custom Tracing	High (DIY everything)	Low (1-3% overhead)	Low	Medium	Metrics-first masochists
Cloud Provider Solutions	Low (until you customize)	Variable (black box)	High (vendor hugs)	High	Cloud-native convenience

FAQ - The Shit Nobody Tells You

What's the real performance overhead?

Open

Telemetry adds 2-5% CPU overhead in normal cases. Configure it wrong and it's 50%. I've seen poorly configured collectors eat entire CPU cores. Memory overhead is usually 50-200MB per collector, but with memory leaks you're looking at gigabytes.The overhead scales with trace volume

100% sampling works for toy apps, kills production. Use 1-10% sampling for high-traffic services or your collectors will die. For context, we process 10M+ spans/day with 5% sampling and it runs on 2 CPU cores, 4GB RAM per gateway collector.

Does Jaeger v2 actually fix anything?

Yes, but it's not magic. V2 (November 2024) rebuilt everything on OpenTelemetry Collector, which means native OTLP and less configuration hell. Migration from v1 is mostly config changes, but you get better performance and fewer moving parts.

Does this work with service mesh?

Istio + OpenTelemetry works but the configuration is a pain in the ass. You get automatic network-level tracing, which is great, but troubleshooting requires understanding both Istio and OpenTelemetry configs. When it works, it's magic. When it breaks, you're fucked.

What's the recommended approach for handling high-cardinality metrics?

OpenTelemetry provides metric cardinality limits and sampling strategies. Use metric aggregation at collection time, implement attribute filtering, and leverage exemplars to link high-cardinality events to specific traces.

How much does storage actually cost?

Traces are expensive to store. Detailed traces for 7 days, aggregated metrics for 6 months, trends forever. Budget $500-5000/month for storage depending on scale. ClickHouse is cheapest, Elasticsearch is most expensive, object storage is slowest but unlimited.

What security considerations are important for production deployment?

Enable mutual TLS for OTLP communications, implement RBAC for Grafana access, and use Kubernetes network policies to restrict component communication. Consider data sanitization processors to remove sensitive information from telemetry data.

How do you monitor the monitoring?

You instrument the observability stack itself or you're flying blind. OpenTelemetry Collector metrics, Jaeger health checks, Grafana operational dashboards. Set alerts for pipeline health because you won't notice until everything's broken.When collectors die, they die silently and your traces disappear. Monitor collector health religiously.

What's the migration path from proprietary APM solutions?

Start with dual deployment

run OpenTelemetry alongside existing solutions.

Use OpenTelemetry's automatic instrumentation to minimize code changes. Gradually migrate dashboards and alerts to Grafana while maintaining existing tooling until full confidence is achieved.

How do you handle multi-cluster or multi-cloud deployments?

Deploy OpenTelemetry Collector gateways as aggregation points for each cluster. Use remote write for metrics federation and centralized Jaeger deployment with multi-tenant configuration. Configure cross-cluster service discovery for complete request tracing.

What are the resource requirements for production deployment?

For medium-scale deployments (1000+ pods): OpenTelemetry Collectors require 2-4 CPU cores and 4-8GB RAM for gateway mode, Jaeger components need 1-2 CPU cores and 2-4GB RAM, and Grafana requires 0.5-1 CPU core and 1-2GB RAM. Storage requirements depend on retention policies but typically start at 100GB for trace storage.

How do you implement custom business metrics alongside infrastructure metrics?

Use OpenTelemetry's custom metrics API to create business-specific instruments. Implement semantic conventions for consistent attribute naming. Create Grafana dashboards that combine business metrics with infrastructure data for comprehensive observability.

What's the recommended sampling strategy for high-traffic applications?

Implement probabilistic head-based sampling (1-10%) for normal traffic and tail-based sampling to always retain error traces and slow requests. Use adaptive sampling to maintain consistent trace volume while capturing important events.

What breaks during deployment?

Everything. The Helm charts assume default configs work (they don't). Resource limits kill collectors silently. Service mesh configs conflict with OpenTelemetry configs. Storage backends timeout under load. Plan for 2-3x longer than the optimistic timeline.Specific failures I've seen: trace volume killed Elasticsearch in production, OpenTelemetry auto-instrumentation broke authentication headers, and Grafana ate 32GB RAM rendering one dashboard.

How do you troubleshoot when shit breaks?

Check collector logs first (they're usually the problem). Verify OTLP endpoints are actually reachable. Enable debug logging but turn it off fast or you'll fill your disk.The dumb thing to check first: are your collectors actually running? Resource limits kill them silently and you won't notice until traces disappear.

Resources That Don't Suck

22%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization

Quick Navigation

What This Stack Actually Does

Why This Integration Doesn't Suck

Component Architecture That Won't Break

Kubernetes Integration Horror Stories

What Actually Breaks During Deployment

What's the real performance overhead?

Does Jaeger v2 actually fix anything?

Does this work with service mesh?

What's the recommended approach for handling high-cardinality metrics?

How much does storage actually cost?

What security considerations are important for production deployment?

How do you monitor the monitoring?

What's the migration path from proprietary APM solutions?

How do you handle multi-cluster or multi-cloud deployments?

What are the resource requirements for production deployment?

How do you implement custom business metrics alongside infrastructure metrics?

What's the recommended sampling strategy for high-traffic applications?

What breaks during deployment?

How do you troubleshoot when shit breaks?

Related Tools & Recommendations

Set Up Microservices Observability: Prometheus & Grafana Guide

Prometheus, Grafana, Alertmanager: Complete Monitoring Stack Setup

Prometheus Monitoring: Overview, Deployment & Troubleshooting Guide

Grafana: Monitoring Dashboards, Observability & Ecosystem Overview

Jaeger: Distributed Tracing for Microservices - Overview

Datadog Setup & Config Guide: Production Monitoring in One Afternoon

New Relic Overview: App Monitoring, Setup & Cost Insights

Datadog Production Troubleshooting Guide: Fix Agent & Cost Issues

ELK Stack for Microservices - Stop Losing Log Data

OpenTelemetry Overview: Observability Without Vendor Lock-in

PostgreSQL vs MySQL vs MongoDB vs Cassandra - Which Database Will Ruin Your Weekend Less?

Elastic APM Overview: Monitor & Troubleshoot Application Performance

Google Kubernetes Engine (GKE) - Google's Managed Kubernetes (That Actually Works Most of the Time)

Kubernetes Enterprise Review - Is It Worth The Investment in 2025?

Fix Kubernetes Pod CrashLoopBackOff - Complete Troubleshooting Guide

Kibana - Because Raw Elasticsearch JSON Makes Your Eyes Bleed

Fix Docker Daemon Connection Failures

Docker Container Won't Start? Here's How to Actually Fix It

Docker Permission Denied on Windows? Here's How to Fix It

Helm: Simplify Kubernetes Deployments & Avoid YAML Chaos