What's the real performance overhead?

OpenTelemetry adds 2-5% CPU overhead in normal cases. Configure it wrong and it's 50%. I've seen poorly configured collectors eat entire CPU cores. Memory overhead is usually 50-200MB per collector, but with memory leaks you're looking at gigabytes.The overhead scales with trace volume - 100% sampling works for toy apps, kills production. Use 1-10% sampling for high-traffic services or your collectors will die. For context, we process 10M+ spans/day with 5% sampling and it runs on 2 CPU cores, 4GB RAM per gateway collector.

Does Jaeger v2 actually fix anything?

Yes, but it's not magic. V2 (November 2024) rebuilt everything on OpenTelemetry Collector, which means native OTLP and less configuration hell. Migration from v1 is mostly config changes, but you get better performance and fewer moving parts.

Does this work with service mesh?

Istio + OpenTelemetry works but the configuration is a pain in the ass. You get automatic network-level tracing, which is great, but troubleshooting requires understanding both Istio and OpenTelemetry configs. When it works, it's magic. When it breaks, you're fucked.

What's the recommended approach for handling high-cardinality metrics?

OpenTelemetry provides [metric cardinality limits](https://opentelemetry.io/docs/specs/otel/metrics/sdk/#cardinality-limits) and sampling strategies. Use [metric aggregation](https://opentelemetry.io/docs/reference/specification/metrics/sdk/#aggregation) at collection time, implement [attribute filtering](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/attributesprocessor), and leverage [exemplars](https://opentelemetry.io/docs/specs/otel/metrics/data-model/#exemplars) to link high-cardinality events to specific traces.

How much does storage actually cost?

Traces are expensive to store. Detailed traces for 7 days, aggregated metrics for 6 months, trends forever. Budget $500-5000/month for storage depending on scale. ClickHouse is cheapest, Elasticsearch is most expensive, object storage is slowest but unlimited.

What security considerations are important for production deployment?

Enable [mutual TLS for OTLP communications](https://opentelemetry.io/docs/specs/otel/protocol/otlp/#otlpgrpc), implement [RBAC for Grafana access](https://grafana.com/docs/grafana/latest/administration/roles-and-permissions/), and use [Kubernetes network policies](https://kubernetes.io/docs/concepts/services-networking/network-policies/) to restrict component communication. Consider [data sanitization processors](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/redactionprocessor) to remove sensitive information from telemetry data.

How do you monitor the monitoring?

You instrument the observability stack itself or you're flying blind. OpenTelemetry Collector metrics, Jaeger health checks, Grafana operational dashboards. Set alerts for pipeline health because you won't notice until everything's broken.When collectors die, they die silently and your traces disappear. Monitor collector health religiously.

What's the migration path from proprietary APM solutions?

Start with [dual deployment](https://opentelemetry.io/docs/migration/) - run OpenTelemetry alongside existing solutions. Use [OpenTelemetry's automatic instrumentation](https://opentelemetry.io/docs/concepts/instrumentation/automatic/) to minimize code changes. Gradually migrate dashboards and alerts to Grafana while maintaining existing tooling until full confidence is achieved.

How do you handle multi-cluster or multi-cloud deployments?

Deploy [OpenTelemetry Collector gateways](https://opentelemetry.io/docs/collector/deployment/gateway/) as aggregation points for each cluster. Use [remote write](https://grafana.com/docs/grafana/latest/datasources/prometheus/#remote-write) for metrics federation and centralized Jaeger deployment with [multi-tenant configuration](https://www.jaegertracing.io/docs/2.10/deployment/#multi-tenancy). Configure [cross-cluster service discovery](https://kubernetes.io/docs/concepts/services-networking/service/) for complete request tracing.

What are the resource requirements for production deployment?

For medium-scale deployments (1000+ pods): OpenTelemetry Collectors require 2-4 CPU cores and 4-8GB RAM for gateway mode, Jaeger components need 1-2 CPU cores and 2-4GB RAM, and Grafana requires 0.5-1 CPU core and 1-2GB RAM. [Storage requirements](https://www.jaegertracing.io/docs/2.10/deployment/#resource-requirements) depend on retention policies but typically start at 100GB for trace storage.

How do you implement custom business metrics alongside infrastructure metrics?

Use [OpenTelemetry's custom metrics API](https://opentelemetry.io/docs/languages/) to create business-specific instruments. Implement [semantic conventions](https://opentelemetry.io/docs/specs/semconv/) for consistent attribute naming. Create [Grafana dashboards](https://grafana.com/docs/grafana/latest/dashboards/) that combine business metrics with infrastructure data for comprehensive observability.

What's the recommended sampling strategy for high-traffic applications?

Implement [probabilistic head-based sampling](https://opentelemetry.io/docs/concepts/sampling/) (1-10%) for normal traffic and [tail-based sampling](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor) to always retain error traces and slow requests. Use [adaptive sampling](https://www.jaegertracing.io/docs/1.60/sampling/#adaptive-sampling) to maintain consistent trace volume while capturing important events.

What breaks during deployment?

Everything. The Helm charts assume default configs work (they don't). Resource limits kill collectors silently. Service mesh configs conflict with OpenTelemetry configs. Storage backends timeout under load. Plan for 2-3x longer than the optimistic timeline.Specific failures I've seen: trace volume killed Elasticsearch in production, OpenTelemetry auto-instrumentation broke authentication headers, and Grafana ate 32GB RAM rendering one dashboard.

How do you troubleshoot when shit breaks?

Check collector logs first (they're usually the problem). Verify OTLP endpoints are actually reachable. Enable debug logging but turn it off fast or you'll fill your disk.The dumb thing to check first: are your collectors actually running? Resource limits kill them silently and you won't notice until traces disappear.

Currently viewing the AI version

Switch to human version

OpenTelemetry + Jaeger + Grafana on Kubernetes: Production Observability Stack

Stack Overview

Core Components:

OpenTelemetry: Vendor-neutral instrumentation (CNCF project, v1.0+ stable since 2021)
Jaeger v2: Distributed tracing storage and search (November 2024 release, built on OpenTelemetry Collector)
Grafana: Visualization and dashboards (trace-to-metrics since 9.1, improved TraceQL in 11.0)
Kubernetes: Container orchestration with service discovery

Key Value Proposition: Zero vendor lock-in, enterprise-scale performance (Netflix processes 2+ trillion spans daily), complete observability stack for free.

Critical Performance Specifications

Resource Requirements (Production Reality)

Component	Minimum RAM	Typical RAM	CPU	Storage Impact
OpenTelemetry Agent (DaemonSet)	150MB	300MB (spikes to 400MB)	0.1-0.2 cores	N/A
OpenTelemetry Gateway	500MB	1-4GB (can balloon to 8GB)	1-2 cores	N/A
Jaeger v2	500MB	1-3GB (storage dependent)	1-2 cores	Varies by backend
Grafana	250MB	400MB-2GB (dashboard complexity)	0.5-1 cores	Minimal

Performance Overhead

Normal Operation: 2-5% CPU overhead
Misconfigured: Up to 50% CPU overhead
Memory: 50-200MB per collector (can leak to gigabytes)
Network: Scales with trace volume

Critical Failure Modes

Silent Failures

Collectors die silently when resource limits are exceeded
Traces disappear without alerts when collectors fail
Memory limits kill collectors without visible errors
Default Helm chart configurations fail in production load

Configuration Hell

Helm charts assume defaults work (they don't)
Service mesh configs conflict with OpenTelemetry configs
Storage backends timeout under production load
Version 1.2.3 of OpenTelemetry operator has known memory leak in webhook

Production Breaking Points

UI breaks at 1000+ spans making debugging impossible
100% sampling kills production - use 1-10% for high-traffic services
Poorly configured collectors eat entire CPU cores
Complex Grafana dashboards can consume 32GB RAM

Deployment Reality

Time Investment

Optimistic: Few hours if nothing breaks
Realistic: 2-3 weeks for production-ready deployment
Disaster: Several weeks when everything breaks
Expertise Required: Deep Kubernetes, YAML configuration, distributed systems knowledge

What Actually Breaks During Deployment

Resource limits too low (default charts)
Storage backend timeouts under load
Service mesh integration conflicts
Auto-instrumentation breaks authentication headers
Version compatibility issues between components
Network policies blocking component communication

Deployment Approaches Comparison

Method	Setup Time	Customization	Production Readiness	Maintenance
OpenTelemetry Operator	Fast	Limited (CRD hell for custom configs)	High	Medium
Helm Charts	Medium	YAML configuration nightmare	High	High
Manual Deployment	Slow	Complete control	Highest	Highest

Storage and Cost Reality

Storage Costs (Monthly)

Budget Range: $500-5000/month depending on scale
Retention Strategy: Detailed traces 7 days, aggregated metrics 6 months, trends forever
Storage Backend Costs: ClickHouse (cheapest) < Cassandra < Elasticsearch (most expensive)
Object Storage: Unlimited but slowest query performance

Sampling Strategy Requirements

High-traffic services: 1-10% probabilistic sampling
Error traces: Always retain via tail-based sampling
Slow requests: Always retain via adaptive sampling
Volume management: Essential to prevent collector death

Security Implementation

Required Security Measures

mTLS for OTLP communications (security team requirement)
Kubernetes network policies (default allow-all is dangerous)
Grafana RBAC (prevent developers accessing billing dashboards)
Data sanitization processors to remove sensitive information
Cross-cluster service discovery configuration for multi-tenant deployments

Integration Complexity Matrix

Integration Type	Complexity	Performance Impact	Vendor Lock-in	Use Case
OpenTelemetry + Jaeger + Grafana	Moderate (3 weeks)	Low (2-5%)	None	Complete observability
Proprietary APM	Low (but expensive)	Medium (5-15%)	High	Budget > time
ELK Stack + APM	High (YAML nightmare)	Medium (3-10%)	Medium	Elasticsearch masochists
Cloud Provider Solutions	Low (until customization)	Variable (black box)	High	Cloud-native convenience

Critical Warnings and Operational Intelligence

What Documentation Doesn't Tell You

Default configurations will fail in production load
Collectors require health monitoring or failures go unnoticed
Memory leaks are common in misconfigured deployments
Service mesh integration requires understanding both Istio and OpenTelemetry configs
Storage backends have different reliability characteristics under load

Migration Considerations

Dual deployment strategy: Run alongside existing APM during transition
Automatic instrumentation: Minimizes code changes but can break authentication
Dashboard migration: Gradual transition while maintaining existing tooling
Multi-cluster deployments: Require gateway aggregation points and centralized storage

Troubleshooting Hierarchy

Check collector logs first (usually the problem)
Verify OTLP endpoint reachability
Enable debug logging temporarily (fills disk quickly)
Confirm collectors are actually running (resource limits kill silently)
Monitor collector health religiously

Recommended Implementation Path

Phase 1: Foundation (Week 1)

Deploy minimal OpenTelemetry Collector with basic configuration
Set up Jaeger v2 with ClickHouse backend
Configure basic Grafana dashboards
Implement health monitoring for all components

Phase 2: Production Hardening (Week 2-3)

Configure proper resource limits based on traffic patterns
Implement sampling strategies (start with 5% probabilistic)
Set up alerting for collector health and pipeline failures
Configure security (mTLS, RBAC, network policies)

Phase 3: Scale and Optimize (Ongoing)

Tune sampling rates based on production data
Optimize storage retention policies
Implement custom business metrics
Monitor and adjust resource allocations

Success Criteria and Validation

Deployment Success Indicators

Trace completeness: >95% of requests produce complete traces
Collector uptime: >99.9% availability with automatic restart
Query performance: Dashboard loads <10 seconds
Resource stability: No OOM kills or CPU throttling
Storage performance: Query response times <5 seconds

Common Failure Patterns to Monitor

Trace volume spikes killing Elasticsearch
Auto-instrumentation breaking application authentication
Grafana memory consumption during complex dashboard rendering
Collector resource exhaustion during traffic surges
Storage backend timeouts during high query loads

Useful Links for Further Investigation

Resources That Don't Suck

Link	Description
OpenTelemetry Documentation	The official docs - they're actually decent, which is rare. Skip the conceptual bullshit and go straight to the [language-specific SDKs](https://opentelemetry.io/docs/languages/). The [collector configuration](https://opentelemetry.io/docs/collector/configuration/) section will save you hours of trial and error.
Jaeger v2 Documentation	Finally updated for v2. The [migration guide from v1](https://www.jaegertracing.io/docs/2.10/deployment/) doesn't lie about the complexity. Start with the [getting started](https://www.jaegertracing.io/docs/2.10/getting-started/) if you're new, skip the theory.
Grafana Observability Documentation	Their docs used to suck, but they're better now. The [data source configuration](https://grafana.com/docs/grafana/latest/datasources/) section is where you'll spend most of your time. The [alerting docs](https://grafana.com/docs/grafana/latest/alerting/) are actually readable.
Kubernetes Observability Guide	Official K8s docs for logging architecture. Dry as hell but accurate. The [resource management](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/) section will prevent your pods from getting OOMKilled.
OpenTelemetry Operator	The operator works great until you need custom configs. Then you're deep in CRD hell. But for basic deployments, it's solid. Check the [releases page](https://github.com/open-telemetry/opentelemetry-operator/releases) before upgrading - some versions have broken our deployments.
OpenTelemetry Helm Charts	I've used these in production, they work. Don't trust the default values though - you'll need to customize [resource limits](https://github.com/open-telemetry/opentelemetry-helm-charts/tree/main/charts/opentelemetry-collector) or your collectors will die under load.
Jaeger Operator	Works for basic deployments. The [storage backend configuration](https://github.com/jaegertracing/jaeger-operator#storage-backends) is where most people fuck up. Read the docs twice before going to production.
Grafana Helm Charts	Community charts that don't suck. The [grafana/grafana](https://github.com/grafana/helm-charts/tree/main/charts/grafana) chart is solid for production. Just don't forget persistence or you'll lose all your dashboards.
OpenTelemetry Demo Application	This actually works. Full microservices setup with real instrumentation. Clone it, run it, see how the pieces fit together. Way better than trying to figure it out from documentation.
Kubernetes OTLP Example	Real configs that work in production. The [DaemonSet config](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/examples/kubernetes/otel-collector-daemonset.yaml) is what you want. Don't use the default resource limits - they're too low.
Grafana Observability Dashboards	Community dashboards are hit or miss but [these ones don't suck](https://grafana.com/grafana/dashboards/15983-opentelemetry-collector/). Import them as a starting point, then customize. Don't trust the default queries - half of them are wrong.
OpenTelemetry Community	Join the [Slack workspace](https://cloud-native.slack.com/) - #opentelemetry channel has people who actually know what they're talking about. The [SIG meetings](https://opentelemetry.io/community/meetings/) are boring but useful if you're doing complex integrations.
CNCF Jaeger Project	The [roadmap](https://www.jaegertracing.io/roadmap/) tells you what's coming. The [GitHub issues](https://github.com/jaegertracing/jaeger/issues) are where you'll find solutions to the problems you're about to hit.
Grafana Community Forums	Better than Stack Overflow for Grafana problems. The [observability section](https://community.grafana.com/c/grafana/observability/35) has people who've solved the same problems you're facing.
OpenTelemetry Specification	Dry technical specs that you'll reference when building [custom instrumentation](https://opentelemetry.io/docs/specs/otel/trace/api/). The [semantic conventions](https://opentelemetry.io/docs/specs/semconv/) are crucial if you want consistent attributes across your stack.
Jaeger Deployment Guide	The [production deployment section](https://www.jaegertracing.io/docs/latest/deployment/#production-deployment) is gold. Follow it or you'll be troubleshooting storage issues at 3am. The [scaling strategies](https://www.jaegertracing.io/docs/latest/deployment/#scaling) will save your ass when traffic spikes.
Grafana Academy	Actually useful tutorials. The [dashboard creation](https://grafana.com/tutorials/grafana-fundamentals/) course teaches you the right way instead of clicking randomly until something works.
OpenTelemetry Registry	Find [instrumentation libraries](https://opentelemetry.io/ecosystem/registry/?component=instrumentation&language=all) that actually work. The [vendor integrations](https://opentelemetry.io/ecosystem/registry/?component=exporter&language=all) list shows what's supported and what's experimental (avoid the experimental ones).
Jaeger Performance Testing	Load testing tools that show you where your deployment will break. Run these before production or you'll find out the hard way during Black Friday. The [capacity planning scripts](https://github.com/jaegertracing/jaeger-performance/tree/master/scripts) are worth their weight in gold.
OpenTelemetry Collector Builder	Build custom collector images with only the [components you need](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main). Reduces image size and attack surface. The [build configs](https://github.com/open-telemetry/opentelemetry-collector/tree/main/cmd/builder/test) show you how it's done.

OpenTelemetry + Jaeger + Grafana on Kubernetes: Production Observability Stack

Stack Overview

Critical Performance Specifications

Resource Requirements (Production Reality)

Performance Overhead

Critical Failure Modes

Silent Failures

Configuration Hell

Production Breaking Points

Deployment Reality

Time Investment

What Actually Breaks During Deployment

Deployment Approaches Comparison

Storage and Cost Reality

Storage Costs (Monthly)

Sampling Strategy Requirements

Security Implementation

Required Security Measures

Integration Complexity Matrix

Critical Warnings and Operational Intelligence

What Documentation Doesn't Tell You

Migration Considerations

Troubleshooting Hierarchy

Recommended Implementation Path

Phase 1: Foundation (Week 1)

Phase 2: Production Hardening (Week 2-3)

Phase 3: Scale and Optimize (Ongoing)

Success Criteria and Validation

Deployment Success Indicators

Common Failure Patterns to Monitor

Useful Links for Further Investigation

Resources That Don't Suck

Related Tools & Recommendations

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

Set Up Microservices Monitoring That Actually Works

Grafana - The Monitoring Dashboard That Doesn't Suck

GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015

Connecting ClickHouse to Kafka Without Losing Your Sanity

Docker Alternatives That Won't Break Your Budget

I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

Zipkin - Distributed Tracing That Actually Works

MongoDB Alternatives: Choose the Right Database for Your Specific Use Case

Fix gRPC Production Errors - The 3AM Debugging Guide

gRPC - Google's Binary RPC That Actually Works

gRPC Service Mesh Integration

Docker Swarm Node Down? Here's How to Fix It

Docker Swarm Service Discovery Broken? Here's How to Unfuck It

Docker Swarm - Container Orchestration That Actually Works

HashiCorp Nomad - Kubernetes Alternative Without the YAML Hell

Amazon ECS - Container orchestration that actually works