Why does my collector keep dying?

Memory leaks. Always memory leaks. Collector 0.89.0 was particularly fucked - upgrade to 0.90.0+. If it's still dying, you probably forgot the `memory_limiter` processor and it's eating all available RAM until the OOMKiller saves your ass.Also check if you're processing 50k spans/second on 2 CPU cores like an idiot. Scale horizontally or optimize the pipeline.

How do I debug missing spans when everything looks configured correctly?

It's probably sampling. It's always fucking sampling. Check your sampling configuration - if you set `trace_id_ratio_based: 0.001`, you're only capturing 0.1% of traces. That error you're looking for probably wasn't sampled.![Jaeger Trace Detail View](https://jaegertracing.io/img/trace-detail-ss.png)The other 10%: Network timeouts between your app and collector, collector → backend export failures, or context propagation broke somewhere in your service chain.Enable debug logging on the collector: `service.telemetry.logs.level: debug`. Warning: This will generate a shitload of logs.

What's the real performance impact?

"1-5% overhead" is marketing bullshit. Real impact depends on your configuration:- **Java agent with default settings**: ~3-8% CPU overhead on our API that handles 10k req/sec- **Python auto-instrumentation**: Added 23ms average latency to our Flask app (15ms baseline to 38ms)- **Collector as sidecar**: 200MB RAM baseline, 50-100MB additional per 1k spans/sec throughputHigh-frequency operations (database calls, HTTP requests) create more overhead. We saw 15% performance hit instrumenting a tight loop that made 1000 Redis calls per request. Don't instrument everything like an idiot.

Why is my observability bill still expensive if OpenTelemetry is "free"?

OpenTelemetry is free like a puppy is free. The framework costs nothing, but storage and processing will destroy your budget:- **Jaeger storage**: Our 50-service microservices architecture generates 2TB traces/month. That's $500/month in S3 + compute costs.- **High-cardinality metrics**: Adding user IDs to metrics labels created 2M unique time series. Prometheus storage exploded to 500GB because we're dumbasses.- **Commercial backends**: Grafana Cloud charges by data ingestion. Our OpenTelemetry setup sends 10GB/day = $300/month. Still cheaper than Datadog's $3k/month.

How do I fix "context deadline exceeded" errors?

Network timeouts between your app and the collector. Default OTLP exporter timeout is 10 seconds, which is optimistic as hell on a busy Kubernetes cluster.For Java: `otel.exporter.otlp.timeout=30000`For Python: `OTEL_EXPORTER_OTLP_TIMEOUT=30000`Also check if your collector is overwhelmed. If it's processing 50k spans/second on 2 CPU cores, it's going to drop data. Scale horizontally or optimize the pipeline.

Does OpenTelemetry work with Spring Boot 3.x?

Mostly. The Java agent works with Spring Boot 3.0+ but has issues with Spring Security 6 and some WebFlux configurations.Spring Boot 3.2.0 specifically breaks with custom actuator endpoints. Upgrade to 3.2.1+ or use manual instrumentation for custom endpoints.

Why is Node.js auto-instrumentation so flaky?

Because JavaScript is chaos and OpenTelemetry tries to impose order. ESM modules break auto-instrumentation, newer Node versions change internal APIs, and some libraries use monkey-patching that conflicts with OTel's monkey-patching.Stick to CommonJS if possible, or accept that you'll be manually instrumenting half your dependencies. The Express instrumentation is solid, but custom middleware can break trace context.

How do I prevent high-cardinality metrics from exploding my storage?

Remove user IDs, request IDs, and timestamps from metric labels. Use metric views to drop high-cardinality labels:```yaml# Collector configurationprocessors: filter/drop_user_ids: metrics: metric: - 'user.id'```Better: Design metrics for aggregation, not individual tracking. Track "requests per endpoint" not "requests per user per endpoint." Learn this before you bankrupt your startup.

Why does it work perfectly on my machine but crash in production?

Because your machine has 32GB RAM and production pods have 512MB. Because your machine doesn't have 47 other services fighting for CPU. Because localhost networking is magic compared to Kubernetes CNI.**Deployment Reality**: This will fail 3 times before it works. Docker networking will break inexplicably, environment variables will be wrong, and the collector config will have one typo that takes 2 hours to find.Check your resource limits, sampling rates, and collector configuration. What works on a single-service development setup dies horribly in a real distributed environment.**The Kubernetes Problem**: Works perfectly in docker-compose, explodes in K8s because of course it does. CNI networking, resource limits, and service mesh sidecars will all conspire against your collector.

How do I explain to management why observability costs more than the actual servers?

Show them the cost of downtime. That 3-hour outage last month because you couldn't debug a performance issue? That cost more than 2 years of observability tooling.**Monitoring the Monitor**: Your observability system will go down during the exact outage you need it most. Murphy's Law applies double to monitoring infrastructure.Or just lie and call it "infrastructure optimization" instead of "observability." Management loves that shit. Works every time.**The Update Trap**: Don't update collector versions on Friday. Or Monday. Or really any day ending in 'y'. Something will break and you'll spend the weekend rolling back.

Currently viewing the AI version

Switch to human version

OpenTelemetry: AI-Optimized Technical Reference

Configuration That Actually Works in Production

SDK Setup by Language

Java

Command: java -javaagent:opentelemetry-javaagent.jar -jar your-app.jar
Performance Impact: 3-8% CPU overhead on 10k req/sec API
Memory Overhead: ~50MB baseline
Critical Failure: Breaks with custom classloaders and Spring Boot 3.2.x actuator endpoints
Solution: Upgrade to Spring Boot 3.2.1+ or use manual instrumentation

Python

Command: opentelemetry-bootstrap && opentelemetry-instrument python app.py
Performance Impact: +23ms latency (15ms → 38ms baseline)
Memory Overhead: ~30MB
Critical Failure: Auto-instrumentation conflicts with gevent
Solution: Use manual instrumentation for gevent applications

Node.js

Status: Unreliable auto-instrumentation
Critical Failure: ESM modules break auto-instrumentation completely
Solution: Use CommonJS or manual instrumentation
Working Components: Express instrumentation is stable

Approach: Manual instrumentation only
Advantage: Clean, predictable API
Trade-off: More development overhead

Collector Deployment Patterns

Mode	RAM Usage	Failure Mode	Use Case
Sidecar	200MB per pod	Pod resource exhaustion	Low-latency requirements
Gateway	Shared resources	Single point of failure	Cost optimization
Agent	Per-node baseline	CNI networking issues	Balanced approach

Critical Configuration

processors:
  memory_limiter:
    limit_mib: 512
  batch:
    timeout: 1s
    send_batch_size: 1024

Sampling Configuration

Production Settings

Start: 1% sampling (trace_id_ratio_based: 0.01)
Storage Cost: 2TB traces/month for 50-service architecture = $500/month S3
Critical Warning: Head-based sampling may miss important errors

Resource Requirements

Time Investment

Setup Time Reality: 1-2 weeks (not "2-3 days" as marketed)
Learning Curve: Steep - plan for 3-4 weeks team ramp-up
Operational Overhead: Significant for self-hosted solutions

Cost Structure

Component	Monthly Cost	Scale Factor
Self-hosted (Jaeger + Prometheus)	$200-1k	Infrastructure + engineer time
Grafana Cloud	~$300/month	10GB/day ingestion
Commercial APM	$15k-50k+	High traffic penalty
OpenTelemetry framework	$0	Storage costs apply

Performance Impact Thresholds

Acceptable: 1-5% CPU overhead
High-frequency operations: 15% performance hit with 1000+ calls/request
Memory baseline: 200MB collector + 50-100MB per 1k spans/sec

Critical Warnings

Collector Memory Leaks

Affected Version: 0.89.0 has memory leak with tail sampling processor
Fix: Upgrade to 0.90.0+ immediately
Monitoring: Always configure memory_limiter processor

High Cardinality Metrics

Failure Scenario: User IDs as labels → 2M unique time series → 500GB Prometheus storage
Prevention: Remove user IDs, request IDs, timestamps from metric labels
Solution: Design for aggregation, not individual tracking

Network Timeout Issues

Default Timeout: 10 seconds (insufficient for production)
Fix: Set 30+ second timeouts
- Java: otel.exporter.otlp.timeout=30000
- Python: OTEL_EXPORTER_OTLP_TIMEOUT=30000

Context Propagation Failures

Root Cause: Missing spans due to broken trace context
Debug: Enable collector debug logging (service.telemetry.logs.level: debug)
Warning: Debug logging generates excessive output

Decision Criteria

Choose OpenTelemetry When

Vendor lock-in is unacceptable
Multi-language environment (20+ supported languages)
Team can handle operational complexity
Long-term cost control is priority

Avoid OpenTelemetry When

Team lacks distributed systems expertise
Need immediate production deployment
Simple monolith application
Budget allows commercial APM without vendor concerns

Self-hosted vs Commercial Backends

Self-hosted (Jaeger + Prometheus)

Pros: Total control, predictable costs
Cons: Operational burden, capacity planning required
Expertise Required: Kubernetes, storage optimization, performance tuning

Commercial Backends

Pros: Managed infrastructure, support
Cons: Vendor lock-in risk, cost scaling issues
Best Options: Grafana Cloud (reasonable pricing), AWS X-Ray (native AWS integration)

Breaking Points and Failure Modes

Known Version Issues

Spring Boot 3.2.0: Breaks custom actuator endpoints
Collector 0.89.0: Memory leak in tail sampling processor
Node.js ESM: Auto-instrumentation completely broken

Production Gotchas

Kubernetes CNI: Host networking breaks in some configurations
Resource Limits: Development configs fail in production resource constraints
Service Mesh: Sidecars can interfere with collector networking

Debugging Missing Spans

Check sampling: 99% of missing spans are due to sampling configuration
Verify timeouts: Network issues between app and collector
Examine context: Trace context broken in service chain
Monitor collector: Overwhelmed collectors drop data silently

Implementation Reality

Semantic Conventions Status (September 2025)

HTTP spans: Stable and adopted
Database operations: Stabilized in 2025
RPC calls: Still unstable despite roadmaps
Legacy compatibility: Expect http.method and http.request.method coexistence

Community Support Quality

GitHub Issues: Well-documented, active maintainer response
Slack Community: Active support, maintainers respond
Documentation: Improving but has gaps, Stack Overflow often required

Update Risk Management

Rule: Never update on Friday/Monday
Reality: Something will break with updates
Strategy: Staged rollouts, quick rollback capability

Vendor Ecosystem (90+ Options)

Reliable Backends

Jaeger + Prometheus: Self-hosted standard
Grafana Cloud: Managed, reasonable pricing until ingestion limits
AWS X-Ray: Native AWS support, confusing sampling rules
Elastic APM: Good for log-heavy workloads

Integration Quality

Data ingestion: Most vendors support OTLP
Feature parity: Varies significantly between vendors
Migration: OpenTelemetry enables backend switching without code changes

Success Metrics

Technical KPIs

Trace capture rate: >95% of critical transactions
Query performance: <2 second trace lookup
Resource overhead: <5% application performance impact
Storage efficiency: Controlled cardinality metrics

Business Value

MTTR reduction: Faster incident resolution
Vendor flexibility: Backend switching capability
Cost predictability: Controlled observability spend
Engineering efficiency: Standardized instrumentation across services

Useful Links for Further Investigation

Stuff That Actually Helps When You're Debugging at 3am

Link	Description
OpenTelemetry Documentation	Getting better but still has gaps. The getting started guides won't break your setup, but you'll spend more time on Stack Overflow anyway.
Language SDKs	Quality is all over the place. Java docs are readable, Python covers basics, Node.js docs are basically "figure it out yourself."
OpenTelemetry Demo	Multi-language microservices demo that actually works. Good for understanding how everything connects. Takes 10 minutes to deploy and shows traces/metrics across 11 services.
OpenTelemetry Specification	The actual technical spec. Dense as hell but necessary if you're building integrations or trying to understand why something behaves weirdly.
OpenTelemetry Collector Issues	Every production problem you'll encounter is documented here. Search before filing tickets - someone else hit your exact memory leak.
Java Instrumentation Issues	Framework compatibility issues and configuration gotchas. Check here when Spring Boot breaks auto-instrumentation.
Python Contrib Issues	Library-specific instrumentation problems. Useful when Django/Flask middleware conflicts with OTel.
JavaScript Issues	Node.js compatibility nightmares and ESM module problems documented in excruciating detail.
OpenTelemetry YouTube Channel	Marketing-heavy but has some technical gems. The "OTel in Practice" series is actually useful.
Jaeger Tracing	Essential if you're self-hosting traces. The performance tuning guide will save you from storage disasters.
Prometheus Documentation	Must-read for metrics. The storage documentation explains why your disk filled up overnight.
OpenTelemetry Slack	Active community where maintainers actually respond. Better than GitHub issues for quick questions.
CNCF OpenTelemetry	Governance and roadmap info. Useful for understanding project direction and which features will actually get built.
Vendor Support List	90+ vendors claim support but quality varies. Grafana Cloud and AWS X-Ray work well. Many others are "technically compatible."
Adopter Case Studies	Real companies using OTel in production. Some have published case studies with scaling insights.
OpenTelemetry Operator	Kubernetes operator that works but has quirks. Auto-instrumentation injection is convenient when it doesn't break your pods.
Helm Charts	Community-maintained charts. The collector chart is solid, but you'll customize the config anyway.
Grafana Stack	Self-hosted alternative to commercial APM. Grafana + Prometheus + Jaeger + Loki stack works well if you can handle the operational complexity.
Elastic APM	OpenTelemetry-compatible and cheaper than Datadog for log-heavy workloads. Good choice if you're already using Elasticsearch.

OpenTelemetry: AI-Optimized Technical Reference

Configuration That Actually Works in Production

SDK Setup by Language

Collector Deployment Patterns

Sampling Configuration

Resource Requirements

Time Investment

Cost Structure

Performance Impact Thresholds

Critical Warnings

Collector Memory Leaks

High Cardinality Metrics

Network Timeout Issues

Context Propagation Failures

Decision Criteria

Choose OpenTelemetry When

Avoid OpenTelemetry When

Self-hosted vs Commercial Backends

Breaking Points and Failure Modes

Known Version Issues

Production Gotchas

Debugging Missing Spans

Implementation Reality

Semantic Conventions Status (September 2025)

Community Support Quality

Update Risk Management

Vendor Ecosystem (90+ Options)

Reliable Backends

Integration Quality

Success Metrics

Technical KPIs

Business Value

Useful Links for Further Investigation

Stuff That Actually Helps When You're Debugging at 3am

Related Tools & Recommendations

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

Set Up Microservices Monitoring That Actually Works

OpenTelemetry + Jaeger + Grafana on Kubernetes - The Stack That Actually Works

Datadog Cost Management - Stop Your Monitoring Bill From Destroying Your Budget

Datadog vs New Relic vs Sentry: Real Pricing Breakdown (From Someone Who's Actually Paid These Bills)

Datadog Enterprise Pricing - What It Actually Costs When Your Shit Breaks at 3AM

Honeycomb - Debug Your Distributed Systems Without Losing Your Mind

Grafana - The Monitoring Dashboard That Doesn't Suck

New Relic - Application Monitoring That Actually Works (If You Can Afford It)

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

Tabnine - AI Code Assistant That Actually Works Offline

Surviving Gatsby's Plugin Hell in 2025

React Router v7 Production Disasters I've Fixed So You Don't Have To

Zipkin - Distributed Tracing That Actually Works

Plaid - The Fintech API That Actually Ships

Elastic APM - Track down why your shit's broken before users start screaming

MongoDB Alternatives: Choose the Right Database for Your Specific Use Case

Fix gRPC Production Errors - The 3AM Debugging Guide