Why is my Jaeger UI so damn slow?

The [Jaeger UI](https://github.com/jaegertracing/jaeger-ui) chokes on large traces (1000+ spans). If your requests generate massive trace trees, the UI becomes unusable. Large distributed transactions can create traces with 10k+ spans that crash browser tabs. Solution: implement better sampling and break up your transaction boundaries. Also, the search is painfully slow without proper indexing. ![Jaeger Trace Timeline](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/07ae2bb3-44ba-4590-14c6-91688994f900/orig) Large traces turn into horizontal scroll nightmares. The UI wasn't designed for traces with thousands of spans, and browser performance degrades significantly. ![Jaeger Embedded Trace View](https://www.jaegertracing.io/img/frontend-ui/embed-trace-view-with-collapse.png)

How do I stop Jaeger from filling up all my disk space?

**Set up retention policies immediately** or Jaeger will consume every byte of storage you have. In Elasticsearch, configure [index lifecycle management](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-lifecycle-management.html) to delete old traces. For Cassandra, you need [time-based compaction strategies](https://cassandra.apache.org/doc/latest/cassandra/operating/compaction/). Default retention is basically forever, which is insane. Start with 7-day retention and adjust based on your debugging patterns. Most issues are found within hours, not weeks.

Can I trust Jaeger not to crash my application?

**Mostly, but not completely.** OpenTelemetry SDKs are pretty robust now, but I've seen cases where tracing overhead killed performance. The old [OpenTracing Java client](https://github.com/opentracing/opentracing-java) had memory leaks that took down production apps. OpenTelemetry is much better, but still - **always canary deployments with tracing enabled**. Set [resource limits](https://opentelemetry.io/docs/concepts/sampling/) on span size and batch processing. A rogue service generating 100MB traces will OOM your collector and potentially your apps.

Why do some of my traces disappear?

**Sampling** is probably dropping them. [Adaptive sampling](https://medium.com/jaegertracing/adaptive-sampling-in-jaeger-50f336f4334) makes decisions based on traffic volume - high-volume services get sampled more aggressively. Check your sampling configuration if you're missing critical traces. Also, collector crashes lose traces. If your collector is under-resourced or gets malformed spans, it will drop data silently. Monitor collector health metrics religiously.

How much will this actually cost me?

Budget **$500-2000/month minimum** for a medium-sized system with decent retention. Elasticsearch storage costs add up fast - expect $0.10-0.30 per GB-month on cloud providers. A system generating 1M spans/day needs ~50GB storage/month at minimum. Don't forget compute costs for the collector, query service, and Elasticsearch cluster. Factor in at least 0.5 FTE DevOps time for maintenance. [Commercial APM solutions](https://www.datadoghq.com/pricing/) might be cheaper if you're under 50 services.

What breaks when I upgrade to Jaeger v2?

**The agent is gone**, so any deployment scripts using jaeger-agent need updating. Your [OpenTracing instrumentation still works](https://medium.com/jaegertracing/experiment-migrating-opentracing-based-application-in-go-to-use-the-opentelemetry-sdk-29b09fe2fbc4) but you should migrate to OpenTelemetry. Configuration format changed significantly between versions. Custom dashboards and alerting on agent metrics will break since the agent doesn't exist anymore. Plan for a few days of fixing monitoring after the upgrade.

Why does my dependency graph look like spaghetti?

**Service naming inconsistency.** Different teams configured service names differently, so you get "user-service", "user-svc", "userservice", and "UserService" all showing up as separate services. [Standardize service naming](https://opentelemetry.io/docs/specs/semconv/resource/) early or live with the mess forever. Also, the dependency graph doesn't handle high-cardinality data well. If you have services that call hundreds of other services, the visualization becomes useless.

How do I debug when Jaeger itself is broken?

Keep traditional monitoring on Jaeger components. When the tracing system is down, you can't trace why it's down. Monitor collector ingestion rates, storage utilization, and query latency with [Prometheus metrics](https://www.jaegertracing.io/docs/1.44/monitoring/). Common failure modes: Elasticsearch cluster split-brain, collector OOM from large spans, storage quota exceeded, and network partitions between components.

Should I run Jaeger on Kubernetes?

**Yes, but use the [official operator](https://github.com/jaegertracing/jaeger-operator).** Running Jaeger components manually on Kubernetes is YAML hell. The operator handles most of the operational complexity, but you still need to understand Elasticsearch tuning. Don't put Jaeger and your application workloads on the same cluster if you can avoid it. When your app cluster has issues, you want your observability infrastructure to stay up.

Is the learning curve worth it?

**Absolutely.** Despite all the operational pain, nothing else gives you the same insight into distributed system behavior. The first time you use Jaeger to debug a cascading failure across 15 services in minutes instead of hours, you'll understand why it's worth the complexity. Just budget for the learning curve, operational overhead, and storage costs upfront.

Currently viewing the AI version

Switch to human version

Jaeger: Distributed Tracing for Microservices - AI-Optimized Reference

Core Problem and Solution

Problem: Distributed systems debugging without visibility into request flow across services results in hours/days of blind troubleshooting
Solution: Jaeger provides distributed tracing showing complete request paths with timing and dependency information

Critical Production Realities

Storage Costs and Configuration

Budget Impact: 100GB per million spans, costs $500-2000/month minimum for medium systems
Default Configuration Failure: Elasticsearch defaults consume 64GB+ RAM and OOM nodes
Production Settings Required:
- Set retention policies immediately (7-day minimum)
- Configure index lifecycle management for Elasticsearch
- Implement adaptive sampling (start with 1% production sampling)

Operational Breaking Points

UI Performance: Becomes unusable with 1000+ spans per trace, crashes browser tabs at 10k+ spans
Collector Failure Modes:
- Crashes on malformed spans or 50MB+ payloads
- Default queue size insufficient for production load
- Required settings: --collector.queue-size=100000 and --collector.num-workers=100
Service Discovery Issues: Inconsistent service naming creates unusable dependency graphs

Architecture Decisions and Trade-offs

Jaeger v2 vs v1 (Critical Migration)

v1 End-of-Life: December 31, 2025
v2 Benefits: Eliminates agent deployment complexity, native OpenTelemetry support, single binary deployment
Migration Impact: Agent-based deployments require complete reconfiguration

Storage Backend Selection

Backend	Pros	Cons	When to Use
Elasticsearch	Easier to operate, familiar	Expensive, memory-intensive	< 50 services
Cassandra	Massive scale capable	Requires specialized expertise	Large scale, dedicated team
ClickHouse	Faster, cheaper than ES	Harder to operate	Cost optimization priority

Failure Scenarios and Consequences

High-Impact Failures

Collector OOM: Silently drops traces, creates debugging blind spots
Storage Quota Exceeded: Complete tracing shutdown, no request visibility
UI Timeout on Large Traces: Makes debugging complex distributed transactions impossible
Sampling Misconfiguration: Either bankrupts storage budget or loses critical debugging data

Common Misconceptions

"Tracing is safe to deploy": OpenTelemetry overhead can kill performance, requires canary deployments
"Default settings work in production": All defaults are development-focused and will fail under load
"Logs are equivalent to traces": Logs show service events, traces show request causality across services

Implementation Requirements

Prerequisites Not in Documentation

DevOps expertise for Elasticsearch tuning
Understanding of sampling strategies and storage mathematics
Traditional monitoring for Jaeger components themselves
Network partition and cluster split-brain recovery procedures

Resource Requirements

Time Investment: Weekend initial setup, 0.5 FTE ongoing maintenance
Technical Expertise: Elasticsearch operations, Kubernetes deployment, OpenTelemetry instrumentation
Infrastructure: Separate cluster recommended for observability isolation

Performance Thresholds with Real-World Impact

1000+ spans per trace: UI becomes unusable for debugging
1M spans/day: Requires ~50GB storage/month minimum
High-traffic services: Require 0.1% sampling or lower to control costs
Elasticsearch heap: Must be tuned per production load or causes node failures

Decision Criteria for Alternatives

Choose Jaeger When:

Running 10+ microservices with complex interactions
Need cost control and infrastructure ownership
Team has DevOps capacity for operational complexity
OpenTelemetry standardization is priority

Choose Commercial APM When:

< 50 services total
Limited DevOps resources
Budget allows $25+/host monthly costs
Superior UX and alerting features required

Critical Warnings from Production Experience

What Official Documentation Omits:

Storage costs scale exponentially with trace volume
UI performance degrades significantly with complex traces
Sampling configuration directly impacts both costs and debugging capability
Collector resource limits must be set or system will OOM under load
Service naming inconsistencies make dependency graphs unusable

Breaking Points and Thresholds:

Traces > 1000 spans: UI performance issues
Services > 100: Dependency graph becomes unreadable
Sampling < 0.1%: May miss critical debugging traces
Retention > 30 days: Storage costs become prohibitive for most teams

Configuration That Actually Works in Production

Collector Production Settings:

--collector.queue-size=100000
--collector.num-workers=100
--memory.max-traces=1000000

Sampling Strategy:

Start: 1% production sampling
High-traffic services: 0.1% or lower
Critical paths: Force 100% sampling with custom logic
Monitor: Adjust based on storage costs vs debugging needs

Storage Retention:

Standard: 7-day retention minimum
High-volume: 3-day retention with intelligent sampling
Critical services: Extended retention for specific service subsets

Operational Intelligence Summary

Jaeger solves the distributed systems debugging problem effectively but requires significant operational investment. Success depends on proper sampling configuration, storage planning, and understanding performance limitations. The transition to v2 eliminates major deployment complexity but requires migration planning. Commercial alternatives may be more cost-effective for smaller deployments, but Jaeger provides superior cost control and flexibility at scale with appropriate DevOps investment.

Useful Links for Further Investigation

Essential Jaeger Resources (And Which Ones Actually Matter)

Link	Description
Jaeger Documentation	Actually decent technical docs, though they assume you know what you're doing. The architecture section is solid if you want to understand how the pieces fit together.
GitHub Repository	Where the real action is. Issues section tells you what's actually broken vs. what the docs claim works.
Jaeger v2 Release Blog	Essential reading if you're not running legacy v1. Explains why they gutted the agent architecture.
Getting Started Guide	Basic as hell, assumes you want to run everything in development mode forever
Official Homepage	Marketing fluff, go straight to the docs
BetterStack's Practical Guide	One of the few guides that covers production deployment without hand-waving the hard parts. Shows you real Elasticsearch configuration.
Last9's Monitoring Guide	Critical if you want to know when your Jaeger setup is fucked. Includes the storage calculation formulas the official docs never mention.
Adaptive Sampling Deep Dive	Required reading unless you enjoy bankrupt storage budgets from tracing everything.
Spring Boot Tutorial	Solid if you're stuck in Java land. Actually shows working code instead of theoretical examples.
OpenObserve Beginner Guide	Fine for absolute beginners, but you'll outgrow it fast.
Native OTLP Support Announcement	Explains why Jaeger v2 doesn't suck as much as v1. The agent removal alone makes this worth upgrading for.
OpenTelemetry Migration Guide	Shows you how to escape the OpenTracing nightmare. Go examples but concepts apply everywhere.
OpenTelemetry Documentation	Comprehensive but reads like a specification. Good reference, terrible tutorial.
Language SDKs	Hit or miss depending on your language. Python and Go are solid, others vary.
Jaeger Kubernetes Operator	Saves you from YAML hell but you still need to understand Elasticsearch tuning. Read the issues before deploying.
Istio Integration	Works out of the box if you're already running Istio. One of the few service mesh features that actually delivers value.
ClickHouse Backend Guide	For when Elasticsearch costs are bankrupting you. ClickHouse is faster but harder to operate.
Complete Observability Stack	Shows Prometheus + Grafana + Jaeger integration that actually works in production
Microsoft .NET Example	Surprisingly good for a Microsoft tutorial. Shows real configuration, not toy examples.
CNCF Slack #jaeger	Most active community. Engineers who actually run this in production hang out here. Way better than Stack Overflow for Jaeger questions.
GitHub Issues	Where bugs and feature requests live. Search before posting - good chance your problem already exists.
Mailing List	Old school, low traffic. Slack is more active.
GitHub Discussions	Newer, trying to compete with Slack. Not there yet.
Security Audits	Third-party security reviews. Actually pretty thorough for an open source project.
Security Best Practices	CNCF compliance badges. Box-checking exercise but covers the basics.
Jaeger at 10 Years	History lesson from August 2025. Shows how far the project has come from Uber's internal tool.
v1 End-of-Life Announcement	v1 dies December 31, 2025. Start planning your v2 migration if you haven't already.

Jaeger: Distributed Tracing for Microservices - AI-Optimized Reference

Core Problem and Solution

Critical Production Realities

Storage Costs and Configuration

Operational Breaking Points

Architecture Decisions and Trade-offs

Jaeger v2 vs v1 (Critical Migration)

Storage Backend Selection

Failure Scenarios and Consequences

High-Impact Failures

Common Misconceptions

Implementation Requirements

Prerequisites Not in Documentation

Resource Requirements

Performance Thresholds with Real-World Impact

Decision Criteria for Alternatives

Choose Jaeger When:

Choose Commercial APM When:

Critical Warnings from Production Experience

What Official Documentation Omits:

Breaking Points and Thresholds:

Configuration That Actually Works in Production

Collector Production Settings:

Sampling Strategy:

Storage Retention:

Operational Intelligence Summary

Useful Links for Further Investigation

Essential Jaeger Resources (And Which Ones Actually Matter)

Related Tools & Recommendations

Connecting ClickHouse to Kafka Without Losing Your Sanity

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together

CrashLoopBackOff Exit Code 1: When Your App Works Locally But Kubernetes Hates It

Temporal + Kubernetes + Redis: The Only Microservices Stack That Doesn't Hate You

Stop Fighting Your Messaging Architecture - Use All Three

Your Elasticsearch Cluster Went Red and Production is Down

EFK Stack Integration - Stop Your Logs From Disappearing Into the Void

ELK Stack for Microservices - Stop Losing Log Data

Docker Desktop Alternatives That Don't Suck

Docker Swarm - Container Orchestration That Actually Works

Docker Security Scanner Performance Optimization - Stop Waiting Forever

Setting Up Prometheus Monitoring That Won't Make You Hate Your Job

Set Up Microservices Monitoring That Actually Works

Prometheus - Scrapes Metrics From Your Shit So You Know When It Breaks

ClickHouse - Analytics Database That Actually Works

Don't Get Screwed by NoSQL Database Pricing - MongoDB vs Redis vs DataStax Reality Check

Apache Cassandra - The Database That Scales Forever (and Breaks Spectacularly)

Cassandra Vector Search - Build RAG Apps Without the Vector Database Bullshit