Currently viewing the human version
Switch to AI version

What the fuck is this thing and why do I need it?

The OpenTelemetry Collector is basically a proxy that sits between your apps and whatever monitoring backend you're paying too much for this month. Instead of instrumenting your code to send data directly to Datadog/New Relic/whatever, you send it to the Collector and let it figure out where to route it.

I've been running this in production for two years after getting a $18k monthly Datadog bill that made our VP cry. The Collector saved our ass by letting us ship 90% of our data to cheaper backends while keeping critical alerts in Datadog.

How this piece of shit actually works

OpenTelemetry Integration

The Collector has three types of components that form a pipeline:

OpenTelemetry Collector Pipeline Architecture

  • Receivers: Accept data in various formats (OTLP, Jaeger, Zipkin, Prometheus). OTLP works, the others... sometimes.
  • Processors: Mess with your data (sampling, filtering, enrichment). This is where you save money.
  • Exporters: Ship processed data to backends. Half of them work, the other half need weird config tweaks.

Current version bullshit

As of September 2025, we're at v0.135.0 for the core. Avoid v0.89.0 - it had memory issues that caused frequent crashes. Learned that the hard way during a Black Friday deployment.

The "stability" ratings are marketing. "Alpha" means it'll randomly break. "Beta" means it breaks predictably. "Stable" means it only breaks when you update.

Two ways to deploy this nightmare

Agent Pattern (sidecar or daemon):

  • Deploy next to each application
  • Lower latency since data doesn't travel far
  • Uses more resources but fails independently
  • We use this for critical services

Gateway Pattern (centralized):

  • One big collector serving multiple apps
  • Cheaper on resources, harder to debug when it shits itself
  • Single point of failure that'll take down your entire observability
  • Great for non-critical services

Why we actually use this thing

Escaping vendor lock-in: We were stuck paying Datadog $18k/month because switching would mean rewriting instrumentation in 47 microservices. With the Collector, we switched backends in 2 hours by changing one config file.

Observability Cost Multipliers

Cost savings: We now send 90% of our data to Grafana Cloud (much cheaper) and only keep high-value data in Datadog. Cut our observability costs by 73%.

Data processing: The Collector can sample traces, filter out noisy metrics, and redact PII before it leaves your network. Direct exporters can't do this shit.

Reliability: The Collector has built-in retries and buffering. When your backend goes down (and it will), you don't lose data. Direct exports just fail silently and you're fucked.

Shit that actually helped me:

OpenTelemetry Collector vs Your Other Shitty Options

Feature

OpenTelemetry Collector

Direct Export (SDK)

Vendor Agents

Vendor Lock-in

Escape hatch when prices jump

Change code when switching

You're fucked

Setup Complexity

YAML hell but well-documented

Simple until it breaks

Works until vendor changes something

Resource Usage

~500MB RAM realistically

<50MB per service

100-300MB + mystery overhead

Data Processing

Actually works (sampling, filtering)

Batching and prayers

Whatever vendor allows

Reliability

Retries, queues, doesn't lose data

App crashes = data loss

Usually works but no control

Multi-Backend Support

✅ Send same data to 3+ backends

❌ Pick one and stick with it

❌ Vendor prison

Production Features

✅ Tail sampling saves money

❌ Sample everything or nothing

⚠️ Pay for what vendor gives you

Network Security

✅ One hole in firewall

❌ Every service talks to internet

⚠️ Vendor-specific bullshit

Configuration

✅ Git-controlled YAML

❌ Code changes for config

⚠️ UI changes you can't version

Cost Control

✅ Filter before you pay

❌ Pay for everything you generate

⚠️ Pay whatever vendor decides

Getting this thing running (and keeping it running)

Installation: The easy part that tricks you

Download the core distribution (don't use contrib unless you hate yourself):

## Download latest release from GitHub releases page
## Visit: https://github.com/open-telemetry/opentelemetry-collector-releases/releases/latest
curl -Lo otelcol-core [DOWNLOAD_URL_FOR_YOUR_PLATFORM]
chmod +x otelcol-core

## Test your config BEFORE running it
./otelcol-core --config=config.yaml --dry-run

## If dry-run passes, actually run it
./otelcol-core --config=config.yaml

Linux gotcha: On Ubuntu/Debian, you might need to install ca-certificates first or TLS connections will fail with cryptic errors.

macOS gotcha: Apple's security will block unsigned binaries. Run xattr -d com.apple.quarantine otelcol-core to fix it.

Config that actually works in production

YAML Configuration Example

This config works because I spent 3 weeks debugging why the "simple" examples don't:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  # ALWAYS put memory_limiter FIRST or your collector will eat all RAM
  memory_limiter:
    limit_mib: 1024
    spike_limit_mib: 256
  batch:
    send_batch_size: 1024    # Start conservative
    timeout: 1s              # Don't wait too long
  resourcedetection:
    detectors: [env, system] # Skip docker/k8s unless you're actually using them
    timeout: 5s              # Default 5s timeout sometimes fails

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true  # Use TLS in prod, but this works for testing
  prometheus:
    endpoint: \"0.0.0.0:8889\"
    # Prometheus exporter is rock solid, unlike some others

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, resourcedetection, batch]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp]  
      processors: [memory_limiter, resourcedetection, batch]
      exporters: [prometheus]

  # Enable internal metrics so you can debug when it breaks
  telemetry:
    metrics:
      address: 0.0.0.0:8888

Deployment reality check

Prometheus Grafana Dashboard

Memory planning: The docs say 200MB baseline. That's bullshit. Plan for:

  • 512MB minimum for any real workload
  • +100MB per 1K spans/second is optimistic
  • +200MB per 1K spans/second with tail sampling
  • Double it if you're using contrib components

High availability means complexity: Multiple collectors behind a load balancer sounds great until you need to debug which one is fucking up. Start with one collector and scale when you actually need it.

Production-hardened config additions

Persistent queues (or lose data when things crash):

extensions:
  file_storage:
    directory: /var/lib/otel-data  # Make sure this directory exists

exporters:
  otlp:
    endpoint: https://backend.example.com
    sending_queue:
      enabled: true
      storage: file_storage
      queue_size: 5000
    retry_on_failure:
      enabled: true
      initial_interval: 1s
      max_interval: 30s

Resource limits (Docker/systemd):

## Docker
docker run --memory=2g --cpus=1 otel-collector

## systemd service
[Service]
MemoryLimit=2G
CPUQuota=100%

Advanced stuff that'll save your money

Tail sampling - sample complete traces, not random spans:

processors:
  tail_sampling:
    decision_wait: 10s  # Wait for complete traces
    policies:
      - name: errors_always
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow_requests  
        type: latency
        latency: {threshold_ms: 1000}
      - name: sample_normal
        type: probabilistic
        probabilistic: {sampling_percentage: 5}  # 5% of normal traces

Drop PII before it leaves your network:

processors:
  attributes:
    actions:
      - key: user.email       # Drop email addresses
        action: delete
      - key: user.phone       # Drop phone numbers  
        action: delete
      - key: credit_card.*    # Drop anything with credit_card prefix
        action: delete
      - key: http.request.body # Usually contains sensitive data
        action: delete

Real production deployment tip: Start with minimal config, get it working, then add complexity. Every processor you add is another thing that can break at 2am.

Shit you'll need when this breaks:

Questions that'll save your ass at 3am

Q

Why does my Collector randomly crash with "signal: killed"?

A

You hit the OOM killer. The Collector will eat all your memory if you don't set limits. I learned this when our collector consumed 32GB of RAM and took down the entire node.

processors:
  memory_limiter:
    limit_mib: 1024  # Set this or die
    spike_limit_mib: 256

Pro tip: Set this as your first processor or you're fucked. Also, check dmesg | grep -i "killed process" to confirm it was the OOM killer.

Q

Can I send data to multiple backends without losing my mind?

A

Yeah, but it's not as easy as the docs claim. You can configure multiple exporters:

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger, otlp/datadog, zipkin]

Reality check: Each backend has different format requirements. Datadog wants specific tags, Jaeger chokes on certain attributes. You'll spend hours debugging why traces show up in one backend but not the other.

Q

What happens when this piece of shit crashes?

A

If you didn't configure persistent queues, your data is gone. I lost 4 hours of Black Friday telemetry learning this lesson.

extensions:
  file_storage:
    directory: /tmp/otel-data

exporters:
  otlp:
    endpoint: https://backend.example.com
    sending_queue:
      enabled: true
      storage: file_storage  # This saves your ass

Without persistent queues, crashes = data loss. Period.

Q

How do I debug when the Collector uses 8GB of RAM?

A

First, check if you're using v0.89.0 - it had memory issues. If not:

## Check internal metrics (run these ON the collector host)
curl localhost:8888/metrics | grep memory

## Profile the collector (requires telemetry enabled on port 8888)
curl -o cpu.prof http://<COLLECTOR_HOST>:8888/debug/pprof/profile?seconds=30
go tool pprof cpu.prof

## For local debugging, replace <COLLECTOR_HOST> with localhost
## Profiling guide: https://opentelemetry.io/docs/collector/troubleshooting/#performance-profiling

Common causes:

  • Batch processor misconfigured: Set reasonable send_batch_size (1024-8192)
  • No memory limiter: Set it or the collector will eat everything
  • High cardinality metrics: Filter them out or prepare for pain
Q

Should I use core or contrib? (Spoiler: core)

A

Use core. The contrib distribution is 200+ components that mostly don't work properly. Core has ~40 components that actually function.

I deployed contrib once and spent a week debugging why the sqlquery receiver randomly stopped working. Turned out it was marked "alpha" for a reason.

Q

How do I stop high-cardinality metrics from destroying my backend?

A

High-cardinality metrics will bankrupt you. User IDs, session IDs, request IDs - all of these will generate millions of unique metric series.

processors:
  filter/kill_cardinality:
    metrics:
      metric:
        - 'user_id'      # Drops the entire user_id label
        - 'session_*'    # Drops any label starting with session_
        - 'request.uuid' # Drops request.uuid labels

Story time: We had a developer accidentally add user IDs to metrics. Our Prometheus storage grew from 100GB to 2TB overnight and the queries became unusable.

Q

Kubernetes deployment is a nightmare, right?

A

The OpenTelemetry Operator works... sometimes. When it doesn't, you get no error messages and your collector just doesn't start.

## Debug operator issues
kubectl logs -n opentelemetry-operator-system deployment/opentelemetry-operator-controller-manager

## Check if your collector actually started  
kubectl get pods -l app.kubernetes.io/name=otelcol

## Get the real error messages
kubectl describe pod your-failing-collector-pod

Common gotcha: The operator ignores YAML syntax errors silently. Validate your config first:

./otelcol-core --config=config.yaml --dry-run
Q

What versions will ruin your day?

A

DO NOT USE:

  • v0.89.0: Memory issues that cause frequent crashes
  • v0.82.x: Performance problems with high throughput
  • v0.78.2: Reported issues with batch processor reliability
  • Any version ending in .0: Wait for .1 or .2, first releases always have bugs

Currently safe: v0.135.1+ as of September 2025, but check the release notes for latest gotchas.

Q

Why does my collector randomly stop receiving data?

A

Check the error you're probably ignoring:

## Check collector logs for this specific error
grep "connection refused" /var/log/otel-collector.log

Common causes:

  • Backend is down: Your exporter fails, collector stops processing
  • Wrong endpoint: OTLP vs OTLP/HTTP confusion (port 4317 vs 4318)
  • TLS issues: Certificates expired or misconfigured
  • Network policies: Kubernetes blocking traffic you thought was allowed

Debug commands that actually help (run these on the collector host):

## Test if collector is accepting data on OTLP HTTP endpoint
curl -X POST http://<COLLECTOR_HOST>:4318/v1/traces -d '{"traces":[]}'

## For local testing, replace <COLLECTOR_HOST> with localhost
## OTLP specs: https://opentelemetry.io/docs/specs/otlp/#otlphttp-request

## Check if exporters are working (requires telemetry enabled)
curl localhost:8888/metrics | grep exporter_sent

Production Operations: When shit hits the fan

Performance tuning that actually works

Batching: Different backends need different batch sizes. Trial and error is your friend:

processors:
  batch:
    send_batch_size: 1024      # Start here, tune based on errors
    send_batch_max_size: 4096  # Hard stop before OOM
    timeout: 1s                # Don't wait forever

Reality check: Datadog pukes if batches are too big. Prometheus doesn't care. Jaeger sometimes just drops data silently. You'll figure out the sweet spot by watching error rates.

Memory limits (set these or regret it):

processors:
  memory_limiter:
    limit_mib: 1024
    spike_limit_mib: 256
    check_interval: 1s  # Check frequently under load

I set check_interval to 1s after our collector ate 8GB during a traffic spike and crashed the entire node.

Monitoring: The metrics that matter

System Monitoring Dashboard

Enable internal metrics because you'll need them when debugging at 3am:

service:
  telemetry:
    metrics:
      address: 0.0.0.0:8888
      level: detailed

Critical alerts to set up:

## Memory usage climbing
otelcol_process_memory_rss_bytes > 1GB

## Data not flowing  
rate(otelcol_receiver_accepted_spans_total[5m]) == 0

## Export failures
rate(otelcol_exporter_send_failed_spans_total[5m]) > 0

## Queue backing up (death spiral incoming)
otelcol_exporter_queue_size > 1000

When the collector stops working (troubleshooting guide)

Problem: Collector stops receiving data
Error: rpc error: code = Unavailable desc = connection error
Fix:

## Check if collector is actually listening
netstat -tulpn | grep :4317

## Test OTLP gRPC receiver directly (on collector host)
grpcurl -plaintext localhost:4317 list

Problem: Collector runs out of memory and gets killed
Error: signal: killed in logs, nothing in collector output
Fix: Check system logs for OOM killer:

dmesg | grep -i \"killed process.*otelcol\"
journalctl -u otel-collector.service | grep -i \"memory\"

Problem: Exports randomly fail
Error: context deadline exceeded or connection refused
Fix: Your backend is overloaded or unreachable. Add retries:

exporters:
  otlp:
    endpoint: https://backend.example.com
    timeout: 30s
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s  # Give up after 5 minutes

Problem: Data loss during restarts
Error: No error, data just vanishes
Fix: Enable persistent queues BEFORE you need them:

extensions:
  file_storage:
    directory: /var/lib/otel-data

exporters:
  otlp:
    sending_queue:
      enabled: true
      storage: file_storage
      queue_size: 5000

Performance debugging like a pro

When your collector is using 100% CPU:

## Get a CPU profile (30 seconds) - run on collector host with telemetry enabled
curl -o cpu.prof http://<COLLECTOR_HOST>:8888/debug/pprof/profile?seconds=30

## For local debugging, replace <COLLECTOR_HOST> with localhost
## Profiling guide: https://opentelemetry.io/docs/collector/troubleshooting/#performance-profiling

## Analyze it
go tool pprof cpu.prof
(pprof) top10

Common CPU hogs:

  • Resource detection with too many detectors enabled
  • Attributes processor with complex regex rules
  • Tail sampling with short decision_wait times
  • Batch processor with tiny batches (high overhead)

When memory keeps growing:

## Get memory profile (on collector host with telemetry enabled)
curl -o mem.prof http://<COLLECTOR_HOST>:8888/debug/pprof/heap

## For local debugging, replace <COLLECTOR_HOST> with localhost
## Profiling guide: https://opentelemetry.io/docs/collector/troubleshooting/#performance-profiling

## Check for leaks
go tool pprof mem.prof
(pprof) list.*leak

Pro tip: If you're running v0.89.0 and seeing memory issues, that version had problems. Upgrade immediately.

Security in production (because your data is valuable)

Vendor Comparison Chart

TLS setup (non-optional in prod):

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        tls:
          cert_file: /etc/ssl/certs/otel.crt
          key_file: /etc/ssl/private/otel.key
          min_version: \"1.3\"  # TLS 1.3 only

Basic auth (better than nothing):

extensions:
  basicauth/server:
    htpasswd:
      file: /etc/otel/auth.htpasswd

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        auth:
          authenticator: basicauth/server

Generate the htpasswd file:

htpasswd -c /etc/otel/auth.htpasswd oteluser

High availability (or how to sleep at night)

Don't do this unless you absolutely need it. HA adds complexity and new failure modes. Start with one collector and scale when you have actual problems.

If you must do HA:

  • Use a load balancer (HAProxy/nginx)
  • Enable persistent queues on all collectors
  • Monitor each collector independently
  • Have a runbook for when one fails

The collector can handle millions of spans per minute on a single instance. You probably don't need HA yet.

The bottom line on production operations

After running this in production for two years: start simple, monitor everything, and prepare for the stuff that breaks at 2am. The collector will save you money and give you vendor independence, but only if you configure it properly from day one.

Links that saved my ass:

Essential OpenTelemetry Collector Resources

Related Tools & Recommendations

integration
Recommended

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

When your API shits the bed right before the big demo, this stack tells you exactly why

Prometheus
/integration/prometheus-grafana-jaeger/microservices-observability-integration
100%
integration
Recommended

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

How to Wire Together the Modern DevOps Stack Without Losing Your Sanity

prometheus
/integration/docker-kubernetes-argocd-prometheus/gitops-workflow-integration
91%
integration
Recommended

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go

Apache Kafka
/integration/kafka-mongodb-kubernetes-prometheus-event-driven/complete-observability-architecture
91%
integration
Recommended

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

Running RAG Systems on K8s Will Make You Hate Your Life, But Sometimes You Don't Have a Choice

Vector Databases
/integration/vector-database-rag-production-deployment/kubernetes-orchestration
70%
howto
Recommended

Set Up Microservices Monitoring That Actually Works

Stop flying blind - get real visibility into what's breaking your distributed services

Prometheus
/howto/setup-microservices-observability-prometheus-jaeger-grafana/complete-observability-setup
70%
integration
Recommended

ELK Stack for Microservices - Stop Losing Log Data

How to Actually Monitor Distributed Systems Without Going Insane

Elasticsearch
/integration/elasticsearch-logstash-kibana/microservices-logging-architecture
65%
integration
Recommended

Pinecone Production Reality: What I Learned After $3200 in Surprise Bills

Six months of debugging RAG systems in production so you don't have to make the same expensive mistakes I did

Vector Database Systems
/integration/vector-database-langchain-pinecone-production-architecture/pinecone-production-deployment
41%
troubleshoot
Recommended

Vector Search Taking Forever? I've Been There

Got queries that take... I don't know, like 20-something seconds instead of 30ms? Memory usage climbing until everything just fucking dies?

Pinecone
/troubleshoot/vector-database-performance/performance-optimization
41%
integration
Recommended

OpenTelemetry + Jaeger + Grafana on Kubernetes - The Stack That Actually Works

Stop flying blind in production microservices

OpenTelemetry
/integration/opentelemetry-jaeger-grafana-kubernetes/complete-observability-stack
40%
tool
Recommended

Grafana - The Monitoring Dashboard That Doesn't Suck

integrates with Grafana

Grafana
/tool/grafana/overview
40%
tool
Recommended

Datadog Cost Management - Stop Your Monitoring Bill From Destroying Your Budget

integrates with Datadog

Datadog
/tool/datadog/cost-management-guide
40%
pricing
Recommended

Datadog vs New Relic vs Sentry: Real Pricing Breakdown (From Someone Who's Actually Paid These Bills)

Observability pricing is a shitshow. Here's what it actually costs.

Datadog
/pricing/datadog-newrelic-sentry-enterprise/enterprise-pricing-comparison
40%
pricing
Recommended

Datadog Enterprise Pricing - What It Actually Costs When Your Shit Breaks at 3AM

The Real Numbers Behind Datadog's "Starting at $23/host" Bullshit

Datadog
/pricing/datadog/enterprise-cost-analysis
40%
tool
Recommended

New Relic - Application Monitoring That Actually Works (If You Can Afford It)

New Relic tells you when your apps are broken, slow, or about to die. Not cheap, but beats getting woken up at 3am with no clue what's wrong.

New Relic
/tool/new-relic/overview
40%
tool
Recommended

Honeycomb - Debug Your Distributed Systems Without Losing Your Mind

integrates with Honeycomb

Honeycomb
/tool/honeycomb/overview
40%
tool
Recommended

Elastic APM - Track down why your shit's broken before users start screaming

Application performance monitoring that won't break your bank or your sanity (mostly)

Elastic APM
/tool/elastic-apm/overview
40%
tool
Recommended

Elastic Observability - When Your Monitoring Actually Needs to Work

The stack that doesn't shit the bed when you need it most

Elastic Observability
/tool/elastic-observability/overview
40%
tool
Recommended

AWS X-Ray - Distributed Tracing Before the 2027 Sunset

integrates with AWS X-Ray

AWS X-Ray
/tool/aws-x-ray/overview
40%
tool
Recommended

Zipkin - Distributed Tracing That Actually Works

integrates with Zipkin

Zipkin
/tool/zipkin/overview
40%
tool
Recommended

Fluentd - Ruby-Based Log Aggregator That Actually Works

Collect logs from all your shit and pipe them wherever - without losing your sanity to configuration hell

Fluentd
/tool/fluentd/overview
37%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization