OpenTelemetry Collector - Stop Getting Fucked by Observability Vendors

Currently viewing the human version

What the fuck is this thing and why do I need it?

The OpenTelemetry Collector is basically a proxy that sits between your apps and whatever monitoring backend you're paying too much for this month. Instead of instrumenting your code to send data directly to Datadog/New Relic/whatever, you send it to the Collector and let it figure out where to route it.

I've been running this in production for two years after getting a $18k monthly Datadog bill that made our VP cry. The Collector saved our ass by letting us ship 90% of our data to cheaper backends while keeping critical alerts in Datadog.

How this piece of shit actually works

OpenTelemetry Integration

The Collector has three types of components that form a pipeline:

OpenTelemetry Collector Pipeline Architecture

Receivers: Accept data in various formats (OTLP, Jaeger, Zipkin, Prometheus). OTLP works, the others... sometimes.
Processors: Mess with your data (sampling, filtering, enrichment). This is where you save money.
Exporters: Ship processed data to backends. Half of them work, the other half need weird config tweaks.

Current version bullshit

As of September 2025, we're at v0.135.0 for the core. Avoid v0.89.0 - it had memory issues that caused frequent crashes. Learned that the hard way during a Black Friday deployment.

The "stability" ratings are marketing. "Alpha" means it'll randomly break. "Beta" means it breaks predictably. "Stable" means it only breaks when you update.

Two ways to deploy this nightmare

Agent Pattern (sidecar or daemon):

Deploy next to each application
Lower latency since data doesn't travel far
Uses more resources but fails independently
We use this for critical services

Gateway Pattern (centralized):

One big collector serving multiple apps
Cheaper on resources, harder to debug when it shits itself
Single point of failure that'll take down your entire observability
Great for non-critical services

Why we actually use this thing

Escaping vendor lock-in: We were stuck paying Datadog $18k/month because switching would mean rewriting instrumentation in 47 microservices. With the Collector, we switched backends in 2 hours by changing one config file.

Observability Cost Multipliers

Cost savings: We now send 90% of our data to Grafana Cloud (much cheaper) and only keep high-value data in Datadog. Cut our observability costs by 73%.

Data processing: The Collector can sample traces, filter out noisy metrics, and redact PII before it leaves your network. Direct exporters can't do this shit.

Reliability: The Collector has built-in retries and buffering. When your backend goes down (and it will), you don't lose data. Direct exports just fail silently and you're fucked.

Shit that actually helped me:

Release notes - check before upgrading anything
Troubleshooting docs - only useful docs when collector breaks
Community Slack - real engineers who've seen your exact problem
OpenTelemetry docs - the source of truth when shit breaks
Collector architecture guide - explains why this thing is designed the way it is
SignOz production guide - practical deployment advice that works
Performance issues tracker - known performance gotchas
Pricing trap analysis - why vendor bills explode
Cost comparison spreadsheet - real pricing data for budgeting
Datadog vs New Relic costs - helps justify switching to cheaper backends

OpenTelemetry Collector vs Your Other Shitty Options

Feature	OpenTelemetry Collector	Direct Export (SDK)	Vendor Agents
Vendor Lock-in	Escape hatch when prices jump	Change code when switching	You're fucked
Setup Complexity	YAML hell but well-documented	Simple until it breaks	Works until vendor changes something
Resource Usage	~500MB RAM realistically	<50MB per service	100-300MB + mystery overhead
Data Processing	Actually works (sampling, filtering)	Batching and prayers	Whatever vendor allows
Reliability	Retries, queues, doesn't lose data	App crashes = data loss	Usually works but no control
Multi-Backend Support	✅ Send same data to 3+ backends	❌ Pick one and stick with it	❌ Vendor prison
Production Features	✅ Tail sampling saves money	❌ Sample everything or nothing	⚠️ Pay for what vendor gives you
Network Security	✅ One hole in firewall	❌ Every service talks to internet	⚠️ Vendor-specific bullshit
Configuration	✅ Git-controlled YAML	❌ Code changes for config	⚠️ UI changes you can't version
Cost Control	✅ Filter before you pay	❌ Pay for everything you generate	⚠️ Pay whatever vendor decides

Getting this thing running (and keeping it running)

Installation: The easy part that tricks you

Download the core distribution (don't use contrib unless you hate yourself):

## Download latest release from GitHub releases page
## Visit: https://github.com/open-telemetry/opentelemetry-collector-releases/releases/latest
curl -Lo otelcol-core [DOWNLOAD_URL_FOR_YOUR_PLATFORM]
chmod +x otelcol-core

## Test your config BEFORE running it
./otelcol-core --config=config.yaml --dry-run

## If dry-run passes, actually run it
./otelcol-core --config=config.yaml

Linux gotcha: On Ubuntu/Debian, you might need to install ca-certificates first or TLS connections will fail with cryptic errors.

macOS gotcha: Apple's security will block unsigned binaries. Run xattr -d com.apple.quarantine otelcol-core to fix it.

Config that actually works in production

YAML Configuration Example

This config works because I spent 3 weeks debugging why the "simple" examples don't:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  # ALWAYS put memory_limiter FIRST or your collector will eat all RAM
  memory_limiter:
    limit_mib: 1024
    spike_limit_mib: 256
  batch:
    send_batch_size: 1024    # Start conservative
    timeout: 1s              # Don't wait too long
  resourcedetection:
    detectors: [env, system] # Skip docker/k8s unless you're actually using them
    timeout: 5s              # Default 5s timeout sometimes fails

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true  # Use TLS in prod, but this works for testing
  prometheus:
    endpoint: \"0.0.0.0:8889\"
    # Prometheus exporter is rock solid, unlike some others

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, resourcedetection, batch]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp]  
      processors: [memory_limiter, resourcedetection, batch]
      exporters: [prometheus]

  # Enable internal metrics so you can debug when it breaks
  telemetry:
    metrics:
      address: 0.0.0.0:8888

Deployment reality check

Prometheus Grafana Dashboard

Memory planning: The docs say 200MB baseline. That's bullshit. Plan for:

512MB minimum for any real workload
+100MB per 1K spans/second is optimistic
+200MB per 1K spans/second with tail sampling
Double it if you're using contrib components

High availability means complexity: Multiple collectors behind a load balancer sounds great until you need to debug which one is fucking up. Start with one collector and scale when you actually need it.

Production-hardened config additions

Persistent queues (or lose data when things crash):

extensions:
  file_storage:
    directory: /var/lib/otel-data  # Make sure this directory exists

exporters:
  otlp:
    endpoint: https://backend.example.com
    sending_queue:
      enabled: true
      storage: file_storage
      queue_size: 5000
    retry_on_failure:
      enabled: true
      initial_interval: 1s
      max_interval: 30s

Resource limits (Docker/systemd):

## Docker
docker run --memory=2g --cpus=1 otel-collector

## systemd service
[Service]
MemoryLimit=2G
CPUQuota=100%

Advanced stuff that'll save your money

Tail sampling - sample complete traces, not random spans:

processors:
  tail_sampling:
    decision_wait: 10s  # Wait for complete traces
    policies:
      - name: errors_always
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow_requests  
        type: latency
        latency: {threshold_ms: 1000}
      - name: sample_normal
        type: probabilistic
        probabilistic: {sampling_percentage: 5}  # 5% of normal traces

Drop PII before it leaves your network:

processors:
  attributes:
    actions:
      - key: user.email       # Drop email addresses
        action: delete
      - key: user.phone       # Drop phone numbers  
        action: delete
      - key: credit_card.*    # Drop anything with credit_card prefix
        action: delete
      - key: http.request.body # Usually contains sensitive data
        action: delete

Real production deployment tip: Start with minimal config, get it working, then add complexity. Every processor you add is another thing that can break at 2am.

Shit you'll need when this breaks:

Official installation guide - different ways to deploy this thing
Memory management guide - to avoid the OOM killer eating your collector
Config examples - actual working configs, not toy examples
Memory limiter deep dive - critical for production
Docker setup guide - production Docker configs that work
K8s operator docs - if you're running this in Kubernetes
Official Kubernetes Helm guide - Helm charts and operator setup
Security best practices - because your data is valuable
Production configuration examples - real configs for major vendors
AWS Kubernetes setup - cloud-specific gotchas

Questions that'll save your ass at 3am

Why does my Collector randomly crash with "signal: killed"?

You hit the OOM killer. The Collector will eat all your memory if you don't set limits. I learned this when our collector consumed 32GB of RAM and took down the entire node.

processors:
  memory_limiter:
    limit_mib: 1024  # Set this or die
    spike_limit_mib: 256

Pro tip: Set this as your first processor or you're fucked. Also, check dmesg | grep -i "killed process" to confirm it was the OOM killer.

Can I send data to multiple backends without losing my mind?

Yeah, but it's not as easy as the docs claim. You can configure multiple exporters:

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger, otlp/datadog, zipkin]

Reality check: Each backend has different format requirements. Datadog wants specific tags, Jaeger chokes on certain attributes. You'll spend hours debugging why traces show up in one backend but not the other.

What happens when this piece of shit crashes?

If you didn't configure persistent queues, your data is gone. I lost 4 hours of Black Friday telemetry learning this lesson.

extensions:
  file_storage:
    directory: /tmp/otel-data

exporters:
  otlp:
    endpoint: https://backend.example.com
    sending_queue:
      enabled: true
      storage: file_storage  # This saves your ass

Without persistent queues, crashes = data loss. Period.

How do I debug when the Collector uses 8GB of RAM?

First, check if you're using v0.89.0 - it had memory issues. If not:

## Check internal metrics (run these ON the collector host)
curl localhost:8888/metrics | grep memory

## Profile the collector (requires telemetry enabled on port 8888)
curl -o cpu.prof http://<COLLECTOR_HOST>:8888/debug/pprof/profile?seconds=30
go tool pprof cpu.prof

## For local debugging, replace <COLLECTOR_HOST> with localhost
## Profiling guide: https://opentelemetry.io/docs/collector/troubleshooting/#performance-profiling

Common causes:

Batch processor misconfigured: Set reasonable send_batch_size (1024-8192)
No memory limiter: Set it or the collector will eat everything
High cardinality metrics: Filter them out or prepare for pain

Should I use core or contrib? (Spoiler: core)

Use core. The contrib distribution is 200+ components that mostly don't work properly. Core has ~40 components that actually function.

I deployed contrib once and spent a week debugging why the sqlquery receiver randomly stopped working. Turned out it was marked "alpha" for a reason.

How do I stop high-cardinality metrics from destroying my backend?

High-cardinality metrics will bankrupt you. User IDs, session IDs, request IDs - all of these will generate millions of unique metric series.

processors:
  filter/kill_cardinality:
    metrics:
      metric:
        - 'user_id'      # Drops the entire user_id label
        - 'session_*'    # Drops any label starting with session_
        - 'request.uuid' # Drops request.uuid labels

Story time: We had a developer accidentally add user IDs to metrics. Our Prometheus storage grew from 100GB to 2TB overnight and the queries became unusable.

Kubernetes deployment is a nightmare, right?

The OpenTelemetry Operator works... sometimes. When it doesn't, you get no error messages and your collector just doesn't start.

## Debug operator issues
kubectl logs -n opentelemetry-operator-system deployment/opentelemetry-operator-controller-manager

## Check if your collector actually started  
kubectl get pods -l app.kubernetes.io/name=otelcol

## Get the real error messages
kubectl describe pod your-failing-collector-pod

Common gotcha: The operator ignores YAML syntax errors silently. Validate your config first:

./otelcol-core --config=config.yaml --dry-run

What versions will ruin your day?

DO NOT USE:

v0.89.0: Memory issues that cause frequent crashes
v0.82.x: Performance problems with high throughput
v0.78.2: Reported issues with batch processor reliability
Any version ending in .0: Wait for .1 or .2, first releases always have bugs

Currently safe: v0.135.1+ as of September 2025, but check the release notes for latest gotchas.

Why does my collector randomly stop receiving data?

Check the error you're probably ignoring:

## Check collector logs for this specific error
grep "connection refused" /var/log/otel-collector.log

Common causes:

Backend is down: Your exporter fails, collector stops processing
Wrong endpoint: OTLP vs OTLP/HTTP confusion (port 4317 vs 4318)
TLS issues: Certificates expired or misconfigured
Network policies: Kubernetes blocking traffic you thought was allowed

Debug commands that actually help (run these on the collector host):

## Test if collector is accepting data on OTLP HTTP endpoint
curl -X POST http://<COLLECTOR_HOST>:4318/v1/traces -d '{"traces":[]}'

## For local testing, replace <COLLECTOR_HOST> with localhost
## OTLP specs: https://opentelemetry.io/docs/specs/otlp/#otlphttp-request

## Check if exporters are working (requires telemetry enabled)
curl localhost:8888/metrics | grep exporter_sent

Production Operations: When shit hits the fan

Performance tuning that actually works

Batching: Different backends need different batch sizes. Trial and error is your friend:

processors:
  batch:
    send_batch_size: 1024      # Start here, tune based on errors
    send_batch_max_size: 4096  # Hard stop before OOM
    timeout: 1s                # Don't wait forever

Reality check: Datadog pukes if batches are too big. Prometheus doesn't care. Jaeger sometimes just drops data silently. You'll figure out the sweet spot by watching error rates.

Memory limits (set these or regret it):

processors:
  memory_limiter:
    limit_mib: 1024
    spike_limit_mib: 256
    check_interval: 1s  # Check frequently under load

I set check_interval to 1s after our collector ate 8GB during a traffic spike and crashed the entire node.

Monitoring: The metrics that matter

System Monitoring Dashboard

Enable internal metrics because you'll need them when debugging at 3am:

service:
  telemetry:
    metrics:
      address: 0.0.0.0:8888
      level: detailed

Critical alerts to set up:

## Memory usage climbing
otelcol_process_memory_rss_bytes > 1GB

## Data not flowing  
rate(otelcol_receiver_accepted_spans_total[5m]) == 0

## Export failures
rate(otelcol_exporter_send_failed_spans_total[5m]) > 0

## Queue backing up (death spiral incoming)
otelcol_exporter_queue_size > 1000

When the collector stops working (troubleshooting guide)

Problem: Collector stops receiving data
Error: rpc error: code = Unavailable desc = connection error
Fix:

## Check if collector is actually listening
netstat -tulpn | grep :4317

## Test OTLP gRPC receiver directly (on collector host)
grpcurl -plaintext localhost:4317 list

Problem: Collector runs out of memory and gets killed
Error: signal: killed in logs, nothing in collector output
Fix: Check system logs for OOM killer:

dmesg | grep -i \"killed process.*otelcol\"
journalctl -u otel-collector.service | grep -i \"memory\"

Problem: Exports randomly fail
Error: context deadline exceeded or connection refused
Fix: Your backend is overloaded or unreachable. Add retries:

exporters:
  otlp:
    endpoint: https://backend.example.com
    timeout: 30s
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s  # Give up after 5 minutes

Problem: Data loss during restarts
Error: No error, data just vanishes
Fix: Enable persistent queues BEFORE you need them:

extensions:
  file_storage:
    directory: /var/lib/otel-data

exporters:
  otlp:
    sending_queue:
      enabled: true
      storage: file_storage
      queue_size: 5000

Performance debugging like a pro

When your collector is using 100% CPU:

## Get a CPU profile (30 seconds) - run on collector host with telemetry enabled
curl -o cpu.prof http://<COLLECTOR_HOST>:8888/debug/pprof/profile?seconds=30

## For local debugging, replace <COLLECTOR_HOST> with localhost
## Profiling guide: https://opentelemetry.io/docs/collector/troubleshooting/#performance-profiling

## Analyze it
go tool pprof cpu.prof
(pprof) top10

Common CPU hogs:

Resource detection with too many detectors enabled
Attributes processor with complex regex rules
Tail sampling with short decision_wait times
Batch processor with tiny batches (high overhead)

When memory keeps growing:

## Get memory profile (on collector host with telemetry enabled)
curl -o mem.prof http://<COLLECTOR_HOST>:8888/debug/pprof/heap

## For local debugging, replace <COLLECTOR_HOST> with localhost
## Profiling guide: https://opentelemetry.io/docs/collector/troubleshooting/#performance-profiling

## Check for leaks
go tool pprof mem.prof
(pprof) list.*leak

Pro tip: If you're running v0.89.0 and seeing memory issues, that version had problems. Upgrade immediately.

Security in production (because your data is valuable)

Vendor Comparison Chart

TLS setup (non-optional in prod):

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        tls:
          cert_file: /etc/ssl/certs/otel.crt
          key_file: /etc/ssl/private/otel.key
          min_version: \"1.3\"  # TLS 1.3 only

Basic auth (better than nothing):

extensions:
  basicauth/server:
    htpasswd:
      file: /etc/otel/auth.htpasswd

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        auth:
          authenticator: basicauth/server

Generate the htpasswd file:

htpasswd -c /etc/otel/auth.htpasswd oteluser

High availability (or how to sleep at night)

Don't do this unless you absolutely need it. HA adds complexity and new failure modes. Start with one collector and scale when you have actual problems.

If you must do HA:

Use a load balancer (HAProxy/nginx)
Enable persistent queues on all collectors
Monitor each collector independently
Have a runbook for when one fails

The collector can handle millions of spans per minute on a single instance. You probably don't need HA yet.

The bottom line on production operations

After running this in production for two years: start simple, monitor everything, and prepare for the stuff that breaks at 2am. The collector will save you money and give you vendor independence, but only if you configure it properly from day one.

Links that saved my ass:

Scaling docs - how much RAM you actually need
Troubleshooting guide - for 3am debugging sessions
Performance profiling - find what's eating your CPU
Security configuration guide - TLS and mTLS setup that actually works
Production security practices - because security matters in production
OTLP receiver security - TLS configuration examples
Monitoring with Prometheus - how to monitor the Collector itself
Collector metrics guide - internal metrics you need
Environment variables guide - production configuration via env vars
High availability patterns - scaling beyond single instance

Quick Navigation

How this piece of shit actually works

Current version bullshit

Two ways to deploy this nightmare

Why we actually use this thing

Installation: The easy part that tricks you

Config that actually works in production

Deployment reality check

Production-hardened config additions

Advanced stuff that'll save your money

Why does my Collector randomly crash with "signal: killed"?

Can I send data to multiple backends without losing my mind?

What happens when this piece of shit crashes?

How do I debug when the Collector uses 8GB of RAM?

Should I use core or contrib? (Spoiler: core)

How do I stop high-cardinality metrics from destroying my backend?

Kubernetes deployment is a nightmare, right?

What versions will ruin your day?

Why does my collector randomly stop receiving data?

Performance tuning that actually works

Monitoring: The metrics that matter

When the collector stops working (troubleshooting guide)

Performance debugging like a pro

Security in production (because your data is valuable)

High availability (or how to sleep at night)

The bottom line on production operations

Related Tools & Recommendations

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

Set Up Microservices Monitoring That Actually Works

ELK Stack for Microservices - Stop Losing Log Data

Pinecone Production Reality: What I Learned After $3200 in Surprise Bills

Vector Search Taking Forever? I've Been There

OpenTelemetry + Jaeger + Grafana on Kubernetes - The Stack That Actually Works

Grafana - The Monitoring Dashboard That Doesn't Suck

Datadog Cost Management - Stop Your Monitoring Bill From Destroying Your Budget

Datadog vs New Relic vs Sentry: Real Pricing Breakdown (From Someone Who's Actually Paid These Bills)

Datadog Enterprise Pricing - What It Actually Costs When Your Shit Breaks at 3AM

New Relic - Application Monitoring That Actually Works (If You Can Afford It)

Honeycomb - Debug Your Distributed Systems Without Losing Your Mind

Elastic APM - Track down why your shit's broken before users start screaming

Elastic Observability - When Your Monitoring Actually Needs to Work

AWS X-Ray - Distributed Tracing Before the 2027 Sunset

Zipkin - Distributed Tracing That Actually Works

Fluentd - Ruby-Based Log Aggregator That Actually Works