Why does my Collector randomly crash with "signal: killed"?

You hit the OOM killer. The Collector will eat all your memory if you don't set limits. I learned this when our collector consumed 32GB of RAM and took down the entire node. ```yaml processors: memory_limiter: limit_mib: 1024 # Set this or die spike_limit_mib: 256 ``` **Pro tip**: Set this as your first processor or you're fucked. Also, check `dmesg | grep -i "killed process"` to confirm it was the OOM killer.

Can I send data to multiple backends without losing my mind?

Yeah, but it's not as easy as the docs claim. You can configure multiple exporters: ```yaml service: pipelines: traces: receivers: [otlp] processors: [batch] exporters: [jaeger, otlp/datadog, zipkin] ``` **Reality check**: Each backend has different format requirements. Datadog wants specific tags, Jaeger chokes on certain attributes. You'll spend hours debugging why traces show up in one backend but not the other.

What happens when this piece of shit crashes?

If you didn't configure persistent queues, your data is gone. I lost 4 hours of Black Friday telemetry learning this lesson. ```yaml extensions: file_storage: directory: /tmp/otel-data exporters: otlp: endpoint: https://backend.example.com sending_queue: enabled: true storage: file_storage # This saves your ass ``` Without persistent queues, crashes = data loss. Period.

How do I debug when the Collector uses 8GB of RAM?

First, check if you're using v0.89.0 - it had memory issues. If not: ```bash # Check internal metrics (run these ON the collector host) curl localhost:8888/metrics | grep memory # Profile the collector (requires telemetry enabled on port 8888) curl -o cpu.prof http:// :8888/debug/pprof/profile?seconds=30 go tool pprof cpu.prof # For local debugging, replace with localhost # Profiling guide: https://opentelemetry.io/docs/collector/troubleshooting/#performance-profiling ``` Common causes: - **Batch processor misconfigured**: Set reasonable `send_batch_size` (1024-8192) - **No memory limiter**: Set it or the collector will eat everything - **High cardinality metrics**: Filter them out or prepare for pain

Should I use core or contrib? (Spoiler: core)

**Use core**. The contrib distribution is 200+ components that mostly don't work properly. Core has ~40 components that actually function. I deployed contrib once and spent a week debugging why the `sqlquery` receiver randomly stopped working. Turned out it was marked "alpha" for a reason.

How do I stop high-cardinality metrics from destroying my backend?

High-cardinality metrics will bankrupt you. User IDs, session IDs, request IDs - all of these will generate millions of unique metric series. ```yaml processors: filter/kill_cardinality: metrics: metric: - 'user_id' # Drops the entire user_id label - 'session_*' # Drops any label starting with session_ - 'request.uuid' # Drops request.uuid labels ``` **Story time**: We had a developer accidentally add user IDs to metrics. Our Prometheus storage grew from 100GB to 2TB overnight and the queries became unusable.

Kubernetes deployment is a nightmare, right?

The [OpenTelemetry Operator](https://opentelemetry.io/docs/platforms/kubernetes/operator/) works... sometimes. When it doesn't, you get no error messages and your collector just doesn't start. ```bash # Debug operator issues kubectl logs -n opentelemetry-operator-system deployment/opentelemetry-operator-controller-manager # Check if your collector actually started kubectl get pods -l app.kubernetes.io/name=otelcol # Get the real error messages kubectl describe pod your-failing-collector-pod ``` **Common gotcha**: The operator ignores YAML syntax errors silently. Validate your config first: ```bash ./otelcol-core --config=config.yaml --dry-run ```

What versions will ruin your day?

**DO NOT USE:** - **v0.89.0**: Memory issues that cause frequent crashes - **v0.82.x**: Performance problems with high throughput - **v0.78.2**: Reported issues with batch processor reliability - **Any version ending in .0**: Wait for .1 or .2, first releases always have bugs **Currently safe**: v0.135.1+ as of September 2025, but check the [release notes](https://github.com/open-telemetry/opentelemetry-collector/releases) for latest gotchas.

Why does my collector randomly stop receiving data?

Check the error you're probably ignoring: ```bash # Check collector logs for this specific error grep "connection refused" /var/log/otel-collector.log ``` Common causes: - **Backend is down**: Your exporter fails, collector stops processing - **Wrong endpoint**: OTLP vs OTLP/HTTP confusion (port 4317 vs 4318) - **TLS issues**: Certificates expired or misconfigured - **Network policies**: Kubernetes blocking traffic you thought was allowed **Debug commands that actually help** (run these on the collector host): ```bash # Test if collector is accepting data on OTLP HTTP endpoint curl -X POST http:// :4318/v1/traces -d '{"traces":[]}' # For local testing, replace with localhost # OTLP specs: https://opentelemetry.io/docs/specs/otlp/#otlphttp-request # Check if exporters are working (requires telemetry enabled) curl localhost:8888/metrics | grep exporter_sent ```

Currently viewing the AI version

Switch to human version

OpenTelemetry Collector: AI-Optimized Technical Reference

Overview

Purpose: Telemetry data proxy that sits between applications and monitoring backends, enabling vendor independence and cost control.

Critical Success Story: Reduced observability costs by 73% (from $18k/month to ~$5k/month) by routing 90% of data to cheaper backends while keeping critical alerts in expensive vendor.

Architecture Components

Pipeline Structure

Receivers: Accept data in various formats (OTLP, Jaeger, Zipkin, Prometheus)
Processors: Transform, sample, filter, and enrich data
Exporters: Ship processed data to backends

Deployment Patterns

Pattern	Use Case	Resource Impact	Failure Mode
Agent (sidecar/daemon)	Critical services	Higher resources	Independent failures
Gateway (centralized)	Non-critical services	Lower resources	Single point of failure

Configuration Requirements

Production-Critical Memory Settings

processors:
  memory_limiter:
    limit_mib: 1024        # REQUIRED: Without this, OOM killer terminates collector
    spike_limit_mib: 256   # Prevents memory spikes
    check_interval: 1s     # Monitor frequently under load

Failure Impact: Without memory limiter, collector consumes all available RAM and crashes entire node.

Minimal Working Configuration

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  memory_limiter:         # MUST be first processor
    limit_mib: 1024
  batch:
    send_batch_size: 1024 # Conservative starting point
    timeout: 1s

exporters:
  otlp/backend:
    endpoint: backend:4317
    tls:
      insecure: true      # Use TLS in production

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/backend]
  telemetry:
    metrics:
      address: 0.0.0.0:8888  # Enable for debugging

Resource Planning

Memory Requirements (Real-World)

Baseline: 512MB minimum (official 200MB is insufficient)
Scaling: +200MB per 1K spans/second with tail sampling
High cardinality: Double baseline for metrics with many dimensions
Contrib components: Add 50% overhead

Performance Thresholds

UI breakdown: >1000 spans causes debugging interface failure
Queue backup: >1000 items indicates impending failure
Batch sizing: 1024-8192 optimal, smaller causes CPU overhead

Critical Warnings

Version Blacklist

v0.89.0: Memory leaks causing frequent crashes
v0.82.x: Performance degradation under high throughput
v0.78.2: Batch processor reliability issues
Any .0 release: Wait for .1 or .2 patches

Common Production Failures

OOM Killer Termination

Symptoms: signal: killed in logs, no collector output
Root Cause: No memory limiter configured
Detection: dmesg | grep "killed process.*otelcol"
Prevention: Always configure memory_limiter as first processor

Data Loss During Crashes

Cause: No persistent queues configured
Impact: 4+ hours of telemetry data lost during outages
Solution: Enable file_storage extension with persistent queues

High Cardinality Metrics Explosion

Symptoms: Storage grows from 100GB to 2TB overnight
Cause: User IDs, session IDs, or request IDs in metric labels
Cost Impact: Can bankrupt observability budget within days
Mitigation: Filter high-cardinality labels before export

Cost Optimization Strategies

Tail Sampling Configuration

processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors_always
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: sample_normal
        type: probabilistic
        probabilistic: {sampling_percentage: 5}  # 95% cost reduction for normal traces

Impact: Samples complete traces rather than random spans, maintains debugging capability while reducing costs.

PII Removal (Security + Compliance)

processors:
  attributes:
    actions:
      - key: user.email
        action: delete
      - key: credit_card.*
        action: delete
      - key: http.request.body    # Often contains sensitive data
        action: delete

Production Operations

Essential Monitoring Alerts

# Memory usage climbing toward limit
otelcol_process_memory_rss_bytes > 1GB

# Data flow stoppage
rate(otelcol_receiver_accepted_spans_total[5m]) == 0

# Export failures indicating backend issues
rate(otelcol_exporter_send_failed_spans_total[5m]) > 0

# Queue backup indicating performance degradation
otelcol_exporter_queue_size > 1000

Troubleshooting Commands

# Test OTLP HTTP endpoint
curl -X POST http://localhost:4318/v1/traces -d '{"traces":[]}'

# Get CPU profile for performance debugging
curl -o cpu.prof http://localhost:8888/debug/pprof/profile?seconds=30

# Check for OOM killer activity
dmesg | grep "killed process.*otelcol"

# Verify collector is listening
netstat -tulpn | grep :4317

Vendor Comparison Matrix

Aspect	OpenTelemetry Collector	Direct SDK Export	Vendor Agents
Vendor Lock-in Risk	None - change backends via config	High - code changes required	Extreme - vendor dependent
Multi-backend Support	Native - send to 3+ simultaneously	None - single destination	None - vendor specific
Cost Control	High - filter/sample before export	Low - pay for all generated data	None - vendor pricing control
Setup Complexity	Medium - YAML configuration	Low - until backend change needed	Low - until vendor changes
Resource Usage	~500MB RAM realistic	<50MB per service	100-300MB + vendor overhead
Data Processing	Advanced - sampling, filtering, PII removal	Basic - batching only	Vendor limited
Network Security	Single egress point	Multiple service endpoints	Vendor specific requirements

Implementation Decision Tree

When to Use Collector

Multi-vendor strategy: Planning to use multiple backends
Cost optimization: Need to filter/sample before paying for data
Security requirements: PII removal or data transformation needed
Vendor independence: Avoiding lock-in to specific observability provider

When Direct Export Acceptable

Single vendor commitment: Long-term contract with trusted provider
Simple requirements: No data transformation needed
Resource constraints: Cannot spare 500MB RAM for collector
Rapid prototyping: Speed over flexibility

Security Requirements

Production TLS Configuration

receivers:
  otlp:
    protocols:
      grpc:
        tls:
          cert_file: /etc/ssl/certs/otel.crt
          key_file: /etc/ssl/private/otel.key
          min_version: "1.3"  # Enforce TLS 1.3

Authentication Setup

extensions:
  basicauth/server:
    htpasswd:
      file: /etc/otel/auth.htpasswd

receivers:
  otlp:
    protocols:
      grpc:
        auth:
          authenticator: basicauth/server

Deployment Considerations

Kubernetes Gotchas

Operator Issues: Silent failures with no error messages
Config Validation: Use --dry-run to catch YAML errors
Resource Limits: Set memory limits to prevent node crashes

Docker Deployment

# Proper resource limits
docker run --memory=2g --cpus=1 otel-collector

# Persistent storage for queues
docker run -v /var/lib/otel-data:/var/lib/otel-data otel-collector

Real-World Lessons

Black Friday Incident

Problem: Deployed without persistent queues
Impact: 4 hours of telemetry data lost during backend outage
Solution: Always enable file_storage for production deployments

Memory Explosion Case

Problem: Developer added user IDs to metrics
Impact: Prometheus storage: 100GB → 2TB overnight, queries unusable
Prevention: Implement high-cardinality filtering from day one

Version Upgrade Failure

Problem: Upgraded to v0.89.0 during high-traffic period
Impact: Memory leaks caused collector crashes every 2 hours
Prevention: Never use .0 releases, always test in staging first

Success Metrics

Cost Reduction Achieved

Previous: $18k/month Datadog bill
Current: ~$5k/month mixed backends (73% reduction)
Method: 90% data to Grafana Cloud, 10% critical data to Datadog

Operational Benefits

Vendor switching: 2 hours config change vs months of code rewriting
Data control: PII filtering, sampling, enrichment before export
Reliability: Built-in retries, queuing, no data loss during outages

This technical reference preserves all operational intelligence while organizing it for AI consumption, including specific failure modes, resource requirements, and real-world implementation lessons.

Useful Links for Further Investigation

Essential OpenTelemetry Collector Resources

Link	Description
OpenTelemetry Collector Documentation	Official docs that actually work (rare for OpenTelemetry). Better written than most open source projects, though they skip the "gotchas that'll fuck you" part.
Collector Architecture Guide	Deep dive that'll make you understand why everything breaks when you configure processors in the wrong order. Read this before you waste a week debugging.
Configuration Reference	Config examples that don't assume you have their exact environment. Unlike most docs, these actually work if you follow them exactly.
Troubleshooting Guide	Troubleshooting guide that covers real problems, not just happy path scenarios. Wish more projects had docs this useful when shit breaks at 2am.
Security Best Practices	Security guide that doesn't treat you like an idiot. Covers the basics plus some gotchas that'll save you from getting pwned in production.
OpenTelemetry Collector Complete Guide - SigNoz	Actually complete guide from people who run this shit in production. Way better than the scattered official tutorials that skip important details.
Beginner's Guide to OpenTelemetry Collector - Better Stack	Solid beginner tutorial that doesn't assume you know what OTLP is. Good if you want to understand the basics before diving into the clusterfuck.
OpenTelemetry Collector Deep Dive - Last9	Deep dive into the internals from people who understand performance. Actually explains why your collector is eating CPU and how to fix it.
Monitoring and Debugging the Collector - Better Stack	Guide to monitoring the thing that monitors everything else (meta as fuck). Essential for catching problems before they tank your observability.
OpenTelemetry Collector Core Repository	Core source code where you file bugs that actually get fixed. Way more responsive than most CNCF projects - they actually care about production issues.
OpenTelemetry Collector Releases	Release binaries that actually work. Read the changelog carefully - some releases break shit (like v0.89.0 had memory issues).
Collector Contrib Repository	Where dreams go to die. 200+ components, half of which don't work. Stick to the core distribution unless you hate yourself.
Component Registry	Catalog of components with "stability" ratings that mean nothing. "Beta" = might work, "Alpha" = prepare for pain, "Stable" = usually works.
Collector Performance Benchmarks	Benchmarks that actually reflect real-world usage. Finally, someone tested with realistic data instead of toy examples.
Scaling the OpenTelemetry Collector	Scaling guide from people who learned the hard way. Covers memory limits and why you'll OOM if you don't set them properly.
Collector Monitoring Guide - Last9	Monitoring guide that covers the metrics you actually need. No fluff, just the alerts that'll save you when shit hits the fan.
Kubernetes Deployment with Operator	K8s operator that actually works. Makes deploying and managing collectors less painful than doing it manually with YAML hell.
Helm Charts for OpenTelemetry	Helm charts that don't suck. Configure once, deploy everywhere, instead of managing 50 different YAML files.
Docker Deployment Examples	Docker examples that work out of the box. No weird networking issues or missing volume mounts like other examples online.
AWS OpenTelemetry Collector Configuration	AWS guide that actually explains their special sauce. Covers the weird IAM permissions and VPC config you'll need. Trust me, you'll need this when your collector randomly starts getting 403 errors.
Grafana OpenTelemetry Integration	Grafana integration that's way cheaper than Datadog. This saved me $15k/month when we switched. Covers the gotchas for getting logs, metrics, and traces all flowing properly.
OpenTelemetry Community Slack	Slack where you can get help from people who actually understand this stuff. Way better than StackOverflow for OTEL questions.
CNCF OpenTelemetry Project	Official CNCF project page with roadmap. Unlike most CNCF projects, this one actually delivers on promises and has decent governance.
OpenTelemetry Demo Application	Demo app that shows OTEL working in a realistic microservices clusterfuck. Actually useful for testing your setup before going to prod.

OpenTelemetry Collector: AI-Optimized Technical Reference

Overview

Architecture Components

Pipeline Structure

Deployment Patterns

Configuration Requirements

Production-Critical Memory Settings

Minimal Working Configuration

Resource Planning

Memory Requirements (Real-World)

Performance Thresholds

Critical Warnings

Version Blacklist

Common Production Failures

OOM Killer Termination

Data Loss During Crashes

High Cardinality Metrics Explosion

Cost Optimization Strategies

Tail Sampling Configuration

PII Removal (Security + Compliance)

Production Operations

Essential Monitoring Alerts

Troubleshooting Commands

Vendor Comparison Matrix

Implementation Decision Tree

When to Use Collector

When Direct Export Acceptable

Security Requirements

Production TLS Configuration

Authentication Setup

Deployment Considerations

Kubernetes Gotchas

Docker Deployment

Real-World Lessons

Black Friday Incident

Memory Explosion Case

Version Upgrade Failure

Success Metrics

Cost Reduction Achieved

Operational Benefits

Useful Links for Further Investigation

Essential OpenTelemetry Collector Resources

Related Tools & Recommendations

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

Set Up Microservices Monitoring That Actually Works

ELK Stack for Microservices - Stop Losing Log Data

Pinecone Production Reality: What I Learned After $3200 in Surprise Bills

Vector Search Taking Forever? I've Been There

OpenTelemetry + Jaeger + Grafana on Kubernetes - The Stack That Actually Works

Grafana - The Monitoring Dashboard That Doesn't Suck

Datadog Cost Management - Stop Your Monitoring Bill From Destroying Your Budget

Datadog vs New Relic vs Sentry: Real Pricing Breakdown (From Someone Who's Actually Paid These Bills)

Datadog Enterprise Pricing - What It Actually Costs When Your Shit Breaks at 3AM

New Relic - Application Monitoring That Actually Works (If You Can Afford It)

Honeycomb - Debug Your Distributed Systems Without Losing Your Mind

Elastic APM - Track down why your shit's broken before users start screaming

Elastic Observability - When Your Monitoring Actually Needs to Work

AWS X-Ray - Distributed Tracing Before the 2027 Sunset

Zipkin - Distributed Tracing That Actually Works

Fluentd - Ruby-Based Log Aggregator That Actually Works