The official docs say "minimal resources" which is complete bullshit. Here's what you actually need.
Resource Requirements That Don't Lie
ClickHouse is the resource hog. Plan accordingly:
- Memory: 32GB minimum, 64GB for >5M traces/day
- CPU: 8 cores minimum, 16 for heavy workloads
- Storage: 500GB SSD minimum, grows 10GB per million traces
- Network: 1Gbps between OTLP collector and ClickHouse
We crashed production twice before learning ClickHouse needs room to breathe. Memory usage spikes 5x during aggregations, especially when ingesting burst traffic from LLM workloads. The ClickHouse performance tuning guide covers memory optimization, while the OpenTelemetry collector scaling documentation explains resource planning. For production sizing, review the observability infrastructure requirements and ClickHouse cluster deployment guide. The Kubernetes resource management patterns show how to configure limits properly.
Network Architecture for Scale
OTLP endpoints are latency-sensitive. Keep collectors geographically close to your apps. Cross-region OTLP calls add 200-500ms to every request - your users will notice.
App Servers (US-East) → OTLP Collector (US-East) → ClickHouse (US-East)
App Servers (EU-West) → OTLP Collector (EU-West) → ClickHouse Replica
Port conflicts are real. Default 4318 conflicts with:
- Jaeger collectors
- Other OpenTelemetry setups
- Local development proxies
Pick custom ports and document them. We use 4320 for OpenLIT to avoid the mess.
Storage Strategy (Don't Fill Your Disks)
Trace retention grows faster than you think:
- 1M traces = ~10GB storage
- 100M traces/month = 1TB storage
- Indexes and aggregations add 30% overhead
Set up retention policies from day one. This killed our staging environment when traces filled a 500GB disk in 3 days.
-- ClickHouse retention policy example
ALTER TABLE otel_traces
MODIFY TTL toDate(Timestamp) + INTERVAL 30 DAY;
High Availability Setup
Single points of failure that will bite you:
- ClickHouse failure = complete observability loss
- OTLP collector failure = trace ingestion stops
- OpenLIT UI failure = dashboards go dark
Deploy everything in HA mode from the start. The official Helm chart supports replicas but doesn't configure persistent volumes properly. For production HA deployments, review the Kubernetes high availability patterns and ClickHouse replication strategies. The OpenTelemetry collector high availability guide covers load balancing approaches, while the observability stack resilience patterns explain disaster recovery procedures.
ClickHouse clustering is painful but necessary:
- 3+ nodes minimum for fault tolerance
- Shared storage or replication required
- ZooKeeper dependency adds complexity
Budget 2-3 days for proper ClickHouse clustering setup. The ClickHouse operator helps but brings its own operational overhead. For production clustering, follow the ClickHouse distributed table setup and ZooKeeper cluster configuration guide. The Kubernetes StatefulSet patterns show how to deploy clustered databases properly.