OpenLIT Production Deployment - Battle-Tested Guide

Currently viewing the human version

Production Planning: Size Your Shit Right

The official docs say "minimal resources" which is complete bullshit. Here's what you actually need.

Resource Requirements That Don't Lie

ClickHouse is the resource hog. Plan accordingly:

Memory: 32GB minimum, 64GB for >5M traces/day
CPU: 8 cores minimum, 16 for heavy workloads
Storage: 500GB SSD minimum, grows 10GB per million traces
Network: 1Gbps between OTLP collector and ClickHouse

We crashed production twice before learning ClickHouse needs room to breathe. Memory usage spikes 5x during aggregations, especially when ingesting burst traffic from LLM workloads. The ClickHouse performance tuning guide covers memory optimization, while the OpenTelemetry collector scaling documentation explains resource planning. For production sizing, review the observability infrastructure requirements and ClickHouse cluster deployment guide. The Kubernetes resource management patterns show how to configure limits properly.

Network Architecture for Scale

OTLP endpoints are latency-sensitive. Keep collectors geographically close to your apps. Cross-region OTLP calls add 200-500ms to every request - your users will notice.

App Servers (US-East) → OTLP Collector (US-East) → ClickHouse (US-East)
App Servers (EU-West) → OTLP Collector (EU-West) → ClickHouse Replica

Port conflicts are real. Default 4318 conflicts with:

Jaeger collectors
Other OpenTelemetry setups
Local development proxies

Pick custom ports and document them. We use 4320 for OpenLIT to avoid the mess.

Storage Strategy (Don't Fill Your Disks)

Trace retention grows faster than you think:

1M traces = ~10GB storage
100M traces/month = 1TB storage
Indexes and aggregations add 30% overhead

Set up retention policies from day one. This killed our staging environment when traces filled a 500GB disk in 3 days.

-- ClickHouse retention policy example
ALTER TABLE otel_traces 
MODIFY TTL toDate(Timestamp) + INTERVAL 30 DAY;

High Availability Setup

Single points of failure that will bite you:

ClickHouse failure = complete observability loss
OTLP collector failure = trace ingestion stops
OpenLIT UI failure = dashboards go dark

Deploy everything in HA mode from the start. The official Helm chart supports replicas but doesn't configure persistent volumes properly. For production HA deployments, review the Kubernetes high availability patterns and ClickHouse replication strategies. The OpenTelemetry collector high availability guide covers load balancing approaches, while the observability stack resilience patterns explain disaster recovery procedures.

ClickHouse clustering is painful but necessary:

3+ nodes minimum for fault tolerance
Shared storage or replication required
ZooKeeper dependency adds complexity

Budget 2-3 days for proper ClickHouse clustering setup. The ClickHouse operator helps but brings its own operational overhead. For production clustering, follow the ClickHouse distributed table setup and ZooKeeper cluster configuration guide. The Kubernetes StatefulSet patterns show how to deploy clustered databases properly.

Production Deployment Questions (The Real Shit)

How much does this actually cost to run in production?

AWS costs for moderate workload (5M traces/day):

3x c5.2xlarge for ClickHouse cluster: $1,200/month
2x c5.large for OTLP collectors: $300/month
1TB EBS storage: $100/month
Network transfer: $50-200/monthTotal: ~$1,650/month for a solid production setup. Cheaper than Datadog but you're managing it yourself.

What breaks first when you scale up?

ClickHouse memory limits. You'll see Memory limit exceeded errors when trace ingestion spikes. The default memory limits in Helm charts are too conservative.Set max_memory_usage = 20000000000 (20GB) minimum in ClickHouse config. Also tune max_bytes_before_external_group_by for large aggregation queries.

How do you handle OTLP collector failures?

Buffering is critical. Configure persistent queues in the OTLP collector:yamlexporters: otlphttp: endpoint: http://clickhouse:8123 sending_queue: enabled: true num_consumers: 16 queue_size: 5000 retry_on_failure: enabled: true initial_interval: 5s max_interval: 30sWithout this, trace loss during collector restarts is guaranteed. We lost 2 hours of traces before implementing proper buffering.

What's the real downtime during updates?

ClickHouse updates: 5-10 minutes with proper clustering, 30+ minutes for single nodeOTLP collector updates: Near zero with rolling deploymentsOpenLIT UI updates: 30 seconds, traces keep flowingPlan ClickHouse updates during maintenance windows. Schema migrations can take hours on large trace tables.

How do you debug when OpenLIT itself is broken?

The chicken-and-egg problem: You need observability to debug your observability tool.

Keep these external monitoring tools:

Node exporter for host metrics
Prometheus for Click

House metrics

CloudWatch/Datadog for basic health checksWhen OpenLIT dashboards are dark, fall back to raw ClickHouse queries:```sqlSELECT count() FROM otel_traces WHERE Timestamp > now()
INTERVAL 1 HOUR;```

What security considerations actually matter?

Network segmentation: Isolate ClickHouse from direct internet access.

Only OTLP collectors should reach the database.Trace data contains secrets: LLM prompts and responses may include API keys, user data, PII.

Enable trace filtering at the collector level:```yamlprocessors: filter: traces: include: match_type: regexp attributes:

key: service.name value: "(openlit|llm-service)"```**Default credentials:** Change user@openlit.io / openlituser immediately or you'll get pwned.

How do you handle ClickHouse performance issues?

Queries timing out? Tune these ClickHouse settings:

max_execution_time = 300 (5 minutes)
max_memory_usage = 20000000000 (20GB)
max_threads = 16 (CPU cores)Dashboard loading slowly? Add time filters to queries.

Scanning >1M traces without time bounds kills performance.Storage growing too fast? Implement TTL policies and compression:sqlALTER TABLE otel_traces MODIFY COLUMN TraceId CODEC(LZ4HC);

Kubernetes Production Setup (The Real Deal)

The official Helm chart gets you 80% there. Here's the other 20% that keeps you up at night.

Helm Values That Actually Work

## values-production.yaml
clickhouse:
  persistence:
    enabled: true
    size: 1Ti
    storageClass: gp3
  resources:
    requests:
      memory: 32Gi
      cpu: 8
    limits:
      memory: 64Gi
      cpu: 16
  settings:
    max_memory_usage: 20000000000
    max_execution_time: 300
    
collector:
  replicas: 3
  resources:
    requests:
      memory: 4Gi
      cpu: 2
    limits:
      memory: 8Gi
      cpu: 4
      
ui:
  replicas: 2
  ingress:
    enabled: true
    annotations:
      kubernetes.io/ingress.class: nginx
      cert-manager.io/cluster-issuer: letsencrypt-prod

Storage classes matter. GP3 volumes give you better IOPS than GP2. Don't cheap out on storage - ClickHouse performance depends on it. For production Helm deployments, follow the official Helm chart documentation and review the Kubernetes best practices guide. The ClickHouse Kubernetes operator provides advanced cluster management, while the OpenTelemetry Kubernetes operator handles auto-instrumentation.

Service Mesh Integration

Istio compatibility: OpenLIT works but requires specific config:

## Disable mTLS for OTLP endpoints
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: openlit-collector
spec:
  selector:
    matchLabels:
      app: openlit-collector
  mtls:
    mode: PERMISSIVE

OTLP traffic doesn't play nice with strict mTLS. We debugged this for 6 hours before finding the solution buried in GitHub issues.

Persistent Volume Gotchas

ClickHouse data corruption is real. Use ReadWriteOnce volumes, never ReadWriteMany. Shared storage corrupts ClickHouse's internal data structures.

persistence:
  storageClass: gp3
  accessMode: ReadWriteOnce
  size: 1Ti

Backup strategy is critical. ClickHouse doesn't have built-in backup. Set up automated snapshots:

## Daily backup script
kubectl exec clickhouse-0 -- clickhouse-client --query=\"BACKUP TABLE otel_traces TO S3('s3://backups/clickhouse/$(date +%Y%m%d)')\"

Load Balancer Configuration

OTLP collector needs session affinity OFF. Default round-robin works best for trace ingestion. Sticky sessions cause uneven load distribution.

service:
  type: LoadBalancer
  sessionAffinity: None
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: \"true\"

Health checks matter. OTLP collectors can appear healthy while failing to forward traces:

livenessProbe:
  httpGet:
    path: /
    port: 13133
  initialDelaySeconds: 30
readinessProbe:
  httpGet:
    path: /
    port: 13133
  initialDelaySeconds: 10

Monitoring Your Monitoring

Prometheus metrics for OpenLIT components:

## ServiceMonitor for collector metrics
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: openlit-collector
spec:
  selector:
    matchLabels:
      app: openlit-collector
  endpoints:
  - port: metrics
    interval: 30s

Key metrics to alert on:

otelcol_exporter_queue_size - traces backing up
clickhouse_query_duration_seconds - database performance
openlit_ui_response_time - dashboard availability

Disaster Recovery Procedures

When ClickHouse dies:

Check disk space first - 90% of failures
Restart the pod if memory exhausted
Restore from S3 backup if data corruption
Expect 30+ minutes recovery time

When OTLP collectors fail:

Rolling restart usually fixes it
Check persistent queue disk usage
Scale replicas if overwhelmed
Near-zero recovery time with proper setup

Complete cluster failure:

Restore ClickHouse data from S3
Deploy fresh Helm chart
Update OTLP endpoints in applications
Historical data preserved, new traces resume

Total recovery time: 2-4 hours depending on data size.

Production Deployment Options Comparison

Deployment Method	Setup Time	Monthly Cost	Maintenance Overhead	Scale Limit	Reliability
Docker Compose	15 minutes	$200 (single VM)	Low	~1M traces/day	⚠️ Single point of failure
Kubernetes (Basic)	2-4 hours	$800-1500	Medium	~10M traces/day	✅ Good with replicas
Kubernetes (HA)	1-2 days	$2000-5000	High	~100M traces/day	✅ Production ready
Managed Services	1 day	$3000-5000	Very Low	Unlimited	✅ Enterprise grade

Quick Navigation

Resource Requirements That Don't Lie

Network Architecture for Scale

Storage Strategy (Don't Fill Your Disks)

High Availability Setup

How much does this actually cost to run in production?

What breaks first when you scale up?

How do you handle OTLP collector failures?

What's the real downtime during updates?

How do you debug when OpenLIT itself is broken?

What security considerations actually matter?

How do you handle ClickHouse performance issues?

Helm Values That Actually Work

Service Mesh Integration

Persistent Volume Gotchas

Load Balancer Configuration

Monitoring Your Monitoring

Disaster Recovery Procedures

Related Tools & Recommendations

LangChain vs LlamaIndex vs Haystack vs AutoGen - Which One Won't Ruin Your Weekend

Milvus vs Weaviate vs Pinecone vs Qdrant vs Chroma: What Actually Works in Production

LangSmith - Debug Your LLM Agents When They Go Sideways

Pinecone Production Reality: What I Learned After $3200 in Surprise Bills

Claude + LangChain + Pinecone RAG: What Actually Works in Production

Stop Fighting with Vector Databases - Here's How to Make Weaviate, LangChain, and Next.js Actually Work Together

LlamaIndex - Document Q&A That Doesn't Suck

I Migrated Our RAG System from LangChain to LlamaIndex

Haystack - RAG Framework That Doesn't Explode

Haystack Editor - Code Editor on a Big Whiteboard

Django Production Deployment - Enterprise-Ready Guide for 2025

HeidiSQL - Database Tool That Actually Works

Fix Redis "ERR max number of clients reached" - Solutions That Actually Work

ChromaDB Troubleshooting: When Things Break

ChromaDB - The Vector DB I Actually Use

I Deployed All Four Vector Databases in Production. Here's What Actually Works.

Qdrant + LangChain Production Setup That Actually Works

New Relic - Application Monitoring That Actually Works (If You Can Afford It)

Grafana - The Monitoring Dashboard That Doesn't Suck

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015