Currently viewing the human version
Switch to AI version

Production Planning: Size Your Shit Right

The official docs say "minimal resources" which is complete bullshit. Here's what you actually need.

Resource Requirements That Don't Lie

ClickHouse is the resource hog. Plan accordingly:

  • Memory: 32GB minimum, 64GB for >5M traces/day
  • CPU: 8 cores minimum, 16 for heavy workloads
  • Storage: 500GB SSD minimum, grows 10GB per million traces
  • Network: 1Gbps between OTLP collector and ClickHouse

OpenLIT Resource Architecture

We crashed production twice before learning ClickHouse needs room to breathe. Memory usage spikes 5x during aggregations, especially when ingesting burst traffic from LLM workloads. The ClickHouse performance tuning guide covers memory optimization, while the OpenTelemetry collector scaling documentation explains resource planning. For production sizing, review the observability infrastructure requirements and ClickHouse cluster deployment guide. The Kubernetes resource management patterns show how to configure limits properly.

Network Architecture for Scale

OTLP endpoints are latency-sensitive. Keep collectors geographically close to your apps. Cross-region OTLP calls add 200-500ms to every request - your users will notice.

App Servers (US-East) → OTLP Collector (US-East) → ClickHouse (US-East)
App Servers (EU-West) → OTLP Collector (EU-West) → ClickHouse Replica

Port conflicts are real. Default 4318 conflicts with:

  • Jaeger collectors
  • Other OpenTelemetry setups
  • Local development proxies

Pick custom ports and document them. We use 4320 for OpenLIT to avoid the mess.

Storage Strategy (Don't Fill Your Disks)

Trace retention grows faster than you think:

  • 1M traces = ~10GB storage
  • 100M traces/month = 1TB storage
  • Indexes and aggregations add 30% overhead

Set up retention policies from day one. This killed our staging environment when traces filled a 500GB disk in 3 days.

-- ClickHouse retention policy example
ALTER TABLE otel_traces 
MODIFY TTL toDate(Timestamp) + INTERVAL 30 DAY;

High Availability Setup

Single points of failure that will bite you:

  1. ClickHouse failure = complete observability loss
  2. OTLP collector failure = trace ingestion stops
  3. OpenLIT UI failure = dashboards go dark

OpenLIT High Availability

Deploy everything in HA mode from the start. The official Helm chart supports replicas but doesn't configure persistent volumes properly. For production HA deployments, review the Kubernetes high availability patterns and ClickHouse replication strategies. The OpenTelemetry collector high availability guide covers load balancing approaches, while the observability stack resilience patterns explain disaster recovery procedures.

ClickHouse clustering is painful but necessary:

  • 3+ nodes minimum for fault tolerance
  • Shared storage or replication required
  • ZooKeeper dependency adds complexity

Budget 2-3 days for proper ClickHouse clustering setup. The ClickHouse operator helps but brings its own operational overhead. For production clustering, follow the ClickHouse distributed table setup and ZooKeeper cluster configuration guide. The Kubernetes StatefulSet patterns show how to deploy clustered databases properly.

Production Deployment Questions (The Real Shit)

Q

How much does this actually cost to run in production?

A

AWS costs for moderate workload (5M traces/day):

  • 3x c5.2xlarge for ClickHouse cluster: $1,200/month
  • 2x c5.large for OTLP collectors: $300/month
  • 1TB EBS storage: $100/month
  • Network transfer: $50-200/monthTotal: ~$1,650/month for a solid production setup. Cheaper than Datadog but you're managing it yourself.
Q

What breaks first when you scale up?

A

ClickHouse memory limits. You'll see Memory limit exceeded errors when trace ingestion spikes. The default memory limits in Helm charts are too conservative.Set max_memory_usage = 20000000000 (20GB) minimum in ClickHouse config. Also tune max_bytes_before_external_group_by for large aggregation queries.

Q

How do you handle OTLP collector failures?

A

Buffering is critical. Configure persistent queues in the OTLP collector:yamlexporters: otlphttp: endpoint: http://clickhouse:8123 sending_queue: enabled: true num_consumers: 16 queue_size: 5000 retry_on_failure: enabled: true initial_interval: 5s max_interval: 30sWithout this, trace loss during collector restarts is guaranteed. We lost 2 hours of traces before implementing proper buffering.

Q

What's the real downtime during updates?

A

ClickHouse updates: 5-10 minutes with proper clustering, 30+ minutes for single nodeOTLP collector updates: Near zero with rolling deploymentsOpenLIT UI updates: 30 seconds, traces keep flowingPlan ClickHouse updates during maintenance windows. Schema migrations can take hours on large trace tables.

Q

How do you debug when OpenLIT itself is broken?

A

The chicken-and-egg problem: You need observability to debug your observability tool.

Keep these external monitoring tools:

  • Node exporter for host metrics
  • Prometheus for Click

House metrics

  • CloudWatch/Datadog for basic health checksWhen OpenLIT dashboards are dark, fall back to raw ClickHouse queries:```sqlSELECT count() FROM otel_traces WHERE Timestamp > now()
  • INTERVAL 1 HOUR;```
Q

What security considerations actually matter?

A

Network segmentation: Isolate ClickHouse from direct internet access.

Only OTLP collectors should reach the database.Trace data contains secrets: LLM prompts and responses may include API keys, user data, PII.

Enable trace filtering at the collector level:```yamlprocessors: filter: traces: include: match_type: regexp attributes:

  • key: service.name value: "(openlit|llm-service)"```**Default credentials:** Change user@openlit.io / openlituser immediately or you'll get pwned.
Q

How do you handle ClickHouse performance issues?

A

Queries timing out? Tune these ClickHouse settings:

  • max_execution_time = 300 (5 minutes)
  • max_memory_usage = 20000000000 (20GB)
  • max_threads = 16 (CPU cores)Dashboard loading slowly? Add time filters to queries.

Scanning >1M traces without time bounds kills performance.Storage growing too fast? Implement TTL policies and compression:sqlALTER TABLE otel_traces MODIFY COLUMN TraceId CODEC(LZ4HC);

Kubernetes Production Setup (The Real Deal)

The official Helm chart gets you 80% there. Here's the other 20% that keeps you up at night.

Helm Values That Actually Work

OpenLIT Kubernetes Deployment

## values-production.yaml
clickhouse:
  persistence:
    enabled: true
    size: 1Ti
    storageClass: gp3
  resources:
    requests:
      memory: 32Gi
      cpu: 8
    limits:
      memory: 64Gi
      cpu: 16
  settings:
    max_memory_usage: 20000000000
    max_execution_time: 300
    
collector:
  replicas: 3
  resources:
    requests:
      memory: 4Gi
      cpu: 2
    limits:
      memory: 8Gi
      cpu: 4
      
ui:
  replicas: 2
  ingress:
    enabled: true
    annotations:
      kubernetes.io/ingress.class: nginx
      cert-manager.io/cluster-issuer: letsencrypt-prod

Storage classes matter. GP3 volumes give you better IOPS than GP2. Don't cheap out on storage - ClickHouse performance depends on it. For production Helm deployments, follow the official Helm chart documentation and review the Kubernetes best practices guide. The ClickHouse Kubernetes operator provides advanced cluster management, while the OpenTelemetry Kubernetes operator handles auto-instrumentation.

Service Mesh Integration

Istio compatibility: OpenLIT works but requires specific config:

## Disable mTLS for OTLP endpoints
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: openlit-collector
spec:
  selector:
    matchLabels:
      app: openlit-collector
  mtls:
    mode: PERMISSIVE

OTLP traffic doesn't play nice with strict mTLS. We debugged this for 6 hours before finding the solution buried in GitHub issues.

Persistent Volume Gotchas

ClickHouse data corruption is real. Use ReadWriteOnce volumes, never ReadWriteMany. Shared storage corrupts ClickHouse's internal data structures.

persistence:
  storageClass: gp3
  accessMode: ReadWriteOnce
  size: 1Ti

Backup strategy is critical. ClickHouse doesn't have built-in backup. Set up automated snapshots:

## Daily backup script
kubectl exec clickhouse-0 -- clickhouse-client --query=\"BACKUP TABLE otel_traces TO S3('s3://backups/clickhouse/$(date +%Y%m%d)')\"

Load Balancer Configuration

OTLP collector needs session affinity OFF. Default round-robin works best for trace ingestion. Sticky sessions cause uneven load distribution.

service:
  type: LoadBalancer
  sessionAffinity: None
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: \"true\"

Health checks matter. OTLP collectors can appear healthy while failing to forward traces:

livenessProbe:
  httpGet:
    path: /
    port: 13133
  initialDelaySeconds: 30
readinessProbe:
  httpGet:
    path: /
    port: 13133
  initialDelaySeconds: 10

Monitoring Your Monitoring

Prometheus metrics for OpenLIT components:

## ServiceMonitor for collector metrics
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: openlit-collector
spec:
  selector:
    matchLabels:
      app: openlit-collector
  endpoints:
  - port: metrics
    interval: 30s

Key metrics to alert on:

  • otelcol_exporter_queue_size - traces backing up
  • clickhouse_query_duration_seconds - database performance
  • openlit_ui_response_time - dashboard availability

Disaster Recovery Procedures

When ClickHouse dies:

  1. Check disk space first - 90% of failures
  2. Restart the pod if memory exhausted
  3. Restore from S3 backup if data corruption
  4. Expect 30+ minutes recovery time

When OTLP collectors fail:

  1. Rolling restart usually fixes it
  2. Check persistent queue disk usage
  3. Scale replicas if overwhelmed
  4. Near-zero recovery time with proper setup

Complete cluster failure:

  1. Restore ClickHouse data from S3
  2. Deploy fresh Helm chart
  3. Update OTLP endpoints in applications
  4. Historical data preserved, new traces resume

Total recovery time: 2-4 hours depending on data size.

Production Deployment Options Comparison

Deployment Method

Setup Time

Monthly Cost

Maintenance Overhead

Scale Limit

Reliability

Docker Compose

15 minutes

$200 (single VM)

Low

~1M traces/day

⚠️ Single point of failure

Kubernetes (Basic)

2-4 hours

$800-1500

Medium

~10M traces/day

✅ Good with replicas

Kubernetes (HA)

1-2 days

$2000-5000

High

~100M traces/day

✅ Production ready

Managed Services

1 day

$3000-5000

Very Low

Unlimited

✅ Enterprise grade

Production Resources (Actually Useful)

Related Tools & Recommendations

integration
Recommended

Making LangChain, LlamaIndex, and CrewAI Work Together Without Losing Your Mind

A Real Developer's Guide to Multi-Framework Integration Hell

LangChain
/integration/langchain-llamaindex-crewai/multi-agent-integration-architecture
100%
compare
Recommended

Milvus vs Weaviate vs Pinecone vs Qdrant vs Chroma: What Actually Works in Production

I've deployed all five. Here's what breaks at 2AM.

Milvus
/compare/milvus/weaviate/pinecone/qdrant/chroma/production-performance-reality
91%
tool
Recommended

LangSmith - Debug Your LLM Agents When They Go Sideways

The tracing tool that actually shows you why your AI agent called the weather API 47 times in a row

LangSmith
/tool/langsmith/overview
63%
integration
Recommended

Pinecone Production Reality: What I Learned After $3200 in Surprise Bills

Six months of debugging RAG systems in production so you don't have to make the same expensive mistakes I did

Vector Database Systems
/integration/vector-database-langchain-pinecone-production-architecture/pinecone-production-deployment
57%
integration
Recommended

Claude + LangChain + Pinecone RAG: What Actually Works in Production

The only RAG stack I haven't had to tear down and rebuild after 6 months

Claude
/integration/claude-langchain-pinecone-rag/production-rag-architecture
57%
tool
Recommended

LlamaIndex - Document Q&A That Doesn't Suck

Build search over your docs without the usual embedding hell

LlamaIndex
/tool/llamaindex/overview
57%
howto
Recommended

I Migrated Our RAG System from LangChain to LlamaIndex

Here's What Actually Worked (And What Completely Broke)

LangChain
/howto/migrate-langchain-to-llamaindex/complete-migration-guide
57%
tool
Recommended

Haystack - RAG Framework That Doesn't Explode

integrates with Haystack AI Framework

Haystack AI Framework
/tool/haystack/overview
57%
tool
Recommended

Haystack Editor - Code Editor on a Big Whiteboard

Puts your code on a canvas instead of hiding it in file trees

Haystack Editor
/tool/haystack-editor/overview
57%
compare
Recommended

LangChain vs LlamaIndex vs Haystack vs AutoGen - Which One Won't Ruin Your Weekend

By someone who's actually debugged these frameworks at 3am

LangChain
/compare/langchain/llamaindex/haystack/autogen/ai-agent-framework-comparison
57%
news
Popular choice

AI Systems Generate Working CVE Exploits in 10-15 Minutes - August 22, 2025

Revolutionary cybersecurity research demonstrates automated exploit creation at unprecedented speed and scale

GitHub Copilot
/news/2025-08-22/ai-exploit-generation
57%
alternatives
Popular choice

I Ditched Vercel After a $347 Reddit Bill Destroyed My Weekend

Platforms that won't bankrupt you when shit goes viral

Vercel
/alternatives/vercel/budget-friendly-alternatives
54%
tool
Popular choice

TensorFlow - End-to-End Machine Learning Platform

Google's ML framework that actually works in production (most of the time)

TensorFlow
/tool/tensorflow/overview
52%
tool
Recommended

ChromaDB Troubleshooting: When Things Break

Real fixes for the errors that make you question your career choices

ChromaDB
/tool/chromadb/fixing-chromadb-errors
52%
tool
Recommended

ChromaDB - The Vector DB I Actually Use

Zero-config local development, production-ready scaling

ChromaDB
/tool/chromadb/overview
52%
compare
Recommended

I Deployed All Four Vector Databases in Production. Here's What Actually Works.

What actually works when you're debugging vector databases at 3AM and your CEO is asking why search is down

Weaviate
/compare/weaviate/pinecone/qdrant/chroma/enterprise-selection-guide
52%
integration
Recommended

Qdrant + LangChain Production Setup That Actually Works

Stop wasting money on Pinecone - here's how to deploy Qdrant without losing your sanity

Vector Database Systems (Pinecone/Weaviate/Chroma)
/integration/vector-database-langchain-production/qdrant-langchain-production-architecture
52%
tool
Recommended

New Relic - Application Monitoring That Actually Works (If You Can Afford It)

New Relic tells you when your apps are broken, slow, or about to die. Not cheap, but beats getting woken up at 3am with no clue what's wrong.

New Relic
/tool/new-relic/overview
52%
tool
Recommended

Grafana - The Monitoring Dashboard That Doesn't Suck

integrates with Grafana

Grafana
/tool/grafana/overview
52%
integration
Recommended

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

When your API shits the bed right before the big demo, this stack tells you exactly why

Prometheus
/integration/prometheus-grafana-jaeger/microservices-observability-integration
52%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization