Currently viewing the AI version
Switch to human version

OpenLIT Production Deployment: AI-Optimized Technical Reference

Critical Resource Requirements

ClickHouse (Primary Resource Constraint)

  • Memory: 32GB minimum, 64GB for >5M traces/day
  • Memory spikes: 5x during aggregations (major failure point)
  • CPU: 8 cores minimum, 16 for heavy workloads
  • Storage: 500GB SSD minimum, grows 10GB per million traces
  • Network: 1Gbps between OTLP collector and ClickHouse

OTLP Collectors

  • Latency-sensitive: 200-500ms cross-region penalty affects user experience
  • Port conflicts: Default 4318 conflicts with Jaeger/other OpenTelemetry setups
  • Recommended: Use port 4320 to avoid conflicts

Storage Growth Rate (Critical for Capacity Planning)

  • 1M traces = ~10GB storage
  • 100M traces/month = 1TB storage
  • Indexes and aggregations add 30% overhead
  • Failure scenario: 500GB disk filled in 3 days without retention policies

Production Cost Analysis

AWS Cost Structure (5M traces/day)

  • 3x c5.2xlarge ClickHouse cluster: $1,200/month
  • 2x c5.large OTLP collectors: $300/month
  • 1TB EBS storage: $100/month
  • Network transfer: $50-200/month
  • Total: ~$1,650/month (cheaper than Datadog, self-managed)

Failure Modes and Solutions

Primary Failure Points (In Order of Frequency)

  1. ClickHouse memory exhaustionMemory limit exceeded errors

    • Solution: Set max_memory_usage = 20000000000 (20GB minimum)
    • Tune max_bytes_before_external_group_by for large aggregations
  2. Disk space exhaustion → 90% of ClickHouse failures

    • Implement TTL policies from day one
    • Monitor storage growth: 10GB per million traces
  3. OTLP collector buffering failure → Trace loss during restarts

    • Configure persistent queues with 5000+ queue_size
    • Enable retry_on_failure with exponential backoff

High Availability Requirements

  • ClickHouse clustering: 3+ nodes minimum, requires ZooKeeper
  • Implementation time: 2-3 days for proper setup
  • Single points of failure: Complete observability loss if any component fails

Performance Thresholds

ClickHouse Performance Limits

  • Query timeout: Set max_execution_time = 300 (5 minutes)
  • Memory per query: max_memory_usage = 20000000000 (20GB)
  • Thread limit: max_threads = 16 (match CPU cores)
  • UI breaking point: >1M traces without time bounds kills performance

Scaling Breakpoints

Deployment Method Scale Limit Setup Time Monthly Cost Reliability
Docker Compose ~1M traces/day 15 minutes $200 Single point of failure
Kubernetes Basic ~10M traces/day 2-4 hours $800-1500 Good with replicas
Kubernetes HA ~100M traces/day 1-2 days $2000-5000 Production ready
Managed Services Unlimited 1 day $3000-5000 Enterprise grade

Configuration That Actually Works

Production ClickHouse Settings

settings:
  max_memory_usage: 20000000000
  max_execution_time: 300
  max_threads: 16

OTLP Collector Buffering (Critical)

exporters:
  otlphttp:
    sending_queue:
      enabled: true
      num_consumers: 16
      queue_size: 5000
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s

Storage Configuration

  • Use GP3 over GP2: Better IOPS performance
  • ReadWriteOnce only: ReadWriteMany corrupts ClickHouse data
  • Backup strategy: No built-in backup, requires automated S3 snapshots

Security Critical Points

Data Exposure Risks

  • Trace data contains secrets: LLM prompts/responses may include API keys, PII
  • Default credentials vulnerability: Change user@openlit.io/openlituser immediately
  • Network segmentation: Isolate ClickHouse from direct internet access

Required Filtering

processors:
  filter:
    traces:
      include:
        match_type: regexp
        attributes:
          - key: service.name
            value: "(openlit|llm-service)"

Downtime Expectations

Update Downtime Windows

  • ClickHouse updates: 5-10 minutes (HA), 30+ minutes (single node)
  • OTLP collector updates: Near zero with rolling deployments
  • OpenLIT UI updates: 30 seconds, traces continue flowing
  • Schema migrations: Hours on large trace tables

Disaster Recovery Times

  • ClickHouse failure recovery: 30+ minutes
  • OTLP collector recovery: Near-zero with proper setup
  • Complete cluster failure: 2-4 hours (depends on data size)

Kubernetes Production Gotchas

Service Mesh Compatibility

  • Istio: Requires PERMISSIVE mTLS mode for OTLP endpoints
  • OTLP traffic: Doesn't work with strict mTLS

Load Balancer Requirements

  • Session affinity: Must be OFF for OTLP collectors
  • Health checks: Standard probes miss trace forwarding failures
  • Use dedicated health check endpoint: Port 13133

Monitoring Requirements

Key metrics for alerting:

  • otelcol_exporter_queue_size → traces backing up
  • clickhouse_query_duration_seconds → database performance
  • openlit_ui_response_time → dashboard availability

Debugging When OpenLIT is Down

External Monitoring Requirements

  • Node exporter for host metrics
  • Prometheus for ClickHouse metrics
  • CloudWatch/Datadog for basic health checks

Emergency Query Access

-- Check trace ingestion in last hour
SELECT count() FROM otel_traces WHERE Timestamp > now() - INTERVAL 1 HOUR;

Retention and Compliance

Data Retention Implementation

-- Set 30-day retention
ALTER TABLE otel_traces 
MODIFY TTL toDate(Timestamp) + INTERVAL 30 DAY;

-- Add compression
ALTER TABLE otel_traces MODIFY COLUMN TraceId CODEC(LZ4HC);

GDPR Considerations

  • Configure log_personal_data_max_bytes for data privacy
  • Implement trace filtering for PII removal
  • Document data retention policies

Resource References (Production-Tested)

Critical Documentation

Operational Tools

Community Support

Useful Links for Further Investigation

Production Resources (Actually Useful)

LinkDescription
OpenLIT Helm ChartThe official OpenLIT Helm chart, providing production-ready values and configurations for deploying OpenLIT on Kubernetes clusters.
ClickHouse OperatorThe Altinity ClickHouse Operator for Kubernetes, offering automated management and scaling of ClickHouse clusters, superior to manual configuration.
OpenTelemetry OperatorThe OpenTelemetry Operator for Kubernetes, enabling automatic injection and management of OTLP collectors within your cluster for seamless observability.
Kubernetes Resource MonitoringOfficial Kubernetes documentation providing comprehensive guidance on how to effectively monitor the resource usage of your cluster and applications.
ClickHouse Performance TuningThe official ClickHouse documentation offering detailed strategies and best practices for optimizing database performance and query execution.
ClickHouse Backup and RestoreComprehensive guide from ClickHouse on implementing robust data protection strategies, including backup and restore procedures for critical database instances.
ClickHouse Cluster ConfigurationOfficial documentation detailing the architecture and deployment of ClickHouse clusters, providing a comprehensive guide for multi-node setups.
Memory and CPU TuningExplore the official ClickHouse settings documentation to understand and configure critical parameters for memory and CPU tuning to enhance performance.
OTLP Collector ConfigurationOfficial OpenTelemetry documentation providing detailed guidance on configuring the OTLP collector for robust and efficient production environments.
Sampling StrategiesLearn about various OpenTelemetry sampling strategies to effectively manage and reduce the volume of traces collected in production systems.
Internal TelemetryUnderstand how to leverage OpenTelemetry's internal telemetry to monitor the health, performance, and operational metrics of your collectors.
Security Best PracticesOfficial OpenTelemetry security documentation outlining best practices for securing OTLP endpoints and ensuring data integrity in your observability pipeline.
AWS EKS OpenLIT SetupGuide for deploying OpenLIT on Amazon EKS, covering the necessary steps and configurations for a successful setup in the AWS cloud environment.
GKE Persistent VolumesOfficial Google Kubernetes Engine documentation explaining persistent volumes, essential for providing reliable and durable storage for ClickHouse deployments.
Azure AKS StorageMicrosoft Azure AKS documentation detailing concepts and procedures for setting up persistent volumes, crucial for stateful applications like ClickHouse.
DigitalOcean KubernetesExplore DigitalOcean's Kubernetes service, a cost-effective and user-friendly option for deploying and managing containerized applications.
Prometheus ClickHouse ExporterGitHub repository for the Prometheus ClickHouse Exporter, a tool designed to expose comprehensive database metrics for monitoring with Prometheus.
Grafana DashboardsDiscover a collection of pre-built Grafana dashboards, including those specifically designed for visualizing OpenLIT telemetry data and metrics.
AlertManager RulesOfficial Prometheus AlertManager documentation, providing guidance on configuring robust alerting rules for production environments and incident notification.
PagerDuty IntegrationPagerDuty guide for integrating with Prometheus, enabling effective on-call incident management and automated alert routing for critical issues.
Velero Kubernetes BackupVelero's official website, detailing its capabilities as a robust open-source tool for performing cluster-level backup and restore of Kubernetes resources.
S3 Backup ScriptsAWS S3 user guide demonstrating how to implement automated backup and restore procedures for data, including ClickHouse, using AWS CLI scripts.
Database Migration ToolsClickHouse documentation on `clickhouse-local` and other utilities, useful for efficient data transfer and migration tasks between databases.
Multi-Region Setup GuideKubernetes best practices guide for setting up multi-zone and multi-region deployments to achieve high availability and geographic redundancy.
RBAC ConfigurationOfficial Kubernetes documentation on Role-Based Access Control (RBAC) configuration, essential for managing and securing access to cluster resources.
Network PoliciesKubernetes documentation explaining network policies, which are crucial for securing pod-to-pod communication and isolating network traffic within the cluster.
Secret ManagementOfficial Kubernetes documentation on secret management, providing secure methods for handling sensitive information like API keys and credentials.
GDPR ComplianceClickHouse documentation section on settings related to data privacy, including `log_personal_data_max_bytes`, important for GDPR compliance.
Kubernetes Resource LimitsOfficial Kubernetes documentation on managing container resources, including setting limits to prevent resource waste and ensure efficient cluster utilization.
Spot Instances GuideAWS EC2 user guide on using Spot Instances, a cost-effective option to significantly reduce compute costs for fault-tolerant applications.
Storage OptimizationClickHouse documentation on storing data, covering strategies for storage optimization, including data compression and retention policies to manage costs.
Auto-scaling SetupKubernetes documentation on Horizontal Pod Autoscaling, guiding users through setting up automatic scaling of applications based on demand and resource utilization.
OpenLIT Slack CommunityJoin the official OpenLIT Slack community to connect with other users, ask questions, share knowledge, and receive support for your OpenLIT implementations.
GitHub IssuesAccess the OpenLIT GitHub Issues page to report bugs, submit feature requests, and track the development progress of the OpenLIT project.
ClickHouse SlackJoin the ClickHouse Slack community for database-specific support, discussions, and to connect with experts and fellow users of ClickHouse.
OpenTelemetry CNCF SlackConnect with the broader OpenTelemetry and CNCF observability community on Slack to get help, share insights, and discuss best practices.

Related Tools & Recommendations

compare
Recommended

LangChain vs LlamaIndex vs Haystack vs AutoGen - Which One Won't Ruin Your Weekend

By someone who's actually debugged these frameworks at 3am

LangChain
/compare/langchain/llamaindex/haystack/autogen/ai-agent-framework-comparison
100%
compare
Recommended

Milvus vs Weaviate vs Pinecone vs Qdrant vs Chroma: What Actually Works in Production

I've deployed all five. Here's what breaks at 2AM.

Milvus
/compare/milvus/weaviate/pinecone/qdrant/chroma/production-performance-reality
91%
tool
Recommended

LangSmith - Debug Your LLM Agents When They Go Sideways

The tracing tool that actually shows you why your AI agent called the weather API 47 times in a row

LangSmith
/tool/langsmith/overview
63%
integration
Recommended

Pinecone Production Reality: What I Learned After $3200 in Surprise Bills

Six months of debugging RAG systems in production so you don't have to make the same expensive mistakes I did

Vector Database Systems
/integration/vector-database-langchain-pinecone-production-architecture/pinecone-production-deployment
57%
integration
Recommended

Claude + LangChain + Pinecone RAG: What Actually Works in Production

The only RAG stack I haven't had to tear down and rebuild after 6 months

Claude
/integration/claude-langchain-pinecone-rag/production-rag-architecture
57%
integration
Recommended

Stop Fighting with Vector Databases - Here's How to Make Weaviate, LangChain, and Next.js Actually Work Together

Weaviate + LangChain + Next.js = Vector Search That Actually Works

Weaviate
/integration/weaviate-langchain-nextjs/complete-integration-guide
57%
tool
Recommended

LlamaIndex - Document Q&A That Doesn't Suck

Build search over your docs without the usual embedding hell

LlamaIndex
/tool/llamaindex/overview
57%
howto
Recommended

I Migrated Our RAG System from LangChain to LlamaIndex

Here's What Actually Worked (And What Completely Broke)

LangChain
/howto/migrate-langchain-to-llamaindex/complete-migration-guide
57%
tool
Recommended

Haystack - RAG Framework That Doesn't Explode

integrates with Haystack AI Framework

Haystack AI Framework
/tool/haystack/overview
57%
tool
Recommended

Haystack Editor - Code Editor on a Big Whiteboard

Puts your code on a canvas instead of hiding it in file trees

Haystack Editor
/tool/haystack-editor/overview
57%
tool
Popular choice

Django Production Deployment - Enterprise-Ready Guide for 2025

From development server to bulletproof production: Docker, Kubernetes, security hardening, and monitoring that doesn't suck

Django
/tool/django/production-deployment-guide
57%
tool
Popular choice

HeidiSQL - Database Tool That Actually Works

Discover HeidiSQL, the efficient database management tool. Learn what it does, its benefits over DBeaver & phpMyAdmin, supported databases, and if it's free to

HeidiSQL
/tool/heidisql/overview
54%
troubleshoot
Popular choice

Fix Redis "ERR max number of clients reached" - Solutions That Actually Work

When Redis starts rejecting connections, you need fixes that work in minutes, not hours

Redis
/troubleshoot/redis/max-clients-error-solutions
52%
tool
Recommended

ChromaDB Troubleshooting: When Things Break

Real fixes for the errors that make you question your career choices

ChromaDB
/tool/chromadb/fixing-chromadb-errors
52%
tool
Recommended

ChromaDB - The Vector DB I Actually Use

Zero-config local development, production-ready scaling

ChromaDB
/tool/chromadb/overview
52%
compare
Recommended

I Deployed All Four Vector Databases in Production. Here's What Actually Works.

What actually works when you're debugging vector databases at 3AM and your CEO is asking why search is down

Weaviate
/compare/weaviate/pinecone/qdrant/chroma/enterprise-selection-guide
52%
integration
Recommended

Qdrant + LangChain Production Setup That Actually Works

Stop wasting money on Pinecone - here's how to deploy Qdrant without losing your sanity

Vector Database Systems (Pinecone/Weaviate/Chroma)
/integration/vector-database-langchain-production/qdrant-langchain-production-architecture
52%
tool
Recommended

New Relic - Application Monitoring That Actually Works (If You Can Afford It)

New Relic tells you when your apps are broken, slow, or about to die. Not cheap, but beats getting woken up at 3am with no clue what's wrong.

New Relic
/tool/new-relic/overview
52%
tool
Recommended

Grafana - The Monitoring Dashboard That Doesn't Suck

integrates with Grafana

Grafana
/tool/grafana/overview
52%
integration
Recommended

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

When your API shits the bed right before the big demo, this stack tells you exactly why

Prometheus
/integration/prometheus-grafana-jaeger/microservices-observability-integration
52%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization