Why is my traditional monitoring useless for LLMs?

Because your APM tool has no fucking clue what an LLM is doing. It'll tell you the API call succeeded while your LLM is writing poetry instead of answering support tickets. Traditional tools see HTTP 200 responses and think everything's fine, but they can't tell you when your AI starts hallucinating or when OpenAI silently throttles your requests to 10% speed.

Help! My LLM monitoring is logging customer SSNs and passwords!

Yeah, this happens. Set `OPENLIT_DISABLE_CONTENT_LOGGING=true` immediately or your compliance team will murder you. Even better, scrub sensitive data before it hits your LLM - regex out SSNs, email addresses, and anything that looks like a password. Pro tip: Hash the sensitive parts instead of excluding them completely. You can still debug issues by matching hash patterns without storing actual customer data. And definitely self-host everything if you're in healthcare or finance - sending patient data to third-party monitoring tools will get you fired faster than you can say "HIPAA violation."

How much is this monitoring shit gonna cost me?

More than you think. Budget at least $500/month for a half-assed setup, $2000+ for something that actually works. Here's the real breakdown: DIY everything costs $300-800/month if you value your time at zero. Add $200-500/month for Grafana Cloud because their metric ingestion pricing is basically extortion. Want actual LLM monitoring? Tack on another $500-2000/month for Langfuse hosting or LangSmith. The hidden costs are the real killers: - Datadog custom metrics: $5/month per metric (you'll have 50+ LLM metrics) - Your AWS bill doubles because you're storing trace data like a hoarder - Weekend debugging costs: your sanity + $40 in energy drinks

My OpenAI bill is $3000 and it's only Tuesday. How do I not get fired?

First, stop the bleeding: implement request blocking at 150% of normal daily spend, not 100% (users will riot if you block legitimate traffic). Set up Slack alerts at 120% of daily normal - email alerts get ignored but Slack spam gets attention. The alert that actually saved my ass: token-per-request alerts. If your average request suddenly jumps from 500 tokens to 5000 tokens, someone found a way to abuse your system. Set up alerts for requests over 2x your normal token average. Pro tip: Track costs by user/API key. When the bill spikes, you can immediately identify the source instead of panicking about the entire system. And always implement exponential backoff - if someone's hammering your API, make them wait longer between requests instead of burning money.

Which LLM providers can I actually monitor without writing custom code?

OpenAI and Anthropic work everywhere - every monitoring tool supports them. Google Cloud AI is hit-or-miss (some tools pretend to support it but half the metrics don't work). AWS Bedrock support is usually broken or half-assed. Azure OpenAI mostly works since it's just OpenAI with Microsoft branding. For local models like Ollama or vLLM, you're on your own. Some tools claim support but it's usually "here's a Python SDK, figure it out yourself." OpenLIT has the best local model support, but expect to write some custom instrumentation.

My locally-hosted LLMs are black boxes. How do I see what's happening?

It's a fucking nightmare. vLLM has Prometheus metrics that actually work, but you need to enable them explicitly (`--metric-port 8000`). Ollama has zero built-in monitoring - you're stuck watching system metrics and hoping for the best. For vLLM, monitor GPU memory usage religiously - it'll OOM without warning and take 5 minutes to restart. Set up alerts for queue depth over 10 because response times go to shit after that. Model loading times are critical too - if your model takes 30+ seconds to load, users will think your app is broken. Pro tip: log every request and response size. Local LLMs are unpredictable as hell and you need data to prove it's not your code when shit breaks.

What metrics actually matter when shit hits the fan?

**The "oh fuck" alerts (page immediately):** - Error rate over 5% (not 1% - you'll get paged for every hiccup) - Response time over 30 seconds (users give up after 15) - Cost spike over 200% of daily average (not 20% - that's just Tuesday) - No successful requests in 5 minutes (everything's broken) **The "deal with tomorrow" alerts:** - Token usage per request up 3x (someone's gaming your prompts) - GPU memory over 90% (for local models) - Disk space under 10GB (Prometheus is eating everything again)

How do I catch when my LLM starts writing garbage?

You need another LLM to judge your LLM's output - yes, it's stupid but it works. Use GPT-3.5-turbo to evaluate GPT-4 responses for relevance and coherence. Costs $0.01 per evaluation but catches quality drops before users complain. Quick hack: track response length - if your chatbot suddenly starts giving 50-word answers instead of 200-word answers, something's wrong. Also monitor repeated phrases in responses - if you see the same boilerplate appearing in 20%+ of responses, your prompts are broken. For safety, scan for obvious red flags: responses containing "I cannot" or "I'm not able to" when they should be answering. Set alerts when refusal rate jumps above normal baseline.

Can I bolt this onto my existing monitoring or do I need to start over?

If you're already running Prometheus and Grafana, you're golden. OpenLIT exports metrics in Prometheus format and Jaeger traces work with any OpenTelemetry-compatible system. Just point the OTel Collector at your existing infrastructure. Datadog users: they added LLM monitoring but it costs $50/month per service - fuck that noise. Better to use OpenLIT and send the data to your existing Datadog instance through their StatsD integration. New Relic claims to support LLM monitoring but their setup docs are garbage. Stick with the open source stack unless your boss is paying for enterprise support.

Everything is broken and I'm getting paged every 15 minutes. What do I check first?

**The 3am debugging checklist:** 1. **Is Docker actually running?** Run `docker ps` - half the time Docker daemon died and nobody noticed 2. **Are the ports actually open?** `telnet localhost 4317` and `telnet localhost 4318` - your firewall blocked them 3. **Is anything actually sending data?** Check OTel Collector logs for "received spans" messages 4. **Is Prometheus scraping anything?** Go to `http://localhost:9090/targets` - if everything's down, fix your service discovery 5. **Did the disk fill up?** `df -h` - Prometheus ate all your storage again **Error message decoder:** - "connection refused": Service is down or port blocked - "no route to host": DNS fucked, use IP addresses - "context deadline exceeded": Everything is slow, check resource limits - "batch processor failed": Traces too big, reduce batch size - "memory limit exceeded": Your OTel Collector is OOMing, add more RAM When all else fails: restart everything and pretend it was a planned maintenance window. Once you've survived your first production incidents, you'll need proper dashboards and security measures to prevent bigger disasters. The next section covers building monitoring dashboards that actually help during outages, plus the security practices that will keep you from accidentally logging customer credit cards.

Currently viewing the AI version

Switch to human version

LLM Production Monitoring: AI-Optimized Technical Guide

Executive Summary

Traditional APM tools fail for LLM monitoring because they cannot detect semantic failures, cost spirals, or quality degradation. Production LLM monitoring requires specialized tooling and costs $500-6000/month for reliable implementation.

Critical Failure Modes

Silent Failures That Break Production

Rate Limiting Without Error Codes: OpenAI throttles requests to crawl speed without returning error codes, showing HTTP 200 while users wait 30+ seconds
Token Usage Explosions: Single user queries can trigger 10,000-token responses, increasing daily costs from $200 to $2,800 overnight
Prompt Injection Attacks: Users bypass instructions to generate expensive long-form content, burning $3,000+ before detection
Model Update Breaking Changes: Provider model updates change response formats without warning, breaking parsing logic in production
Empty Response Failures: Silent rate limit hits cause 30% blank responses for 6+ hours while traditional metrics show 99.9% uptime

Cost Spiral Indicators

Alert Threshold: 150% of daily normal spend (not 500% - too late for recovery)
Token Usage Spikes: Average request jumping from 500 to 5,000+ tokens indicates system abuse
Weekend Debugging: Forgotten test scripts can generate $1,200+ in charges during off-hours

Technical Implementation Stack

Required Infrastructure

Kubernetes: v1.24+ minimum (v1.23 breaks compatibility)
Memory: 16GB minimum (8GB insufficient for Prometheus memory consumption)
Storage: 200GB+ for metrics retention (retention policies are unreliable)
Python: 3.9+ required (3.8 breaks OpenTelemetry in production)
Node.js: 18+ required (LLM libraries incompatible with v16)

Core Monitoring Components

OpenTelemetry Collector (Primary Failure Point)

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
processors:
  memory_limiter:
    limit_mib: 512  # Increase to 2048 if OOMing
  batch:
    send_batch_size: 1024  # Reduce if "batch too large" errors
service:
  pipelines:
    metrics:
      processors: [memory_limiter, batch]  # Order critical

Common Failures:

connection refused on port 4317: Docker network configuration failure
jaeger-collector not found: DNS resolution failed, use IP addresses
memory limit exceeded: Double memory_limiter value
batch processor failed: Reduce batch size for oversized traces

Prometheus Configuration

Retention: Maximum 7 days (consumes all disk space beyond this)
Scrape Interval: 15 seconds minimum (shorter intervals crash Prometheus)
Sampling Rate: 1% maximum for traces (100% fills storage within days)

Jaeger Tracing

Development: Works reliably
Production: Frequent "trace collection timed out" errors
Storage: Default sampling rate (100%) requires immediate reduction to 1%

LLM-Specific Monitoring Tools

OpenLIT (Recommended)

Advantage: Auto-instrumentation without manual API call wrapping
Compatibility: OpenAI, Anthropic, LangChain, most major providers
Implementation: 2-line code addition for full instrumentation

Alternative Solutions by Budget

Solution Type	Monthly Cost	Setup Time	Reliability
DIY Everything	$300-800	2-3 days	Frequent failures
Hybrid Approach	$800-2200	15-30 hours setup	Weekend debugging required
Full Commercial	$2000-6000	5-15 hours	Vendor lock-in risk

Provider-Specific Support Quality

OpenAI: Universal support across all tools
Anthropic: Universal support across all tools
Google Cloud AI: Inconsistent, half the metrics often broken
AWS Bedrock: Usually broken or incomplete implementation
Azure OpenAI: Mostly functional (Microsoft-branded OpenAI)
Local Models (Ollama/vLLM): Minimal support, requires custom instrumentation

Critical Monitoring Metrics

Immediate Alert Thresholds

Error Rate: >5% (not 1% - reduces false positives)
Response Time: >30 seconds (users abandon after 15 seconds)
Cost Spike: >200% of daily average (not 20% - normal variance)
Service Availability: No successful requests in 5 minutes

Secondary Alert Thresholds

Token Usage: 3x increase per request (indicates prompt gaming)
GPU Memory: >90% for local models
Disk Space: <10GB remaining (Prometheus storage consumption)

Security Requirements

Data Protection Implementation

Sensitive Data Regex Patterns:
- SSN: \b\d{3}-\d{2}-\d{4}\b
- Credit Cards: \b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b
- Email: \b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b

Compliance Requirements

Data Minimization: Hash sensitive fields instead of exclusion for debugging capability
Audit Logging: Permanent storage of access logs (user ID, timestamp, data accessed)
Access Control: OAuth2/SAML for enterprise, never admin/admin defaults
Encryption: All data in transit, VPN-protected monitoring dashboards

Cost Control Implementation

Real-Time Cost Prevention

# Rate limiting by estimated cost per request
rate_limits = {
    "gpt-4": {"max_requests": 10, "window": "1h"},
    "gpt-3.5-turbo": {"max_requests": 100, "window": "1h"}
}

Model Selection Rules

<50 words: Route to cheap models
>500 words: Route to expensive models
Accuracy: 70% correct routing (30% error rate acceptable for 50% cost reduction)

Dashboard Configuration

Executive Dashboard

Primary Metric: Daily cost
Secondary Metric: System uptime status
Design Rule: Single number focus, no technical details

Operations Dashboard (3AM Debugging)

API Error Rates: Visual threshold at 1%
Response Time Percentiles: Never use averages (misleading)
Token Usage: Per-minute monitoring
Alert Status: Large, red, unmistakable failure indicators

Engineering Dashboard

Request Traces: Individual transaction timing breakdowns
Token Usage: Per-request granularity
Quality Scores: Model response evaluation metrics
Edge Cases: Log unusual user interaction patterns

Troubleshooting Checklist (3AM Debugging)

Primary System Checks

Docker Status: docker ps - daemon frequently fails silently
Port Accessibility: telnet localhost 4317 and telnet localhost 4318
Data Flow: Check OTel Collector logs for "received spans" messages
Prometheus Targets: Visit http://localhost:9090/targets for scraping status
Disk Space: df -h - Prometheus storage consumption

Error Message Decoder

"connection refused": Service down or firewall blocking
"no route to host": DNS failure, switch to IP addresses
"context deadline exceeded": Resource exhaustion, check limits
"batch processor failed": Trace size too large, reduce batch size
"memory limit exceeded": OTel Collector OOM, increase RAM allocation

Resource Requirements

Minimum Production Setup

Budget: $500/month minimum for functional monitoring
Personnel: 2-3 days initial setup time
Maintenance: 15-30 hours monthly for hybrid approach
Hidden Costs: AWS storage doubling, weekend debugging time

Tool-Specific Costs

Datadog Custom Metrics: $5/month per metric (50+ LLM metrics typical)
Grafana Cloud: $200-500/month metric ingestion
Third-party LLM Monitoring: $500-2000/month additional

Quality Monitoring

Automated Quality Checks

LLM-on-LLM Evaluation: GPT-3.5-turbo evaluating GPT-4 responses ($0.01/evaluation)
Response Length Monitoring: Track average response length changes
Refusal Rate Tracking: Monitor responses containing "I cannot" or "I'm not able to"
Repetition Detection: Alert when boilerplate appears in >20% of responses

Local Model Monitoring

vLLM: Prometheus metrics available with --metric-port 8000
Ollama: No built-in monitoring, system metrics only
GPU Memory: Critical monitoring for OOM prevention
Queue Depth: Alert at >10 requests (response time degradation)
Model Loading: 30+ second load times appear broken to users

Useful Links for Further Investigation

Resources That Don't Completely Suck

Link	Description
OpenTelemetry Collector Configuration Reference	The only OTel docs that make sense, bookmark this for comprehensive guidance on configuring the OpenTelemetry Collector effectively for your observability needs.
OpenTelemetry LLM Observability Blog	This blog post provides an actually useful introduction to LLM observability using OpenTelemetry, offering practical insights unlike most other OpenTelemetry content available.
Semantic Conventions for GenAI Systems	This document outlines the semantic conventions for Generative AI systems, which are boring but absolutely necessary for ensuring consistent naming and standardized observability practices.
Complete Guide to LLM Observability with Grafana Cloud	This guide provides a comprehensive walkthrough for achieving LLM observability using OpenTelemetry and Grafana Cloud, demonstrating practical implementation steps that work as advertised.
Grafana Dashboard Templates	Explore a wide variety of pre-built Grafana dashboard templates that can be directly imported and customized, saving significant development time compared to building from scratch.
Prometheus Config Examples	Access practical configuration examples for Prometheus, designed to help users quickly set up metrics collection and avoid common pitfalls and extensive troubleshooting, saving hours of trial and error.
OpenLIT GitHub	This GitHub repository provides the OpenLIT library, which automatically instruments your LLM calls, enabling quick and easy setup for observability within minutes, making it highly efficient.
Langfuse Docs	Comprehensive documentation for Langfuse, an open-source, self-hosted platform for LLM observability and analytics, known for its user-friendly interface and thorough guides, unlike many alternatives.
Arize Phoenix	The Arize Phoenix GitHub repository offers a local-first LLM observability tool, ideal for development and experimentation, though it may encounter performance issues with very large datasets.
LangSmith Docs	Official documentation for LangSmith, a robust LLM monitoring and evaluation platform, particularly beneficial for teams already integrated into the LangChain ecosystem, offering superior monitoring capabilities.
Datadog LLM Observability	Detailed documentation on Datadog's LLM observability features, offering powerful monitoring capabilities for large-scale AI applications, albeit at a premium cost that can exceed a car payment.
Coralogix AI Observability	Getting started guide for Coralogix AI Observability, a product that provides comprehensive monitoring for AI systems, despite having an outdated user interface from 2015.
Building Production-Ready LLM Applications Guide	A comprehensive Medium guide offering a complete walkthrough for deploying production-ready LLM applications, including essential monitoring and observability strategies, ensuring robust system performance.
vLLM Observability with OpenTelemetry	A specialized guide from IBM Data & AI on Medium, focusing on building production-ready observability for vLLM, particularly useful for locally-hosted LLM deployments and performance monitoring.
Kong Gateway LLM Metrics Visualization	A tutorial from Kong HQ demonstrating how to visualize LLM metrics using Grafana, specifically for monitoring LLM traffic routed through API gateways, providing clear operational insights.
OpenTelemetry Demo Application	The official OpenTelemetry demo application, serving as a comprehensive reference implementation for microservices observability across various languages and components, showcasing best practices.
LLM Monitoring Docker Compose Stack	Documentation providing Docker Compose templates for deploying a complete LLM monitoring stack, including Prometheus and Tempo, for easy setup and experimentation with observability tools.
Kubernetes Monitoring Manifests	The kube-prometheus GitHub repository offers production-grade Kubernetes monitoring configurations, including Prometheus, Grafana, and Alertmanager, for robust cluster observability and alerting.
OpenTelemetry Community Slack	Join the active OpenTelemetry community Slack channel for real-time troubleshooting, discussions on best practices, and direct support from contributors and users, fostering collaborative problem-solving.
LLM Ops Community Discord	An active Discord discussion forum dedicated to LLM operations, covering topics related to production deployment, scaling, and management of large language models, facilitating knowledge exchange.
Grafana Community Forums	The official Grafana Community Forums provide a platform for users to seek support, share dashboards, discuss features, and collaborate on monitoring solutions, enhancing user experience.
Cloud Native Computing Foundation LLM Guidelines	The Cloud Native Computing Foundation (CNCF) website provides resources and emerging standards for cloud-native AI/ML observability, promoting best practices in the ecosystem for robust deployments.
Anthropic Safety and Trust Center	Anthropic's Safety and Trust Center offers comprehensive guidelines and principles for responsible AI deployment, focusing on ethical considerations and effective monitoring strategies to ensure trustworthy AI.
MLOps Principles for LLM Applications	The MLOps.org website outlines industry standards and principles for Machine Learning Operations, providing guidance for the entire lifecycle management of LLM applications, from development to deployment.
OpenTelemetry Fundamentals Course	Official training and certification resources from the CNCF for OpenTelemetry fundamentals, designed to equip users with essential knowledge for effective implementation and usage in cloud-native environments.
Grafana Fundamentals	A collection of comprehensive tutorials from Grafana, covering fundamental concepts and advanced techniques for monitoring, data visualization, and dashboard creation, suitable for all skill levels.
Prometheus Monitoring Workshop	Official Prometheus tutorials and workshops offering hands-on training for effective metrics collection, querying with PromQL, and setting up robust alerting mechanisms for reliable system monitoring.

LLM Production Monitoring: AI-Optimized Technical Guide

Executive Summary

Critical Failure Modes

Silent Failures That Break Production

Cost Spiral Indicators

Technical Implementation Stack

Required Infrastructure

Core Monitoring Components

OpenTelemetry Collector (Primary Failure Point)

Prometheus Configuration

Jaeger Tracing

LLM-Specific Monitoring Tools

OpenLIT (Recommended)

Alternative Solutions by Budget

Provider-Specific Support Quality

Critical Monitoring Metrics

Immediate Alert Thresholds

Secondary Alert Thresholds

Security Requirements

Data Protection Implementation

Compliance Requirements

Cost Control Implementation

Real-Time Cost Prevention

Model Selection Rules

Dashboard Configuration

Executive Dashboard

Operations Dashboard (3AM Debugging)

Engineering Dashboard

Troubleshooting Checklist (3AM Debugging)

Primary System Checks

Error Message Decoder

Resource Requirements

Minimum Production Setup

Tool-Specific Costs

Quality Monitoring

Automated Quality Checks

Local Model Monitoring

Useful Links for Further Investigation

Resources That Don't Completely Suck

Related Tools & Recommendations

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

OpenTelemetry + Jaeger + Grafana on Kubernetes - The Stack That Actually Works

LangChain vs LlamaIndex vs Haystack vs AutoGen - Which One Won't Ruin Your Weekend

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

OpenTelemetry Alternatives - For When You're Done Debugging Your Debugging Tools

OpenTelemetry - Finally, Observability That Doesn't Lock You Into One Vendor

Set Up Microservices Monitoring That Actually Works

Datadog Cost Management - Stop Your Monitoring Bill From Destroying Your Budget

Datadog vs New Relic vs Sentry: Real Pricing Breakdown (From Someone Who's Actually Paid These Bills)

Datadog Enterprise Pricing - What It Actually Costs When Your Shit Breaks at 3AM

MLflow - Stop Losing Track of Your Fucking Model Runs

Pinecone Production Reality: What I Learned After $3200 in Surprise Bills

Claude + LangChain + Pinecone RAG: What Actually Works in Production

Stop Fighting with Vector Databases - Here's How to Make Weaviate, LangChain, and Next.js Actually Work Together

New Relic - Application Monitoring That Actually Works (If You Can Afford It)

LlamaIndex - Document Q&A That Doesn't Suck

I Migrated Our RAG System from LangChain to LlamaIndex

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

ELK Stack for Microservices - Stop Losing Log Data