Microservices Observability: Production-Ready Implementation Guide
Executive Summary
Comprehensive guide for implementing Prometheus, Grafana, Jaeger, and OpenTelemetry for microservices monitoring. Covers Docker and Kubernetes deployments with real-world failure scenarios and cost implications.
Critical Prerequisites
Infrastructure Requirements
- Docker Desktop: 8GB+ RAM allocated (4GB minimum causes performance issues)
- Kubernetes: 16GB total RAM across cluster (don't cheap out - seriously)
- Docker Compose: Modern version (not Python legacy version)
- Helm: Required for Kubernetes deployments
- kubectl: Must be properly configured (test with
kubectl get nodes
)
Knowledge Prerequisites
- Basic Docker understanding (
docker ps
command familiarity) - YAML syntax competency (tabs vs spaces will destroy you)
- Patience for random failures and debugging sessions
Technology Stack Components
OpenTelemetry Collector
Function: Routes telemetry data without configuring exporters per service
Reality Check:
- Works until custom processing needed
- Common error:
failed to build pipelines: service "pipelines" has no listeners
- YAML documentation requires 4+ hours to understand
- Contrib distributions have processors that actually work
Prometheus
Function: Metrics storage and querying
Performance Issues:
- Memory usage improvements in latest versions (but not dramatic)
- High cardinality labels will murder performance
- UI breaks at 1000+ spans making debugging impossible
- Default settings fail in production
Jaeger
Function: Distributed tracing visualization
Implementation Reality:
- User auth taking 2 seconds because of database calls in loops
- Newer versions work better with OpenTelemetry
- Still requires 1 hour of configuration tweaking
- Sampling strategies prevent trace data drowning
Grafana
Function: Dashboard creation and visualization
User Experience:
- Looks good in demos, confusing in production
- Query builder induces frustration
- AI features in v11 are basically useless
- Community dashboards save significant time
Resource Requirements and Costs
Time Investment
Task | Expected Time | Reality Check |
---|---|---|
Docker setup | 2-3 hours | 8 hours if Docker misbehaves |
Kubernetes production | 2-3 days minimum | 1 week with CrashLoopBackOff issues |
Useful dashboards | 1 week tweaking | Includes cursing at Grafana query builder |
Infrastructure Costs
Environment | Monthly Cost | Notes |
---|---|---|
Development | $50-100 | Depends on retention settings |
Production | $500+ | Bills exploded to $1000+ with year retention |
Kubernetes | Variable | AWS bills increased $400+ first month |
Hidden Cost: DevOps engineer who understands the stack (when Prometheus OOMs at 2am)
Docker Implementation
Working Docker Compose Configuration
version: '3.8'
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
command: ["--config=/etc/otel-collector-config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
ports:
- "4317:4317" # OTLP gRPC receiver
- "4318:4318" # OTLP HTTP receiver
- "8889:8889" # Prometheus metrics endpoint
depends_on:
- jaeger-all-in-one
networks:
- observability
prometheus:
image: prom/prometheus:latest
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
networks:
- observability
jaeger-all-in-one:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # Jaeger UI
- "14250:14250" # gRPC endpoint
environment:
- COLLECTOR_OTLP_ENABLED=true
networks:
- observability
grafana:
image: grafana/grafana:10.2.2
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
networks:
- observability
volumes:
prometheus-data:
grafana-data:
networks:
observability:
driver: bridge
Critical Setup Steps
- Create directories:
mkdir -p grafana/provisioning/{dashboards,datasources}
- Start services:
docker-compose up -d
- Verify deployment:
docker-compose ps
- Access endpoints:
- Grafana: localhost:3000 (admin/admin, 30-second boot time)
- Prometheus: localhost:9090 (immediate)
- Jaeger: localhost:16686 (30-second boot time)
Common Deployment Failures
- Port 3000 taken: Kill zombie React dev server with
lsof -ti:3000 | xargs kill -9
- Grafana won't start: Missing grafana/provisioning directories
- OTel Collector crashes: Wrong config file path or YAML syntax errors
- "Exited (1)" status: YAML tabs vs spaces nightmare
Kubernetes Implementation
Resource Planning (Production)
Component | RAM Requirement | CPU Requirement | Storage |
---|---|---|---|
Prometheus | 4-8GB RAM | 500m-1000m | 50Gi (fast SSD) |
Grafana | 512Mi-1Gi | 100m-500m | 10Gi |
Jaeger + Elasticsearch | 8GB+ total | Variable | Variable |
OTel Collector | 1Gi | 250m-500m | None |
Helm Chart Deployment Sequence
# Add repositories
helm repo add open-telemetry https://github.com/open-telemetry/opentelemetry-helm-charts
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
helm repo update
# Deploy OpenTelemetry Collector
helm install otel-collector open-telemetry/opentelemetry-collector \
-f otel-collector-values.yaml \
-n observability --create-namespace
# Deploy Jaeger
helm install jaeger jaegertracing/jaeger \
--set provisionDataStore.cassandra=false \
--set provisionDataStore.elasticsearch=true \
--set storage.type=elasticsearch \
-n observability
# Deploy Prometheus Stack
helm install prometheus prometheus-community/kube-prometheus-stack \
-f prometheus-values.yaml \
-n observability
Application Instrumentation
Java/Spring Boot Configuration
Dependencies: OpenTelemetry API 1.30.0, SDK, OTLP exporter, Spring Boot starter
Critical Configuration:
management:
otlp:
metrics:
export:
url: http://otel-collector:4318/v1/metrics
tracing:
export:
url: http://otel-collector:4318/v1/traces
otel:
exporter:
otlp:
endpoint: http://otel-collector:4318
service:
name: ${spring.application.name}
Node.js Implementation Issues
Known Problems:
- Breaks hot reload (nodemon conflicts with HTTP instrumentation)
- OpenTelemetry hooks interfere with file watching
- Performance impact on development workflow
Python Flask Limitations
Compatibility Issues:
- Auto-instrumentation fails with async views (
async def
route handlers) - Manual span management required for
asyncio
operations - FlaskInstrumentor doesn't handle async properly
Production Configuration Requirements
Environment Variables:
OTEL_EXPORTER_OTLP_ENDPOINT: "http://otel-collector:4318"
OTEL_SERVICE_NAME: "service-name"
OTEL_SERVICE_VERSION: "1.0.0"
OTEL_TRACES_SAMPLER: "parentbased_traceidratio"
OTEL_TRACES_SAMPLER_ARG: "0.01" # 1% sampling for production
Critical Failure Scenarios
Prometheus Context Deadline Exceeded
Root Cause: Expensive PromQL queries or high-cardinality metrics
Impact: Dashboard timeouts, monitoring blindness
Solutions:
- Increase
--query.timeout
from 2 to 5 minutes - Create recording rules for expensive calculations
- Avoid
rate()
on 10k+ series metrics - Emergency restart sometimes helps temporarily
Memory Exhaustion (Prometheus)
Symptoms: 8GB+ RAM consumption, OOM kills
Root Cause: High cardinality labels (user IDs, timestamps, session tokens in labels)
Production Impact: Cluster crashes, monitoring downtime
Emergency Fix: --storage.tsdb.max-bytes=4GB
to cap memory
Permanent Solution: Hunt and eliminate high-cardinality metrics
Trace Collection Failures
Common Causes:
- Wrong endpoints (4317 for gRPC, 4318 for HTTP)
- Sampling too aggressive (use
OTEL_TRACES_SAMPLER=always_on
for testing) - Collector can't reach Jaeger (service name/port issues)
- Apps not sending traces (instrumentation disabled)
Docker Networking Issues
Symptoms: "Connection refused" errors between services
Root Cause: Using localhost
instead of service names
Fix: Use service-name:port
not localhost:port
Debug: docker exec -it prometheus curl http://app:8080/metrics
Performance Impact Analysis
Auto-Instrumentation Overhead
Platform | Startup Impact | Request Latency | Development Impact |
---|---|---|---|
Java Spring Boot | +200ms startup | +50-100ms per request | Acceptable |
Node.js | Minimal | +50-100ms per request | Breaks hot reload |
Python Flask | Minimal | +50-100ms per request | Async view issues |
Data Volume Reality
Production Experience:
- 50TB+ trace data generated first month without sampling
- Bills jumped to $3K+ before implementing 1% sampling
- High-cardinality metrics caused 50,000 spans per request
Security Configuration
Production Security Checklist
- RBAC: Enable in Kubernetes (default in managed services)
- NetworkPolicies: Prevent lateral movement between services
- TLS: Required everywhere (painful but necessary)
- Authentication: Change default Grafana admin/admin credentials
- Resource Limits: Prevent resource exhaustion attacks
Cost Optimization Strategies
Sampling Configuration
# Production sampling rates
OTEL_TRACES_SAMPLER: "parentbased_traceidratio"
OTEL_TRACES_SAMPLER_ARG: "0.01" # 1% for busy applications
# Development: 100% sampling acceptable
# Production: 1% prevents bankruptcy
Retention Settings
prometheus:
prometheusSpec:
retention: 30d # Not 1 year unless you enjoy massive bills
resources:
limits:
memory: 4Gi # Set limits or face OOM kills
Deployment Comparison Matrix
Aspect | Docker Compose | Kubernetes | Managed Services |
---|---|---|---|
Complexity | Easy (if Docker works) | YAML hell | Easy until customization |
Scalability | Single machine limit | Infinite (expensive) | Auto-scales (3x cost) |
Maintenance | 2am security updates | Automated (if configured) | Provider responsibility |
Production Ready | Development only | Yes (with expertise) | Yes (with SLA) |
High Availability | Not possible | Requires proper configuration | Built-in redundancy |
Cost | Infrastructure only | Infrastructure + DevOps engineer | 3x more but externalized problems |
Troubleshooting Quick Reference
Emergency Commands
# Check container status
docker-compose ps
# View service logs
docker-compose logs [service-name]
# Test metrics endpoint
docker exec -it prometheus curl http://app:8080/metrics
# Kubernetes pod status
kubectl get pods -n observability
# Resource usage monitoring
docker stats
Performance Debugging
- Prometheus slow queries: Test in Prometheus UI before Grafana
- Missing traces: Check collector logs for export errors
- High memory usage: Identify high-cardinality metrics
- Network issues: Verify service names and ports in Docker/K8s
Essential Documentation References
- Prometheus Documentation - Actually comprehensive
- OpenTelemetry Troubleshooting
- Grafana Performance Limitations
- Kubernetes Monitoring Architecture
- CNCF Slack #observability - Active community support
- Prometheus Community Forum - Searchable troubleshooting history
Production Deployment Warnings
Critical Failures to Avoid:
- Default retention settings will bankrupt you
- High cardinality labels crash clusters
- Missing resource limits cause cascading failures
- Auto-instrumentation without sampling generates TB of data
- ServiceMonitors require Prometheus Operator (install kube-prometheus-stack)
Success Criteria:
- Metrics collection without OOM kills
- Trace visibility under 1% sampling
- Dashboard response under 5 seconds
- Monthly costs under budget expectations
- 2am alerts that actually indicate real problems
Useful Links for Further Investigation
Resources That Don't Suck
Link | Description |
---|---|
Prometheus Documentation | The official docs are actually good, unlike most open source projects. Covers PromQL, recording rules, and why your queries are slow. Spent more time here than I care to admit. |
Jaeger Documentation | Decent documentation, though the deployment examples are basic. Good for understanding sampling and storage options. The troubleshooting section saved me when traces vanished into the void. |
Grafana Documentation | Grafana's docs are decent but their query builder UI is still confusing as hell. The dashboard examples help when you're staring at a blank panel for the 50th time. |
OpenTelemetry Documentation | Comprehensive but overwhelming. Start with the getting started guides, skip the philosophy sections (unless you enjoy existential crises about observability standards). |
Microsoft OpenTelemetry Example | One of the few complete examples that doesn't skip the hard parts. Shows .NET integration with working configs. |
Kubernetes Observability Tutorial | Hands-on tutorial with actual YAML files you can copy. Better than most "enterprise" documentation. |
CNCF OpenTelemetry Demo | Kitchen-sink demo app with every integration. Good for seeing real examples, bad for understanding simple setups. Took me 3 hours to figure out which parts I actually needed. Skip this unless you like complexity for its own sake. |
Prometheus Operator | Makes Prometheus work in Kubernetes without manual YAML hell. Required if you want ServiceMonitors to work. |
Grafana Helm Charts | Official Helm charts that actually work out of the box. Use these instead of writing your own manifests. |
Jaeger Kubernetes Templates | Production templates with Elasticsearch setup. Warning: Elasticsearch will eat your RAM budget. AWS bills went up like 400 bucks or more the first month we deployed this thing. |
Prometheus Performance Tuning | Essential reading for when Prometheus starts eating 32GB RAM. Covers cardinality and why your metrics design sucks. |
Grafana Cloud Observability Guide | Modern patterns with Grafana Cloud. Expensive but someone else's problem when it breaks. Half the examples won't work in your environment. |
CNCF Slack #observability | Active community that actually helps. Maintainers hang out here and answer real questions. |
Prometheus Community Forum | Official forum that's slower than Slack but has better searchable history for troubleshooting. |
Related Tools & Recommendations
GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus
How to Wire Together the Modern DevOps Stack Without Losing Your Sanity
Fix Kubernetes OOMKilled Pods - Production Memory Crisis Management
When your pods die with exit code 137 at 3AM and production is burning - here's the field guide that actually works
Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015
When your API shits the bed right before the big demo, this stack tells you exactly why
OpenTelemetry Collector - Stop Getting Fucked by Observability Vendors
Route your telemetry data wherever the hell you want
OpenTelemetry Alternatives - For When You're Done Debugging Your Debugging Tools
I spent last Sunday fixing our collector again. It ate 6GB of RAM and crashed during the fucking football game. Here's what actually works instead.
Jaeger - Finally Figure Out Why Your Microservices Are Slow
Stop debugging distributed systems in the dark - Jaeger shows you exactly which service is wasting your time
Datadog Setup and Configuration Guide - From Zero to Production Monitoring
Get your team monitoring production systems in one afternoon, not six months of YAML hell
OpenTelemetry - Finally, Observability That Doesn't Lock You Into One Vendor
Because debugging production issues with console.log and prayer isn't sustainable
Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break
When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go
Fix Kubernetes ImagePullBackOff Error - The Complete Battle-Tested Guide
From "Pod stuck in ImagePullBackOff" to "Problem solved in 90 seconds"
EFK Stack Integration - Stop Your Logs From Disappearing Into the Void
Elasticsearch + Fluentd + Kibana: Because searching through 50 different log files at 3am while the site is down fucking sucks
Stop Docker from Killing Your Containers at Random (Exit Code 137 Is Not Your Friend)
Three weeks into a project and Docker Desktop suddenly decides your container needs 16GB of RAM to run a basic Node.js app
CVE-2025-9074 Docker Desktop Emergency Patch - Critical Container Escape Fixed
Critical vulnerability allowing container breakouts patched in Docker Desktop 4.44.3
Falco + Prometheus + Grafana: The Only Security Stack That Doesn't Suck
Tired of burning $50k/month on security vendors that miss everything important? This combo actually catches the shit that matters.
Enterprise Datadog Deployments That Don't Destroy Your Budget or Your Sanity
Real deployment strategies from engineers who've survived $100k+ monthly Datadog bills
PostgreSQL vs MySQL vs MongoDB vs Cassandra vs DynamoDB - Database Reality Check
Most database comparisons are written by people who've never deployed shit in production at 3am
ELK Stack for Microservices - Stop Losing Log Data
How to Actually Monitor Distributed Systems Without Going Insane
Mongoose - Because MongoDB's "Store Whatever" Philosophy Gets Messy Fast
built on Mongoose
Rust, Go, or Zig? I've Debugged All Three at 3am
What happens when you actually have to ship code that works
Grafana - The Monitoring Dashboard That Doesn't Suck
integrates with Grafana
Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization