Prometheus keeps showing "context deadline exceeded" errors

Your PromQL query is too expensive and timing out. This happens when you query high-cardinality metrics or use inefficient functions. Hit this error 20 times before I realized I was querying metrics with user IDs as labels. **What worked for me (sometimes):** - Simplify your query or throw more RAM at it (I threw RAM at it, still timed out) - Bump `--query.timeout` from 2 minutes to 5 minutes - don't ask me why this helps - Create recording rules to pre-calculate the expensive shit (this actually saved my dashboard that was murdering Prometheus) - Stop using `rate()` on metrics with 10k+ series unless you enjoy waiting - Sometimes restarting Prometheus fixes it temporarily - no clue why **Example fix:** Instead of `rate(http_requests_total[5m])`, create a [recording rule](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/) that computes this every minute. Check the [query performance guide](https://prometheus.io/docs/prometheus/latest/querying/basics/#avoiding-slow-queries) and [cardinality debugging](https://prometheus.io/docs/practices/instrumentation/#labels).

Can't curl the metrics endpoint and Prometheus shows "connection refused"

Your app isn't exposing metrics or Docker networking is fucked. **Shit to try:** 1. Can you actually curl that metrics endpoint from inside the container? (usually no) 2. Are you using `localhost` instead of the service name? (Use `app:8080`, not `localhost:8080`) - made this mistake like 10 times 3. Is the app even listening on the port you think it is? Check with `netstat -ln` or whatever **Command to test:** `docker exec -it prometheus curl http://your-app:port/metrics`

Grafana dashboards are completely blank

Check if your time range includes when your apps were actually running. Rookie mistake #1 that I made 5 different times. Spent 30 minutes debugging "broken" metrics when I was looking at data from last week. **Other causes:** - Data source isn't configured (test the connection in Grafana) - You're querying metrics that don't exist (check in Prometheus UI first) - Network issues between Grafana and Prometheus **Pro tip:** Always test queries in [Prometheus UI](https://prometheus.io/docs/visualization/browser/) (`http://localhost:9090`) before building Grafana dashboards. Use [Grafana's query builder](https://grafana.com/docs/grafana/latest/panels/query-a-data-source/use-query-editor/) or check [dashboard examples](https://grafana.com/grafana/dashboards/).

Jaeger UI shows no traces

The collector is probably eating your traces. Check `docker-compose logs otel-collector` for errors. **Common fuckups:** - Wrong endpoint URLs (4317 for gRPC, 4318 for HTTP) - Sampling is set too low (add `OTEL_TRACES_SAMPLER=always_on` for testing) - Apps aren't actually sending traces (check if instrumentation is enabled) - Collector can't reach Jaeger (wrong service name or port)

Docker containers randomly crash with "killed"

You're out of memory. The OOM killer is terminating containers. **Quick check:** `docker stats` to see memory usage. If anything is near 100%, that's your problem. **Fixes:** - Give Docker Desktop more RAM allocation - Set memory limits on containers so they don't eat everything - Reduce metric retention or sampling rate

OpenTelemetry Collector is in a crash loop

Usually a config file issue or resource limits. **Debug steps:** 1. Check the logs: `docker-compose logs otel-collector` 2. Is the YAML syntax valid? Use `yamllint` if you're not sure 3. Are all the service references correct? (jaeger-all-in-one, not jaeger) 4. Are you hitting memory limits? Increase the `memory_limiter` processor

Prometheus is eating 8GB RAM and still wants more

High cardinality labels are killing you. Someone used user IDs or timestamps as labels. I've seen this take down entire clusters - classic move by a junior dev who thought putting request IDs in metric labels was a good idea. **Nuclear option:** Restart Prometheus with `--storage.tsdb.max-bytes=4GB` to cap memory usage. No idea why this works but it does. **Actual fix:** Hunt down the shitty high-cardinality metrics and murder them. Find labels with thousands of unique values - usually some genius put session tokens or request IDs in labels. Took me like 2 days or more to track down the metric that was using session tokens as labels (yes, fucking session tokens). Use the [cardinality explorer](https://prometheus.io/docs/prometheus/latest/querying/api/#tsdb-stats) and follow [metric naming best practices](https://prometheus.io/docs/practices/naming/) to avoid this nightmare.

"Failed to connect to Docker daemon" when using docker-compose

Docker Desktop randomly stops working. Just restart it. This happens more on Windows/Mac. If that doesn't work: - Check if Docker is actually running - Try `docker ps` to see if the daemon is responsive - Restart your entire machine (the nuclear option that works 90% of the time) **Additional troubleshooting resources:** - [Prometheus FAQ](https://prometheus.io/docs/introduction/faq/) - [Grafana debugging](https://grafana.com/docs/grafana/latest/troubleshooting/) - [Jaeger troubleshooting](https://www.jaegertracing.io/docs/troubleshooting/) - [Docker networking issues](https://docs.docker.com/network/) - [OpenTelemetry troubleshooting](https://opentelemetry.io/docs/collector/troubleshooting/) - [Kubernetes debugging](https://kubernetes.io/docs/tasks/debug/debug-application/)

Currently viewing the AI version

Switch to human version

Microservices Observability: Production-Ready Implementation Guide

Executive Summary

Comprehensive guide for implementing Prometheus, Grafana, Jaeger, and OpenTelemetry for microservices monitoring. Covers Docker and Kubernetes deployments with real-world failure scenarios and cost implications.

Critical Prerequisites

Infrastructure Requirements

Docker Desktop: 8GB+ RAM allocated (4GB minimum causes performance issues)
Kubernetes: 16GB total RAM across cluster (don't cheap out - seriously)
Docker Compose: Modern version (not Python legacy version)
Helm: Required for Kubernetes deployments
kubectl: Must be properly configured (test with kubectl get nodes)

Knowledge Prerequisites

Basic Docker understanding (docker ps command familiarity)
YAML syntax competency (tabs vs spaces will destroy you)
Patience for random failures and debugging sessions

Technology Stack Components

OpenTelemetry Collector

Function: Routes telemetry data without configuring exporters per service
Reality Check:

Works until custom processing needed
Common error: failed to build pipelines: service "pipelines" has no listeners
YAML documentation requires 4+ hours to understand
Contrib distributions have processors that actually work

Prometheus

Function: Metrics storage and querying
Performance Issues:

Memory usage improvements in latest versions (but not dramatic)
High cardinality labels will murder performance
UI breaks at 1000+ spans making debugging impossible
Default settings fail in production

Jaeger

Function: Distributed tracing visualization
Implementation Reality:

User auth taking 2 seconds because of database calls in loops
Newer versions work better with OpenTelemetry
Still requires 1 hour of configuration tweaking
Sampling strategies prevent trace data drowning

Grafana

Function: Dashboard creation and visualization
User Experience:

Looks good in demos, confusing in production
Query builder induces frustration
AI features in v11 are basically useless
Community dashboards save significant time

Resource Requirements and Costs

Time Investment

Task	Expected Time	Reality Check
Docker setup	2-3 hours	8 hours if Docker misbehaves
Kubernetes production	2-3 days minimum	1 week with CrashLoopBackOff issues
Useful dashboards	1 week tweaking	Includes cursing at Grafana query builder

Infrastructure Costs

Environment	Monthly Cost	Notes
Development	$50-100	Depends on retention settings
Production	$500+	Bills exploded to $1000+ with year retention
Kubernetes	Variable	AWS bills increased $400+ first month

Hidden Cost: DevOps engineer who understands the stack (when Prometheus OOMs at 2am)

Docker Implementation

Working Docker Compose Configuration

version: '3.8'

services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317"   # OTLP gRPC receiver
      - "4318:4318"   # OTLP HTTP receiver
      - "8889:8889"   # Prometheus metrics endpoint
    depends_on:
      - jaeger-all-in-one
    networks:
      - observability

  prometheus:
    image: prom/prometheus:latest
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    networks:
      - observability

  jaeger-all-in-one:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # Jaeger UI
      - "14250:14250"  # gRPC endpoint
    environment:
      - COLLECTOR_OTLP_ENABLED=true
    networks:
      - observability

  grafana:
    image: grafana/grafana:10.2.2
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    networks:
      - observability

volumes:
  prometheus-data:
  grafana-data:

networks:
  observability:
    driver: bridge

Critical Setup Steps

Create directories: mkdir -p grafana/provisioning/{dashboards,datasources}
Start services: docker-compose up -d
Verify deployment: docker-compose ps
Access endpoints:
- Grafana: localhost:3000 (admin/admin, 30-second boot time)
- Prometheus: localhost:9090 (immediate)
- Jaeger: localhost:16686 (30-second boot time)

Common Deployment Failures

Port 3000 taken: Kill zombie React dev server with lsof -ti:3000 | xargs kill -9
Grafana won't start: Missing grafana/provisioning directories
OTel Collector crashes: Wrong config file path or YAML syntax errors
"Exited (1)" status: YAML tabs vs spaces nightmare

Kubernetes Implementation

Resource Planning (Production)

Component	RAM Requirement	CPU Requirement	Storage
Prometheus	4-8GB RAM	500m-1000m	50Gi (fast SSD)
Grafana	512Mi-1Gi	100m-500m	10Gi
Jaeger + Elasticsearch	8GB+ total	Variable	Variable
OTel Collector	1Gi	250m-500m	None

Helm Chart Deployment Sequence

# Add repositories
helm repo add open-telemetry https://github.com/open-telemetry/opentelemetry-helm-charts
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
helm repo update

# Deploy OpenTelemetry Collector
helm install otel-collector open-telemetry/opentelemetry-collector \
  -f otel-collector-values.yaml \
  -n observability --create-namespace

# Deploy Jaeger
helm install jaeger jaegertracing/jaeger \
  --set provisionDataStore.cassandra=false \
  --set provisionDataStore.elasticsearch=true \
  --set storage.type=elasticsearch \
  -n observability

# Deploy Prometheus Stack
helm install prometheus prometheus-community/kube-prometheus-stack \
  -f prometheus-values.yaml \
  -n observability

Application Instrumentation

Java/Spring Boot Configuration

Dependencies: OpenTelemetry API 1.30.0, SDK, OTLP exporter, Spring Boot starter

Critical Configuration:

management:
  otlp:
    metrics:
      export:
        url: http://otel-collector:4318/v1/metrics
    tracing:
      export:
        url: http://otel-collector:4318/v1/traces

otel:
  exporter:
    otlp:
      endpoint: http://otel-collector:4318
  service:
    name: ${spring.application.name}

Node.js Implementation Issues

Known Problems:

Breaks hot reload (nodemon conflicts with HTTP instrumentation)
OpenTelemetry hooks interfere with file watching
Performance impact on development workflow

Python Flask Limitations

Compatibility Issues:

Auto-instrumentation fails with async views (async def route handlers)
Manual span management required for asyncio operations
FlaskInstrumentor doesn't handle async properly

Production Configuration Requirements

Environment Variables:

OTEL_EXPORTER_OTLP_ENDPOINT: "http://otel-collector:4318"
OTEL_SERVICE_NAME: "service-name"
OTEL_SERVICE_VERSION: "1.0.0"
OTEL_TRACES_SAMPLER: "parentbased_traceidratio"
OTEL_TRACES_SAMPLER_ARG: "0.01"  # 1% sampling for production

Critical Failure Scenarios

Prometheus Context Deadline Exceeded

Root Cause: Expensive PromQL queries or high-cardinality metrics
Impact: Dashboard timeouts, monitoring blindness
Solutions:

Increase --query.timeout from 2 to 5 minutes
Create recording rules for expensive calculations
Avoid rate() on 10k+ series metrics
Emergency restart sometimes helps temporarily

Memory Exhaustion (Prometheus)

Symptoms: 8GB+ RAM consumption, OOM kills
Root Cause: High cardinality labels (user IDs, timestamps, session tokens in labels)
Production Impact: Cluster crashes, monitoring downtime
Emergency Fix: --storage.tsdb.max-bytes=4GB to cap memory
Permanent Solution: Hunt and eliminate high-cardinality metrics

Trace Collection Failures

Common Causes:

Wrong endpoints (4317 for gRPC, 4318 for HTTP)
Sampling too aggressive (use OTEL_TRACES_SAMPLER=always_on for testing)
Collector can't reach Jaeger (service name/port issues)
Apps not sending traces (instrumentation disabled)

Docker Networking Issues

Symptoms: "Connection refused" errors between services
Root Cause: Using localhost instead of service names
Fix: Use service-name:port not localhost:port
Debug: docker exec -it prometheus curl http://app:8080/metrics

Performance Impact Analysis

Auto-Instrumentation Overhead

Platform	Startup Impact	Request Latency	Development Impact
Java Spring Boot	+200ms startup	+50-100ms per request	Acceptable
Node.js	Minimal	+50-100ms per request	Breaks hot reload
Python Flask	Minimal	+50-100ms per request	Async view issues

Data Volume Reality

Production Experience:

50TB+ trace data generated first month without sampling
Bills jumped to $3K+ before implementing 1% sampling
High-cardinality metrics caused 50,000 spans per request

Security Configuration

Production Security Checklist

RBAC: Enable in Kubernetes (default in managed services)
NetworkPolicies: Prevent lateral movement between services
TLS: Required everywhere (painful but necessary)
Authentication: Change default Grafana admin/admin credentials
Resource Limits: Prevent resource exhaustion attacks

Cost Optimization Strategies

Sampling Configuration

# Production sampling rates
OTEL_TRACES_SAMPLER: "parentbased_traceidratio"
OTEL_TRACES_SAMPLER_ARG: "0.01"  # 1% for busy applications
# Development: 100% sampling acceptable
# Production: 1% prevents bankruptcy

Retention Settings

prometheus:
  prometheusSpec:
    retention: 30d  # Not 1 year unless you enjoy massive bills
    resources:
      limits:
        memory: 4Gi  # Set limits or face OOM kills

Deployment Comparison Matrix

Aspect	Docker Compose	Kubernetes	Managed Services
Complexity	Easy (if Docker works)	YAML hell	Easy until customization
Scalability	Single machine limit	Infinite (expensive)	Auto-scales (3x cost)
Maintenance	2am security updates	Automated (if configured)	Provider responsibility
Production Ready	Development only	Yes (with expertise)	Yes (with SLA)
High Availability	Not possible	Requires proper configuration	Built-in redundancy
Cost	Infrastructure only	Infrastructure + DevOps engineer	3x more but externalized problems

Troubleshooting Quick Reference

Emergency Commands

# Check container status
docker-compose ps

# View service logs
docker-compose logs [service-name]

# Test metrics endpoint
docker exec -it prometheus curl http://app:8080/metrics

# Kubernetes pod status
kubectl get pods -n observability

# Resource usage monitoring
docker stats

Performance Debugging

Prometheus slow queries: Test in Prometheus UI before Grafana
Missing traces: Check collector logs for export errors
High memory usage: Identify high-cardinality metrics
Network issues: Verify service names and ports in Docker/K8s

Essential Documentation References

Prometheus Documentation - Actually comprehensive
OpenTelemetry Troubleshooting
Grafana Performance Limitations
Kubernetes Monitoring Architecture
CNCF Slack #observability - Active community support
Prometheus Community Forum - Searchable troubleshooting history

Production Deployment Warnings

Critical Failures to Avoid:

Default retention settings will bankrupt you
High cardinality labels crash clusters
Missing resource limits cause cascading failures
Auto-instrumentation without sampling generates TB of data
ServiceMonitors require Prometheus Operator (install kube-prometheus-stack)

Success Criteria:

Metrics collection without OOM kills
Trace visibility under 1% sampling
Dashboard response under 5 seconds
Monthly costs under budget expectations
2am alerts that actually indicate real problems

Useful Links for Further Investigation

Resources That Don't Suck

Link	Description
Prometheus Documentation	The official docs are actually good, unlike most open source projects. Covers PromQL, recording rules, and why your queries are slow. Spent more time here than I care to admit.
Jaeger Documentation	Decent documentation, though the deployment examples are basic. Good for understanding sampling and storage options. The troubleshooting section saved me when traces vanished into the void.
Grafana Documentation	Grafana's docs are decent but their query builder UI is still confusing as hell. The dashboard examples help when you're staring at a blank panel for the 50th time.
OpenTelemetry Documentation	Comprehensive but overwhelming. Start with the getting started guides, skip the philosophy sections (unless you enjoy existential crises about observability standards).
Microsoft OpenTelemetry Example	One of the few complete examples that doesn't skip the hard parts. Shows .NET integration with working configs.
Kubernetes Observability Tutorial	Hands-on tutorial with actual YAML files you can copy. Better than most "enterprise" documentation.
CNCF OpenTelemetry Demo	Kitchen-sink demo app with every integration. Good for seeing real examples, bad for understanding simple setups. Took me 3 hours to figure out which parts I actually needed. Skip this unless you like complexity for its own sake.
Prometheus Operator	Makes Prometheus work in Kubernetes without manual YAML hell. Required if you want ServiceMonitors to work.
Grafana Helm Charts	Official Helm charts that actually work out of the box. Use these instead of writing your own manifests.
Jaeger Kubernetes Templates	Production templates with Elasticsearch setup. Warning: Elasticsearch will eat your RAM budget. AWS bills went up like 400 bucks or more the first month we deployed this thing.
Prometheus Performance Tuning	Essential reading for when Prometheus starts eating 32GB RAM. Covers cardinality and why your metrics design sucks.
Grafana Cloud Observability Guide	Modern patterns with Grafana Cloud. Expensive but someone else's problem when it breaks. Half the examples won't work in your environment.
CNCF Slack #observability	Active community that actually helps. Maintainers hang out here and answer real questions.
Prometheus Community Forum	Official forum that's slower than Slack but has better searchable history for troubleshooting.