Set Up Microservices Monitoring That Actually Works

Reality Check: What You're Getting Into

This setup gives you monitoring that doesn't suck when production catches fire:

Metrics that matter: Prometheus scraping app metrics (not just "CPU is 50%")
Trace everything: Jaeger showing which service is bottlenecking your requests
Dashboards that work: Grafana displaying data you can actually use at 3am
One data pipeline: OpenTelemetry Collector so you don't configure 12 different exporters

Prerequisites (AKA "Shit You Need Before Starting")

Required or you're fucked:

Docker Desktop with 8GB+ RAM allocated (4GB minimum if you hate yourself)
Docker Compose that actually works (not the old Python version)
Basic Docker knowledge (you know what docker ps does)
Patience for when things randomly break

For Kubernetes (prepare for pain):

K8s cluster that doesn't randomly restart pods
Helm installed and working
kubectl configured properly (test with kubectl get nodes)
At least 16GB total RAM across your cluster (seriously, don't cheap out)

The Stack (What Each Thing Actually Does)

OpenTelemetry Collector routes your telemetry data without the pain of configuring exporters in every damn service. Worked great on my setup until I needed custom processing - spent 4 hours reading YAML documentation trying to figure out why my processor kept barfing with failed to build pipelines: service "pipelines" has no listeners. Still don't understand half the OpenTelemetry specification, but the contrib distributions have all the processors that actually work.

Prometheus stores your metrics and doesn't crash when you ask it questions. Latest versions supposedly have better memory handling - something about Remote Write improvements that cut memory usage, though honestly I haven't seen dramatic changes in my setup. Check the official docs, PromQL guide, and best practices before you start collecting everything.

Jaeger shows you request traces so you can see that user auth is taking 2 seconds because someone added a database call in a loop. The newer versions work better with OpenTelemetry out of the box, though I still had to mess with the config for an hour. The Jaeger documentation covers deployment, while sampling strategies prevent you from drowning in trace data.

Grafana makes dashboards that look good in demos and confusing in production. Grafana 11 added AI features that are basically useless, and their query builder still makes me want to punch my screen. Just steal from community dashboards - someone else already did the hard work. The provisioning docs will save you from clicking through the UI 50 times to recreate your dashboards. Also check the Grafana getting started guide and dashboard optimization tips for faster loading.

Reality Check: Cost and Time

Time investment:

Docker setup: 2-3 hours if Docker behaves, 8 hours if it doesn't (spent a whole Saturday wrestling with Docker Desktop on M1 Mac)
Kubernetes production setup: 2-3 days minimum (more like a week if you hit the CrashLoopBackOff hell)
Getting useful dashboards: 1 week of tweaking (and cursing at Grafana's query builder)

Infrastructure costs (prepare your wallet):

Development: probably 50-100 bucks a month, maybe more if you're not paying attention to retention settings
Production: anywhere from a few hundred to fuck-knows-how-much depending on how much data you hoard (bills went completely nuts when someone cranked up retention to a year or something stupid - think it was over a grand that month, maybe more)

Essential cost planning: AWS CloudWatch pricing if you're going cloud, the CNCF landscape for open source alternatives when you're tired of vendor bills, Grafana performance considerations, and Prometheus best practices to avoid shooting yourself in the foot.

The real cost: A DevOps person who understands this stack, because when Prometheus OOMs at 2am, you need someone who knows why.

Free is never free when you factor in the human time to make it work.

Dashboard Metrics Visualization

Observability in MicroServices: Serilog, Grafana, OpenTelemetry, Jaeger, Prometheus & Grafana Loki by Five Minute Tech

## Microservices Observability with OpenTelemetry (17 mins)

This Five Minute Tech walkthrough shows you the real setup without the usual YouTube bullshit. Goes through Serilog, Grafana, OpenTelemetry, Jaeger, and Prometheus integration.

Watch: Observability in MicroServices: Serilog, Grafana, OpenTelemetry, Jaeger, Prometheus & Grafana Loki

What you'll actually learn:
- Complete observability stack setup with working configs (actual YAML, not pseudo-code)
- OpenTelemetry integration that doesn't break your app (learned this after it broke mine twice)
- Dashboard creation without the Grafana query builder confusion
- How to connect everything without reading 12 different docs
- Real troubleshooting when things inevitably fail (around 14:30 mark when his collector crashes)

Why this one doesn't suck:
Relatively recent upload, shows actual working code, and covers the full observability pipeline. The presenter actually debugs the OTEL_EXPORTER_OTLP_ENDPOINT error instead of pretending everything works perfectly. Plus he admits when he fucks up the YAML indentation at 9:15.

📺 YouTube

Docker Setup: Copy-Paste This and Pray

Prometheus and Grafana Architecture

The Docker Compose That Actually Works

Here's the docker-compose.yml that murdered my weekend the first time I tried to set this up. Copy this exactly - don't be like me and think you can "improve" it. Wasted 2 hours wondering why otel-collector:4317 kept throwing connection errors before I realized Docker networking ignores your /etc/hosts file. This config follows Docker Compose best practices, container observability patterns, and Docker networking fundamentals - read those or you'll hit the same stupid networking issues I did.

version: '3.8'

services:
  # OpenTelemetry Collector - Routes telemetry data
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317"   # OTLP gRPC receiver
      - "4318:4318"   # OTLP HTTP receiver
      - "8889:8889"   # Prometheus metrics endpoint
    depends_on:
      - jaeger-all-in-one
    networks:
      - observability

  # Prometheus - Metrics storage
  prometheus:
    image: prom/prometheus:latest
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    networks:
      - observability

  # Jaeger - Tracing
  jaeger-all-in-one:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # Jaeger UI
      - "14250:14250"  # gRPC endpoint
    environment:
      - COLLECTOR_OTLP_ENABLED=true
    networks:
      - observability

  # Grafana - Dashboards
  grafana:
    image: grafana/grafana:10.2.2
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    networks:
      - observability

volumes:
  prometheus-data:
  grafana-data:

networks:
  observability:
    driver: bridge

OpenTelemetry Collector Configuration

Create otel-collector-config.yaml:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  memory_limiter:
    limit_mib: 512

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
    const_labels:
      cluster: "local"
  jaeger:
    endpoint: jaeger-all-in-one:14250
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]

Prometheus Configuration

Create prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'otel-collector'
    static_configs:
      - targets: ['otel-collector:8889']
    scrape_interval: 10s

  # Add your application endpoints here
  - job_name: 'microservices'
    static_configs:
      - targets: ['app1:8080', 'app2:8081']
    metrics_path: '/metrics'
    scrape_interval: 10s

Getting This Shit Running

Warning: This will use 4GB+ RAM. Docker Desktop will hate you if you're running on 8GB total.

Create the dirs (or it'll fail silently):

mkdir observability-stack && cd observability-stack
mkdir -p grafana/provisioning/{dashboards,datasources}
# Yes, you need those specific directories or Grafana won't start

Start everything:

docker-compose up -d
# If this fails with "network not found", run it again. Docker networking is weird.

Check if anything died:

docker-compose ps
# If anything shows "Exit 1", check logs: docker-compose logs [service-name]

Access the UIs (if they actually started):
- Grafana: localhost port 3000 (admin/admin) - takes 30 seconds to boot
- Prometheus: localhost port 9090 - should be immediate
- Jaeger: localhost port 16686 - takes 30 seconds to show traces

When everything goes to shit:

Port 3000 taken? Kill whatever's hogging it (usually a zombie React dev server): lsof -ti:3000 | xargs kill -9
Grafana refuses to start? Check the grafana/provisioning dirs exist (forgot this 3 different times like an idiot)
OTel Collector crashes with no such file or directory? Your config file path is wrong or someone fucked up the YAML
Everything shows "Exited (1)"? Classic tabs vs spaces YAML nightmare

This setup creates a "observability" Docker network so containers can talk to each other. The volumes persist data so you don't lose metrics when containers restart.

Essential reading: Docker networking guide when containers won't talk, resource limits before you OOM everything, Docker Compose troubleshooting, and OpenTelemetry Collector config reference. Also check Prometheus configuration docs and Grafana provisioning guide to avoid manual setup hell.

Kubernetes: Where Your Budget Goes to Die

Kubernetes Architecture

Docker setup working? Awesome. Now for the part where your bank account starts crying - making this shit work on Kubernetes. Everything gets 10x more expensive and complex, but you also get actual scalability instead of praying your single Docker host doesn't die. I learned this lesson when my first K8s observability setup nuked production for 2 hours because I didn't set resource limits and Prometheus ate every byte of RAM on the cluster.

Helm Charts (Because Manual YAML is Masochism)

Kubernetes deployment is where costs explode and complexity multiplies. Helm charts keep you from manually editing YAML files at 3am when production is burning. The Kubernetes monitoring guide covers the fundamentals, while the Kubernetes architecture docs and resource management guide prevent you from blowing your budget. Also check Grafana's high cardinality management to avoid performance hell.

1. Add Required Helm Repositories

## Add OpenTelemetry repository (check the [OpenTelemetry Helm charts](https://github.com/open-telemetry/opentelemetry-helm-charts) for latest)
helm repo add open-telemetry https://github.com/open-telemetry/opentelemetry-helm-charts

## Add Prometheus community charts (see [Prometheus community](https://prometheus-community.github.io/helm-charts))
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

## Add Grafana repository (check [Grafana charts](https://grafana.github.io/helm-charts))
helm repo add grafana https://grafana.github.io/helm-charts

## Add Jaeger repository (visit [Jaeger charts](https://jaegertracing.github.io/helm-charts))
helm repo add jaegertracing https://jaegertracing.github.io/helm-charts

## Update repositories
helm repo update

2. Deploy OpenTelemetry Collector

Create otel-collector-values.yaml:

mode: deployment
replicaCount: 2

config:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318
  processors:
    batch:
      timeout: 1s
      send_batch_size: 1024
    memory_limiter:
      limit_mib: 512
    k8sattributes:
      auth_type: "serviceAccount"
      passthrough: false
      extract:
        metadata:
          - k8s.pod.name
          - k8s.pod.uid
          - k8s.deployment.name
          - k8s.namespace.name
  exporters:
    prometheus:
      endpoint: "0.0.0.0:8889"
      const_labels:
        cluster: "production"
    jaeger:
      endpoint: jaeger-collector:14250
      tls:
        insecure: true
  service:
    pipelines:
      traces:
        receivers: [otlp]
        processors: [memory_limiter, k8sattributes, batch]
        exporters: [jaeger]
      metrics:
        receivers: [otlp]
        processors: [memory_limiter, k8sattributes, batch]
        exporters: [prometheus]

resources:
  limits:
    cpu: 500m
    memory: 1Gi
  requests:
    cpu: 250m
    memory: 512Mi

service:
  type: ClusterIP

Deploy with:

helm install otel-collector open-telemetry/opentelemetry-collector \
  -f otel-collector-values.yaml \
  -n observability --create-namespace

3. Deploy Jaeger

helm install jaeger jaegertracing/jaeger \
  --set provisionDataStore.cassandra=false \
  --set provisionDataStore.elasticsearch=true \
  --set storage.type=elasticsearch \
  --set elasticsearch.replicas=1 \
  --set elasticsearch.minimumMasterNodes=1 \
  -n observability

4. Deploy Prometheus Stack

Create prometheus-values.yaml:

prometheus:
  prometheusSpec:
    retention: 30d
    resources:
      requests:
        memory: 2Gi
        cpu: 500m
      limits:
        memory: 4Gi
        cpu: 1000m
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: fast-ssd
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi

grafana:
  adminPassword: "secure-password-here"
  persistence:
    enabled: true
    size: 10Gi
  resources:
    requests:
      memory: 256Mi
      cpu: 100m
    limits:
      memory: 512Mi
      cpu: 500m

alertmanager:
  alertmanagerSpec:
    resources:
      requests:
        memory: 256Mi
        cpu: 100m
      limits:
        memory: 512Mi
        cpu: 500m

Deploy with:

helm install prometheus prometheus-community/kube-prometheus-stack \
  -f prometheus-values.yaml \
  -n observability

5. Configure Service Discovery

Create ServiceMonitor for automatic metrics discovery:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: otel-collector
  namespace: observability
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: opentelemetry-collector
  endpoints:
  - port: prometheus
    interval: 10s
    path: /metrics

Apply with:

kubectl apply -f servicemonitor.yaml

6. Verify Production Deployment

## Check all pods are running
kubectl get pods -n observability

## Port-forward to access UIs
kubectl port-forward -n observability svc/prometheus-grafana 3000:80
kubectl port-forward -n observability svc/jaeger-query 16686:16686

## Check metrics collection
kubectl port-forward -n observability svc/prometheus-kube-prometheus-prometheus 9090:9090

Production Reality Check

Resource planning (prepare your wallet):

Prometheus: probably need 4-8GB RAM, maybe more if you're not careful about retention
Grafana: 1GB RAM (512MB if you hate performance)
Jaeger with Elasticsearch: 8GB+ RAM total or it'll crash randomly
OTel Collector: 1GB RAM (scales with throughput, will OOM if you don't set limits)

Total infrastructure cost: anywhere from 500 to fuck-knows-how-much per month depending on cluster size and cloud provider - could be way more if you're not watching the bills.

High availability (because single points of failure suck):
Deploy 3 replicas minimum for critical components. Use pod anti-affinity or you'll get all replicas on the same node that dies during an update. Check the Kubernetes deployment patterns and pod disruption budgets guides.

Security (so you don't get fired):

Enable RBAC (default in most managed K8s now)
NetworkPolicies to prevent lateral movement
TLS everywhere (pain in the ass but necessary)
Don't use default passwords (looking at you, Grafana admin/admin)

ServiceMonitors are useless without Prometheus Operator installed first. Another moving piece to deploy and manage. Spent a full day wondering why Prometheus completely ignored my ServiceMonitors before discovering you need the CRDs installed first. The kube-prometheus-stack bundles everything together, so just use that instead of cobbling together individual components like I did.

Must-read: Kubernetes monitoring architecture, resource management, the Prometheus community charts, Jaeger Kubernetes deployment guide, and GeekGrove's optimization guide or you'll blow your budget.

Instrumenting Your Apps: Where Things Break

Microservices Telemetry Architecture

Got your monitoring stack running? Cool. Now for the part that'll make you question your career choices - getting your applications to send data that doesn't suck. Most people either instrument everything and tank their performance, or instrument nothing and wonder why their traces are empty. I went full idiot on my first attempt and instrumented every database call - ended up with 50,000 spans per request and a trace that looked like a Christmas tree.

Adding Telemetry Without Breaking Everything

The OpenTelemetry documentation is your friend here, especially the semantic conventions that define what attributes you should actually be tracking. Also worth reading: the Spring Boot starter guide and OpenTelemetry troubleshooting for Node.js for platform-specific gotchas.

Java/Spring Boot Applications

Add OpenTelemetry dependencies to your pom.xml (check the OpenTelemetry Java instrumentation repo for latest versions):

<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-api</artifactId>
    <version>1.30.0</version>
</dependency>
<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-sdk</artifactId>
    <version>1.30.0</version>
</dependency>
<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-exporter-otlp</artifactId>
    <version>1.30.0</version>
</dependency>
<dependency>
    <groupId>io.opentelemetry.instrumentation</groupId>
    <artifactId>opentelemetry-spring-boot-starter</artifactId>
    <version>1.32.0</version>
</dependency>

Configure in application.yml:

management:
  otlp:
    metrics:
      export:
        enabled: true
        url: http://otel-collector:4318/v1/metrics
    tracing:
      export:
        enabled: true
        url: http://otel-collector:4318/v1/traces

spring:
  application:
    name: user-service

otel:
  exporter:
    otlp:
      endpoint: http://otel-collector:4318
  service:
    name: ${spring.application.name}
  resource:
    attributes:
      service.version: 1.0.0

Node.js Applications

Install these packages (and pray npm doesn't break something else):

npm install @opentelemetry/api \
           @opentelemetry/sdk-node \
           @opentelemetry/exporter-otlp-http \
           @opentelemetry/instrumentation-http \
           @opentelemetry/instrumentation-express

Warning: Node.js instrumentation seems to break hot reload - at least it did for me. Spent a whole afternoon convinced my Express app was fucked before realizing the OpenTelemetry HTTP instrumentation was making nodemon have a seizure every time I saved a file. The tracing library hooks into HTTP requests and apparently that confuses the hell out of nodemon's file watching. The Node.js instrumentation docs mention this gotcha, and there's a GitHub issue about performance slowdown that affects development too.

Create tracing.js:

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-otlp-http');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-otlp-http');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');

const traceExporter = new OTLPTraceExporter({
  url: 'http://otel-collector:4318/v1/traces',
});

const metricExporter = new OTLPMetricExporter({
  url: 'http://otel-collector:4318/v1/metrics',
});

const sdk = new NodeSDK({
  traceExporter,
  metricExporter,
  instrumentations: [
    new HttpInstrumentation(),
    new ExpressInstrumentation(),
  ],
  serviceName: 'order-service',
  serviceVersion: '1.0.0',
});

sdk.start();

Import this file at the top of your main application file:

require('./tracing');
const express = require('express');
const app = express();

Python Applications

Install OpenTelemetry packages:

pip install opentelemetry-api \
           opentelemetry-sdk \
           opentelemetry-exporter-otlp \
           opentelemetry-instrumentation-flask \
           opentelemetry-instrumentation-requests

Gotcha: Python Flask auto-instrumentation works until you use async views, then it gets confused. FlaskInstrumentor doesn't handle async def route handlers well - at least in my experience. Burned 3 hours wondering why my async endpoints were invisible in traces - turns out you need manual span management for anything touching asyncio. Check the OpenTelemetry Python docs and instrumentation libraries guide for the latest async support.

Create instrumentation setup:

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

## Configure tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

otlp_exporter = OTLPSpanExporter(
    endpoint="http://otel-collector:4318/v1/traces"
)
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

## Configure metrics
metrics.set_meter_provider(MeterProvider())
meter = metrics.get_meter(__name__)

## Instrument Flask and requests
FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()

Custom Metrics and Traces

Adding Business Metrics

Java example with custom metrics:

@RestController
public class OrderController {
    
    private final Counter orderCounter;
    private final Timer orderProcessingTime;
    
    public OrderController(MeterRegistry meterRegistry) {
        this.orderCounter = Counter.builder("orders.created")
            .description("Number of orders created")
            .tag("service", "order-service")
            .register(meterRegistry);
            
        this.orderProcessingTime = Timer.builder("order.processing.time")
            .description("Time taken to process orders")
            .register(meterRegistry);
    }
    
    @PostMapping("/orders")
    public ResponseEntity<Order> createOrder(@RequestBody Order order) {
        return orderProcessingTime.recordCallable(() -> {
            Order savedOrder = orderService.save(order);
            orderCounter.increment();
            return ResponseEntity.ok(savedOrder);
        });
    }
}

Adding Custom Spans

@Service
public class PaymentService {
    
    private final Tracer tracer;
    
    public PaymentService() {
        this.tracer = GlobalOpenTelemetry.getTracer("payment-service");
    }
    
    public PaymentResult processPayment(PaymentRequest request) {
        Span span = tracer.spanBuilder("payment.process")
            .setAttribute("payment.amount", request.getAmount())
            .setAttribute("payment.currency", request.getCurrency())
            .startSpan();
            
        try (Scope scope = span.makeCurrent()) {
            // Payment processing logic
            PaymentResult result = doPaymentProcessing(request);
            
            span.setStatus(StatusCode.OK);
            span.setAttribute("payment.result", result.getStatus());
            
            return result;
        } catch (Exception e) {
            span.recordException(e);
            span.setStatus(StatusCode.ERROR, "Payment processing failed");
            throw e;
        } finally {
            span.end();
        }
    }
}

Environment Configuration

Docker Environment Variables

services:
  user-service:
    environment:
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
      - OTEL_SERVICE_NAME=user-service
      - OTEL_SERVICE_VERSION=1.0.0
      - OTEL_RESOURCE_ATTRIBUTES=service.name=user-service,service.version=1.0.0

  order-service:
    environment:
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
      - OTEL_SERVICE_NAME=order-service
      - OTEL_SERVICE_VERSION=1.0.0

Kubernetes ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-config
data:
  OTEL_EXPORTER_OTLP_ENDPOINT: "http://otel-collector:4318"
  OTEL_SERVICE_VERSION: "1.0.0"
  OTEL_RESOURCE_ATTRIBUTES: "cluster=production,environment=prod"

Auto-instrumentation reality check:

Java Spring Boot starter works but tacks on 200ms to startup. Still worth it for not manually instrumenting everything.
Node.js auto-instrumentation grabs most HTTP calls but murders hot reload (cost me 4 hours of debugging)
Python Flask instrumentation is fine until you need custom spans or async views, then you're on your own
All of them vomit massive amounts of data in prod - set sampling rates or watch your AWS bill go insane (we generated crazy amounts of trace data - had to be 50TB or more the first month)

Production tips I wish someone told me earlier:

Use environment variables for config - hard-coding collector URLs is some junior-level bullshit
Set OTEL_SERVICE_NAME or all your traces show up as "unknown_service" and you'll want to delete everything
Set sampling rates or go bankrupt (1% for busy apps, 100% only for dev - bills jumped to like 3K or something before I figured this out)
Watch instrumentation overhead - this shit adds 50-100ms to every request (yes, monitoring makes your app slower)

Key references: OpenTelemetry Java SDK for implementation, sampling strategies to avoid bankruptcy, Baeldung's Spring Boot setup guide, the Aspecto troubleshooting checklist, and SigNoz instrumentation guide for production examples.

Shit That Will Break (And How to Fix It)

Prometheus keeps showing "context deadline exceeded" errors

Your PromQL query is too expensive and timing out. This happens when you query high-cardinality metrics or use inefficient functions. Hit this error 20 times before I realized I was querying metrics with user IDs as labels.

What worked for me (sometimes):

Simplify your query or throw more RAM at it (I threw RAM at it, still timed out)
Bump --query.timeout from 2 minutes to 5 minutes - don't ask me why this helps
Create recording rules to pre-calculate the expensive shit (this actually saved my dashboard that was murdering Prometheus)
Stop using rate() on metrics with 10k+ series unless you enjoy waiting
Sometimes restarting Prometheus fixes it temporarily - no clue why

Example fix: Instead of rate(http_requests_total[5m]), create a recording rule that computes this every minute. Check the query performance guide and cardinality debugging.

Can't curl the metrics endpoint and Prometheus shows "connection refused"

Your app isn't exposing metrics or Docker networking is fucked.

Shit to try:

Can you actually curl that metrics endpoint from inside the container? (usually no)
Are you using localhost instead of the service name? (Use app:8080, not localhost:8080) - made this mistake like 10 times
Is the app even listening on the port you think it is? Check with netstat -ln or whatever

Command to test: docker exec -it prometheus curl http://your-app:port/metrics

Grafana dashboards are completely blank

Check if your time range includes when your apps were actually running. Rookie mistake #1 that I made 5 different times. Spent 30 minutes debugging "broken" metrics when I was looking at data from last week.

Other causes:

Data source isn't configured (test the connection in Grafana)
You're querying metrics that don't exist (check in Prometheus UI first)
Network issues between Grafana and Prometheus

Pro tip: Always test queries in Prometheus UI (http://localhost:9090) before building Grafana dashboards. Use Grafana's query builder or check dashboard examples.

Jaeger UI shows no traces

The collector is probably eating your traces. Check docker-compose logs otel-collector for errors.

Common fuckups:

Wrong endpoint URLs (4317 for gRPC, 4318 for HTTP)
Sampling is set too low (add OTEL_TRACES_SAMPLER=always_on for testing)
Apps aren't actually sending traces (check if instrumentation is enabled)
Collector can't reach Jaeger (wrong service name or port)

Docker containers randomly crash with "killed"

You're out of memory. The OOM killer is terminating containers.

Quick check: docker stats to see memory usage. If anything is near 100%, that's your problem.

Fixes:

Give Docker Desktop more RAM allocation
Set memory limits on containers so they don't eat everything
Reduce metric retention or sampling rate

OpenTelemetry Collector is in a crash loop

Usually a config file issue or resource limits.

Debug steps:

Check the logs: docker-compose logs otel-collector
Is the YAML syntax valid? Use yamllint if you're not sure
Are all the service references correct? (jaeger-all-in-one, not jaeger)
Are you hitting memory limits? Increase the memory_limiter processor

Prometheus is eating 8GB RAM and still wants more

High cardinality labels are killing you. Someone used user IDs or timestamps as labels. I've seen this take down entire clusters - classic move by a junior dev who thought putting request IDs in metric labels was a good idea.

Nuclear option: Restart Prometheus with --storage.tsdb.max-bytes=4GB to cap memory usage. No idea why this works but it does.

Actual fix: Hunt down the shitty high-cardinality metrics and murder them. Find labels with thousands of unique values - usually some genius put session tokens or request IDs in labels. Took me like 2 days or more to track down the metric that was using session tokens as labels (yes, fucking session tokens). Use the cardinality explorer and follow metric naming best practices to avoid this nightmare.

"Failed to connect to Docker daemon" when using docker-compose

Docker Desktop randomly stops working. Just restart it. This happens more on Windows/Mac.

If that doesn't work:

Check if Docker is actually running
Try docker ps to see if the daemon is responsive
Restart your entire machine (the nuclear option that works 90% of the time)

Additional troubleshooting resources:

Real Talk: Deployment Options

Aspect	Docker Compose	Kubernetes + Helm	Managed Services
Complexity	Easy if Docker works	Medium YAML hell	Easy until you need custom configs
Scalability	Single machine max	Infinite (if you have money)	Auto-scales (expensive)
Cost	Infrastructure only	Infrastructure + the DevOps engineer you'll need to hire	Like 3x more expensive but someone else's problem
Maintenance	You updating everything at 2am when security alerts hit	Automated if you set it up right	Provider handles it
Customization	Change anything you want	Helm overrides for most things	Limited to what they allow
Production Ready	Don't even think about it	Yes, but you need to know what you're doing	Yes, and they'll take the blame
Data Retention	Gone when container dies	Persistent if configured properly	Built-in backups and SLA
Multi-tenancy	One big mess	Namespaces work well	Proper isolation by default
High Availability	Ha, good luck	Works if you configure it right	Built-in redundancy

Resources That Don't Suck

Related Tools & Recommendations

integration

Prometheus, Grafana, Alertmanager: Complete Monitoring Stack Setup

How to Connect Prometheus, Grafana, and Alertmanager Without Losing Your Sanity

Prometheus

/integration/prometheus-grafana-alertmanager/complete-monitoring-integration

100%

tool

Datadog Security Monitoring: Good or Hype? An Honest Review

Is Datadog Security Monitoring worth it? Get an honest review, real-world implementation tips, and insights into its effectiveness as a SIEM alternative. Avoid

Quick Navigation

Prerequisites (AKA "Shit You Need Before Starting")

The Stack (What Each Thing Actually Does)

Reality Check: Cost and Time

The Docker Compose That Actually Works

OpenTelemetry Collector Configuration

Prometheus Configuration

Getting This Shit Running

Helm Charts (Because Manual YAML is Masochism)

1. Add Required Helm Repositories

2. Deploy OpenTelemetry Collector

3. Deploy Jaeger

4. Deploy Prometheus Stack

5. Configure Service Discovery

6. Verify Production Deployment

Production Reality Check

Adding Telemetry Without Breaking Everything

Java/Spring Boot Applications

Node.js Applications

Python Applications

Custom Metrics and Traces

Adding Business Metrics

Adding Custom Spans

Environment Configuration

Docker Environment Variables

Kubernetes ConfigMap

Prometheus keeps showing "context deadline exceeded" errors

Can't curl the metrics endpoint and Prometheus shows "connection refused"

Grafana dashboards are completely blank

Jaeger UI shows no traces

Docker containers randomly crash with "killed"

OpenTelemetry Collector is in a crash loop

Prometheus is eating 8GB RAM and still wants more

"Failed to connect to Docker daemon" when using docker-compose

Related Tools & Recommendations

Prometheus, Grafana, Alertmanager: Complete Monitoring Stack Setup

Datadog Security Monitoring: Good or Hype? An Honest Review

PostgreSQL vs MySQL vs MongoDB vs Cassandra - Which Database Will Ruin Your Weekend Less?

Google Kubernetes Engine (GKE) - Google's Managed Kubernetes (That Actually Works Most of the Time)

Kubernetes Enterprise Review - Is It Worth The Investment in 2025?

Fix Kubernetes Pod CrashLoopBackOff - Complete Troubleshooting Guide

ELK Stack for Microservices - Stop Losing Log Data

Django Production Deployment Guide: Docker, Security, Monitoring

Fix MongoDB "Topology Was Destroyed" Connection Pool Errors

Datadog Production Troubleshooting - When Everything Goes to Shit

Datadog Setup and Configuration Guide - From Zero to Production Monitoring

Grafana - The Monitoring Dashboard That Doesn't Suck

Prometheus - Scrapes Metrics From Your Shit So You Know When It Breaks

How to Actually Connect Cassandra and Kafka Without Losing Your Shit

Connecting ClickHouse to Kafka Without Losing Your Sanity

Your Elasticsearch Cluster Went Red and Production is Down

Fix Docker Daemon Connection Failures

Docker Container Won't Start? Here's How to Actually Fix It

Docker Permission Denied on Windows? Here's How to Fix It

OpenTelemetry - Finally, Observability That Doesn't Lock You Into One Vendor