How These Three Tools Actually Work Together

The Good, The Bad, and The "Why Is Nothing Scraping?"

Grafana Logo

Monitoring Stack Overview

Look, Prometheus scrapes metrics every 15 seconds by default. Grafana queries Prometheus to make graphs. Alertmanager yells at you when shit breaks. That's the whole thing.

The reality: you'll spend the first week figuring out why Prometheus isn't scraping anything, the second week wondering why Grafana shows no data, and the third week getting alerts for things that aren't actually broken.

What Each Tool Actually Does

Prometheus is basically a time-series database that fetches metrics from everything every few seconds. Recent versions supposedly fixed the UI (it's still ugly) and added UTF-8 support (because apparently someone was using emojis in metric names).

Grafana makes the ugly metrics look pretty. Latest version added some dashboard advisor that'll probably tell you your dashboards suck. Fair enough.

Alertmanager 0.28.1 handles notifications. It groups alerts so you don't get 500 Slack messages when one server dies and takes everything with it. Sometimes this works too well and you miss actual emergencies.

Prometheus Architecture

The Data Flow (When It Works)

Here's what happens when everything's actually working:

  1. Prometheus scrapes targets - usually breaks on Docker networking
  2. Stores metrics in TSDB - fills up your disk if you're not careful
  3. Grafana queries Prometheus - times out if you write terrible PromQL
  4. Alert rules fire - usually at 3am on weekends
  5. Alertmanager routes notifications - to the wrong Slack channel

Most common issue? Prometheus can't reach your targets because of some Docker networking bullshit. You'll see context deadline exceeded errors in the logs that tell you nothing useful. Spend a day fighting with docker network inspect before you blame the config.

Why People Use This Stack

The good stuff:

The painful parts:

  • Setup will break in creative ways
  • Learning curve is steeper than advertised
  • High availability means double the infrastructure costs
  • Storage costs add up fast if you keep everything forever
  • Version upgrades break something every single time

When This Makes Sense

This stack shines when you're running cloud-native stuff where services come and go. It's particularly good on Kubernetes because the service discovery actually works most of the time.

Don't use this for application logs - use something else. Prometheus is for metrics, not logs. I learned this the hard way after trying to jam application logs into custom metrics. Took me a week to realize I was an idiot. Don't be me.

Resources that actually help:

Actually Getting This Shit to Work

What Nobody Tells You About Production Deployments

Docker Logo

Here's the thing about "production-ready" configs: they break in ways you never imagined. I've spent more nights debugging why Prometheus can't scrape itself than I care to admit. Last time it was because I had localhost:9090 in the config instead of the container name. Took me 3 hours to figure that out.

The Docker Compose That Actually Works

Forget the fancy Kubernetes setup for now. Start with Docker Compose. This took me 3 attempts to get right - the networking always screwed me over:

version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest  # Pin to specific versions in prod
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alerts:/etc/prometheus/alerts
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'

  grafana:
    image: grafana/grafana:latest  # I learned this the hard way when 'latest' broke our entire monitoring stack during a routine pull
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin123  # Change this!
      
  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml

volumes:
  grafana-data:

Reality check: This will eat 2GB+ RAM minimum. Plan for 4GB if you want it responsive. Also budget $50-100/month in cloud costs unless you like surprises. I tried running this on a $20 DigitalOcean droplet once - lasted exactly 3 days before everything started OOMing.

The Prometheus Config That Doesn't Suck

Every tutorial shows you the "simple" config that breaks immediately. Here's one that actually works:

global:
  scrape_interval: 15s  # Don't go lower unless you hate your disk
  evaluation_interval: 15s

rule_files:
  - "alerts/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
    scrape_interval: 5s  # More frequent for system metrics
    
  - job_name: 'docker-containers'
    static_configs:
      - targets: ['cadvisor:8080']

Common gotcha: The targets need to use Docker service names, not localhost. Took me an embarrassing amount of time to figure that out.

Alertmanager Config That Won't Drive You Insane

The official docs make this look simple. It's not:

global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'monitoring@yourcompany.com'
  smtp_auth_username: 'monitoring@yourcompany.com'
  smtp_auth_password: 'your-app-password'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'

receivers:
- name: 'web.hook'
  slack_configs:
  - api_url: 'YOUR_SLACK_WEBHOOK_URL'
    channel: '#alerts'
    username: 'Prometheus'
    text: 'Something is fucked: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

Pro tip: Start with Slack webhooks. Email alerts will end up in spam and PagerDuty costs money you probably don't have yet.

What Actually Breaks in Production

Storage fills up fast. Prometheus keeps everything in memory + disk. I killed a server by accident keeping 6 months of metrics - got the dreaded no space left on device error right in the middle of debugging a prod issue. Seen multiple versions where retention settings get ignored randomly. Set retention to 30 days max unless you have money to burn.

Memory usage spirals. Each metric series eats RAM. Monitor your /metrics endpoint before adding 50 more exporters. This calculator will save your sanity.

Network timeouts everywhere. Default timeouts are too aggressive. Bump them up or watch everything flake:

scrape_configs:
  - job_name: 'slow-service'
    scrape_interval: 30s
    scrape_timeout: 10s  # Default is 10s, increase for slow targets

Docker networking is hell. Services can't reach each other? You'll get connection refused or no route to host errors that mean nothing useful. Check if they're on the same network with docker network ls and docker inspect until you hate networking. I've seen versions that randomly fail to scrape certain targets, especially when you get creative with network names.

Security That Actually Matters

Lock this shit down because you WILL get owned if you don't:

  1. Change default passwords - admin/admin is not production-ready, genius
  2. Use HTTPS - Let's Encrypt is free, use it
  3. Restrict network access - Don't expose 9090/9093/3000 to the internet
  4. Update regularly - Security patches matter more than new features

Scaling Reality Check

High availability means 2x the costs. You need at least 2 Prometheus instances, 2 Alertmanagers, and a load balancer. Plus shared storage that doesn't suck.

Federation is complex. Don't attempt it until you've mastered the basics. I tried to federate 5 Prometheus instances and spent 2 weeks debugging query fanout issues.

Consider Grafana Cloud if you have budget. It's 3x more expensive but saves your sanity. Sometimes that's worth it.

Resources That Don't Suck

Grafana Dashboard Example

Kubernetes Integration

Integration Approaches Comparison

Approach

Complexity

Scalability

Maintenance

Best For

Reality Check

Docker Compose

Low (until you scale)

Limited (breaks at ~10 services)

Easy (famous last words)

Development, small teams

Starts simple, becomes a nightmare to scale. This is where everyone starts before they learn better.

Kubernetes Operator

Medium

High

Medium

Production Kubernetes

Works great until the operator breaks and you have to debug CRDs at 3am with a production outage happening

Helm Charts

Medium (if you know Helm)

High

Medium

Kubernetes with customization

Flexible but you'll spend weeks tweaking values.yaml and questioning your life choices

Manual Installation

High (pain incarnate)

Medium

Complex

Specialized requirements

Only do this if you hate yourself, need something weird, or enjoy debugging systemd unit files

Cloud Managed

Low (vendor lock-in)

Very High

Minimal

Enterprise, cloud-first

Costs 3x more but your sanity is worth it. Sometimes that's the right trade-off.

The Shit That Actually Breaks at Scale

What Happens When You Have Real Traffic

Here's what nobody tells you: this stack works fine for your pet project. It's when you hit real scale that the pain starts. I've been through three major monitoring disasters - here's what actually happens.

When Prometheus Hits Its Limits

Memory usage explodes. Your 8GB server becomes a 32GB server real quick. Each unique metric series eats RAM, and developers love creating metrics like http_requests{user_id=\"12345\"}. That's 100k unique series if you have 100k users.

This memory calculator saved my ass when we hit 10M series and the server started dying. Rule of thumb: 1-3 bytes per sample in RAM, plus overhead.

Query timeouts everywhere. Your dashboards take 30 seconds to load because someone wrote a PromQL query that scans 6 months of data. Use recording rules to pre-compute expensive queries or watch Grafana time out constantly.

Federation is a nightmare. Don't federate until you've mastered single-node Prometheus. I tried to federate 5 instances and spent 2 weeks debugging why queries were returning partial results. Turns out the federation job was timing out silently with context deadline exceeded (Client.Timeout exceeded while awaiting headers). The docs make it look easy - it's not.

Alert Fatigue is Real

You'll get 500 alerts when one thing breaks. Cascading failures mean everything downstream alerts too. Set up inhibition rules or prepare to be woken up by 47 Slack notifications at 3am.

Alert routing becomes a full-time job. Every team wants their alerts routed differently. The Alertmanager config grows to 500 lines and nobody understands it anymore.

I learned this the hard way when a database went down and triggered 200+ alerts. The phone wouldn't stop buzzing for 20 minutes straight. My manager thought I was ignoring him. Now I group by cluster and severity and use silence patterns for known maintenance. Should've done this from day one.

Storage Costs Will Kill You

Long-term storage gets expensive fast. Prometheus keeps everything in memory + local disk. Want 6 months of metrics? That's terabytes of storage plus the RAM to serve queries.

Remote storage isn't magic. Thanos and Cortex solve long-term storage but add complexity. I spent a month debugging Thanos sidecar crashes - kept getting signal: killed with no useful logs. Gave up and just kept 30 days local. Sometimes simple wins.

Consider cloud options. Grafana Cloud costs 3x more but handles scaling, storage, and maintenance. Sometimes that's worth it vs spending weeks optimizing retention policies.

Production War Stories

The Great Cardinality Explosion of 2023: A developer added user IDs to metric labels. Went from 50k series to 2M overnight. Prometheus OOM'd every 10 minutes. Took down monitoring during a real outage. Don't be that developer.

The Federation Failure: Tried to aggregate metrics from 5 data centers. Worked fine until one Prometheus went down and queries started returning incomplete data. Took 2 days to figure out the federation queries were silently failing with context deadline exceeded (Client.Timeout exceeded while awaiting headers) buried in the logs. No one thought to check the federation job status. Brilliant.

The Alert Storm: Database cluster went down, triggered like 800+ alerts in a few minutes. PagerDuty, Slack, email all exploded. Phone calls at 2am. Had to silence everything and fix blind. Learned about alert dependencies the hard way.

What Actually Scales (Spoiler: Not What You Think)

Keep it simple. Multiple small Prometheus instances beat one giant federated mess. Easier to debug, easier to upgrade, easier to not break everything.

Monitoring the monitoring. You need alerts on Prometheus being down, Grafana being slow, and Alertmanager not sending notifications. This meta-monitoring guide will save your ass.

Real resource requirements:

  • Small setup (< 1k series): 2GB RAM, 2 CPU cores, 100GB SSD
  • Medium (10k-100k series): 16GB RAM, 4 CPU cores, 1TB SSD
  • Large (1M+ series): 64GB+ RAM, 8+ CPU cores, enterprise storage

Alternatives When This Doesn't Work

Sometimes this stack just isn't the right fit and you need to admit defeat:

  • DataDog - costs 10x more, works out of the box. Your CFO will hate you but your sleep schedule will thank you.
  • New Relic - expensive but handles application monitoring better. Good for when you need to trace why your APIs are slow as shit.
  • Grafana Cloud - managed Prometheus/Grafana. Same tools, someone else's headache.
  • AWS CloudWatch - if you're all-in on AWS and don't mind the vendor lock-in. Integrates well, costs add up fast.
  • VictoriaMetrics - Prometheus-compatible but faster. Good when you hit Prometheus's scaling limits.
  • InfluxDB - different data model, better for high-cardinality. Learn a new query language, though.

Resources for When You're Stuck

Essential Resources and Documentation

Related Tools & Recommendations

tool
Similar content

Prometheus Monitoring: Overview, Deployment & Troubleshooting Guide

Free monitoring that actually works (most of the time) and won't die when your network hiccups

Prometheus
/tool/prometheus/overview
100%
tool
Similar content

Grafana: Monitoring Dashboards, Observability & Ecosystem Overview

Explore Grafana's journey from monitoring dashboards to a full observability ecosystem. Learn about its features, LGTM stack, and how it empowers 20 million use

Grafana
/tool/grafana/overview
96%
integration
Similar content

Jenkins Docker Kubernetes CI/CD: Deploy Without Breaking Production

The Real Guide to CI/CD That Actually Works

Jenkins
/integration/jenkins-docker-kubernetes/enterprise-ci-cd-pipeline
94%
integration
Similar content

ELK Stack for Microservices Logging: Monitor Distributed Systems

How to Actually Monitor Distributed Systems Without Going Insane

Elasticsearch
/integration/elasticsearch-logstash-kibana/microservices-logging-architecture
84%
tool
Recommended

Google Kubernetes Engine (GKE) - Google's Managed Kubernetes (That Actually Works Most of the Time)

Google runs your Kubernetes clusters so you don't wake up to etcd corruption at 3am. Costs way more than DIY but beats losing your weekend to cluster disasters.

Google Kubernetes Engine (GKE)
/tool/google-kubernetes-engine/overview
80%
tool
Similar content

Alertmanager - Stop Getting 500 Alerts When One Server Dies

Learn how Alertmanager processes alerts from Prometheus, its advanced features, and solutions for common issues like duplicate alerts. Get an overview of its pr

Alertmanager
/tool/alertmanager/overview
79%
pricing
Recommended

Datadog vs New Relic vs Sentry: Real Pricing Breakdown (From Someone Who's Actually Paid These Bills)

Observability pricing is a shitshow. Here's what it actually costs.

Datadog
/pricing/datadog-newrelic-sentry-enterprise/enterprise-pricing-comparison
57%
troubleshoot
Recommended

Fix Kubernetes Service Not Accessible - Stop the 503 Hell

Your pods show "Running" but users get connection refused? Welcome to Kubernetes networking hell.

Kubernetes
/troubleshoot/kubernetes-service-not-accessible/service-connectivity-troubleshooting
55%
compare
Recommended

PostgreSQL vs MySQL vs MongoDB vs Cassandra - Which Database Will Ruin Your Weekend Less?

Skip the bullshit. Here's what breaks in production.

PostgreSQL
/compare/postgresql/mysql/mongodb/cassandra/comprehensive-database-comparison
50%
tool
Similar content

Django Production Deployment Guide: Docker, Security, Monitoring

From development server to bulletproof production: Docker, Kubernetes, security hardening, and monitoring that doesn't suck

Django
/tool/django/production-deployment-guide
46%
tool
Recommended

MongoDB Atlas Enterprise Deployment Guide

built on MongoDB Atlas

MongoDB Atlas
/tool/mongodb-atlas/enterprise-deployment
41%
pricing
Recommended

Datadog Enterprise Pricing - What It Actually Costs When Your Shit Breaks at 3AM

The Real Numbers Behind Datadog's "Starting at $23/host" Bullshit

Datadog
/pricing/datadog/enterprise-cost-analysis
41%
tool
Similar content

Fix gRPC Production Errors - The 3AM Debugging Guide

Fix critical gRPC production errors: 'connection refused', 'DEADLINE_EXCEEDED', and slow calls. This guide provides debugging strategies and monitoring solution

gRPC
/tool/grpc/production-troubleshooting
30%
news
Recommended

Google Guy Says AI is Better Than You at Most Things Now

Jeff Dean makes bold claims about AI superiority, conveniently ignoring that his job depends on people believing this

OpenAI ChatGPT/GPT Models
/news/2025-09-01/google-ai-human-capabilities
28%
tool
Similar content

Aqua Security Troubleshooting: Resolve Production Issues Fast

Real fixes for the shit that goes wrong when Aqua Security decides to ruin your weekend

Aqua Security Platform
/tool/aqua-security/production-troubleshooting
28%
tool
Similar content

Deploy OpenAI gpt-realtime API: Production Guide & Cost Tips

Deploy the NEW gpt-realtime model to production without losing your mind (or your budget)

OpenAI Realtime API
/tool/openai-gpt-realtime-api/production-deployment
28%
tool
Similar content

Debug Kubernetes Issues: The 3AM Production Survival Guide

When your pods are crashing, services aren't accessible, and your pager won't stop buzzing - here's how to actually fix it

Kubernetes
/tool/kubernetes/debugging-kubernetes-issues
28%
tool
Similar content

PostgreSQL Performance Optimization: Master Tuning & Monitoring

Optimize PostgreSQL performance with expert tips on memory configuration, query tuning, index design, and production monitoring. Prevent outages and speed up yo

PostgreSQL
/tool/postgresql/performance-optimization
27%
tool
Similar content

Azure OpenAI Service: Production Troubleshooting & Monitoring Guide

When Azure OpenAI breaks in production (and it will), here's how to unfuck it.

Azure OpenAI Service
/tool/azure-openai-service/production-troubleshooting
27%
tool
Similar content

Alpaca Trading API Production Deployment Guide & Best Practices

Master Alpaca Trading API production deployment with this comprehensive guide. Learn best practices for monitoring, alerts, disaster recovery, and handling real

Alpaca Trading API
/tool/alpaca-trading-api/production-deployment
26%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization