The Good, The Bad, and The "Why Is Nothing Scraping?"
Look, Prometheus scrapes metrics every 15 seconds by default. Grafana queries Prometheus to make graphs. Alertmanager yells at you when shit breaks. That's the whole thing.
The reality: you'll spend the first week figuring out why Prometheus isn't scraping anything, the second week wondering why Grafana shows no data, and the third week getting alerts for things that aren't actually broken.
What Each Tool Actually Does
Prometheus is basically a time-series database that fetches metrics from everything every few seconds. Recent versions supposedly fixed the UI (it's still ugly) and added UTF-8 support (because apparently someone was using emojis in metric names).
Grafana makes the ugly metrics look pretty. Latest version added some dashboard advisor that'll probably tell you your dashboards suck. Fair enough.
Alertmanager 0.28.1 handles notifications. It groups alerts so you don't get 500 Slack messages when one server dies and takes everything with it. Sometimes this works too well and you miss actual emergencies.
The Data Flow (When It Works)
Here's what happens when everything's actually working:
- Prometheus scrapes targets - usually breaks on Docker networking
- Stores metrics in TSDB - fills up your disk if you're not careful
- Grafana queries Prometheus - times out if you write terrible PromQL
- Alert rules fire - usually at 3am on weekends
- Alertmanager routes notifications - to the wrong Slack channel
Most common issue? Prometheus can't reach your targets because of some Docker networking bullshit. You'll see context deadline exceeded
errors in the logs that tell you nothing useful. Spend a day fighting with docker network inspect
before you blame the config.
Why People Use This Stack
The good stuff:
- Free and open source (as in beer and speech)
- Works with everything via exporters
- PromQL is actually decent once you learn it
- Scales horizontally if you know what you're doing
- Kubernetes integration doesn't completely suck
The painful parts:
- Setup will break in creative ways
- Learning curve is steeper than advertised
- High availability means double the infrastructure costs
- Storage costs add up fast if you keep everything forever
- Version upgrades break something every single time
When This Makes Sense
This stack shines when you're running cloud-native stuff where services come and go. It's particularly good on Kubernetes because the service discovery actually works most of the time.
Don't use this for application logs - use something else. Prometheus is for metrics, not logs. I learned this the hard way after trying to jam application logs into custom metrics. Took me a week to realize I was an idiot. Don't be me.
Resources that actually help:
- Official Prometheus docs - dry but accurate
- Grafana documentation - better than most
- PromLabs training - costs money but saves time
- Awesome Prometheus - community resources
- Prometheus Operator docs - for Kubernetes masochists
- DevOpsCube Prometheus tutorial - practical examples
- Brian Brazil's blog - from the Prometheus creator
- Grafana community forums - when you're stuck
- Prometheus community forum - real war stories
- CNCF Prometheus landscape - see what else is out there