Prometheus - Scrapes Metrics From Your Shit So You Know When It Breaks

What Actually Happens When You Run Prometheus

I've been running Prometheus in production for 4 years. Here's what the docs won't tell you and why you'll want to throw your laptop out the window.

Prometheus Logo

Prometheus is a monitoring system that scrapes HTTP endpoints for metrics every 15 seconds by default. SoundCloud built it in 2012 because existing tools sucked, and it became the second CNCF project after Kubernetes.

The pull model actually works. Instead of agents pushing metrics everywhere, Prometheus visits your /metrics endpoints and grabs everything. This means when your network is fucked, you don't lose historical data sitting in some agent's buffer.

What You Actually Get

Prometheus Server - This is the thing that eats your memory. It scrapes metrics, stores them in local TSDB files, and runs alerts. No clustering bullshit, no distributed complexity. One server, one problem to debug.

Service Discovery - Automatically finds things to monitor in Kubernetes, AWS, or whatever. Works great until you fat-finger a selector and suddenly you're scraping 10,000 pods instead of 10.

PromQL Query Language - Like SQL but designed by someone who actually uses time series data. Learning curve is steep but once you get it, you can query anything:

## Your app's error rate (actually useful)
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

## Memory usage that triggers alerts you care about  
container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.8

Alertmanager - Handles alert routing and deduplication. The config syntax will make you want to quit programming, but it works once you figure it out.

Prometheus Architecture

The Memory Problem Everyone Hits

Here's the bullshit the docs skip: Prometheus uses roughly 3KB per time series in memory. Sounds innocent until you realize that http_requests_total{method="GET", handler="/", status="200"} and http_requests_total{method="POST", handler="/", status="200"} are completely different series. And that 3KB? Total fucking lie in production.

High Cardinality Management Strategies

I learned this the hard way when our Prometheus went from 2GB to 16GB overnight. Some genius added a user_id label to request metrics. With 50,000 users, that single fucking metric became 50,000 series. The cardinality explosion killed our monitoring right when our payment system started failing. Spent 3 hours debugging payments while our monitoring was dead. Good times.

Reality check: 1 million series = anywhere from 3GB to 20GB RAM depending on how fucked your labels are. Hit 10 million series? Your server dies a slow, painful death swapping to disk. That's when you throw in the towel and migrate to VictoriaMetrics like you should have done months ago.

What's New in Version 3.0+ (And the 2025 Updates)

Prometheus 3.0 finally dropped in November 2024 after 7 years. The big changes:

New UI - Thank fuck, the old one looked like it was designed in 2005
UTF-8 support - You can finally use Unicode in metric names (why this took 12 years, I don't know)
Better OpenTelemetry - Native OTLP support at /api/v1/otlp/v1/metrics
Remote Write 2.0 - Better compression, metadata support, handles partial writes properly
Native Histograms - More efficient than classic histograms, but still experimental

Prometheus 3.0 New UI

UTF-8 Support in New UI

The upgrade from 2.x is "mostly painless" according to the docs. Reality: we broke 3 recording rules and our main SLI dashboard during the 3.0 upgrade. Test EVERYTHING first. Native histograms are still experimental in 3.0, so expect some changes as they mature.

Prometheus vs The Monitoring Landscape (Real Talk)

Feature	Prometheus	InfluxDB	Datadog	Nagios	New Relic
Setup Time	30 min if you know YAML	2 hours if clustering works	5 min + credit card	3 days of pain	10 min + $500/month
Monthly Cost	0 + your sanity	0-3000+ (clustering = $$)	15-500 per host	0 + therapy	100-2000+ per server
Memory Usage	3KB per series (will surprise you)	Reasonable until you scale	Their problem	Minimal	Their problem
Query Language	PromQL (hard to learn, worth it)	SQL-ish but weird	Point and click	Shell scripts from hell	GUI only
When It Dies	Lose recent data only	Entire cluster can shit the bed	Your monitoring keeps working	Everything stops	Your monitoring keeps working
Kubernetes	Made for each other	Works but why?	Good but expensive	Fuck no	Works fine
Learning Curve	Steep but logical	Medium	Easy	Masochistic	Easy
Real Production Use	Every serious K8s shop	Time series nerds	Enterprises with money	Legacy infrastructure	App performance monitoring

How to Actually Deploy Prometheus (And What Goes Wrong)

Docker - Good for Testing, Don't Use in Prod

## This gets you started in 30 seconds
docker run -p 9090:9090 prom/prometheus

Works great until Docker restarts and you lose all your metrics like a fucking amateur. For anything resembling production, mount a volume and config file:

docker run -p 9090:9090 \
  -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
  -v prometheus-data:/prometheus \
  prom/prometheus

Docker gotcha: Default retention is 15 days, but they don't mention that means 15 days of EVERYTHING you scrape. I watched a 20GB drive fill up in 3 days because someone enabled debug metrics on all pods. Set --storage.tsdb.retention.time=7d unless you enjoy 3AM disk space alerts.

Kubernetes - Use the Operator or Suffer

The Prometheus Operator is the only sane way to run Prometheus in K8s. Manual YAML deployments become unmanageable fast.

## Install via Helm (easiest path)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack

What actually happens: The Operator creates ServiceMonitors that magically discover your services. Works beautifully until you hit 500+ services and your Prometheus starts choking on service discovery changes. Then it becomes a stuttering mess.

Kubernetes gotcha: The default scrape config will try to monitor everything in your cluster. This includes test pods, CI runners, and other junk. You'll hit memory limits fast. Use `serviceMonitorSelector` to be selective:

serviceMonitorSelector:
  matchLabels:
    prometheus: main

Binary - For When You Need Control

Download, extract, run. Simple until you need systemd, proper permissions, and security.

## Basic systemd service file - save to /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus
After=network-online.target

[Service]
Type=simple  
User=prometheus
Group=prometheus
ExecStart=/opt/prometheus/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus/ \
  --storage.tsdb.retention.time=15d \
  --web.console.libraries=/opt/prometheus/console_libraries \
  --web.console.templates=/opt/prometheus/consoles \
  --web.listen-address=0.0.0.0:9090 \
  --web.external-url=http://monitoring.yourdomain.com/
Restart=always

[Install]
WantedBy=multi-user.target

Binary gotcha: Prometheus runs as root by default. Create a prometheus user or you're asking for security problems.

Configuration Reality Check

Here's a config that actually works in production:

global:
  scrape_interval: 30s  # Don't use 15s unless you hate your memory
  evaluation_interval: 30s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['server1:9100', 'server2:9100']
    scrape_interval: 60s  # System metrics don't need 30s resolution

  - job_name: 'application'
    static_configs:
      - targets: ['app1:8080', 'app2:8080']
    metrics_path: '/actuator/prometheus'  # Spring Boot default
    scrape_interval: 15s  # App metrics matter more

## Don't forget this or your disk dies
storage:
  tsdb:
    retention.time: 7d
    retention.size: 10GB

Configuration gotcha: Every unique label combination creates a new time series. I've seen people add instance_id or request_id labels and accidentally create millions of series. Your Prometheus will OOM.

Memory Planning (The Part That Kills Everyone)

That "3KB per series" rule is complete bullshit if you have messy labels. After watching servers die because of this lie, here's what actually happens in prod:

100k series: 2-4GB RAM
1M series: 8-16GB RAM
10M series: 64-128GB RAM (you need VictoriaMetrics at this point)

Check your cardinality before it kills you:

## Top metrics by series count
topk(10, count by (__name__)({__name__=~\".+\"}))

## Total series count
prometheus_tsdb_head_series

Cardinality Management Dashboard

What Breaks in Production

Memory issues: Cardinality explosion is the #1 killer. One bad label and you're fucked. Monitor `prometheus_tsdb_head_series` and set alerts.

Disk space: Default retention fills disks fast. Set `--storage.tsdb.retention.size` as a safety net.

Service discovery choking: With 500+ pods, Kubernetes service discovery turns into a fucking crawl. During our biggest deploy ever (Black Friday prep), service discovery took 10 minutes to catch up. Our monitoring went completely dark right when we needed it most. I spent 45 minutes refreshing Grafana wondering if our sales were tanking or if monitoring was just dead. Enable `--web.enable-lifecycle` and reload configs via the management API instead of restarts.

Alert fatigue: Alertmanager default config spams everything. Group by cluster/service and set proper repeat intervals.

High Availability (There Isn't Any)

Prometheus doesn't cluster. Run multiple identical instances and accept that you might lose some samples. Alertmanager handles deduplication.

The HA setup that works:

2+ Prometheus servers with identical configs
Shared Alertmanager cluster (3 nodes minimum for true HA)
Load balancer for queries (or just pick one server)
Accept that you'll lose ~1% of samples during failover
3.0 improvement: Service discovery is more resilient during failovers compared to 2.x

When You Need More Than Prometheus

Prometheus Scaling Solutions

Long-term storage: Use Thanos or VictoriaMetrics. Don't try to store months of data in vanilla Prometheus.

Scale beyond 10M series: VictoriaMetrics handles 100M+ series on reasonable hardware.

Multi-tenancy: Thanos or Cortex if you need proper tenant isolation.

Bottom line: Most companies can run Prometheus + Grafana + Thanos for years without issues. The nightmare scenarios happen when someone ignores the memory warnings or adds user_id labels like a maniac. I've debugged this shit at 3AM more times than I want to remember - just respect the cardinality limits and you'll sleep better.

Questions You'll Actually Ask (And Answers That Work)

Why Does My Prometheus Use 16GB RAM?

You have too many fucking time series.

Probably some asshole added a user ID to a counter. Run this query to see what's eating memory:promqltopk(10, count by (name)({name=~"..+"}))Common cardinality killers:

User IDs in labels (user_id="12345")
Request IDs in labels (request_id="uuid-here")
IP addresses as label values
Timestamp labels (seriously, don't do this)Fix:

Remove high-cardinality labels or use recording rules to pre-aggregate.

Why Did My Prometheus Stop Scraping at 2AM?

Murphy's law of monitoring

always breaks when you're sleeping.

First move: journalctl -u prometheus -f to see what died. 90% of the time it's one of these classics:

Out of memory
- SIGKILL in logs, add more RAM or reduce cardinality
Out of disk space
- Set --storage.tsdb.retention.size=10GB3. Service discovery timeout
- Kubernetes service discovery shits the bed with 500+ pods, set scrape_timeout to 30s or higher
DNS resolution failed
- Use IP addresses for critical targets

How Do I Stop Getting Alerts for Shit I Don't Care About?

Edit your Alertmanager config.

Group alerts and add proper inhibition rules:yamlaroute: group_by: ['alertname', 'cluster', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 12h receiver: 'slack-critical'inhibit_rules:

source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'cluster', 'service']Pro tip:

Use amtool to test your routing before deploying:bashamtool config routes test --config.file=alertmanager.yml

Why Does Kubernetes Service Discovery Keep Breaking?

The Prometheus Operator is probably fighting with manual config.

Pick one approach:Option 1

Full Operator (recommended):yamlapiVersion: monitoring.coreos.com/v1kind:

ServiceMonitormetadata: name: my-appspec: selector: matchLabels: app: my-app endpoints:

port: metricsOption 2
Static config (when Operator is broken):yamlscrape_configs:
job_name: 'kubernetes-pods' kubernetes_sd_configs:
role: pod relabel_configs:
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true

My PromQL Query Returns Nothing, What's Wrong?

**Check these in order:**1. Metric exists: {__name__=~"your_metric.*"}2. Time range is right: Use offset if data is delayed 3. Labels match exactly: Copy-paste from the metric, don't guess 4. Rate vs increase: Use rate() for per-second rates, increase() for totalsCommon mistakes:promql# Wrong

rate of counter over instantrate(http_requests_total)# Right
rate over time window rate(http_requests_total[5m])

Can I Use Prometheus for Logs?

No.

Prometheus stores numbers, not text. Use:

ELK Stack if you hate yourself
Grafana Loki if you want something that works
Fluentd + whatever if you're feeling adventurousException: You can create metrics FROM logs (error rates, response times), but don't store the actual log lines.

How Do I Backup Prometheus Data?

Easy way: Use file system snapshots if you're on AWS/GCP.Hard way: Use the snapshot API:bashcurl -XPOST localhost:9090/api/v1/admin/tsdb/snapshot# Creates snapshot in data/snapshots/Just use Thanos. Manual snapshots are a pain in the ass and you'll forget to do them when it matters. Object storage backup or gtfo.

Why Did My Recording Rule Break After Upgrading?

Prometheus 3.0+ changed some PromQL parsing edge cases.

Check your rules file for:bash# Test recording rules before deployingpromtool query instant 'your_recording_rule_expression'# Test the entire rules filepromtool check rules /path/to/rules.ymlCommon 3.0 breaking changes (learned these the hard way):

Empty label matching broke our recording rules in prod
Regex .* patterns started matching differently, killed half our service discovery
UTF-8 support broke metric names with weird characters (looking at you, Java JMX metrics)
Native histogram functions are still experimental
expect changes
Some PromQL edge cases parse differently now

My Grafana Dashboard Shows No Data

Grafana Prometheus Dashboard Debug checklist:

Data source connected: Check Grafana data source page
Time range: Grafana defaults to last 6 hours, your data might be older
PromQL query works: Test the query in Prometheus UI first
Labels exist: Grafana template variables must match actual label valuesQuick fix: Change time range to "Last 24 hours" and see if data appears.

Should I Upgrade to Prometheus 3.0?

Yes, but not immediately:

Wait 6 months for the community to find the rough edges
Test thoroughly in staging
the PromQL changes are subtle but breaking
Native histograms are cool but still experimental
UTF-8 support is actually useful if you have international teamsStay on 2.x if:
You're running critical production workloads and can't afford downtime
Your team doesn't have time to rewrite broken recording rules
You use external tools that haven't updated for 3.0 changes yetUpgrade path: Test your recording rules first with promtool. Seriously. We learned this the hard way.

How Do I Know If I Need VictoriaMetrics?

You need it when:

Prometheus uses >32GB RAM
You have >10 million active series
Queries take >30 seconds regularly
You need years of retention without object storage complexityMigration is easy: VictoriaMetrics accepts Prometheus remote write and has the same query API.

Quick Navigation

What You Actually Get

The Memory Problem Everyone Hits

What's New in Version 3.0+ (And the 2025 Updates)

Docker - Good for Testing, Don't Use in Prod

Kubernetes - Use the Operator or Suffer

Binary - For When You Need Control

Configuration Reality Check

Memory Planning (The Part That Kills Everyone)

What Breaks in Production

High Availability (There Isn't Any)

When You Need More Than Prometheus

Why Does My Prometheus Use 16GB RAM?

Why Did My Prometheus Stop Scraping at 2AM?

How Do I Stop Getting Alerts for Shit I Don't Care About?

Why Does Kubernetes Service Discovery Keep Breaking?

My PromQL Query Returns Nothing, What's Wrong?

Can I Use Prometheus for Logs?

How Do I Backup Prometheus Data?

Why Did My Recording Rule Break After Upgrading?

My Grafana Dashboard Shows No Data

Should I Upgrade to Prometheus 3.0?

How Do I Know If I Need VictoriaMetrics?

Related Tools & Recommendations

Kibana - Because Raw Elasticsearch JSON Makes Your Eyes Bleed

Datadog vs New Relic vs Sentry: Real Pricing Breakdown (From Someone Who's Actually Paid These Bills)

Node.js Production Troubleshooting: Debug Crashes & Memory Leaks

Setting Up Prometheus Monitoring That Won't Make You Hate Your Job

Google Kubernetes Engine (GKE) - Google's Managed Kubernetes (That Actually Works Most of the Time)

Fix Docker Permission Denied on Mac M1: Troubleshooting Guide

Mastering ML Model Deployment: From Jupyter to Production

mongoexport: Export MongoDB Data to JSON & CSV - Overview

Certbot: Get Free SSL Certificates & Simplify Installation

Node.js Security Hardening Guide: Protect Your Apps

Alpaca Trading API Production Deployment Guide & Best Practices

Fix MySQL Error 1045 Access Denied: Solutions & Troubleshooting

Git: How to Merge Specific Files from Another Branch

Apollo GraphQL Overview: Server, Client, & Getting Started Guide

Install Node.js & NVM on Mac M1/M2/M3: A Complete Guide

LM Studio Performance: Fix Crashes & Speed Up Local AI

GitLab CI/CD Overview: Features, Setup, & Real-World Use

Node.js Performance Optimization: Boost App Speed & Scale

PostgreSQL: Why It Excels & Production Troubleshooting Guide

Open Policy Agent (OPA): Centralize Authorization & Policy Management