What Actually Happens When You Run Prometheus

I've been running Prometheus in production for 4 years. Here's what the docs won't tell you and why you'll want to throw your laptop out the window.

Prometheus Logo

Prometheus is a monitoring system that scrapes HTTP endpoints for metrics every 15 seconds by default. SoundCloud built it in 2012 because existing tools sucked, and it became the second CNCF project after Kubernetes.

The pull model actually works. Instead of agents pushing metrics everywhere, Prometheus visits your /metrics endpoints and grabs everything. This means when your network is fucked, you don't lose historical data sitting in some agent's buffer.

What You Actually Get

Prometheus Server - This is the thing that eats your memory. It scrapes metrics, stores them in local TSDB files, and runs alerts. No clustering bullshit, no distributed complexity. One server, one problem to debug.

Service Discovery - Automatically finds things to monitor in Kubernetes, AWS, or whatever. Works great until you fat-finger a selector and suddenly you're scraping 10,000 pods instead of 10.

PromQL Query Language - Like SQL but designed by someone who actually uses time series data. Learning curve is steep but once you get it, you can query anything:

## Your app's error rate (actually useful)
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

## Memory usage that triggers alerts you care about  
container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.8

Alertmanager - Handles alert routing and deduplication. The config syntax will make you want to quit programming, but it works once you figure it out.

Prometheus Architecture

The Memory Problem Everyone Hits

Here's the bullshit the docs skip: Prometheus uses roughly 3KB per time series in memory. Sounds innocent until you realize that http_requests_total{method="GET", handler="/", status="200"} and http_requests_total{method="POST", handler="/", status="200"} are completely different series. And that 3KB? Total fucking lie in production.

High Cardinality Management Strategies

I learned this the hard way when our Prometheus went from 2GB to 16GB overnight. Some genius added a user_id label to request metrics. With 50,000 users, that single fucking metric became 50,000 series. The cardinality explosion killed our monitoring right when our payment system started failing. Spent 3 hours debugging payments while our monitoring was dead. Good times.

Reality check: 1 million series = anywhere from 3GB to 20GB RAM depending on how fucked your labels are. Hit 10 million series? Your server dies a slow, painful death swapping to disk. That's when you throw in the towel and migrate to VictoriaMetrics like you should have done months ago.

What's New in Version 3.0+ (And the 2025 Updates)

Prometheus 3.0 finally dropped in November 2024 after 7 years. The big changes:

  • New UI - Thank fuck, the old one looked like it was designed in 2005
  • UTF-8 support - You can finally use Unicode in metric names (why this took 12 years, I don't know)
  • Better OpenTelemetry - Native OTLP support at /api/v1/otlp/v1/metrics
  • Remote Write 2.0 - Better compression, metadata support, handles partial writes properly
  • Native Histograms - More efficient than classic histograms, but still experimental

Prometheus 3.0 New UI

UTF-8 Support in New UI

The upgrade from 2.x is "mostly painless" according to the docs. Reality: we broke 3 recording rules and our main SLI dashboard during the 3.0 upgrade. Test EVERYTHING first. Native histograms are still experimental in 3.0, so expect some changes as they mature.

Prometheus vs The Monitoring Landscape (Real Talk)

Feature

Prometheus

InfluxDB

Datadog

Nagios

New Relic

Setup Time

30 min if you know YAML

2 hours if clustering works

5 min + credit card

3 days of pain

10 min + $500/month

Monthly Cost

0 + your sanity

0-3000+ (clustering = $$)

15-500 per host

0 + therapy

100-2000+ per server

Memory Usage

3KB per series (will surprise you)

Reasonable until you scale

Their problem

Minimal

Their problem

Query Language

PromQL (hard to learn, worth it)

SQL-ish but weird

Point and click

Shell scripts from hell

GUI only

When It Dies

Lose recent data only

Entire cluster can shit the bed

Your monitoring keeps working

Everything stops

Your monitoring keeps working

Kubernetes

Made for each other

Works but why?

Good but expensive

Fuck no

Works fine

Learning Curve

Steep but logical

Medium

Easy

Masochistic

Easy

Real Production Use

Every serious K8s shop

Time series nerds

Enterprises with money

Legacy infrastructure

App performance monitoring

How to Actually Deploy Prometheus (And What Goes Wrong)

Docker - Good for Testing, Don't Use in Prod

## This gets you started in 30 seconds
docker run -p 9090:9090 prom/prometheus

Works great until Docker restarts and you lose all your metrics like a fucking amateur. For anything resembling production, mount a volume and config file:

docker run -p 9090:9090 \
  -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
  -v prometheus-data:/prometheus \
  prom/prometheus

Docker gotcha: Default retention is 15 days, but they don't mention that means 15 days of EVERYTHING you scrape. I watched a 20GB drive fill up in 3 days because someone enabled debug metrics on all pods. Set --storage.tsdb.retention.time=7d unless you enjoy 3AM disk space alerts.

Kubernetes - Use the Operator or Suffer

The Prometheus Operator is the only sane way to run Prometheus in K8s. Manual YAML deployments become unmanageable fast.

## Install via Helm (easiest path)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack

What actually happens: The Operator creates ServiceMonitors that magically discover your services. Works beautifully until you hit 500+ services and your Prometheus starts choking on service discovery changes. Then it becomes a stuttering mess.

Kubernetes gotcha: The default scrape config will try to monitor everything in your cluster. This includes test pods, CI runners, and other junk. You'll hit memory limits fast. Use `serviceMonitorSelector` to be selective:

serviceMonitorSelector:
  matchLabels:
    prometheus: main

Binary - For When You Need Control

Download, extract, run. Simple until you need systemd, proper permissions, and security.

## Basic systemd service file - save to /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus
After=network-online.target

[Service]
Type=simple  
User=prometheus
Group=prometheus
ExecStart=/opt/prometheus/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus/ \
  --storage.tsdb.retention.time=15d \
  --web.console.libraries=/opt/prometheus/console_libraries \
  --web.console.templates=/opt/prometheus/consoles \
  --web.listen-address=0.0.0.0:9090 \
  --web.external-url=http://monitoring.yourdomain.com/
Restart=always

[Install]
WantedBy=multi-user.target

Binary gotcha: Prometheus runs as root by default. Create a prometheus user or you're asking for security problems.

Configuration Reality Check

Here's a config that actually works in production:

global:
  scrape_interval: 30s  # Don't use 15s unless you hate your memory
  evaluation_interval: 30s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['server1:9100', 'server2:9100']
    scrape_interval: 60s  # System metrics don't need 30s resolution

  - job_name: 'application'
    static_configs:
      - targets: ['app1:8080', 'app2:8080']
    metrics_path: '/actuator/prometheus'  # Spring Boot default
    scrape_interval: 15s  # App metrics matter more

## Don't forget this or your disk dies
storage:
  tsdb:
    retention.time: 7d
    retention.size: 10GB

Configuration gotcha: Every unique label combination creates a new time series. I've seen people add instance_id or request_id labels and accidentally create millions of series. Your Prometheus will OOM.

Memory Planning (The Part That Kills Everyone)

That "3KB per series" rule is complete bullshit if you have messy labels. After watching servers die because of this lie, here's what actually happens in prod:

  • 100k series: 2-4GB RAM
  • 1M series: 8-16GB RAM
  • 10M series: 64-128GB RAM (you need VictoriaMetrics at this point)

Check your cardinality before it kills you:

## Top metrics by series count
topk(10, count by (__name__)({__name__=~\".+\"}))

## Total series count
prometheus_tsdb_head_series

Cardinality Management Dashboard

What Breaks in Production

Memory issues: Cardinality explosion is the #1 killer. One bad label and you're fucked. Monitor `prometheus_tsdb_head_series` and set alerts.

Disk space: Default retention fills disks fast. Set `--storage.tsdb.retention.size` as a safety net.

Service discovery choking: With 500+ pods, Kubernetes service discovery turns into a fucking crawl. During our biggest deploy ever (Black Friday prep), service discovery took 10 minutes to catch up. Our monitoring went completely dark right when we needed it most. I spent 45 minutes refreshing Grafana wondering if our sales were tanking or if monitoring was just dead. Enable `--web.enable-lifecycle` and reload configs via the management API instead of restarts.

Alert fatigue: Alertmanager default config spams everything. Group by cluster/service and set proper repeat intervals.

High Availability (There Isn't Any)

Prometheus doesn't cluster. Run multiple identical instances and accept that you might lose some samples. Alertmanager handles deduplication.

The HA setup that works:

  • 2+ Prometheus servers with identical configs
  • Shared Alertmanager cluster (3 nodes minimum for true HA)
  • Load balancer for queries (or just pick one server)
  • Accept that you'll lose ~1% of samples during failover
  • 3.0 improvement: Service discovery is more resilient during failovers compared to 2.x

When You Need More Than Prometheus

Prometheus Scaling Solutions

Long-term storage: Use Thanos or VictoriaMetrics. Don't try to store months of data in vanilla Prometheus.

Scale beyond 10M series: VictoriaMetrics handles 100M+ series on reasonable hardware.

Multi-tenancy: Thanos or Cortex if you need proper tenant isolation.

Bottom line: Most companies can run Prometheus + Grafana + Thanos for years without issues. The nightmare scenarios happen when someone ignores the memory warnings or adds user_id labels like a maniac. I've debugged this shit at 3AM more times than I want to remember - just respect the cardinality limits and you'll sleep better.

Questions You'll Actually Ask (And Answers That Work)

Q

Why Does My Prometheus Use 16GB RAM?

A

You have too many fucking time series.

Probably some asshole added a user ID to a counter. Run this query to see what's eating memory:promqltopk(10, count by (name)({name=~"..+"}))Common cardinality killers:

  • User IDs in labels (user_id="12345")
  • Request IDs in labels (request_id="uuid-here")
  • IP addresses as label values
  • Timestamp labels (seriously, don't do this)Fix:

Remove high-cardinality labels or use recording rules to pre-aggregate.

Q

Why Did My Prometheus Stop Scraping at 2AM?

A

Murphy's law of monitoring

  • always breaks when you're sleeping.

First move: journalctl -u prometheus -f to see what died. 90% of the time it's one of these classics:

  1. Out of memory
    • SIGKILL in logs, add more RAM or reduce cardinality
  2. Out of disk space
    • Set --storage.tsdb.retention.size=10GB3. Service discovery timeout
    • Kubernetes service discovery shits the bed with 500+ pods, set scrape_timeout to 30s or higher
  3. DNS resolution failed
    • Use IP addresses for critical targets
Q

How Do I Stop Getting Alerts for Shit I Don't Care About?

A

Edit your Alertmanager config.

Group alerts and add proper inhibition rules:yamlaroute: group_by: ['alertname', 'cluster', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 12h receiver: 'slack-critical'inhibit_rules:

  • source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'cluster', 'service']Pro tip:

Use amtool to test your routing before deploying:bashamtool config routes test --config.file=alertmanager.yml

Q

Why Does Kubernetes Service Discovery Keep Breaking?

A

The Prometheus Operator is probably fighting with manual config.

Pick one approach:Option 1

  • Full Operator (recommended):yamlapiVersion: monitoring.coreos.com/v1kind:

ServiceMonitormetadata: name: my-appspec: selector: matchLabels: app: my-app endpoints:

  • port: metricsOption 2

  • Static config (when Operator is broken):yamlscrape_configs:

  • job_name: 'kubernetes-pods' kubernetes_sd_configs:

  • role: pod relabel_configs:

  • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true

Q

My PromQL Query Returns Nothing, What's Wrong?

A

**Check these in order:**1. Metric exists: {__name__=~"your_metric.*"}2. Time range is right: Use offset if data is delayed 3. Labels match exactly: Copy-paste from the metric, don't guess 4. Rate vs increase: Use rate() for per-second rates, increase() for totalsCommon mistakes:promql# Wrong

  • rate of counter over instantrate(http_requests_total)# Right
  • rate over time window rate(http_requests_total[5m])
Q

Can I Use Prometheus for Logs?

A

No.

Prometheus stores numbers, not text. Use:

  • ELK Stack if you hate yourself
  • Grafana Loki if you want something that works
  • Fluentd + whatever if you're feeling adventurousException: You can create metrics FROM logs (error rates, response times), but don't store the actual log lines.
Q

How Do I Backup Prometheus Data?

A

Easy way: Use file system snapshots if you're on AWS/GCP.Hard way: Use the snapshot API:bashcurl -XPOST localhost:9090/api/v1/admin/tsdb/snapshot# Creates snapshot in data/snapshots/Just use Thanos. Manual snapshots are a pain in the ass and you'll forget to do them when it matters. Object storage backup or gtfo.

Q

Why Did My Recording Rule Break After Upgrading?

A

Prometheus 3.0+ changed some PromQL parsing edge cases.

Check your rules file for:bash# Test recording rules before deployingpromtool query instant 'your_recording_rule_expression'# Test the entire rules filepromtool check rules /path/to/rules.ymlCommon 3.0 breaking changes (learned these the hard way):

  • Empty label matching broke our recording rules in prod
  • Regex .* patterns started matching differently, killed half our service discovery
  • UTF-8 support broke metric names with weird characters (looking at you, Java JMX metrics)
  • Native histogram functions are still experimental
  • expect changes
  • Some PromQL edge cases parse differently now
Q

My Grafana Dashboard Shows No Data

A

Grafana Prometheus DashboardDebug checklist:

  1. Data source connected: Check Grafana data source page
  2. Time range: Grafana defaults to last 6 hours, your data might be older
  3. PromQL query works: Test the query in Prometheus UI first
  4. Labels exist: Grafana template variables must match actual label valuesQuick fix: Change time range to "Last 24 hours" and see if data appears.
Q

Should I Upgrade to Prometheus 3.0?

A

Yes, but not immediately:

  • Wait 6 months for the community to find the rough edges

  • Test thoroughly in staging

  • the PromQL changes are subtle but breaking

  • Native histograms are cool but still experimental

  • UTF-8 support is actually useful if you have international teamsStay on 2.x if:

  • You're running critical production workloads and can't afford downtime

  • Your team doesn't have time to rewrite broken recording rules

  • You use external tools that haven't updated for 3.0 changes yetUpgrade path: Test your recording rules first with promtool. Seriously. We learned this the hard way.

Q

How Do I Know If I Need VictoriaMetrics?

A

You need it when:

  • Prometheus uses >32GB RAM
  • You have >10 million active series
  • Queries take >30 seconds regularly
  • You need years of retention without object storage complexityMigration is easy: VictoriaMetrics accepts Prometheus remote write and has the same query API.

Resources That Actually Help (With Reality Checks)

Related Tools & Recommendations

tool
Similar content

Kibana - Because Raw Elasticsearch JSON Makes Your Eyes Bleed

Stop manually parsing Elasticsearch responses and build dashboards that actually help debug production issues.

Kibana
/tool/kibana/overview
100%
pricing
Recommended

Datadog vs New Relic vs Sentry: Real Pricing Breakdown (From Someone Who's Actually Paid These Bills)

Observability pricing is a shitshow. Here's what it actually costs.

Datadog
/pricing/datadog-newrelic-sentry-enterprise/enterprise-pricing-comparison
100%
tool
Similar content

Node.js Production Troubleshooting: Debug Crashes & Memory Leaks

When your Node.js app crashes in production and nobody knows why. The complete survival guide for debugging real-world disasters.

Node.js
/tool/node.js/production-troubleshooting
95%
integration
Recommended

Setting Up Prometheus Monitoring That Won't Make You Hate Your Job

How to Connect Prometheus, Grafana, and Alertmanager Without Losing Your Sanity

Prometheus
/integration/prometheus-grafana-alertmanager/complete-monitoring-integration
90%
tool
Recommended

Google Kubernetes Engine (GKE) - Google's Managed Kubernetes (That Actually Works Most of the Time)

Google runs your Kubernetes clusters so you don't wake up to etcd corruption at 3am. Costs way more than DIY but beats losing your weekend to cluster disasters.

Google Kubernetes Engine (GKE)
/tool/google-kubernetes-engine/overview
90%
troubleshoot
Similar content

Fix Docker Permission Denied on Mac M1: Troubleshooting Guide

Because your shiny new Apple Silicon Mac hates containers

Docker Desktop
/troubleshoot/docker-permission-denied-mac-m1/permission-denied-troubleshooting
84%
howto
Similar content

Mastering ML Model Deployment: From Jupyter to Production

Tired of "it works on my machine" but crashes with real users? Here's what actually works.

Docker
/howto/deploy-machine-learning-models-to-production/production-deployment-guide
84%
tool
Similar content

mongoexport: Export MongoDB Data to JSON & CSV - Overview

MongoDB's way of dumping collection data into readable JSON or CSV files

mongoexport
/tool/mongoexport/overview
84%
tool
Similar content

Certbot: Get Free SSL Certificates & Simplify Installation

Learn how Certbot simplifies obtaining and installing free SSL/TLS certificates. This guide covers installation, common issues like renewal failures, and config

Certbot
/tool/certbot/overview
76%
tool
Similar content

Node.js Security Hardening Guide: Protect Your Apps

Master Node.js security hardening. Learn to manage npm dependencies, fix vulnerabilities, implement secure authentication, HTTPS, and input validation.

Node.js
/tool/node.js/security-hardening
76%
tool
Similar content

Alpaca Trading API Production Deployment Guide & Best Practices

Master Alpaca Trading API production deployment with this comprehensive guide. Learn best practices for monitoring, alerts, disaster recovery, and handling real

Alpaca Trading API
/tool/alpaca-trading-api/production-deployment
76%
troubleshoot
Similar content

Fix MySQL Error 1045 Access Denied: Solutions & Troubleshooting

Stop fucking around with generic fixes - these authentication solutions are tested on thousands of production systems

MySQL
/troubleshoot/mysql-error-1045-access-denied/authentication-error-solutions
76%
howto
Similar content

Git: How to Merge Specific Files from Another Branch

November 15th, 2023, 11:47 PM: Production is fucked. You need the bug fix from the feature branch. You do NOT need the 47 experimental commits that Jim pushed a

Git
/howto/merge-git-branch-specific-files/selective-file-merge-guide
76%
tool
Similar content

Apollo GraphQL Overview: Server, Client, & Getting Started Guide

Explore Apollo GraphQL's core components: Server, Client, and its ecosystem. This overview covers getting started, navigating the learning curve, and comparing

Apollo GraphQL
/tool/apollo-graphql/overview
71%
howto
Similar content

Install Node.js & NVM on Mac M1/M2/M3: A Complete Guide

My M1 Mac setup broke at 2am before a deployment. Here's how I fixed it so you don't have to suffer.

Node Version Manager (NVM)
/howto/install-nodejs-nvm-mac-m1/complete-installation-guide
71%
tool
Similar content

LM Studio Performance: Fix Crashes & Speed Up Local AI

Stop fighting memory crashes and thermal throttling. Here's how to make LM Studio actually work on real hardware.

LM Studio
/tool/lm-studio/performance-optimization
68%
tool
Similar content

GitLab CI/CD Overview: Features, Setup, & Real-World Use

CI/CD, security scanning, and project management in one place - when it works, it's great

GitLab CI/CD
/tool/gitlab-ci-cd/overview
68%
tool
Similar content

Node.js Performance Optimization: Boost App Speed & Scale

Master Node.js performance optimization techniques. Learn to speed up your V8 engine, effectively use clustering & worker threads, and scale your applications e

Node.js
/tool/node.js/performance-optimization
68%
tool
Similar content

PostgreSQL: Why It Excels & Production Troubleshooting Guide

Explore PostgreSQL's advantages over other databases, dive into real-world production horror stories, solutions for common issues, and expert debugging tips.

PostgreSQL
/tool/postgresql/overview
62%
tool
Similar content

Open Policy Agent (OPA): Centralize Authorization & Policy Management

Stop hardcoding "if user.role == admin" across 47 microservices - ask OPA instead

/tool/open-policy-agent/overview
62%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization