Why Does My Prometheus Use 16GB RAM?

You have too many fucking time series. Probably some asshole added a user ID to a counter. Run this query to see what's eating memory:promqltopk(10, count by (__name__)({__name__=~"..+"}))Common cardinality killers:- User IDs in labels (`user_id="12345"`)- Request IDs in labels (`request_id="uuid-here"`)- IP addresses as label values- Timestamp labels (seriously, don't do this)Fix: Remove high-cardinality labels or use [recording rules](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/) to pre-aggregate.

Why Did My Prometheus Stop Scraping at 2AM?

Murphy's law of monitoring - always breaks when you're sleeping. First move: `journalctl -u prometheus -f` to see what died. 90% of the time it's one of these classics:1. **Out of memory** - `SIGKILL` in logs, add more RAM or reduce cardinality2. **Out of disk space** - Set `--storage.tsdb.retention.size=10GB`3. **Service discovery timeout** - Kubernetes service discovery shits the bed with 500+ pods, set `scrape_timeout` to 30s or higher4. **DNS resolution failed** - Use IP addresses for critical targets

How Do I Stop Getting Alerts for Shit I Don't Care About?

Edit your Alertmanager config. Group alerts and add proper inhibition rules:yamlaroute: group_by: ['alertname', 'cluster', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 12h receiver: 'slack-critical'inhibit_rules:- source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'cluster', 'service']Pro tip: Use `amtool` to test your routing before deploying:bashamtool config routes test --config.file=alertmanager.yml

Why Does Kubernetes Service Discovery Keep Breaking?

The Prometheus Operator is probably fighting with manual config. Pick one approach:Option 1 - Full Operator (recommended):yamlapiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata: name: my-appspec: selector: matchLabels: app: my-app endpoints: - port: metricsOption 2 - Static config (when Operator is broken):yamlscrape_configs: - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true

My PromQL Query Returns Nothing, What's Wrong?

**Check these in order:**1. **Metric exists:** `{__name__=~"your_metric.*"}`2. **Time range is right:** Use `offset` if data is delayed3. **Labels match exactly:** Copy-paste from the metric, don't guess4. **Rate vs increase:** Use `rate()` for per-second rates, `increase()` for totalsCommon mistakes:promql# Wrong - rate of counter over instantrate(http_requests_total)# Right - rate over time window rate(http_requests_total[5m])

Can I Use Prometheus for Logs?

No. Prometheus stores numbers, not text. Use:- **ELK Stack** if you hate yourself- **Grafana Loki** if you want something that works- **Fluentd + whatever** if you're feeling adventurousException: You can create metrics FROM logs (error rates, response times), but don't store the actual log lines.

How Do I Backup Prometheus Data?

**Easy way:** Use file system snapshots if you're on AWS/GCP.Hard way: Use the [snapshot API](https://prometheus.io/docs/prometheus/latest/querying/api/#snapshot):bashcurl -XPOST localhost:9090/api/v1/admin/tsdb/snapshot# Creates snapshot in data/snapshots/Just use [Thanos](https://thanos.io/). Manual snapshots are a pain in the ass and you'll forget to do them when it matters. Object storage backup or gtfo.

Why Did My Recording Rule Break After Upgrading?

Prometheus 3.0+ changed some PromQL parsing edge cases. Check your rules file for:bash# Test recording rules before deployingpromtool query instant 'your_recording_rule_expression'# Test the entire rules filepromtool check rules /path/to/rules.ymlCommon 3.0 breaking changes (learned these the hard way):- Empty label matching broke our recording rules in prod- Regex `.*` patterns started matching differently, killed half our service discovery- UTF-8 support broke metric names with weird characters (looking at you, Java JMX metrics)- Native histogram functions are still experimental - expect changes- Some PromQL edge cases parse differently now

My Grafana Dashboard Shows No Data

![Grafana Prometheus Dashboard](https://grafana.com/static/assets/img/blog/grafana-dashboard-histograms-cardinality.png)Debug checklist:1. **Data source connected:** Check Grafana data source page2. **Time range:** Grafana defaults to last 6 hours, your data might be older3. **PromQL query works:** Test the query in Prometheus UI first4. **Labels exist:** Grafana template variables must match actual label valuesQuick fix: Change time range to "Last 24 hours" and see if data appears.

Should I Upgrade to Prometheus 3.0?

**Yes, but not immediately:**- Wait 6 months for the community to find the rough edges- Test thoroughly in staging - the PromQL changes are subtle but breaking- Native histograms are cool but still experimental- UTF-8 support is actually useful if you have international teamsStay on 2.x if:- You're running critical production workloads and can't afford downtime- Your team doesn't have time to rewrite broken recording rules- You use external tools that haven't updated for 3.0 changes yetUpgrade path: Test your recording rules first with `promtool`. Seriously. We learned this the hard way.

How Do I Know If I Need VictoriaMetrics?

You need it when:- Prometheus uses >32GB RAM- You have >10 million active series- Queries take >30 seconds regularly- You need years of retention without object storage complexityMigration is easy: VictoriaMetrics accepts Prometheus remote write and has the same query API.

Currently viewing the AI version

Switch to human version

Prometheus Monitoring: AI-Optimized Technical Reference

System Overview

Core Function: Pull-based monitoring system that scrapes HTTP /metrics endpoints every 15-30 seconds by default.

Architecture Benefits:

Network failures don't cause data loss (unlike push-based systems)
Single server deployment eliminates distributed complexity
Local TSDB storage with no clustering dependencies

Critical Performance Specifications

Memory Usage Reality

Official claim: 3KB per time series
Production reality: 8-20KB per series with real-world label complexity
Scaling thresholds:
- 100k series: 2-4GB RAM
- 1M series: 8-16GB RAM
- 10M series: 64-128GB RAM (requires migration to VictoriaMetrics)

Cardinality Explosion Failure Mode

Critical failure scenario: Adding high-cardinality labels (user_id, request_id, IP addresses) creates exponential series growth

Real example: Single metric with user_id label + 50,000 users = 50,000 series
Consequence: Memory usage increases 10-50x overnight, causing OOM kills
Detection query: topk(10, count by (__name__)({__name__=~".+"}))

Deployment Configurations

Production-Ready Settings

global:
  scrape_interval: 30s  # 15s causes memory issues
  evaluation_interval: 30s

storage:
  tsdb:
    retention.time: 7d    # Default 15d fills disks rapidly
    retention.size: 10GB  # Safety net for disk space

scrape_configs:
  - job_name: 'node-exporter'
    scrape_interval: 60s  # System metrics don't need high resolution
  - job_name: 'application'
    scrape_interval: 15s  # App metrics require higher resolution

Memory Planning Formula

Required RAM = (Active Series Count × 15KB) + (Ingestion Rate × Retention Days × 2KB)

Common Failure Modes

Production Breaking Scenarios

Service Discovery Overload
- Trigger: 500+ Kubernetes pods
- Symptom: 10-minute discovery lag during deployments
- Impact: Complete monitoring blindness during critical periods
- Solution: Use serviceMonitorSelector for selective discovery
Cardinality Bombs
- Trigger: Labels with user IDs, request IDs, timestamps
- Symptom: Memory usage 10-50x increase overnight
- Impact: OOM kills, monitoring system failure
- Prevention: Monitor prometheus_tsdb_head_series metric
Disk Space Exhaustion
- Trigger: Default 15-day retention with high-volume metrics
- Timeline: 20GB disk fills in 3 days with debug metrics enabled
- Solution: Set both time and size retention limits

Kubernetes Deployment Requirements

Operator vs Manual Deployment

Prometheus Operator: Only viable option for 500+ services
Manual YAML: Becomes unmanageable beyond 50 services
Memory allocation: Set limits 2x expected usage for safety

Critical Configuration

serviceMonitorSelector:
  matchLabels:
    prometheus: main  # Prevents scraping test/CI pods

Version 3.0 Migration Considerations

Breaking Changes (Production Impact)

PromQL parsing: Edge cases changed, breaking recording rules
Regex matching: .* patterns behave differently
UTF-8 support: Breaks metrics with non-ASCII characters
Native histograms: Still experimental, expect changes

Upgrade Testing Requirements

Test all recording rules with promtool check rules
Validate PromQL queries in staging environment
Verify service discovery configurations
Monitor for metric name parsing issues

High Availability Limitations

No Built-in Clustering

Architecture: Run 2+ identical instances
Data loss: Accept 1% sample loss during failover
Alertmanager: Requires separate 3-node cluster for true HA
Limitation: No automatic failover or data replication

Scale-Out Decision Points

When to Migrate to VictoriaMetrics

Prometheus RAM usage >32GB
Active series count >10 million
Query response time >30 seconds consistently
Need for years of retention without object storage

Long-term Storage Solutions

Thanos: Best for multi-cluster setups, complex configuration
VictoriaMetrics: Drop-in replacement, 10x memory efficiency
Cortex: Multi-tenant but extremely complex deployment

Query Performance Guidelines

Efficient PromQL Patterns

# Correct: Rate with time window
rate(http_requests_total[5m])

# Wrong: Rate without time window (returns nothing)
rate(http_requests_total)

# Error rate calculation
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

Performance Killers

Queries without time ranges
High-cardinality aggregations
Large time windows on instant queries

Alert Configuration Best Practices

Alertmanager Routing

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h

inhibit_rules:
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  equal: ['alertname', 'cluster', 'service']

Operational Intelligence

Time Investment Requirements

Initial setup: 30 minutes (Docker) to 3 days (bare metal with security)
Kubernetes deployment: 2-4 hours with Operator
Learning curve: 2-4 weeks for PromQL proficiency
Production debugging: Plan for 3AM memory alerts

Cost Considerations

Infrastructure: Free software, pay for hardware/cloud resources
Operational overhead: High cardinality management becomes full-time job
Migration costs: VictoriaMetrics migration = 1-2 weeks engineering time

Support Quality

Community: Active, but assumes deep technical knowledge
Documentation: Good for basics, lacking for production edge cases
Commercial support: Available through multiple vendors

Hidden Operational Costs

Cardinality monitoring: Requires dedicated alerting and regular audits
Memory capacity planning: Must be updated monthly as application grows
Service discovery tuning: Ongoing maintenance for large Kubernetes clusters
Recording rule maintenance: Breaking changes require regular validation

Critical Monitoring Metrics

System Health Indicators

# Memory pressure warning
prometheus_tsdb_head_series > 1000000

# Ingestion rate monitoring  
rate(prometheus_tsdb_head_samples_appended_total[5m])

# Query performance degradation
histogram_quantile(0.95, rate(prometheus_http_request_duration_seconds_bucket[5m])) > 5

Capacity Planning Queries

# Storage growth rate
increase(prometheus_tsdb_head_series[1h]) * 24  # Daily series growth

# Memory usage projection
prometheus_process_resident_memory_bytes / (1024^3)  # GB usage

Integration Dependencies

Required Components

Node Exporter: Essential for server monitoring, works on all platforms except Windows
Alertmanager: Required for production alerting, complex configuration
Grafana: De facto visualization layer, start with dashboard ID 1860

Optional but Recommended

Blackbox Exporter: HTTP/HTTPS/DNS monitoring, confusing config but powerful
Service discovery: Kubernetes, AWS, Consul integrations available
Remote storage: Thanos or VictoriaMetrics for long-term retention

Decision Framework

Choose Prometheus When

Kubernetes-native monitoring required
Pull-based model fits infrastructure
Team can manage cardinality complexity
Budget constraints favor open source

Consider Alternatives When

Need >10M series capacity
Require multi-tenancy
Team lacks time for operational complexity
Budget allows for managed services

Migration Triggers

Memory usage >50% of available system RAM
Query response time degradation
Storage costs exceeding managed service pricing
Operational overhead consuming >20% engineering time

Useful Links for Further Investigation

Resources That Actually Help (With Reality Checks)

Link	Description
Prometheus Official Documentation	Pretty good but sometimes outdated by 6 months. Check GitHub issues when shit doesn't work.
PromQL Tutorial	Essential reading. PromQL is weird but powerful once you get it.
Best Practices	Read this twice. Following these prevents 90% of production issues.
Prometheus Operator	Most reliable K8s deployment, but complex. Start with Helm chart.
Helm Charts	Use `kube-prometheus-stack`. Ignore `prometheus` chart (it's minimal).
Why Prometheus Uses So Much Memory	Actually explains the 3KB rule and what breaks it.
Managing High Cardinality	Practical advice for when your Prometheus eats all the RAM.
Node Exporter	Essential for server monitoring. Works on everything except Windows.
Blackbox Exporter	HTTP/HTTPS/DNS/TCP monitoring. Config is confusing but powerful.
All Exporters List	Huge list, most are unmaintained. Check GitHub activity first.
Thanos	Best for long-term storage. Complex setup but scales forever.
VictoriaMetrics	Easiest performance upgrade. Drop-in replacement, 10x less memory usage.
M3DB	Uber's baby. Scales to infinity but setup will crush your soul. Their docs assume you're already an expert.
Cortex	Multi-tenant Prometheus, but the deployment complexity will make you question your career choices.
Grafana Dashboards for Prometheus	Pre-made dashboards. Start with ID 1860 (Node Exporter).
Robust Perception Blog	Best technical content about Prometheus. Written by core maintainers.
AWS Managed Prometheus	Good if you're all-in on AWS. Pricing adds up quickly.
GitHub Issues	Search before posting. Many "bugs" are config issues.
PromLens	PromQL query builder and explainer. Great for learning.
Query Performance	How to not kill your Prometheus with bad queries.

Prometheus Monitoring: AI-Optimized Technical Reference

System Overview

Critical Performance Specifications

Memory Usage Reality

Cardinality Explosion Failure Mode

Deployment Configurations

Production-Ready Settings

Memory Planning Formula

Common Failure Modes

Production Breaking Scenarios

Kubernetes Deployment Requirements

Operator vs Manual Deployment

Critical Configuration

Version 3.0 Migration Considerations

Breaking Changes (Production Impact)

Upgrade Testing Requirements

High Availability Limitations

No Built-in Clustering

Scale-Out Decision Points

When to Migrate to VictoriaMetrics

Long-term Storage Solutions

Query Performance Guidelines

Efficient PromQL Patterns

Performance Killers

Alert Configuration Best Practices

Alertmanager Routing

Operational Intelligence

Time Investment Requirements

Cost Considerations

Support Quality

Hidden Operational Costs

Critical Monitoring Metrics

System Health Indicators

Capacity Planning Queries

Integration Dependencies

Required Components

Optional but Recommended

Decision Framework

Choose Prometheus When

Consider Alternatives When

Migration Triggers

Useful Links for Further Investigation

Resources That Actually Help (With Reality Checks)

Related Tools & Recommendations

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Datadog vs New Relic vs Sentry: Real Pricing Breakdown (From Someone Who's Actually Paid These Bills)

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

Datadog Cost Management - Stop Your Monitoring Bill From Destroying Your Budget

Datadog Enterprise Pricing - What It Actually Costs When Your Shit Breaks at 3AM

Grafana - The Monitoring Dashboard That Doesn't Suck

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

Set Up Microservices Monitoring That Actually Works

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

OpenTelemetry Alternatives - For When You're Done Debugging Your Debugging Tools

OpenTelemetry - Finally, Observability That Doesn't Lock You Into One Vendor

OpenTelemetry + Jaeger + Grafana on Kubernetes - The Stack That Actually Works

Docker Alternatives That Won't Break Your Budget

I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works

PostgreSQL Alternatives: Escape Your Production Nightmare

AWS RDS Blue/Green Deployments - Zero-Downtime Database Updates

Splunk - Expensive But It Works

etcd - The Database That Keeps Kubernetes Working

Setting Up Prometheus Monitoring That Won't Make You Hate Your Job

Alertmanager - Stop Getting 500 Alerts When One Server Dies