Why does Prometheus eat all my RAM and how do I stop it from crashing?

Someone added user_id to your metrics, didn't they? Yeah, that'll kill Prometheus faster than you can say 'cardinality explosion.' Here's how to fix it before your manager asks why monitoring is down. Avoid these cardinality bombs: ```prometheus# Bad: This will murder your serverhttp_requests{user_id="12345", session_id="abc123", request_id="xyz789"}# Good: This won't crash everythinghttp_requests{service="api", method="GET", status="200"}``` Monitor cardinality with these queries: ```promql# Check series count per metric{__name__=~\".+\"}# Find highest cardinality metricstopk(10, count by (__name__)({__name__=~\".+\"}))# Series per label valuecount by (label_name)({__name__=\"your_metric\"})``` Use [recording rules](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/) to pre-aggregate high-cardinality data into manageable metrics.

What's the difference between RED, USE, and SLI metrics for performance monitoring?

**RED Metrics** (Request-focused): - **Rate**: Requests per second - **Errors**: Error rate percentage - **Duration**: Response time distribution **USE Metrics** (Resource-focused): - **Utilization**: Resource usage percentage - **Saturation**: Queue depth and backlog - **Errors**: Resource-related failures **SLI Metrics** (User-focused): - **Availability**: Successful operations percentage - **Latency**: User-perceived response time - **Quality**: Data accuracy and completeness For performance monitoring, start with RED metrics for services, USE metrics for infrastructure, and SLI metrics for business objectives.

How do I set up SLOs without my team wanting to murder me?

Don't overwhelm your team with 47 SLOs on day one. Start with one that actually matters: **Step 1: Choose one critical user journey** Focus on your most important business function (signup, checkout, core API). **Step 2: Define a simple SLI** ```promql# Availability SLI: Percentage of successful requestssum(rate(http_requests_total{status!~\"5..\"}[5m]))/sum(rate(http_requests_total[5m]))``` **Step 3: Set realistic objectives** - 99.9% availability (43 minutes downtime/month) - 95% of requests under 200ms latency - Start conservative, tighten over time **Step 4: Track error budget consumption** ```promql# Error budget remaining (for 99.9% SLO)1 - (1 - sli_value) / (1 - 0.999)``` Add more SLOs only after the first one is working well and providing value.

Why do my Grafana dashboards load like molasses and how do I fix them?

**Use recording rules or watch your dashboards timeout during incidents:** ```yaml# Pre-compute complex aggregations- record: service:error_rate:5m expr: | sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)``` **Optimize query time ranges:** - Real-time panels: 5-15 minutes - Trend analysis: 6-24 hours - Historical comparison: Use downsampled data **Use dashboard variables effectively:** ```# Service variable for filteringlabel_values(http_requests_total, service)# Time range variable for consistent queries$__range_s for automatic time range adaptation``` **Cache expensive queries:** Enable Grafana query caching for panels that don't need real-time updates.

What PromQL queries are essential for application performance monitoring?

**Latency percentiles:** ```promql# P95 response timehistogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))/histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))/histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m] offset 7d)) by (le))``` **Error rate tracking:** ```promql# Error rate percentagesum(rate(http_requests_total{status=~\"5..\"}[5m]))/sum(rate(http_requests_total[5m])) * 100# Error budget burn rate(sum(rate(http_requests_total{status=~\"5..\"}[1h])) / sum(rate(http_requests_total[1h]))) > 0.01``` **Throughput analysis:** ```promql# Current RPSsum(rate(http_requests_total[5m]))# Growth raterate(sum(rate(http_requests_total[5m]))[1d])```

How do I correlate application performance with infrastructure metrics?

**Use consistent labeling:** ```prometheus# Application metricshttp_request_duration_seconds{service=\"api\", instance=\"api-1\"}# Infrastructure metricscpu_usage_percent{service=\"api\", instance=\"api-1\"}memory_usage_bytes{service=\"api\", instance=\"api-1\"}``` **Create correlation queries:** ```promql# Response time vs CPU usage correlationincrease(http_request_duration_seconds_sum[5m])and on(instance)cpu_usage_percent > 80# Error rate during high memory usagesum(rate(http_requests_total{status=~\"5..\"}[5m])) by (instance)and on(instance)memory_usage_percent > 90``` **Build infrastructure correlation dashboards:** Show application performance metrics alongside related infrastructure metrics using Grafana's [shared crosshair](https://grafana.com/docs/grafana/latest/dashboards/build-dashboards/manage-dashboard-links/) feature.

What's the best way to handle seasonal traffic patterns in performance monitoring?

**Use time-based comparisons:** ```promql# Compare to same time last weeksum(rate(http_requests_total[5m]))/sum(rate(http_requests_total[5m] offset 7d))# Day-over-day growthsum(rate(http_requests_total[5m]))/sum(rate(http_requests_total[5m] offset 1d))``` **Implement dynamic thresholds:** ```promql# Alert threshold based on historical averageavg_over_time(sum(rate(http_requests_total[5m]))[7d:1h]) * 1.5``` **Track seasonal baselines:** Use [Grafana annotations](https://grafana.com/docs/grafana/latest/dashboards/build-dashboards/annotate-visualizations/) to mark seasonal events (Black Friday, end-of-quarter) and adjust alerting thresholds accordingly.

How do I monitor performance of microservices dependencies?

**Track dependency latency:** ```prometheus# External service call durationexternal_service_duration_seconds{service=\"payment_api\", endpoint=\"/charge\"}# Dependency availabilityexternal_service_up{service=\"payment_api\"}``` **Monitor circuit breaker metrics:** ```prometheus# Circuit breaker statecircuit_breaker_state{service=\"payment_api\", state=\"open\"}# Fallback execution ratefallback_executions_total{service=\"payment_api\"}``` **Create dependency maps:** Use Grafana's [node graph panel](https://grafana.com/docs/grafana/latest/panels/visualizations/node-graph/) to visualize service dependencies and their health status.

How much monitoring data should I keep before it bankrupts me?

**"I need to debug this NOW"**: 1-7 days high resolution (15s intervals) **"Why was last month slow?"**: 30-90 days medium resolution (5min intervals) **"Will we survive Black Friday?"**: 1+ years low resolution (1hr intervals) Our Prometheus storage went from 200GB to 2.4TB in 3 months because someone didn't configure retention properly. Learn from our pain. **Implement tiered retention:** ```yaml# High resolution: 7 daysretention: 7dresolution: 15s# Medium resolution: 90 daysretention: 90dresolution: 5m# Low resolution: 2 yearsretention: 2yresolution: 1h``` **Use remote storage for long-term data:** Consider [Thanos](https://thanos.io/) or [Cortex](https://cortexmetrics.io/) for cost-effective long-term storage while keeping recent data local for fast queries.

How do I migrate from legacy monitoring to Prometheus/Grafana?

**Parallel deployment strategy:** 1. Deploy Prometheus alongside existing monitoring 2. Implement same alerts in both systems 3. Build equivalent dashboards in Grafana 4. Run parallel for 30 days minimum 5. Gradually shift alerting to Prometheus 6. Decommission legacy system after validation **Data migration approach:** ```promql# Create equivalent metrics for legacy systemslegacy_cpu_percent = cpu_usage * 100legacy_response_time_ms = http_request_duration_seconds * 1000``` **Training and adoption:** - Start with read-only Grafana access for team - Provide PromQL training sessions - Create runbooks for common queries - Establish on-call procedures with new tools

How do I know when my monitoring setup is completely fucked?

**Your dashboards are useless during incidents:** - Load times >10 seconds when you need answers NOW - Query timeouts every time you try to drill down - Prometheus giving up on your queries **Storage is eating your budget:** - Disk space disappearing faster than expected - High cardinality warnings that everyone ignores - Prometheus getting OOMKilled more than your application **Alert fatigue has set in:** - Your team muted the #alerts channel - People stopped checking their phones after hours - Nobody responds to alerts anymore **You're always reactive, never proactive:** - Users tell you about outages before your monitoring does - "Works on my machine" is your incident response plan - You spend more time debugging monitoring than actual bugs Address these by reviewing metric cardinality, optimizing queries with recording rules, and refining alert thresholds based on actual business impact.

Currently viewing the AI version

Switch to human version

Prometheus & Grafana Performance Monitoring: AI-Optimized Reference

Technology Overview

Primary Function: Application performance monitoring and alerting system using open-source tools
Core Components: Prometheus (metrics collection), Grafana (visualization), AlertManager (notifications)
Architecture: Pull-based metrics collection with time-series database storage

Critical Implementation Requirements

Essential Metric Categories

Latency: P50, P95, P99 response time distributions (not averages - averages hide 5-second timeouts)
Throughput: Requests per second, transaction rates, business operation completion
Error Rates: HTTP status codes, application exceptions, failed business transactions
Saturation: Resource utilization approaching limits, queue depths, connection pool usage

High-Risk Configuration Pitfalls

Cardinality Explosion (System Killer)

# FATAL: Will crash Prometheus server
http_requests{user_id="12345", session_id="abc123"}

# SAFE: Low cardinality approach
http_requests{service="api", method="GET", status="200"}

Impact: RAM consumption explosion, query timeouts, complete monitoring failure
Detection: Monitor series count per metric with topk(10, count by (__name__)({__name__=~".+"}))

Recording Rules Necessity

# Pre-compute expensive calculations to prevent dashboard timeouts
- record: api:error_rate:5m
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
    /
    sum(rate(http_requests_total[5m])) by (service)

Critical: Without recording rules, dashboards become unusable during incidents

Performance Thresholds and Limits

Storage Requirements

High resolution: 7 days at 15-second intervals
Medium resolution: 90 days at 5-minute intervals
Long-term: 2+ years at 1-hour intervals
Storage growth: Expect 200GB to 2.4TB expansion in 3 months without proper retention

Query Performance Optimization

# SLOW: Calculates rate per series then sums
sum(rate(http_requests_total[5m]))

# FAST: Sums raw counters then calculates rate
rate(sum(http_requests_total)[5m])

Dashboard Load Time Thresholds

Acceptable: <5 seconds during normal operations
Incident-critical: <10 seconds during outages
Failure point: >10 seconds renders monitoring useless

Service Level Objectives (SLO) Implementation

Effective SLI Selection

Availability: 99.9% successful requests (43 minutes downtime/month)
Latency: 95% of requests under 200ms
Error Budget: Track consumption rate to balance reliability vs feature velocity

SLO Configuration

slo:
  name: "API Response Time"
  sli:
    query: |
      sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m]))
      /
      sum(rate(http_request_duration_seconds_count[5m]))
  objective: 0.95
  time_window: "7d"

Error Budget Calculation

# Error budget remaining for 99% availability SLO
(1 - (sum(rate(http_requests_total{status=~"5.."}[7d]))/sum(rate(http_requests_total[7d])))) >= 0.99

Critical Failure Scenarios

Monitoring System Failures

Prometheus OOMKilled: High cardinality metrics consuming all available RAM
Dashboard Timeouts: Complex queries without recording rules during incidents
Alert Fatigue: Teams muting alerts due to false positives from infrastructure metrics instead of user impact

Real-World Failure Examples

Hidden Payment Failures: API returns HTTP 200 with error in JSON body, monitoring shows false success
Dependency Blindness: Internal metrics perfect while external payment gateway causes 12-second delays
Average Response Time Lies: 150ms average hiding 30-second timeout experiences for some users

Alerting Strategy

Multi-Window Alert Configuration

# Fast response for sudden spikes
- alert: HighErrorRateSpike
  expr: sum(rate(http_requests_total{status=~"5.."}[2m])) by (service) / sum(rate(http_requests_total[2m])) by (service) > 0.05
  for: 1m

# Trend detection for gradual degradation
- alert: HighErrorRateTrend
  expr: sum(rate(http_requests_total{status=~"5.."}[15m])) by (service) / sum(rate(http_requests_total[15m])) by (service) > 0.02
  for: 5m

SLO-Based Alerting

# Alert when burning error budget too quickly
- alert: ErrorBudgetBurnRateHigh
  expr: (sum(rate(http_requests_total{status=~"5.."}[1h])) by (service) / sum(rate(http_requests_total[1h])) by (service)) > (0.01 * 14.4)
  for: 2m

Resource Requirements

Implementation Timeline

Basic Setup: 2-3 weeks including cardinality optimization
Production-Ready: 5+ months including failure scenario handling
Team Training: 30+ days for PromQL proficiency and operational procedures

Cost Analysis (Monthly)

Prometheus + Grafana: $200-900 (until storage scaling hits)
DataDog APM: $1400-5200+ (vendor lock-in premium)
Infrastructure: Additional 20-30% for monitoring stack resources

Expertise Requirements

PromQL Mastery: Essential for effective query optimization
Cardinality Management: Critical for system stability
Recording Rules Design: Required for incident-ready dashboards

Technology Comparison Matrix

Solution	Setup Complexity	Monthly Cost	Query Performance	Best Use Case
Prometheus + Grafana	Medium (2-3 weeks)	$200-900	Excellent with optimization	Cost-conscious teams wanting control
DataDog APM	Low (hours)	$1400-5200+	Good (vendor-optimized)	Unlimited budget scenarios
New Relic	Low (days)	$950-3100+	Good for tracing	Dashboard-focused organizations
Dynatrace	Medium (vendor-assisted)	$2100-7800+	Excellent (AI-powered)	Enterprise complex dependencies

Essential Query Library

Performance Monitoring Queries

# P95 latency tracking
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le))

# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

# Throughput measurement
sum(rate(http_requests_total[5m])) by (service)

# Dependency correlation
increase(http_request_duration_seconds_sum[5m]) and on(instance) cpu_usage_percent > 80

Capacity Planning Queries

# Growth trend prediction
predict_linear(avg_over_time(cpu_usage_percent[7d])[30d:1d], 86400 * 30) > 80

# Seasonal comparison
sum(rate(http_requests_total[5m])) / sum(rate(http_requests_total[5m] offset 7d))

Critical Warnings

Production Deployment Blockers

Default retention policies will exhaust storage in 3-6 months
High cardinality labels (user_id, session_id) will crash Prometheus
Complex dashboard queries without recording rules fail during incidents
Average metrics hide critical user experience issues

Operational Failure Indicators

Dashboard load times >10 seconds during incidents
Prometheus consuming >80% available RAM
Alert fatigue leading to muted notification channels
Users reporting issues before monitoring alerts trigger

Migration Strategy

Parallel Deployment Approach

Deploy alongside existing monitoring (30-day minimum overlap)
Implement equivalent alerts in both systems
Build matching dashboards in Grafana
Validate against historical incidents
Gradually shift alerting responsibility
Decommission legacy after full validation

Training Requirements

PromQL basics: 1-2 weeks for operational queries
Dashboard design: 3-4 weeks for effective visualization
Incident response: 2-3 months for confident troubleshooting
Advanced optimization: 6+ months for cardinality and performance tuning

Success Criteria

Monitoring Effectiveness Indicators

Proactive issue detection: Alerts trigger before user complaints
Incident response speed: Dashboard load times <5 seconds under load
Alert precision: <10% false positive rate on critical alerts
Coverage validation: All user-facing failures generate monitoring signals

Business Impact Metrics

Mean time to detection (MTTD): <5 minutes for critical issues
Error budget consumption: Tracked and actionable for feature velocity decisions
Capacity planning accuracy: Infrastructure scaling based on trend analysis
Cost optimization: Monitoring stack <5% of infrastructure budget

This reference provides the technical foundation for implementing performance monitoring that detects real user impact rather than just infrastructure health, with specific attention to the operational pitfalls that cause monitoring systems to fail when most needed.

Useful Links for Further Investigation

Essential Resources for Performance Monitoring Excellence

Link	Description
Prometheus Performance Best Practices	Read this before you learn the hard way why user_id labels will destroy your Prometheus server. We ignored this and blew up our monitoring stack twice - don't be us.
Grafana SLO Documentation	Actually readable SLO docs that don't make you want to punch a wall. The error budget examples are the only reason our SLOs didn't completely fail.
PromQL Query Examples	Just copy these instead of trying to write PromQL from scratch like some masochist. Saved my ass during a 3am incident when my brain stopped working.
Grafana Dashboard Best Practices	How to build dashboards that don't shit the bed during incidents. The "incident response" section is what you actually need - ignore the rest.
Google SRE Book - Monitoring Distributed Systems	The book that started it all. Skip the theory and jump to the Four Golden Signals section. This stuff actually works at scale.
RED Method for Microservices	Tom Wilkie's presentation on the RED method (Rate, Errors, Duration) for monitoring microservices effectively.
USE Method by Brendan Gregg	Systematic methodology for infrastructure performance analysis focusing on Utilization, Saturation, and Errors.
SLI/SLO Implementation Guide	Google's practical guide to implementing SLIs and SLOs from the SRE workbook, with concrete examples and measurement strategies.
Prometheus Storage Documentation	Read this before you blow your entire AWS budget on Prometheus storage like we did. The retention policy section will save you from financial pain.
Recording Rules Best Practices	How to pre-compute expensive shit so your dashboards don't timeout when you need them most. Wish I'd read this before our Black Friday incident.
Grafana Variables and Templating	The magic that makes dashboards actually useful instead of static garbage. Took me way too long to figure this out.
PromQL Performance Tips	How to write PromQL that doesn't make your dashboards slower than molasses. Essential reading if you want your team to not hate you.
Prometheus Node Exporter	Essential system metrics collection for correlating application performance with infrastructure health.
JMX Exporter for Java Applications	Critical for monitoring JVM performance metrics including garbage collection, heap usage, and thread pools.
Blackbox Exporter	External monitoring for API endpoints, SSL certificates, and dependency health checks.
KEDA - Kubernetes Event-Driven Autoscaling	Integrate performance metrics with autoscaling decisions for optimal resource utilization.
Grafana Dashboard Repository	Community dashboards with mixed quality. Filter by downloads and ratings to find the good ones.
Node Exporter Full Dashboard	The dashboard that every ops team steals and customizes. Works out of the box which is rare for open source tools.
JVM Dashboard	Actually useful Java monitoring that shows why your app is eating memory. The GC analysis panels saved us during a massive memory leak.
RED Metrics Dashboard	Tom Wilkie's RED method in dashboard form. Copy this if you want microservices monitoring that doesn't suck.
Thanos - Prometheus Long-term Storage	Highly available Prometheus setup with unlimited retention and cross-cluster queries for enterprise deployments.
Cortex - Horizontally Scalable Prometheus	Multi-tenant Prometheus solution for organizations requiring massive scale and isolation.
VictoriaMetrics	High-performance Prometheus-compatible storage with better compression and query performance.
Sloth - SLO Generator	Automated SLO definition and alert generation for standardized reliability engineering practices.
k6 Load Testing with Prometheus	Integrate performance testing results with monitoring stack for better performance analysis.
Artillery.io Prometheus Plugin	Real-time load testing metrics integration for continuous performance validation.
Grafana k6 Dashboard	Visualize load testing results alongside production performance metrics for correlation analysis.
Grafana SLO Examples	Practical SLI implementations for availability, latency, and custom business metrics.
OpenSLO Specification	Vendor-neutral standard for defining SLOs as code, enabling portability across platforms.
Pyrra - Prometheus SLO Tool	Kubernetes-native SLO monitoring with automated multi-window alerting and burn rate analysis.
Prometheus Alerting Rules	Comprehensive guide to creating effective alerts that reduce noise while catching critical issues.
AlertManager Configuration	Advanced alert routing, inhibition, and integration with incident management systems.
Grafana Unified Alerting	Next-generation alerting system with multi-dimensional rule evaluation and flexible notification policies.
Grafana Enterprise Reports	Automated performance reporting for stakeholder communication and compliance requirements.
Prometheus Federation	Hierarchical monitoring setup for large organizations with multiple teams and environments.
Cost Monitoring with Prometheus	Integrate infrastructure costs with performance metrics for ROI analysis and optimization decisions.
PromCon Conference Talks	Real-world case studies and advanced techniques from Prometheus community conferences.
Grafana Community Forum	Active community for troubleshooting dashboard issues and sharing monitoring strategies.
CNCF Slack #prometheus Channel	Direct access to Prometheus maintainers and expert community for advanced technical discussions.
Site Reliability Engineering Course	Google's practical SRE training covering SLO implementation and performance monitoring strategies.