Currently viewing the AI version
Switch to human version

Prometheus Node Exporter Production Configuration Guide

Critical Configuration Requirements

Default Configuration Failures

  • Default Risk: Node Exporter v1.9.1 ships with 70+ collectors enabled by default
  • Production Impact: Causes 8GB+ RAM usage on 200-node clusters, can crash Prometheus servers
  • Memory Growth Pattern: interrupts collector generates 500+ metrics per server, slabinfo can consume 2GB+ memory alone
  • Kernel Stability: Parallel I/O operations crash Linux kernels on 96-core servers (documented in GitHub issue #2530)

Essential Collector Configuration

Production-Safe Minimal Setup:

./node_exporter \
  --collector.disable-defaults \
  --collector.cpu \
  --collector.meminfo \
  --collector.filesystem \
  --collector.diskstats \
  --collector.netdev \
  --collector.loadavg

Critical Filesystem Filtering (Required for Docker/Kubernetes):

--collector.filesystem.mount-points-exclude="^/(sys|proc|dev|host|etc|var/lib/docker)($$|/)"
--collector.filesystem.fs-types-exclude="^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$"

Collector Decision Matrix

Collector Memory Usage Production Value Critical Warnings
cpu 10MB Essential Basic alerting foundation
meminfo 5MB Essential Memory leak detection
filesystem 200MB+ without filtering Critical Prevents disk full disasters; MUST filter Docker mounts
diskstats 50MB High Detects database I/O thrashing
netdev 100MB High Bandwidth saturation alerts; filter Docker interfaces
pressure 20MB High Modern load indicators (PSI stall metrics)
hwmon 50MB Medium Temperature/fan monitoring; filter voltage readings
interrupts 500MB+ Rarely useful Crashes on 96-core systems
slabinfo 1GB+ Never useful Kernel debugging only
ethtool 200MB Rarely useful Can cause packet drops
systemd 100MB Medium Service status monitoring on systemd systems

Performance Thresholds and Failure Points

Critical Performance Metrics

  • Maximum metrics per node: 2000 (above this causes Prometheus performance degradation)
  • Memory limit per instance: 200MB (production deployments should set container limits)
  • Scrape timeout threshold: 5 seconds (longer indicates too many collectors enabled)
  • Cardinality explosion pattern: 200 nodes × 2000 metrics = 400k series (Prometheus failure point)

Single-Threading Constraint

  • GOMAXPROCS=1 enforced since v1.5.0: Parallel I/O operations crash Linux kernels
  • Performance trade-off: Slower scraping prevents kernel panics on high-core systems
  • Cannot override: Attempting to increase GOMAXPROCS causes system instability

Kubernetes Deployment Requirements

Essential Host Access Configuration

spec:
  hostNetwork: true
  hostPID: true
  securityContext:
    runAsUser: 65534  # nobody user
    runAsNonRoot: true
  containers:
  - name: node-exporter
    args:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/host/root'
      - '--collector.filesystem.mount-points-exclude=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/.+)($|/)'
    volumeMounts:
    - name: root
      mountPath: /host/root
      mountPropagation: HostToContainer  # Critical: missing causes mount detection failures
      readOnly: true

Resource Limits (Required)

resources:
  limits:
    memory: 200Mi
  requests:
    cpu: 100m
    memory: 100Mi

Security Implementation Requirements

Network Security (Critical)

# DON'T bind to all interfaces (security vulnerability)
./node_exporter --web.listen-address="192.168.1.100:9100"

# TLS configuration (mandatory for production)
./node_exporter \
  --web.config.file=/etc/node_exporter/web.yml \
  --web.listen-address=":9100"

TLS Configuration Template

tls_server_config:
  cert_file: /etc/ssl/certs/node_exporter.crt
  key_file: /etc/ssl/private/node_exporter.key
  min_version: TLS12
  cipher_suites:
    - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384

Systemd Security Hardening

[Service]
NoNewPrivileges=true
PrivateTmp=true
ProtectHome=true
ProtectSystem=strict
ReadWritePaths=/var/lib/node_exporter
CapabilityBoundingSet=CAP_DAC_OVERRIDE
AmbientCapabilities=CAP_DAC_OVERRIDE
LimitNOFILE=8192
MemoryLimit=512M

Textfile Collector Implementation

Atomic File Writing (Required)

#!/bin/bash
TEXTFILE_DIR="/var/lib/node_exporter/textfiles"
TEMP_FILE=$(mktemp)

# Generate metrics in temp location
{
  echo "# HELP backup_last_success_timestamp Last successful backup time"
  echo "# TYPE backup_last_success_timestamp gauge"
  echo "backup_last_success_timestamp $(date +%s)"
} > "$TEMP_FILE"

# Atomic move prevents partial file reads
mv "$TEMP_FILE" "$TEXTFILE_DIR/backup_status.prom"

Critical Requirement: Use temporary files and atomic moves; writing directly to textfile directory causes metric corruption.

Troubleshooting Common Failures

Memory Usage Explosion

Symptoms: Node Exporter consuming 1GB+ memory
Root Cause: interrupts or slabinfo collectors enabled
Solution:

# Check cardinality
curl -s localhost:9100/metrics | wc -l  # Should be <2000

# Identify problematic collector
curl -s localhost:9100/metrics | grep "^node_" | cut -d'_' -f1,2 | sort | uniq -c | sort -nr

Kubernetes Mount Point Explosion

Symptoms: 500+ filesystem metrics from single node
Root Cause: Docker/Kubernetes overlay mounts not filtered
Solution: Apply mount point exclusion regex (shown above)

Network Interface Cardinality

Symptoms: 100+ network interface metrics on AWS ECS
Solution:

--collector.netdev.device-include="^(eth|ens|eno|enp).*"

Version-Specific Considerations

Version 1.9.1 (Current Recommended)

  • Memory leak fixes: Resolved in IRQ pressure collector
  • Multiple textfile directories: Supports comma-separated paths
  • Improved filtering: URL parameter filtering available

Upgrade Risk Assessment

  • High Risk: Cardinality changes between versions can crash Prometheus
  • Testing Required: Always test in staging environment first
  • Rollback Plan: Backup configurations before upgrade
  • Silent Failures: Metric drops after upgrade often go unnoticed

Critical Alerts Configuration

# Essential Node Exporter health monitoring
- alert: NodeExporterDown
  expr: up{job="node-exporter"} == 0
  for: 1m

- alert: NodeExporterHighCardinality
  expr: prometheus_tsdb_symbol_table_size_bytes > 16000000
  for: 5m

- alert: NodeExporterHighMemory
  expr: process_resident_memory_bytes{job="node-exporter"} > 200000000
  for: 5m

Resource Requirements and Planning

Infrastructure Sizing

  • CPU: Single core utilization (GOMAXPROCS=1 limitation)
  • Memory: 100MB baseline + 2MB per 100 metrics
  • Network: 1-5 seconds scrape time with proper filtering
  • Storage: Minimal (stateless service)

Scaling Considerations

  • Per-node deployment: Required for host-level metrics
  • Load balancer support: Possible but complex due to instance-specific metrics
  • High availability: Achieved through multiple Prometheus servers, not Node Exporter clustering

Common Misconceptions and Failures

  1. "Enable all collectors for complete monitoring" → Causes system crashes and memory exhaustion
  2. "Node Exporter monitors containers" → Only monitors host system; use cAdvisor for containers
  3. "Default configuration is production-ready" → Default configuration can consume 8GB+ RAM
  4. "Binding to 0.0.0.0 is safe on internal networks" → Exposes detailed system information to attackers
  5. "More GOMAXPROCS improves performance" → Causes kernel panics on large systems

Essential Documentation References

Useful Links for Further Investigation

Essential Resources and Documentation

LinkDescription
Prometheus Node Exporter GitHubThe only source of truth. The Issues section has all the "holy shit, that crashed my server too" war stories
Node Exporter GuideOfficial docs are garbage for troubleshooting but they won't outright lie about basic setup
Release Notes v1.9.1Actually read these or suffer when your memory usage explodes after upgrading
Collector DocumentationLists all collectors but doesn't warn you which ones will murder your server
Better Stack Node Exporter GuideFinally explains the Docker mount hell properly, unlike the garbage tutorials everywhere else
Prometheus Best PracticesThe cardinality section will save you from metric explosions that murder your server
Robust Perception BlogBrian Brazil actually knows what the fuck he's talking about with Node Exporter performance
Textfile Collector ScriptsThese scripts saved me from writing custom metrics code. The backup monitoring one actually works
GOMAXPROCS=1 IssueThe GitHub issue that explains why Node Exporter is single-threaded (spoiler: parallel I/O crashes Linux)
High Cardinality DebuggingHow to find which collector is murdering your Prometheus server
Prometheus Community SlackThe #node-exporter channel has people who've actually survived production disasters
Prometheus Mailing ListsWar stories and solutions from people who've been through the same shit
Stack Overflow Node Exporter TagUsually garbage but occasionally has the exact error message you're staring at
Windows ExporterIf you're stuck monitoring Windows (my condolences)
cAdvisorFor actual per-container metrics that Node Exporter can't provide
Node Exporter Dashboard 1860The only Grafana dashboard that doesn't look like it was designed by a colorblind toddler
Memory Usage TroubleshootingThe canonical "why is Node Exporter eating all my RAM" issue
Kernel Panic DebuggingWhat to do when Node Exporter crashes your 96-core server
Prometheus Getting Started GuideOfficial setup and configuration documentation

Related Tools & Recommendations

integration
Recommended

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

How to Wire Together the Modern DevOps Stack Without Losing Your Sanity

prometheus
/integration/docker-kubernetes-argocd-prometheus/gitops-workflow-integration
100%
integration
Similar content

Setting Up Prometheus Monitoring That Won't Make You Hate Your Job

How to Connect Prometheus, Grafana, and Alertmanager Without Losing Your Sanity

Prometheus
/integration/prometheus-grafana-alertmanager/complete-monitoring-integration
69%
howto
Recommended

Set Up Microservices Monitoring That Actually Works

Stop flying blind - get real visibility into what's breaking your distributed services

Prometheus
/howto/setup-microservices-observability-prometheus-jaeger-grafana/complete-observability-setup
60%
tool
Recommended

Prometheus - Scrapes Metrics From Your Shit So You Know When It Breaks

Free monitoring that actually works (most of the time) and won't die when your network hiccups

Prometheus
/tool/prometheus/overview
35%
integration
Recommended

Falco + Prometheus + Grafana: The Only Security Stack That Doesn't Suck

Tired of burning $50k/month on security vendors that miss everything important? This combo actually catches the shit that matters.

Falco
/integration/falco-prometheus-grafana-security-monitoring/security-monitoring-integration
35%
tool
Recommended

Grafana - The Monitoring Dashboard That Doesn't Suck

integrates with Grafana

Grafana
/tool/grafana/overview
35%
troubleshoot
Recommended

Fix Kubernetes ImagePullBackOff Error - The Complete Battle-Tested Guide

From "Pod stuck in ImagePullBackOff" to "Problem solved in 90 seconds"

Kubernetes
/troubleshoot/kubernetes-imagepullbackoff/comprehensive-troubleshooting-guide
32%
troubleshoot
Recommended

Fix Kubernetes OOMKilled Pods - Production Memory Crisis Management

When your pods die with exit code 137 at 3AM and production is burning - here's the field guide that actually works

Kubernetes
/troubleshoot/kubernetes-oom-killed-pod/oomkilled-production-crisis-management
32%
howto
Recommended

Stop Docker from Killing Your Containers at Random (Exit Code 137 Is Not Your Friend)

Three weeks into a project and Docker Desktop suddenly decides your container needs 16GB of RAM to run a basic Node.js app

Docker Desktop
/howto/setup-docker-development-environment/complete-development-setup
32%
troubleshoot
Recommended

CVE-2025-9074 Docker Desktop Emergency Patch - Critical Container Escape Fixed

Critical vulnerability allowing container breakouts patched in Docker Desktop 4.44.3

Docker Desktop
/troubleshoot/docker-cve-2025-9074/emergency-response-patching
32%
howto
Similar content

Complete Kubernetes Security Monitoring Stack Setup - Zero to Production

Learn to build a complete Kubernetes security monitoring stack from zero to production. Discover why commercial tools fail, get a step-by-step implementation gu

Kubernetes
/howto/setup-kubernetes-security-monitoring/complete-security-monitoring-stack
31%
integration
Similar content

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

When your API shits the bed right before the big demo, this stack tells you exactly why

Prometheus
/integration/prometheus-grafana-jaeger/microservices-observability-integration
31%
integration
Similar content

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go

Apache Kafka
/integration/kafka-mongodb-kubernetes-prometheus-event-driven/complete-observability-architecture
31%
tool
Recommended

Alertmanager - Stop Getting 500 Alerts When One Server Dies

integrates with Alertmanager

Alertmanager
/tool/alertmanager/overview
29%
tool
Recommended

New Relic - Application Monitoring That Actually Works (If You Can Afford It)

New Relic tells you when your apps are broken, slow, or about to die. Not cheap, but beats getting woken up at 3am with no clue what's wrong.

New Relic
/tool/new-relic/overview
29%
tool
Similar content

Debug Kubernetes Issues - The 3AM Production Survival Guide

When your pods are crashing, services aren't accessible, and your pager won't stop buzzing - here's how to actually fix it

Kubernetes
/tool/kubernetes/debugging-kubernetes-issues
26%
tool
Similar content

ArgoCD Production Troubleshooting - Fix the Shit That Breaks at 3AM

The real-world guide to debugging ArgoCD when your deployments are on fire and your pager won't stop buzzing

Argo CD
/tool/argocd/production-troubleshooting
26%
tool
Recommended

Splunk - Expensive But It Works

Search your logs when everything's on fire. If you've got $100k+/year to spend and need enterprise-grade log search, this is probably your tool.

Splunk Enterprise
/tool/splunk/overview
26%
tool
Similar content

kube-state-metrics - See What's Actually Happening in Your Kubernetes Cluster

Stop guessing what's broken in your cluster - get real visibility into your Kubernetes objects

kube-state-metrics
/tool/kube-state-metrics/overview
26%
howto
Similar content

Setup Kubernetes Production Deployment - Complete Guide

The step-by-step playbook to deploy Kubernetes in production without losing your weekends to certificate errors and networking hell

Kubernetes
/howto/setup-kubernetes-production-deployment/production-deployment-guide
26%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization