Prometheus Node Exporter Production Configuration Guide
Critical Configuration Requirements
Default Configuration Failures
- Default Risk: Node Exporter v1.9.1 ships with 70+ collectors enabled by default
- Production Impact: Causes 8GB+ RAM usage on 200-node clusters, can crash Prometheus servers
- Memory Growth Pattern: interrupts collector generates 500+ metrics per server, slabinfo can consume 2GB+ memory alone
- Kernel Stability: Parallel I/O operations crash Linux kernels on 96-core servers (documented in GitHub issue #2530)
Essential Collector Configuration
Production-Safe Minimal Setup:
./node_exporter \
--collector.disable-defaults \
--collector.cpu \
--collector.meminfo \
--collector.filesystem \
--collector.diskstats \
--collector.netdev \
--collector.loadavg
Critical Filesystem Filtering (Required for Docker/Kubernetes):
--collector.filesystem.mount-points-exclude="^/(sys|proc|dev|host|etc|var/lib/docker)($$|/)"
--collector.filesystem.fs-types-exclude="^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$"
Collector Decision Matrix
Collector | Memory Usage | Production Value | Critical Warnings |
---|---|---|---|
cpu | 10MB | Essential | Basic alerting foundation |
meminfo | 5MB | Essential | Memory leak detection |
filesystem | 200MB+ without filtering | Critical | Prevents disk full disasters; MUST filter Docker mounts |
diskstats | 50MB | High | Detects database I/O thrashing |
netdev | 100MB | High | Bandwidth saturation alerts; filter Docker interfaces |
pressure | 20MB | High | Modern load indicators (PSI stall metrics) |
hwmon | 50MB | Medium | Temperature/fan monitoring; filter voltage readings |
interrupts | 500MB+ | Rarely useful | Crashes on 96-core systems |
slabinfo | 1GB+ | Never useful | Kernel debugging only |
ethtool | 200MB | Rarely useful | Can cause packet drops |
systemd | 100MB | Medium | Service status monitoring on systemd systems |
Performance Thresholds and Failure Points
Critical Performance Metrics
- Maximum metrics per node: 2000 (above this causes Prometheus performance degradation)
- Memory limit per instance: 200MB (production deployments should set container limits)
- Scrape timeout threshold: 5 seconds (longer indicates too many collectors enabled)
- Cardinality explosion pattern: 200 nodes × 2000 metrics = 400k series (Prometheus failure point)
Single-Threading Constraint
- GOMAXPROCS=1 enforced since v1.5.0: Parallel I/O operations crash Linux kernels
- Performance trade-off: Slower scraping prevents kernel panics on high-core systems
- Cannot override: Attempting to increase GOMAXPROCS causes system instability
Kubernetes Deployment Requirements
Essential Host Access Configuration
spec:
hostNetwork: true
hostPID: true
securityContext:
runAsUser: 65534 # nobody user
runAsNonRoot: true
containers:
- name: node-exporter
args:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/host/root'
- '--collector.filesystem.mount-points-exclude=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/.+)($|/)'
volumeMounts:
- name: root
mountPath: /host/root
mountPropagation: HostToContainer # Critical: missing causes mount detection failures
readOnly: true
Resource Limits (Required)
resources:
limits:
memory: 200Mi
requests:
cpu: 100m
memory: 100Mi
Security Implementation Requirements
Network Security (Critical)
# DON'T bind to all interfaces (security vulnerability)
./node_exporter --web.listen-address="192.168.1.100:9100"
# TLS configuration (mandatory for production)
./node_exporter \
--web.config.file=/etc/node_exporter/web.yml \
--web.listen-address=":9100"
TLS Configuration Template
tls_server_config:
cert_file: /etc/ssl/certs/node_exporter.crt
key_file: /etc/ssl/private/node_exporter.key
min_version: TLS12
cipher_suites:
- TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
Systemd Security Hardening
[Service]
NoNewPrivileges=true
PrivateTmp=true
ProtectHome=true
ProtectSystem=strict
ReadWritePaths=/var/lib/node_exporter
CapabilityBoundingSet=CAP_DAC_OVERRIDE
AmbientCapabilities=CAP_DAC_OVERRIDE
LimitNOFILE=8192
MemoryLimit=512M
Textfile Collector Implementation
Atomic File Writing (Required)
#!/bin/bash
TEXTFILE_DIR="/var/lib/node_exporter/textfiles"
TEMP_FILE=$(mktemp)
# Generate metrics in temp location
{
echo "# HELP backup_last_success_timestamp Last successful backup time"
echo "# TYPE backup_last_success_timestamp gauge"
echo "backup_last_success_timestamp $(date +%s)"
} > "$TEMP_FILE"
# Atomic move prevents partial file reads
mv "$TEMP_FILE" "$TEXTFILE_DIR/backup_status.prom"
Critical Requirement: Use temporary files and atomic moves; writing directly to textfile directory causes metric corruption.
Troubleshooting Common Failures
Memory Usage Explosion
Symptoms: Node Exporter consuming 1GB+ memory
Root Cause: interrupts or slabinfo collectors enabled
Solution:
# Check cardinality
curl -s localhost:9100/metrics | wc -l # Should be <2000
# Identify problematic collector
curl -s localhost:9100/metrics | grep "^node_" | cut -d'_' -f1,2 | sort | uniq -c | sort -nr
Kubernetes Mount Point Explosion
Symptoms: 500+ filesystem metrics from single node
Root Cause: Docker/Kubernetes overlay mounts not filtered
Solution: Apply mount point exclusion regex (shown above)
Network Interface Cardinality
Symptoms: 100+ network interface metrics on AWS ECS
Solution:
--collector.netdev.device-include="^(eth|ens|eno|enp).*"
Version-Specific Considerations
Version 1.9.1 (Current Recommended)
- Memory leak fixes: Resolved in IRQ pressure collector
- Multiple textfile directories: Supports comma-separated paths
- Improved filtering: URL parameter filtering available
Upgrade Risk Assessment
- High Risk: Cardinality changes between versions can crash Prometheus
- Testing Required: Always test in staging environment first
- Rollback Plan: Backup configurations before upgrade
- Silent Failures: Metric drops after upgrade often go unnoticed
Critical Alerts Configuration
# Essential Node Exporter health monitoring
- alert: NodeExporterDown
expr: up{job="node-exporter"} == 0
for: 1m
- alert: NodeExporterHighCardinality
expr: prometheus_tsdb_symbol_table_size_bytes > 16000000
for: 5m
- alert: NodeExporterHighMemory
expr: process_resident_memory_bytes{job="node-exporter"} > 200000000
for: 5m
Resource Requirements and Planning
Infrastructure Sizing
- CPU: Single core utilization (GOMAXPROCS=1 limitation)
- Memory: 100MB baseline + 2MB per 100 metrics
- Network: 1-5 seconds scrape time with proper filtering
- Storage: Minimal (stateless service)
Scaling Considerations
- Per-node deployment: Required for host-level metrics
- Load balancer support: Possible but complex due to instance-specific metrics
- High availability: Achieved through multiple Prometheus servers, not Node Exporter clustering
Common Misconceptions and Failures
- "Enable all collectors for complete monitoring" → Causes system crashes and memory exhaustion
- "Node Exporter monitors containers" → Only monitors host system; use cAdvisor for containers
- "Default configuration is production-ready" → Default configuration can consume 8GB+ RAM
- "Binding to 0.0.0.0 is safe on internal networks" → Exposes detailed system information to attackers
- "More GOMAXPROCS improves performance" → Causes kernel panics on large systems
Essential Documentation References
- GitHub Issues #2530: GOMAXPROCS=1 explanation and kernel panic prevention
- Release Notes v1.9.1: Current version improvements and fixes
- Textfile Collector Scripts: Community-maintained metric collection scripts
- Robust Perception Blog: Performance optimization and cardinality management
Useful Links for Further Investigation
Essential Resources and Documentation
Link | Description |
---|---|
Prometheus Node Exporter GitHub | The only source of truth. The Issues section has all the "holy shit, that crashed my server too" war stories |
Node Exporter Guide | Official docs are garbage for troubleshooting but they won't outright lie about basic setup |
Release Notes v1.9.1 | Actually read these or suffer when your memory usage explodes after upgrading |
Collector Documentation | Lists all collectors but doesn't warn you which ones will murder your server |
Better Stack Node Exporter Guide | Finally explains the Docker mount hell properly, unlike the garbage tutorials everywhere else |
Prometheus Best Practices | The cardinality section will save you from metric explosions that murder your server |
Robust Perception Blog | Brian Brazil actually knows what the fuck he's talking about with Node Exporter performance |
Textfile Collector Scripts | These scripts saved me from writing custom metrics code. The backup monitoring one actually works |
GOMAXPROCS=1 Issue | The GitHub issue that explains why Node Exporter is single-threaded (spoiler: parallel I/O crashes Linux) |
High Cardinality Debugging | How to find which collector is murdering your Prometheus server |
Prometheus Community Slack | The #node-exporter channel has people who've actually survived production disasters |
Prometheus Mailing Lists | War stories and solutions from people who've been through the same shit |
Stack Overflow Node Exporter Tag | Usually garbage but occasionally has the exact error message you're staring at |
Windows Exporter | If you're stuck monitoring Windows (my condolences) |
cAdvisor | For actual per-container metrics that Node Exporter can't provide |
Node Exporter Dashboard 1860 | The only Grafana dashboard that doesn't look like it was designed by a colorblind toddler |
Memory Usage Troubleshooting | The canonical "why is Node Exporter eating all my RAM" issue |
Kernel Panic Debugging | What to do when Node Exporter crashes your 96-core server |
Prometheus Getting Started Guide | Official setup and configuration documentation |
Related Tools & Recommendations
GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus
How to Wire Together the Modern DevOps Stack Without Losing Your Sanity
Setting Up Prometheus Monitoring That Won't Make You Hate Your Job
How to Connect Prometheus, Grafana, and Alertmanager Without Losing Your Sanity
Set Up Microservices Monitoring That Actually Works
Stop flying blind - get real visibility into what's breaking your distributed services
Prometheus - Scrapes Metrics From Your Shit So You Know When It Breaks
Free monitoring that actually works (most of the time) and won't die when your network hiccups
Falco + Prometheus + Grafana: The Only Security Stack That Doesn't Suck
Tired of burning $50k/month on security vendors that miss everything important? This combo actually catches the shit that matters.
Grafana - The Monitoring Dashboard That Doesn't Suck
integrates with Grafana
Fix Kubernetes ImagePullBackOff Error - The Complete Battle-Tested Guide
From "Pod stuck in ImagePullBackOff" to "Problem solved in 90 seconds"
Fix Kubernetes OOMKilled Pods - Production Memory Crisis Management
When your pods die with exit code 137 at 3AM and production is burning - here's the field guide that actually works
Stop Docker from Killing Your Containers at Random (Exit Code 137 Is Not Your Friend)
Three weeks into a project and Docker Desktop suddenly decides your container needs 16GB of RAM to run a basic Node.js app
CVE-2025-9074 Docker Desktop Emergency Patch - Critical Container Escape Fixed
Critical vulnerability allowing container breakouts patched in Docker Desktop 4.44.3
Complete Kubernetes Security Monitoring Stack Setup - Zero to Production
Learn to build a complete Kubernetes security monitoring stack from zero to production. Discover why commercial tools fail, get a step-by-step implementation gu
Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015
When your API shits the bed right before the big demo, this stack tells you exactly why
Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break
When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go
Alertmanager - Stop Getting 500 Alerts When One Server Dies
integrates with Alertmanager
New Relic - Application Monitoring That Actually Works (If You Can Afford It)
New Relic tells you when your apps are broken, slow, or about to die. Not cheap, but beats getting woken up at 3am with no clue what's wrong.
Debug Kubernetes Issues - The 3AM Production Survival Guide
When your pods are crashing, services aren't accessible, and your pager won't stop buzzing - here's how to actually fix it
ArgoCD Production Troubleshooting - Fix the Shit That Breaks at 3AM
The real-world guide to debugging ArgoCD when your deployments are on fire and your pager won't stop buzzing
Splunk - Expensive But It Works
Search your logs when everything's on fire. If you've got $100k+/year to spend and need enterprise-grade log search, this is probably your tool.
kube-state-metrics - See What's Actually Happening in Your Kubernetes Cluster
Stop guessing what's broken in your cluster - get real visibility into your Kubernetes objects
Setup Kubernetes Production Deployment - Complete Guide
The step-by-step playbook to deploy Kubernetes in production without losing your weekends to certificate errors and networking hell
Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization