Node Exporter v1.9.1 ships with 70+ collectors enabled by default, which is operational suicide if you run it in production. I learned this when our Prometheus server hit 16GB of RAM usage because some jackass deployed Node Exporter with defaults across 200 nodes. The official docs mention "careful consideration" but don't tell you it'll murder your server.
The Collectors That Actually Matter (And The Ones That Don't)
Most collectors are useless noise that'll murder your Prometheus performance. The interrupts collector alone shits out thousands of metrics per server. I watched slabinfo eat 2GB of memory before I learned to blacklist the fucking thing.
Disable the defaults and cherry-pick what you need:
## This saved our ass when memory usage hit 8GB per node
./node_exporter \
--collector.disable-defaults \
--collector.cpu \
--collector.meminfo \
--collector.filesystem \
--collector.diskstats \
--collector.netdev \
--collector.loadavg
The collectors worth keeping:
- cpu: CPU utilization - obviously you need this shit
- meminfo: Memory stats - because OOM kills are fun to debug at 3am
- filesystem: Disk space monitoring - saved my ass from "disk full" disasters more times than I can count
- diskstats: I/O metrics - catches when your database decides to hammer the disk
- netdev: Network stats - spots when someone's torrenting on the production network
- loadavg: Load average - the one Unix metric that hasn't been ruined by containers
Skip these memory hogs:
- interrupts: Generates 500+ metrics per server, crashes on 96-core boxes
- slabinfo: Linux kernel memory stats nobody looks at
- softnet: Network softirq stats that are rarely useful
- entropy: Random number entropy - interesting but not actionable
Filtering - Because Docker Mount Spam Will Kill You
Filesystem collector without filtering is a cardinality bomb:
## This prevents 500+ filesystem metrics from Docker containers
--collector.filesystem.mount-points-exclude=\"^/(sys|proc|dev|host|etc|var/lib/docker)($$|/)\"
--collector.filesystem.fs-types-exclude=\"^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$\"
Docker and Kubernetes will shit out hundreds of overlay mounts. I've debugged servers generating 2000+ filesystem metrics because Kubernetes was constantly churning pods. That regex above stopped a cardinality bomb that was murdering 4GB of our Prometheus memory.
Hardware monitoring that doesn't suck:
## Only monitor temps and fans - voltage readings are usually garbage
--collector.hwmon.chip-include=\"^(coretemp|k10temp|drivetemp|acpi).*\"`
--collector.hwmon.sensor-include=\"^(temp|fan).*\"`
Most server hwmon sensors report meaningless voltage readings that fluctuate randomly. Temperature and fan RPM are the only metrics that matter for alerting.
Textfile Collector - Custom Metrics Without Writing Go
The textfile collector is how you get application metrics without building a full Prometheus exporter. Just dump Prometheus format .prom
files and Node Exporter scrapes them. The community textfile scripts have backup monitoring, certificate expiry checks, and custom business metrics.
## Version 1.9.0+ supports multiple directories
--collector.textfile.directory=/var/lib/node_exporter/textfiles:/opt/app/metrics
Write files atomically or you'll get partial metrics:
#!/bin/bash
## DON'T write directly to the textfile directory - use temp files
TEXTFILE_DIR=\"/var/lib/node_exporter/textfiles\"
TEMP_FILE=$(mktemp)
## Generate metrics in temp location
{
echo \"# HELP backup_last_success_timestamp Last successful backup time\"
echo \"# TYPE backup_last_success_timestamp gauge\"
echo \"backup_last_success_timestamp $(date +%s)\"
} > \"$TEMP_FILE\"
## Atomic move prevents Node Exporter reading partial files
mv \"$TEMP_FILE\" \"$TEXTFILE_DIR/backup_status.prom\"
I've seen textfile metrics get corrupted because scripts write directly to the monitored directory. Use mktemp
and atomic moves or you'll get HELP
lines without TYPE
lines, which breaks Prometheus parsing.
Kubernetes Deployment - The Host Mount Hell
Running Node Exporter in Kubernetes is a pain in the ass because it needs to access the host system, not the container. You need hostNetwork
, hostPID
, and a bunch of volume mounts that make security teams nervous.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
spec:
hostNetwork: true
hostPID: true
securityContext:
runAsUser: 65534 # nobody user
runAsNonRoot: true
containers:
- name: node-exporter
image: prom/node-exporter:v1.9.1
args:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/host/root'
# This regex saves your life in Kubernetes
- '--collector.filesystem.mount-points-exclude=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/.+)($|/)'
- '--collector.disable-defaults'
- '--collector.cpu'
- '--collector.meminfo'
- '--collector.filesystem'
- '--collector.diskstats'
- '--collector.netdev'
- '--collector.loadavg'
ports:
- containerPort: 9100
protocol: TCP
resources:
limits:
memory: 200Mi
requests:
cpu: 100m
memory: 100Mi
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
- name: root
mountPath: /host/root
mountPropagation: HostToContainer
readOnly: true
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
- name: root
hostPath:
path: /
tolerations:
- operator: Exists
The filesystem mount point exclusion is critical in Kubernetes. Without it, you'll get thousands of metrics from kubelet and Docker overlay mounts. Set memory limits because Node Exporter can balloon to 1GB+ if you enable the wrong collectors.
The GOMAXPROCS=1 Story - Why Node Exporter is Single-Threaded
Since version 1.5.0, Node Exporter locks itself to GOMAXPROCS=1
because parallel I/O operations literally crash Linux kernels on big servers. I'm not making this up - we had Node Exporter kernel panic a 96-core AWS c5.24xlarge
by doing simultaneous /proc
reads. GitHub issue #2530 documents this shitshow.
The server would boot, Node Exporter would start hammering /proc
and /sys
in parallel, and BOOM - kernel oops. Took me 6 hours of debugging before I found the GOMAXPROCS setting buried in a comment.
The real performance killers:
- Cardinality explosion: 200 nodes × 2000 metrics = 400k series. Your Prometheus will die.
- Slow
/metrics
endpoint: If scraping takes >5 seconds, you've enabled too many collectors - Memory growth: Without filtering, Node Exporter hits 1GB+ memory usage per instance
How to debug performance issues:
## Check metric cardinality - if >2000, you're probably screwed
curl -s localhost:9100/metrics | wc -l
## Time the scrape - should be <2 seconds
time curl -s localhost:9100/metrics > /dev/null
## Check memory usage - should be <100MB with proper filtering
docker exec node-exporter ps aux | grep node_exporter
Network interface hell: AWS ECS servers with 100+ network interfaces will absolutely wreck the netdev collector. The v1.9.0 ifAlias optimization helps, but you're still fucked without filtering:
## Only monitor physical interfaces, skip the 200+ Docker bridges
--collector.netdev.device-include=\"^(eth|ens|eno|enp).*\"`