Node Exporter Advanced Configuration - Stop It From Killing Your Server

Collector Configuration - How to Not Destroy Your Prometheus Server

Prometheus Architecture

Node Exporter v1.9.1 ships with 70+ collectors enabled by default, which is operational suicide if you run it in production. I learned this when our Prometheus server hit 16GB of RAM usage because some jackass deployed Node Exporter with defaults across 200 nodes. The official docs mention "careful consideration" but don't tell you it'll murder your server.

The Collectors That Actually Matter (And The Ones That Don't)

Most collectors are useless noise that'll murder your Prometheus performance. The interrupts collector alone shits out thousands of metrics per server. I watched slabinfo eat 2GB of memory before I learned to blacklist the fucking thing.

Disable the defaults and cherry-pick what you need:

## This saved our ass when memory usage hit 8GB per node
./node_exporter \
  --collector.disable-defaults \
  --collector.cpu \
  --collector.meminfo \
  --collector.filesystem \
  --collector.diskstats \
  --collector.netdev \
  --collector.loadavg

The collectors worth keeping:

cpu: CPU utilization - obviously you need this shit
meminfo: Memory stats - because OOM kills are fun to debug at 3am
filesystem: Disk space monitoring - saved my ass from "disk full" disasters more times than I can count
diskstats: I/O metrics - catches when your database decides to hammer the disk
netdev: Network stats - spots when someone's torrenting on the production network
loadavg: Load average - the one Unix metric that hasn't been ruined by containers

Skip these memory hogs:

interrupts: Generates 500+ metrics per server, crashes on 96-core boxes
slabinfo: Linux kernel memory stats nobody looks at
softnet: Network softirq stats that are rarely useful
entropy: Random number entropy - interesting but not actionable

Filtering - Because Docker Mount Spam Will Kill You

Filesystem collector without filtering is a cardinality bomb:

## This prevents 500+ filesystem metrics from Docker containers
--collector.filesystem.mount-points-exclude=\"^/(sys|proc|dev|host|etc|var/lib/docker)($$|/)\"
--collector.filesystem.fs-types-exclude=\"^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$\"

Docker and Kubernetes will shit out hundreds of overlay mounts. I've debugged servers generating 2000+ filesystem metrics because Kubernetes was constantly churning pods. That regex above stopped a cardinality bomb that was murdering 4GB of our Prometheus memory.

Hardware monitoring that doesn't suck:

## Only monitor temps and fans - voltage readings are usually garbage
--collector.hwmon.chip-include=\"^(coretemp|k10temp|drivetemp|acpi).*\"`
--collector.hwmon.sensor-include=\"^(temp|fan).*\"`

Most server hwmon sensors report meaningless voltage readings that fluctuate randomly. Temperature and fan RPM are the only metrics that matter for alerting.

Textfile Collector - Custom Metrics Without Writing Go

The textfile collector is how you get application metrics without building a full Prometheus exporter. Just dump Prometheus format .prom files and Node Exporter scrapes them. The community textfile scripts have backup monitoring, certificate expiry checks, and custom business metrics.

## Version 1.9.0+ supports multiple directories
--collector.textfile.directory=/var/lib/node_exporter/textfiles:/opt/app/metrics

Write files atomically or you'll get partial metrics:

#!/bin/bash
## DON'T write directly to the textfile directory - use temp files
TEXTFILE_DIR=\"/var/lib/node_exporter/textfiles\"
TEMP_FILE=$(mktemp)

## Generate metrics in temp location
{
  echo \"# HELP backup_last_success_timestamp Last successful backup time\"
  echo \"# TYPE backup_last_success_timestamp gauge\"
  echo \"backup_last_success_timestamp $(date +%s)\"
} > \"$TEMP_FILE\"

## Atomic move prevents Node Exporter reading partial files
mv \"$TEMP_FILE\" \"$TEXTFILE_DIR/backup_status.prom\"

I've seen textfile metrics get corrupted because scripts write directly to the monitored directory. Use mktemp and atomic moves or you'll get HELP lines without TYPE lines, which breaks Prometheus parsing.

Kubernetes Deployment - The Host Mount Hell

Running Node Exporter in Kubernetes is a pain in the ass because it needs to access the host system, not the container. You need hostNetwork, hostPID, and a bunch of volume mounts that make security teams nervous.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
spec:
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      labels:
        app: node-exporter
    spec:
      hostNetwork: true
      hostPID: true
      securityContext:
        runAsUser: 65534  # nobody user
        runAsNonRoot: true
      containers:
      - name: node-exporter
        image: prom/node-exporter:v1.9.1
        args:
          - '--path.procfs=/host/proc'
          - '--path.sysfs=/host/sys'
          - '--path.rootfs=/host/root'
          # This regex saves your life in Kubernetes
          - '--collector.filesystem.mount-points-exclude=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/.+)($|/)'
          - '--collector.disable-defaults'
          - '--collector.cpu'
          - '--collector.meminfo'
          - '--collector.filesystem'
          - '--collector.diskstats'
          - '--collector.netdev'
          - '--collector.loadavg'
        ports:
        - containerPort: 9100
          protocol: TCP
        resources:
          limits:
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 100Mi
        volumeMounts:
        - name: proc
          mountPath: /host/proc
          readOnly:  true
        - name: sys
          mountPath: /host/sys
          readOnly: true
        - name: root
          mountPath: /host/root
          mountPropagation: HostToContainer
          readOnly: true
      volumes:
      - name: proc
        hostPath:
          path: /proc
      - name: sys
        hostPath:
          path: /sys
      - name: root
        hostPath:
          path: /
      tolerations:
      - operator: Exists

The filesystem mount point exclusion is critical in Kubernetes. Without it, you'll get thousands of metrics from kubelet and Docker overlay mounts. Set memory limits because Node Exporter can balloon to 1GB+ if you enable the wrong collectors.

Node Exporter Configuration

The GOMAXPROCS=1 Story - Why Node Exporter is Single-Threaded

Since version 1.5.0, Node Exporter locks itself to GOMAXPROCS=1 because parallel I/O operations literally crash Linux kernels on big servers. I'm not making this up - we had Node Exporter kernel panic a 96-core AWS c5.24xlarge by doing simultaneous /proc reads. GitHub issue #2530 documents this shitshow.

The server would boot, Node Exporter would start hammering /proc and /sys in parallel, and BOOM - kernel oops. Took me 6 hours of debugging before I found the GOMAXPROCS setting buried in a comment.

The real performance killers:

Cardinality explosion: 200 nodes × 2000 metrics = 400k series. Your Prometheus will die.
Slow /metrics endpoint: If scraping takes >5 seconds, you've enabled too many collectors
Memory growth: Without filtering, Node Exporter hits 1GB+ memory usage per instance

How to debug performance issues:

## Check metric cardinality - if >2000, you're probably screwed
curl -s localhost:9100/metrics | wc -l

## Time the scrape - should be <2 seconds
time curl -s localhost:9100/metrics > /dev/null

## Check memory usage - should be <100MB with proper filtering  
docker exec node-exporter ps aux | grep node_exporter

Network interface hell: AWS ECS servers with 100+ network interfaces will absolutely wreck the netdev collector. The v1.9.0 ifAlias optimization helps, but you're still fucked without filtering:

## Only monitor physical interfaces, skip the 200+ Docker bridges
--collector.netdev.device-include=\"^(eth|ens|eno|enp).*\"`

Comparison Table

Collector	Default	Memory Usage	Actually Useful?	What It Does	Why You Care
cpu	✅	10MB	Hell yes	CPU usage per core	Your basic alerting
meminfo	✅	5MB	Obviously	RAM/swap stats	Memory leak detection
filesystem	✅	200MB+	Critical	Disk space by mount	Prevents "disk full" disasters
diskstats	✅	50MB	Yes	I/O ops and latency	Spots database thrashing
netdev	✅	100MB	Yes	Network traffic/errors	Bandwidth saturation alerts
pressure	✅	20MB	Very useful	PSI stall metrics	Modern load indicators
hwmon	✅	50MB	Yes	Temps and fan speeds	Overheating alerts
interrupts	❌	500MB+	Rarely	IRQ counts per CPU	Debug kernel issues
processes	❌	30MB	Sometimes	Process counts by state	Zombie detection
systemd	❌	100MB	Yes on systemd	Service status	Failed service alerts
textfile	✅	Variable	Essential	Custom app metrics	Application monitoring
ethtool	❌	200MB	Rarely	NIC driver stats	Network debugging
slabinfo	❌	1GB+	No	Kernel memory pools	Kernel debugging only
qdisc	❌	300MB	No	Traffic control	Network QoS debugging
entropy	❌	5MB	No	Random pool entropy	Academic interest only

Security and Production Deployment - Don't Get Owned

Node Exporter dumps detailed system metrics on port 9100 by default, which is like leaving your server's diary open for anyone to read. I've watched pentesters map entire data centers using exposed Node Exporter endpoints - they can see your disk layout, network topology, and resource usage patterns. The Prometheus team assumes you'll handle network security, but most people skip this and get owned.

Security Best Practices

Network Security Configuration:
We found out about network exposure the hard way - CISO storms into the office with a Shodan search showing our dev Node Exporter broadcasting to the fucking internet. Turns out there are thousands of these exposed, just hemorrhaging server details to script kiddies.

## Don't bind to 0.0.0.0 unless you want the world to see your metrics
./node_exporter --web.listen-address=\"192.168.1.100:9100\"

## Security through obscurity isn't real security, but it helps
./node_exporter --web.listen-address=\":9101\"

## TLS is mandatory for anything production-adjacent
./node_exporter \
  --web.config.file=/etc/node_exporter/web.yml \
  --web.listen-address=\":9100\"

The web.listen-address flag changed behavior in v1.8.0 - make sure you're not accidentally binding to all interfaces if you upgraded from an older version.

TLS Configuration Example (web.yml):

tls_server_config:
  cert_file: /etc/ssl/certs/node_exporter.crt
  key_file: /etc/ssl/private/node_exporter.key
  min_version: TLS12
  max_version: TLS13
  cipher_suites:
    - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
    - TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384

basic_auth_users:
  prometheus: $2y$10$X0h1gDsPszWURQaxFh.zoubFi6DaqGGGn6xxLJFTvKwnKvA4FcGr.

Systemd Service Hardening:
The systemd hardening is a massive pain in the ass to configure, but it saved us when an attacker got shell access through a different service and tried to pivot through Node Exporter. The hardening basically told them to fuck off.

[Unit]
Description=Prometheus Node Exporter
After=network.target

[Service]
Type=simple
User=node_exporter
Group=node_exporter
ExecStart=/usr/local/bin/node_exporter

## Security hardening - these saved us from lateral movement
NoNewPrivileges=true
PrivateTmp=true
ProtectHome=true
ProtectSystem=strict
ReadWritePaths=/var/lib/node_exporter
## CAP_DAC_OVERRIDE is needed for reading /proc files as non-root
CapabilityBoundingSet=CAP_DAC_OVERRIDE
AmbientCapabilities=CAP_DAC_OVERRIDE

## Resource limits - because Node Exporter can balloon
LimitNOFILE=8192
MemoryLimit=512M

[Install]
WantedBy=multi-user.target

Test the shit out of your hardened service before going live. The capabilities Node Exporter needs change between kernel versions, and you'll spend hours debugging why it can't read /proc files.

Metric Filtering and Data Minimization

URL Parameter Filtering (New in v1.9.0):

## Exclude specific collectors via URL
curl \"http://localhost:9100/metrics?collect[]=cpu&collect[]=memory&exclude[]=interrupts\"

Sensitive Data Exclusion:

## Exclude potentially sensitive filesystem paths
--collector.filesystem.mount-points-exclude=\"^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/.+)($$|/)\"

## Filter network interfaces containing sensitive information
--collector.netdev.device-exclude=\"^(veth|docker|virbr).*\"

High Availability and Scaling

Multi-Instance Deployment Strategy:
Large environments need Node Exporter behind load balancers, but it's trickier than you think:

## HAProxy configuration for Node Exporter scraping
backend node_exporters
    balance roundrobin
    option httpchk GET /metrics
    server node1 192.168.1.10:9100 check
    server node2 192.168.1.11:9100 check
    server node3 192.168.1.12:9100 check

Prometheus Scrape Configuration:

scrape_configs:
  - job_name: 'node-exporter'
    scrape_interval: 30s
    scrape_timeout: 10s
    metrics_path: /metrics
    static_configs:
      - targets: ['node1:9100', 'node2:9100']
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
      - source_labels: [__address__]
        regex: '([^:]+)(:[0-9]+)?'
        target_label: __address__
        replacement: '${1}:9100'

Performance Monitoring and Alerting

Critical Node Exporter Health Metrics:

## Alert when Node Exporter is down
- alert: NodeExporterDown
  expr: up{job=\"node-exporter\"} == 0
  for: 1m
  annotations:
    summary: \"Node Exporter is down on {{ $labels.instance }}\"

## Alert on high metric cardinality
- alert: NodeExporterHighCardinality
  expr: prometheus_tsdb_symbol_table_size_bytes > 16000000
  for: 5m
  annotations:
    summary: \"Node Exporter producing high cardinality metrics\"

Prometheus Logo

Container Security and Resource Management

Docker Security Configuration:
Running Node Exporter in containers is tricky because it needs host access but you want container isolation.

docker run -d \
  --name=node-exporter \
  --restart=unless-stopped \
  # Read-only filesystem prevents tampering
  --read-only \
  # Drop all capabilities, add back only what's needed
  --cap-drop=ALL \
  --cap-add=DAC_OVERRIDE \
  # Nobody user - UID 65534 is standard
  --user=65534:65534 \
  --security-opt=no-new-privileges:true \
  -p 9100:9100 \
  -v \"/proc:/host/proc:ro\" \
  -v \"/sys:/host/sys:ro\" \
  -v \"/:/rootfs:ro\" \
  prom/node-exporter:v1.9.1 \
    --path.procfs=/host/proc \
    --path.sysfs=/host/sys \
    --path.rootfs=/rootfs

The rslave mount propagation is missing from every fucking example online. Without it, you won't see new mounts after Node Exporter starts - learned that debugging a deployment where disk alerts never fired.

Kubernetes Security Context:

securityContext:
  runAsNonRoot: true
  runAsUser: 65534
  runAsGroup: 65534
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  capabilities:
    drop: [\"ALL\"]
    add: [\"DAC_OVERRIDE\"]

Prometheus Monitoring

Operational Considerations

Log Management:

## Enable structured logging (Go slog in v1.9.0)
./node_exporter --log.format=json --log.level=info

## Rotate logs with proper retention
journalctl --unit=node_exporter --since=\"7 days ago\" --until=\"now\"

Backup and Recovery:
Node Exporter is stateless, but configuration backup is essential:

## Backup configuration and scripts
tar -czf node_exporter_backup_$(date +%Y%m%d).tar.gz \
  /etc/node_exporter/ \
  /var/lib/node_exporter/ \
  /etc/systemd/system/node_exporter.service

Version Upgrade Strategy:

Test new versions in staging - trust me, they break things
Actually read the release notes - the cardinality changes will fuck you
Blue-green deployment if you're fancy, rolling restart if you're lazy
Watch for metric drops after upgrade - they'll be silent failures

Just fucking upgrade to 1.9.1 already. The memory leaks are fixed, the IRQ pressure collector doesn't crash anymore, and you won't spend your weekends debugging kernel panics. Anything older than 1.5.0 is asking for trouble.

Questions You'll Actually Ask in Production

Why the hell does Node Exporter only use one CPU core?

Because parallel I/O kills Linux kernels on big servers. I learned this when Node Exporter kernel panicked a 96-core AWS instance by hammering /proc simultaneously. The kernel oops was both beautiful and fucking terrifying.GOMAXPROCS=1 since v1.5.0 prevents this shitshow. Yeah, it's slower. No, you can't change it unless you enjoy surprise reboots.

My Node Exporter is eating 2GB of RAM, what do I do?

You probably enabled slabinfo or interrupts collectors. Turn that shit off immediately:bash# The nuclear option - disable everything, enable selectively--collector.disable-defaults --collector.cpu --collector.meminfo --collector.filesystem --collector.netdev# If you must keep defaults, at least filter the garbage--collector.filesystem.mount-points-exclude=\"^/(dev|proc|sys|var/lib/docker|run/docker)\" --collector.netdev.device-exclude=\"^(veth|docker|br-).*\" Check your cardinality: curl localhost:9100/metrics | wc -l. If it's over 5000 lines, you're completely fucked and need to start over.

Should I upgrade from 1.8.x to 1.9.x?

The 1.9.0 release has some actually useful improvements:

IRQ pressure metrics
catches interrupt storms that don't show in CPU stats
Multiple textfile directories
organize your custom metrics better
Better collector filtering
finally, usable filtering for the memory hogsVersion 1.9.1 fixes the IRQ collector on older RHEL/CentOS kernels. Upgrade if you're using the pressure collector.The logging changes to slog are pointless. It's still just logs that you'll ignore until something breaks.

Docker containers aren't showing up in Node Exporter?

Because Node Exporter monitors the host, not containers. It's literally called "node" exporter, not "container" exporter.bash# This gets you HOST metrics (CPU, RAM, disk of the Docker host)docker run -d --net=\"host\" --pid=\"host\" -v \"/:/host/root:ro,rslave\" -v \"/proc:/host/proc:ro\" -v \"/sys:/host/sys:ro\" prom/node-exporter:v1.9.1 --path.procfs=/host/proc --path.sysfs=/host/sys --path.rootfs=/host/root For actual container stats (per-container CPU/memory), you need cAdvisor or just use docker stats like a sane person.

Why can't I just enable all collectors?

Because the Node Exporter maintainers learned from experience.

Some collectors will absolutely wreck your system:

interrupts: 2000+ metrics per server, crashes on high-core machines
slabinfo: 1GB+ memory usage for useless kernel stats
ethtool:

Hammers network drivers with queries, causes packet drops

qdisc: Traffic control stats that generate infinite cardinalityThe defaults are actually sane for once. Don't enable everything unless you enjoy getting paged at 3am because your monitoring crashed your monitoring.

How do I create custom metrics with textfile collector?

Write metrics in Prometheus exposition format to .prom files:bash#!/bin/bashOUTPUT_DIR=\"/var/lib/node_exporter/textfiles\"TMP_FILE=\"/tmp/custom.prom\"# Generate metricsecho \"# HELP custom_service_status Service health status\" > \"$TMP_FILE\"echo \"# TYPE custom_service_status gauge\" >> \"$TMP_FILE\"echo \"custom_service_status{service=\"api\"} 1\" >> \"$TMP_FILE\"# Atomic move to prevent partial readsmv \"$TMP_FILE\" \"$OUTPUT_DIR/custom.prom\"

What security shit should I actually worry about?

Don't get owned:

TLS everything
cleartext metrics show attackers your entire infrastructure layout
Auth your endpoints
script kiddies scrape exposed Node Exporters for recon
Filter mount points
don't leak your Docker secrets in filesystem metrics
Drop privileges
Node Exporter doesn't need root, despite what tutorials claim
Bind to localhost
binding to 0.0.0.0 is basically posting your server specs on Pastebin
Watch for weird scraping patterns
attackers probe different endpoints looking for data

Help! Prometheus is dying from too many metrics!

Find the cardinality bomb before it kills your Prometheus:bash# This shows which collector is fucking you overcurl -s localhost:9100/metrics | grep \"^node_\" | cut -d'_' -f1,2 | sort | uniq -c | sort -nr# Filesystem is always the problem - Docker mount spamcurl -s localhost:9100/metrics | grep node_filesystem | wc -l# Total count - over 2000 means you're screwed curl -s localhost:9100/metrics | wc -l It's always Docker overlay mounts (I've seen 800+ metrics from one server), virtual interfaces from Kubernetes (another 300+ metrics), or some jackass who enabled the interrupts collector and brought down the entire monitoring stack.

Can Node Exporter run on Windows?

Node Exporter doesn't work on Windows because it's designed for actual operating systems.

If you're stuck in Microsoft purgatory, use Windows Exporter

it's the only Windows monitoring that doesn't make you want to jump off a bridge.

How do I handle Node Exporter upgrades?

Don't be a hero:

Test in staging - I can't stress this enough, shit breaks
Read the fucking release notes - cardinality changes will murder your Prometheus
Backup your configs - because you'll need to rollback at 2am
Blue-green if you're fancy - rolling restart if you just want to go home
Watch for missing metrics - they fail silently and your alerts go dark
Check your dashboards - half will break because labels changed

Never auto-update Node Exporter in production. Just don't. Plan proper maintenance windows or enjoy explaining to leadership why monitoring is down.

Quick Navigation

The Collectors That Actually Matter (And The Ones That Don't)

Filtering - Because Docker Mount Spam Will Kill You

Textfile Collector - Custom Metrics Without Writing Go

Kubernetes Deployment - The Host Mount Hell

The GOMAXPROCS=1 Story - Why Node Exporter is Single-Threaded

Security Best Practices

Metric Filtering and Data Minimization

High Availability and Scaling

Performance Monitoring and Alerting

Container Security and Resource Management

Operational Considerations

Why the hell does Node Exporter only use one CPU core?

My Node Exporter is eating 2GB of RAM, what do I do?

Should I upgrade from 1.8.x to 1.9.x?

Docker containers aren't showing up in Node Exporter?

Why can't I just enable all collectors?

How do I create custom metrics with textfile collector?

What security shit should I actually worry about?

Help! Prometheus is dying from too many metrics!

Can Node Exporter run on Windows?

How do I handle Node Exporter upgrades?

Related Tools & Recommendations

Prometheus, Grafana, Alertmanager: Complete Monitoring Stack Setup

Kafka, MongoDB, K8s, Prometheus: Event-Driven Observability

Prometheus Monitoring: Overview, Deployment & Troubleshooting Guide

Django Production Deployment Guide: Docker, Security, Monitoring

KrakenD Production Troubleshooting - Fix the 3AM Problems

Honeycomb: Debug Distributed Systems & Understand Observability

TypeScript Compiler Performance: Fix Slow Builds & Optimize Speed

Datadog Production Troubleshooting Guide: Fix Agent & Cost Issues

Set Up Microservices Monitoring That Actually Works

Fix gRPC Production Errors - The 3AM Debugging Guide

Mastering GitOps: Docker, Kubernetes, ArgoCD, Prometheus Stack

OpenTelemetry + Jaeger + Grafana on Kubernetes - The Stack That Actually Works

Debug Kubernetes Issues: The 3AM Production Survival Guide

Weaviate Production Deployment & Scaling: Avoid Common Pitfalls

Interactive Brokers TWS API Production Deployment Guide

Node.js Production Deployment - How to Not Get Paged at 3AM

Grafana: Monitoring Dashboards, Observability & Ecosystem Overview

Change Data Capture (CDC) Integration Patterns for Production

Datadog Monitoring: Features, Cost & Why It Works for Teams

Aqua Security Troubleshooting: Resolve Production Issues Fast