Right, let’s talk about node_exporter. This is the workhorse, the foundation, the thing that goes out and gets the dirt-under-its-fingernails metrics from the machine your software is running on. It’s not glamorous, but without it, you’re flying blind. Think of it as a highly specific, incredibly diligent intern who runs around your server with a clipboard, meticulously counting everything from CPU cycles to disk I/O, and then formats it all for Prometheus to consume.

You don’t run node_exporter as part of your application. You run it as a separate, persistent daemon on every node you want to monitor—be it a physical metal server, a VM, or a Kubernetes node. Its entire job is to expose a HTTP endpoint (port 9100 by default) that spits out a massive wall of plaintext metrics in Prometheus’ format whenever you curl it.

The Quickest of Starts

Getting it running is stupidly simple. Download the pre-compiled binary for your architecture from the Prometheus downloads page, unpack it, and run the thing.

wget https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.8.1.linux-amd64.tar.gz
cd node_exporter-1.8.1.linux-amd64
./node_exporter

Boom. It’s now running and scraping your system, pouring metrics onto http://localhost:9100/metrics. Go curl that. I’ll wait.

See that glorious, terrifying wall of text? That’s pure gold. Each line is a metric. Some are simple counters (e.g., node_network_receive_bytes_total), some are gauges (e.g., node_memory_MemAvailable_bytes), and they all have labels to differentiate devices, modes, and states.

Why You Should Use a Service Manager

Running it directly in your shell like that is great for a quick test, but profoundly stupid for anything real. The moment you close your terminal, it dies. Don’t do that.

Use systemd. It’s there for a reason. Create a service file like /etc/systemd/system/node_exporter.service:

[Unit]
Description=Node Exporter
Documentation=https://github.com/prometheus/node_exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target

Then enable and start it:

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

Now it’ll survive reboots and you can manage it like a proper adult. The User and Group aren’t strictly necessary, but creating a dedicated, non-privileged user for it is a solid security best practice. It doesn’t need root for most collectors.

Taming the Firehose: Collector Management

By default, node_exporter enables a ton of collectors, but not all of them. Some are off by default because they might be considered niche or potentially expensive. The list of enabled collectors changes slightly between versions, which is why you should always check what’s actually active.

Use the --web.disable-exporter-metrics flag to avoid the node_exporter scraping its own internal Go metrics (unless you want those, which you might). To see everything it can collect, run node_exporter --collector.disable-defaults --collector.print. This is a great way to discover weird, hidden metrics.

Want to enable just the cpufreq collector, which is off by default? Use --collector.cpufreq. Want to disable the often-useless textfile collector? Use --collector.disable-defaults --collector.textfile. The logic is, frankly, a bit awkward. You use --collector.X to enable a specific one if it’s not in the defaults, and --no-collector.X to disable one that is. Check the help (--help) for your specific version. This is one of those areas where the design feels a bit bolted-on over time.

The Textfile Collector: Your Escape Hatch

This is arguably the most powerful feature. The textfile collector lets you write your own custom metrics from shell scripts, cron jobs, or any other process on the machine by simply writing them to a file in the Prometheus format. node_exporter will then automatically pick them up and expose them alongside its native metrics.

Say you have a cron job that checks the status of your legacy backup system. Instead of trying to get Prometheus to scrape some weird XML output, have the cron job write a metric to a file:

#!/bin/bash

# Check if our backup succeeded. Exit code magic.
check_backup_status() {
  # ... your logic here ...
  if [ $? -eq 0 ]; then
    echo "myapp_backup_successful 1"
  else
    echo "myapp_backup_successful 0"
  fi
}

# Write the metric to the directory node_exporter is watching
check_backup_status > /var/lib/node_exporter/textfile_backups/myapp_backup.prom.$$
mv /var/lib/node_exporter/textfile_backups/myapp_backup.prom.$$ /var/lib/node_exporter/textfile_backups/myapp_backup.prom

Then run node_exporter with --collector.textfile.directory=/var/lib/node_exporter/textfiles. Now you’ve extended your node-level metrics with application-specific business logic. It’s brilliant.

The Pitfalls: It’s All Relative

The biggest “gotcha” with node_exporter metrics is that you almost never care about the absolute value. You care about the rate of change.

node_network_receive_bytes_total is a counter. It only goes up until the server is rebooted (and even then, it might persist if you use the --web.disable-exporter-metrics flag properly). A value of 123456789 is meaningless. What you want in PromQL is rate(node_network_receive_bytes_total[5m]), which gives you the average bytes per second over the last five minutes. This is the fundamental shift in thinking that Prometheus forces on you, and it’s the right one.

Another classic pitfall is assuming all metrics are available everywhere. A metric like node_cpu_seconds_total will have a mode="idle" label. But node_disk_io_time_seconds_total? That might only exist on Linux, with specific kernels, and for certain disk types. Always check the /metrics output on your actual target systems. The abstraction is leaky, and node_exporter faithfully exposes those leaks, which is actually a good thing. It tells you what the OS is providing, warts and all.