38.3 sar: System Activity Reporter and Historical Data

Right, let’s talk about sar. This is the tool you use when you get that 3 AM alert about a server being “slow” and you need to figure out what actually happened six hours ago. While top or htop show you the glorious, burning dumpster fire of the present moment, sar is the historian who kept meticulous, if slightly dry, notes on the entire blaze. It’s part of the sysstat package, and if you’re not installing that by default on every Linux system you touch, we need to have a different conversation first.

The sheer genius of sar is its scheduling. Once installed and enabled (sudo systemctl enable sysstat), it’s quietly run by cron every 10 minutes, gathering a snapshot of your system’s vital signs—CPU, memory, disk I/O, network, you name it. It stashes this data in binary files in /var/log/sa/ (like sa21 for the 21st of the month). This means you have days, weeks, or even months of data to autopsy, not just your fuzzy recollection that “the CPU was, like, really high.”

Installing and Enabling sysstat

First things first, let’s get it installed. On most distributions, it’s a simple:

# For Debian/Ubuntu
sudo apt-get install sysstat

# For RHEL/CentOS/Fedora
sudo dnf install sysstat  # or use yum on older versions

Now, here’s the first “questionable choice” you need to fix. On some older distributions, the data collection might be disabled by default in /etc/default/sysstat. Ensure it’s set to ENABLED="true". Then restart the service: sudo systemctl restart sysstat. Give it a few minutes to run its first collection. If you’re impatient (I am), you can run a collection manually: sudo /usr/lib/sysstat/sa1 1 1.

Querying Today’s Data in Real-ish Time

The simplest way to use sar is to look at today’s data. The -f flag is used to specify a file, but if you omit it, it defaults to today’s. Want to see the CPU usage every 10 minutes since midnight?

sar
# Output:
# Linux 5.15.0-86-generic (hostname)     10/26/2023     _x86_64_    (4 CPU)
#
# 12:00:01 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
# 12:10:01 AM     all      1.01      0.00      0.38      0.03      0.00     98.58
# 12:20:01 AM     all      0.89      0.00      0.31      0.02      0.00     98.78
# ... (and so on, all day) ...

This is useful, but it’s a lot of data. Let’s say you only care about the disk I/O statistics. Use the -d flag.

sar -d
# Output:
# 12:00:01 AM       DEV       tps     rkB/s     wkB/s   await   areq-sz   aqu-sz     %util
# 12:10:01 AM    dev8-0      2.21      8.27     40.12    1.23     21.91     0.00      0.27
# 12:10:01 AM   dev8-16     12.45    145.33    212.89    2.11     28.78     0.03      2.62

The %util column here is a classic pitfall. People see 100% and think “the disk is maxed out!” But for modern SSDs and RAID arrays, that’s often not the case. It just means the device was busy for the entire sampling interval. You need to look at tps (transfers per second), rkB/s/wkB/s (read/write KB/s), and especially await (average I/O response time in ms) to get the real performance picture. A high %util with a low await is usually fine; a high %util with a high await means your storage is choking.

Delving into the Historical Archives

This is where sar pays for your entire retirement plan. You use the -f flag to point to a specific data file in /var/log/sa/. The files are named saXX where XX is the day of the month. To look at the CPU data from the 21st:

sar -f /var/log/sa/sa21

Now, let’s say the complaint was “the database was slow between 9 AM and 11 AM on the 21st.” You can filter by time with the -s (start) and -e (end) flags. The syntax is hilariously archaic (hh:mm:ss), a clear design choice from when bell-bottoms were in fashion.

sar -s 09:00:00 -e 11:00:00 -f /var/log/sa/sa21

A Practical Example: The Memory Mystery

Someone reports an out-of-memory killer event around 2 PM yesterday. Let’s investigate memory usage. The key flag is -r.

sar -r -s 13:30:00 -e 14:30:00 -f /var/log/sa/sa25
# Output:
# 01:40:01 PM kbmemfree   kbavail kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
# 01:50:01 PM    267488   1044676   7416940     96.51     92396   2912904  10350792    134.71   4659824   2132684       276
# 02:00:01 PM     15364    769128   7662064     99.68     13308   1266224  11572564    150.63   6396500    824044       120
# 02:10:01 PM    268044   1046232   7418384     96.52     93404   2914584  10350928    134.71   4661208   2133984       292

Bingo. Look at that line for 2:00 PM. kbmemfree is terrifyingly low (~15 MB), %memused is at 99.68%, and kbcommit (the amount of memory needed for current workloads) is at 150% of physical RAM. The system was absolutely drowning, and the OOM killer absolutely did its job. The sar data doesn’t lie. You’ve just moved from “something happened” to “here is the exact metric and time of the failure.”

The best practice? Make sar your first responder. Its data is objective, historical, and comprehensive. The pitfall? Not installing it before you need it. If you wait until the server is on fire, you’ll have no history to learn from. And that, my friend, is a tragedy far more absurd than its command-line flags.