38.3 sar: System Activity Reporter and Historical Data
Right, let’s talk about sar. This is the tool you use when you get that 3 AM alert about a server being “slow” and you need to figure out what actually happened six hours ago. While top or htop show you the glorious, burning dumpster fire of the present moment, sar is the historian who kept meticulous, if slightly dry, notes on the entire blaze. It’s part of the sysstat package, and if you’re not installing that by default on every Linux system you touch, we need to have a different conversation first.
The sheer genius of sar is its scheduling. Once installed and enabled (sudo systemctl enable sysstat), it’s quietly run by cron every 10 minutes, gathering a snapshot of your system’s vital signs—CPU, memory, disk I/O, network, you name it. It stashes this data in binary files in /var/log/sa/ (like sa21 for the 21st of the month). This means you have days, weeks, or even months of data to autopsy, not just your fuzzy recollection that “the CPU was, like, really high.”
Installing and Enabling sysstat
First things first, let’s get it installed. On most distributions, it’s a simple:
# For Debian/Ubuntu
sudo apt-get install sysstat
# For RHEL/CentOS/Fedora
sudo dnf install sysstat # or use yum on older versions
Now, here’s the first “questionable choice” you need to fix. On some older distributions, the data collection might be disabled by default in /etc/default/sysstat. Ensure it’s set to ENABLED="true". Then restart the service: sudo systemctl restart sysstat. Give it a few minutes to run its first collection. If you’re impatient (I am), you can run a collection manually: sudo /usr/lib/sysstat/sa1 1 1.
Querying Today’s Data in Real-ish Time
The simplest way to use sar is to look at today’s data. The -f flag is used to specify a file, but if you omit it, it defaults to today’s. Want to see the CPU usage every 10 minutes since midnight?
sar
# Output:
# Linux 5.15.0-86-generic (hostname) 10/26/2023 _x86_64_ (4 CPU)
#
# 12:00:01 AM CPU %user %nice %system %iowait %steal %idle
# 12:10:01 AM all 1.01 0.00 0.38 0.03 0.00 98.58
# 12:20:01 AM all 0.89 0.00 0.31 0.02 0.00 98.78
# ... (and so on, all day) ...
This is useful, but it’s a lot of data. Let’s say you only care about the disk I/O statistics. Use the -d flag.
sar -d
# Output:
# 12:00:01 AM DEV tps rkB/s wkB/s await areq-sz aqu-sz %util
# 12:10:01 AM dev8-0 2.21 8.27 40.12 1.23 21.91 0.00 0.27
# 12:10:01 AM dev8-16 12.45 145.33 212.89 2.11 28.78 0.03 2.62
The %util column here is a classic pitfall. People see 100% and think “the disk is maxed out!” But for modern SSDs and RAID arrays, that’s often not the case. It just means the device was busy for the entire sampling interval. You need to look at tps (transfers per second), rkB/s/wkB/s (read/write KB/s), and especially await (average I/O response time in ms) to get the real performance picture. A high %util with a low await is usually fine; a high %util with a high await means your storage is choking.
Delving into the Historical Archives
This is where sar pays for your entire retirement plan. You use the -f flag to point to a specific data file in /var/log/sa/. The files are named saXX where XX is the day of the month. To look at the CPU data from the 21st:
sar -f /var/log/sa/sa21
Now, let’s say the complaint was “the database was slow between 9 AM and 11 AM on the 21st.” You can filter by time with the -s (start) and -e (end) flags. The syntax is hilariously archaic (hh:mm:ss), a clear design choice from when bell-bottoms were in fashion.
sar -s 09:00:00 -e 11:00:00 -f /var/log/sa/sa21
A Practical Example: The Memory Mystery
Someone reports an out-of-memory killer event around 2 PM yesterday. Let’s investigate memory usage. The key flag is -r.
sar -r -s 13:30:00 -e 14:30:00 -f /var/log/sa/sa25
# Output:
# 01:40:01 PM kbmemfree kbavail kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty
# 01:50:01 PM 267488 1044676 7416940 96.51 92396 2912904 10350792 134.71 4659824 2132684 276
# 02:00:01 PM 15364 769128 7662064 99.68 13308 1266224 11572564 150.63 6396500 824044 120
# 02:10:01 PM 268044 1046232 7418384 96.52 93404 2914584 10350928 134.71 4661208 2133984 292
Bingo. Look at that line for 2:00 PM. kbmemfree is terrifyingly low (~15 MB), %memused is at 99.68%, and kbcommit (the amount of memory needed for current workloads) is at 150% of physical RAM. The system was absolutely drowning, and the OOM killer absolutely did its job. The sar data doesn’t lie. You’ve just moved from “something happened” to “here is the exact metric and time of the failure.”
The best practice? Make sar your first responder. Its data is objective, historical, and comprehensive. The pitfall? Not installing it before you need it. If you wait until the server is on fire, you’ll have no history to learn from. And that, my friend, is a tragedy far more absurd than its command-line flags.