38.4 perf: Linux Performance Counters and Profiling

Right, let’s talk about perf. If you’re serious about figuring out why your code is slow, leaking memory, or just generally misbehaving on Linux, this is your new best friend. It’s not a tool; it’s a sprawling ecosystem of tools built into the kernel, and it’s so powerful it’s almost absurd that it’s free. Forget guessing. We’re moving to evidence-based profiling.

Think of perf as a high-speed data recorder for your CPU. It can tell you which lines of code are getting executed millions of times, what’s causing those pesky cache misses, and where the kernel is spending its time on your behalf. It does this using Performance Monitoring Units (PMUs) – hardware counters on the CPU itself. This means it’s incredibly low-overhead. We’re not talking about adding print statements here; we’re talking about directly querying the processor’s internal statistics.

First, Get It and See What It Can Do

Your distro probably has it. On Ubuntu/Debian, it’s linux-tools-common and linux-tools-$(uname -r). Install it. Then, just type perf and behold the list of subcommands. This is your toolbox.

sudo apt update && sudo apt install linux-tools-common linux-tools-generic
perf

You’ll see a list of commands like stat, record, report, top, and mem. We’ll get to the good ones in a second.

The Gateway Drug: `perf stat`

Before you dive into full profiling, just get a feel for what’s happening. perf stat runs a command and gives you a summary of key hardware events. It’s like a full-body scan for your program.

Let’s run it on something simple, like listing a directory.

perf stat ls

You’ll get beautiful, terrifying output like this:

 Performance counter stats for 'ls':

              1.21 msec task-clock:u               #    0.747 CPUs utilized
                 0      context-switches:u         #    0.000 /sec
                 0      cpu-migrations:u           #    0.000 /sec
               106      page-faults:u              #   87.595 K/sec
         1,234,567      cycles:u                   #    1.020 GHz
         1,543,210      instructions:u             #    1.25  insn per cycle
           321,098      branches:u                 #  265.361 M/sec
            12,345      branch-misses:u            #    3.84% of all branches

       0.001617400 seconds time elapsed
       0.001686000 seconds user
       0.000000000 seconds sys

Look at that. Instructions per cycle (IPC) is 1.25? Not bad. A branch-miss rate of nearly 4% for an ls? Maybe a bit high, but it’s a short-lived process. This is the kind of baseline intuition you start to build. A low IPC (say, below 1.0) often means your code is stalling, waiting for memory.

Actually Profiling: `perf record` and `report`

Now for the main event. perf record runs your command and samples the program’s stack at incredibly high speed (thousands of times per second). It writes this data to a perf.data file. Then, perf report dissects that file into a navigable hierarchy of where time was spent.

Let’s profile a naive Fibonacci function, the classic example of terrible performance.

First, create a simple C program, fib.c:

#include <stdio.h>

long long fib(int n) {
    if (n <= 1) return n;
    return fib(n-1) + fib(n-2);
}

int main() {
    printf("Fibonacci: %lld\n", fib(45));
    return 0;
}

Compile it with debug symbols (-g). This is non-negotiable. Without it, perf will show you hex addresses instead of function names, and you will cry.

gcc -g -o fib fib.c

Now, profile it:

perf record -g -- ./fib

The -g option tells it to record call graphs (the stack). Let it run. It’ll take a while because this algorithm is comically bad. Now, look at the results:

perf report

You are now staring at a hierarchical view of your program’s misery. You’ll see a percentage of samples in each function. I can guarantee you’ll see something like 99.9% of samples inside the fib function, and by drilling down (using the arrow keys and enter), you’ll see the call graph dominated by recursive calls to fib. This is perf holding up a mirror and showing you the exact nature of your performance problem. It’s humbling and incredibly effective.

Why Your Perf Output Sucks (Common Pitfalls)

No Debug Symbols: I already said it, but I’ll say it again. Your binary must be compiled with -g. If you’re using a packaged binary, you often need a -dbgsym package. If perf report shows a bunch of [unknown], this is your problem.
Missing Kernel Symbols: Sometimes you need to see time spent in the kernel. For that, you need the vmlinux image with debug symbols for your exact kernel. This is often a pain to get. Distributions, in their infinite wisdom, often don’t ship this by default because it’s large. It’s a questionable choice, but we live with it.
Not Running as Root: While perf can do some things as a normal user, many of the cool events (like cache misses, precise events) require root access. Use sudo if you’re serious.
Sampling Too Low (or High): The default sampling frequency is usually fine. But for very short-lived programs, you might need to use -F to crank the frequency up to catch enough events. Conversely, for a long-running program, sampling too high will generate a gigantic perf.data file. Be sensible.

Beyond CPU: `perf mem` and `perf c2c`

perf isn’t just about CPU cycles. Modern versions are brilliant for memory analysis.

perf mem record will sample memory accesses and can show you which lines of code are causing the most cache misses. This is where you find the real performance killers in data-heavy applications.
perf c2c (False Sharing Detector) is black magic. It can detect “false sharing” – where two threads write to different variables that happen to reside on the same CPU cache line, forcing the cores to invalidate and update the cache constantly. It’s a silent performance murderer, and c2c will point a giant arrow at the offending variables.

The key with perf is to stop guessing. Let the hardware tell you the story. It’s the most knowledgeable, brutally honest friend your code will ever have.

First, Get It and See What It Can Do

The Gateway Drug: perf stat

Actually Profiling: perf record and report

Why Your Perf Output Sucks (Common Pitfalls)

Beyond CPU: perf mem and perf c2c

The Gateway Drug: `perf stat`

Actually Profiling: `perf record` and `report`

Beyond CPU: `perf mem` and `perf c2c`