67.5 py-spy: Sampling Profiler for Production

Alright, let’s talk about something that actually works: py-spy. This is the profiling tool you use when your application is on fire in production and you can’t just restart it with cProfile attached. It’s a sampling profiler, which is a fancy way of saying it peeks at what your Python process is doing, at regular intervals, without your code having to know it’s being watched. It’s like a wildlife documentary filmmaker hiding in a bush, not a stage actor performing for a camera. The key thing here is that it’s low-overhead and safe to run on live, production systems.

The magic, and the reason it’s so safe, is that it works by directly reading the memory of the Python process and using the OS’s native profiling APIs (ptrace on Linux, libdispatch on macOS, and the Windows API helpfully called… the Windows API). This means it doesn’t require any code changes, doesn’t run inside your process, and won’t crash your app just because the profiler hiccups. It’s a separate process altogether, which is exactly what you want when debugging a live service.

Installing the Damn Thing

You’ll need to install it with Rust’s package manager, Cargo. Yes, a profiling tool for Python is written in Rust. I don’t make the rules, but in this case, the rules are excellent because it gives us a single, static binary with no crazy dependencies.

# The easy way. Probably all you need.
pip install py-spy

# Or, if you're a purist or the pip version is borked (it happens),
# grab it from Cargo directly. You'll need the Rust toolchain.
cargo install py-spy

The Basic Spell: Seeing the Flames

The most immediately useful thing py-spy does is generate a flamegraph. This is your first stop for answering the question “why is my code so unbelievably slow?”

Find your application’s process ID (PID). Then, run this:

sudo py-spy record -o profile.svg --pid 12345

Let’s unpack this incantation. The sudo is often necessary because peeking into another process’s memory is a privileged operation (think of it as a medical ethics violation, but for computers). The -o profile.svg tells it to output a flamegraph. The --pid flag attaches it to a running process. Now, let it run for 30 seconds, or a minute, then hit Ctrl+C. Open that profile.svg in a web browser. Congratulations, you now have a beautiful, clickable chart showing you exactly which functions were on the CPU most often. The width of each block is proportional to how much time was spent there. It’s impossible to look at this and not immediately have a suspect.

Why Sampling is Your Production Best Friend

You might be wondering, “Why not just use cProfile everywhere?” Because cProfile is a tracing profiler. It records every single function call. The overhead for this can be enormous, often adding 10-30% latency, which in a production environment is completely unacceptable. It’s like trying to diagnose a car’s engine problem by putting a microphone inside each cylinder—you’ll get amazing data, but you’ll also change the way the engine runs, probably for the worse.

Sampling, by contrast, has minimal overhead—usually in the 1-5% range. It pokes its head in every millisecond or so, sees what the program is doing, and notes it down. If a function is showing up in a huge number of samples, it’s clearly a hot spot. It’s a statistical approach, and for finding your big, architectural performance problems, it’s more than precise enough. Don’t use a microscope when a pair of binoculars will do.

Advanced Sorcery: Dumping and Diagnosing

Sometimes you can’t just run a profiler for a full minute. Sometimes the problem is right now. For those moments, you can get a immediate snapshot of what every thread in your process is doing.

py-spy dump --pid 12345

This will print out the current call stack for every thread. It’s the digital equivalent of kicking the machine and seeing which error messages fall out. It’s incredibly useful for diagnosing deadlocks or sudden latency spikes. You’ll see exactly which line of code each thread is stuck on.

The One Major Gotcha: Native Extensions

Here’s the thing py-spy is brutally honest about: it’s fantastic for pure Python code. But if your performance bottleneck is buried inside a C extension—like something in numpy or cryptography—py-spy might show you a stack trace that just stops at the edge of Python. You’ll see a line like unix_ffi_call and think, “well, that was helpful.”

For this, you need a different tool, like perf on Linux, which can profile all the way down to the native machine code. py-spy tries to help here with the --native flag, which instructs it to also show native C frames in the output. It’s not perfect, as it has to unwind compiled code, which is black magic, but it’s often better than nothing.

# Try this if your bottleneck is hiding in a C library
sudo py-spy record --native -o profile.svg --pid 12345

Living in the Real World: A Practical Example

Let’s say you have a poorly-written Flask endpoint that’s slowing everything down. You’d run py-spy, get the flamegraph, and see that a function called expensive_calculation is taking 80% of the time. The fix isn’t to make py-spy run faster; it’s to go make expensive_calculation less… expensive. Maybe you add caching with functools.lru_cache, or you optimize the algorithm, or you realize you’re doing a database query in a loop like a maniac.

The value of py-spy isn’t just in showing you the problem—it’s in showing you the right problem. It cuts through the noise and points a giant, flashing arrow at the thing you actually need to fix. Stop guessing. Start sampling.