67.1 timeit: Micro-Benchmarking Code Snippets

Right, let’s talk about timeit. You’ve probably had a thought like, “Is method A faster than method B?” and then, like a chump, wrapped it in a time.time() call and run it once. I’ve been there. The results are a lie. Your operating system is a chaotic, beautiful mess of processes fighting for CPU time, and your one-off measurement just captured a moment when a background antivirus scan decided to sigh heavily. We need to do better. timeit is how we do better. It’s the statistical sledgehammer we use to smash uncertainty about tiny, repetitive code.

The core idea is simple but brilliant: run the code snippet millions of times, measure the total time, and divide to get the average execution time per run. This minimizes the impact of random system events and gives us a stable, comparable number. It’s built into Python’s standard library because the core developers know we need this. Often.

The Two Ways to Wield timeit

You can use timeit from the command line for quick checks, or within a script for more formal benchmarking. The CLI is my go-to for rapid, “wait, is that really faster?” experiments.

python -m timeit "'-'.join(str(n) for n in range(100))"

This outputs something like: 10000 loops, best of 5: 23.4 usec per loop

Let’s decode that. It ran the statement 10,000 times, but it didn’t just do that once. It repeated that entire process 5 times (the -r option, default 5) and only told you the best total time from those 5 repetitions. Why the best? Because it’s assuming your system can, at its best, achieve that performance, and we want to filter out the noise of other processes. It’s a good assumption.

For more complex snippets, or to benchmark functions you’ve already written, you’ll want to use the Python interface.

import timeit

# Method 1: Using a string (works, but can be clunky with quotes)
time_with_string = timeit.timeit("'-'.join(str(n) for n in range(100))", number=100000)
print(f"String method: {time_with_string / 100000 * 1e6:.2f} microseconds")

# Method 2: Using a callable (the better way)
def list_comprehension_join():
    return '-'.join([str(n) for n in range(100)])

def generator_expression_join():
    return '-'.join(str(n) for n in range(100))

# timeit.Timer creates a timer object
timer_obj = timeit.Timer(list_comprehension_join)
time_taken = timer_obj.timeit(number=100000)
print(f"List comp time: {time_taken / 100000 * 1e6:.2f} microseconds")

# Let's compare them directly!
comp_time = timeit.timeit(list_comprehension_join, number=100000)
gen_time = timeit.timeit(generator_expression_join, number=100000)

print(f"List Comprehension: {comp_time:.4f}s")
print(f"Generator Expression: {gen_time:.4f}s")
print(f"Winner is {'List Comp' if comp_time < gen_time else 'Generator'} by {abs(comp_time - gen_time)/min(comp_time, gen_time)*100:.1f}%")

The setup.py You Didn’t Know You Needed

Here’s the first major pitfall: namespace. The code you’re timing runs in a separate, barren wasteland of a namespace. It doesn’t know about your imports or the variables in your current module. You have to provide them explicitly using the setup parameter. This is where most people get it wrong initially.

import timeit

# This will FAIL miserably because 'my_list' doesn't exist in the timed namespace
# my_list = [n for n in range(1000)]
# timeit.timeit('sum(my_list)', number=1000) # NameError: name 'my_list' is not defined

# This is the correct way. The setup code is run once to prime the environment.
setup_code = """
my_list = [n for n in range(1000)]
import numpy as np  # You can import inside setup too!
"""

stmt = "sum(my_list)"
time_taken = timeit.timeit(stmt, setup=setup_code, number=10000)
print(f"Time: {time_taken}")

repeat() and the Art of Statistical Skepticism

Sometimes timeit() isn’t enough. You want to see the distribution of results, not just the best. This is where timeit.repeat() shines. It gives you a list of the total time for each repetition (the “best of 5” we saw earlier). This lets you check for variance. If one run is a massive outlier, maybe your code isn’t the problem—your system is.

import timeit

t = timeit.Timer("sum([1, 2, 3, 4, 5])")
results = t.repeat(repeat=5, number=1000000) # Do 1 million runs, 5 times.

print("Raw results for 5 repetitions (seconds for 1M runs):", results)
print("Best time per run:", min(results) / 1000000, "seconds")
print("Spread of results:", max(results) - min(results), "seconds. Big spread? System was busy.")

The Gotchas and Grey Areas

timeit is brilliant, but it’s not a crystal ball. It measures the specific, isolated performance of a snippet. It does not account for real-world factors like surrounding code pressure on the CPU cache, or garbage collection overhead from other objects.

The biggest trap is using it for multi-threaded or asynchronous code. timeit will largely measure the overhead of creating threads/events, not their true concurrent performance. For that, you need more specialized tools.

Also, be wary of what you’re actually testing. If your setup code does a ton of work and your stmt does very little, you’re mostly measuring the overhead of the timing mechanism itself. Always sanity-check your results. If a supposedly “faster” method is only 0.001% faster, it’s almost certainly measurement noise. timeit gives you the data, but you still have to bring the brains to interpret it. It tells you the “what,” you have to figure out the “why.” And that, my friend, is where the real fun begins.