11.7 Bootstrapping for Confidence Intervals on Metrics

Right, so you’ve trained your model, calculated your accuracy, and it looks… decent. But that single number is a point estimate. It’s the performance on this specific test set. If you’d shuffled your data differently, would you get a similar number, or did you just get lucky? This is where bootstrapping saunters in, looking like a statistical cheat code. It’s one of the most useful and intuitive tools in your evaluation toolbox, and it works by pretending to create new datasets out of thin air.

The core idea is brilliantly simple, almost absurd: we’re going to create new “pseudo-datasets” by randomly sampling from our original test set with replacement. This means a single data point from your test set could be sampled zero, one, or even five times in a single new dataset. It feels a bit like photocopying a document until the words are blurry and then claiming you have new documents, but the math checks out, I promise.

We do this hundreds or thousands of times. For each of these bootstrapped samples, we calculate our metric of choice (accuracy, F1-score, you name it). What you’re left with is a distribution of that metric. Suddenly, instead of saying “my accuracy is 89%,” you can say “I’m 95% confident my model’s true accuracy is between 86% and 92%.” That’s a far more powerful and honest statement.

The Nitty-Gritty: How to Actually Bootstrap

Let’s make this concrete. Here’s how you’d bootstrap a 95% confidence interval for accuracy using Python. We’ll use a simple example with a pretend set of predictions and true labels.

import numpy as np
from sklearn.utils import resample

# Let's assume these are your model's predictions and the true labels
# 1 = positive class, 0 = negative class
y_true = np.array([1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0])
y_pred = np.array([1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0])  # Note one error on index 5

# Our point estimate of accuracy
point_accuracy = np.mean(y_true == y_pred)
print(f"Point estimate accuracy: {point_accuracy:.3f}")

# Now, let's bootstrap
n_iterations = 10000
bootstrap_accuracies = []

# We bootstrap based on the *indices* of our test set
n_samples = len(y_true)
idx = np.arange(n_samples)

for i in range(n_iterations):
    # The magic happens here: sample indices with replacement
    bootstrap_idx = resample(idx, replace=True, n_samples=n_samples)
    
    # Create the bootstrapped sample using the selected indices
    y_true_boot = y_true[bootstrap_idx]
    y_pred_boot = y_pred[bootstrap_idx]
    
    # Calculate metric for this sample and store it
    acc = np.mean(y_true_boot == y_pred_boot)
    bootstrap_accuracies.append(acc)

# Convert to array for easier math
bootstrap_accuracies = np.array(bootstrap_accuracies)

# Calculate the 95% CI using the percentile method (more on this next)
alpha = 100 * 0.05  # For a 95% CI, we take the 2.5th and 97.5th percentiles
ci_lower, ci_upper = np.percentile(bootstrap_accuracies, [alpha/2, 100 - alpha/2])

print(f"Bootstrapped 95% CI: [{ci_lower:.3f}, {ci_upper:.3f}]")

Why Sampling With Replacement is the Whole Game

This is the non-negotiable part. Sampling with replacement is what makes each bootstrap sample a new, unique permutation of your original data. It’s what allows the law of large numbers to work its magic. If you sampled without replacement, you’d just get the original dataset in a different order, which is useless for estimating variance. The “with replacement” rule is what simulates the process of drawing a new sample from the underlying population your test set came from. Some data points will be left out (on average, about 37% will be missing from any given sample), and others will be duplicated. This variation is exactly what we need to measure the stability of our metric.

Choosing Your Confidence Interval Method

The percentile method I used above is the simplest, but it’s not always the best. It can be biased if your bootstrap distribution isn’t symmetric. Two more sophisticated methods are worth knowing about:

The Bias-Corrected and Accelerated (BCa) method: This is often the gold standard. It adjusts for bias and skewness in the bootstrap distribution. It’s more computationally expensive and a bit more complex to implement, but scipy can help.

The Normal approximation method: This assumes the bootstrap distribution is normal and uses the standard error to calculate the interval. It’s quick and dirty, but that assumption is often violated, especially with metrics like precision and recall.

Stick with the percentile method for most quick diagnostics, but if you’re publishing a result or making a critical business decision, take the time to implement BCa. The code for BCa is a bit too long for this section, but know it exists and is the more statistically rigorous choice.

The Inevitable Pitfalls and How to Avoid Them

The “Garbage In, Garbage Out” Principle: Bootstrapping estimates the variance of your metric given your current test set. If your test set is small or profoundly unrepresentative of reality, your confidence interval will be beautifully precise but utterly wrong. It quantifies uncertainty from sampling, not from having a terrible dataset.
It’s Computationally Expensive. Doing 10,000 iterations is trivial for accuracy on a small dataset. It’s a nightmare for a metric like ROC-AUC on a dataset with a million instances. You need to be smart about vectorization and possibly reduce the number of iterations for a first pass.
It’s for Inference, Not Model Selection. Don’t use bootstrapping on your validation set to pick between models. That’s what cross-validation is for. Bootstrapping is for getting a robust estimate of the performance of a single, already-chosen model on unseen data.
Watch for Ties in Your Data. The BCa method, in particular, can be sensitive to ties in your bootstrap statistics. If you have a very small test set, you might get many identical bootstrap samples, which can throw off the calculations.

The bottom line? Stop reporting single numbers. A point estimate is a story half-told. Bootstrapping gives you the confidence to tell the whole story, and it does it with a brute-force elegance that would make any engineer smile. Now go add it to your workflow.