13.2 Bayesian Optimization: Gaussian Processes and Acquisition Functions

Right, so you’ve been grid searching. Bless your heart. You’ve set up your parameter grids, fired it off, and gone to get a coffee. You came back a day later to find your model has barely budged, and you’ve burned enough compute cycles to power a small moon. There’s got to be a smarter way to find good hyperparameters than just brute force, right? There is. It’s called Bayesian Optimization, and it’s basically the opposite of guessing. It’s about being clever, learning from each experiment, and using probability to guide your next move.

Think of it like this: you’re trying to find the highest point in a vast, foggy landscape (your loss function). You can’t see the whole thing. A grid search is like dropping people on a pre-defined, evenly-spaced grid and having them shout their altitude. It’s systematic but incredibly wasteful. Bayesian Optimization is like having a seasoned mountaineer who, after each report, uses their intuition (a probabilistic model) to guess where the next most promising place to look might be. It’s adaptive. It learns.

The whole process rests on two brilliant ideas: building a probabilistic model of your objective function (that’s the Gaussian Process) and then using a decision-making helper to pick the next point to evaluate (that’s the Acquisition Function).

The Gaussian Process: Your Function’s Psychic

At the heart of this is the Gaussian Process (GP). Don’t let the name scare you; you don’t need to derive it from first principles. Just think of a GP as a sophisticated way to say, “Based on the points I’ve already evaluated, here’s my best guess for the entire function, and more importantly, here’s how certain I am about that guess at every other point.”

It gives you a mean function (the “probably here is the value”) and a standard deviation (the “but I could be this wrong”). This uncertainty is pure gold. It’s what separates this from just fitting a simple regression to your data points. In regions where you have lots of data, the uncertainty is small. In regions you haven’t explored, the uncertainty is huge. This combination of prediction and uncertainty is called a surrogate model.

Here’s a simplistic look at building a GP surrogate with scikit-learn. We’re not tuning the model’s hyperparameters here (the kernel), which is its own whole thing, but this shows the mechanics.

import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C

# Let's pretend we've already evaluated our black-box function at a few points.
# X represents our hyperparameters (e.g., learning rate, batch size), and y is the validation score.
X_samples = np.array([[0.1], [0.5], [1.0]])  # e.g., learning rates
y_samples = np.array([0.65, 0.89, 0.75])      # e.g., accuracy

# Define a kernel. This controls the shape of the function you're modeling.
# An RBF (Radial Basis Function) kernel is a common default. It assumes smooth, continuous functions.
kernel = C(1.0, (1e-3, 1e3)) * RBF(1.0, (1e-2, 1e2))
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10)

# Fit the GP to our existing data. It's learning the landscape.
gp.fit(X_samples, y_samples)

# Now we can ask the GP to predict the mean and std.dev across a range of new points.
X_to_predict = np.linspace(0, 2, 100).reshape(-1, 1)
y_mean, y_std = gp.predict(X_to_predict, return_std=True)
# y_mean is the predicted performance, y_std is the uncertainty.

Acquisition Functions: The Decision Maker

Okay, so you have a model that tells you what it thinks the function looks like and where it’s uncertain. Now what? Do you just sample where the mean prediction is best? That’s exploitation. Or do you sample where uncertainty is highest? That’s exploration. This is the classic dilemma, and it’s exactly what an acquisition function solves.

An acquisition function is a clever formula that balances exploration and exploitation for you. It takes the GP’s output (mean and standard deviation) and calculates a single “utility” score for every point. You then simply choose the point with the highest utility to run your next, expensive, model-training experiment.

The most common one you’ll meet is Expected Improvement (EI). It literally calculates the expectation of how much improvement you’ll get over the current best value. It naturally values points with a high mean ( exploitation) but also points with high uncertainty ( exploration) because if you’re uncertain, there’s a chance the value could be fantastic. It’s mathematically elegant and works brilliantly in practice.

Other popular ones include Upper Confidence Bound (UCB, very exploration-heavy) and Probability of Improvement (PI, a simpler cousin of EI).

from scipy.stats import norm

def expected_improvement(X, gp, current_best):
    """Calculates EI for a set of points X based on a fitted GP and the best value found so far."""
    mu, sigma = gp.predict(X, return_std=True)
    sigma = sigma.clip(1e-10, None)  # Avoid division by zero
    with np.errstate(divide='warn'):
        improvement = mu - current_best
        Z = improvement / sigma
        ei = improvement * norm.cdf(Z) + sigma * norm.pdf(Z)
    return ei

# Find the point in our range that maximizes EI
current_best = np.max(y_samples)
ei_values = expected_improvement(X_to_predict, gp, current_best)
next_sample_point = X_to_predict[np.argmax(ei_values)]
print(f"Next point to sample: {next_sample_point[0]:.3f}")

The Nuts, Bolts, and Landmines

Here’s the stuff the pure-math explanations often gloss over. First, GPs get painfully slow as your number of observations (n) grows, roughly O(n³). After a few hundred evaluations, you’ll feel it. For larger runs, you might graduate to other surrogate models like Bayesian Neural Networks or Tree-structured Parzen Estimators (TPE), which is what fancy tools like Hyperopt use.

Second, the kernel choice matters. The default RBF kernel assumes your function is smooth. If your hyperparameter space is full of discrete or categorical parameters, or if the function has sharp discontinuities (which, let’s be honest, it sometimes does thanks to bad software defaults), a standard GP will struggle. You need specialized kernels for that, which is its own deep rabbit hole.

Finally, this is not a magic wand. You’re still making assumptions, primarily that your objective function is reasonably behaved. If it’s complete noise, nothing will help. But for most well-defined problems, Bayesian Optimization will find a good set of hyperparameters in orders of magnitude fewer iterations than grid or random search. It’s not just smarter; it’s ruthlessly efficient. And in this game, efficiency is everything.