79.8 Hyperparameter Tuning: GridSearchCV and RandomizedSearchCV
Right, so you’ve built your model. It’s probably a RandomForestClassifier because that’s what everyone builds first. It’s the “I’m not sure what I’m doing but I want something that works” of machine learning, and honestly, it’s a great choice. But you ran it, and the accuracy is… fine. Not great. Just fine. You stare at your screen. Now what?
Welcome to the single most impactful (and most tedious) part of the machine learning workflow: hyperparameter tuning. Your model is a car with a million unlabeled dials and knobs. Hyperparameter tuning is the process of fiddling with them until you stop getting terrible gas mileage and actually start winning races. We’re going to talk about the two smartest ways to do this fiddling without just randomly twisting things until something breaks.
Your Model’s Knobs: What Are Hyperparameters?
First, a quick distinction. Parameters are what the model learns from the data (like the weights in a linear regression). Hyperparameters are the settings you, the all-powerful wizard, choose before the training even starts. They control the whole learning process.
Think of a RandomForestClassifier. Key hyperparameters include:
n_estimators: How many trees in the forest? (More trees = less overfitting, but slower)max_depth: How deep can each tree grow? (Deeper trees = more complex patterns, risk of overfitting)min_samples_split: How many samples are needed to split a node? (Higher number = simpler trees)
Picking these values by hand is a fool’s errand. You will waste days. Instead, we systematize the search.
Brute Force with a Plan: GridSearchCV
GridSearchCV is the meticulous, slightly obsessive-compulsive friend who makes a spreadsheet for everything. You define a grid of every single hyperparameter value you’d like to try, and it will train and evaluate a model for every possible combination of those values. It’s an exhaustive search.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
# Let's use a toy dataset that's actually fun to look at
X, y = make_moons(n_samples=1000, noise=0.3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define your model
model = RandomForestClassifier(random_state=42) # Always set random_state for reproducibility!
# Define the hyperparameter grid. This is the magic box.
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10]
}
# Create the GridSearchCV object
# cv=5 means 5-fold cross-validation. This is non-negotiable. Always use it.
grid_search = GridSearchCV(estimator=model,
param_grid=param_grid,
cv=5,
scoring='accuracy', # You can use 'f1', 'roc_auc', etc.
n_jobs=-1) # Use all your CPU cores. You paid for them.
# Fit it! This will train (n_estimators * max_depth * min_samples_split * cv) models.
# In this case: 3 * 4 * 3 * 5 = 180 models. Buckle up.
grid_search.fit(X_train, y_train)
# Who won?
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.3f}")
# And finally, see how our champion performs on the held-out test set
best_model = grid_search.best_estimator_
test_score = best_model.score(X_test, y_test)
print(f"Test set score: {test_score:.3f}")
The best_score_ is the average score from the cross-validation folds for the best parameter set. The test set is your final, unbiased judge to see if your tuned model actually generalizes.
The Pitfall: The curse of dimensionality. My example grid has 3 * 4 * 3 = 36 combinations. Add one more parameter with 5 values? Now it’s 36 * 5 = 180. The computational cost explodes. You will quickly find yourself grid-searching over a weekend for a model that might be 0.1% better. It’s absurd. That’s why we have…
Smart(ish) Randomness: RandomizedSearchCV
RandomizedSearchCV is the chaotic-good alternative. Instead of trying every combination, you tell it how many model iterations you’re willing to pay for (n_iter), and it will randomly sample from distributions of hyperparameters you define.
Why is this brilliant? You can explore a much, much wider range of values for the same computational cost. You’re statistically likely to get close to the best answer without finding it exactly, which is almost always good enough.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
# Define distributions to sample from, not fixed values
param_distributions = {
'n_estimators': randint(50, 250), # Random integers between 50 and 250
'max_depth': [None, 10, 20, 30], # You can still use a list
'min_samples_split': randint(2, 15),
'max_features': uniform(0.1, 0.9) # Random floats between 0.1 and 0.9
}
random_search = RandomizedSearchCV(estimator=model,
param_distributions=param_distributions,
n_iter=50, # I'm only willing to train 50 models
cv=5,
scoring='accuracy',
n_jobs=-1,
random_state=42) # Again, for reproducibility!
random_search.fit(X_train, y_train)
print(f"Best parameters: {random_search.best_params_}")
print(f"Best cross-validation score: {random_search.best_score_:.3f}")
See what we did there? We even added a new hyperparameter (max_features) that would have made a grid search prohibitively expensive. With 50 iterations, we get a great sense of what works without burning a hole in your laptop.
Best Practices from the Trenches
- Start with RandomizedSearch: Always. Use it to narrow down the range of what good hyperparameters look like. It’s your scouting party.
- Then, maybe, use GridSearch: Once RandomizedSearch has identified a promising region of the hyperparameter space, you can do a finer-grained grid search around that area if you really need to squeeze out that last drop of performance. Most of the time, you won’t need to.
- Know Your Scoring Metric: The default is often
accuracy, which is frequently terrible. Usescoring='roc_auc'for imbalanced classification,scoring='neg_mean_squared_error'for regression. This is the single biggest lever in making the search actually useful. n_jobs=-1is Your Friend: It parallelizes the process across all your CPU cores. The difference between this andn_jobs=1is the difference between “I’ll get a coffee” and “I’ll get a coffee, lunch, and a nap.”- The Data is Key: No amount of hyperparameter tuning will save a model trained on terrible, leaky, or uninformative data. It’s like meticulously tuning the engine of a car with flat tires. Fix the data first.
The goal isn’t perfection; it’s a significant step up from your first draft. These tools get you there by replacing guesswork with a systematic process, saving you from the despair of manual tuning. Now go make that model less… fine.