6.3 Bagging and Random Forests: Reducing Variance with Diversity

Right, so you’ve built yourself a decision tree. It’s a beautiful, sprawling thing that fits your training data perfectly. You show it off to your friends, your family, and then, with a trembling hand, you run it on some new data. The result is a catastrophic, humiliating failure. What happened? You’ve just been personally victimized by overfitting. Your tree is too specific; it’s memorized the noise in your data, not the underlying signal. It has high variance.

We need to introduce some stability, some consensus. And what’s the best way to get a good, stable answer? You ask a committee. But here’s the trick: you can’t just ask a committee of experts who all read the same books and went to the same schools. You need a diverse committee. That’s the entire philosophical core of ensemble methods.

The Bootstrap Aggregating (Bagging) Gambit

The core idea of Bagging is so brilliantly simple it feels like cheating:

Create multiple slightly different versions of your dataset by bootstrapping (sampling with replacement).
Train a model (like a decision tree) on each of these bootstrapped datasets.
For a prediction, let all these models vote (for classification) or average their predictions (for regression).

Why does this work? By training on different samples, each model learns a slightly different aspect of the data. Some might be a bit weird, some might be spot on. When you average their predictions, the weird, high-variance errors tend to cancel each other out, while the true signal reinforces itself. You’re reducing variance without increasing bias too much. It’s the wisdom of the crowd, applied to machine learning.

Here’s the beautiful part: because we’re sampling with replacement, each bootstrapped dataset leaves out, on average, about 37% of the original data. This is the so-called Out-of-Bag (OOB) data. It’s a free validation set for each tree! We can use it to get a pretty decent estimate of the ensemble’s error without needing a separate holdout set, which is incredibly handy when data is scarce.

Let’s see it in action. We’ll use scikit-learn to bag some decision trees (a BaggingClassifier).

from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Make a dataset that's notoriously hard for a single tree (moons are fun)
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# First, let's feel the pain of a single, overfit tree
single_tree = DecisionTreeClassifier(random_state=42)
single_tree.fit(X_train, y_train)
y_pred_tree = single_tree.predict(X_test)
print(f"Single tree test accuracy: {accuracy_score(y_test, y_pred_tree):.4f}")

# Now, let's bring in the committee. 500 trees, each trained on a bootstrap sample.
# Note: we're using the OOB score for an unbiased estimate. `bootstrap=True` is key.
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=100,  # Size of each bootstrap sample
    bootstrap=True,   # This is what makes it bagging
    oob_score=True,   # Let's use that free validation data!
    random_state=42
)
bag_clf.fit(X_train, y_train)

# Predict and compare
y_pred_bag = bag_clf.predict(X_test)
print(f"Bagging test accuracy: {accuracy_score(y_test, y_pred_bag):.4f}")
print(f"Bagging OOB accuracy: {bag_clf.oob_score_:.4f}") # Should be close to the test score

You’ll almost certainly see a solid bump in accuracy from the bagged model. The OOB score is a fantastic, quick sanity check that usually aligns very closely with the actual test score.

Enter the Random Forest: A Stroke of Design Genius

A Bagged ensemble of decision trees is good, but Leo Breiman turbocharged the idea with the Random Forest. It adds one crucial twist to the bagging process: feature randomness.

When building each tree, instead of greedily choosing the best split from all features at every node, the algorithm is forced to choose from a random subset of features. This is a moment of sheer design brilliance. Why?

It introduces even more diversity into the committee. In a standard bagged set of trees, the same powerful feature might dominate the top split across most trees, making them highly correlated. By randomly limiting the feature choices at each node, you decorrelate the trees. One tree might be forced to use a decent-but-not-best feature, leading it down a different path. The resulting trees are individually weaker (slightly higher bias) but much less correlated. This reduction in correlation is the key to a more effective variance reduction when we average them all together.

In scikit-learn, a RandomForestClassifier is optimized and easier to use than a BaggingClassifier of DecisionTreeClassifiers. It automatically uses feature randomness (controlled by max_features) and other tricks.

from sklearn.ensemble import RandomForestClassifier

# The classic. It just works.
rf_clf = RandomForestClassifier(
    n_estimators=500,
    max_leaf_nodes=16,    # A way to control the size of individual trees
    oob_score=True,
    random_state=42
)
rf_clf.fit(X_train, y_train)

y_pred_rf = rf_clf.predict(X_test)
print(f"Random Forest test accuracy: {accuracy_score(y_test, y_pred_rf):.4f}")
print(f"Random Forest OOB accuracy: {rf_clf.oob_score_:.4f}")

Practical Considerations and Pitfalls

Don’t just throw 1000 trees at everything and hope for the best. Here’s what you need to know:

The Law of Diminishing Returns: More trees always reduce variance, but after a point, the improvement is negligible. The OOB error will plateau. Start with 100-500; you’ll know when you have enough.
n_jobs is Your Best Friend: Training trees is embarrassingly parallel. Set n_jobs=-1 to use all your cores and watch it fly. It’s free performance.
It’s Still a Black Box: A random forest is even less interpretable than a single tree. You can glean feature importance (via rf_clf.feature_importances_), which is fantastic, but you can’t easily trace a single prediction through 500 trees. If you need explainability, this is a real trade-off.
The Bias Problem: Remember, bagging and random forests are primarily variance-reduction techniques. If your underlying model (a shallow tree) is wrong and has high bias, no amount of ensembling will fix it. You’ll just be efficiently averaging a bunch of wrong answers. Your committee is biased from the start. Always ensure your base learner is at least somewhat competent on its own.

The takeaway? If you have a noisy dataset and you need a robust, high-performance model out of the box, a Random Forest is almost always the first thing you should try. It’s a workhorse for a reason. It takes the inherent weakness of a decision tree—its instability—and turns it into its greatest strength.