6.4 Feature Importance in Random Forests

Right, so you’ve built a Random Forest. It’s performing well, and you’re feeling pretty smug. But you’re not the type to just accept a black box, are you? You want to know why it works. You want to know which features are actually pulling their weight and which are just dead weight, collecting a salary while the hard-working variables do all the heavy lifting. That’s where feature importance comes in, and it’s one of the most useful—and most frequently misunderstood—tools in the ensemble learning kit.

Let’s cut through the noise. There are two primary methods for calculating feature importance in a Random Forest, and they are fundamentally different beasts. You need to understand both.

Mean Decrease in Impurity (MDI)

This is the default in scikit-learn and the most common one you’ll see. The logic is beautifully simple: every time a node in any tree is split, the split is chosen to maximize the decrease in impurity (be it Gini or Entropy for classification, or MSE for regression). We can keep a running tally.

For each feature, we simply average the total decrease in impurity, weighted by the number of samples it affects, across all trees in the forest. Features used higher up in the trees (where they affect more samples) and features that cause big purity drops will have higher scores.

Here’s the kicker: it’s computed during training. It’s basically free. Let’s see it in action.

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt

# Let's conjure up a synthetic dataset where we know the truth.
# Features 0 and 1 are informative, 2 is linear combo (redundant), 3 and 4 are pure noise.
X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=42)

# Grow the forest
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)

# Plot the importances
importances = model.feature_importances_
feature_names = [f'Feature {i}' for i in range(X.shape[1])]

plt.figure(figsize=(10, 6))
plt.barh(feature_names, importances)
plt.xlabel("Mean Decrease in Impurity Importance")
plt.title("Random Forest Feature Importance (MDI)")
plt.show()

You’ll likely see Feature 0 and 1 dominating, with Feature 2 (the redundant one) having some middling importance, and the noise features (3 and 4) trailing significantly. This makes intuitive sense. But MDI has a critical, often-criticized flaw: it is biased towards features with more categories or higher cardinality. A continuous feature or a feature with many possible values has more opportunities to be chosen for a split that looks good by pure chance, artificially inflating its importance. Keep that in your back pocket.

Permutation Importance

This method is more computationally expensive but avoids the bias of MDI. The concept is straightforward and brilliant: if a feature is truly important, randomly shuffling its values should completely tank the model’s performance.

Here’s the process:

Calculate a baseline score (e.g., accuracy) for your model on a validation dataset.
For each feature:
- Shuffle (permute) the values of that feature in the validation set, breaking its relationship with the target.
- Recalculate the model’s score using this now-corrupted dataset.
- The importance is the decrease in score (baseline score - permuted score).
Repeat this process multiple times to get a stable estimate.

The beauty of this method is its model-agnostic nature and its direct measurement of “how much does messing with this feature hurt performance?” It doesn’t care about the data type of the feature.

from sklearn.inspection import permutation_importance

# We need a held-out validation set to do this fairly.
# Let's use the training set for demonstration, but you should use a proper test set.
result = permutation_importance(model, X, y, n_repeats=10, random_state=42)

# result.importances_mean holds the average importance over the 10 repeats
sorted_idx = result.importances_mean.argsort()

plt.figure(figsize=(10, 6))
plt.barh([feature_names[i] for i in sorted_idx], result.importances_mean[sorted_idx])
plt.xlabel("Permutation Importance (Decrease in Accuracy)")
plt.title("Random Forest Feature Importance (Permutation)")
plt.show()

The story should be similar to the MDI plot, but it’s a more reliable ground truth. The noise features should have an importance hovering around zero (sometimes even slightly negative due to random noise), while the true drivers show a clear positive drop.

Best Practices and Pitfalls

Don’t just run these and call it a day. Think like a scientist.

Use a Hold-Out Set: Never compute permutation importance on the data you trained on. You’ll get nonsensical, overly optimistic results. The model can perform well even with shuffled features due to overfitting. Always use a validation or test set.
Correlated Features: This is the big one. If you have two highly correlated features, the model can use either one almost interchangeably. When you permute one, the other can still pick up the slack, leading to a lower importance score for both. MDI will often arbitrarily assign higher importance to one of them. This doesn’t mean the feature is unimportant; it means the information is important, but it’s spread across features. You might need to combine them or accept the lower scores.
MDI Bias is Real: If you have a “number of shoes owned” feature and a “gender” feature, MDI will almost certainly assign a higher importance to the high-cardinality shoes feature, even if gender is the true driver. Prefer permutation importance when your features have varying levels of cardinality.
The Sum-to-1 Illusion: MDI importance scores are normalized to sum to 1. This makes them easy to read as “percentage contributions,” but don’t be fooled. This is just a scaling artifact. If you add 100 useless features, the importance of your truly important features will be diluted. The relative ordering is what matters.

So, which one should you use? My advice: start with MDI because it’s instant and gives a good first pass. But for any serious analysis, particularly when presenting results or doing feature selection, invest the compute time and use permutation importance on a proper test set. It’s the gold standard for a reason. It tells you what actually matters, not just what the algorithm found convenient to use.