12.8 Wrapper Methods: RFE and Sequential Feature Selection

Alright, let’s talk about wrapper methods. You’ve probably been eyeballing your dataset, wondering which features are the real MVPs and which are just dead weight. Filter methods (like correlation scores) are a good first date, but they don’t tell you how features actually behave in a relationship with your specific model. That’s where wrapper methods come in. They’re more demanding—they actually train the model over and over to see which subset of features makes it perform best. It’s computationally expensive, like a high-maintenance partner, but you get a much clearer picture of what works.

The core idea is beautifully simple, if a little brute-force: we treat feature selection as a search problem. We have a set of features, and we’re trying to find the subset that gives us the best model performance. We use the model itself as a “wrapper” to evaluate these subsets. It’s the opposite of dry theory; it’s hands-on, empirical, and tells you exactly what you need to know for this model and this data.

How Recursive Feature Elimination (RFE) Works

RFE is the workhorse of wrapper methods, and its name is a bit of a spoiler. It’s recursive because it loops, and it eliminates features. Shocking, I know.

Here’s the clever part: instead of just randomly dropping features and retraining, RFE uses the model’s own internal feature importance metrics to make an educated guess about which one to axe next. For a linear model, it uses the absolute magnitude of the coefficients. For tree-based models, it uses feature importances (like Gini importance). The process is straightforward:

Train the model on the entire set of features.
Get the importance score for each feature.
Say goodbye to the feature(s) with the lowest importance score(s).
Repeat steps 1-3 with the remaining features until you hit your desired number.

You tell it how many features you want (n_features_to_select), and it works backwards until it gets there. Let’s see it in action with a classic: a linear regression on the Boston housing dataset (yes, it’s problematic, but it’s a well-known example for this).

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Let's use the California Housing dataset instead
data = fetch_california_housing()
X, y = data.data, data.target
feature_names = data.feature_names

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a pipeline to scale our data and then apply RFE
# We want to find the top 5 features
estimator = LinearRegression()
selector = RFE(estimator, n_features_to_select=5, step=1)
pipeline = Pipeline(steps=[('scaler', StandardScaler()), ('selector', selector)])

# Fit the pipeline
pipeline.fit(X_train, y_train)

# Now, let's see which features made the cut
selected_features = np.array(feature_names)[selector.support_]
print(f"Selected features: {list(selected_features)}")

# And the ranking (1 is best). This shows the order of elimination.
print(f"Feature rankings: {dict(zip(feature_names, selector.ranking_))}")

The beauty here is that the model itself is guiding the process. The step parameter lets you eliminate more than one feature per round to speed things up, but be careful—it’s a trade-off between speed and precision. Removing multiple features at once might boot a feature that would have become important after another was removed.

The Sequential Approach: Forward and Backward Selection

RFE is a backward elimination method. It starts with everything and removes the weak links. But you can also go the other way. Sequential Feature Selection (SFS) can be either forward or backward.

Forward Selection (SFS): Starts with zero features and greedily adds the one that gives the biggest performance boost. It repeats this until a stopping criterion is met.
Backward Elimination (SBS): The same idea as RFE, but instead of using model coefficients, it uses a performance metric (like R² or AUC) to decide which feature to remove next. This is often more robust but much slower.

Why would you use one over the other? Forward selection is cheaper computationally, especially if you expect only a few features to be useful. Backward elimination can sometimes spot synergies between features that forward selection might miss. mlxtend has a great implementation of SFS.

# You'll need to install mlxtend first: `pip install mlxtend`
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

sfs = SFS(estimator,
          k_features=5,
          forward=True,  # Set to False for Backward Selection
          scoring='r2',
          cv=5)  # Use cross-validation for a honest performance estimate

# Create a new pipeline with just a scaler and SFS
pipeline_sfs = Pipeline(steps=[('scaler', StandardScaler()), ('sfs', sfs)])
pipeline_sfs.fit(X_train, y_train)

# Get the details
print("Forward Selection Results:")
print(pipeline_sfs.named_steps['sfs'].subsets_)

The Crucial Pitfalls and Best Practices

This is where the rubber meets the road. Wrapper methods are powerful, but they have teeth.

The Leakage Trap: This is the biggest one. You must perform feature selection within the folds of your cross-validation. If you perform RFE on your entire training set and then do CV, you’ve leaked information from the entire dataset into your feature selection process. Your performance estimates will be wildly optimistic garbage. Always use Pipeline and tools like RFE or SFS that integrate with cross_val_score or use their built-in cv parameter. The mlxtend SFS with cv=5 is a perfect example of doing it right.
Computational Cost: Let’s be honest, this is slow. Training a model dozens or hundreds of times isn’t trivial. For large datasets, you might need to use the step parameter in RFE to eliminate features in chunks or stick with faster models (like linear models instead of large GBMs) for the selection process itself.
The Model is the Boss: Remember, the features are selected for that specific model. The best subset for a LinearRegression might be useless for a RandomForest. If you change your model, you need to re-run the feature selection. There’s no universal “best feature set.”
The Random State Gambit: Many models have stochastic elements. If you don’t set random_state in your estimator (e.g., for an SGDRegressor or a RandomForest), the feature importance scores can wobble between runs, leading to slightly different selected features. For reproducibility, lock it down.

Wrapper methods are your go-to when you need a precise, model-specific answer to the feature selection question. They cut through the noise by letting the model’s performance do the talking. Just be sure to avoid the leakage trap—it’s the number one mistake that makes the results look better than they are. Trust me, I’ve been there. It’s a facepalm moment you want to avoid.