12.9 Embedded Methods: LASSO and Tree Feature Importance

Right, so you’ve got your data, you’ve thrown a bunch of features at the wall, and now you’re wondering which ones are actually sticking. You’re not just throwing spaghetti at the wall to see what sticks; you’re trying to build a damn suspension bridge. This is where embedded methods come in—they’re the smart, multitasking construction crew that builds the bridge and tells you which steel beams are load-bearing and which are just for show. They perform feature selection as part of the model training process itself. No separate step. Efficient. I like it.

Let’s talk about two of the heavy hitters: LASSO for your linear models and Tree-based Feature Importance for your, well, trees.

The LASSO’s Mean Trick: Shrinkage to Zero

LASSO, which stands for Least Absolute Shrinkage and Selection Operator, is a linear regression technique that has a neat party trick: it can set the coefficients of useless features to exactly zero. Poof. Gone. It’s feature selection baked right into the model.

How does it manage this black magic? It all comes down to the cost function. Ordinary least squares (OLS) regression just tries to minimize the sum of squared errors. LASSO adds a penalty term to this: the sum of the absolute values of the coefficients (the L1 norm), multiplied by a tuning parameter, alpha. So, the cost function becomes: Minimize(Sum of Squared Errors + alpha * |coefficients|).

This alpha is your boss’s level of micromanagement. A tiny alpha means little penalty, and LASSO acts almost like a friendly OLS regression. A huge alpha means a massive penalty for having large coefficients, so it aggressively shrinks them all down to zero to avoid the penalty. The beautiful thing is that for features that contribute little to nothing, the path of least resistance for the algorithm is to just zero them out completely. It’s brutally efficient.

Here’s how you use it in Python with scikit-learn. Notice we usually standardize our features first because the penalty term is affected by the scale of the coefficients.

import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_regression

# Generate a synthetic dataset with some useless features
X, y = make_regression(n_samples=1000, n_features=20, n_informative=5, noise=0.5, random_state=42)

# Standardize the features (very important for LASSO)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit LASSO with Cross-Validation to find the best alpha
lasso = LassoCV(cv=5, random_state=42).fit(X_scaled, y)

# Check which coefficients were zeroed out
print("Number of features used:", np.sum(lasso.coef_ != 0))
print("Coefficients for all features:", lasso.coef_)

The output will clearly show that most of the 20 coefficients are zero, leaving you with only the truly informative ones. The pitfall? If you have highly correlated features, LASSO might arbitrarily pick one and zero the others, which can be a problem if you care about interpretability. Also, if alpha is set too high, it’ll start zeroing out useful features. Always use cross-validation (LassoCV is your friend) to find a sensible value.

Tree Feature Importance: The Gini Gauge

Now, for tree-based methods like Random Forests or Gradient Boosted Machines (GBM), we use a different embedded approach. These models don’t have coefficients like linear models. Instead, they provide a metric called “feature importance.”

The most common type is mean decrease in impurity. Think of it this way: every time a feature is used to split a node in a decision tree, it reduces the impurity (Gini impurity or entropy for classification, variance for regression) of the node. The feature importance is simply the average of how much that feature reduced impurity across all trees in the forest, weighted by the number of samples in the node. Features used at the top of large trees to make big, impactful decisions get high scores. Features that are rarely used or only make tiny adjustments get low scores.

It’s intuitive and powerful, but you have to be aware of its quirks.

from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import fetch_california_housing

# Load a real dataset
data = fetch_california_housing()
X, y = data.data, data.target
feature_names = data.feature_names

# Fit a Random Forest
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get and display feature importances
importances = rf.feature_importances_
sorted_indices = np.argsort(importances)[::-1]

print("Feature ranking:")
for i, idx in enumerate(sorted_indices):
    print(f"{i+1}. {feature_names[idx]} ({importances[idx]:.4f})")

The Dirty Secret of Tree Importance

Here’s the thing the manual often glosses over: this method has a strong bias towards features with more unique values or high cardinality. A continuous feature has more potential split points than a binary one, so it has more opportunities to look “important.” This is why you might see a useless but unique feature (like a row ID) ranked highly if you’re not careful. Always be skeptical.

The best practice? Use it, but don’t trust it blindly. Combine it with other methods like permutation importance (which measures how much your model’s score drops when you randomly shuffle a feature’s values) for a more reliable view. Permutation importance is more computationally expensive but doesn’t have the same bias, making it a great sanity check.

So, there you have it. LASSO gives you a sparse model by driving coefficients to zero with mathematical precision, while tree-based importance tells you which features were the workhorses inside the black box. Use them both, but understand their biases. They’re tools, not oracles.