5.4 Regularization: Ridge (L2), Lasso (L1), and Elastic Net

Right, let’s talk about keeping your models from getting a bit too full of themselves. You’ve trained a linear regression, the predictions look great on your training data, and then you show it new data and it completely faceplants. This, my friend, is the classic sign of overfitting. Your model has basically memorized the training set, quirks, noise, and all, instead of learning the general patterns. It’s the equivalent of cramming for a test without understanding the concepts—you’ll fail the final.

Regularization is our go-to tool to smack some sense into an overzealous model. The core idea is beautifully simple: we add a penalty to the model’s loss function—the thing we’re trying to minimize—that discourages it from leaning too heavily on any one feature. We’re basically saying, “Sure, try to make good predictions, but also, try to keep your coefficients small and well-behaved.” It’s a trade-off between fitting the data perfectly and keeping the model simple and robust. This is the famous bias-variance tradeoff in action.

The Math: It’s Just a Penalty Function

Don’t let the fancy names scare you. All we’re doing is taking our standard loss function (usually Mean Squared Error for regression) and adding a new term. The new function looks like this:

Total Loss = Loss(MSE) + Penalty Term

The Penalty Term is where Ridge (L2) and Lasso (L1) differ. They’re named after the type of norm they use on the coefficients.

Ridge Regression (L2 Regularization)

Ridge regression adds a penalty equal to the square of the magnitude of the coefficients. The loss function becomes:

MSE + λ * (coefficient₁² + coefficient₂² + ... + coefficientₙ²)

That λ (lambda) is the critical hyperparameter here. You’ll also see it called alpha in some libraries, which is annoying, but just know they’re the same thing. Think of λ as the dial on our penalty strength.

λ = 0: No penalty. We’re back to standard linear regression, warts and all.
λ -> infinity: The penalty crushes all coefficients towards zero, leading to underfitting.

Ridge is great because it’s differentiable, which makes it computationally friendly. It shrinks coefficients towards zero, but it will rarely set them to exactly zero. This means all your features stay in the model, even if their influence is tiny. It’s a gentle nudge, not a amputation.

from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import make_regression

# Generate some data with a problematic number of features
X, y = make_regression(n_samples=100, n_features=100, noise=0.5, random_state=42)

# Crucial: Regularization is sensitive to feature scale. ALWAYS scale your data.
# A pipeline makes this easy and prevents data leakage.
ridge_model = make_pipeline(StandardScaler(), Ridge(alpha=1.0))
ridge_model.fit(X, y)

# Let's see what it did to the coefficients
coefficients = ridge_model.named_steps['ridge'].coef_
print(f"Number of coefficients: {len(coefficients)}")
print(f"Number of coefficients exactly zero: {sum(coefficients == 0)}")
print(f"Min/Max coefficient: {coefficients.min():.4f} / {coefficients.max():.4f}")

Lasso Regression (L1 Regularization)

Lasso (Least Absolute Shrinkage and Selection Operator) takes a more aggressive approach. It adds a penalty equal to the absolute value of the magnitude of the coefficients:

MSE + λ * (|coefficient₁| + |coefficient₂| + ... + |coefficientₙ|)

This change is a game-changer. The nature of the L1 penalty tends to produce sparse solutions, meaning it forces some coefficients to be exactly zero. This is effectively automatic feature selection. Lasso looks at your feature set and says, “You know what? We don’t actually need half of this junk,” and throws it in the bin. This is incredibly useful for high-dimensional datasets where you suspect many features are irrelevant or redundant.

from sklearn.linear_model import Lasso

# Again, scale your data. I'm not kidding. Do it.
lasso_model = make_pipeline(StandardScaler(), Lasso(alpha=0.1, max_iter=10000))
lasso_model.fit(X, y)

lasso_coefficients = lasso_model.named_steps['lasso'].coef_
print(f"Number of coefficients: {len(lasso_coefficients)}")
print(f"Number of coefficients exactly zero: {sum(lasso_coefficients == 0)}")  # This will be >0

Watch Out: Lasso can be unstable. If you have highly correlated features, it might arbitrarily pick one and zero the others, which isn’t always ideal. Also, it requires more iterations to converge, so you might need to bump up max_iter to avoid a warning.

Elastic Net: The Best of Both Worlds?

So, Ridge keeps all features, Lasso selects features. What if you want both? Enter Elastic Net. It’s a pragmatic hybrid that uses a linear combination of both L1 and L2 penalties.

MSE + λ * [ (ratio * L1 Penalty) + ((1 - ratio) * L2 Penalty) ]

You now have two hyperparameters to tune: λ (the overall strength) and l1_ratio (the mix between L1 and L2). If l1_ratio = 1, it’s pure Lasso. If l1_ratio = 0, it’s pure Ridge.

Elastic Net is particularly useful when you have multiple features that are correlated with each other (a common reality). Lasso might struggle here, but the L2 component of Elastic Net helps it handle groups of correlated variables more sensibly.

from sklearn.linear_model import ElasticNet

# A mix of 70% Lasso's aggressiveness and 30% Ridge's gentleness
elastic_net_model = make_pipeline(StandardScaler(), ElasticNet(alpha=0.1, l1_ratio=0.7))
elastic_net_model.fit(X, y)

en_coefficients = elastic_net_model.named_steps['elasticnet'].coef_
print(f"Coefficients zeroed out: {sum(en_coefficients == 0)}")

Best Practices and Pitfalls

Scale Your Features. Always. I’ve said it three times now because it’s the most common mistake. Regularization penalizes large coefficients, so if one feature is in millimeters and another is in kilometers, the model is unfairly biased. Standardize (zero mean, unit variance) or normalize your data first. Using a Pipeline is the easiest way to ensure this happens cleanly.
Tune Your Hyperparameters. You can’t just guess alpha and l1_ratio. Use GridSearchCV or RandomizedSearchCV to find the right values through cross-validation. Throwing a default alpha=1.0 at your problem is a recipe for mediocrity.
Interpretability is a Superpower. A Lasso model that zeroes out 80% of your features isn’t just a better predictor; it’s a signal. It tells you which features actually matter. That insight is often more valuable than a slight boost in accuracy.
It Doesn’t Solve All Problems. Regularization helps with overfitting caused by complexity, but it won’t fix a fundamentally useless set of features or a broken model assumption. It’s a powerful tool, not a magic wand.