79.5 Regression: Linear, Ridge, Lasso
Right, so you want to make a machine predict a number. Not just any number, but a specific, continuous number. Like the price of a house, the temperature tomorrow, or how many milliseconds it will take for a user to close your app after seeing that garish new banner ad. This isn’t classification anymore; this is regression, and it’s where we get to draw lines. Beautiful, predictive lines.
We’ll start with the granddaddy of them all: Linear Regression. The idea is almost stupidly simple. We’re going to find a straight line (or a hyperplane, if you want to be fancy and multidimensional about it) that best fits our data. The “best fit” is defined as the line that minimizes the sum of the squared differences between the actual data points and the points predicted by our line. These differences are called residuals, and squaring them does two wonderfully useful things: it makes all the values positive (so a point above the line doesn’t cancel out one below it) and it penalizes larger errors much more severely.
The Straightforward: Ordinary Least Squares
Let’s get our hands dirty. Here’s the classic fit and predict dance you’ll come to know and love.
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Let's fabricate some data. Pure linear relationship with some noise.
np.random.seed(42) # for reproducibility
X = 2 * np.random.rand(100, 1) # 100 data points, 1 feature
y = 4 + 3 * X + np.random.randn(100, 1) # y = 4 + 3x + noise
# Split the data. Always. Do not train on your test data. Ever.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# See how badly we did (or well, in this case)
print(f"Coefficient: {model.coef_[0][0]:.2f}")
print(f"Intercept: {model.intercept_[0]:.2f}")
print(f"Mean Squared Error: {mean_squared_error(y_test, y_pred):.2f}")
print(f"R² Score: {r2_score(y_test, y_pred):.2f}")
You should see an intercept hovering around 4 and a coefficient around 3. The R² score tells you the proportion of the variance in the dependent variable that’s predictable from the independent variable. 1.0 is perfect, 0.0 is terrible. If it’s negative, your model is so bad it’s actually worse than just predicting the mean of y for everything—a truly impressive feat of failure.
OLS is brilliant, but it has a fatal flaw. It’s an unbiased estimator, which is statistically great, but it can have very high variance. This means if you trained it on a slightly different dataset, the resulting model (the slope and intercept of that line) could be wildly different. This is a sign of overfitting, especially when you have features that are highly correlated with each other (multicollinearity). The model gets jittery, trying to assign precise credit to features that are essentially telling it the same story.
The Well-Behaved: Ridge Regression (L2 Regularization)
Enter regularization. Think of it as putting a leash on your model. Ridge Regression (or L2 regularization) tackles the high variance problem by penalizing large coefficients. It adds a “squared magnitude” of the coefficients to the loss function (the thing we’re minimizing). The model now has to ask itself, “Is making this coefficient huge really worth the improvement in fit, or will I get penalized for it?”
The strength of the leash is controlled by the alpha parameter. alpha=0 is just plain OLS. A higher alpha means more constraint, pulling all coefficients closer to zero (but never quite to zero).
from sklearn.linear_model import Ridge
# Let's use a slightly more complex dataset
X = 3 * np.random.rand(100, 5) # 5 features now
y = 2 + X @ np.array([1, -2, 0, 3, 0.5]) + np.random.randn(100) # @ is matrix multiplication
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Compare OLS and Ridge
ols_model = LinearRegression()
ols_model.fit(X_train, y_train)
ridge_model = Ridge(alpha=10.0) # This is our leash strength
ridge_model.fit(X_train, y_train)
print("OLS Coefficients:", ols_model.coef_)
print("Ridge Coefficients:", ridge_model.coef_)
print("\nOLS R² on Test:", r2_score(y_test, ols_model.predict(X_test)))
print("Ridge R² on Test:", r2_score(y_test, ridge_model.predict(X_test)))
You’ll often find that on the test set, Ridge’s R² is higher. It traded a little bias for a lot less variance, making it more robust and generalizable. It’s the model that plays well with others and doesn’t cause a scene.
The Brutal: Lasso Regression (L1 Regularization)
Now, meet Ridge’s more ruthless cousin: Lasso (L1 regularization). Lasso also penalizes coefficients to prevent overfitting, but it does so using the absolute value of the coefficients, not the squared value. This subtle change has a nuclear consequence: it can force coefficients all the way to zero.
This is called feature selection. Lasso doesn’t just gently shrink unimportant features; it mercilessly annihilates them. It’s the Marie Kondo of regression algorithms—if a feature doesn’t “spark joy” (i.e., significantly improve the model fit), it thanks it for its service and chucks it out the window.
from sklearn.linear_model import Lasso
lasso_model = Lasso(alpha=0.1) # Need a different alpha scale than Ridge
lasso_model.fit(X_train, y_train)
print("OLS Coefficients:", ols_model.coef_)
print("Ridge Coefficients:", ridge_model.coef_)
print("Lasso Coefficients:", lasso_model.coef_)
Look at that Lasso output. Bet you a dollar at least one of those coefficients is exactly 0.0. This is incredibly useful for high-dimensional datasets where you suspect most features are useless noise. Lasso will find the few that actually matter.
The Practicalities: What You Actually Need to Know
Scaling is Non-Negotiable. I cannot stress this enough. Regularization methods like Ridge and Lasso penalize coefficients based on their magnitude. If one feature is “age” (0-100) and another is “annual income” (0-500,000), the income feature will naturally have a much smaller coefficient to avoid predicting house prices in the billions. The regularization penalty will unfairly crush the income coefficient because it looks larger. Standardize your features (StandardScaler) before you use these models. Always.
Tune Alpha. The alpha parameter is hyper-critical. The default is rarely the best. You must use cross-validation (like RidgeCV or LassoCV) to find the right value. Don’t just guess.
from sklearn.linear_model import LassoCV
# Let LassoCV find the best alpha for us
lasso_cv = LassoCV(alphas=[0.001, 0.01, 0.1, 1.0, 10.0], cv=5)
lasso_cv.fit(X_train, y_train)
print(f"Best alpha chosen by cross-validation: {lasso_cv.alpha_}")
When to Use Which?
- OLS: Your baseline. Use it on simple, clean, low-dimensional problems. It’s your control group.
- Ridge: Your go-to. Use it when you have many correlated features and you want to improve generalization. It’s reliable and robust.
- Lasso: Your feature selector. Use it when you believe only a few features are truly important and you want a simpler, more interpretable model.
They all have the same core goal: draw the best damn line. But as you’ve seen, there’s a world of difference in how they decide what “best” really means.