5.2 Multiple Linear Regression and Feature Matrices

Right, so you’ve mastered predicting house prices based on square footage alone. That’s cute. A fine parlor trick, but the real world is a messy, multivariate place. What about the number of bedrooms? The age of the roof? The proximity to a suspiciously aromatic chemical plant? You need a model that can handle more than one input feature. Enter Multiple Linear Regression, the workhorse algorithm that says, “Give me all your numbers, I’ll sort them out.”

The core idea is gloriously simple: we’re just adding more terms to our old linear equation. Instead of y = mx + b, we now have:

y = b + w₁x₁ + w₂x₂ + ... + wₙxₙ

Here, y is still our target (like price). b is the intercept (the value when all features are zero—often a bizarre hypothetical state). And now we have a whole squad of weights (w₁, w₂, ... wₙ), each representing the contribution of its corresponding feature (x₁, x₂, ... xₙ). Think of w₁ as “dollars per square foot” and w₂ as “dollars per bedroom.” A negative weight for the “distance to chemical plant” feature would be very, very good.

The Matrix Revolution: Why We Stop Messing Around

Writing out equations with a dozen w and x terms is for masochists and textbook writers. We use matrices. It’s not just to look cool (though it does); it’s because it allows us to leverage the terrifying efficiency of linear algebra libraries. Our equation becomes beautifully compact:

y = Xw

Hold on. Let’s unpack this. In this elegant little bomb of notation:

y is now a vector of all our target values.
w is a vector of all our weights (including the intercept, with a clever trick).
X is our feature matrix. This is the key.

The feature matrix X isn’t just a list of values; it’s a specific structure. Each row represents an instance (e.g., a single house). Each column represents a feature (e.g., sq. footage, bedrooms, age). And here’s the magic part: we almost always prepend a column of ones to this matrix. Why? Because when that column of ones is multiplied by the first weight in our vector w, it literally becomes 1 * w₀ = w₀… which is our intercept b! This little trick lets us handle the intercept within the matrix multiplication instead of as a separate entity.

So, for m houses and n features, the dimensions look like this:

X is an m x (n+1) matrix (including the “ones” column).
w is an (n+1) x 1 vector.
y is an m x 1 vector.

The solution, the way to find the optimal weights w that minimize the sum of squared errors, is given by the Normal Equation:

w = (XᵀX)⁻¹Xᵀy

Yes, it involves the transpose and the inverse. No, you will almost never compute this by hand. You let the computer do it. Let’s see it in action.

import numpy as np

# Sample data: [Square Footage, Bedrooms, Age] for 4 houses
# We'll create our feature matrix X *without* the column of ones first.
X_data = np.array([
    [2100, 3, 15],  # House 1
    [1400, 2, 45],  # House 2
    [2300, 4, 5],   # House 3
    [1600, 2, 25]   # House 4
])
# Target vector: Price in thousands of dollars ($)
y = np.array([450, 350, 550, 375])

# Here's the crucial part: prepend a column of ones to X_data for the intercept.
ones_column = np.ones((X_data.shape[0], 1))
X = np.hstack([ones_column, X_data])  # X is now our proper feature matrix

print("Feature Matrix X:\n", X)
# X is now:
# [[1. 2100.    3.   15.]
#  [1. 1400.    2.   45.]
#  [1. 2300.    4.    5.]
#  [1. 1600.    2.   25.]]

# Calculate the optimal weights using the Normal Equation
X_transpose = X.T
X_transpose_X = X_transpose.dot(X)
X_transpose_X_inv = np.linalg.inv(X_transpose_X) # This is the potentially hairy part
w = X_transpose_X_inv.dot(X_transpose).dot(y)

print(f"\nOptimal weights vector w: {w}")
# w[0] is the intercept, w[1] is for sq. ft, w[2] for bedrooms, w[3] for age.

# Let's predict the price of a new house: 2000 sq ft, 3 bedrooms, 10 years old
new_house = np.array([1, 2000, 3, 10])  # Don't forget the leading '1'!
predicted_price = new_house.dot(w)
print(f"Predicted price for new house: ${predicted_price:.2f} thousand")

The Normal Equation’s Kryptonite

See that np.linalg.inv() line? That’s the bottleneck. Inverting a matrix is computationally expensive, roughly O(n³) for n features. If you have a thousand features, that’s a billion operations. Ouch. It can also be numerically unstable (prone to rounding errors) if your features are highly correlated—a problem called multicollinearity. If two features (like “sq. footage” and “number of bathrooms”) are tightly linked, the XᵀX matrix becomes ill-conditioned and its inverse can’t be computed reliably. The model won’t know how to divide the credit between the two features, leading to weights that are nonsensical and wildly unstable.

This is why for large datasets, we use iterative optimization algorithms like Gradient Descent. It’s slower per step but doesn’t require inverting a giant matrix, making it feasible for massive problems.

The Golden Rule: You Must Standardize Your Features

Here’s a classic rookie mistake. You run the code above and get weights like w = [ 50.1, 0.2, -15.7, -0.8 ]. The weight for sq. footage (w₁=0.2) is tiny, and the weight for bedrooms (w₂=-15.7) is huge and negative! Does this mean bedrooms decrease a home’s value? Probably not. It likely means your features are on different scales.

Square footage is in the thousands, bedrooms are single digits, and age is in tens. The algorithm sees that a change of 1 unit in square footage is a trivial percentage change, but a change of 1 unit in bedrooms is a massive 50% shift. To fairly compare the importance of the weights, you must standardize your features (e.g., convert them to z-scores: subtract the mean, divide by the standard deviation). This puts everything on a common scale of “standard deviations from the mean.” Suddenly, the weights make sense and can be compared directly. Always do this.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_data_scaled = scaler.fit_transform(X_data) # Scale the original data

# Now build the feature matrix WITH the intercept column
X_scaled = np.hstack([ones_column, X_data_scaled])

# Recalculate weights using the scaled data
# ... (Normal Equation code from above) ...

print(f"\nOptimal weights vector w (with scaled features): {w}")
# Now the weights are on a comparable scale. A larger absolute weight = more important feature.

The takeaway? Multiple linear regression is powerful precisely because it’s simple. You’re just drawing the best-fitting hyperplane through your multi-dimensional data. Respect the process: use matrices, be wary of the normal equation’s limits, and for the love of all that is good, standardize your features. Now go forth and predict.