5.1 Simple Linear Regression: Least Squares and the Normal Equation

Alright, let’s get down to brass tacks. You want to predict something. You have one thing you want to predict (the ‘dependent variable’) and one thing you think might predict it (the ‘independent variable’). Simple Linear Regression is your go-to, no-nonsense starting point. It’s the “draw the rest of the owl” of machine learning, but we’re going to learn how to actually draw the owl.

The core idea is embarrassingly straightforward: find the single straight line that best fits your scatterplot of data. “Best” here is defined as the line that minimizes the sum of the squared differences between the actual data points and the points predicted by our line. These differences are called residuals. We square them for two brilliantly practical reasons: 1) it makes all the values positive, and 2) it heavily penalizes large errors, which is usually what we want. A line that’s mostly okay but has one catastrophically wrong prediction is worse than a line that’s consistently a little off.

The Model and the Goal

Formally, we’re modeling the relationship as: y = β₀ + β₁*x + ε

Where:

y is the dependent variable (what we’re predicting).
x is the independent variable (our predictor).
β₀ is the y-intercept. It’s the value of y when x is zero. Sometimes this makes sense, sometimes it’s a formality—we’ll get to that.
β₁ is the slope. This is the star of the show. It tells you how much y changes for a one-unit change in x.
ε is the error term (the residual). This represents all the stuff our simple model couldn’t account for (other variables, randomness, etc.).

Our goal is to find the values of β₀ and β₁—let’s call them b0 and b1—that minimize the Sum of Squared Errors (SSE): SSE = Σ(yᵢ - ŷᵢ)² = Σ(yᵢ - (b0 + b1*xᵢ))²

We could guess values all day. Or we could use calculus, which is vastly more efficient.

The Normal Equation: No Calculus Required (I Did It For You)

Thankfully, you don’t need to re-derive calculus every time you want to fit a line. The minimization of the SSE leads us to a beautiful, closed-form solution—a direct formula—known as the Normal Equation. It’s called “normal” because it uses the perpendicularity of the residual vector, but you can just think of it as the “direct answer equation.”

The formulas for our two parameters are:

b1 = (Σ(xᵢ - x̄)(yᵢ - ȳ)) / (Σ(xᵢ - x̄)²) b0 = ȳ - b1 * x̄

Where x̄ and ȳ are the means of x and y. Let’s demystify this. The slope, b1, is basically the covariance of x and y divided by the variance of x. It makes intuitive sense: the relationship’s strength (covariance) is normalized by how spread out your predictor is (variance). The intercept, b0, just ensures the regression line always passes through the point (x̄, ȳ).

Let’s code this from absolute scratch. It’s important to see the guts once.

import numpy as np

# Let's create some ridiculously simple, fake data.
# X is feature, y is target. This is the "toy dataset" you'll see in every tutorial.
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])

# Calculate the means
x_mean = np.mean(x)
y_mean = np.mean(y)

# Calculate the terms needed for b1 numerator and denominator
numerator = np.sum((x - x_mean) * (y - y_mean))
denominator = np.sum((x - x_mean) ** 2)

# Calculate the coefficients
b1 = numerator / denominator
b0 = y_mean - b1 * x_mean

print(f"Slope (b1): {b1:.2f}")
print(f"Intercept (b0): {b0:.2f}")
# Output: Slope (b1): 0.60, Intercept (b0): 2.20

So our best-fit line is y = 2.20 + 0.60*x. Let’s plot it to see how we did.

import matplotlib.pyplot as plt

# Create the regression line predictions
y_pred = b0 + b1 * x

# Plot the raw data
plt.scatter(x, y, color='blue', label='Actual Data')
# Plot the regression line
plt.plot(x, y_pred, color='red', label='Regression Line')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.show()

You’ll see the red line cutting through the center of the blue dots. It’s not perfect—it shouldn’t be! The data isn’t perfectly linear. But it’s the single best straight line for this job.

The Matrix Magic: Because Scalability is a Thing

The above method works, but it’s, frankly, a bit pedestrian. It also doesn’t scale to multiple regression (where you have more than one predictor). For that, we need to think in terms of matrices. The Normal Equation in its full, glorious, matrix form is:

β = (XᵀX)⁻¹Xᵀy

Whoa. Hold on. This isn’t as scary as it looks.

X is now our design matrix. It’s not just a vector of x values anymore. We add a column of 1s at the beginning to account for the intercept term β₀. It’s a neat trick that lets us handle the intercept within the matrix multiplication.
y is the vector of our target values.
β is the vector of coefficients we’re solving for [β₀, β₁].

Let’s do this in code. Notice how we add that column of ones.

# Add a column of ones to x to create the design matrix X
X = np.column_stack((np.ones(len(x)), x))  # Now X has shape (5, 2)
print("Design Matrix X:\n", X)

# Apply the Normal Equation
# np.linalg.inv() calculates the matrix inverse
# The @ operator is for matrix multiplication
beta = np.linalg.inv(X.T @ X) @ X.T @ y

print(f"Coefficients from matrix equation [b0, b1]: {beta}")
# Output: Coefficients from matrix equation [b0, b1]: [2.2  0.6]

Look at that! The same result: b0=2.2, b1=0.6. This matrix approach is how every serious library does it under the hood because it generalizes perfectly to a million features.

The Pitfalls and the “Gotchas”

This isn’t all rainbows and unicorns. The Normal Equation has some serious drawbacks you must know.

Computational Complexity: Inverting a matrix is expensive. The computation cost for inverting an n x n matrix (X.T @ X is n x n) is roughly O(n³). So if you have 1000 features (a modest number by today’s standards), you’re inverting a 1000x1000 matrix, which is already slow. For 10,000 features, it’s practically impossible. This is why for large-scale problems, we use iterative methods like Gradient Descent instead. The Normal Equation is for “smallish” problems.
Non-Invertibility: Sometimes, (X.T @ X) is a singular matrix and can’t be inverted. This usually happens for one of two reasons:
- Redundant Features (Perfect Multicollinearity): If one feature is a perfect linear combination of another (e.g., one column is twice another, or you’ve one-hot encoded poorly and introduced a “dummy variable trap”), your matrix is singular. The solution is to remove the redundant feature.
- Too Many Features: If you have more features than data points (e.g., 100 genes for 50 patients), (X.T @ X) won’t be invertible. This is a classic “p > n” problem. The solution is regularization (like Ridge Regression), which we’ll tackle later.

You can often spot this in code when np.linalg.inv() throws a LinAlgError: Singular matrix.

So, while the Normal Equation is a perfect, exact solution for small, well-behaved problems, it’s not the final word. It’s the elegant theory that introduces you to the practical, often-messy world of building models. Remember: the best-fit line is just the beginning. Next, we need to ask how good of a fit it actually is. But that’s a story for another section.