Right, let’s talk about the fairy tale we tell ourselves when we fit a linear model. We imagine a perfect, orderly world where our data behaves itself. This is that world: the assumptions of linear regression. They’re not just pedantic statistics homework; they’re the promise you’re making about your data so that the neat little model.summary() printout actually means something. When these break, your model doesn’t just get a little worse—it becomes a confident liar, handing you coefficients that are biased and predictions that are nonsense. Let’s pop the hood and see what we’re actually assuming.

The Holy Trinity: Linearity, Independence, and Homoscedasticity

These three are the non-negotiables. Get these wrong, and your model’s foundation is built on sand.

First, linearity. This one seems obvious, but it’s the most frequently violated. We’re assuming the relationship between your predictors (X) and the outcome (y) is, you know, linear. It’s a straight line. If the true relationship is a curve, fitting a line is like using a shoe to hammer a nail—it might kinda work, but it’s the wrong tool and you’ll damage the wall. Always, always plot your residuals (the differences between your predictions and the actual values). If you see a pattern (a curve, a funnel), you’ve got a problem.

import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan

# Let's create a classic nonlinear relationship
np.random.seed(42)
x = np.linspace(1, 10, 100)
y_true = np.log(x)  # The true relationship is logarithmic!
y_observed = y_true + np.random.normal(0, 0.2, len(x)) # Add some noise

# The crime: fitting a linear model to a nonlinear process
X = sm.add_constant(x)  # Adds the intercept term
model = sm.OLS(y_observed, X).fit()
predictions = model.predict(X)

# The verdict: plot residuals vs. the predictor
residuals = y_observed - predictions
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(x, residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('X')
plt.ylabel('Residuals')
plt.title('Clear Pattern = Linearity Assumption Violated')

# Now for Homoscedasticity: constant error variance
# Let's make the noise get worse as X increases (a common real-world issue)
y_heteroscedastic = y_true + np.random.normal(0, 0.2 * x, len(x))
model_het = sm.OLS(y_heteroscedastic, X).fit()
residuals_het = y_heteroscedastic - model_het.predict(X)

plt.subplot(1, 2, 2)
plt.scatter(x, residuals_het)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('X')
plt.title('Funnel Shape = Heteroscedasticity')
plt.tight_layout()
plt.show()

# A formal test for heteroscedasticity (Breusch-Pagan)
lm_test, p_value, f_stat, f_p_value = het_breuschpagan(residuals_het, X)
print(f"Breusch-Pagan p-value: {p_value:.4f}") # A low p-value (<0.05) indicates heteroscedasticity

Independence is simpler: your data points shouldn’t be whispering to each other. This is crucial for time series data (where today’s stock price depends on yesterday’s) or repeated measurements on the same subject. If they’re not independent, your standard errors shrink to meaninglessly small sizes, making you think your results are far more certain than they are. There’s no easy plot for this; you need to understand how your data was collected.

Homoscedasticity is a fancy word for “constant variance.” It means the amount of noise in your data should be the same across all levels of your predictors. In the plot above, the funnel shape is a dead giveaway. When this breaks, your OLS estimates are still unbiased, but the standard errors are wrong, so your confidence intervals and p-values are junk. The fix? Use robust standard errors (cov_type='HC3' in statsmodels) or switch to a model like Generalized Least Squares. It’s an easy fix once you know to look for it.

The Often-Ignored Fourth: Normality of Errors

This is the most misunderstood assumption. We assume the errors (the residuals) are normally distributed. We do not assume your data is normally distributed. This matters mainly for generating those neat confidence intervals and hypothesis tests. For large sample sizes, the Central Limit Theorem often bails you out. But for small samples, non-normal errors will screw up your p-values.

Why did they design it this way? Because the math for proving optimality (BLUE estimators) works beautifully under normality. It’s a mathematical convenience that happens to work well in practice. Check it with a Q-Q plot.

from scipy import stats

# Get residuals from our first bad linear model
residuals = y_observed - predictions

# Create a Q-Q plot
plt.figure(figsize=(6, 6))
stats.probplot(residuals, dist="norm", plot=plt)
plt.title('Q-Q Plot: How Normal Are Your Errors?')
plt.show()

# If the points largely follow the red line, you're fine. If they deviate, especially at the tails, you've got non-normality.

What to Do When the World Isn’t Perfect

So your diagnostic plots look like a toddler’s abstract art. Now what?

  1. For Non-Linearity: Transform your predictors (log(x), x**2, np.sqrt(x)) or use polynomial regression. For complex curves, move to splines or tree-based models.
  2. For Heteroscedasticity: Use robust standard errors. It’s a one-line fix in most modern libraries and should be your default for real-world data. model = sm.OLS(y, X).fit(cov_type='HC3').
  3. For Non-Normal Errors: Again, large samples are forgiving. For small samples, a transformation of the target variable (like log(y)) can sometimes help. If the model’s purpose is prediction, you might not even care.
  4. For Non-Independence: This is a big one. You need a different model class entirely, like time series models (ARIMA) or mixed-effects models that account for the grouping structure.

The goal isn’t to pass a statistical test perfectly. It’s to understand the ways your data is misbehaving so you can either fix the model or, more importantly, honestly interpret what it’s actually telling you. A model that violates assumptions is like a GPS giving directions while upside down—it might occasionally be right, but you’d be a fool to trust it.