15.6 L1 and L2 Regularization in Neural Networks

Right, so you’ve built this beautiful, complex neural network. It’s learning, it’s fitting your training data like a glove… and it’s completely, utterly useless on anything else. It’s memorized the answers to the practice test but hasn’t learned a single underlying concept. This, my friend, is the dreaded overfitting. Your model has become a high-variance, low-bias monstrosity. We need to give it a little… discipline. That’s where L1 and L2 regularization come in. Think of them as the parental controls for your weights.

Their core job is brutally simple: punish the model for getting too big for its boots. Or, more accurately, for having weights that get too big. The fundamental assumption here is that a model with smaller weights is a simpler model, and a simpler model is less likely to overfit to the noise in our training data. It’s a bet on Occam’s Razor: the simplest solution is probably the right one.

Both L1 and L2 do this by adding a special “penalty term” to our loss function. So instead of just calculating how wrong our predictions are (e.g., using cross-entropy or mean squared error), we now have:

Total Loss = Original Loss + Regularization Penalty

This means the optimizer now has a dual mission: minimize the error and keep the weights small. It’s a constant tug-of-war between fitting the data and staying generalizable.

The Math: L2 (Ridge) Regularization - The Gentle Nudge

L2 regularization, also called weight decay or Tikhonov regularization, adds the squared magnitude of the weights as the penalty term. For a network, this means we sum the squares of all the weights.

The penalty term is (lambda/2) * ||w||²₂, where ||w||²₂ is the L2 norm (the square root of the sum of squares, but we usually square it directly in the term). That lambda (or sometimes alpha) is our regularization strength hyperparameter. This is your dial. Crank it up, and your weights will be punished more severely for growing large.

Why squares? Because it’s disproportionately harsh on large weights. A weight of 2 contributes a penalty of 4, but a weight of 4 contributes a penalty of 16. This makes the optimizer really want to avoid letting any single weight become a dominant superstar, encouraging it to spread the “responsibility” across many weights. It leads to diffuse, small weights. The “2” in the denominator is just a common convention to make the derivative nice and clean (d/dw[(lambda/2) * w²] = lambda * w).

The Math: L1 (Lasso) Regularization - The Brutal Chopper

L1 regularization takes a different, more extreme approach. It adds the absolute value of the weights as the penalty term.

The penalty term is lambda * ||w||₁, where ||w||₁ is the L1 norm (the sum of absolute values).

Here’s the magic and the madness of L1: its derivative is constant. The gradient of |w| is either +1 or -1 (technically it’s undefined at 0, but we handle that). This means that during gradient descent, L1 regularization doesn’t just shrink weights linearly like L2; it subtracts a constant amount from them every step. This leads to many weights being driven exactly to zero.

Why is this useful? It performs feature selection. It effectively says, “I don’t need all these neurons and connections. Most of you are useless. Begone!” It creates sparse models, which can be fantastic for interpretability and efficiency, especially if you suspect most of your input features are noise.

A Side-by-Side Code Example

Let’s see this in action with a simple linear regression on some synthetic data. We’ll use scikit-learn so you can see the effect clearly.

import numpy as np
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt

# Generate some data that's mostly linear with a bit of noise
np.random.seed(42)
X = np.linspace(0, 10, 20)
y = 2 * X + 1 + np.random.normal(0, 2, len(X)) # True function: y = 2x + 1 + noise

# Now let's create a nightmare scenario: we'll fit a ridiculously high-degree polynomial
# This is a perfect recipe for overfitting.
X_plot = np.linspace(0, 10, 100) # Smooth line for plotting
degree = 15

# Models to compare
models = {
    'No Regularization': make_pipeline(PolynomialFeatures(degree), LinearRegression()),
    'L2 (Ridge, λ=0.1)': make_pipeline(PolynomialFeatures(degree), Ridge(alpha=0.1)),
    'L1 (Lasso, λ=0.01)': make_pipeline(PolynomialFeatures(degree), Lasso(alpha=0.01, max_iter=10000)) # Lasso needs more iterations
}

# Fit and plot
plt.figure(figsize=(12, 8))
plt.scatter(X, y, color='black', label='Data points')

for name, model in models.items():
    model.fit(X.reshape(-1, 1), y)
    y_plot = model.predict(X_plot.reshape(-1, 1))
    plt.plot(X_plot, y_plot, linewidth=2, label=name)

plt.legend()
plt.ylim(-5, 30)
plt.title("The Dramatic Effects of Regularization")
plt.show()

Run this. The unregularized model will be a wild, squiggly mess that tries to hit every single data point perfectly. The Ridge (L2) model will be a much smoother, more reasonable curve. The Lasso (L1) model will likely be an even simpler, almost linear fit. It has effectively zeroed out most of the unnecessary high-degree terms.

Best Practices and Pitfalls

Normalize Your Data: This is non-negotiable. Regularization penalizes weights based on their magnitude. If your features are on different scales (e.g., age: 0-100, income: 0-500,000), the penalty will be unfairly harsh on the income feature. StandardScaler is your best friend here.
Tune Lambda (α): This is the single most important hyperparameter here. Use a technique like grid search or random search with cross-validation to find the right value. Too low, and you’re not regularizing. Too high, and you’ll crush all your weights to zero and underfit, creating a high-bias model that can’t learn anything.
L1 vs L2 Choice: Use L2 as your default. It’s stable, well-understood, and works great for most cases. Turn to L1 if you have a massive number of features and you believe only a few are actually important, and you need the sparsity.
Bias Term: Typically, we do not regularize the bias term. The bias is just an offset; regularizing it would artificially shift the data and doesn’t help with overfitting. Most good libraries (like keras) handle this correctly by default.
Lasso Can Be Stubborn: L1 regularization can be slow to converge because of its non-smooth nature. If you’re using it, you might need to crank up the max_iter parameter significantly.

In frameworks like Keras, adding it is dead simple. You just add a kernel_regularizer to your layer:

from tensorflow.keras import models, layers
from tensorflow.keras.regularizers import l1, l2

model = models.Sequential([
    layers.Dense(64, activation='relu', input_shape=(100,),
                 kernel_regularizer=l2(0.001)),  # L2 with λ=0.001
    layers.Dense(64, activation='relu',
                 kernel_regularizer=l1_l2(l1=0.001, l2=0.001)),  # You can even combine them!
    layers.Dense(1)
])

So there you have it. Regularization isn’t a magic bullet, but it’s one of the most powerful and essential tools in your kit to fight overfitting. It forces your model to be less of a know-it-all and more of a thoughtful generalist. And in the real world, generalists usually win.