5.3 Gradient Descent: Batch, Stochastic, and Mini-Batch

Right, let’s get down to brass tacks. You’ve got your cost function, that mathematical measure of how spectacularly wrong your model’s predictions are. You need to minimize it. You could, I suppose, try to solve for the exact analytical solution by setting the derivative to zero. For linear regression, that’s the normal equation: θ = (XᵀX)⁻¹Xᵀy. It looks elegant, doesn’t it? And it is. Until your dataset has more than a few thousand features or instances. Then that (XᵀX)⁻¹ term becomes a computational nightmare—an O(n³) operation that will have your computer weeping softly in the corner.

So, we imitate nature instead. Imagine you’re at the top of a fog-covered mountain and need to get to the bottom. You can’t see the entire path, but you can feel the ground beneath your feet. The steepest slope right where you’re standing is your gradient. You take a step in that downhill direction. Then you feel again. And again. This is gradient descent. It’s an iterative, often painfully slow, but remarkably effective process of finding the minimum of a function by walking repeatedly in the direction of steepest descent.

The update rule is the heart of it. For a parameter θj, we update it as: θj := θj - α * (∂J(θ)/∂θj) Where α is the learning rate—your step size. Set it too small, and your descent will be agonizingly slow; too large, and you’ll overshoot the minimum, potentially bouncing around it or even diverging spectacularly. Getting this right is more art than science.

The Three Flavors: Batch, Stochastic, and Mini-Batch

The crucial difference between these methods boils down to one question: how much data do you use to calculate that gradient before you take a single step?

Batch Gradient Descent is the conscientious, thorough one. It calculates the gradient using the entire training dataset. Every. Single. Time.

import numpy as np

# X is the feature matrix (with a column of 1s for the intercept), y is the target
def batch_gradient_descent(X, y, learning_rate=0.01, n_iters=1000):
    m, n = X.shape
    theta = np.zeros(n)  # Initialize parameters
    cost_history = []

    for _ in range(n_iters):
        # The key part: Calculate gradient using ALL examples
        gradient = (1/m) * X.T.dot(X.dot(theta) - y)
        theta -= learning_rate * gradient
        cost = np.mean((X.dot(theta) - y)**2) / 2
        cost_history.append(cost)

    return theta, cost_history

This gives you a stable, smooth path directly toward the minimum. It’s also computationally suicidal for large datasets. Each step requires a pass over the entire dataset, which is just painfully slow.

Stochastic Gradient Descent (SGD) is its chaotic, impulsive sibling. It calculates the gradient using just one randomly chosen training example per step.

def stochastic_gradient_descent(X, y, learning_rate=0.01, n_epochs=50):
    m, n = X.shape
    theta = np.zeros(n)
    cost_history = []

    for epoch in range(n_epochs):
        # Shuffle your data each epoch to ensure randomness
        shuffled_indices = np.random.permutation(m)
        X_shuffled = X[shuffled_indices]
        y_shuffled = y[shuffled_indices]

        for i in range(m):
            # The key part: Use ONE random example
            xi = X_shuffled[i:i+1]  # Keep as 2D array
            yi = y_shuffled[i:i+1]
            gradient = xi.T.dot(xi.dot(theta) - yi)  # No 1/m term for a single example
            theta -= learning_rate * gradient

        # Calculate cost on entire dataset for tracking (this is optional and expensive)
        cost = np.mean((X.dot(theta) - y)**2) / 2
        cost_history.append(cost)

    return theta, cost_history

SGD is fast. Blazingly fast per step. It can jump all over the place, often bouncing around the minimum rather than settling neatly into it. This noise can actually help it escape shallow local minima, which is a nice bonus. The downside? The path is so noisy that the final parameters are never the true minimum, just a very good approximation. You’ll need to use a learning rate schedule (slowly reducing the learning rate over time) to get it to finally converge instead of just buzzing around the target.

Mini-Batch Gradient Descent is the Goldilocks compromise, and it’s what everyone actually uses in practice. It splits the training data into small, random batches (e.g., 32, 64, 128 examples) and computes the gradient for each batch.

def mini_batch_gradient_descent(X, y, learning_rate=0.01, n_epochs=50, batch_size=32):
    m, n = X.shape
    theta = np.zeros(n)
    cost_history = []

    for epoch in range(n_epochs):
        shuffled_indices = np.random.permutation(m)
        X_shuffled = X[shuffled_indices]
        y_shuffled = y[shuffled_indices]

        for i in range(0, m, batch_size):
            # The key part: Use a small BATCH of examples
            X_batch = X_shuffled[i:i+batch_size]
            y_batch = y_shuffled[i:i+batch_size]

            gradient = (1 / batch_size) * X_batch.T.dot(X_batch.dot(theta) - y_batch)
            theta -= learning_rate * gradient

        cost = np.mean((X.dot(theta) - y)**2) / 2
        cost_history.append(cost)

    return theta, cost_history

This gets you the best of both worlds: a significant computational speedup over Batch GD (via vectorized operations on the batch) and much less noisy convergence than SGD. The batch size becomes a new hyperparameter to tune. Smaller batches = more noise and regularization. Larger batches = more stable convergence but closer to the slowness of Batch GD.

Best Practices and Pitfalls

Feature Scaling is Non-Negotiable: If your features are on different scales (e.g., age (0-100) and income (0-500,000)), the gradient will be dominated by the larger-scale feature. Your descent path will be inefficient and janky. Standardize (subtract mean, divide by standard deviation) or Normalize (scale to [0,1]) your features first. Always.
Monitor Your Loss: Plot the cost function value over iterations. Batch GD should show a smooth, declining curve. SGD will look like a heart rate monitor during a panic attack. Mini-batch will be somewhere in between. If the cost is increasing, your learning rate is too high. Dial it down.
The Magic of Shuffling: For SGD and Mini-batch, you must shuffle your data after each epoch. If you don’t, and your data is ordered (e.g., all examples of class 1 first, then class 2), the algorithm will fixate on one pattern and then completely forget it to learn the next, leading to horrible performance. It’s like trying to learn French and Mandarin by studying each language for a week straight instead of alternating days.
Convergence is a Suggestion: Unlike Batch GD, which will converge to the exact minimum, SGD and Mini-batch will typically hover around it. This is usually fine—the slight inaccuracy is a cheap price to pay for the massive speed gain, and the noise can help prevent overfitting. To actually stop the algorithm, you can’t just wait for the gradient to be zero. You’ll need to set a maximum number of iterations or stop when the cost improvement between epochs falls below a tiny threshold.