15.3 SGD with Momentum: Accelerating Gradient Descent

Right, so you’ve met Stochastic Gradient Descent (SGD). It’s the workhorse, the foundational algorithm. But let’s be honest, vanilla SGD can be a bit of a klutz. It’s like a well-intentioned but myopic explorer, taking small, precise steps straight down the slope of whatever hill it’s currently standing on. This is great in a smooth, bowl-shaped canyon, but our loss landscapes are more like badly drawn topographical maps of the Himalayas after a few beers. They are riddled with ravines—long, steep, narrow valleys with a gentle slope along the length but brutally sharp slopes on the sides.

Vanilla SGD will naively zig-zag across this ravine. It takes a step down the steep side, then immediately has to take a step down the other steep side. Its path is a series of inefficient, orthogonal oscillations. It makes progress, but it’s painfully slow and computationally wasteful. It’s the algorithmic equivalent of herding cats downhill.

The Core Idea: It’s Physics, Not Magic

We fix this by giving our myopic explorer a heavy ball. This is SGD with Momentum. It’s not a new concept; it’s literally basic physics. A moving object tends to stay in motion.

Instead of just using the current gradient to determine our step, we maintain a running average of past gradients, called the velocity (v). We calculate our step direction by combining the current gradient with this historical velocity. The hyperparameter gamma (γ), usually set between 0.8 and 0.99, controls how much “friction” there is. A high gamma means the ball has a lot of momentum and won’t be swayed easily by new, contradictory gradients (like those pesky oscillations in our ravine).

Mathematically, it’s beautifully simple:

v_t = γ * v_{t-1} + η * ∇J(θ_t)
θ_{t+1} = θ_t - v_t

Where:

v_t is the velocity at time step t.
γ is the momentum coefficient.
η is the learning rate.
∇J(θ_t) is the gradient of the objective function with respect to the parameters.

Why does this work in the ravine? The oscillations (the gradients pointing left and right across the ravine) are roughly opposite. When you average them together in the velocity term, they cancel each other out. Meanwhile, the consistent gradient along the length of the ravine (the direction we actually want to go) is reinforced with each step. The velocity builds up in that direction, leading to faster, more stable convergence. It’s like giving SGD a sense of inertia.

Implementing It From Scratch

Let’s make this concrete. Here’s how you’d implement SGD with Momentum in pure Python, just so we see the gears turning. No magic, just code.

import numpy as np

# Let's define a simple quadratic loss function: J(theta) = theta^2
# Its gradient is: dJ/dtheta = 2 * theta
def loss_function(theta):
    return theta**2

def gradient(theta):
    return 2 * theta

# Hyperparameters
theta = 10.0  # Initial parameter value (we start at 10)
learning_rate = 0.1
momentum_gamma = 0.9
velocity = 0.0  # Initialize velocity
num_epochs = 50

# Store the path for plotting
path = [theta]

for epoch in range(num_epochs):
    # 1. Compute the gradient at current parameters
    grad = gradient(theta)
    
    # 2. MOST IMPORTANT STEP: Update the velocity
    velocity = momentum_gamma * velocity + learning_rate * grad
    
    # 3. Update the parameters using the velocity, not the raw gradient
    theta = theta - velocity
    
    path.append(theta)
    print(f"Epoch {epoch}: theta = {theta:.4f}, velocity = {velocity:.4f}")

print(f"\nFinal theta: {theta}")

Run this. Watch how the velocity builds up, and how the parameter theta hurtles towards zero far faster than vanilla SGD would. It might even overshoot a bit and oscillate—that’s the momentum carrying it past the minimum before it can correct course. This is a classic behavior.

The Real-World Way: Using a Library

You are not, of course, going to write your own optimizer from scratch for a real project. Here’s how you use the battle-tested version in PyTorch and TensorFlow/Keras. Notice how the momentum parameter is exactly our gamma (γ).

# PyTorch
import torch.optim as optim

model = ... # your PyTorch model
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# Inside your training loop:
optimizer.zero_grad()
loss = ... # calculate loss
loss.backward()
optimizer.step()

# TensorFlow / Keras
from tensorflow.keras.optimizers import SGD

model = ... # your Keras model
optimizer = SGD(learning_rate=0.01, momentum=0.9)

model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy')
model.fit(x_train, y_train, epochs=10)

Pitfalls and Best Practices

This isn’t a silver bullet. That heavy ball can be a liability if you’re not careful.

Overshooting: The most common issue. If your learning rate is too high, the momentum will carry the parameters right past a good minimum and potentially into a worse part of the loss landscape. If you see your loss suddenly skyrocket after a period of steady decline, you’ve likely overshot. The fix? Turn down the learning rate.
The Initial Wobble: At the very start of training, the velocity is zero. The first update is just η * ∇J, a standard SGD step. The momentum hasn’t built up yet. This can cause a brief initial period of slower convergence. Some implementations use a tweak called “Nesterov Accelerated Gradient” (NAG) which is a smarter way to calculate the gradient after applying the momentum, often leading to less wobble.
It Doesn’t Solve Everything: Momentum helps with ravines, but it doesn’t automatically adapt to the scale of different parameters (like Adam does). It’s still a fundamentally simple method, and that’s its strength. It’s often the optimizer of choice for well-tuned computer vision models and other domains where the loss landscape is known to be particularly conducive to its physics.

The takeaway? Always try vanilla SGD first to get a baseline. The moment you see that characteristic zig-zagging validation loss plot, smile, and turn on momentum. It’s one of the most reliable upgrades you can make.