15.5 Learning Rate Schedules: Warmup, Cosine Decay, One-Cycle

Right, let’s talk about learning rates. You’ve probably already been told it’s the single most important hyperparameter. That’s mostly true, but it’s also a massive oversimplification. Picking one static number and hoping for the best is like trying to drive across the country by flooring the accelerator until you’re “probably close” and then slamming on the brakes. It’s inefficient, you’ll overshoot your destination, and you’ll probably break something expensive.

A fixed learning rate is a first-date strategy: you show up with one level of energy and hope it’s appropriate for the entire, often awkward, evening. The real world of training a neural network is messier. You need to start carefully, gain momentum, and then slow down to finesse your way into a good local minimum. That’s what learning rate schedules are for. They dynamically adjust your learning rate during training, and if you’re not using one, you’re leaving performance on the table. It’s that simple.

Why a Warmup is Non-Negotiable for Modern Optimizers

You’re using Adam or AdamW, right? Of course you are. Everyone is. Well, here’s the thing the papers don’t always scream from the rooftops: these adaptive optimizers are terrible at the very beginning of training.

Why? They calculate running estimates of the gradient’s variance (the second moment, in math-speak). At step one, these estimates are zero. So when they take a step, it’s effectively parameter = parameter - (lr * gradient) / (sqrt(0) + epsilon). See the problem? Division by a number very close to zero. This leads to a massively large and unstable step size, corrupting those carefully initialized weights immediately.

The warmup phase is a brilliantly simple hack to solve this. We start with a ludicrously small learning rate (like lr_init * 1e-3) and linearly ramp it up to our chosen maximum learning rate over a few thousand steps. This gives the optimizer’s internal statistics time to stabilize before we trust them to drive at full speed. Skipping warmup is like revving a cold engine to the redline. You might get away with it, but it’s deeply inadvisable and the gods of gradient descent will frown upon you.

import torch
from torch.optim import AdamW
import math

def get_lr_with_warmup(step, warmup_steps, max_lr, total_steps):
    """Simple linear warmup followed by hold."""
    if step < warmup_steps:
        return max_lr * (step / warmup_steps)
    else:
        return max_lr

# Example usage in a training loop
optimizer = AdamW(model.parameters(), lr=1e-3)  # This 'lr' is now our max_lr
warmup_steps = 2000
total_steps = 20000

for step in range(1, total_steps + 1):
    current_lr = get_lr_with_warmup(step, warmup_steps, 1e-3, total_steps)
    for param_group in optimizer.param_groups:
        param_group['lr'] = current_lr

    # ... your usual training step code goes here ...
    # optimizer.zero_grad(), loss.backward(), optimizer.step()

Cosine Annealing: The Smooth Landing

Once we’re warmed up, we need a strategy to decay the learning rate. My personal favorite, and a staple in the literature, is cosine annealing. The idea is elegant: we smoothly lower the learning rate from its maximum value to (often) zero using the shape of a cosine curve.

Why a cosine? Because it’s a beautifully smooth transition. There are no sharp drops or jarring changes in speed, which helps the model settle more gracefully into the loss landscape. It’s like coasting to a stop instead of throwing an anchor out the window. Mathematically, it looks like this:

current_lr = min_lr + 0.5 * (max_lr - min_lr) * (1 + cos(π * current_step / total_steps))

The min_lr is often set to zero, but setting it to a very small value (like max_lr * 1e-4) can sometimes let the model do a final bit of “fine-tuning,” a concept Leslie Smith brilliantly formalized with the One-Cycle policy.

def get_cosine_schedule_with_warmup(step, warmup_steps, total_steps, max_lr, min_lr=0):
    """Combines warmup and cosine decay."""
    # 1. Linear Warmup
    if step < warmup_steps:
        return max_lr * (step / warmup_steps)

    # 2. Cosine Decay
    progress = (step - warmup_steps) / (total_steps - warmup_steps)
    # Clamp progress to 1.0 in case steps > total_steps
    progress = min(progress, 1.0)
    return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * progress))

The One-Cycle Policy: Going Super Saiyan

Now, let’s talk about Leslie Smith’s One-Cycle policy. This is where we stop being polite and start getting real. It’s a specific, aggressive scheduling strategy that often leads to faster convergence and better generalization. It looks absolutely insane on a graph.

Here’s the recipe:

Ramp Up: Use the warmup period we discussed, but instead of just warming up to your conservative guess of a good learning rate, you warm up to a much higher maximum learning rate. We’re talking 5x to 10x your initial guess. This is the “super-convergence” part.
Annihilate: Then, you decay the learning rate just as aggressively back down, even past your starting point, to a very low minimum value. You use cosine annealing for this.
(Optional but recommended): In the final ~10% of training, you drop the learning rate by another order of magnitude or two for a final fine-tuning phase.

The high LR phase acts like a form of strong regularization, bouncing the model out of sharp minima and helping it find a wider, more generalizable basin. The key is that the entire cycle happens over a relatively small number of epochs. You’re not slogging through training; you’re blitzing through it.

def get_one_cycle_schedule(step, warmup_steps, total_steps, max_lr, div_factor=25.0, final_div_factor=1e4):
    """
    Implements the one-cycle policy with cosine annealing.
    max_lr here is the highest LR we'll hit.
    The initial LR is max_lr / div_factor.
    The final LR is max_lr / final_div_factor.
    """
    # Initial and final LR calculations
    initial_lr = max_lr / div_factor
    min_lr = max_lr / final_div_factor

    # Phase 1: Warmup from initial_lr to max_lr
    if step < warmup_steps:
        return initial_lr + (max_lr - initial_lr) * (step / warmup_steps)

    # Phase 2 & 3: Cosine anneal from max_lr down to min_lr
    progress = (step - warmup_steps) / (total_steps - warmup_steps)
    progress = min(progress, 1.0)
    return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * progress))

# Example: A very aggressive one-cycle schedule
# max_lr = 0.1, so we warmup from 0.1/25=0.004 to 0.1, then decay to 0.1/10000=0.00001

The best practice? Don’t guess these hyperparameters. Use a learning rate range test to find a good max_lr. Start with something like div_factor=25 and final_div_factor=10000 and see how it goes. The payoff for this small bit of configuration is one of the biggest free lunches in deep learning. Stop driving with the parking brake on.