80.6 PyTorch Training Loop: Forward, Loss, Backward, Optimizer Step

Alright, let’s get our hands dirty. The training loop is the beating heart of any PyTorch model. It’s where your theoretical architecture meets the cold, hard data and hopefully learns something. If you’ve ever written a for loop, you can do this. But doing it well is the difference between a model that converges smoothly and one that just… doesn’t.

The core of it is a beautifully simple, four-step ritual that you’ll repeat thousands of times:

Forward Pass: Make a prediction.
Calculate Loss: Quantify how bad that prediction was.
Backward Pass: Figure out who to blame for the error.
Optimizer Step: Actually learn something.

Let’s break it down. First, the setup. You’ll need your model, your data, a loss function, and an optimizer.

import torch
import torch.nn as nn
import torch.optim as optim

# Assume we have a simple neural network class defined
model = MyNeuralNet()

# A loss function - CrossEntropy is common for classification
criterion = nn.CrossEntropyLoss()

# An optimizer - Adam is the sensible default for most things
optimizer = optim.Adam(model.parameters(), lr=0.001) # lr is crucial!

# Your training data loader (e.g., from a Dataset and DataLoader)
train_loader = ...

The Basic Loop Structure

Here’s the canonical training loop. It’s so common it’s practically muscle memory.

num_epochs = 10

for epoch in range(num_epochs):
    # Set the model to training mode. This is crucial for layers like Dropout and BatchNorm.
    model.train()

    # Iterate over batches of data
    for inputs, labels in train_loader:
        # Step 1: Forward pass
        outputs = model(inputs)

        # Step 2: Calculate the loss
        loss = criterion(outputs, labels)

        # Step 3: Backward pass
        optimizer.zero_grad()  # Critical: clear old gradients
        loss.backward()        # Compute gradients

        # Step 4: Update parameters
        optimizer.step()

    # Optional: Print loss or metrics at the end of each epoch
    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

Why optimizer.zero_grad() is Non-Negotiable

This is the most common rookie mistake, and it’s a doozy. PyTorch accumulates gradients. When you call .backward(), it calculates the gradients and adds them to any existing gradients stored in the .grad attribute of each parameter.

If you don’t zero them out before the next backward pass, you’re effectively summing the gradients across multiple batches. This is almost never what you want. It’s like trying to learn from today’s mistakes while still being mad about yesterday’s. The model’s performance will be utterly bizarre and unstable. So, always, always zero_grad().

Choosing Your Loss Function Wisely

The loss function isn’t just a technicality; it’s the objective you’re telling your model to optimize. Choosing the wrong one is like giving a chef a recipe for cake when you want a steak.

nn.MSELoss(): For regression tasks (predicting a continuous value like house prices). It punishes large errors heavily.
nn.CrossEntropyLoss(): For multi-class classification. It combines a softmax activation and negative log likelihood loss. Please, for the love of all that is holy, do not pass the outputs through softmax yourself before feeding them into this loss. It will do it for you. I’ve seen it happen. It’s not pretty.
nn.BCEWithLogitsLoss(): For binary classification. Similarly, it combines a sigmoid and binary cross-entropy. Don’t pre-sigmoid your outputs.

The Optimizer’s Job and The Learning Rate

The optimizer’s job is to look at the gradients (the direction of steepest ascent for the loss) and take a step in the opposite direction. The size of that step is determined by the learning rate (lr).

Set lr too high, and your model will bounce around the minimum like a pinball, never converging, or even diverge (loss -> NaN, which is your cue to panic). Set it too low, and your model will take an geological era to train. Adam with lr=1e-3 or lr=1e-4 is a great starting point. For SGD, you’ll often need a higher rate, like 0.01 or 0.1, but it’s much less forgiving.

Gradient Clipping: A Safety Net for Exploding Gradients

Sometimes, especially in recurrent networks, gradients can become absurdly large (“explode”). When the optimizer takes a step based on these massive gradients, it can completely destabilize the model.

Gradient clipping is a simple but effective hack: if the gradients get larger than a certain threshold, we just scale them down.

# ... after loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()

Think of it as putting training wheels on your model. It won’t make it learn better, but it might prevent a catastrophic crash.

The Validation Loop: How You Know You’re Not Just Memorizing

Training is useless if it doesn’t generalize. After each epoch, you need to evaluate your model on validation data it has never seen. The key differences:

model.eval()  # Sets the model to evaluation mode (turns off Dropout, etc.)
with torch.no_grad():  # Disables gradient calculation. Major speed boost.
    for inputs, labels in validation_loader:
        outputs = model(inputs)
        # ... calculate validation loss/metrics ...
# Switch back to training mode for the next epoch
model.train()

This loop is where your ego goes to die, but it’s the only way to get a model that’s actually useful. The dance between model.train() and model.eval() is essential. Forget it, and your validation metrics will be a lie.