3.4 Derivatives and the Chain Rule: Foundations of Backpropagation

Alright, let’s get our hands dirty with derivatives. Forget the dusty old definition from calculus class with the limit of the secant line. In the AI world, you need a more practical, almost physical intuition. Think of a derivative not as a slope, but as a sensitivity measurement.

If you have a function f(x), the derivative f'(x) or df/dx tells you one thing: if you give x a tiny nudge h, how much will the output f(x) nudge in response? It’s the function’s amplification factor for change at that specific point. A large derivative means it’s super sensitive; a small one means it barely cares. This is the absolute bedrock of training neural networks. We nudge the weights (our x) based on how sensitive the loss (our f(x)) is to them. It’s how the network learns.

The Chain Rule: The Glue That Holds AI Together

Here’s where it gets fun. In real neural networks, your loss function isn’t a simple f(x). It’s a monstrous, nested function of functions of functions. Loss(Softmax(ReLU(Linear(Inputs, Weights)))). Trying to compute the derivative of this mess directly would be like trying to calculate the exact air resistance on a falling piece of toast. It’s not happening.

Enter the chain rule, the unsung hero of backpropagation. The chain rule is a mathematical parlor trick that lets you break down the derivative of a composite function into a chain of simpler derivatives. If y = f(g(x)), then the sensitivity of y to x is the sensitivity of y to g multiplied by the sensitivity of g to x:

dy/dx = (dy/dg) * (dg/dx)

Why is this so powerful? It means we can compute the derivative of the whole complicated system by localizing the problem. Each layer or operation in your network only needs to know how to compute its own derivative with respect to its own inputs. The chain rule then seamlessly multiplies these local gradients together to get the overall gradient. It’s beautifully modular. The forward pass calculates the output, and the backward pass calculates the local derivatives and chains them together. That’s backpropagation in a nutshell.

Implementing It: From Math to Code

Let’s make this concrete with a simple example. Let’s say we have y = f(g(x)) where g(x) = x**2 and f(z) = sin(z). So y = sin(x**2).

We want dy/dx. By the chain rule:

df/dz = cos(z)
dg/dx = 2*x
dy/dx = (df/dz) * (dg/dx) = cos(x**2) * 2*x

Now, let’s simulate this with code, both manually and using PyTorch’s autograd to show you how it’s done in practice.

import torch
import math

# Let's pick a specific point, say x = 2
x = torch.tensor(2.0, requires_grad=True)  # requires_grad=True tells PyTorch to track computations for grad

# Forward pass: compute y = sin(x^2)
z = x ** 2
y = torch.sin(z)

# Backward pass: compute the gradient dy/dx
y.backward()

# Now x.grad will contain dy/dx
print("PyTorch computed gradient:", x.grad.item())

# Let's verify with our manual calculation:
manual_grad = math.cos(2**2) * 2 * 2
print("Manually computed gradient:", manual_grad)

Common Pitfalls and The Gotchas

This isn’t all sunshine and rainbows. Here are the places you’ll get tripped up.

The Order of Multiplication is Everything: In neural networks, you’re not dealing with scalars x; you’re dealing with vectors, matrices, and tensors. The chain rule still holds, but the multiplication becomes a matrix product. And matrix multiplication is not commutative. A * B is not the same as B * A. If you mess up the order of your gradients during backprop, you’ll get a shape error or, worse, silently incorrect results. Your deep learning framework handles this for you, but understanding it prevents you from being mystified by RuntimeError: grad can be implicitly created only for scalar outputs.
The Vanishing Gradient Problem: Remember, the chain rule involves multiplying derivatives. What happens if most of those derivatives are between 0 and 1? You multiply a lot of small numbers together, and the product becomes exponentially tiny. This gradient effectively becomes zero, and the weights in your early layers don’t get updated. This is the infamous “vanishing gradient” problem that made training deep networks nearly impossible before ReLU and its variants (Leaky ReLU, etc.) came along. ReLU’s derivative is 1 for positive inputs, which acts as a gradient pump, preventing the product from shrinking to nothingness.
Watch Your Shapes: This is the most common practical headache. The gradient of a scalar loss with respect to a weight matrix must have the exact same shape as the weight matrix itself. This is non-negotiable. If your gradient has a different shape, your update rule weights = weights - learning_rate * gradient will fail. Always print(weight.shape, gradient.shape) if you’re ever doing anything remotely custom. The framework does this correctly for its built-in layers, but the moment you write a custom function, the responsibility falls on you.