14.5 Computational Graphs and Automatic Differentiation

Right, let’s get our hands dirty. You’ve probably heard the term “backpropagation” thrown around like a party favor at a machine learning conference. It’s the magical, mystical process that makes neural networks learn. But strip away the mystique, and what you find is a shockingly elegant and practical piece of computer science called automatic differentiation (autodiff), built on the shoulders of a computational graph.

Think of a computational graph not as some terrifying abstract concept, but as a detailed recipe for your calculation. Every variable (ingredient) and operation (step) is a node, and the edges show the flow of data. We break a complex calculation into its tiniest, most fundamental steps. Why? Because it’s far easier to teach a computer how to compute the derivative of a + b once than it is to teach it the derivative of an entire monstrous loss function from scratch every time.

Let’s build one. Say we want to compute z = (x * y) + (x + 5). Our computational graph looks like this:

x ---> (*) ---> (+) ---> z
y ----/         /
x ---> (+) ----/
5 ----/

We introduce intermediate variables (a, b) to make the graph explicit:

a = x * y
b = x + 5
z = a + b

Now, why did we bother? Because when we want to know how sensitive z is to a change in x (i.e., dz/dx), the graph and the chain rule from calculus give us a perfect, step-by-step guide to find out.

Reverse-Mode Autodiff: The Engine of Deep Learning

This is backpropagation’s secret sauce. We walk through the graph forward, calculating the intermediate values (like a and b). Then, we walk backwards from the final output (z) to compute the gradients for every parameter.

We calculate the “local” gradient at each node—the derivative of the operation at that node with respect to its immediate inputs. The chain rule then tells us that the “global” gradient flowing into that node gets multiplied by these local gradients to flow back to the node’s inputs.

Let’s do it for dz/dx. We want to know how z changes with x. We see that x feeds into two nodes: the multiplication (a) and the addition (b).

Forward Pass: Compute the values. Let x = 3, y = 4 a = x * y = 3 * 4 = 12 b = x + 5 = 3 + 5 = 8 z = a + b = 12 + 8 = 20
Backward Pass: Compute the gradients, starting from z and moving back.
- Gradient at z: dz/dz = 1. (This is our starting signal.)
- Node z = a + b: The local gradients are ∂z/∂a = 1 and ∂z/∂b = 1. So, the gradient flowing back to a is 1 * 1 = 1. Similarly, 1 flows back to b.
- Node b = x + 5: The local gradient ∂b/∂x = 1. The gradient flowing back from b is 1. So, b’s contribution to dz/dx is 1 * 1 = 1.
- Node a = x * y: The local gradients are ∂a/∂x = y = 4 and ∂a/∂y = x = 3. The gradient flowing back from a is 1. So, a’s contribution to dz/dx is 1 * 4 = 4.
- Total dz/dx: We sum the contributions from all paths. From a: 4. From b: 1. So, dz/dx = 4 + 1 = 5.

You can verify this the old-fashioned way: z = (x*y) + (x+5) = x*y + x + 5. The derivative dz/dx = y + 1 = 4 + 1 = 5. Perfect.

The computer does exactly this. It records the operations during the forward pass (this recording is called tracing) and then executes this precise backward walk. This specific method, where we compute the gradient of the final output with respect to all inputs in one backward sweep, is called reverse-mode autodiff. It’s spectacularly efficient for neural networks where we have many parameters (inputs) and one loss (output).

Where This Actually Matters: The Vanishing Gradient

Here’s where theory meets the brutal reality of training. Look at the core operation of a neural network: output = activation(weight * input + bias).

Now imagine a deep network with, say, 10 layers. The gradient for an early layer is calculated by multiplying the local gradients of every single layer that comes after it. It’s the chain rule, all the way down.

What’s the local gradient of a popular activation function like sigmoid? ∂sigmoid(x)/∂x = sigmoid(x) * (1 - sigmoid(x)). Look at that. Its maximum value is 0.25. If you have 10 layers, you’re multiplying at least 10 numbers that are all <= 0.25. 0.25^10 is a vanishingly small number, roughly 9.5e-7. Your gradient signal for the early layers effectively becomes zero. They stop learning. This isn’t a theoretical concern; it’s the reason deep networks were famously hard to train before ReLU and better initialization schemes came along. The computational graph makes this multiplicative nature of the backward pass painfully clear.

Implementing It in Code: A Peek Under the Hood

You don’t write this yourself; frameworks like PyTorch and TensorFlow do it for you. But understanding what their Tensor object is doing is critical. It’s not just a data holder; it’s a node in the computational graph.

import torch

# Tell PyTorch we need to track operations for gradients later.
x = torch.tensor(3.0, requires_grad=True)
y = torch.tensor(4.0, requires_grad=True)

# Build the graph forward. PyTorch secretly records every operation.
a = x * y
b = x + 5
z = a + b

# This is the magic command. It initiates the backward pass.
# It calculates gradients of z with respect to every tensor that has requires_grad=True.
z.backward()

# Now, x.grad and y.grad contain dz/dx and dz/dy.
print(f"dz/dx: {x.grad}")  # Output: tensor(5.)
print(f"dz/dy: {y.grad}")  # Output: tensor(3.)

The crucial detail: Notice y.grad is 3. From our earlier math: the path from z to a to y gives us dz/dy = 1 * x = 1 * 3 = 3.

The Number One Pitfall: The Dangling `.grad` Attribute

Here’s the part that trips up everyone, including me, at 2 AM. In PyTorch, gradients are accumulated. When you call .backward(), the calculated gradients are added to the .grad attribute of the leaf tensors (x and y in our example).

Why? For some models, like RNNs, you might process batches and want the gradient to be the sum over all batches. But 99% of the time, you don’t want this. You want the gradients for the current batch only.

# Run the same cell again. What happens?
z.backward()
print(f"dz/dx after second .backward(): {x.grad}")  # Output: tensor(10.)

It’s now 10! (5 + 5). This is almost never what you want. You must zero the gradients before each backward pass when training.

# The correct pattern for training a loop:
for epoch in range(100):
    # ... calculate loss ...
    optimizer.zero_grad()  # This sets all .grad attributes to zero.
    loss.backward()
    optimizer.step()  # Uses the freshly calculated gradients to update weights.

Forgetting optimizer.zero_grad() is the programmer’s equivalent of leaving the fridge door open. Everything seems fine until everything is ruined and you have no idea why.

So there you have it. Autodiff isn’t magic; it’s a clever, systematic application of the chain rule that is brutally honest about the math. It gives us our training signal but also explains why deep networks can be so fragile. You’re not just using a framework anymore; you understand the engine it’s built on.

Reverse-Mode Autodiff: The Engine of Deep Learning

Where This Actually Matters: The Vanishing Gradient

Implementing It in Code: A Peek Under the Hood

The Number One Pitfall: The Dangling .grad Attribute

The Number One Pitfall: The Dangling `.grad` Attribute