80.4 PyTorch Tensors and Autograd
Right, let’s talk about PyTorch’s two-fisted approach to getting things done: Tensors and Autograd. This isn’t just a data structure and a library feature; it’s the core philosophical difference that makes PyTorch feel so immediate and, frankly, human. While other frameworks were drawing elaborate blueprints, PyTorch handed you a lump of clay and said, “Go on, shape it. I’ll figure out the math for the changes you make.” It’s brilliant.
The Tensor: Your Universal Data Soldier
Forget everything you know about tensors from that physics class you barely passed. In PyTorch, a tensor is basically just a multi-dimensional array, a NumPy ndarray on steroids that can live on a GPU and, crucially, knows how to track its own history for gradients.
You create them. You bash them into new shapes. You perform operations on them. They are your fundamental unit of data. Let’s make a few. Notice I’m not just showing you the syntax; I’m showing you the debugging syntax—the stuff you’ll actually type a thousand times to figure out what went wrong.
import torch
# Your basic tensor, living a quiet life on the CPU
x = torch.tensor([1.0, 2.0, 3.0])
print(f"Values: {x}")
print(f"Shape: {x.shape}") # Crucial. The first thing you check when anything breaks.
print(f"Device: {x.device}") # The second thing you check. Is it on GPU? Did you forget to send it?
# Let's get this party on the GPU. Requires a CUDA-capable device, obviously.
if torch.cuda.is_available():
device = torch.device("cuda")
y = torch.tensor([4.0, 5.0, 6.0], device=device) # Create it there directly
x = x.to(device) # Or move it there later
z = x + y # Now this operation happens at ludicrous speed on the GPU
print(z)
The most common pitfall here, bar none, is the device mismatch. You will, at some point, try to add a tensor on CPU to a tensor on GPU. PyTorch will not quietly fix this for you. It will stop and yell at you. This is a good thing! It’s saving you from silent, mysterious bugs. Get in the habit of checking .device religiously.
Autograd: The Magic That Isn’t Magic
This is the killer app. Autograd (automatic gradient computation) is why we’re here. It’s the engine that makes neural networks trainable. And it works by being brutally, beautifully simple: every operation on a tensor is tracked.
When you create a tensor and set requires_grad=True, you’re essentially putting a tiny little GPS tracker on it. Every operation that originates from this tensor is logged in a directed acyclic graph (a “compute graph”). This graph knows exactly how the final result was calculated from the original inputs.
# Let's create a tensor that wants gradients.
a = torch.tensor([2.0], requires_grad=True)
b = torch.tensor([3.0], requires_grad=True)
# Do something with them. PyTorch is now watching and taking notes.
c = a * b # c = 2 * 3 = 6
# Let's say 'c' is our loss. We want to know: how sensitive is c to changes in a and b?
# In math: we want dc/da and dc/db.
# We call .backward() on c to compute these gradients and accumulate them in the .grad attribute of the original tensors.
c.backward()
print(a.grad) # dc/da = b = 3
print(b.grad) # dc/db = a = 2
See? It’s not magic. It’s just a very clever application of the chain rule. The .backward() method traverses the compute graph from the final output (c) back to the roots (a, b), calculating the derivative at every step and storing it. The reason this feels like witchcraft is that PyTorch builds this graph dynamically as your code runs. This is the “define-by-run” mentality—the graph is defined by the execution path of your code, which is why it’s so intuitive to debug.
The with torch.no_grad(): Guard
Now, here’s a crucial best practice. You don’t always want this tracking. It consumes memory and compute. When you’re evaluating your model on validation data, or just doing any intermediate calculation that doesn’t contribute to gradients, you need to turn it off. Enter the context manager.
# This will track history and eat your memory
prediction = model(x) * 2
# This won't. It's a free operation.
with torch.no_grad():
prediction = model(x) * 2
# Also useful for manually updating parameters without messing up the graph
model.weight.data -= learning_rate * model.weight.grad.data
Forgetting to use no_grad() during inference is a classic rookie mistake. You’ll slowly bleed memory as your graph history balloons, and you’ll sit there wondering why your evaluation loop is getting slower and slower. Don’t be that person.
In-Place Operations: The Cardinal Sin
This is the big one. The trap. You must understand this.
An in-place operation is one that modifies a tensor’s data directly, denoted by an underscore (e.g., .add_() instead of .add()). The problem? It destroys the history of that tensor. Autograd relies on that history to do the backward pass. If you destroy it, the gradient calculation becomes impossible or, worse, silently wrong.
a = torch.tensor([2.0], requires_grad=True)
# GOOD: Out-of-place operation. Creates a new tensor, history is preserved.
b = a + 1
b.backward() # Works perfectly.
# Let's reset the gradients for demonstration
a.grad = None
# BAD: In-place operation. This breaks the graph.
a.add_(1) # We've now directly modified 'a'. Its past is erased.
# b = a + 1 # If you tried to use it now and call .backward(), chaos.
PyTorch will often throw an error if you try to do an in-place op on a requires_grad tensor, but not always. The rule of thumb is simple: avoid in-place operations on tensors that require gradients. Just don’t do it. The microscopic memory savings are not worth the debugging nightmare you will inevitably create for yourself. Consider it forbidden knowledge, best left unused.