Backpropagation | mikePietsch.com

14.7 Vanishing and Exploding Gradients

Right, so you’ve built your first few neural networks. They’re training, the loss is (mostly) going down, and you’re feeling pretty good about yourself. Then you try to build something a bit deeper—maybe ten, twenty, or a hundred layers. Suddenly, your model’s performance flatlines. The loss stops improving, or worse, it starts outputting complete gibberish from the very first epoch. Welcome to the two gremlins that have haunted deep learning since its inception: the problems of vanishing and exploding gradients.

14.6 Weight Initialization: Xavier, He, and Orthogonal

Right, let’s talk about the very first thing your network does before it even gets a chance to be smart: guessing. That’s all weight initialization is. You’re setting the starting values for the millions of parameters your model will spend the next however-long tweaking. Get this wrong, and you’re not just starting on the back foot; you’re starting in a different stadium, facing the wrong way. Think of it like this: if you initialize all your weights to zero, every neuron in a layer will calculate the exact same thing on the first forward pass. On the backward pass, they’ll all get the exact same gradient. They’ll all update in the exact same way. You don’t have a hundred neurons; you have one neuron with a hundred clones. It’s a spectacular waste of compute and will never break symmetry. So, we need to start with random values. But “random” is a big, scary universe. Do we use a uniform distribution between -1 and 1? A normal distribution? This is where the math nerds (bless them) come in to save us from ourselves.

14.5 Computational Graphs and Automatic Differentiation

Right, let’s get our hands dirty. You’ve probably heard the term “backpropagation” thrown around like a party favor at a machine learning conference. It’s the magical, mystical process that makes neural networks learn. But strip away the mystique, and what you find is a shockingly elegant and practical piece of computer science called automatic differentiation (autodiff), built on the shoulders of a computational graph. Think of a computational graph not as some terrifying abstract concept, but as a detailed recipe for your calculation. Every variable (ingredient) and operation (step) is a node, and the edges show the flow of data. We break a complex calculation into its tiniest, most fundamental steps. Why? Because it’s far easier to teach a computer how to compute the derivative of a + b once than it is to teach it the derivative of an entire monstrous loss function from scratch every time.

14.4 Backpropagation: The Chain Rule at Scale

Right, so you’ve built your network, fed it some data, and… nothing happens. Or rather, something happens, but it’s catastrophically, hilariously wrong. Your model’s predictions are less “insightful AI” and more “random number generator with a drinking problem.” This is the moment. You can’t just shrug and hope it gets better. You need to tell it exactly how it messed up, and more importantly, which of its millions of knobs to tweak and by how much. That, my friend, is backpropagation. It’s not magic; it’s the chain rule from calculus, applied with a level of persistence that would make a debt collector blush.

14.3 Multi-Layer Perceptrons: Universal Approximation Theorem

Right, so you’ve got your single neuron. It’s a plucky little thing, tries its best, but let’s be honest: drawing a single straight line through your data is about as effective as using a butter knife to perform brain surgery. Most interesting problems in the world aren’t linearly separable. They’re curvy, swirly, gloriously messy affairs. This is where we stop playing with kindergarten blocks and start building cathedrals. We stack neurons into layers, and in doing so, we unlock the ability to approximate just about any continuous function you can dream up. This isn’t just hopeful thinking; it’s a mathematical certainty, formally known as the Universal Approximation Theorem.

14.2 The Perceptron and Its Limitations

Alright, let’s get our hands dirty with the perceptron. It’s the Lego brick of neural networks—the simplest possible building block you can have. The idea, dreamed up by Frank Rosenblatt in 1958, is almost childishly simple, which is precisely why it’s so brilliant. It’s a linear binary classifier. Fancy term, simple idea: it draws a straight line (or a plane, or a hyperplane if you’re feeling fancy) to separate two categories of things. Is this email spam or not? Is this image a cat or a dog? That’s its entire job description.

14.1 The Biological Neuron and Its Mathematical Abstraction

Right, so you want to build a brain. Well, a pathetic, simplified, mathematical caricature of one. Don’t worry, that’s all we need. To do that, we first need to look at the biological blueprint: the neuron. It’s a fantastically complicated little beast, but we’re going to strip it down to its absolute essence for our purposes. Don’t @ me, neuroscientists; this is engineering, not a PhD thesis. The real star of the show is the synapse, the gap between neurons where the magic of learning actually happens. An electrical signal (the action potential) zooms down the axon of one neuron and triggers the release of neurotransmitters. These chemicals float across the synaptic gap and bind to receptors on the next neuron, which can either encourage it to fire (excite it) or discourage it (inhibit it). The strength of this connection isn’t fixed; it changes based on experience. This is the biological basis of learning, and it’s called Hebbian theory: “neurons that fire together, wire together.”