Bert

18.9 Efficient Transformers: Sparse Attention, Linear Attention, Flash Attention

Alright, let’s pull back the curtain on one of the biggest open secrets in modern machine learning: the standard Transformer’s attention mechanism is a computational monster. It scales with the square of the sequence length (O(n²)), which is the technical way of saying “it gets stupidly slow and memory-hungry the moment you try to do anything interesting.” Trying to process a long document or a high-resolution image? Forget about it. Your GPU will wave a little white flag and give up.

18.8 GPT: Autoregressive Decoder-Only Pre-Training

Right, so you’ve heard the hype. “GPT changed everything!” It did, but not by inventing some alien technology. It took the core Transformer block we just talked about and made one brutally simple, wildly effective architectural choice: it threw away the encoder. That’s it. That’s the big secret. All those GPT models—GPT-2, GPT-3, the one you’re probably using to get summaries of this book—are just a stack of Transformer decoder blocks, with one small but critical tweak.

18.7 BERT: Bidirectional Encoder Pre-Training

Right, so you’ve heard of Transformers. You’ve seen the diagrams with all the “Attention” arrows pointing everywhere like a conspiracy theorist’s bulletin board. But BERT? BERT is the one that actually read the manual. While every other model was busy staring left-to-right like it was reading a particularly dull novel, BERT had a brilliant, simple idea: maybe words are defined by the words on both sides of them. You know, like in every human conversation ever.

18.6 The Decoder Stack: Masked Attention + Cross-Attention

Right, so you’ve made it past the encoder. Good. That was the warm-up. Now we get to the real party trick of the Transformer: the decoder. This is where the model actually becomes a generative model, where it takes all that juicy contextual understanding from the encoder and uses it to produce something new, one token at a time. It’s a beautiful, slightly unhinged process of creative constraint. The decoder stack looks suspiciously like the encoder stack—it’s built from layers of self-attention and feed-forward networks—but it has two absolutely critical modifications that prevent it from cheating. And I mean really prevent it. Because if it could cheat, it would be useless.

18.5 The Encoder Stack: Self-Attention + FFN + LayerNorm

Right, so you’ve got your input embeddings and you’ve added positional encoding. Now the real party starts: the Encoder Stack. This isn’t just one layer; it’s a series of identical layers stacked on top of each other. And each one is a beautifully engineered little machine with two main workhorses and one crucial piece of organizational glue: Self-Attention, a Feed-Forward Network (FFN), and Layer Normalization. Don’t let the simplicity fool you—this is where the magic of context gets woven into your data.

18.4 Positional Encoding: Fixed and Learned

Right, so we’ve got these fancy word embeddings now. Your sequence of words is a tidy stack of vectors, each representing a word’s meaning in a high-dimensional space. Neat, but there’s a colossal problem: our model is, for all intents and purposes, a fancy bag-of-words. The words “dog bites man” and “man bites dog” have the exact same input representation. That’s a deal-breaker for understanding language, where order is, you know, the entire point.

18.3 Multi-Head Attention: Attending to Multiple Representation Subspaces

Right, so we’ve established that self-attention is the magic trick that lets every word in a sequence have a little meeting with every other word to figure out how much they should care about each other. But if that’s all we had, it would be a bit of a blunt instrument. It’s like only having one tool in your workshop—a hammer. Sure, you can attend to everything, but you’re probably going to treat every relationship like a nail.

18.2 Scaled Dot-Product Attention

Alright, let’s get our hands dirty with the star of the show: Scaled Dot-Product Attention. If the Transformer architecture is a party, this is the charismatic host who introduces everyone to each other and decides who gets to have a meaningful conversation. It’s the core mechanism that allows the model to dynamically focus on different parts of the input sequence. And despite the fancy name, its guts are just a few matrix multiplications and a softmax. Don’t let anyone tell you otherwise.

18.1 Attention Is All You Need: The Paper That Changed Everything

Right, let’s talk about the paper that dropped in 2017 and promptly broke the entire field of NLP’s collective brain. It was called “Attention Is All You Need,” which is a fantastically audacious title. They weren’t wrong. Before this, we were all meticulously building recurrent networks (RNNs, LSTMs) and convolutional networks (CNNs) for language, carefully stacking them like Jenga towers that were always on the verge of collapsing from vanishing gradients or just taking a geological age to train.