18.1 Attention Is All You Need: The Paper That Changed Everything
Right, let’s talk about the paper that dropped in 2017 and promptly broke the entire field of NLP’s collective brain. It was called “Attention Is All You Need,” which is a fantastically audacious title. They weren’t wrong. Before this, we were all meticulously building recurrent networks (RNNs, LSTMs) and convolutional networks (CNNs) for language, carefully stacking them like Jenga towers that were always on the verge of collapsing from vanishing gradients or just taking a geological age to train.
The core insight, the big one, is that recurrence is slow because it’s sequential. You can’t compute step N+1 until step N is done. That’s a nightmare for modern parallel processors (GPUs/TPUs), which, as the name suggests, love to do things in parallel. The Transformer architecture said, “Forget that. Let’s just look at the entire sequence at once.” And the mechanism that lets us do that? You guessed it: self-attention.
The Self-Attention Mechanism, Demystified
Don’t let the math scare you. At its heart, self-attention is a glorified (and incredibly powerful) way of asking, “For every word in this sentence, which other words are you most in a relationship with?” It’s not just looking at the word itself; it’s understanding its context by seeing its affinity to every other word.
Here’s the magic trick: we take our input embeddings (fancy numbers representing words) and project them into three new vectors for each word: a Query, a Key, and a Value. Think of it like this:
- The Query is like the question a word is asking: “Who’s relevant to me right now?”
- The Key is what each word holds up as an answer: “This is what I’m about.”
- The Value is the actual information that gets passed forward if a word is deemed important.
We take the dot product of the Query of one word with the Keys of all words (including itself) to get a score. High score means strong relationship. We then soften these scores into a probability distribution (using softmax, because of course we are), and use that to create a weighted sum of all the Value vectors. The result? A new, context-aware representation for every word.
Let’s make this painfully concrete with a tiny example. We’ll skip the linear projections for clarity.
import torch
import torch.nn.functional as F
# Let's say we have a sentence with 3 words, and our embedding size is 4
# Sequence length of 3, embedding dim of 4
embeddings = torch.tensor([[1.0, 0.0, 1.0, 0.0], # Word 1
[0.0, 2.0, 0.0, 2.0], # Word 2
[1.0, 1.0, 1.0, 1.0]]) # Word 3
# In reality, these would be learned linear transformations of the embeddings.
# For this demo, let's just use the embeddings as our Queries, Keys, and Values.
queries = keys = values = embeddings
# Step 1: Compute attention scores (dot product of queries and keys)
scores = queries @ keys.T # @ is matrix multiplication in PyTorch
print("Raw Scores:\n", scores)
# Step 2: Scale and apply softmax. Scaling by sqrt(d_k) prevents vanishing gradients.
d_k = keys.size(-1) # dimension of key vectors, which is 4
scaled_scores = scores / (d_k ** 0.5)
attention_weights = F.softmax(scaled_scores, dim=-1)
print("\nAttention Weights:\n", attention_weights)
# Step 3: Weighted sum of values using the attention weights
output = attention_weights @ values
print("\nOutput (Contextualized Embeddings):\n", output)
This output isn’t just a representation of each word in isolation; it’s each word enriched by the context of every other word. And crucially, every step here is a massive matrix operation that can be parallelized over the entire sequence. This is why Transformers train so much faster than RNNs.
Multi-Head Attention: The Committee of Experts
Now, here’s where the designers were genuinely clever. Using just one self-attention process might be limiting. A word can have different kinds of relationships simultaneously. For example, in “The animal didn’t cross the street because it was too tired,” the word “it” has a strong syntactic relationship with “animal” (subject) and a strong semantic relationship with “tired” (state of being).
Multi-head attention solves this. Instead of one set of Q, K, V projections, we have multiple (h heads). Each head learns to project the original embeddings into a different subspace, essentially focusing on a different type of relationship. It’s like having a committee of attention experts, each looking at the sentence from a different angle.
# This is a simplified conceptual sketch. Real implementation uses nn.Linear and view/transpose.
class MultiHeadAttention(torch.nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads # Dimension of each head's key/query/val
# These would be the linear projection layers for Q, K, V, and the final output
self.wq = torch.nn.Linear(d_model, d_model)
self.wk = torch.nn.Linear(d_model, d_model)
self.wv = torch.nn.Linear(d_model, d_model)
self.wo = torch.nn.Linear(d_model, d_model)
def forward(self, x):
batch_size, seq_len, d_model = x.size()
# Project inputs to Q, K, V
Q = self.wq(x) # Shape: (batch_size, seq_len, d_model)
K = self.wk(x)
V = self.wv(x)
# Reshape to add a head dimension. Now it's (batch_size, num_heads, seq_len, d_k)
Q = Q.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
K = K.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
V = V.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
# Compute attention for each head (this is the scaled dot-product from before)
scores = (Q @ K.transpose(-2, -1)) / (self.d_k ** 0.5)
attn_weights = F.softmax(scores, dim=-1)
context = attn_weights @ V # (batch_size, num_heads, seq_len, d_k)
# Concatenate heads and put back through final linear layer
context = context.transpose(1, 2).contiguous().view(batch_size, seq_len, d_model)
return self.wo(context)
# Example usage
d_model = 512
num_heads = 8
mha = MultiHeadAttention(d_model, num_heads)
x = torch.randn(1, 10, d_model) # Batch of 1, sequence of 10 tokens, 512-dim embeddings
output = mha(x)
print(output.shape) # torch.Size([1, 10, 512]) - same shape as input, but now context-rich!
The Encoder-Decoder Dance and The Missing Piece: Positional Encoding
Here’s the first rough edge you’ll hit. Self-attention is permutation invariant. It treats a sequence as a bag of words. The sentences “The dog bit the man” and “The man bit the dog” would have identical attention patterns, which is… problematic. The original RNNs got this for free through their sequential nature.
The Transformer’s solution is a hack, but a brilliantly effective one: positional encoding. We manually inject information about the absolute (and sometimes relative) position of each token into its embeddings before feeding them into the attention mechanism. The original paper used a fixed, sinusoidal function. Nowadays, learned positional embeddings are just as common. It feels like an afterthought, but it’s utterly essential. Never forget to add this.
The full Transformer architecture is an encoder-decoder model, perfect for sequence-to-sequence tasks like translation. The encoder uses self-attention to build a rich understanding of the input sentence. The decoder then uses two attention layers: one to attend to the encoder’s output (classic attention) and a masked self-attention layer on its own output, which prevents it from cheating by looking at future words during training.
The real legacy of “Attention Is All You Need” isn’t that it made RNNs obsolete overnight. It’s that it provided a scalable, parallelizable, and astonishingly powerful foundation. We quickly realized the encoder alone (BERT) or the decoder alone (GPT) were ridiculously powerful on their own. But it all started here, with this one beautifully simple and outrageously ambitious idea.