17.8 Attention Mechanism: The Precursor to Transformers

Alright, let’s talk about the elephant in the room. You’ve just spent all this mental energy wrapping your head around LSTMs and GRUs, these fantastically complex gates designed to solve the vanishing gradient problem and remember things for more than five seconds. And they work!… sort of. For shorter sequences, they’re brilliant. But ask an LSTM to read War and Peace and then summarize the plot based on a subtle hint from the first chapter, and it will, politely, have a stroke.

The core issue is the bottleneck problem. Think of the LSTM’s hidden state as its memory. It’s a single, fixed-size vector. At every timestep, it has to cram all the new information from the current input and the entire history of what it’s seen so far into this one, finite container. By the time it gets to the end of a long sentence or document, that initial, crucial piece of information has been squeezed, diluted, and compressed into near oblivion. It’s like trying to summarize your entire life story into a single tweet. You lose the nuance. You lose the details. You lose the good stuff.

This is where the attention mechanism waltzes in, not as a replacement for the LSTM, but as a brilliantly simple augment that makes it infinitely more powerful. The core idea is so intuitive you’ll kick yourself for not thinking of it: Instead of forcing the entire history into one compressed vector, why not let the model just look back at all the previous vectors whenever it needs to?

How Attention Actually Works: The Key, Query, Value Metaphor

Don’t let the fancy terms scare you. Imagine you have a set of notes (the Values V) from a lecture—a sequence of hidden states [h1, h2, h3, ..., hN] from your encoder RNN.

Now, you’re writing a summary (the decoder’s current task). To write the next word of your summary, you ask a specific question: “What was the main subject of the lecture?” This question is your Query Q (often the decoder’s current hidden state).

You then take your stack of notes and, for each individual note, you ask, “How relevant is this specific note to my question?” You do this by comparing your Query Q to a Key K derived from each note. In the simplest case, the Key is just the Value itself (K = V).

This comparison is usually done by calculating a simple dot product or using a small neural network, resulting in a score for each note. A high score means “This note is very relevant to your current question!” A low score means “Ignore this for now.”

You then run these scores through a softmax function. This converts them into a set of attention weights that sum to 1. These weights are a probability distribution over your notes. Finally, you create a context vector by taking a weighted sum of all your Values V, using these attention weights.

The magic is in that context vector. It’s no longer a generic summary; it’s a specific, focused summary that is dynamically crafted to help answer the exact question the decoder is asking at this very moment.

Here’s a bare-bones, from-scratch implementation in PyTorch to make it concrete. We’ll create a SimpleAttention module.

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleAttention(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        # Often, we project the keys and queries for better flexibility
        self.key_proj = nn.Linear(hidden_dim, hidden_dim)
        self.query_proj = nn.Linear(hidden_dim, hidden_dim)

    def forward(self, query, keys, values):
        """
        query: [batch_size, hidden_dim] (decoder's current state)
        keys: [seq_len, batch_size, hidden_dim] (all encoder hidden states)
        values: [seq_len, batch_size, hidden_dim] (usually same as keys)
        """
        # Project the query and keys for a better comparison
        proj_query = self.query_proj(query).unsqueeze(2)  # [batch_size, hidden_dim, 1]
        proj_keys = self.key_proj(keys)  # [seq_len, batch_size, hidden_dim]

        # Swap dimensions for batch matrix multiplication
        proj_keys = proj_keys.permute(1, 2, 0)  # [batch_size, hidden_dim, seq_len]

        # Calculate attention scores: a dot product for each key
        # [batch_size, 1, seq_len]
        scores = torch.bmm(proj_query.transpose(1, 2), proj_keys)
        attention_weights = F.softmax(scores, dim=-1)  # [batch_size, 1, seq_len]

        # Prepare values for multiplication
        values = values.permute(1, 0, 2)  # [batch_size, seq_len, hidden_dim]

        # Calculate the context vector as weighted sum of values
        context = torch.bmm(attention_weights, values)  # [batch_size, 1, hidden_dim]
        return context.squeeze(1), attention_weights.squeeze(1)  # [batch_size, hidden_dim], [batch_size, seq_len]

# Example usage:
batch_size = 2
seq_len = 10
hidden_dim = 16

# Your encoder's output (all hidden states)
encoder_states = torch.randn(seq_len, batch_size, hidden_dim)
# Your decoder's current hidden state
decoder_state = torch.randn(batch_size, hidden_dim)

attn_layer = SimpleAttention(hidden_dim)
context_vector, attn_weights = attn_layer(decoder_state, encoder_states, encoder_states)

print(f"Context vector shape: {context_vector.shape}") # [2, 16]
print(f"Attention weights shape: {attn_weights.shape}") # [2, 10] - a weight for each input element for each batch

Why This is a Game-Changer and The Path to Transformers

This mechanism solves the bottleneck problem elegantly. The decoder now has direct, weighted access to every single encoder state at every decoding step. The fixed-length vector bottleneck is gone. The model can learn to align words across languages (e.g., “chat” in French with “cat” in English) or find relevant information from distant parts of the input, which was the LSTM’s Achilles’ heel.

But here’s the kicker, and the designers’ next questionable-turned-genius choice: if attention is this good at figuring out relationships within a sequence, do we even need the recurrent part anymore?

Seriously. Why have all the complex, sequential, impossible-to-parallelize LSTM gates if a stack of these simple, highly parallelizable attention mechanisms can do the job better? This exact line of questioning is what led directly to the Transformer architecture, which threw out recurrence altogether and went all-in on attention. The “Attention Is All You Need” paper wasn’t a boast; it was a conclusion.

A common pitfall? Assuming attention is a silver bullet. It adds computational and memory complexity that scales with the square of the sequence length (O(seq_len^2)). For very long sequences, this becomes prohibitively expensive, a problem later solved by more efficient variants like sparse attention. But for most tasks, it’s the single most important upgrade you can give a classic seq2seq model, and understanding it is the key to unlocking everything that came after.