39.3 Attention Mechanism in Seq2Seq Models

Right, so here we are. You’ve built your first sequence-to-sequence model, probably an LSTM-based encoder-decoder. It works. Sort of. You feed it a sentence, it encodes the whole thing into a single, fixed-size vector—a “thought vector,” if you will—and then the decoder has to spit out a perfect translation based solely on that compressed memory.

And you quickly realize this is a bit of a nightmare. That one vector becomes an information bottleneck. The poor decoder, especially when dealing with long sequences, is like a student trying to write a perfect essay based on a single, crammed note they wrote two weeks ago. It forgets the beginning of the sentence by the time it gets to the end. The translations for long inputs become vague, generic, and frankly, a bit useless.

This is the problem the attention mechanism solves, and it’s so brilliantly simple in concept you’ll kick yourself for not thinking of it first. Instead of forcing the encoder to cram the entire meaning of the source sentence into one final hidden state, we let the encoder pass along all of its hidden states. Then, at every single step of the decoding process, we allow the decoder to “attend” to—or, focus on—the most relevant parts of the input sequence.

Think of it like a human translator. They don’t read the entire German document, forget it, and then write the entire English translation from memory. They constantly glance back at the source text, focusing on the specific word or phrase they’re translating right now. Attention gives our model a way to “glance back.”

How Attention Actually Works: The Three-Step Dance

The process for generating each output word involves a tiny three-step dance between the decoder and the encoder’s hidden states.

Score All the Things: First, for this decoder step, we need to figure out which encoder hidden states are most relevant. We take the decoder’s current hidden state (let’s call it s_t) and compare it to every single encoder hidden state (h_1, h_2, … h_n) using a scoring function (e.g., a simple dot product, a small neural network). This gives us a score for each encoder word, a measure of its relevance right now.
Softmax = Weights: We run these scores through a softmax layer. This converts them into a set of attention weights that sum to 1. A weight of 0.8 for h_5 means “at this moment, focus 80% of your attention on the fifth input word.”
Compute the Context Vector: We multiply each encoder hidden state by its attention weight and sum them all together. This produces a new vector called the context vector. This isn’t a random vector; it’s a weighted combination of the encoder states, a summary of the most relevant parts of the input for generating the next output word.

This context vector then gets fed into the decoder alongside its previous hidden state and its previous output word to help it make an informed prediction. And we repeat this elegant little dance for every single word we generate.

Here’s a simplified, conceptual code block to make this concrete. We’ll use a basic dot-product attention.

import torch
import torch.nn as nn
import torch.nn.functional as F

class DotProductAttention(nn.Module):
    def __init__(self, hidden_dim):
        super(DotProductAttention, self).__init__()
        # Often a scaling factor is used to soften the softmax for larger dims
        self.scale = 1.0 / (hidden_dim ** 0.5)

    def forward(self, decoder_state, encoder_states):
        """
        decoder_state: [1, hidden_dim]  (current decoder hidden state, s_t)
        encoder_states: [src_len, hidden_dim] (all encoder hidden states)

        Returns: context vector, attention_weights
        """
        # Calculate scores for each encoder state. Simple dot product.
        # decoder_state: [1, hidden_dim] -> [hidden_dim, 1] for the matmul
        scores = torch.matmul(encoder_states, decoder_state.view(-1, 1)) # [src_len, 1]
        scores = scores * self.scale

        # Convert scores to weights using softmax
        attention_weights = F.softmax(scores, dim=0) # [src_len, 1]

        # Calculate the context vector as weighted sum of encoder states
        context_vector = (attention_weights * encoder_states).sum(dim=0) # [hidden_dim]

        return context_vector, attention_weights.squeeze()

Why This Is a Game-Changer

The magic here isn’t just the slightly higher BLEU score. It’s the interpretability. With attention, your model suddenly develops a conscience you can peer into. After you run a translation, you can plot the attention weights. You’ll get a beautiful matrix where rows are output words and columns are input words. You’ll see the model learning alignment. When it outputs “apple,” the highest weights will be on “pomme.” It learns grammatical structure, like attending to the verb at the start of a German sentence when outputting the verb later in an English sentence.

This was the real breakthrough. We stopped being wizards casting spells with giant matrices and started being mechanics who could pop the hood and see which parts were actually doing the work. It’s the reason attention became the foundation for everything that came next, most notably the Transformer, which said, “If attention is this good, why do we need the RNNs at all?” (Spoiler: we don’t).

The Pitfalls and “Well, Actually…” Moments

It’s not all rainbows. Attention has its quirks.

Computational Cost: That three-step dance? It happens for every output token. And for each token, we’re comparing against every input token. This means the computation cost grows as O(src_len * tgt_len). For very long documents, this quadratic scaling becomes a real problem (this is what the newer “efficient attention” folks are trying to solve).
Sometimes It Attends to Nothing… or Everything: You might find the softmax weights are almost uniform across the entire input sequence for some outputs. This means the model is “confused” and isn’t finding a strong signal, so it’s just averaging everything together—a kind of “I give up” gesture. Conversely, it might attend too rigidly to one word, missing the broader context.
It’s a Band-Aid on a Broken Leg: For a while, attention made lousy encoder-decoder architectures tolerable. The real genius was realizing that the RNNs themselves were the main limitation. Attention was the key insight that allowed us to throw them away entirely.

The takeaway? Attention isn’t just a nifty trick; it’s a fundamental rethinking of how sequences should relate to each other. It gives the model a working memory and, crucially, gives you a window into its decision-making process. And that’s always a good thing.