17.7 Sequence-to-Sequence with Encoder-Decoder Architecture

Right, so you’ve got a handle on vanilla RNNs, and you’ve seen how LSTMs and GRUs solve their chronic short-term memory problem. Fantastic. But let’s be honest, a single LSTM cell, no matter how brilliant, is a bit of a one-trick pony. It’s great for predicting the next word or classifying a sentiment, but what if you need to transform one sequence into another? Translate French to English? Summarize a long article? Have a coherent conversation? For that, you need a bigger gun. You need the Sequence-to-Sequence (Seq2Seq) architecture, and it’s one of the most elegant and powerful ideas in modern deep learning.

The core concept is so simple it’s almost stupid: use one RNN (the Encoder) to read and compress the entire input sequence into a fixed-length context vector—a thought, if you will. Then, use another RNN (the Decoder) to read that thought and spit out the transformed output sequence. It’s like having a brilliant but terribly forgetful friend who reads a German sentence, summarizes the entire meaning into a single, dense concept in their head (“Ah, the inherent melancholy of a rainy Tuesday!”), and then another friend who takes that concept and writes a poetic English translation. The potential for catastrophic miscommunication is both the flaw and the fun of the whole endeavor.

The Encoder: Squashing a Saga into a Sentence

The encoder’s job is straightforward. It processes the input sequence, one token at a time. At each step, it updates its hidden state. After it chews through the last element, we take its final hidden state (h_n) and sometimes its final cell state (if it’s an LSTM) and call this our “context vector.” This tiny tensor is now the sole representative of everything the network knows about the input. No pressure.

Here’s the catch, and it’s a big one: this is a massive information bottleneck. Imagine trying to cram the entire plot of War and Peace into a 256-dimensional vector. You’re going to lose some details. This is the fundamental limitation of the basic Seq2Seq model and the reason why the attention mechanism (which we’ll get to) was such a revolutionary improvement.

import torch
import torch.nn as nn

class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size
        # An embedding layer to convert input indices to dense vectors
        self.embedding = nn.Embedding(input_size, hidden_size)
        # Using a GRU here for simplicity; you could use LSTM too.
        self.gru = nn.GRU(hidden_size, hidden_size)

    def forward(self, input, hidden):
        # Embed the input
        embedded = self.embedding(input).view(1, 1, -1)
        output, hidden = self.gru(embedded, hidden)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size)

The Decoder: The Artistic Half of the Operation

The decoder is another RNN, but its job is generative. It starts with the context vector from the encoder as its initial hidden state. Its first input is almost always a special start-of-sequence token (<SOS>). It uses this to generate its first output and a new hidden state. Crucially, for the next step, it uses its own previous output as its next input. This is called “teacher forcing” during training (where we sometimes use the true target instead to speed things up) and it’s always used during inference. It’s how the model creates a sequence of any length.

class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(DecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(output_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        embedded = self.embedding(input).view(1, 1, -1)
        output, hidden = self.gru(embedded, hidden)
        output = self.softmax(self.out(output[0]))
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size)

The Training Loop and the Teacher Forcing Gambit

This is where you see the whole thing come together. You run the encoder on the source sequence, grab its final hidden state, and pass it to the decoder. The decoder then tries to predict the target sequence. The loss is calculated for every word it predicts.

Now, about teacher forcing: a pure decoder, left to its own devices, would use its own often-terrible early predictions as inputs for later steps, leading to a rapid downhill slide into gibberish during training. To stabilize this, we often use a trick: we randomly decide to feed the decoder the actual next word from the target sequence (the “ground truth”) instead of its own prediction. This gives the model a fighting chance to learn correctly before it has to fly solo. You control this with a teacher_forcing_ratio probability.

def train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion, max_length, teacher_forcing_ratio=0.5):
    encoder_hidden = encoder.initHidden()

    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    input_length = input_tensor.size(0)
    target_length = target_tensor.size(0)

    # Run the encoder
    for ei in range(input_length):
        encoder_output, encoder_hidden = encoder(input_tensor[ei], encoder_hidden)

    # The encoder's final hidden state is the decoder's initial context
    decoder_hidden = encoder_hidden
    decoder_input = torch.tensor([[SOS_token]]) # Start-of-Sequence token

    loss = 0

    # Run the decoder
    for di in range(target_length):
        decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)
        # Get the best guess (the top candidate)
        topv, topi = decoder_output.topk(1)
        decoder_input = topi.squeeze().detach() # Detach from history as input

        loss += criterion(decoder_output, target_tensor[di].unsqueeze(0))

        # The Teacher Forcing decision: use the actual next item?
        if random.random() < teacher_forcing_ratio:
            decoder_input = target_tensor[di] # <-- Ground truth

    loss.backward()
    encoder_optimizer.step()
    decoder_optimizer.step()

    return loss.item() / target_length

The Elephant in the Room: The Bottleneck

I’ve been hinting at it. That context vector is a disaster for long sequences. The encoder’s final hidden state simply gets watered down with earlier information; the beginning of the sentence is all but forgotten by the end. This is why the basic Seq2Seq model is often terrible at long-range dependencies. It’s the model’s equivalent of only remembering the last thing your partner said in an argument. The answer, of course, is attention. It allows the decoder to “peek back” at the encoder’s entire sequence of hidden states at every decoding step, dynamically deciding which parts of the input are most important right now. It completely sidesteps the bottleneck. But that, my friend, is a topic for the next section. For now, understand that this Encoder-Decoder setup is the fundamental chassis on which the lightning-fast sports car of attention is built.