39.2 Neural Machine Translation with Encoder-Decoder

Alright, let’s pull back the curtain on how we actually get a machine to translate languages. Forget the clunky, rule-based systems of yore; we’re talking about Neural Machine Translation (NMT). The architecture that kicked off the modern era is the Encoder-Decoder model, and it’s a beautiful, intuitive piece of work. Think of it like this: I (the encoder) read the entire German sentence you hand me and compress its essence into a single, dense “thought” vector. Then, I (the decoder) take that thought and slowly, carefully, unfold it into a proper English sentence, one word at a time.

The magic here isn’t just compression; it’s distillation. We’re not storing every word in a lookup table. We’re capturing the meaning and the relationships between the words into a fixed-length numerical representation. This is the context vector—the model’s “understanding” of the source sentence. It’s a ridiculously high-stakes game of telephone, but with matrix multiplication.

The Encoder: Your Sentence’s Personal Compressor

The encoder’s job is simple on paper: take a variable-length input sequence (your sentence) and turn it into a fixed-length context vector. In practice, we use a Recurrent Neural Network (RNN), usually an LSTM or GRU, because they’re built to handle sequences. Here’s the basic flow:

import torch
import torch.nn as nn

class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)

    def forward(self, input, hidden):
        embedded = self.embedding(input).view(1, 1, -1)
        output, hidden = self.gru(embedded, hidden)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size)

# Example: Let's say our input language has 10,000 unique words (input_size)
# and we want a hidden state size of 256.
encoder = EncoderRNN(10000, 256)
hidden = encoder.initHidden()

# Imagine 'input_word' is an integer tensor representing a single word index.
# In a real setup, you'd process the whole sentence sequentially.
output, hidden = encoder(input_word, hidden)

Notice that after processing the entire sequence, we largely throw away the encoder’s output and keep only the final hidden state. This hidden state, having been updated at each step, is supposed to contain all the information from the sequence. It’s the context vector. This is the first big “questionable choice” I have to call out: expecting the final hidden state to perfectly encapsulate a long, complex sentence is a tall order. It’s like trying to summarize War and Peace with a single, slightly breathless sigh. This is why attention mechanisms were invented—but we’ll get to that glorious mess later.

The Decoder: The Careful Unpacker

The decoder is another RNN. Its job is to take the context vector from the encoder and generate the target sequence, one token at a time. It starts with a “start of sentence” token (<SOS>), uses the context vector as its initial hidden state, and generates a distribution over possible first words. We pick one (usually the most probable), feed it back in as the next input, and repeat until we hit an “end of sentence” token (<EOS>). This is called “teacher forcing” during training, where we sometimes use the real target word as the next input to keep things stable.

class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(DecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(output_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        output = self.embedding(input).view(1, 1, -1)
        output = torch.relu(output)
        output, hidden = self.gru(output, hidden)
        output = self.softmax(self.out(output[0]))
        return output, hidden

# Example: Output size is the number of words in the target language.
decoder = DecoderRNN(256, 15000)

# Start with the SOS token (let's assume it's index 0)
decoder_input = torch.tensor([[0]])
# The decoder's initial hidden state is the encoder's final context vector!
decoder_hidden = hidden

# In a loop, you'd generate one word at a time.
for di in range(max_length):
    decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)
    # Get the most likely next word
    topv, topi = decoder_output.topk(1)
    decoder_input = topi.squeeze().detach()  # detach from history

    if decoder_input.item() == 1:  # Assuming 1 is the EOS token index
        break

The Information Bottleneck and Why It’s a Problem

You’ve probably spotted the flaw. The entire meaning of a potentially long sentence must be squeezed into a single vector of fixed size (e.g., 256 numbers). This is the infamous information bottleneck. It works surprisingly well for short sentences, but performance degrades rapidly as sentence length increases. The decoder has to magically reconstruct a perfect translation from this compressed thought, and crucial details will get lost. It’s the main reason vanilla encoder-decoder models often produce generic or inaccurate translations for complex source material. The context vector becomes a fuzzy average, not a precise blueprint.

Training and The Teacher Forcing Trick

How do we train this thing? With a clever trick called teacher forcing. For each training example, we feed the source sentence into the encoder. The decoder then tries to predict the target sequence. But here’s the key: during training, we don’t just use its own previous, likely-incorrect prediction as the next input. We help it out by giving it the actual correct word from the target sequence as its next input. This stabilizes training immensely by preventing the model from spiraling off into a fantasy land of its own mistakes early on. The downside? It can create a mismatch between training (where it sees perfect inputs) and inference (where it sees its own often-imperfect predictions), a phenomenon known as exposure bias. We mitigate this by sometimes using a scheduled sampling strategy, slowly weaning the model off teacher forcing.

The encoder-decoder architecture is the fundamental bedrock of NMT. It’s elegant, powerful, and deeply flawed—the perfect starting point for understanding everything that came after it to fix those flaws.