39.1 Statistical Machine Translation: IBM Models and Phrase-Based

Alright, let’s get our hands dirty with how we used to translate languages before neural nets started showing off. This isn’t just a history lesson; understanding Statistical Machine Translation (SMT) is like learning the fundamentals of chess. It teaches you the core problems of translation, and frankly, some of these ideas are so clever they’ll make you want to high-five a long-deceased IBM researcher.

We start with the IBM models, developed in the late 80s and 90s. Their genius was framing translation as a noisy channel problem. Think of it like this: I have a target sentence in English (e.g., “the cat sat on the mat”). This sentence gets corrupted into a sort of “French-ish” noise, and out pops the source sentence (“le chat s’est assis sur le tapis”). Our job is to reverse-engineer this process.

The fundamental equation they derived is this beauty:

$$ \arg \max_{e} p(e | f) = \arg \max_{e} p(f | e) , p(e) $$

Translation isn’t about finding the English (e) that’s most probable given the French (f)—that’s intractable. Instead, we lean on Bayes’ theorem. We find the e that maximizes the product of two things: 1) the translation model $p(f | e)$ (how likely is this French string to be the garbage-noise version of this English string?), and 2) the language model $p(e)$ (how likely is this English string to be a coherent, grammatical sentence on its own?). The language model ensures we output “the cat sat on the mat” and not “mat the on sat cat the.”

The Humble (and Slightly Unhinged) IBM Model 1

Model 1 is the gateway drug. Its goal is simple: given an English word, what’s the probability it translates into a given French word? We call these translation probabilities, or $t(f | e)$.

The wild part? Model 1 ignores word order completely. It assumes all connections between words in a sentence are equally likely. It’s like taking the two sentences, shredding them, throwing the words in the air, and then figuring out which word pairs landed together. It’s absurd, but it’s a start. We learn these probabilities using the EM algorithm (Expectation-Maximization), which is a fancy way of saying “make an educated guess, see how wrong you are, and slowly get less wrong over many iterations.”

Here’s a grotesquely simplified Python snippet to give you the flavor of the E-step for Model 1. Don’t run this on real data; it’s just to show the mechanics.

# A tiny corpus: English sentences and their French translations
corpus = [
    (['the', 'house'], ['la', 'maison']),
    (['the', 'book'], ['le', 'livre'])
]

# Initialize translation probabilities uniformly. P(french | english)
t_probs = {
    ('la', 'the'): 0.5,   ('maison', 'the'): 0.5,
    ('le', 'the'): 0.5,   ('livre', 'the'): 0.5,
    ('la', 'house'): 0.5, ('maison', 'house'): 0.5,
    ('le', 'book'): 0.5,  ('livre', 'book'): 0.5,
    # ... and so on. In reality, you'd initialize programmatically.
}

# E-step: Calculate expected counts for each (french, english) word pair.
# This is the "how wrong were we?" part.
for (e_sent, f_sent) in corpus:
    for f_word in f_sent:
        total = sum(t_probs.get((f_word, e_word), 0) for e_word in e_sent)
        for e_word in e_sent:
            # The count for (f_word, e_word) is its proportion of the total probability
            delta = t_probs.get((f_word, e_word), 0) / total
            # We'd normally accumulate this delta in a structure for the M-step
            print(f"Count for ({f_word}, {e_word}) += {delta:.4f}")
# The M-step would then update t_probs by normalizing these accumulated counts.

After many iterations of E and M steps, you’ll hopefully find that t('la' | 'the') and t('le' | 'the') converge to some value, while t('maison' | 'the') plummets towards zero. Magic!

Getting Serious with Phrase-Based Translation

The IBM models are brilliant but have a fatal flaw: they’re obsessed with words. Translation doesn’t work word-by-word; it works by phrases (contiguous sequences of words). “Green card” doesn’t mean “carte verte” in French; it means “carte de séjour”. The meaning is in the chunk.

Phrase-Based Machine Translation (PBMT) was the pragmatic response that dominated until ~2014. It ditches the theoretical purity of generative models for a simpler, more effective approach: 1) break the source sentence into phrases, 2) translate each phrase, and 3) reorder the translated phrases into a coherent target sentence.

Its power comes from three key probability distributions:

Phrase Translation Table ($φ(\bar{f}|\bar{e})$): The core. The probability that a source phrase $\bar{f}$ translates to a target phrase $\bar{e}$. Mined from massive parallel corpora.
Reordering Model: Penalizes crazy reordering of phrases. Going from Subject-Verb-Object (English) to Subject-Object-Verb (German) is fine. Putting the verb at the very beginning for no reason is not.
Language Model ($p(e)$): Same as before. Makes sure the final stitched-together sentence is fluent.

The search for the best translation becomes a classic dynamic programming problem, often solved with a beam search.

# A hypothetical, simplified phrase table lookup for a decoder.
# This is what the model's "knowledge" looks like.

phrase_table = {
    'das': [('the', 0.9), ('that', 0.1)],
    'haus': [('house', 0.8), ('home', 0.2)],
    'ist': [('is', 0.95)],
    'klein': [('small', 0.7), ('little', 0.3)],
    'das haus': [('the house', 0.7)], # Phrase translation!
    'ist klein': [('is small', 0.6), ('is little', 0.4)]
}

def translate(german_sentence):
    words = german_sentence.split()
    # A real decoder would consider all possible segmentations and translations.
    # Here, we just naively check for a multi-word phrase first.
    if 'das haus' in german_sentence:
        print("Using phrase translation: 'the house'")
    else:
        print("Falling back to word-by-word translation.")
    # ... immense complexity hidden here ...

translate('das haus ist klein')

The Pitfalls and the Glory

SMT is a testament to engineering. It’s robust, its components are interpretable (you can inspect a phrase table!), and it doesn’t need a GPU farm. But the pitfalls are deep.

First, reordering. Modeling the permutations of phrases is a nightmare. The models are clunky and often get it wrong on long-range dependencies.

Second, data sparsity. That phrase table is built from observed co-occurrences. If your parallel corpus never contains the phrase “green card,” the model will never learn to translate it correctly. It has zero ability to reason about the meaning of unseen word combinations. This is its ultimate weakness and the very problem neural models solve.

So, while we’ve moved on, we stand on the shoulders of these statistical giants. They taught us the rules of the game.