21.3 WordPiece: BERT's Tokenization Algorithm

Alright, let’s talk about WordPiece. You know BPE, that scrappy algorithm that builds a vocabulary by gluing the most frequent pairs of bytes together? WordPiece is its more meticulous, slightly neurotic cousin who showed up to the party with a spreadsheet. It’s the algorithm that powers BERT and its many descendants, and it was designed with one primary goal in mind: to handle the messiness of human language as effectively as possible for a masked language model.

The core difference between BPE and WordPiece isn’t in the merging—they both do that greedy pair-merging dance. It’s in the scoring. BPE just asks, “Which pair of symbols appears the most frequently?” It’s a simple frequency contest. WordPiece, in its infinite wisdom, asks a more nuanced question: “Which pair, when merged, will increase the likelihood of our training data the most?” It uses a maximum likelihood objective, which is just a fancy way of saying it wants to build a vocabulary that makes its data look as statistically probable as possible.

Think of it this way: merging ’e’ and ’s’ might be frequent, forming ’es’, a common plural ending. But merging ‘u’ and ’n’ to form ‘un’ might be an even more powerful unit for the language, even if it’s slightly less frequent. WordPiece’s scoring mechanism is designed to find those powerful, meaningful units. It’s not just counting; it’s evaluating.

The Gory Details of the Training Process

The training starts just like BPE: with a base vocabulary of all individual characters in your corpus. Then, it iteratively scores every possible pair of adjacent symbols in that vocabulary. The score for a pair (say, ‘u’ and ’n’) is calculated as:

score('u', 'n') = freq('u', 'n') / (freq('u') * freq('n'))

Wait, don’t glaze over. This is the good part. It’s not just raw frequency. It’s dividing the frequency of the bigram (the pair) by the product of the frequencies of each individual piece. This rewards pairs that occur together way more often than you’d expect by random chance. A high score means these two symbols are deeply, statistically committed to each other. They belong together. So we merge them, add the new symbol ‘un’ to the vocabulary, and update our frequency counts for the next round. We repeat this until we hit our target vocabulary size.

How Tokenization Actually Works in Practice

This is where most people get tripped up. The training algorithm is one thing; the tokenization algorithm at runtime is another. And for WordPiece, it’s a sight to behold. It uses a greedy longest-match-first strategy, also known as maximum matching.

It starts from the beginning of the word and finds the longest substring in the vocabulary that matches. Then it moves on and does it again. “Subword” is a bit of a misnomer; it’s really finding the longest possible token from its list.

Let’s see it in action. Imagine our vocab has ‘un’, ‘##able’, ‘##ing’, and ‘afford’. Now, tokenize “unaffordable”.

# This is a simplified illustration of the greedy algorithm
vocab = {"un", "##able", "##ing", "afford", "##able"}

word = "unaffordable"
start = 0
tokens = []

while start < len(word):
    end = len(word)
    # Try to find the longest substring from 'start' that's in the vocab
    while end > start:
        substr = word[start:end]
        # Check for the '##' prefix for non-starting tokens
        if start > 0:
            substr = "##" + substr
        if substr in vocab:
            tokens.append(substr)
            start = end
            break
        end -= 1
    else:
        # If no match found, use the UNK token
        tokens.append("[UNK]")
        break

print(tokens)  # Would output: ['un', 'afford', '##able']

See that? It didn’t break it into ['un', 'a', 'ff', 'ord', '##able']. It greedily grabbed the biggest chunks it could: first ‘un’, then ‘afford’, then ‘##able’. This is crucial for BERT’s performance, as it creates more meaningful, coherent tokens.

The Infamous Double-Hash Prefix

This is WordPiece’s most obvious quirk, and it’s a brilliantly simple hack. How do you distinguish a token that can start a word from one that can only appear in the middle? You prefix the latter with ##.

This solves a massive ambiguity problem. Without it, the token “ing” could be the word “ing” (admittedly rare) or the suffix “-ing”. The double-hash tells the model unequivocally, “Hey, I’m a suffix. I can’t start a word. Don’t even think about putting me at the beginning.” This explicit positional marking is a big part of why WordPiece works so well. It’s a small piece of structural information baked directly into the token itself.

Common Pitfalls and How to Avoid Them

Mismatched Vocabularies: This is the number one cause of “why is my model broken?” The WordPiece vocab is a sacred object. You must use the exact same one that your pre-trained model (like BERT) used. Tokenizing your fine-tuning data with a different vocabulary, even if trained on a similar corpus, will lead to catastrophic nonsense. The model will see tokens it literally cannot comprehend.
Forgetting the CLS/SEP Tokens: WordPiece isn’t just for words. Those special tokens [CLS], [SEP], [MASK] are part of the vocabulary and are non-negotiable for structuring input for BERT. Your tokenization library should add these automatically, but if you’re rolling your own, you must add them manually.
Handling Unknowns: The [UNK] token is a failure state. If your text contains a lot of [UNK], it means your tokenizer is seeing characters or sequences it never learned during training. For a well-trained tokenizer on a large corpus, this should be rare. If you’re processing domain-specific text (e.g., biomedical papers with lots of chemical names), you might need to augment the vocabulary or use a different approach. A high [UNK] rate is a major red flag.
The Whitespace Trap: Remember, the tokenizer is designed to work on text that has already been pre-tokenized by whitespace. The algorithm runs on each “word” individually. If you feed it a string with no spaces, it will try to tokenize the entire gargantuan string as one single word, which will likely result in a mess of [UNK]s or a bizarre sequence of subword tokens. Always pre-tokenize your text by space first.

In essence, WordPiece is a workhorse. It’s not the newest algorithm on the block, but its design—the likelihood scoring, the greedy matching, the double-hash trick—is engineered with a deep understanding of both statistics and language. It’s the reason BERT could understand the relationship between “unaffordable” and “afford” in the first place. It’s not just breaking text; it’s building meaning.