21.2 Byte Pair Encoding (BPE): Building the Vocabulary from Merges
Alright, let’s get our hands dirty with Byte Pair Encoding (BPE). Forget the intimidating name for a second; the core idea is so stupidly simple a compression engineer from the 90s would slap their forehead and say “why didn’t I think of that for text?” Actually, they did think of it for text, but we’ve since weaponized it for AI. BPE is a data compression algorithm that we’ve hijacked to solve a fundamental NLP problem: how do you build a vocabulary for a model when words are infinite but your GPU’s memory is very, very finite?
You can’t just use single words. “Unconstitutionality” is a word, but so is “a”. Do you really want to waste a vocabulary slot on every single punctuation mark? And what about misspellings, or “dog” vs. “dogs”? A naive word-level vocabulary would be enormous, inefficient, and brittle. BPE solves this by starting with the absolute smallest possible units—individual bytes, or more commonly, characters—and then merging the most frequent pairs into new, larger units. It’s a greedy, statistical, bottom-up process that ends up creating a vocabulary of the most statistically significant “subword” chunks in your training corpus. These chunks can be whole words (“cat”), common morphemes (“un-”, “-ing”), or even parts of words (“uca” from “education”). This is brilliant because it lets the model handle never-before-seen words by just breaking them down into known subwords. No more <UNK> tokens for every proper noun.
The Algorithm: A Step-by-Step Walkthrough
Let’s break down the algorithm, which is best understood by seeing it in action. We’ll train a tiny BPE vocabulary on a laughably small corpus. Pay attention to the frequency; that’s the engine driving this whole process.
First, we start with our initial vocabulary: every character in our text, plus a special </w> token appended to each word to mark its end. This end-of-word token is crucial. It’s the difference between “cat” being a word and “cat” being the start of “catalogue”. We need to know where words stop.
# Our 'corpus'
corpus = "low lower newest widest"
# Pre-tokenize by word, add the end-of-word token
words = [word + '</w>' for word in corpus.split()]
print("Initial words:", words)
# Count frequency of each word
from collections import defaultdict
word_freq = defaultdict(int)
for word in words:
word_freq[word] += 1
print("Word frequencies:", dict(word_freq))
Output:
Initial words: ['low</w>', 'lower</w>', 'newest</w>', 'widest</w>']
Word frequencies: {'low</w>': 1, 'lower</w>': 1, 'newest</w>': 1, 'widest</w>': 1}
Now, we break each word into its constituent characters. This is our starting point. Our initial vocabulary is all the characters, and our “tokens” for each word are just sequences of those characters.
# Get the initial splits (characters + </w>)
splits = {word: list(word) for word in word_freq.keys()}
print("Initial splits:", splits)
Output:
Initial splits: {
'low</w>': ['l', 'o', 'w', '</w>'],
'lower</w>': ['l', 'o', 'w', 'e', 'r', '</w>'],
'newest</w>': ['n', 'e', 'w', 'e', 's', 't', '</w>'],
'widest</w>': ['w', 'i', 'd', 'e', 's', 't', '</w>']
}
The core loop is simple: count the frequency of every adjacent pair of symbols (bytes, characters, or merged tokens) in your vocabulary. Then, find the most frequent pair. Finally, merge that pair everywhere it appears. You repeat this process for a predetermined number of merge operations.
Let’s do the first merge. What are the most common pairs? Let’s count all adjacent pairs across all words.
from collections import Counter
pair_freq = Counter()
for word, freq in word_freq.items():
split = splits[word]
for i in range(len(split) - 1):
pair = (split[i], split[i+1])
pair_freq[pair] += freq
print("Pair frequencies:", pair_freq.most_common())
Output:
Pair frequencies: [
(('e', 's'), 2), # Found in 'newest</w>' and 'widest</w>'
(('s', 't'), 2), # Found in 'newest</w>' and 'widest</w>'
(('w', '</w>'), 1), (('l', 'o'), 1), ... # and others
]
The pair ('e', 's') wins with a frequency of 2. So, we merge it. We create a new token called es and replace every occurrence of ’e’ followed by ’s’ with this new merged token.
# Perform the merge for the most frequent pair ('e', 's')
most_frequent_pair = pair_freq.most_common(1)[0][0]
new_token = ''.join(most_frequent_pair)
print(f"Merging '{most_frequent_pair}' into new token: '{new_token}'")
# Update the splits for every word
for word in word_freq:
split = splits[word]
i = 0
new_split = []
while i < len(split):
if i < len(split) - 1 and (split[i], split[i+1]) == most_frequent_pair:
new_split.append(new_token)
i += 2 # Skip the next element since we merged it
else:
new_split.append(split[i])
i += 1
splits[word] = new_split
print("Splits after first merge:", splits)
Output:
Merging ('e', 's') into new token: 'es'
Splits after first merge: {
'low</w>': ['l', 'o', 'w', '</w>'],
'lower</w>': ['l', 'o', 'w', 'e', 'r', '</w>'], # 'e','s' not adjacent here, no change
'newest</w>': ['n', 'e', 'w', 'es', 't', '</w>'], # merged!
'widest</w>': ['w', 'i', 'd', 'es', 't', '</w>'] # merged!
}
You rinse and repeat this process—recounting pair frequencies on the updated splits and merging the most common pair—for as many steps as you want. The number of merges is a key hyperparameter; it directly controls the size of your final vocabulary. More merges = a smaller vocabulary of larger chunks. Fewer merges = a larger vocabulary of smaller chunks.
The Devil’s in the Details: Implementation Gotchas
Now, the above code is pedagogical trash. It’s slow and inefficient. In the real world, you use a priority queue to track pair frequencies and update counts incrementally after each merge instead of recomputing everything from scratch. The Hugging Face tokenizers library does this heavy lifting for you, and you should absolutely use it instead of rolling your own, unless you’re a masochist.
Here’s how you’d do it properly:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
# Initialize a tokenizer with the BPE model
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
# Tell it to pre-tokenize by splitting on whitespace
tokenizer.pre_tokenizer = Whitespace()
# Trainer specifies the vocab size target and special tokens
trainer = BpeTrainer(vocab_size=300, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
# Train it on a text file (or a list of strings)
tokenizer.train(["path/to/your/corpus.txt"], trainer)
# Now use it
output = tokenizer.encode("This is a sentence with unconstitutionality.")
print("Tokens:", output.tokens)
print("IDs: ", output.ids)
The biggest pitfall? The corpus you use for training is everything. Train on code and your merges will include def and import . Train on English literature and you’ll get th and ing</w>. Train on a messy mix of social media posts, code, and technical manuals, and you’ll get a Frankenstein’s monster of a vocabulary that’s probably not optimal for any one task. Garbage in, garbage out. Choose your training data like you’d choose a parachute: very, very carefully.
Why This Works and When It Doesn’t
BPE is elegant because it’s data-driven. It doesn’t need a linguistic preprocessor; it just statistically finds common byte sequences. This makes it robust across languages, even those without whitespace.
The questionable choice? It’s greedy. It makes the best local merge at each step without any consideration for the global optimum. The merge ('e', 's') might seem great now, but it might prevent a more meaningful merge like ('est', '</w>') later. It’s a classic case of a greedy algorithm: often very good, rarely perfect.
The other rough edge is that the same word can have multiple valid segmentations based on the merge rules. Most implementations just use a greedy match (longest-match-first) during tokenization, which works well enough but isn’t guaranteed to be optimal. It’s a heuristic, and sometimes it does something downright silly. But for the vast majority of cases, it’s shockingly effective. It’s the workhorse behind GPT and countless other models for a reason. It just works.