21.1 Why We Tokenize: Characters, Words, and Subwords
Right, let’s get this out of the way: you can’t just feed raw text into a model. I know, I know, it feels like we should be able to. You and I can read a string of characters and understand it, but your model sees a sequence of numbers. Tokenization is the process of figuring out how to turn text into those numbers in a way that’s actually useful. It’s the first, most crucial, and often most infuriating step in the whole NLP pipeline. Get it wrong here, and everything that follows is built on a shaky foundation.
Think of it like this: we need to break text into chunks, or “tokens,” that our model can understand. The question is, what is a token? We have three main contenders, each with their own philosophical baggage.
The Character: The Minimalist’s Dream
At one extreme, we have character-level tokenization. Every single letter, digit, punctuation mark, and weird Unicode symbol gets its own token ID. It’s incredibly simple.
text = "hello!"
char_tokens = list(text)
print(char_tokens) # Output: ['h', 'e', 'l', 'l', 'o', '!']
The Upside: The vocabulary (the set of all possible tokens) is tiny—just a few hundred for most languages. You’ll never run into an out-of-vocabulary (OOV) word because you can represent anything as a sequence of these basic building blocks. It’s elegant.
The Downside: It’s brutally inefficient. The sequence length for a single sentence balloons, and the model has to work incredibly hard to learn that ‘h’, ’e’, ’l’, ’l’, ‘o’ actually form a meaningful unit. It’s like trying to understand a novel by examining each molecule of ink on the page. The computational cost is, frankly, absurd for anything but small-scale experiments.
The Word: The Obvious (But Flawed) Choice
The intuitive solution is word-level tokenization. Split on whitespace and punctuation, and boom, you have tokens that mean something.
import re
text = "Can't believe it! Hello, world."
# A naive word-level split
word_tokens = re.findall(r"\w+|\S", text)
print(word_tokens) # Output: ['Can', 't', 'believe', 'it', '!', 'Hello', ',', 'world', '.']
…See the problem immediately? Our naive split butchered “Can’t” into “Can” and “t”. This is the first of many headaches. To do this properly, you need a robust vocabulary and a way to handle unseen words. The vocabulary size explodes into the hundreds of thousands. What do you do with “unhappiness”? Is it one token? What about “model’s” versus “models”? And God help you if your text has a typo or a rare medical term—the model just sees an annoying <UNK> (unknown) token and loses all information about that word. It’s a brittle, high-maintenance approach.
The Subword: The Clever Compromise
This is where we get smart. Subword tokenization, the method behind giants like BPE (used by GPT), WordPiece (used by BERT), and SentencePiece, is the Goldilocks solution. The core idea is brilliantly simple: frequent words should stay as whole tokens, but rare words should be broken down into meaningful sub-components.
This solves the OOV problem elegantly. Imagine the model learns the subword tokens "un", "happi", "ness". It can now understand “unhappiness,” but also “happiness,” “unhappy,” and even a misspelling like “unhappynes” if it had to. The vocabulary size is kept manageable (e.g., 30k-50k tokens) while providing extensive coverage.
Why does this work so well? Because it mirrors how we understand language. You don’t need to have seen the word “antidisestablishmentarianism” before to know that “anti-” means against, “dis-” means not, “establish” is a thing, and “-arianism” is some kind of belief system. Subword tokenization gives models a similar, data-driven way to deconstruct meaning.
The Inevitable Trade-offs and Pitfalls
No approach is perfect. Subword tokenization has its own quirks that will drive you up a wall.
First, it’s not deterministic. The same word can be tokenized differently based on the surrounding context or the specific algorithm’s merge rules. This can be a nightmare for reproducibility.
Second, whitespace is significant. Most algorithms (BPE, WordPiece) treat a space as a character and often merge it onto the beginning of a word, resulting in tokens like " hello" and "world". This is why you’ll sometimes see a model output text with missing spaces. SentencePiece gets around this by treating the input as a raw stream of bytes, making it more agnostic to languages without clear word separators.
Third, the chosen vocabulary size is a hyperparameter you can’t ignore. Too small, and your tokens are almost back to being characters, losing efficiency. Too large, and you’re back to the bloat and overfitting problems of word-level tokenization. You have to find the sweet spot for your specific dataset.
Finally, always remember that the tokenizer is trained on your data. Its biases become your model’s biases. If your training corpus is full of “Python” (the language) but not “python” (the snake), the model might see "p" and "ython" for the latter. The tokenizer’s worldview is baked into every single prediction your model makes. So choose your training text wisely, because your tokenizer is learning its rules from it.