21.4 SentencePiece: Language-Agnostic Tokenization
Alright, let’s talk about SentencePiece. You’ve met BPE and WordPiece, which are brilliant but come with a massive, unspoken assumption: that you can pre-tokenize your text into neat little words using spaces. Cue the record scratch. This is a problem for about half the planet’s languages.
Languages like Japanese, Chinese, or Korean don’t use whitespace to separate words. Trying to pre-tokenize them is a fool’s errand that usually involves bringing in another, equally complex NLP model just to get started. And even for languages that do use spaces, what about all the other gunk? Punctuation, emoji, that weird combined apostrophe-word thing we do in English (“don’t” vs “do not”)? Pre-tokenizers have to make a bunch of arbitrary calls on how to handle that, and they often get it wrong, losing information before the main tokenization algorithm even gets a shot.
SentencePiece looks at this mess, says “hard pass,” and handles the raw text directly. It’s the only one of the three we’re discussing that is truly end-to-end. Its genius move is to treat the input text as a raw stream of Unicode characters, including the whitespace. This makes it completely language-agnostic. It couldn’t care less what script you’re using.
How SentencePiece Rolls: The Whitespace Trick
The core of its sorcery is how it handles the space. Since it operates on the raw character sequence, the space is just another character. But it’s a special character. To avoid a token vocabulary bloated with space-prefixed tokens, SentencePiece typically replaces actual space characters with a meta-character, the Unicode U+2581 “LOWER ONE EIGHTH BLOCK” (▁). This looks like a underscore but isn’t. This symbol represents the word-boundary marker.
So, the sentence “Hello world” is initially represented as the character sequence: H e l l o ▁ w o r l d.
The BPE algorithm (or Unigram, which we’ll get to) then runs on this sequence. You’ll end up with tokens like ['▁He', 'llo', '▁world']. The huge advantage? The model explicitly learns which tokens are word-initial and which are not. This is a superpower for downstream tasks. When you decode the token sequence, you just convert the ▁ back to a space, and voilà, perfect reconstruction. No more guessing if a token should have a preceding space or not.
Picking Your Algorithm: BPE vs. Unigram
Here’s something most guides gloss over: SentencePiece isn’t an algorithm itself; it’s a system. Under the hood, it can use one of two algorithms: BPE (the one we know and love) or the Unigram Language Model algorithm.
BPE mode is what you’d expect. It starts with a base vocabulary of all individual characters and iteratively merges the most frequent pairs. It’s a bottom-up, greedy approach. It’s fast and effective, the default for most setups.
Unigram mode is the less famous but often more interesting choice. It works top-down. You start with a huge vocabulary (e.g., all pre-tokenized words and common substrings) and a language model. Then, you iteratively remove tokens from the vocabulary that contribute the least to the overall likelihood of your training corpus. It’s a pruning method. Why would you use it? It often produces better tokenization for rare words because it evaluates the overall utility of a token, not just its frequency in a single merge step. The paper often shows it outperforming BPE.
You choose this when initializing the model trainer.
import sentencepiece as spm
# Training a model with the BPE algorithm (the default)
spm.SentencePieceTrainer.train(
input='my_corpus.txt',
model_prefix='sp_bpe_model',
vocab_size=8000,
model_type='bpe' # This is the default, so you could omit it
)
# Training a model with the Unigram algorithm
spm.SentencePieceTrainer.train(
input='my_corpus.txt',
model_prefix='sp_unigram_model',
vocab_size=8000,
model_type='unigram' # Explicitly choosing Unigram
)
The Nasty Little Secret: It’s Not Fully Agnostic
I told you it was language-agnostic, and for the most part, it is. But we have to be honest about the rough edges. To train effectively, SentencePiece still requires you to normalize your input text. It comes with built-in rules for this, and you can’t easily turn them all off.
The biggest offender? It automatically converts all whitespace sequences (spaces, tabs, newlines) into that single ▁ character. This is usually what you want, until it very much isn’t.
The Pitfall: If your task genuinely cares about the difference between a space, a tab, and a newline (say, you’re tokenizing source code or a structured text file), SentencePiece will gleefully obliterate that distinction. All of them become ▁. Poof. Information lost. You can’t recover it on the other side. For most natural language, this is fine. For code, it’s a potential disaster. It’s the one place where its “treat everything as a raw stream” philosophy has a pretty significant leak.
Using Your Trained Model
Using a trained SentencePiece model is refreshingly consistent. The same object handles both encoding and decoding, and it handles all the whitespace conversion for you seamlessly.
import sentencepiece as spm
# Load the model
sp = spm.SentencePieceProcessor()
sp.load('sp_bpe_model.model')
# Encode some text
text = "I don't know... 🤷 This is complex!"
pieces = sp.encode_as_pieces(text)
ids = sp.encode_as_ids(text)
print("Text:", text)
print("Pieces:", pieces) # e.g., ['▁I', '▁don', "'", 't', '▁know', '...', '▁', '🤷', '▁This', '▁is', '▁complex', '!']
print("IDs:", ids)
# Decode it back
decoded_text = sp.decode_pieces(pieces)
print("Decoded:", decoded_text) # Should match the original text perfectly
The best practice here is to always decode using the model’s own methods (decode_pieces or decode_ids). Don’t try to manually concaten tokens yourself, because you’ll inevitably mess up the ▁-to-space conversion.
So, when should you reach for SentencePiece? Always, if you’re working with multiple languages or a language without spaces. And strongly consider it even for English, because its explicit word boundary markers are just a cleaner, more intelligent design. Just keep that whitespace normalization behavior in the back of your mind—it’s the one thing that can trip you up if your data isn’t plain old prose.