21.7 Special Tokens: BOS, EOS, PAD, and Custom Tokens
Right, let’s talk about the stagehands of the tokenization world: special tokens. You’ve got your <s>, </s>, [PAD], and a whole cast of custom characters. They don’t get the glory of the actual text tokens, but without them, the whole show falls apart. They’re the invisible scaffolding that holds your model’s input and output together, telling it where a sequence begins, ends, and how to deal with the messy reality of batching variable-length text.
Ignoring them is the number one reason for “my model is producing gibberish” or “I’m getting a weird shape error” problems. So let’s get this sorted.
The Core Four (Plus One)
You’ll primarily deal with four special tokens. Their exact string representation can vary (e.g., <s> vs [CLS]), but their jobs are universal.
- BOS (Beginning of Sequence): Also called CLS in some models. This token is slapped onto the start of every input sequence. It’s the model’s “wake up” call. In encoder-only models (like BERT), this token’s final hidden state is often used as a pooled representation for the entire sequence for classification tasks. In decoder-only models (like GPT), it’s simply the starting gun for generation.
- EOS (End of Sequence): This is the model’s way of saying “I’m done talking.” It’s crucial for autoregressive generation, as the model learns to stop producing output once it predicts this token. During training, it helps the model learn the concept of a complete thought or sentence.
- PAD (Padding Token): Here’s where we get practical. We train models on batches of data for efficiency, but sentences aren’t all the same length. We pad shorter sequences up to a predefined
max_lengthwith this token. The critical part: we use an attention mask to tell the model “ignore the PAD tokens completely.” If we don’t, the model tries to find meaning in nothingness and it all goes pear-shaped. - UNK (Unknown Token): The tokenizer’s white flag. When it encounters a word or subword it has absolutely no clue about (usually because it’s out-of-vocabulary), it maps it to this token. Seeing too many of these is a sign your tokenizer wasn’t trained on a representative corpus or you’re dealing with truly novel text (like misspelled garbage).
And then there’s the “plus one”:
- MASK (Masking Token): The star of masked language modeling, made famous by BERT. We randomly replace tokens with this guy and task the model with predicting the original. It’s a clever self-supervised training trick.
How They Actually Work in Code
Don’t just take my word for it. Let’s see how this plays out with the Hugging Face transformers library, which handles this with more grace than most.
from transformers import AutoTokenizer
# Let's use a common model, like GPT-2
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# By default, GPT-2's tokenizer doesn't add special tokens automatically. Let's see what it does.
print("Default behavior:")
print(tokenizer("Hello, how are you?"))
# Output: {'input_ids': [15496, 11, 703, 389, 345, 30], 'attention_mask': [1, 1, 1, 1, 1, 1]}
# Notice: no BOS or EOS. Let's tell it we want them.
print("\nWith padding and special tokens:")
tokenizer.pad_token = tokenizer.eos_token # GPT-2 doesn't have a dedicated pad token, so we use EOS (a common, if slightly messy, practice)
output = tokenizer(
"Hello, how are you?",
padding="max_length", # Pad to max_length
max_length=10, # Let's make it 10
truncation=True, # In case our text is too long
return_tensors="pt" # Return PyTorch tensors
)
print("Input IDs:", output["input_ids"])
print("Attention Mask:", output["attention_mask"])
# Output will look something like:
# Input IDs: tensor([[50256, 15496, 11, 703, 389, 345, 30, 50256, 50256, 50256]])
# Attention Mask: tensor([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0]])
See what happened? The tokenizer added the BOS token (50256 for GPT-2) at the beginning and the EOS token at the end of the actual text. Then it padded the rest of the sequence up to max_length=10 with more EOS tokens (because we set pad_token=eos_token). The attention mask is 1 for all real tokens and 0 for all padding tokens. This mask is your golden ticket – you must pass it to the model during training and inference.
The Custom Token Trap
Sometimes, the pre-defined special tokens aren’t enough. You might need to add a special [SEP] token for separating sentences or a [DOMAIN_XYZ] token to give your model a hint. Most tokenizers let you add these.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Let's say we want to add a custom token for a specific task
custom_token = "[MY_SPECIAL_TOKEN]"
# Check if it exists first (it probably doesn't)
if custom_token not in tokenizer.vocab:
num_added_tokens = tokenizer.add_tokens([custom_token])
print(f"Added {num_added_tokens} tokens. New vocab size: {len(tokenizer)}")
# Important: If you're training, you need to resize your model's embedding layer too!
# model.resize_token_embeddings(len(tokenizer))
# Now use it
output = tokenizer("This is a sentence with " + custom_token, return_tensors="pt")
print(output["input_ids"])
The big, honking “gotcha” here that everyone misses: adding a token changes the vocabulary size. If you’ve already loaded a pre-trained model, its embedding layer is a fixed matrix. You must resize it (model.resize_token_embeddings(len(tokenizer))) to accommodate the new token, otherwise you’ll get a catastrophic index-out-of-bounds error. The new embedding will be initialized randomly, so you’ll probably need to fine-tune the model to teach it what this new token means.
Best Practices and Pitfalls
- Never Let Padding Leak Into Your Loss: This is non-negotiable. When calculating your loss function (e.g., CrossEntropyLoss), you must ignore the padding tokens. You can use the attention mask to create a loss mask. This is often handled automatically in good training loops, but be aware of it.
- Choose Your Pad Token Wisely: Using EOS as PAD (like in the GPT-2 example) is convenient but can be conceptually messy. Your model might learn that “end of sequence” also means “ignore me.” It often works in practice, but be aware of the potential for weirdness.
- Truncation is a Necessary Evil: You have to set a
max_length. Text longer than this gets unceremoniously chopped off. There’s no elegant solution. Choose a length that covers the vast majority of your data without being so large it causes memory issues. It’s a trade-off. - Your Tokenizer and Model Must Agree: This feels obvious, but I’ve seen it blow up countless times. The special token mappings (their IDs, their meanings) are baked into the pre-trained model’s weights. If you use a different tokenizer with a model, or change the tokenizer’s special tokens after the fact, you’re asking for incomprehensible garbage output. Always use the tokenizer that belongs to your model.
Think of special tokens not as an afterthought, but as the first piece of configuration you set up. They control the very grammar of how your model sees the world. Get them right, and everything else gets a whole lot easier.