Tokenization | mikePietsch.com

37.7 NLTK: Classical NLP Toolkit

Right, let’s talk about NLTK. If you’re in this field, you’ve probably heard of it. The Natural Language Toolkit is the grand old dame of Python NLP libraries. It’s not the fastest, it’s not the shiniest, but it’s a fantastic pedagogical tool and a reliable workhorse for a lot of classical, non-neural NLP tasks. Think of it as your well-stocked, slightly dusty university lab—everything you need to understand the fundamentals is in here, even if the new grad students are all running off to the fancy new building with the laser cutters (that’s spaCy and Hugging Face, by the way).

37.6 spaCy: Industrial-Strength NLP Pipelines

Alright, let’s get our hands dirty with spaCy. Forget those academic toolkits that feel like they’re held together with string and theoretical hope; spaCy is the one you actually want to use to build something real. It’s a library built by people who clearly had to meet a deadline and deal with messy, real-world text. It’s fast, it’s efficient, and its API is so sensible you’ll want to weep with joy after using some of the alternatives.

37.5 Coreference Resolution

Right, coreference resolution. This is where your NLP pipeline stops just pointing at words and starts actually reading. It’s the task of figuring out all the nouns and pronouns that refer to the same real-world entity. When I say “The model loaded its weights. It was trained for weeks,” you know that “It” and “its” are pointing back to “The model.” You do this effortlessly. Getting a computer to do it is, predictably, a bit of a circus.

37.4 Dependency Parsing: Syntactic Structure of Sentences

Right, so you’ve tagged your words, you’ve found your entities, and now you’re staring at a sentence like “The old man the boat.” and your brain just did a little somersault, didn’t it? Welcome to the party. This is why we need dependency parsing. It’s the process of mapping out the grammatical structure of a sentence and figuring out how the words relate to each other. It’s the difference between seeing a pile of lumber and seeing the blueprint for the house.

37.3 Named Entity Recognition: Rule-Based and Neural Approaches

Right, let’s talk about Named Entity Recognition, or NER. Your goal here is simple: teach a machine to read a sentence like “Apple is looking to buy a U.K. startup for $1 billion” and not have an existential crisis about whether we’re discussing fruit, a tech giant, or a very expensive piece of produce. It’s the process of finding and classifying named entities—things like people, organizations, locations, monetary values, and more—into pre-defined categories.

37.2 Part-of-Speech Tagging

Right, let’s talk about giving words jobs. That’s essentially what Part-of-Speech (PoS) tagging is. You’ve got a string of words, and your job is to assign each one a grammatical role: is it a noun, a verb, an adjective? This isn’t just academic hoop-jumping; it’s the bedrock for almost everything interesting in NLP. You can’t figure out who did what to whom (“The dog chased the cat” vs. “The cat chased the dog”) if you don’t know which is the noun and which is the verb. It’s the first step in making text structured data instead of just a bag of words.

37.1 Text Preprocessing: Lowercasing, Stemming, Lemmatization

Right, let’s get your text ready for the real NLP heavy lifting. Think of this step as the pre-flight checklist. You wouldn’t try to fly a jet with mud on the wings, and you shouldn’t try to train a model on raw, chaotic text. Our goal here is to reduce noise and variation without losing the essential meaning. We’re standardizing. We’re simplifying. We’re making the data’s life less complicated so our models can have an easier time finding the signal.

37. NLP Fundamentals: Tokenization, PoS, NER, and Parsing

21.8 Token Counting and Cost Estimation

Right, let’s talk about money. Not the abstract “compute” kind, but the very real, “oh god, my API bill” kind. When you’re working with Large Language Models, every token is a tiny little digital penny flying out of your wallet. Understanding how to count them and estimate cost isn’t just academic; it’s a survival skill. I’ve seen more than one developer’s jaw drop when they got their first monthly invoice after a careless for loop. Let’s make sure that’s not you.

21.7 Special Tokens: BOS, EOS, PAD, and Custom Tokens

Right, let’s talk about the stagehands of the tokenization world: special tokens. You’ve got your <s>, </s>, [PAD], and a whole cast of custom characters. They don’t get the glory of the actual text tokens, but without them, the whole show falls apart. They’re the invisible scaffolding that holds your model’s input and output together, telling it where a sequence begins, ends, and how to deal with the messy reality of batching variable-length text.

21.6 The Effect of Tokenization on Model Performance

Now, let’s get to the heart of the matter: why you should care about this text-slicing nonsense in the first place. It’s not just a pre-processing step; it’s the first and most fundamental act of translation between your world and the model’s. Get it wrong, and you’re building a palace on a wobbly foundation. The choice of tokenizer and its vocabulary directly shapes what your model can even conceive, let alone learn.

21.5 Tiktoken: OpenAI's Fast BPE Implementation

Alright, let’s talk about Tiktoken. No, it’s not a social media app for time-traveling insects. It’s OpenAI’s blisteringly fast implementation of Byte Pair Encoding (BPE), and it’s the reason their models can chomp through your prompts without taking a coffee break first. You might be thinking, “BPE? Been there, done that, got the t-shirt.” But hold on. While the core algorithm is the same—merging the most frequent pairs of bytes or characters—OpenAI’s implementation is a masterclass in optimization. They didn’t just implement BPE; they weaponized it for production at a scale that would make most of our laptops spontaneously combust. The key difference is that they pre-compute the merges for a given vocabulary into a lookup table, turning the tokenization process from a iterative algorithm into a single, light-speed pass through your text. It’s the difference between building a car every time you need to go to the store and just having a car in your garage.

21.4 SentencePiece: Language-Agnostic Tokenization

Alright, let’s talk about SentencePiece. You’ve met BPE and WordPiece, which are brilliant but come with a massive, unspoken assumption: that you can pre-tokenize your text into neat little words using spaces. Cue the record scratch. This is a problem for about half the planet’s languages. Languages like Japanese, Chinese, or Korean don’t use whitespace to separate words. Trying to pre-tokenize them is a fool’s errand that usually involves bringing in another, equally complex NLP model just to get started. And even for languages that do use spaces, what about all the other gunk? Punctuation, emoji, that weird combined apostrophe-word thing we do in English (“don’t” vs “do not”)? Pre-tokenizers have to make a bunch of arbitrary calls on how to handle that, and they often get it wrong, losing information before the main tokenization algorithm even gets a shot.

21.3 WordPiece: BERT's Tokenization Algorithm

Alright, let’s talk about WordPiece. You know BPE, that scrappy algorithm that builds a vocabulary by gluing the most frequent pairs of bytes together? WordPiece is its more meticulous, slightly neurotic cousin who showed up to the party with a spreadsheet. It’s the algorithm that powers BERT and its many descendants, and it was designed with one primary goal in mind: to handle the messiness of human language as effectively as possible for a masked language model.

21.2 Byte Pair Encoding (BPE): Building the Vocabulary from Merges

Alright, let’s get our hands dirty with Byte Pair Encoding (BPE). Forget the intimidating name for a second; the core idea is so stupidly simple a compression engineer from the 90s would slap their forehead and say “why didn’t I think of that for text?” Actually, they did think of it for text, but we’ve since weaponized it for AI. BPE is a data compression algorithm that we’ve hijacked to solve a fundamental NLP problem: how do you build a vocabulary for a model when words are infinite but your GPU’s memory is very, very finite?

21.1 Why We Tokenize: Characters, Words, and Subwords

Right, let’s get this out of the way: you can’t just feed raw text into a model. I know, I know, it feels like we should be able to. You and I can read a string of characters and understand it, but your model sees a sequence of numbers. Tokenization is the process of figuring out how to turn text into those numbers in a way that’s actually useful. It’s the first, most crucial, and often most infuriating step in the whole NLP pipeline. Get it wrong here, and everything that follows is built on a shaky foundation.