81.1 NLTK: Tokenization, Stemming, POS Tagging, and Corpora
Before we dive into the fancy deep learning stuff, we need to talk about the fundamentals. And for that, we’re going to spend some quality time with NLTK, the Natural Language Toolkit. Think of it not as the shiny new power tool, but as the rock-solid, slightly-scuffed-but-infinitely-reliable toolbox your grandpa gave you. It’s where you learn the why before you rely on the wow of modern transformers.
Hugging Face’s transformers library is incredible, but it often feels like magic. NLTK is where the magicians learn how the tricks are actually done. It provides the essential utilities—tokenization, stemming, part-of-speech tagging—that are the bedrock of any NLP task, even if they’re now happening under the hood of a billion-parameter model.
Tokenization: The First Cut Is the Deepest
Your first step with any text is to break it down into manageable pieces. This is tokenization, and it’s less obvious than you think. Is “don’t” one token or two? What about “U.K.” or “rock ’n’ roll”? NLTK gives you a few choices.
The word_tokenize function is your workhorse. It uses the Penn Treebank tokenization standards, which are sensible and widely adopted. It handles contractions about as well as you can expect.
import nltk
nltk.download('punkt') # You'll need to do this once. Yes, it's a bit annoying.
from nltk.tokenize import word_tokenize, sent_tokenize
text = "Dr. Smith paid $29.99 for a Python book. He said, 'It's worth it!'"
# Sentence tokenization: Split into sentences.
sentences = sent_tokenize(text)
print("Sentences:", sentences)
# Word tokenization: Split into words/tokens.
tokens = word_tokenize(text)
print("Tokens:", tokens)
Output:
Sentences: ['Dr. Smith paid $29.99 for a Python book.', "He said, 'It's worth it!'"]
Tokens: ['Dr.', 'Smith', 'paid', '$', '29.99', 'for', 'a', 'Python', 'book', '.', 'He', 'said', ',', "'It", "'s", 'worth', 'it', "'", '!']
See the issue? It split “‘It’s” into three tokens: 'It, 's, and the trailing '. This is the Penn Treebank way, and while it’s grammatically “correct,” it can be messy for downstream tasks. This is a classic pitfall: your tokenizer’s choices directly impact everything that comes next. Always inspect your tokens; never assume they’ll be perfect.
Stemming and Lemmatization: Chopping Words Down to Size
Often, you care about the core meaning of a word, not its grammatical form. “Running,” “runs,” and “ran” all point to the concept of “run.” Stemming and lemmatization reduce words to their base form.
- Stemming is a crude, heuristic process of chopping off suffixes. It’s fast but dumb. The most famous stemmer is the Porter stemmer, an algorithm that feels like it was designed by someone with a deep-seated grudge against the letter ’e’.
- Lemmatization is smarter and more sophisticated. It uses a vocabulary and morphological analysis to return the actual base word, or lemma. “Better” becomes “good,” “ran” becomes “run.” This requires knowing the word’s part of speech, which is why it’s slower and more complex.
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('wordnet') # The lexical database for lemmatization
nltk.download('omw-1.4') # Open Multilingual WordNet, needed for lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
words = ["running", "runs", "ran", "better", "cats", "geese"]
print("Stemming Results:")
for word in words:
print(f"{word} -> {stemmer.stem(word)}")
print("\nLemmatization Results (default - noun):")
for word in words:
print(f"{word} -> {lemmatizer.lemmatize(word)}") # Without POS tag, assumes noun
print("\nLemmatization Results (with correct POS - verb or adj):")
# 'v' for verb, 'a' for adjective
print(f"running (v) -> {lemmatizer.lemmatize('running', 'v')}")
print(f"better (a) -> {lemmatizer.lemmatize('better', 'a')}")
Output:
Stemming Results:
running -> run
runs -> run
ran -> ran
better -> better
cats -> cat
geese -> geese
Lemmatization Results (default - noun):
running -> running # Oops, without 'v' tag, it thinks it's a noun.
runs -> run
ran -> ran
better -> better
cats -> cat
geese -> goose
Lemmatization Results (with correct POS):
running (v) -> run
better (a) -> good
The stemmer butchered “ran” (leaving it as ‘ran’) and didn’t touch “better” or “geese.” The lemmatizer, once given the correct part of speech, did a perfect job. The critical best practice here: Always use lemmatization if you can, and always provide the part-of-speech tag. Otherwise, you’re not getting the full benefit.
Part-of-Speech Tagging: Grammar Classifier
Which brings us to Part-of-Speech (POS) tagging: labeling each word in a sentence as a noun, verb, adjective, etc. NLTK’s default POS tagger is pretty good—it’s trained on the Penn Treebank corpus and uses a sophisticated sequence model under the hood.
nltk.download('averaged_perceptron_tagger')
tokens = word_tokenize("The quick brown fox jumps over the lazy dog.")
pos_tags = nltk.pos_tag(tokens)
print("POS Tags:", pos_tags)
Output:
POS Tags: [('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]
Those cryptic codes (JJ, NN, VBZ) are the Penn Treebank tags. You’ll need to look them up; nobody expects you to memorize them. “VBZ” means verb, 3rd person singular present—which is exactly what “jumps” is. The accuracy is impressive, but it’s not perfect. It can be tripped up by ambiguous words or unusual sentence structure. For most practical purposes, though, it’s remarkably reliable.
Corpora: The Data You Didn’t Know You Needed
Finally, NLTK ships with a ton of corpora and linguistic data—everything from Shakespeare’s works to a sense-labeled WordNet database to a list of stopwords. This is its secret weapon for teaching and prototyping.
nltk.download('stopwords')
from nltk.corpus import stopwords, gutenberg
# Stopwords: common words often filtered out
stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
print("Tokens without stopwords:", filtered_tokens)
# Accessing a corpus
print("\nFirst 100 chars of Austen's 'Emma':")
print(gutenberg.raw('austen-emma.txt')[:100])
This is invaluable. Want to test a new idea on a classic text without scouring the web? It’s right there. Need a list of common words to filter? Sorted. NLTK understands that data is the first hurdle in learning NLP, and it gracefully removes that barrier.
So, while you might eventually graduate to the transformer-based big leagues, your time with NLTK is what will give you the intuition to understand why those models work and, more importantly, when they fail. It’s the foundation. And you always need a good foundation, even if you’re building a skyscraper.