37.2 Part-of-Speech Tagging
Right, let’s talk about giving words jobs. That’s essentially what Part-of-Speech (PoS) tagging is. You’ve got a string of words, and your job is to assign each one a grammatical role: is it a noun, a verb, an adjective? This isn’t just academic hoop-jumping; it’s the bedrock for almost everything interesting in NLP. You can’t figure out who did what to whom (“The dog chased the cat” vs. “The cat chased the dog”) if you don’t know which is the noun and which is the verb. It’s the first step in making text structured data instead of just a bag of words.
Now, you might be thinking, “How hard could this be? I learned this in third grade.” And for a human, it’s trivial. For a computer, it’s a nightmare of ambiguity. Let’s take the word “saw.” Go on, give it a PoS tag. You can’t, can you? Without context, it could be a noun (“I used a saw”), a verb (“I saw a movie”), or even part of a noun phrase (“saw blade”). This is the crux of the problem. The earliest algorithms were basically giant lookup tables with some hand-crafted rules for context (e.g., “if the previous word is ’the’, the next word is probably a noun”). These were… not great. They broke immediately upon encountering anything they hadn’t been explicitly told about.
How It Actually Works: Statistical Sorcery
Thankfully, we’ve moved on. Modern PoS taggers use probabilistic models, and the current reigning champion is the Hidden Markov Model (HMM). Don’t let the name intimidate you. The intuition is simple: the tag of a word depends on the tag of the word before it and the word itself.
Think of it like this: if the previous word was tagged as an article (‘DT’ like ’the’ or ‘a’), the probability that the next word is a noun or adjective is extremely high. The model learns these transition probabilities (from one tag to another) and emission probabilities (from a tag to a specific word) from a massive corpus of text that has already been manually tagged by long-suffering linguists. This pre-tagged data is called “gold-standard” data. The model’s job is to find the most probable sequence of tags for a given sequence of words. It’s not perfect, but it’s astonishingly accurate.
Let’s see it in action with nltk, the old workhorse of NLP. It uses the Penn Treebank tagset, which has about 36 tags. Why so many? Because they decided to get specific. Instead of just ‘verb’, we have ‘VB’ (base form), ‘VBD’ (past tense), ‘VBG’ (gerund), etc. It’s pedantic, but that specificity is powerful.
import nltk
from nltk import word_tokenize, pos_tag
# You might need to download the averaged_perceptron_tagger if you haven't already
# nltk.download('averaged_perceptron_tagger')
text = "The quick brown fox jumps over the lazy dog but doesn't land gracefully."
tokens = word_tokenize(text)
tags = pos_tag(tokens)
print(tags)
This will output:
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('but', 'CC'), ('does', 'VBZ'), ("n't", 'RB'), ('land', 'VB'), ('gracefully', 'RB'), ('.', '.')]
See? It correctly navigated the minefield. It knew “jumps” and “does” are verbs (‘VBZ’), “n’t” is an adverb (‘RB’), and “land” is a verb (‘VB’). Not bad for a few milliseconds of work.
The Spacy Way: Faster, and Built for Pipelines
While NLTK is the classic, spaCy is the modern industrial-grade tool. It’s faster, more accurate, and designed to be part of a pipeline. Its tagset is more detailed, and it comes bundled with its own statistical models.
import spacy
# Load the small English model
nlp = spacy.load("en_core_web_sm")
doc = nlp("The quick brown fox jumps over the lazy dog but doesn't land gracefully.")
for token in doc:
print(f"{token.text:{12}} {token.pos_:{8}} {token.tag_:{8}} {spacy.explain(token.tag_)}")
This gives a richer output, showing both the coarse-grained POS (token.pos_) and the fine-grained tag (token.tag_), which is closer to Penn Treebank.
Common Pitfalls and Why You Should Care
- The Garbage-In-Garbage-Out Principle: Your tokenization directly impacts your tagging. If your tokenizer mangles “doesn’t” into
['do', "n't"]vs.['doesn', "'t"], the tagger will produce different results. Your entire pipeline is only as strong as its weakest link, and that link is almost always data cleaning and tokenization. - Unknown Words: This is the kryptonite of statistical models. What happens when your model encounters the word “Blorfable”? It has never seen it before, so its emission probability is effectively zero. Modern taggers have tricks for this, like using word shapes (is it capitalized? does it have digits? does it have a common suffix like ‘-able’?), but it remains a hard problem. This is why domain-specific models often outperform general ones—they’ve seen the weird words of their domain before.
- Tagset Mismatch: This one will bite you when you least expect it. The Penn Treebank tagset (used by NLTK) and the Google Universal Dependencies tagset (used by many newer tools) are different. One might have 36 tags, another 17. If you’re training a model on one tagset and applying it to text tagged with another, you’ll have a bad time. Always know your tagset.
- Assuming Perfection: Even the best taggers are only about 97% accurate. That sounds high until you realize that on a 100-word document, that’s three mistakes. For a downstream task like parsing, those three mistakes can completely derail the meaning. Never assume the tags are gospel. Always have a strategy for handling errors.
So, while it might seem like a solved problem, PoS tagging is a nuanced, critical first step. Getting it right matters because every step that comes after it is standing on its shoulders. And nobody wants their semantic parser to fall flat on its face because we mis-tagged a verb.