37.7 NLTK: Classical NLP Toolkit

Right, let’s talk about NLTK. If you’re in this field, you’ve probably heard of it. The Natural Language Toolkit is the grand old dame of Python NLP libraries. It’s not the fastest, it’s not the shiniest, but it’s a fantastic pedagogical tool and a reliable workhorse for a lot of classical, non-neural NLP tasks. Think of it as your well-stocked, slightly dusty university lab—everything you need to understand the fundamentals is in here, even if the new grad students are all running off to the fancy new building with the laser cutters (that’s spaCy and Hugging Face, by the way).

We’re going to walk through its implementation of the big four: tokenization, part-of-speech tagging, named entity recognition, and parsing. First, a non-negotiable first step:

import nltk
# You're gonna see this a lot. It's NLTK's way of saying "I don't come with everything pre-downloaded."
# It's a bit of a pain, but it keeps the core library lightweight.
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

Tokenization: Where the Rubber Meets the Road

Forget AI for a second. The first and most humbling task in NLP is just breaking a string of text into individual words and sentences. It seems trivial until you hit a sentence like “Mr. O’Neill paid $29.99 for the first edition, didn’t he?”.

NLTK’s word tokenizer isn’t just splitting on spaces. It’s a smart, rule-based system (built on the Penn Treebank project) that handles contractions, punctuation, and other edge cases with surprising grace.

from nltk.tokenize import word_tokenize, sent_tokenize

text = "Mr. O'Neill paid $29.99 for the first edition, didn't he? What a deal!"
words = word_tokenize(text)
sentences = sent_tokenize(text)

print("Words:", words)
print("Sentences:", sentences)

This will output:

Words: ['Mr.', 'O\'Neill', 'paid', '$', '29.99', 'for', 'the', 'first', 'edition', ',', 'did', "n't", 'he', '?', 'What', 'a', 'deal', '!']
Sentences: ["Mr. O'Neill paid $29.99 for the first edition, didn't he?", 'What a deal!']

See what it did? It correctly kept “Mr.” as a single token, split “didn’t” into “did” and “n’t”, and isolated punctuation. This is a good thing for most classical NLP tasks, as punctuation and the structure of contractions carry meaning. The sent_tokenize is similarly robust, using an unsupervised algorithm to guess where sentence boundaries are.

Pitfall: This tokenization is not the same as the subword tokenization (WordPiece, SentencePiece) used by modern transformers. If you feed this into a BERT model, it will have a very bad day. This is for a different workflow.

Part-of-Speech Tagging: Labeling the Parts

Once we have tokens, the next step is figuring out what grammatical role each one plays. Is “book” a noun or a verb? Is “right” an adjective, adverb, noun, or verb? This is Part-of-Speech (PoS) tagging.

NLTK’s default tagger uses a pre-trained Averaged Perceptron model. It’s a statistical model that looks at the context of a word (the words around it) to make an educated guess. It’s not perfect, but it’s remarkably accurate for something trained on news text decades ago.

from nltk import pos_tag

tagged_tokens = pos_tag(words)
print("PoS Tags:", tagged_tokens)

You’ll get something like:

PoS Tags: [('Mr.', 'NNP'), ('O\'Neill', 'NNP'), ('paid', 'VBD'), ('$', '$'), ('29.99', 'CD'), ('for', 'IN'), ('the', 'DT'), ('first', 'JJ'), ('edition', 'NN'), (',', ','), ('did', 'VBD'), ("n't", 'RB'), ('he', 'PRP'), ('?', '.'), ('What', 'WP'), ('a', 'DT'), ('deal', 'NN'), ('!', '.')]

Those cryptic codes are Penn Treebank tags. NNP is a proper noun, VBD is a past tense verb, JJ is an adjective. You’ll need a cheat sheet (nltk.help.upenn_tagset() is your friend) until you memorize the common ones. The model correctly tagged “paid” as a verb (VBD) and “first” as an adjective (JJ).

Named Entity Recognition: Finding the Whos and Whats

NER is the process of finding and classifying named entities: people, organizations, locations, etc. NLTK’s approach is a two-step process: first it PoS tags the text, then it uses a chunker (essentially a pattern matcher on the tagged sequence) to find entities.

from nltk import ne_chunk

# ne_chunk expects a list of (word, pos) tuples
ner_tree = ne_chunk(tagged_tokens)
print(ner_tree)

The output is a nested nltk.Tree object. It might look messy, but it’s telling you it found PERSON: Mr. O'Neill. This is where NLTK starts to show its age. Its NER is serviceable but not great. It’s trained on news wire data, so it’s fantastic at finding things like “General Motors” but will utterly fail on modern tech company names or niche entities. For any serious NER work, you’d jump to spaCy or a transformer model. But for a quick-and-dirty analysis or an educational tool, it gets the point across.

Parsing: Diagramming Sentences, Robot-Style

Parsing is the ultimate grammar nerd fantasy: building a full syntactic tree that shows the grammatical structure of a sentence. NLTK provides parsers, but a word of caution: the good ones are slow. They’re based on probabilistic context-free grammars (PCFGs).

# You'll need to download a parser model
nltk.download('maxent_ne_chunker')
nltk.download('treebank')

from nltk.corpus import treebank
from nltk import Parser

# Load a pre-trained parser (this is a big download)
# In practice, you'd use a more efficient parser, but this is for demonstration.
# Let's use a small, provided example.
first_tree = treebank.parsed_sents('wsj_0001.mrg')[0]
first_tree.pretty_print()

Running a parser on a long sentence can take minutes. This is why the NLP world has largely moved to dependency parsing (which NLTK also has, via nltk.parse.stanford wrappers, but they require Java). The output is a beautiful, hierarchical tree that breaks down the sentence into nested phrases (Noun Phrases, Verb Phrases, etc.).

The NLTK Verdict: So, should you use it for a production system in 2024? Probably not for the heavy lifting. Its tokenization is still solid, but for PoS, NER, and especially parsing, newer tools are faster and more accurate. However, as a learning tool, it is unparalleled. It forces you to understand the pipeline, the data structures (like those Tree objects), and the fundamentals of how these tasks work before they became magic API calls in a neural network. It’s the foundation you need before you can appreciate why the new stuff is so revolutionary.