37.1 Text Preprocessing: Lowercasing, Stemming, Lemmatization
Right, let’s get your text ready for the real NLP heavy lifting. Think of this step as the pre-flight checklist. You wouldn’t try to fly a jet with mud on the wings, and you shouldn’t try to train a model on raw, chaotic text. Our goal here is to reduce noise and variation without losing the essential meaning. We’re standardizing. We’re simplifying. We’re making the data’s life less complicated so our models can have an easier time finding the signal.
The big three tools in our belt for this are lowercasing, stemming, and lemmatization. They often get lumped together, but they serve different masters, and choosing the wrong one is a classic rookie mistake. Let’s break them down.
To Lowercase or Not to Lowercase?
This seems like a no-brainer, right? "The" and "the" are obviously the same word. Just .lower() everything and be done with it! Well, hold on. It’s usually the first thing we do, but it’s not always the right thing to do.
The primary reason to lowercase is to reduce the vocabulary size of your text data. Your model now has one representation for “apple” instead of one for “Apple”, “APPLE”, and “apple”. This is fantastic for efficiency and is almost universally applied in tasks like bag-of-words models or TF-IDF.
But here’s the catch: You are literally throwing away information. Capitalization tells you something is a proper noun (e.g., “Apple” the company vs. “apple” the fruit). It marks the start of a sentence. For a task like Named Entity Recognition (NER), lowercasing everything immediately makes the job harder. The model loses a major clue.
My rule of thumb: if you’re doing a simple task like sentiment analysis or topic modeling, lowercase. If your task hinges on identifying specific entities or precise grammatical structure, pause and think before you lowercase.
# The standard, and usually correct, approach
text = "The Quick Brown Fox jumped over the Lazy Dog."
processed_text = text.lower()
print(processed_text)
# Output: 'the quick brown fox jumped over the lazy dog.'
Stemming: The Brutalist Architecture of NLP
Stemming is the process of brutally chopping off the end of words to leave only their root or stem. It’s fast, it’s simple, and it’s kind of a blunt instrument. The resulting “stem” may not even be a real word.
The most common algorithm is the Porter Stemmer, a classic from 1980 that’s basically a big list of rules for snipping suffixes. It’s seen some things. It gets the job done, but it’s not subtle.
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
words = ["running", "happily", "flies", "generously", "better"]
for word in words:
print(f"{word} -> {stemmer.stem(word)}")
# Output:
# running -> run
# happily -> happili # See? Not a word.
# flies -> fli # Yikes.
# generously -> gener
# better -> better
Notice the absurdity? “Happily” becomes “happili” and “flies” becomes “fli”. It’s ugly, but for many information retrieval tasks (like a search engine), this is actually fine. The goal is to conflate “running” and “run” so a search for one finds the other. Precision of the word form is less important than the core concept.
Pitfall: Over-stemming is its fatal flaw. It can destroy meaningful distinctions. “University” and “universe” both stem to “univers”. That’s… not great.
Lemmatization: Stemming’s More Sophisticated Cousin
Lemmatization is what you wish stemming was. It uses a vocabulary and morphological analysis to return the base or dictionary form of a word, known as the lemma. It’s smarter, slower, and requires knowing the word’s part of speech to do its job correctly.
“Better” becomes “good”. “Running” becomes “run”. “Went” becomes “go”. This is a much more linguistically sound result.
import spacy
# Load a model (this is the small English model)
nlp = spacy.load("en_core_web_sm")
doc = nlp("I am running happily in the better universe while going to the university.")
for token in doc:
print(f"{token.text} -> {token.lemma_}")
# Output:
# I -> I
# am -> be # Much more useful for a model!
# running -> run
# happily -> happily
# in -> in
# the -> the
# better -> good # Wow.
# universe -> universe
# while -> while
# going -> go
# to -> to
# the -> the
# university -> university
# . -> .
See the difference? “Better” correctly lemmatized to “good”. “Am” became “be”. This is a far more meaningful normalization. The key here is that spaCy uses the word’s context and part-of-speech tag to determine the correct lemma.
The Catch: It’s computationally more expensive than stemming. You need a large model in memory to perform the analysis. For massive datasets, this can be a real bottleneck. It’s also not perfect; you’ll still see the occasional head-scratcher from any lemmatizer.
So, which one do you use?
- Stemming: Speed is critical, the task is simple and broad (e.g., search), and you can tolerate some noise.
- Lemmatization: Accuracy is critical, the task is complex (e.g., question answering, precise language analysis), and you have the computational budget.
In practice, for most modern projects, I just use lemmatization. Computers are fast enough, and the payoff in accuracy is almost always worth it. Just remember to feed it the part-of-speech tags, or you’re leaving it to guess, and you’ll be right back in the land of questionable results.