38.2 TF-IDF and Bag-of-Words for Classical Classifiers

Right, let’s talk about the two workhorses of classical NLP that refuse to die: Bag-of-Words and its slightly smarter cousin, TF-IDF. They’re the foundational techniques you need to understand, even if you’re eventually going to run off with some fancy neural network. Why? Because they’re fast, surprisingly effective for a lot of tasks, and they’ll teach you more about the texture of language than you might think. Plus, they’re the secret weapon for getting a quick baseline model before you blow the budget on GPU time.

The core idea is gloriously, almost offensively, simple. We’re going to take your beautifully structured text—with its syntax, semantics, and narrative flow—and throw all of that out the window. We treat a document as if it were just a… bag of words. Unordered. Contextless. Just a multiset of tokens. It’s the textual equivalent of taking a meticulously crafted lasagna, pureeing it, and then counting the number of peas. It sounds absurd, but you’d be shocked how often you can still tell it’s lasagna from the puree. This approach is the computational linguist’s version of “smash it with a hammer until it fits our spreadsheet.”

The Bag-of-Words Model: It’s Exactly What It Sounds Like

Here’s the recipe:

Tokenize: Split your text into words (or n-grams, but we’ll get to that).
Count: For each document, count how often each word in your entire vocabulary appears.

The result is a massive matrix, often called a Document-Term Matrix. Each row is a document. Each column is a word from the entire corpus (your collection of documents). Each cell is a count. That’s it.

Let’s build one from scratch so you can see the gears turn. We’ll use sklearn’s CountVectorizer, which is essentially a very fancy, production-ready bag-of-words machine.

from sklearn.feature_extraction.text import CountVectorizer

# Our stunningly profound corpus
corpus = [
    "The cat sat on the mat.",
    "The dog sat on the log.",
    "The cat and the dog are friends."
]

# Initialize the vectorizer. We'll leave it with defaults for now.
vectorizer = CountVectorizer()

# fit_transform does two things:
# 1. fit: learns the vocabulary (all unique words) from the corpus.
# 2. transform: counts the words in each document and returns the matrix.
X = vectorizer.fit_transform(corpus)

# Let's see what we made.
print("Vocabulary:", vectorizer.get_feature_names_out())
print("Dense representation of the matrix:")
print(X.toarray())

This will output:

Vocabulary: ['and' 'are' 'cat' 'dog' 'friends' 'log' 'mat' 'on' 'sat' 'the']
Dense representation of the matrix:
[[0 0 1 0 0 0 1 1 1 2]
 [0 0 0 1 0 1 0 1 1 2]
 [1 1 1 1 1 0 0 0 0 2]]

Look at that. Our three sentences have been reduced to vectors of counts. The first document [0 0 1 0 0 0 1 1 1 2] corresponds to the word counts for ‘and’, ‘are’, ‘cat’, … ’the’. So it has 1 for ‘cat’, 1 for ‘mat’, 1 for ‘on’, 1 for ‘sat’, and 2 for ’the’. Perfect.

Now, the immediate problem slaps you in the face. The word “the” dominates. It’s the most frequent word in both documents, but it carries the least amount of interesting information. This is where TF-IDF waltzes in to save the day.

TF-IDF: Term Frequency-Inverse Document Frequency

TF-IDF is a feature weighting scheme, not a different model. We’re still building a Bag-of-Words. We’re just replacing the simple counts with a more informative value. The goal is to highlight words that are important to a specific document, but not common across all documents.

Term Frequency (TF): How often a word appears in a single document. (Same as BoW).
Inverse Document Frequency (IDF): A measure of how rare a word is across the entire corpus. It’s calculated as log(total_number_of_documents / (number_of_documents_containing_the_word + 1)). The +1 is a smoothing factor to avoid division by zero.

The TF-IDF score is TF * IDF. A high TF-IDF score means:

The term is frequent in the current document (high TF).
The term is rare in other documents (high IDF).

Words like “the” will have a high TF but a very low IDF (because they’re in every document), so their TF-IDF score plummets. A word like “mat” appears a decent amount in one document and not in others, so it gets a nice, high score.

from sklearn.feature_extraction.text import TfidfVectorizer

# Same corpus
corpus = [
    "The cat sat on the mat.",
    "The dog sat on the log.",
    "The cat and the dog are friends."
]

# The TfidfVectorizer API is identical to CountVectorizer. Convenient!
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(corpus)

print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())
print("TF-IDF representation:")
print(X_tfidf.toarray().round(2)) # rounding for readability

Your output will look something like this (the exact values might vary slightly due to the IDF calculation):

Vocabulary: ['and' 'are' 'cat' 'dog' 'friends' 'log' 'mat' 'on' 'sat' 'the']
TF-IDF representation:
[[0.   0.   0.39 0.   0.   0.   0.39 0.39 0.39 0.39]
 [0.   0.   0.   0.39 0.   0.39 0.   0.39 0.39 0.39]
 [0.51 0.51 0.26 0.26 0.51 0.   0.   0.   0.   0.26]]

See the magic? In the first document, “the” has been demoted. It now has the same weight as “cat” and “mat” (around 0.39), whereas in the raw counts it was twice their value. Meanwhile, in the third document, unique words like “and”, “are”, and “friends” get the highest scores (0.51), correctly identifying them as the most distinctive terms for that sentence.

The Devil’s in the Details: Practical Considerations

This is where most tutorials stop. I won’t. Here’s what you actually need to know to use this effectively.

stop_words: You can pass a list of stop words (like ’the’, ‘and’, ‘is’) to the vectorizer to ignore them completely. sklearn has a built-in English list (stop_words='english'). Use it cautiously. Sometimes these words are important (e.g., in sentiment analysis, “not” is definitely a stop word you want to keep).

ngram_range: This is the biggest lever you can pull. By default, we use (1, 1) (single words, or unigrams). But often, phrases are meaning. (1, 2) will give you both unigrams and bigrams (e.g., “cat sat”, “sat on”). This can massively improve performance by capturing some local context, but it also explodes the number of features. Your vocabulary can easily grow from 10,000 words to 500,000 n-grams. It’s a classic trade-off.

max_features: You will often want to set this to, say, 10,000. It tells the vectorizer to only keep the top N most frequent terms in the vocabulary. This is essential for managing memory and preventing your model from overfitting on ridiculously rare typos.

The Memory Bomb: This is the single biggest pitfall. That document-term matrix? It’s overwhelmingly zeros. For a large corpus, it’s 99.9% zeros. sklearn returns a compressed sparse matrix object (csr_matrix) by default. Do not call .toarray() on a large matrix unless you want to crash your kernel. Feed the sparse matrix directly into your classifier (LogisticRegression, SVM, Naive Bayes). They are designed to handle it efficiently.

So, you vectorize your text, get your sparse matrix, and then you just… pipe it into a classifier. That’s it. That’s the whole game for classical NLP. It’s simple, brutal, and undeniably effective. It’s your baseline. If you can’t beat TF-IDF with a linear model, you have no business trying to beat it with a transformer. Now go run TfidfVectorizer(ngram_range=(1,2), max_features=10000) into a LogisticRegression and see for yourself.