12.6 Text Features: TF-IDF, CountVectorizer, Embeddings
Right, let’s talk about turning words into numbers, because your model is a glorified calculator and it doesn’t speak Shakespeare. It speaks vectors. Our job is to translate the messy, beautiful chaos of human language into a tidy spreadsheet of numbers it can actually crunch. We’ve got three main tools for this, and I’ll be honest with you: they range from “simple but surprisingly effective” to “black magic that works suspiciously well.”
First up, the classics. They’re not sexy, but they get the job done and you’d be a fool to not have them in your toolkit.
The Humble Bag of Words: CountVectorizer
Don’t let the fancy name fool you. CountVectorizer is brutally simple. It looks at your text, compiles a vocabulary of every word (or “token”), and then for each document, it just counts how many times each word appears. The result is a massive matrix where each row is a document and each column is a word, filled with integers.
Think of it like this: it’s creating a massive checklist for every document. “Word ’the’? Present. Word ‘algorithm’? Present 3 times. Word ‘banana’? Absent.”
from sklearn.feature_extraction.text import CountVectorizer
# Let's use some truly profound texts
corpus = [
"The cat sat on the mat.",
"The dog sat on the log, and the log was soggy.",
"This is a cat, a very specific cat."
]
# Initialize the vectorizer. We'll tell it to ignore English stop words ("the", "a", etc.)
vectorizer = CountVectorizer(stop_words='english')
# Fit learns the vocabulary, transform counts the words
X = vectorizer.fit_transform(corpus)
# Let's see what we got
print("Vocabulary:", vectorizer.get_feature_names_out())
print("Dense matrix:\n", X.toarray())
This will output something like:
Vocabulary: ['cat' 'dog' 'log' 'mat' 'sat' 'soggy' 'specific']
Dense matrix:
[[1 0 0 1 1 0 0]
[0 1 2 0 1 1 0]
[2 0 0 0 0 0 1]]
See? The first document has ‘cat’ (1), ‘mat’ (1), and ‘sat’ (1). The second document has ’log’ twice. It’s dumb. It has no concept of grammar, word order, or meaning. But this sheer, brute-force simplicity is often all you need for a first pass. The biggest pitfall? Common words like “the” will dominate everything, which is why we almost always remove those “stop words.” Also, “cat” and “cats” are different words to it, which is why you’ll often want to use its stemming or lemmatization options.
TF-IDF: Bag of Words, But With Math
Okay, so CountVectorizer is a bit of a blunt instrument. A word appearing 10 times is 10 times more important, right? Not necessarily. The word “quantum” appearing once in a physics paper is probably more meaningful than the word “the” appearing fifty times.
Enter TF-IDF (Term Frequency-Inverse Document Frequency). It’s a way to weight the counts, not just count them. The intuition is brilliant: the importance of a word is proportional to how often it appears in this document (Term Frequency), but inversely proportional to how common it is across all documents (Inverse Document Frequency). A word that’s everywhere (“the”, “a”) gets penalized heavily. A word that’s rare but shows up a lot in one document gets a huge boost.
from sklearn.feature_extraction.text import TfidfVectorizer
# Same corpus
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
X_tfidf = tfidf_vectorizer.fit_transform(corpus)
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())
print("TF-IDF matrix:\n", X_tfidf.toarray().round(2))
You’ll get a matrix of floats instead of integers. The word “log” in the second document will have a high score because it’s frequent there and doesn’t appear in the others. The word “sat” will have a middling score—it’s in two documents, so it’s not that unique. This is almost always a better default than raw counts. It’s still a Bag of Words, but it’s a smarter Bag of Words.
The Modern Magic: Word Embeddings
Here’s where we leave the comfortable world of simple counting and enter the realm of dark sorcery. Bag-of-words models have a fatal flaw: they have no concept of semantics. “King” and “queen” are as different to them as “king” and “avocado.” Word embeddings fix this.
An embedding is a dense vector of floating-point numbers (e.g., 300 dimensions) that represents a word. The magic is that these vectors are learned from massive amounts of text, and they capture semantic meaning through geometry. The classic example: the vector for “king” minus “man” plus “woman” is eerily close to the vector for “queen.” It’s mind-blowing the first time you see it.
You don’t typically train these yourself; you download a pre-trained model like Word2Vec or GloVe. For a document, you can average the embeddings of all its words to get a halfway decent document vector.
# Example using the gensim library to load a pre-trained model
import gensim.downloader as api
# This will download the model (be warned, it's ~1.6GB)
glove_vectors = api.load("glove-wiki-gigaword-100")
# Get the vector for a word
king_vector = glove_vectors['king']
print("Vector for 'king' (first 5 dims):", king_vector[:5])
# The famous analogy
result = glove_vectors.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print("king - man + woman =", result[0])
The output should show “queen” as the top result. It’s absurd that this works, but it does. The best practice here is to use a pre-trained model on a huge corpus. The pitfall? It’s computationally heavy, and averaging words loses word order. For that, you’d need more complex models like transformers (BERT, etc.), which are a whole other chapter of “how is this even possible.” For now, know that if your bag-of-words model is plateauing, embeddings are your next, more powerful, step.