38.6 Topic Modeling: LDA and BERTopic

Right, so you’ve got a mountain of text and you need to make sense of it. Sentiment analysis tells you how people feel, but it doesn’t tell you what they’re actually talking about. That’s where topic modeling comes in. Think of it as a brilliant, albeit slightly messy, librarian who takes your pile of books (documents), scans them all at superhuman speed, and starts sorting them into piles based on recurring themes. It’s unsupervised, which means we’re not giving it labels. We’re just saying, “Here’s the data, find me the hidden structure.” And the granddaddy of all topic models is LDA. Let’s get into it.

The Core Idea of LDA: A Generative Story

Latent Dirichlet Allocation (LDA) is a “generative probabilistic model.” Fancy term, simple idea. It pretends there’s a fictional, step-by-step process for how every single document in your collection was written. It goes like this:

For each topic, decide what words are likely to belong to it. Topic 1 might have “battery,” “life,” “charge,” “power.” Topic 2 might have “screen,” “display,” “resolution,” “inch.”
For each document, decide what mix of topics it’s going to be about. Maybe Document A is 70% Topic 1 (battery life) and 30% Topic 2 (screen).
For every word in that document: a. First, roll the dice based on the document’s topic mix to pick a topic. b. Then, roll the dice again based on that topic’s word mix to pick a specific word.

The “latent” part means we have to reverse-engineer this fictional process. We have the final documents (the words), and we need to work backwards to discover both the topics (the lists of words) and the topic mixtures for each document. It does this using Bayesian inference, which is essentially a very sophisticated form of educated guessing and updating those guesses. The “Dirichlet” part is just the fancy statistical distribution it uses to keep those mixtures of topics and words somewhat sensible and not overly extreme.

Here’s how you do it in Python with scikit-learn. We’ll use a classic toy dataset because you’ve probably seen it before.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Let's grab some text data that isn't just "hello world"
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'), categories=['sci.space', 'comp.graphics'])
documents = newsgroups.data

# We need to turn text into numbers. Bag-of-words is the classic approach.
# We'll ignore super common and super rare words.
vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
doc_term_matrix = vectorizer.fit_transform(documents)

# Now, let's say we want to find 5 topics. This is a GUESS. It's the hardest part.
lda = LatentDirichletAllocation(n_components=5, random_state=42)
lda.fit(doc_term_matrix)

# Let's see what we got!
def print_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx}:")
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))
        print()

# Get the feature names (the words) from our vectorizer
feature_names = vectorizer.get_feature_names_out()
print_topics(lda, feature_names, 10)

You’ll get output that looks like a bunch of word lists. Your job is to look at the words and assign a human-readable label. If you see “orbit,” “NASA,” “shuttle,” “moon,” you’d call it “Space Exploration.” The model doesn’t know the label; it just knows these words hang out together a lot.

The BERTopic Revolution: Context is Everything

LDA is great, but it’s a bit… dumb. It’s based on bag-of-words, which throws all word order and context out the window. “The movie was not good” and “The movie was good” are identical to LDA. That’s a problem.

Enter BERTopic. This is where modern transformers come in and show off. BERTopic’s genius is in its two-stage approach:

Embedding: It uses a sentence transformer (like all-MiniLM-L6-v2) to convert each document into a dense vector representation. This embedding captures the meaning and context of the document. The model knows that “not good” and “good” are very different. This is a massive upgrade.
Clustering: It then uses a dimensionality reduction technique (UMAP) and a clustering algorithm (HDBSCAN) to group these document vectors. Documents with similar meanings end up in the same cluster. Each cluster is a topic.

The beauty here is that the topics are defined by the clusters of documents first. The words for each topic are then derived in a separate step, often using a class-based TF-IDF, which finds the words most representative of a cluster compared to all others.

# pip install bertopic
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

# Same data as before
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'), categories=['sci.space', 'comp.graphics'])
documents = newsgroups.data

# The magic happens here. This one line does the embedding, reduction, and clustering.
topic_model = BERTopic(verbose=True)
topics, probs = topic_model.fit_transform(documents)

# Let's see the results. This output is far more intuitive and rich.
topic_model.get_topic_info()

BERTopic will give you a nice table with topic counts and the top words. Crucially, it also has a topic -1. That’s the outlier topic. HDBSCAN is great because it doesn’t force every document into a topic; if a doc doesn’t fit well, it gets labeled as an outlier, which is far more honest than LDA trying to assign it to something.

Best Practices and Pitfalls

The Number of Topics (n_components in LDA): This is the question. There’s no right answer. Use coherence scores (gensim.models.CoherenceModel) as a guide, but ultimately, you have to look at the topics and ask, “Do these make sense? Are they distinct?” With BERTopic, you often don’t need to set this, but you might need to adjust the UMAP and HDBSCAN parameters to control cluster granularity.
Preprocessing is Still King: While BERTopic handles context better, you still need to clean your text. Remove boilerplate, HTML tags, and punctuation. Lemmatization can help, but with modern embeddings, it’s sometimes less critical than it was for LDA. Test it.
The Garbage In, Garbage Out Principle: If your corpus is a complete mess with no coherent themes, no model will save you. Topic modeling amplifies structure; it doesn’t create it from nothing.
Interpreting Topics is Your Job: The model gives you word lists. You are the human who must assign meaning. Two people might look at a topic with words “president,” “bill,” “senate,” “law” and call it either “U.S. Politics” or “The Clinton Administration.” The model doesn’t care. It just found the pattern.
BERTopic is Computationally Hungry: Generating embeddings for a large corpus (100k+ documents) can be slow and require significant memory. For massive datasets, you might still need to reach for the older, faster LDA. It’s a trade-off.

LDA is the reliable old pick-up truck—it gets the job done on a well-defined path. BERTopic is the new self-driving car—it understands the terrain much better but requires more computational fuel. Your choice depends on the road you’re on and what you’re hauling.