K-Means | mikePietsch.com

8.8 Evaluating Clustering: Internal and External Metrics

Right, so you’ve thrown some data into a clustering algorithm and it gave you back… some clusters. Great. Now for the million-dollar question: are they any good? Or did you just perform a very expensive, automated version of sorting marbles by color while blindfolded? This is where evaluation comes in, and it’s arguably more art than science. We have two families of metrics to help us: internal and external. Internal metrics don’t need the ground truth labels; they judge a cluster by its own structure. External metrics require the actual labels (which, let’s be honest, if you had those you might not be clustering in the first place) and measure how well our clusters match the known classes.

8.7 Spectral Clustering: Graph-Based Approach

Alright, let’s get our hands dirty with Spectral Clustering. You’ve probably hit the wall with K-Means and its spherical obsession, or found DBSCAN’s parameter-tuning to be a dark art. This is where we step into the world of graph theory and linear algebra to solve clustering problems that those other methods just can’t handle. The core idea is brilliantly simple: instead of trying to cluster the data points directly in their original space, we use a graph representation to transform the data into a new space where the clusters become trivial to separate. It’s like taking a tangled mess of Christmas lights, laying them out neatly, and then just snipping the obvious gaps.

8.6 Gaussian Mixture Models: Soft Cluster Assignments with EM

Right, so you’ve met K-Means. It’s fast, it’s simple, and it’s about as subtle as a sledgehammer. Every data point gets a one-way ticket to a single cluster. But let’s be honest, the world is messy. Is that customer really 100% a ‘bargain hunter’ or 100% a ‘premium spender’? Or are they maybe 70% premium and 30% bargain? That’s where Gaussian Mixture Models (GMMs) come in. They’re the sophisticated, probabilistic cousin of K-Means, and they deal in shades of gray, not just black and white.

8.5 Hierarchical Clustering: Dendrograms and Linkage Criteria

Right, so you’ve met K-Means, the eager intern who needs to be told exactly how many clusters to find. And you’ve met DBSCAN, the grizzled detective who finds clusters of any shape but gets weirdly existential about noise points. Now, let’s talk about Hierarchical Clustering, the method that refuses to make a single decision and instead shows you every possible cluster configuration from one big blob to individual data points. It’s the “choose your own adventure” of clustering, and its map is a gloriously weird tree called a dendrogram.

8.4 HDBSCAN: Hierarchical DBSCAN with Soft Clustering

Alright, let’s talk about HDBSCAN. You remember DBSCAN, right? It’s that clever algorithm that finds clusters based on dense regions of data points and has the good sense to call some points noise. Its two big knobs are eps (the neighborhood radius) and min_samples (the party threshold—how many points need to be in that radius to start a cluster). The problem? Choosing eps is a massive pain. Get it wrong, and your entire clustering falls apart. It’s like trying to tune a radio with a sledgehammer.

8.3 DBSCAN: Density-Based Clustering for Arbitrary Shapes

Right, so you’ve tried K-Means. You’ve squinted at the results, looked at those perfectly spherical clusters it forced onto your beautifully weird, non-spherical data, and thought, “Well, this is a lie.” You’re not wrong. The world isn’t made of neat circles. It’s made of squiggles, dense blobs, and lonely, weird points that don’t belong anywhere. Enter DBSCAN (Density-Based Spatial Clustering of Applications with Noise). This is the algorithm that looks at your messy, real-world data and says, “I get you.”

8.2 Choosing K: Elbow Method, Silhouette Score, and Gap Statistic

Alright, let’s get down to the brass tacks of choosing K. You’ve got your data, you’ve fired up sklearn, and you’re ready to unleash K-Means on it. You type KMeans().fit(X) and it hits you like a ton of bricks: “Wait, how many clusters do I actually want?” This is the million-dollar question, and anyone who tells you there’s one single, magic-bullet answer is trying to sell you something. The truth is, we have a toolbox of heuristics—some brilliant, some flawed, all useful in context. Let’s open it up.

8.1 K-Means: Lloyd's Algorithm, Initialization, and K-Means++

Alright, let’s talk about K-Means. It’s the algorithm you reach for when you want to “just get some clusters” and it’s so conceptually simple it’s almost stupid. The goal is to partition your data into K distinct, non-overlapping groups. It’s like herding cats, but with math. The core idea is Lloyd’s Algorithm, and it’s an elegant little dance that alternates between two steps until it gets bored (or converges, in technical terms). Here’s how it goes: