79.6 Clustering: KMeans, DBSCAN, Hierarchical

Right, so you’ve got your data, it’s not labeled, and you’re staring at it wondering, “What natural groups are hiding in this mess?” Welcome to clustering, the unsupervised learning equivalent of throwing a bunch of magnets on a table and seeing how they clump together. It’s part art, part science, and a great way to either find profound insights or produce beautifully colored, utterly meaningless scatter plots. Let’s make sure you end up with the former.

We’ll cover the three heavy hitters: the straightforward but fussy K-Means, the robust but enigmatic DBSCAN, and the flexible but computationally indulgent Hierarchical clustering. Each has its superpower and its kryptonite.

The K-Means Kettlebell: Simple, Fast, and Kind of a Bully

K-Means is the go-to for a reason. It’s conceptually simple and computationally efficient. Its goal is noble: partition your data into K clusters where each point belongs to the cluster with the nearest mean. Think of it as a very organized event planner who insists on grouping guests by their average position on the dance floor.

The algorithm is a beautiful, iterative dance:

Initialization: Pick K points at random to be your initial cluster centers (centroids). This is where things can get dicey, but we’ll get to that.
Assignment: For every data point, find the nearest centroid. “Nearest” almost always means Euclidean distance, so make sure your features are scaled unless you want one loud feature to dominate the party.
Update: Calculate the mean of all points assigned to each centroid. That mean becomes the new centroid.
Repeat: Do steps 2 and 3 until the centroids stop moving (or barely move).

Here it is in code. We’ll use the classic Iris dataset, but pretend we don’t have the labels.

from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Load up our data
iris = load_iris()
X = iris.data

# Remember what I said about scaling? Let's do that.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Let's assume we think there are 3 clusters (it's Iris, after all)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10) # n_init is important, trust me for now.
cluster_labels = kmeans.fit_predict(X_scaled)

# Plot the clusters based on the first two features
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=cluster_labels, cmap='viridis')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])
plt.title("K-Means Clustering on Iris Data (Scaled)")
plt.show()

The big, glaring problem here? You had to tell it K. Life rarely gives you the number of clusters upfront. This is K-Means’ greatest weakness. You might use the “elbow method” on inertia (the sum of squared distances to centroids), but it’s often more of a suggestion than a clear answer. Also, those random initial centroids? They can lead to suboptimal solutions. That’s why we use n_init to run it multiple times and keep the best one. And for the love of all that is reproducible, always set random_state.

It’s great for large datasets and well-separated, spherical-ish clusters. If your clusters are weirdly shaped or have variable densities, it will fail spectacularly, like trying to fit a square peg into a round hole by just pushing harder.

DBSCAN: The Anarchist’s Algorithm

If K-Means is a rigid event planner, DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is the cool, observant friend who points out the real social groups, including the loners and the tight-knit cliques. It doesn’t need you to specify the number of clusters. Instead, it defines clusters based on dense regions of points, separated by areas of low density.

It has two key parameters:

eps (ε): The maximum distance two points can be from each other to be considered neighbors. This defines your “personal space” bubble.
min_samples: The minimum number of points within a point’s eps radius for that point to be considered a core point. This defines how popular you need to be to start a clique.

A cluster is built by starting with a core point and then recursively including all points within its eps neighborhood that are also core points. Points that aren’t core points and don’t fall into any cluster are labeled as noise (-1). This is DBSCAN’s killer feature: it can tell you what doesn’t belong.

from sklearn.cluster import DBSCAN
import numpy as np

# Let's create some obviously non-spherical data
from sklearn.datasets import make_moons
X_moons, _ = make_moons(n_samples=200, noise=0.05, random_state=42)
X_moons_scaled = StandardScaler().fit_transform(X_moons)

# KMeans would utterly fail here. Let's see DBSCAN handle it.
dbscan = DBSCAN(eps=0.3, min_samples=5)
cluster_labels = dbscan.fit_predict(X_moons_scaled)

# Plot the results. Notice the noise points, if any, will be black.
plt.scatter(X_moons_scaled[:, 0], X_moons_scaled[:, 1], c=cluster_labels, cmap='viridis')
plt.title("DBSCAN on Moons Data")
plt.show()

The catch? Choosing eps and min_samples is black magic. There are heuristics (like using a k-distance graph), but it often involves trial and error. It also struggles with clusters of similar density, and its performance can start to drag on very large datasets. But when your data has outliers and funky shapes, it’s an absolute hero.

Hierarchical Clustering: The Family Tree

Hierarchical clustering is less of a single method and more of a framework. Instead of a single partitioning, it builds a multilevel hierarchy of clusters. You can look at the results at different levels of granularity. The most common type is agglomerative (“bottom-up”), where each point starts as its own cluster and pairs of clusters are merged as you move up the hierarchy.

You don’t need to pre-specify K; you can choose it later by cutting the “dendrogram”—a tree diagram that shows the order of merges and the distance at which they occurred. The choice of linkage criterion (how to measure distance between clusters) is critical:

ward: Minimizes the variance within each cluster. Tends to create nice, spherical clusters.
average: Uses the average distance between all pairs of points in the two clusters.
complete: Uses the maximum distance between points in the two clusters.

from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

# Let's go back to the Iris data
X = iris.data[:, :2] # Just use first two features for a clearer dendrogram

# Create the linkage matrix for the dendrogram
linked = linkage(X, 'ward') # Using ward linkage

# Plot the dendrogram
plt.figure(figsize=(10, 7))
dendrogram(linked)
plt.title('Dendrogram for Iris Data')
plt.xlabel('Sample Index')
plt.ylabel('Euclidean Distance')
plt.show()

# Now let's actually cluster, choosing 3 clusters after looking at the dendrogram
agg_cluster = AgglomerativeClustering(n_clusters=3)
cluster_labels = agg_cluster.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=cluster_labels, cmap='viridis')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])
plt.title("Agglomerative Clustering on Iris (n_clusters=3)")
plt.show()

The main drawback? It’s computationally expensive. Don’t even think about it on a dataset with 100,000 samples. It’s O(n³) in most cases, which is… not great. But for smaller datasets, the dendrogram provides an invaluable visual tool for understanding the inherent structure of your data and making an informed choice about K.