8.8 Evaluating Clustering: Internal and External Metrics

Right, so you’ve thrown some data into a clustering algorithm and it gave you back… some clusters. Great. Now for the million-dollar question: are they any good? Or did you just perform a very expensive, automated version of sorting marbles by color while blindfolded?

This is where evaluation comes in, and it’s arguably more art than science. We have two families of metrics to help us: internal and external. Internal metrics don’t need the ground truth labels; they judge a cluster by its own structure. External metrics require the actual labels (which, let’s be honest, if you had those you might not be clustering in the first place) and measure how well our clusters match the known classes.

The Silhouette Coefficient: Narcissistic Clustering

This is my favorite internal metric. It measures how well each sample fits into its own cluster compared to other clusters. It’s like a measure of cluster narcissism: “How much better am I than those other guys?” The score ranges from -1 to 1.

A score near 1 means the sample is very close to its cluster mates and far from others. Excellent.
A score near 0 means the sample is on the boundary between two clusters. Meh.
A score near -1 means the sample is probably in the wrong cluster. Yikes.

It’s fantastic for comparing results from the same algorithm (e.g., which ‘k’ in K-Means is best?) but less useful for comparing different algorithms, as density-based methods like DBSCAN will naturally score differently than centroid-based ones.

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Create some obvious, well-separated blobs
X, y = make_blobs(n_samples=300, centers=3, cluster_std=0.8, random_state=42)

# Cluster it with the right k
kmeans_good = KMeans(n_clusters=3, random_state=42)
labels_good = kmeans_good.fit_predict(X)
score_good = silhouette_score(X, labels_good)
print(f"Silhouette Score for k=3: {score_good:.3f}")  # Should be high

# Now cluster it with a stupidly wrong k
kmeans_bad = KMeans(n_clusters=10, random_state=42)  # Way too many clusters!
labels_bad = kmeans_bad.fit_predict(X)
score_bad = silhouette_score(X, labels_bad)
print(f"Silhouette Score for k=10: {score_bad:.3f}")  # Will be lower

Davies-Bouldin Index: Keeping Your Centroids to Yourself

Another internal metric. The idea is simple: good clusters have small within-cluster distances (all points are close to their centroid) and large between-cluster distances (centroids are far apart from each other). The Davies-Bouldin (DB) index is the average similarity between each cluster and its most similar counterpart. Lower values are better. A value of 0 is the best possible outcome, indicating clusters are infinitely far apart.

It’s less intuitive to interpret than the Silhouette score on its own, but it’s incredibly fast to compute since it only uses the centroids and intra-cluster distances, not every single data point. Use it for a quick, centroid-focused sanity check.

from sklearn.metrics import davies_bouldin_score

db_score_good = davies_bouldin_score(X, labels_good)
db_score_bad = davies_bouldin_score(X, labels_bad)

print(f"DB Index for k=3: {db_score_good:.3f}")  # Lower is better
print(f"DB Index for k=10: {db_score_bad:.3f}")   # This will be higher (worse)

When You Have the Answers: External Metrics

Okay, let’s pretend you’re a god and you actually possess the ground truth labels. This is the luxury suite of cluster evaluation. The two big ones are Adjusted Rand Index (ARI) and Adjusted Mutual Information (AMI).

Forget the regular Rand Index or raw Mutual Information. They need to be adjusted for chance. Why? Because if you just throw everything into one giant cluster or a million tiny clusters, those unadjusted metrics will still give you a non-zero score, which is absurd. The adjusted versions fix this; they’ll give you a score around 0 for random labeling and 1.0 for perfect agreement.

Adjusted Rand Index (ARI): Measures the pairwise agreement between two label assignments, adjusted for chance. It’s symmetric and doesn’t care about the actual label values, just the groupings.
Adjusted Mutual Information (AMI): Based on the information-theoretic concept of mutual information, then adjusted for chance. Also symmetric.

They usually tell you the same story, so pick one. I default to ARI.

from sklearn.metrics import adjusted_rand_score, adjusted_mutual_info_score

# Let's use our 'good' k-means labels from before
ari_score = adjusted_rand_score(y, labels_good)
ami_score = adjusted_mutual_info_score(y, labels_good)

print(f"Adjusted Rand Index: {ari_score:.3f}")  # Should be very close to 1.0
print(f"Adjusted MI Score: {ami_score:.3f}")    # Should be very close to 1.0

# What happens with the terrible clustering?
ari_score_bad = adjusted_rand_score(y, labels_bad)
print(f"ARI for terrible clustering: {ari_score_bad:.3f}") # Will be much lower

The Homogeneity-Completeness-V-Measure Triad

This one is a bit more nuanced. It breaks the concept of “goodness” into two separate ideas:

Homogeneity: Are my clusters pure? Does each cluster contain only members of a single class? (“All points in cluster A are cats.”)
Completeness: Did I capture everyone? Are all members of a given class assigned to the same cluster? (“All cats are in cluster A.”)

You can have one without the other. A clustering result that satisfies both is, well, perfect. The V-measure is their harmonic mean, giving you a single number to chase. It’s also adjusted for chance. It’s useful when you care about this specific dichotomy.

The Most Important Pitfall: Garbage In, Gospel Out

Here’s the brutal truth: All internal metrics are fundamentally flawed. They’re based on geometric assumptions (distance, density) that you baked into the data during preprocessing. If you scaled your features weirdly, used a questionable distance metric, or your data is just a tangled mess, these metrics will confidently give you a score that makes your nonsensical clusters look mathematically brilliant. They measure the self-consistency of the clustering given your assumptions, not whether the clusters are meaningful.

The ultimate test is always, always external validation. Do these clusters tell you something new about your data? Can you use them to make a prediction? Do they correlate with a real-world outcome? No metric can replace that. Use these scores as a guide, not a gospel. They’re here to help you choose between different runs of the same algorithm, not to anoint your results as objectively “true.”