8.1 K-Means: Lloyd's Algorithm, Initialization, and K-Means++

Alright, let’s talk about K-Means. It’s the algorithm you reach for when you want to “just get some clusters” and it’s so conceptually simple it’s almost stupid. The goal is to partition your data into K distinct, non-overlapping groups. It’s like herding cats, but with math.

The core idea is Lloyd’s Algorithm, and it’s an elegant little dance that alternates between two steps until it gets bored (or converges, in technical terms). Here’s how it goes:

Assignment: For every data point, find the nearest cluster center (we call these “centroids”). Assign the point to that cluster. You’re basically just asking, “Which of these K bullies is closest to me?”
Update: For every cluster, calculate its new centroid by taking the mean of all the points currently assigned to it. The centroid literally just moves to the average position of its new gang.

You repeat these two steps until the assignments stop changing (or change very little). That’s it. The algorithm has found a state where moving the centroids any further wouldn’t change the cluster assignments. It’s a thing of beauty.

The Initialization Problem: Garbage In, Garbage Out

Here’s the first place K-Means shows its tragic flaw. It’s brutally sensitive to where you plop those initial centroids down. Bad starting positions can lead to disastrously bad final clusters. It’s like trying to find the best pub in town but starting your search from three different dumpsters.

The most common method is 'random', where you pick K random points from your dataset to be the first centroids. Run it multiple times and you’ll get different results. It’s a bit of a lottery.

Let’s see this train wreck in action with some Python.

import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Let's make some fake, well-separated data to play with
X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=0.90, random_state=0)

# Now, let's run K-Means twice with random initialization. Watch the chaos.
kmeans1 = KMeans(n_clusters=3, init='random', n_init=1, random_state=100)
kmeans2 = KMeans(n_clusters=3, init='random', n_init=1, random_state=200)

labels1 = kmeans1.fit_predict(X)
labels2 = kmeans2.fit_predict(X)

# The centroids will be in completely different places, and the labels might be permuted.
print("Centroids from run 1:\n", kmeans1.cluster_centers_)
print("Centroids from run 2:\n", kmeans2.cluster_centers_)
# You might need to plot these to see the different outcomes truly.

This is why the KMeans class in scikit-learn has an n_init parameter (defaults to 10). It runs the whole algorithm n_init times with different random seeds and keeps the best one. It’s a pragmatic solution to a fundamental weakness.

K-Means++: Actually Smart Starting Points

Someone finally decided that starting from random dumpsters was a bad strategy and invented K-Means++. This is the default initialization method in most serious libraries for a reason: it’s dramatically better.

Instead of picking centroids completely at random, it uses a probabilistic approach to spread them out. The algorithm is wonderfully straightforward:

Pick one centroid uniformly at random from the data points.
For each data point, compute the squared distance to the nearest existing centroid.
Choose the next centroid from the data points with a probability proportional to that squared distance. Points that are far from any existing centroid have a higher chance of being chosen.
Repeat steps 2 and 3 until you have K centroids.

Why squared distance? Because it makes the algorithm even more likely to pick a point that’s far away, ensuring the initial centroids are spread out across the entire dataset. It’s basically a way to say, “Okay, we’ve got a cluster here, now for the love of god, let’s put the next one somewhere far away where the data is probably different.”

This one simple trick reduces the number of times you need to run the algorithm (n_init can be much lower, like 1 or 5) and consistently leads to better and more stable final results. You should almost never not use it.

# This is how you do it right. 'k-means++' is the default for a reason.
kmeans_good = KMeans(n_clusters=3, init='k-means++', n_init=10)
labels_good = kmeans_good.fit_predict(X)

print("Optimal centroids using k-means++:\n", kmeans_good.cluster_centers_)
print("Inertia (sum of squared distances):", kmeans_good.inertia_)

The Inertia Trap and the Elbow Method

“Inertia” is K-Means’s fancy word for “how bad my clusters are.” It’s the sum of squared distances of each point to its closest centroid. A lower inertia means tighter clusters. The algorithm’s goal is to minimize this value.

This leads us to the million-dollar question: how do you choose K? You can’t minimize inertia by just increasing K. If you set K equal to the number of data points, inertia will be zero! Perfect clusters! And completely useless.

The classic, albeit flawed, solution is the “elbow method.” You plot inertia for a range of K values and look for the “elbow” – the point where the rate of decrease sharply bends. It’s more of an art than a science and often fails miserably on real-world data where there’s no clear elbow. It’s a starting point, not a gospel.

import matplotlib.pyplot as plt

inertias = []
K_range = range(1, 10)

for k in K_range:
    kmeans = KMeans(n_clusters=k, init='k-means++', n_init=10)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)

plt.plot(K_range, inertias, '-o')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('The Elbow Method (Good Luck)')
plt.show()

Remember, K-Means isn’t a silver bullet. It assumes clusters are spherical and of roughly similar size and density, which is almost never true in the wild. It will happily split a long,蜿蜒的 cluster in half if you ask for K=2. But for a quick and dirty clustering of roughly blob-shaped data, it’s still your best first stop. Just remember to use init='k-means++' and don’t trust the elbow plot blindly.