9.5 UMAP: Faster and More Globally Faithful than t-SNE
Alright, let’s get into UMAP. If t-SNE is the brilliant but moody artist who gets lost in the details, UMAP is the pragmatic engineer who understands the big picture and actually cares about how long the project takes. It stands for Uniform Manifold Approximation and Projection, which sounds like a mouthful dreamed up by a committee, but the underlying ideas are actually elegant, powerful, and—blessedly—fast.
The core genius of UMAP is that it’s built on a solid theoretical foundation from topology (specifically, something called Riemannian geometry and fuzzy topological analysis). Before you close this tab, don’t worry, we’re not going to get a math lecture. The key takeaway is this: UMAP assumes your data isn’t just a meaningless cloud of points; it’s lying on some underlying surface—a manifold. Think of it like a crumpled piece of paper (the manifold) stuffed into a box (your high-dimensional space). PCA can only see the box. t-SNE tries to uncrumple it but gets distracted by the local texture. UMAP’s goal is to find a low-dimensional representation that best respects the topology of that original crumpled paper—its connectedness and shape.
How UMAP Actually Works: A Two-Step Dance
It pulls this off in two main stages, both of which are smarter and more efficient than t-SNE’s approach.
Build a High-Dimensional Graph: First, UMAP figures out the neighbors for each point in your original space. But it doesn’t use a fixed perplexity like a t-SNE. Instead, it uses a nifty adaptive mechanism. For each point, it finds the
n_neighborsclosest points. The distance to the first neighbor is used to normalize all the other distances for that point. This means the notion of “close” is personalized for every point in the dataset, which is brilliant for handling density variations. It then creates a fuzzy simplicial set—a fancy term for a graph where the edges have weights (probabilities of connection) that capture how connected points are.Optimize a Low-Dimensional Graph: Now, it creates a similar graph in your low-dimensional target space (e.g., 2D). The optimization process—the heavy lifting—involves yanking and pulling on this low-dimensional graph until its structure looks as much like the high-dimensional graph as possible. The huge win here is the loss function it uses: cross-entropy. Unlike t-SNE, which basically just minimizes the KL divergence, cross-entropy allows UMAP to do something crucial: it can say “Hey, points that weren’t neighbors in high-dimensions should definitely be far apart in low-dimensions.” This is the secret sauce for its superior global structure preservation. t-SNE mostly just cares about preserving local similarities and is relatively apathetic about where it puts unrelated clusters in relation to each other. UMAP actively repels them.
The result? You get the local cluster goodness of t-SNE, but the relative distances between clusters actually mean something. A cluster that’s closer to another in your UMAP plot is generally more similar than one that’s farther away. This is almost never true with t-SNE.
Let’s See It in Action: Code That Doesn’t Take a Coffee Break
Here’s the beautiful part. Using UMAP is dead simple and incredibly fast. Let’s use the classic digits dataset. Watch how snappy this is.
# Import the usual suspects
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
import umap # You'll need to 'pip install umap-learn'
# Load up some data. Let's use the digits dataset.
digits = load_digits()
X, y = digits.data, digits.target
# This is a best practice you shouldn't skip. UMAP is distance-based, so scale your data!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Now, let's do the thing. This will take seconds, not minutes.
reducer = umap.UMAP(random_state=42) # Always set a random_state for reproducibility
X_umap = reducer.fit_transform(X_scaled)
# Plot the glorious result
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='Spectral', s=5)
plt.colorbar(scatter, label='Digit Class')
plt.title('UMAP projection of the Digits dataset', fontsize=16)
plt.xlabel('UMAP 1')
plt.ylabel('UMAP 2')
plt.show()
You should see beautiful, tight clusters representing the digits 0-9. The global structure is meaningful; similar digits (like 4 and 9, or 3 and 8) are often closer together, while very different ones (like 1 and 0) are farther apart.
Taming the Beast: Key Hyperparameters and Pitfalls
UMAP is powerful, but it’s not magic. It has knobs to turn. The defaults are great, but you need to know what to tweak when.
n_neighbors: This is your most important knob. It balances local vs. global structure. A smalln_neighborsvalue (e.g., 5-15) will give you very localized, fine-grained clusters, potentially breaking up broader structures. A largen_neighborsvalue (e.g., 50-200) will make UMAP prioritize the big picture, gluing together smaller local clusters that it decides are part of a larger whole. If your clusters look too fragmented, increase this value. If they look too merged and you’re losing important detail, decrease it.min_dist: This controls how tightly UMAP packs points together in the low-dimensional space. A very lowmin_dist(e.g., 0.0) will give you extremely tight, dense clusters—good for a crisp look but can hide internal structure. A highermin_dist(e.g., 0.1-0.5) will let the clusters breathe more, allowing you to see potential sub-structure within them. Don’t be afraid to experiment with this.metric: This is huge. UMAP isn’t limited to Euclidean distance. You can use any distance metric you can think of. Working with text? Trymetric='cosine'. Working with biological sequences? Maybemetric='hamming'. This flexibility is a massive advantage over many other algorithms.
The biggest pitfall? Interpreting the absolute space between clusters as gospel. While it’s far better than t-SNE, it’s still a non-linear projection. The space between clusters is meaningful in a relative sense (“A is closer to B than to C”), but you can’t draw a conclusion like “The distance between cluster A and B is exactly twice that of A and C.” It’s a visualization tool, not a precision instrument for that kind of measurement.
So, when should you use UMAP over t-SNE? Almost always. It’s faster, it preserves global structure better, and it’s more flexible. Use t-SNE only if you have a specific, weird affinity for its particular aesthetic output or if you’re on a machine so old it might actually be a steam-powered contraption. For everyone else, UMAP is the new default.