9.4 t-SNE: Preserving Local Structure for Visualization

Alright, let’s talk about t-SNE. If PCA is the sober cartographer, meticulously drawing a scaled map of your data’s grandest themes, then t-SNE is the abstract expressionist painter. It doesn’t care about global distances or precise scales. Its entire raison d’être is to preserve the local structure—it wants to show you which data points are huddled together in little clumps and neighborhoods. This makes it phenomenally good for visualization, especially of high-dimensional stuff like word embeddings or single-cell RNA sequences, where you just need to see the clusters. But, and this is a massive but, it will lie to your face about the big picture. More on that later.

The Core Idea: It’s All About Neighbors

Here’s the gist. t-SNE tries to make the arrangement of points in its low-dimensional projection (usually 2D) reflect the local similarities from the high-dimensional space. It does this by first constructing a probability distribution that represents the likelihood that any two points are neighbors. A point picks another point as its neighbor with a probability that follows a Gaussian (normal) distribution centered on itself. Points that are close in the high-D space have a high probability of being neighbors; points that are far away have a vanishingly small probability.

Then, in the low-dimensional space, it constructs a similar probability distribution (but using a heavier-tailed Student t-distribution with one degree of freedom—this is the ’t’ in t-SNE). This is a crucial trick. The fatter tails of the t-distribution make it easier for points to be modeled as “moderately distant” in the 2D/3D plot, which alleviates the dreaded “crowding problem” where everything gets squished into the center.

Finally, it uses gradient descent to minimize the Kullback-Leibler (KL) divergence between these two distributions. In plain English, it moves the points around on the 2D plane, trying to make the “neighborhood” relationships look as similar as possible to how they did in the high-D space. It’s basically playing a game of “keep your friends close.”

A Code Example: From Theory to Plot

Let’s see it in action. We’ll use the classic Iris dataset because everyone does, and then we’ll do something more interesting.

# Import the usual suspects
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler

# Load the data and standardize it (CRUCIAL for distance-based methods)
iris = datasets.load_iris()
X = iris.data
y = iris.target
X_scaled = StandardScaler().fit_transform(X)

# Create and fit a t-SNE model. Let's use some common parameters.
tsne = TSNE(n_components=2,  # We want a 2D plot
            random_state=42, # So our results are reproducible
            perplexity=30,   # More on this magic number next
            n_iter=1000)     # Number of gradient descent iterations

X_tsne = tsne.fit_transform(X_scaled)

# Now let's plot the results
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis', alpha=0.8)
plt.colorbar(scatter, label='Iris Species')
plt.title('t-SNE of Iris Dataset')
plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.show()

You should see three beautifully separated clusters. Notice how the clusters are distinct, but the distances between the clusters are meaningless. One cluster isn’t necessarily “further” from another in a way you can interpret. You can only interpret the clustering itself.

The Perplexity Parameter: Your Knob for Guesswork

This is the most important hyperparameter and the one that feels the most like black magic. Perplexity is essentially a guess about the number of close neighbors each point has. It’s a smooth measure of a point’s immediate social circle size.

Low perplexity (e.g., 5): The algorithm will focus on very local structure. You’ll get many, very small, tight clusters. It’s like looking at the data with a microscope—you see the tiny neighborhoods but miss the city blocks.
High perplexity (e.g., 50): The algorithm considers more points to be neighbors. It will produce fewer, looser clusters and capture more of the global structure (though it’s still bad at this compared to PCA). It’s like looking at the data from a satellite.

The rule of thumb is to set perplexity between 5 and 50. A value around 30 is often a good starting point. If your dataset is tiny (like <100 points), you’ll need to set it lower. The fact that you have to guess this is, frankly, a bit absurd, but it’s the price of admission.

The Major Pitfalls: Why t-SNE Will Mess With You

Interpretative Lies: I cannot stress this enough. The axes are meaningless. The scale is meaningless. The distances between clusters are meaningless. You can rotate the plot, flip it, or stretch it, and it would be just as “correct.” The only thing you can trust is the proximity of points within a cluster. Anyone who tells you “Feature X is responsible for t-SNE axis 1” should be gently reminded to stick to PCA.
Stochasticity: Run t-SNE twice with the same parameters and a different random_state, and you’ll get a different plot. The overall cluster pattern should be similar, but the exact layout will change. This is because the gradient descent starts from a random initialization. Always set random_state for reproducibility during exploration.
Computational Cost: That KL divergence minimization is expensive. It’s typically O(N^2) in the number of points. For datasets larger than tens of thousands of points, it gets painfully slow. In those cases, you might look at approximations or alternatives like UMAP.
Sensitivity to Parameters: The shape and number of clusters you see can change dramatically based on perplexity and the learning rate for the gradient descent. You must experiment. A plot is not a single truth; it’s a view through a specific lens you’ve chosen.

So, when should you use it? Almost exclusively for exploratory data analysis and visualization. It’s a fantastic tool for seeing clusters you didn’t know were there. It is categorically not a tool for feature engineering or as input to another machine learning model. For that, you’d use PCA. t-SNE is for your eyes, and your brain, only.