Pca | mikePietsch.com

9.8 Linear Discriminant Analysis (LDA)

Alright, let’s talk about Linear Discriminant Analysis, or LDA. Don’t get it twisted—this isn’t the Latent Dirichlet Allocation for topic modeling. This is the other LDA, the one that’s like a much more sophisticated, class-conscious cousin to PCA. While PCA is obsessed with maximum variance and ignores your class labels entirely (how rude), LDA actually uses those labels to find the axes that maximize the separation between your pre-defined classes. It’s a supervised learning algorithm moonlighting as a dimensionality reduction technique.

9.7 Feature Selection vs Feature Extraction

Right, let’s settle this. Before we dive into the glorious math of PCA and the beautiful visualizations of t-SNE, we need to get this fundamental distinction straight. It’s the difference between throwing out entire bags of groceries and making a gourmet reduction sauce. Both get you a smaller kitchen, but the results are… wildly different. You’re drowning in features. Your dataset has hundreds, maybe thousands of columns. Your model is slow, noisy, and probably overfit. You need to reduce the dimensionality. Your two main weapons are Feature Selection and Feature Extraction. Don’t mix them up.

9.6 Autoencoders for Dimensionality Reduction

Right, so you’ve slogged through PCA, marveled at the weird, clumpy art of t-SNE, and felt the clean, topological embrace of UMAP. They’re all fantastic, but they share a common trait: they’re projection methods. They take your high-dimensional data and squash it down onto a lower-dimensional plane for you to look at. Useful, but a bit like a tourist taking a photo of a city—you get a nice 2D view, but you can’t really go and build a new building there.

9.5 UMAP: Faster and More Globally Faithful than t-SNE

Alright, let’s get into UMAP. If t-SNE is the brilliant but moody artist who gets lost in the details, UMAP is the pragmatic engineer who understands the big picture and actually cares about how long the project takes. It stands for Uniform Manifold Approximation and Projection, which sounds like a mouthful dreamed up by a committee, but the underlying ideas are actually elegant, powerful, and—blessedly—fast. The core genius of UMAP is that it’s built on a solid theoretical foundation from topology (specifically, something called Riemannian geometry and fuzzy topological analysis). Before you close this tab, don’t worry, we’re not going to get a math lecture. The key takeaway is this: UMAP assumes your data isn’t just a meaningless cloud of points; it’s lying on some underlying surface—a manifold. Think of it like a crumpled piece of paper (the manifold) stuffed into a box (your high-dimensional space). PCA can only see the box. t-SNE tries to uncrumple it but gets distracted by the local texture. UMAP’s goal is to find a low-dimensional representation that best respects the topology of that original crumpled paper—its connectedness and shape.

9.4 t-SNE: Preserving Local Structure for Visualization

Alright, let’s talk about t-SNE. If PCA is the sober cartographer, meticulously drawing a scaled map of your data’s grandest themes, then t-SNE is the abstract expressionist painter. It doesn’t care about global distances or precise scales. Its entire raison d’être is to preserve the local structure—it wants to show you which data points are huddled together in little clumps and neighborhoods. This makes it phenomenally good for visualization, especially of high-dimensional stuff like word embeddings or single-cell RNA sequences, where you just need to see the clusters. But, and this is a massive but, it will lie to your face about the big picture. More on that later.

9.3 Kernel PCA and Non-Linear PCA

Right, so you’ve mastered standard PCA. You can take a high-dimensional dataset, project it onto a set of orthogonal axes that maximize variance, and get a lower-dimensional representation that’s actually useful. It’s brilliant, and it’s linear. And that’s the problem. The universe, much to the chagrin of mathematicians everywhere, is stubbornly non-linear. What happens when your data lives on some twisted manifold—think of a rolled-up sheet of paper or a squiggly line in 3D space? Standard PCA, which can only perform linear projections, will completely butcher it. It’s like trying to use a flat map of the world: useful for some things, but it will never accurately represent the distances between points on a sphere. We need a way to “unroll” the manifold. Enter Kernel PCA, our first and most mathematically elegant tool for the job.

9.2 PCA: Eigenvectors, Explained Variance, and Scree Plots

Alright, let’s get our hands dirty with PCA. Forget the textbook definition for a second. Here’s what PCA actually does: it finds the directions in your data where things are most stretched out, the axes of maximum variance. Think of it like taking a messy, tilted cloud of points and rotating it so you can look at it from the most informative angles. The first new angle you look from (Principal Component 1) shows you the most spread. The next one (PC2) shows you the next most spread, and so on. It’s a workhorse. It’s not flashy, but it’s the first thing you should reach for when you need to simplify your data or see its structure.

9.1 The Curse of Dimensionality

Right, let’s talk about the monster in the closet of every data scientist: the Curse of Dimensionality. It sounds like a bad Indiana Jones sequel, but I promise you, it’s far more real and it’s actively trying to ruin your models. The core joke is this: in high dimensions, our intuition about space and distance—the very foundation of most machine learning algorithms—completely and utterly falls apart. Think of it this way. In one dimension (a line), data is simple. In two dimensions (a plane), you can still visualize clusters. In three dimensions (a cube), it gets trickier, but we can still reason about it. Now, imagine a dataset with 100, or 1,000, or 10,000 features. You’re not in Kansas anymore; you’re in a hyper-dimensional nightmare where every point is basically equidistant from every other point. This isn’t just a theoretical curiosity; it’s the reason your brilliant k-Nearest Neighbors model suddenly becomes useless on raw, high-dimensional data.