79.9 Feature Selection and Dimensionality Reduction: PCA, SelectKBest

Right, let’s talk about one of the most common and quietly frustrating parts of the job: your data has too many columns. You’re not just being messy; you’ve probably got dozens or hundreds of features, and a nagging suspicion that most of them are either useless, redundant, or actively plotting against your model’s performance. This isn’t a data hoarding intervention; it’s about being smart. We’re going to cover two of your most powerful allies in this fight: brute-force statistical scoring (SelectKBest) and the elegant, geometric magic of Principal Component Analysis (PCA).

The goal is simple: to reduce the number of features while keeping the good stuff—the signal. Why bother? Three big reasons. First, it curses the “curse of dimensionality”: as your feature space gets bigger, your data gets exponentially sparser, making it harder for your model to find meaningful patterns. It’s like trying to find a friend in an empty field versus in the entire continent of Asia. Second, it speeds up training. Fewer features mean less computation. And third, it can actually improve your model’s accuracy by reducing overfitting. You’re cutting out the noise.

The Statistical Bouncer: SelectKBest

Think of SelectKBest as a no-nonsense bouncer at a club. It lines up all your features, applies a statistical test to each one to see how well it relates to the target variable, gives each a score, and then only lets the top K highest-scoring features through the door. It’s simple, effective, and brutally direct.

The most common test for regression problems is f_regression (ANOVA F-value) and for classification, it’s chi2 or f_classif. Let’s see it in action. We’ll use the classic wine dataset because, well, it’s more interesting than housing prices.

from sklearn.datasets import load_wine
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import train_test_split

# Load some classy data
wine = load_wine()
X, y = wine.data, wine.target
feature_names = wine.feature_names

print(f"Original number of features: {X.shape[1]}")

# Split first! You never, ever fit on your test data. Not even for feature selection.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the bouncer: we want the top 5 features, judged by chi-squared
selector = SelectKBest(score_func=chi2, k=5)

# Fit it on the training data and transform the training data
X_train_selected = selector.fit_transform(X_train, y_train)

# Now, crucially, we transform the test data using the selector we fit on the training data.
# We do NOT fit again on the test data. That would be cheating.
X_test_selected = selector.transform(X_test)

print(f"Reduced number of features: {X_train_selected.shape[1]}")
print("Top 5 features:", [feature_names[i] for i in selector.get_support(indices=True)])

The Pitfall: The biggest mistake here is thinking SelectKBest understands feature interactions. It doesn’t. It judges each feature in isolation. If you have two features that are useless alone but brilliant together, this bouncer will kick them both out. It’s a univariate method. Also, you must do the train-test split before the feature selection. Fitting the selector on your entire dataset leaks information from the test set into the training process, giving you a hopelessly optimistic performance estimate.

The Geometric Alchemist: Principal Component Analysis (PCA)

PCA is a different beast. It doesn’t just pick existing features; it creates brand new ones, called principal components. These new features are linear combinations of the original ones, designed to capture the maximum amount of variance in the data. The first PC points in the direction of the greatest variance. The second PC is orthogonal to the first and points in the direction of the next greatest variance, and so on.

Why is this brilliant? It takes a bunch of potentially correlated features (like “height in inches” and “height in centimeters”) and creates a new, uncorrelated set of features that efficiently summarizes the data. It’s like taking a messy, tilted cloud of points in 3D space and rotating it so you can see its true shape clearly along new, better axes.

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# PCA is a distance-based method, so you MUST scale your data first.
# If you don't, a feature with a large range (like "annual income") will
# completely dominate a feature with a small range (like "age") and you'll
# get nonsense components. Don't forget this. I've forgotten it. It's bad.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Scale test set with training parameters

# Let's say we want to reduce our data to 2 dimensions for visualization
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled) # Again, transform with the fitted PCA

print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Shape after PCA: {X_train_pca.shape}")

The explained_variance_ratio_ tells you the percentage of the total variance in the original data that is captured by each component. If the first two components have ratios of [0.45, 0.3], then together they’ve captured 75% of the information. This is your best guide for choosing n_components. You can run PCA without specifying n_components and look at the cumulative explained variance plot to find a good cutoff (e.g., 95% variance retained).

The Rough Edge: Here’s the honest part: the new features from PCA are completely uninterpretable. Your model might perform better, but you can no longer point to “alcohol content” as an important factor. You’re pointing to “a mysterious blend of alcohol, malic acid, and ash that explains 32% of the variance.” This is often a trade-off worth making for performance, but it’s a nightmare if you need to explain your model to a non-technical stakeholder.

So, Which One Do You Use?

It’s not an either/or choice. Use SelectKBest when interpretability is key and you believe your best features are strong on their own. Use PCA when you suspect multicollinearity (highly correlated features) or when you need to crush dimensionality for a performance boost, and you’re willing to sacrifice some understanding of the features. You can even use them together in a pipeline—filter out the worst offenders with SelectKBest first, then use PCA on the remaining features to squeeze out even more redundancy. Just remember the golden rule: fit everything on your training data, and then use that fitted object to transform your test data. Now go forth and simplify.