4.7 Class Imbalance: Oversampling, Undersampling, and SMOTE

Right, so you’ve built your model, you’re feeling pretty good, and then… it predicts everything as the majority class. 99% accuracy! Fantastic! Except it’s completely useless because you’re trying to find the one fraudulent transaction in a sea of legitimate ones. Welcome to the wonderfully frustrating world of class imbalance. It’s the single biggest party pooper for classification models. They’re desperate to minimize error, and the easiest way to do that is to just always guess the most common outcome. Lazy little things.

We’re not here to tolerate laziness. We’re going to force our model to actually look at the minority class by balancing the dataset ourselves. We have two main, brutally simple strategies, and one clever-but-slightly-dangerous one.

The Sledgehammers: Undersampling and Oversampling

The core idea is stupidly simple: either we remove a bunch of the majority class (undersampling) or we add copies of the minority class (oversampling). Let’s see this in action with some code. First, let’s create a stupidly imbalanced dataset. We’ll use imbalanced-learn, which is the go-to library for this stuff (pip install imbalanced-learn).

import numpy as np
from sklearn.datasets import make_classification

# Create a synthetic imbalanced dataset
X, y = make_classification(
    n_samples=10000,
    n_features=10,
    n_informative=2,
    n_redundant=0,
    n_repeated=0,
    n_classes=2,
    n_clusters_per_class=1,
    weights=[0.99, 0.01],  # 99% class 0, 1% class 1
    random_state=42
)

print(f"Original class counts: {np.bincount(y)}")
# Output: Original class counts: [9900  100]

Yikes. 9900 to 100. Let’s bring out the sledgehammers.

Undersampling is like kicking people out of the party to get an even ratio. It’s fast and efficient, but you’re literally throwing away data. This is a terrible idea if you don’t have a massive dataset to begin with.

from imblearn.under_sampling import RandomUnderSampler

undersampler = RandomUnderSampler(random_state=42)
X_under, y_under = undersampler.fit_resample(X, y)

print(f"Undersampled class counts: {np.bincount(y_under)}")
# Output: Undersampled class counts: [100 100]

Oversampling is the opposite: we just invite the same few people over and over again until the room is full. Also simple, but now you’re massively overfitting on the specific data points you have in the minority class. Your model will learn those specific examples by heart, which is not great for generalization.

from imblearn.over_sampling import RandomOverSampler

oversampler = RandomOverSampler(random_state=42)
X_over, y_over = oversampler.fit_resample(X, y)

print(f"Oversampled class counts: {np.bincount(y_over)}")
# Output: Oversampled class counts: [9900 9900]

See the problem? Undersampling decimates your valuable data. Oversampling makes your model a memorization machine for the minority class. We need something smarter.

The Clever Trick: SMOTE

This is where SMOTE (Synthetic Minority Over-sampling Technique) comes in. Instead of just duplicating minority samples, it creates new ones. How? It looks at a data point from the minority class, finds its k-nearest neighbors (also from the minority class), and then creates a new point somewhere on the line connecting them.

It’s basically saying, “The space between these two fraudulent transactions probably also represents a fraudulent transaction.” This is genius because it generates plausible new data instead of crude copies. Let’s see it.

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X, y)

print(f"SMOTE class counts: {np.bincount(y_smote)}")
# Output: SMOTE class counts: [9900 9900]

The counts are the same as naive oversampling, but the content is different. We now have 9800 new, synthetic minority samples instead of 9800 copies.

Where SMOTE Goes Horribly Wrong

SMOTE is brilliant, but it’s not magic. It has a fatal flaw: it’s utterly oblivious to the broader context of your data. It will happily create synthetic samples in places that make no sense.

Consider this nightmare scenario: You have a feature like “Age” and another like “Years of Experience.” SMOTE might create a synthetic data point with “Age = 22” and “Years of Experience = 20”. That’s… not a person. That’s a vampire, and you probably don’t want them in your training data.

This happens because SMOTE operates on each feature independently when it draws those lines between points. It assumes the feature space is continuous and homogeneous, which is often a dangerous assumption.

Best practice: Never, ever apply SMOTE to the entire dataset before splitting. You’ll leak information from your test set into your training set via these synthetic samples, and your model’s performance will be a beautiful, profound lie. Always split your data into train and test sets first, then apply SMOTE only on the training set.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Split FIRST
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Apply SMOTE ONLY to the training data
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Train on the resampled data
model = RandomForestClassifier(random_state=42)
model.fit(X_train_smote, y_train_smote)

# Evaluate on the pristine, un-tampered-with test data
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

So, What Should You Actually Do?

Start simple. Try class weighting first. Most algorithms (like RandomForestClassifier or SVC) have a class_weight='balanced' parameter. This tells the model to penalize mistakes on the minority class more heavily. It’s often all you need and doesn’t involve any dangerous synthetic data generation.

If you must use sampling, try a combination. SMOTE is often paired with a bit of undersampling of the majority class (e.g., SMOTEENN). And for the love of all that is good, visualize your data after applying SMOTE. Use PCA to project it into 2D and see if those synthetic points are landing in sane places or if you’ve just created a feature-space lovecraftian horror. Your model’s performance, and your sanity, depend on it.