11.6 Cross-Validation: k-Fold, Stratified, and Time-Series CV

Alright, let’s get our hands dirty with cross-validation. If you’ve been following along, you know that training and testing on the same data is the ML equivalent of a student writing their own exam—it feels great, but the real world is going to be a brutal wake-up call. A simple train-test split is a good start, but it’s a single, fragile snapshot. Your model’s performance could be wildly different depending on which 20% of the data you randomly held out. Enter cross-validation: the way to stress-test your model and get a robust, realistic estimate of how it will perform on unseen data.

The core idea is elegantly simple: instead of one train-test split, you create many. You train the model on different subsets of the data and test it on the complementary parts, then you average the results. This tells you not just if your model works, but how consistently it works.

The Granddaddy: k-Fold Cross-Validation

This is the workhorse. Here’s the drill:

Shuffle your dataset randomly (crucial!).
Split it into k equal-sized (or as equal as possible), non-overlapping chunks, called “folds”.
For each unique fold:
- Treat that fold as your holdout test set.
- Train your model on the other k-1 folds.
- Evaluate your model on the held-out fold and record the score.
Once you’ve done this for all k folds, calculate the average of all the recorded scores. This is your cross-validation score. The standard deviation of those scores tells you about the variance of your model’s performance.

Why k=5 or k=10? It’s a classic bias-variance trade-off. A higher k means more training data in each iteration (lower bias), but you’re averaging over more estimates, which can lead to higher variance in the final estimate. k=5 or 10 hits a sweet spot for most datasets. Using k=n (aka “Leave-One-Out”) is often computationally insane and gives you a high-variance estimate. Don’t do it unless your dataset is tiny.

Here’s how you do it in scikit-learn. It’s so straightforward it feels like cheating.

from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

# Let's conjure a sample dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Initialize your model
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Initialize the k-Fold splitter. shuffle=True is critical.
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Get the cross-validated scores
cv_scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')

print(f"CV Accuracy Scores: {cv_scores}")
print(f"Mean CV Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
# Output might look like:
# CV Accuracy Scores: [0.87, 0.89, 0.855, 0.88, 0.865]
# Mean CV Accuracy: 0.8700 (+/- 0.0250)

Pitfall Alert: The most common mistake is forgetting to shuffle the data first. If your data has any inherent order (e.g., all samples of class A first, then class B), you’ll end up with folds that are not representative of the overall distribution, completely poisoning the result. Always set shuffle=True.

When Your Data is Lopsided: Stratified k-Fold

What if you’re working on a classification problem and one class is ridiculously rare? A random k-fold might by chance put all 10 examples of “users who actually clicked this ad” into a single fold. That fold’s test score will be terrible, and the training folds for the other iterations will be missing that class entirely. Not ideal.

Stratified k-Fold is here to save the day. It preserves the percentage of samples for each class in every fold. So if 10% of your data is class ‘1’, each fold will have approximately 10% class ‘1’. It’s k-Fold with manners.

In scikit-learn, you just swap the splitter. It’s a one-line change for massively better results on imbalanced data.

from sklearn.model_selection import StratifiedKFold

# Use StratifiedKFold instead
stratified_kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

stratified_scores = cross_val_score(model, X, y, cv=stratified_kf, scoring='accuracy')

print(f"Stratified CV Accuracy: {stratified_scores.mean():.4f} (+/- {stratified_scores.std() * 2:.4f})")

Rule of Thumb: For classification problems, you almost always want StratifiedKFold. It’s a best practice. Use plain KFold mostly for regression problems.

When Time is of the Essence: Time Series Cross-Validation

Here’s where the designers said, “throw the previous rules out the window.” Data with a time component—stock prices, daily temperatures, website traffic—has a cardinal rule: The future cannot influence the past. Random shuffling is strictly forbidden. If you shuffle, you’re literally leaking future information into your training set, making your model look like a prophet when it’s just a cheater.

Time Series CV respects causality. You walk forward in time.

You start with a small initial training set.
You train on that data and test on the next chunk.
You then add that test chunk to your training set and repeat.

This isn’t just a different split; it’s a fundamentally different philosophy. You’re simulating how you’d actually use the model in production: train on everything you know up to now, predict tomorrow.

scikit-learn has a TimeSeriesSplit for this. Notice how the test index is always after the training index.

from sklearn.model_selection import TimeSeriesSplit
import numpy as np

# Create a simple time-ordered dataset
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 1, 2, 1, 2, 1, 2])

tscv = TimeSeriesSplit(n_splits=3)

for train_index, test_index in tscv.split(X):
    print(f"TRAIN: {train_index} TEST: {test_index}")
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

# Output:
# TRAIN: [0 1] TEST: [2 3]
# TRAIN: [0 1 2 3] TEST: [4 5]
# TRAIN: [0 1 2 3 4 5] TEST: [6 7]

The big gotcha? Your model’s performance can decay over time if the underlying patterns are changing (a concept called “concept drift”). The scores from your later folds might be much worse than the early ones. This isn’t a bug; it’s a feature! It’s your model telling you, “Hey, the world is changing, you might want to retrain me.”

So, which one do you use? It’s not a choice. Let your data decide: Shuffled StratifiedKFold for classification, shuffled KFold for regression, and TimeSeriesSplit for anything that has a time order. Master these three, and you’ve got 99% of use cases covered. Now go forth and validate properly.