4.3 Handling Missing Values: Imputation Strategies
Alright, let’s get our hands dirty. Missing data isn’t an if, it’s a when. You’ll find NaN, None, NA, or just a suspicious-looking empty string staring back at you from the dataset, and your first instinct might be to just drop those rows. Resist it. That’s the data science equivalent of throwing away a puzzle because a single piece is missing. It’s lazy, and it can introduce massive bias into your model. Your model will learn from the data you give it, and if you’ve systematically removed all the records where, say, income was missing (which might correlate with a certain demographic), congratulations, you’ve just built a biased model. So, we’re going to impute—a fancy word for “make an educated guess.”
The golden rule of imputation: your goal isn’t to be right. You can’t know the true value. Your goal is to be less wrong in a way that doesn’t screw up your downstream analysis.
The Simplest Thing That Could Possibly Work: Mean/Median/Mode
Let’s start with the classics. For numerical data, you fill missing values with the mean or median. For categorical data, you use the mode (the most frequent value).
Why you’d do it: It’s fast, simple, and requires no complex machinery. It’s a decent baseline.
Why it’s often a terrible idea: It drastically underestimates the variance in your data. You’re artificially creating a bunch of identical values right at the center of your distribution, which makes your data look less spread out than it actually is. It also completely ignores relationships between variables. If height is missing for a person, wouldn’t weight and gender be better clues than the average height of the entire population?
import pandas as pd
import numpy as np
# Create a sample dataframe with some obvious missingness
df = pd.DataFrame({
'Age': [25, 28, np.nan, 35, 22, np.nan, 40],
'Salary': [50000, 54000, 52000, np.nan, 48000, np.nan, 70000]
})
print("Original DataFrame:")
print(df)
# Impute numerical columns with their median (more robust than mean)
df_imputed_simple = df.copy()
df_imputed_simple['Age'].fillna(df['Age'].median(), inplace=True)
df_imputed_simple['Salary'].fillna(df['Salary'].median(), inplace=True)
print("\nAfter Simple Median Imputation:")
print(df_imputed_simple)
See how the two missing Salary values became identical? That’s the problem. Use this strategy only as a first step or when you have absolutely no other information to go on.
Getting Smarter: K-Nearest Neighbors (KNN) Imputation
This is where we start using the relationships in the data. The idea is brilliant in its simplicity: for a row with a missing value, find the k other rows that are most similar to it (its “neighbors”), and then take the mean or mode of those neighbors’ values for the missing feature.
Why it’s better: It preserves relationships. The imputed value for a tall, heavy person’s missing weight will be based on other tall, heavy people, not the entire dataset.
The catch: It’s computationally expensive. You have to calculate the distance between every pair of rows for every feature with missingness. It also requires you to scale your data first (because a distance calculation cares if one feature is in the 1000s and another is between 0 and 1), and it’s hopelessly slow on very large datasets.
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
# Scale the data first! Crucial step.
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
# Initialize the imputer. n_neighbors=2 means use the 2 most similar rows.
imputer = KNNImputer(n_neighbors=2)
df_imputed_knn = imputer.fit_transform(df_scaled)
# Transform back to the original scale (approximately)
df_imputed_knn = scaler.inverse_transform(df_imputed_knn)
df_imputed_knn = pd.DataFrame(df_imputed_knn, columns=df.columns)
print("\nAfter KNN Imputation (n_neighbors=2):")
print(df_imputed_knn.round(1))
Notice the imputed values are different from the median now. They’re based on the actual, specific rows that were closest. Much smarter.
The Heavy Artillery: Multivariate Imputation by Chained Equations (MICE)
If KNN is smart, MICE is brilliant. It models each feature with missing values as a function of all the other features. It’s an iterative process: it makes an initial guess (like using the mean), then it cycles through each feature, refining its predictions each time based on the newly imputed values from the last round.
Why it’s often the best: It does the best job of capturing the complex, multivariate relationships in your data and recreating a realistic variance for the missing values. Each imputed value is a prediction from a model built on the other features.
The serious catch: It’s complex and computationally very expensive. It’s also a black box; you’re relying on the convergence of these chained models.
from sklearn.experimental import enable_iterative_imputer # Required import
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
# The IterativeImputer is Scikit-learn's implementation of MICE
# It uses a BayesianRidge regression model by default for numerical data
imputer = IterativeImputer(random_state=42, max_iter=10)
df_imputed_mice = imputer.fit_transform(df)
df_imputed_mice = pd.DataFrame(df_imputed_mice, columns=df.columns)
print("\nAfter MICE Imputation:")
print(df_imputed_mice.round(1))
The Non-Answer: Adding an Indicator Column
Here’s a pro move that most beginners miss. However you impute, you should almost always add a new binary column for each column you impute: Age_Was_Missing, Salary_Was_Missing, etc.
Why this is a game-changer: It tells your model, “Hey, the original value here was unknown.” This is a huge piece of information! The fact that data is missing is rarely random; it’s often systematic. Maybe people with very high salaries are less likely to report them. Your model can learn to treat imputed values differently if it knows they were imputed. It’s like giving your model a clue about the underlying uncertainty.
df_with_indicators = df.copy()
for col in df.columns:
# Create a new indicator column before imputation
df_with_indicators[col + '_Was_Missing'] = df[col].isna().astype(int)
# Now perform your imputation of choice on the original columns
imputer = IterativeImputer(random_state=42)
df_imputed = imputer.fit_transform(df_with_indicators[['Age', 'Salary']])
df_with_indicators[['Age', 'Salary']] = df_imputed
print("\nDataFrame with Missing Indicator Columns:")
print(df_with_indicators.round(1))
There is no single “best” method. Your choice depends on your data size, the nature of the missingness (is it random or not?), and your tolerance for complexity. Always try a few methods and see how they impact your model’s performance. And for the love of all that is holy, never impute your training and test data together—fit the imputer on the training set and transform the test set with it to avoid data leakage. It’s the most common pitfall, and it will utterly invalidate your results. You’ve been warned.