79.2 Preprocessing: Scalers, Encoders, and Imputers
Right, let’s get your data ready for the machine learning party. Think of this as the part where we stop our algorithms from throwing a tantrum because you fed them numbers in the wrong format. Most machine learning models are, to put it bluntly, a bit stupid and incredibly fussy. They expect all their input features to be on the same scale, in purely numerical form, and without any pesky missing values. If you don’t do this prep work, a model like a Support Vector Machine or a k-Nearest Neighbors will treat a salary feature in the tens of thousands as infinitely more important than an age feature under 100, not because it is, but purely because the numbers are bigger. It’s our job to fix that.
The Absolute Necessity of Scaling
Here’s the deal: unless you’re using a tree-based model (like a Random Forest or Gradient Boosting), scaling your data is non-negotiable. Algorithms that rely on calculating distances (like k-NN) or finding coefficients (like Linear Regression, SVMs) are pathologically sensitive to the scale of your features.
Let’s say you have two features: distance_to_office (in meters, ranging from 100 to 10,000) and number_of_coffees (ranging from 0 to 5). The distance feature’s values are orders of magnitude larger. To a distance-based algorithm, a difference of 100 meters will seem trivial compared to a difference of 2 coffees. It will effectively ignore the distance_to_office feature, which is almost certainly not what you want.
We fix this by putting all features on a common scale. The two heavy hitters are StandardScaler and MinMaxScaler.
StandardScaler: This is my go-to 90% of the time. It transforms your data so that it has a mean of 0 and a standard deviation of 1. It’s excellent for when your data is roughly normally distributed (or you have no idea what the distribution is). It’s robust to outliers, though if you have extreme outliers, they’ll still exert a lot of influence.MinMaxScaler: This one squishes all your data into a specified range, usually [0, 1]. It’s great if you need bounded values (like for the input of some neural networks) or if you have a non-normal distribution. The catch? It’s extremely sensitive to outliers. A single huge outlier will compress all the other data into a tiny little ball.
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Some fake, terribly-scaled data
X_train = np.array([[1000, 2],
[2000, 5],
[300, 1],
[4500, 4]])
# Standard Scaling
scaler_standard = StandardScaler()
X_scaled_standard = scaler_standard.fit_transform(X_train)
print("StandardScaler result:\n", X_scaled_standard)
# Notice the values are now centered around 0.
# MinMax Scaling
scaler_minmax = MinMaxScaler()
X_scaled_minmax = scaler_minmax.fit_transform(X_train)
print("\nMinMaxScaler result:\n", X_scaled_minmax)
# Everything is neatly between 0 and 1.
Crucial Best Practice: You must fit the scaler only on your training data, then use it to transform both the training and test data. If you fit on the entire dataset, you’re peeking at the test set and leaking information about its distribution, which is cheating and will give you wildly over-optimistic results. This rule applies to almost all transformers in scikit-learn.
Taming Categorical Variables with Encoders
Machines love numbers. They don’t know what to do with “red”, “blue”, or “green”. Encoding is the process of turning these categories into numbers, but you have to do it smartly. Dumping arbitrary numbers (e.g., red=1, blue=2, green=3) is usually a terrible idea because it implies an order that doesn’t exist (is red less than blue?).
Enter OneHotEncoder. This is the gold standard for nominal data (categories with no inherent order). It creates a new binary feature for each possible category. For the color “red”, it turns off all the other color switches and flicks the “is_red” switch to 1.
from sklearn.preprocessing import OneHotEncoder
# Data with categorical features
X_train_cat = [['red'], ['blue'], ['green'], ['blue'], ['red']]
# OneHotEncoder: handle_unknown='ignore' is a lifesaver for unseen categories in the test set.
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
X_encoded = encoder.fit_transform(X_train_cat)
print("Encoded categories:\n", X_encoded)
print("Feature names:", encoder.get_feature_names_out())
# What if our test set has a new category?
X_test = [['red'], ['purple']] # 'purple' wasn't in training!
X_test_encoded = encoder.transform(X_test)
print("\nTest set encoded (note purple is all zeros):\n", X_test_encoded)
For ordinal data (categories with a clear order, like “low”, “medium”, “high”), you’d use OrdinalEncoder and manually map the categories to numbers in the correct order. OneHotEncoder would also work, but it loses the ordering information.
The Art of Dealing with Missing Data
Real-world data is messy. Values are missing. You can’t just drop rows with missing values willy-nilly; you might decimate your dataset. Imputation is the fancy word for making an educated guess about what those missing values should be.
The SimpleImputer is your first port of call. Its strategies are straightforward:
mean: The average value. Good for numerical data without extreme outliers.median: The middle value. My preferred choice for numerical data as it’s robust to outliers.most_frequent: The mode. The go-to for categorical data.constant: Fill with a fixed value you specify.
from sklearn.impute import SimpleImputer
# Data with some very strategic missing values (NaNs)
X_train_with_nans = np.array([[5, 2],
[np.nan, 3],
[7, np.nan],
[9, 6]])
# Impute with the median for each feature
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X_train_with_nans)
print("Original data with NaNs:\n", X_train_with_nans)
print("\nImputed data (median):\n", X_imputed)
print("The medians learned for each column:", imputer.statistics_)
The Big Pitfall: Just like with scaling, you fit the imputer on the training data and use those learned values (e.g., the median) to transform the test set. Never calculate the median of the test set to fill in the test set! You’re not allowed to know anything about the test set, remember?
Putting It All Together: The Power of a Pipeline
You’re probably thinking, “This is a lot of steps. I have to scale, impute, and encode, and I have to remember to fit on train and transform on test for each one.” You are absolutely right. This is a recipe for a bug-filled nightmare.
Scikit-learn’s Pipeline is your salvation. It chains all these steps together into a single, tidy object that you can fit and predict with. It automatically handles all the fit/transform logic correctly, preventing data leakage and making your code incredibly robust and clean.
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import pandas as pd
# Let's use a DataFrame because it's more realistic
df = pd.DataFrame({
'salary': [50000, 80000, np.nan, 120000],
'age': [25, 40, 30, 25],
'color': ['red', 'blue', np.nan, 'blue']
})
# Separate features and target (if you had one)
X = df
# Define which preprocessor goes to which column
preprocessor = ColumnTransformer(
transformers=[
('num', Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())]), ['salary', 'age']),
('cat', Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore'))]), ['color'])
])
# Fit and transform all at once, correctly.
X_processed = preprocessor.fit_transform(X)
print("Fully processed feature matrix:\n", X_processed)
This pipeline is a thing of beauty. You can now feed raw, messy data in and get perfectly preprocessed, model-ready data out. This is the professional way to do it. Anything less is just playing with fire.