12.2 Encoding Categorical Variables: One-Hot, Ordinal, Target Encoding

Alright, let’s talk about turning your messy, non-numeric categories into something a model can actually digest. Most machine learning algorithms are, at their heart, just glorified calculators. They love numbers. They dream in matrices. They have no idea what to do with a “red,” “blue,” or “green.” Our job is to translate that categorical gibberish into a numerical dialect they understand, and we’ve got a few primary methods for that. Choose wisely, because this is one of the highest-leverage decisions you’ll make in a project.

The Trusty Workhorse: One-Hot Encoding

The most straightforward way to encode categories is to just ask a series of yes/no questions. Is this sample “blue”? Is it “green”? Is it “red”? That’s one-hot encoding (OHE). It creates a new binary feature for each unique category in your original variable.

Why it works: It makes no assumptions about the relationships between your categories. “Red” isn’t higher or lower than “blue”; it’s just different. OHE represents this perfectly, preventing a model from learning a false ordinal relationship (e.g., that red = 1 and blue = 2 means blue is “greater than” red).

The main pitfall? Cardinality. If you have a categorical feature with 1000 unique values, OHE will spit out 1000 new columns. This can explode your dataset’s dimensionality, making it sparse and potentially causing memory issues or sending your model down a nasty overfitting rabbit hole. Use it for features with a relatively low number of categories (<15 is a decent rule of thumb).

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Sample data with a low-cardinality feature
data = pd.DataFrame({'color': ['red', 'blue', 'green', 'blue', 'red']})

# Initialize and fit the encoder
encoder = OneHotEncoder(sparse_output=False, drop=None)  # We'll talk about 'drop' in a second
encoded_array = encoder.fit_transform(data[['color']])

# Get the feature names for the new columns and create a DataFrame
feature_names = encoder.get_feature_names_out(['color'])
encoded_df = pd.DataFrame(encoded_array, columns=feature_names)

print(encoded_df)

This outputs:

   color_blue  color_green  color_red
0         0.0          0.0        1.0
1         1.0          0.0        0.0
2         0.0          1.0        0.0
3         1.0          0.0        0.0
4         0.0          0.0        1.0

Watch out for the dummy variable trap (a perfect multicollinearity issue). If you have k categories, you only need k-1 columns to represent them perfectly. The drop parameter in Scikit-learn handles this. Using drop='first' will drop the first category, which is often a good idea for linear models. Tree-based models? They mostly don’t care.

When Order Actually Matters: Ordinal Encoding

Sometimes, your categories do have a natural order. Think “small,” “medium,” “large” or “low,” “medium,” “high.” Here, slapping on OHE would be throwing away valuable information. For this, we use ordinal encoding, which maps the categories to integers respecting their order.

Why it works: It preserves the order relationship, which is exactly the information you want the model to learn. A model can now understand that “large” > “medium” > “small.”

The pitfall? Don’t just map to 0, 1, 2 arbitrarily. You must manually define the order. Letting a library infer it alphabetically is a one-way ticket to nonsense town (“large”=0, “medium”=1, “small”=2). You have to be the brains of the operation here.

from sklearn.preprocessing import OrdinalEncoder

# Sample data with an ordinal feature
data = pd.DataFrame({'size': ['small', 'large', 'medium', 'small', 'large']})

# YOU define the order. This is non-negotiable.
size_categories = [['small', 'medium', 'large']]

encoder = OrdinalEncoder(categories=size_categories)
data['size_encoded'] = encoder.fit_transform(data[['size']])

print(data)

This outputs:

     size  size_encoded
0  small           0.0
1  large           2.0
2  medium          1.0
3  small           0.0
4  large           2.0

Perfect. The model now knows the intended hierarchy.

The Powerful but Dangerous Shortcut: Target Encoding

Here’s where things get spicy. What if we encode a category by using the target variable itself? Specifically, we replace each category with the average value of the target for that category. For a regression problem, it’s the mean; for classification, it’s the probability of the positive class.

Why it works: It can be incredibly powerful. It directly captures a relationship between the category and what we’re trying to predict, often leading to faster convergence and better performance with tree-based models, especially on high-cardinality features where OHE would fail miserably.

Why it’s dangerous: It’s a massive leakage risk. You are using the target to create a feature. If you do this on your entire training set before evaluating, you are leaking information from the target into your features, making your model look unrealistically good. You must calculate these encodings strictly from the training fold and apply them to the validation/test fold, typically using cross-validation within scikit-learn’s Pipeline and cross_val_score.

It also risks overfitting to rare categories. If a category only appears a few times, its average target value will be very noisy.

from sklearn.model_selection import train_test_split
from category_encoders import TargetEncoder

# Sample data
data = pd.DataFrame({
    'city': ['A', 'B', 'A', 'C', 'B', 'C', 'A', 'B'], # High cardinality for example
    'target': [10, 20, 12, 45, 22, 55, 15, 23]
})

# Split FIRST to avoid data leakage
X_train, X_test, y_train, y_test = train_test_split(data[['city']], data['target'], test_size=0.25, random_state=42)

# Initialize and fit the encoder ON THE TRAINING DATA ONLY
encoder = TargetEncoder()
encoder.fit(X_train, y_train)

# Transform both training and test data
X_train_encoded = encoder.transform(X_train)
X_test_encoded = encoder.transform(X_test)

print("Training Data Encoded:")
print(X_train_encoded)
print("\nTest Data Encoded:")
print(X_test_encoded)

See how the test data for city ‘C’ gets encoded with a value (~50) even though it wasn’t in this particular training fold? That value was calculated from the entire training set’s relationship for ‘C’. This is the correct, leak-free way to do it. It’s a fantastic tool, but it demands respect. Use it carefully, and always validate its performance against simpler methods.