12.1 Domain-Driven Feature Creation
Alright, let’s get our hands dirty. You’ve got your raw data, and it’s… fine. It’s a start. But if you want your model to do more than just mediocre guesswork, you need to feed it something better. That’s where domain-driven feature creation comes in. This isn’t about blindly applying one-hot encoding and calling it a day. This is the art of using your brain—your understanding of the problem space—to create features that scream the important patterns to your model. It’s the single biggest lever you have to improve performance, and frankly, it’s where the real fun is.
The Alchemy of Date/Time Features
Your dataset has a timestamp column. The rookie move is to just drop it into the model as some giant integer. Don’t do that. The model doesn’t inherently know that 1640995200 (January 1, 2022) and 1641081600 (January 2, 2022) are consecutive days. It just sees two large, seemingly unrelated numbers. Your job is to perform the alchemy that turns that useless Unix epoch timestamp into a goldmine of signal.
Think like a human. If you’re predicting foot traffic for a bakery, what matters? Is it a weekend? What hour of the day is it? Is it a holiday? The model needs these concepts broken down explicitly.
import pandas as pd
# Start with a useless big number
df['timestamp'] = pd.to_datetime(df['timestamp']) # Make it a datetime object first!
# Now, let's get smart
df['hour_of_day'] = df['timestamp'].dt.hour # Mornings are different from nights
df['day_of_week'] = df['timestamp'].dt.dayofweek # Mondays are... Mondays
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int) # Weekend flag is pure gold
df['month'] = df['timestamp'].dt.month # December retail sales? Yeah, that's a thing.
df['is_holiday'] = ... # You'd define a custom function here for your region
# For cyclical features like hour or day, use sine/cosine to encode the circularity.
# This is a pro move that tells the model that hour 23 is close to hour 0, not 22 hours away.
df['hour_sin'] = np.sin(2 * np.pi * df['hour_of_day'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour_of_day'] / 24)
Ratios and Aggregates: Seeing the World in Context
A raw number is often meaningless without context. A $100 purchase is a big deal for a coffee shop but a rounding error for a car dealership. You have to create that context.
# Pitfall: Using raw values
# df['purchase_amount'] # Meh.
# Better: Create ratios
df['price_per_square_foot'] = df['home_price'] / df['square_footage']
df['clickthrough_rate'] = df['clicks'] / (df['impressions'] + 1e-10) # Avoid division by zero. Always.
# Best: Create aggregates from other tables (if you have them)
# Example: Get a customer's average historical purchase amount
customer_agg = orders.groupby('customer_id')['purchase_amount'].agg(['mean', 'max', 'count']).reset_index()
customer_agg.columns = ['customer_id', 'cust_avg_purchase', 'cust_max_purchase', 'cust_order_count']
# Then merge it back to your main dataframe
df = df.merge(customer_agg, on='customer_id', how='left')
# Now you can create powerful contextual features
df['purchase_vs_avg'] = df['purchase_amount'] / df['cust_avg_purchase'] # Is this purchase large FOR THIS CUSTOMER?
This last feature, purchase_vs_avg, is the kind of thing that separates adequate models from great ones. It’s not just what the number is; it’s what the number means.
The Pit of Leakage and How to Avoid It
Here’s the part where I stop being your witty friend and become the one who slaps the coffee out of your hand before you make a catastrophic mistake. Data leakage. It’s the silent killer of models in production. It occurs when information from the future (or the target) inadvertently sneaks into your training data. Your model will look incredibly, deceptively good during testing and then faceplant in the real world.
The most common crime scene? Creating aggregates without a time-aware split.
# WRONG. This leaks future information into every row.
df['global_mean_feature'] = df['value'].mean()
# LESS WRONG but still often WRONG. This calculates the mean using the ENTIRE dataset.
df['customer_historical_mean'] = df.groupby('customer_id')['value'].transform('mean')
# RIGHT. You must use only data that would have been available AT THE TIME OF THE EVENT.
# This requires setting up your problem with a time index and using rolling calculations.
df = df.sort_values('timestamp')
df['customer_historical_mean'] = df.groupby('customer_id')['value'].expanding().mean().reset_index(0, drop=True)
# Even better: Use a library like sklearn-pandas that supports time-based grouping in pipelines.
The rule is simple but brutal: When creating features, you can only use information that would have been available before the thing you’re trying to predict happened. Violate this, and you’re not building a model; you’re building a time bomb with a surprisingly high ROC-AUC score.