12.5 Date and Time Feature Extraction

Right, let’s talk about dates and times. Your model doesn’t understand that “January 1st, 2023” is a Saturday, comes after a Friday, and is a national holiday. It just sees a string or, heaven forbid, an integer. Our job is to translate the rich, contextual information hidden in a timestamp into a language your algorithm can actually use. This isn’t just data cleaning; it’s data archaeology. We’re excavating meaning.

The first and most critical rule: never, ever store or use your datetime as a raw string. You’re just asking for pain. The moment you get a new data source with a slightly different format ('01-Jan-2023' vs. '2023/01/01'), your entire pipeline grinds to a halt. Your first line of defense is to parse it into a proper datetime object immediately. In Python, that means datetime.datetime.

import pandas as pd
from datetime import datetime

# Your raw, messy data
date_data = {'messy_date': ['2023-01-01', '01/02/2023', 'March 3, 2023 14:30']}
df = pd.DataFrame(date_data)

# The right way: let pandas figure it out with pd.to_datetime
df['clean_date'] = pd.to_datetime(df['messy_date'], infer_datetime_format=True)

# For more control, or if you're a control freak (like me), specify the format
df['clean_date_controlled'] = pd.to_datetime(df['messy_date'], format='mixed')
print(df.dtypes)
print(df)

Pandas’ to_datetime is brilliantly forgiving. The infer_datetime_format or format='mixed' parameters are your friends here, but for mission-critical stuff, I often use a specific format string to avoid any ambiguity. Trust, but verify.

Decomposing the Datetime

Now for the fun part: pulling this single datetime object apart into its constituent features. Think about what’s semantically important in a date.

# Extract the obvious temporal components
df['year'] = df['clean_date'].dt.year
df['month'] = df['clean_date'].dt.month
df['day'] = df['clean_date'].dt.day
df['day_of_week'] = df['clean_date'].dt.dayofweek  # Monday=0, Sunday=6
df['hour'] = df['clean_date'].dt.hour
df['minute'] = df['clean_date'].dt.minute
df['is_weekend'] = (df['clean_date'].dt.dayofweek >= 5).astype(int)

# For cyclical features like hour, month, and day_of_week, we need to encode them properly.
# A linear model will think the difference between month 11 (Nov) and 12 (Dec) is the same as
# between 12 (Dec) and 1 (Jan). It's not. We solve this with sine/cosine transformation.
df['month_sin'] = np.sin(2 * np.pi * df['month']/12)
df['month_cos'] = np.cos(2 * np.pi * df['month']/12)

# Do the same for day of the week and hour
df['hour_sin'] = np.sin(2 * np.pi * df['hour']/24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour']/24)

The cyclical encoding is non-negotiable for any periodic data. Without it, your model treats the jump from 11 PM (23) to midnight (0) as a 23-step difference instead of a 1-step difference. It’s a great way to get useless results.

Beyond the Basics: The Real World Intrudes

A timestamp isn’t just a point on a graph; it exists in a human context. This is where you separate the adequate from the exceptional.

# Is this a US federal holiday? Requires the `holidays` library
# pip install holidays
import holidays

us_holidays = holidays.US()
df['is_holiday'] = df['clean_date'].dt.date.isin(us_holidays)
df['holiday_name'] = df['clean_date'].dt.date.map(lambda d: us_holidays.get(d, ''))

# Financial quarter? Super important for business data.
df['quarter'] = df['clean_date'].dt.quarter

# Part of the day? Useful for energy or web traffic models.
def part_of_day(hour):
    if 5 <= hour < 12:
        return 'morning'
    elif 12 <= hour < 17:
        return 'afternoon'
    elif 17 <= hour < 21:
        return 'evening'
    else:
        return 'night'

df['time_of_day'] = df['hour'].apply(part_of_day)

# Finally, get dummy variables for categorical time features like time_of_day
df = pd.get_dummies(df, columns=['time_of_day'], prefix='', prefix_sep='')

The Difference Between Two Dates is Gold

Often, the most powerful feature isn’t the date itself, but the time elapsed since a key event.

# Time since a critical event (e.g., product launch, last user login)
critical_date = pd.to_datetime('2023-01-15')
df['days_since_critical'] = (df['clean_date'] - critical_date).dt.days

# For user data, time since last activity is often the strongest predictor of churn
df = df.sort_values(by='clean_date')  # Sort by time first!
df['time_since_last_event'] = df.groupby('user_id')['clean_date'].diff().dt.total_seconds() / 3600  # hours

Pitfall Warning: Always, always sort your DataFrame by time before calculating lags or time differences. If your data is out of order, this calculation becomes meaningless noise. I’ve seen this mistake tank more models than I care to admit.

The goal here is to think like a domain expert. What about this moment in time matters? Is it the end of a quarter? Is it 3 AM on a Tuesday? Is it the user’s first login since a major holiday? Your features should answer those questions. Now go make your timestamps talk.