76.8 Time Series: DatetimeIndex, Resampling, and Rolling

Right, let’s talk about time. Specifically, your data’s time. Because if your timestamps are just strings of text sitting in a column, you’re fighting with one hand tied behind your back. We’re going to fix that. The goal is to make time a first-class citizen in your DataFrame, which unlocks a whole suite of powerful, almost magical, operations.

The absolute, non-negotiable first step is getting your dates and times into a proper datetime format. Pandas will try its best if you use pd.to_datetime(), but you should never rely on its guesswork for anything serious. Be explicit. It’s like dating: assumptions lead to pain.

import pandas as pd

# Let's say you have this mess, which is basically every CSV ever.
df = pd.DataFrame({
    'date_str': ['2023-01-01 08:30:00', '2023-01-01 09:45:00', '2023-01-02'],
    'value': [100, 150, 200]
})

# The "please just figure it out" method (often works, but you're living on the edge)
df['date_guess'] = pd.to_datetime(df['date_str'])

# The professional "I am an adult" method (always do this)
df['date_proper'] = pd.to_datetime(df['date_str'], format='%Y-%m-%d %H:%M:%S', errors='coerce')
print(df.dtypes)
print(df)

The errors='coerce' argument is your best friend. It forces pandas to convert what it can and stick NaT (Not a Time, datetime’s version of NaN) in for the entries it can’t parse, instead of just throwing a tantrum and exiting. This saves you from a single malformed entry blowing up your entire script.

The Superpower: Setting a DatetimeIndex

Converting a column to datetime is just step one. The real magic happens when you tell pandas, “Hey, this column is the timeline for this data.” You do that by setting it as the index. This transforms your regular DataFrame into a time-series-aware DataFrame.

# Set the 'date_proper' column as the index. INPLACE is False by default, so we assign it back.
df_ts = df.set_index('date_proper')
print(df_ts.index)  # Behold: a DatetimeIndex

Now, your index is a DatetimeIndex. This is not just a fancy label; it’s a supercharged index that understands time. You can slice data intuitively in ways that would make a standard index cry:

# Get all data from a specific day
print(df_ts['2023-01-01'])

# Slice by a time range (yes, it's this easy)
print(df_ts['2023-01-01 08:00':'2023-01-01 10:00'])

Resampling: Changing Time’s Granularity

Here’s where things get fun. You have minute-level data but need a daily report? You have daily sales but need to compare weekly performance? This is called resampling. You’re changing the frequency of your observations. The resample() method is the gateway drug for this, and it works almost exactly like groupby(), but for time.

# Let's create some more realistic data first
rng = pd.date_range('2023-01-01', periods=100, freq='D')  # 100 days of data
df_daily = pd.DataFrame({'date': rng, 'sales': np.random.randint(50, 200, size=100)})
df_daily = df_daily.set_index('date')

# Resample daily data up to weekly (downsampling)
weekly_sales = df_daily.resample('W-MON').sum()  # Week ending on Monday
print(weekly_sales.head())

# You can use different methods: mean, max, min, oh my!
weekly_avg = df_daily.resample('W-MON').mean()
weekly_max = df_daily.resample('W-MON').max()

The magic is in the frequency string: 'W-MON', 'D', 'H', '5min', 'M' (month end), 'MS' (month start). This is one of those places where the pandas API is both brilliant and utterly maddening. You just have to memorize the most common ones; for everything else, keep the time series offset aliases documentation bookmarked. I do.

Rolling Windows: Calculating Moving Statistics

Resampling gives you new, fixed points in time. Rolling calculations give you a window that slides through your data, calculating a statistic for every point based on its neighbors. Think “7-day moving average.” It’s fantastic for smoothing out noisy data and seeing trends.

# Calculate a 7-day rolling average of our daily sales
df_daily['rolling_avg_7'] = df_daily['sales'].rolling(window=7).mean()
print(df_daily.head(10))

Notice the first 6 values for rolling_avg_7 are NaN. That’s because there isn’t a full 7-day window to calculate the average until the 7th day. This is the most common “gotcha” – your rolling series will have a lead-in of NaN values. You can control this with the min_periods argument if you’re okay with, say, a 3-day average for the first few points.

The real power user move is using a time-based window instead of a fixed number of periods. This is a lifesaver for irregular time series.

# Let's make an irregular time series for demonstration
irregular_dates = pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-05', '2023-01-09']) # Note the gaps!
df_irregular = pd.DataFrame({'sales': [10, 20, 30, 40]}, index=irregular_dates)

# A 3-day *time* window: for each point, look back 3 days and calculate the mean
df_irregular['rolling_time_avg'] = df_irregular.rolling('3D').mean()
print(df_irregular)

See? For the point on 2023-01-09, it only includes itself and 2023-01-07 (which doesn’t exist) and 2023-01-06 (nope). So it just uses 2023-01-09 and 2023-01-05 because that’s within the 3-day window. This is infinitely more logical for real-world data than a fixed-period window. Always use a time-based window ('3D', '5H') if your index is a DatetimeIndex. A fixed window=3 would have been useless here.

The designers got this one right. It’s a beautifully implemented feature. Use it.