10.4 ARIMA: AutoRegressive Integrated Moving Average

Right, let’s talk about ARIMA. You’ve probably heard the name thrown around like a magic incantation. It stands for AutoRegressive Integrated Moving Average, which sounds like a committee named it, and they did. It’s the workhorse of classical time series forecasting, the thing you try before you break out the big neural network guns. It’s not magic, but when you understand its components, it becomes a shockingly powerful and intuitive tool. Think of it as giving your past data and your past mistakes a vote in predicting the future.

The name is the recipe. Let’s break it down.

The AR: Autoregression

This is the part where the model says, “Hey, today’s value probably looks a lot like yesterday’s value… and the day before’s.” An AR(p) model is a linear regression that uses the last p values to predict the next one. The p is called the order.

The formula for an AR(1) model is: $X_t = c + \phi X_{t-1} + \epsilon_t$

Where c is a constant, \phi is the parameter we need to estimate (how much influence the previous value really has), and \epsilon_t is white noise (the stuff we can’t predict). If \phi is 1, you’re basically just saying “tomorrow will be today,” which is a bold strategy. We’ll need to fix that later.

The I: Integrated

This is the secret sauce that makes non-stationary data play nice. “Integrated” is a fancy word for “differenced.” Stationarity—having a constant mean and variance over time—is a non-negotiable requirement for AR and MA models. Most real-world data (stock prices, sales trends) is non-stationary; it has a trend.

So, we difference the data. Instead of modeling the raw value X_t, we model the change from one time step to the next: X_t - X_{t-1}. That difference series often has a steady mean around zero. The d parameter in ARIMA(p, d, q) is the number of times you need to difference the data to achieve stationarity. Usually, d=1 does the trick. Sometimes you’ll need d=2. Any more than that and you should probably step away from the data and have a coffee while you reconsider your life choices.

The MA: Moving Average

Don’t let the name fool you; this has nothing to do with smoothing. The MA(q) part models the future value as a linear combination of past error terms (the shocks we didn’t see coming). It’s the model saying, “If I really messed up my prediction yesterday, that probably means my world-view was wrong, and that error will echo into today’s prediction.”

The formula for an MA(1) model is: $X_t = \mu + \epsilon_t + \theta \epsilon_{t-1}$

Where \mu is the mean of the series, \epsilon_t is today’s white noise shock, and \theta is the parameter telling us how much of yesterday’s shock carries over. It captures the unpredictable events that have a ripple effect.

Putting It All Together: ARIMA(p,d,q)

An ARIMA model is simply an ARMA model (AR + MA) fitted on differenced (Integrated) data. The full formula for an ARIMA(1,1,1) model, for example, would be: $(X_t - X_{t-1}) = c + \phi (X_{t-1} - X_{t-2}) + \epsilon_t + \theta \epsilon_{t-1}$

We are modeling the change in X (because d=1) as a function of the previous change (the AR part) and the previous error (the MA part).

How You Actually Do It in Python

Enough theory. Let’s get our hands dirty with statsmodels. First, let’s make some fake data we understand completely.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.arima.model import ARIMA

# Set a seed for reproducibility
np.random.seed(42)

# Create a simple ARIMA(1,1,1) process
n = 1000
epsilon = np.random.normal(size=n) # White noise
X = np.zeros(n)

# Simulate the process: (1 - phi*B)(1 - B)X_t = (1 + theta*B)epsilon_t
# This is a bit more hands-on to show the mechanics
for t in range(2, n):
    # This embodies our ARIMA(1,1,1) formula above
    X[t] = X[t-1] + 0.7 * (X[t-1] - X[t-2]) + epsilon[t] + 0.6 * epsilon[t-1]

# Convert to a Pandas Series because statsmodels likes it that way
time_series = pd.Series(X)
time_series.plot(title="Our Simulated ARIMA(1,1,1) Data")
plt.show()

Now, let’s fit a model to this data. The hard part in the real world is choosing p, d, and q. For now, we’ll pretend we know the right orders (we built the thing, after all).

# Fit an ARIMA(1,1,1) model to our data
model = ARIMA(time_series, order=(1, 1, 1))
fitted_model = model.fit()

# Print a summary of the fit. This is where you see if the model 'got it'.
print(fitted_model.summary())

Look at the coefficient table. You should see an ar.L1 parameter close to our simulated 0.7 and a ma.L1 parameter close to 0.6. The fact that it gets so close is borderline magical. The summary also gives you p-values (P>|z|) to check if each parameter is statistically significant (a value below 0.05 is a good sign it’s a meaningful part of the model).

The Real World: ACF, PACF, and the Art of Guessing (p,d,q)

Here’s where the rubber meets the road. You won’t know the true p, d, and q. You have to deduce them.

Find d: Plot your data. Does it have a obvious trend? Difference it once (d=1) and plot it again. Does it look stationary (wobbling around a fixed mean)? If not, difference it again. Rarely do you need more than d=2.
Find p and q: This is the tricky bit. You use two plots:
- ACF (AutoCorrelation Function): Shows the correlation between a series and its lagged self. A slow, graceful decay in the ACF suggests an AR process. A sharp cutoff after lag q suggests an MA(q) process.
- PACF (Partial AutoCorrelation Function): Shows the correlation between a series and a lag, after controlling for the correlations of all shorter lags. A sharp cutoff after lag p suggests an AR(p) process.

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# Plot ACF and PACF of the ORIGINAL data (to help find d)
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))
plot_acf(time_series, lags=40, ax=ax1)
plot_pacf(time_series, lags=40, ax=ax2)
plt.show()

# Now difference the data once and plot ACF/PACF again to find p and q
diff_series = time_series.diff().dropna()
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))
plot_acf(diff_series, lags=40, ax=ax1)
plot_pacf(diff_series, lags=40, ax=ax2)
plt.show()

Interpreting these plots is more art than science. You’ll squint at the plots, make a guess (e.g., “looks like an AR(2)”), fit the model, and then check the model’s residuals. If the residuals look like white noise (no significant lags in the ACF/PACF), you’ve probably got a good model. If not, you go back and try again.

The Biggest Pitfall: Overfitting and Overdifferencing

It’s incredibly easy to make an ARIMA model that perfectly fits your historical data by just throwing more parameters at it (p and q too high). This model will be utterly useless for forecasting the future. Always use the AIC or BIC score in the model summary to compare models. Lower is better. They penalize models for having extra parameters, helping you avoid this trap.

Also, do not just set d=2 because you can. Every time you difference, you lose a data point and inject noise. Overdifferencing will actually introduce patterns that the model then has to waste parameters modeling. If d=1 makes the series stationary, stop there.

ARIMA is a brilliant, robust tool. It’s not the flashiest kid on the block anymore, but it will often outperform far more complex models with a fraction of the computational cost. Master its mechanics, and you’ve got a trusted friend in your forecasting toolbox.