10.8 Neural Methods: LSTM and Temporal Convolutional Networks
Right, let’s talk about neural networks for time series. You’ve probably hit the wall with ARIMA and its ilk. They’re like that reliable but deeply boring coworker—great for linear problems with a firm handshake, but they fall apart the moment things get even a little bit… interesting. Non-linear trends? Complex, long-range dependencies? They just shrug. That’s where our flashier friends, LSTMs and Temporal Convolutional Networks (TCNs), come in. They’re the data scientists who show up to the company picnic in leather jackets. They can model those complex, non-linear relationships that make traditional methods weep. But I’ll be your brilliant, slightly cynical friend here: they’re not magic. They come with their own brand of absurdity and a whole new set of problems to solve.
The Problem with Vanilla RNNs (And Why We Need LSTMs)
Before we get to the hero, let’s meet the failed prototype. A simple Recurrent Neural Network (RNN) seems perfect for time series: it has a memory loop, allowing it to pass information from one step to the next. Theoretically, it should remember what happened ten steps ago. In practice? It suffers from a case of catastrophic amnesia or, conversely, gets hopelessly stuck in the past. This is the infamous vanishing/exploding gradient problem.
During training, gradients (which are used to update the weights) are calculated via backpropagation. In an RNN, this involves multiplying the same weight matrix over and over again for each time step. Think of it like a game of telephone. If that weight matrix is a number less than 1, the gradient shrinks to nothing (vanishes) by the time it gets back to the early steps. The network forgets. If it’s greater than 1, it balloons to infinity (explodes) and training blows up. Either way, your model can’t learn long-range dependencies. It’s a deal-breaker.
Enter the LSTM: The Fancy, Overengineered Memory Cell
The Long Short-Term Memory (LSTM) network is the brilliant, albeit slightly convoluted, solution to this. Instead of a naive memory loop, it has a carefully regulated system of gates—think of it as a bureaucratic process for managing information. It has three main gates:
- Forget Gate: “What parts of the old memory should we trash?” It looks at the new input and the previous hidden state and outputs a number between 0 (completely forget) and 1 (remember everything) for each number in the cell state.
- Input Gate: “What new information are we going to store in the cell state?” It decides which values to update.
- Output Gate: “What parts of the cell state are we going to output?” The cell state is filtered to create the next hidden state.
This gated mechanism allows the LSTM to learn what to remember, what to forget, and what to pass on, over very long sequences. It’s gloriously overengineered, and it works.
Here’s a classic example of setting up an LSTM in PyTorch for a univariate forecasting task. Notice the nn.LSTM layer and how we handle its output.
import torch
import torch.nn as nn
import numpy as np
# Let's create a simple sine wave as our toy data
time = np.linspace(0, 100, 1000)
data = np.sin(time) + 0.1 * np.random.randn(1000) # sine wave with a bit of noise
# Prepare sequences: using 50 steps to predict the next 1
sequence_length = 50
X, y = [], []
for i in range(len(data) - sequence_length):
X.append(data[i:i+sequence_length])
y.append(data[i+sequence_length])
X = np.array(X)
y = np.array(y)
# Convert to PyTorch tensors and add a channel dimension (batch_size, seq_len, features)
X_tensor = torch.FloatTensor(X).unsqueeze(-1) # now shape [950, 50, 1]
y_tensor = torch.FloatTensor(y) # shape [950]
class LSTMForecaster(nn.Module):
def __init__(self, input_size=1, hidden_size=50, output_size=1):
super().__init__()
self.hidden_size = hidden_size
self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
self.linear = nn.Linear(hidden_size, output_size)
def forward(self, x):
# The LSTM returns: output, (hidden_state, cell_state)
lstm_out, _ = self.lstm(x)
# We take the output from the VERY LAST time step only
last_time_step_out = lstm_out[:, -1, :]
prediction = self.linear(last_time_step_out)
return prediction.squeeze() # remove extra dim for loss calculation
model = LSTMForecaster()
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
# Training loop (simplified)
for epoch in range(100):
model.train()
optimizer.zero_grad()
outputs = model(X_tensor)
loss = criterion(outputs, y_tensor)
loss.backward()
optimizer.step()
if epoch % 10 == 0:
print(f'Epoch {epoch}, Loss: {loss.item():.4f}')
Temporal Convolutional Networks: The Cool New Kid
While we were all obsessed with LSTMs, a powerful idea from computer vision was waiting in the wings: convolution. A Temporal Convolutional Network (TCN) uses convolutional layers designed for sequence data. Its key feature is causal convolution—meaning at time t, the output is only convolved with elements from time t and earlier in the previous layer. No peeking into the future! This is enforced by padding the input to the left so the output has the same length.
But the real magic is dilated convolutions. This is where the filter has holes in it, allowing it to have a much larger receptive field without increasing the number of parameters exponentially. A dilation rate of d means there are d-1 gaps between filter taps. This allows the network to see very long historical contexts efficiently. A TCN is often faster to train than an LSTM and, in many benchmarks, outperforms them.
class TCNForecaster(nn.Module):
def __init__(self, input_size=1, n_channels=64, kernel_size=3, dilation_base=2):
super().__init__()
# A simple stack of dilated causal conv layers
self.conv1 = nn.Conv1d(input_size, n_channels, kernel_size, dilation=dilation_base**0, padding='same')
self.conv2 = nn.Conv1d(n_channels, n_channels, kernel_size, dilation=dilation_base**1, padding='same')
self.conv3 = nn.Conv1d(n_channels, n_channels, kernel_size, dilation=dilation_base**2, padding='same')
self.linear = nn.Linear(n_channels, 1)
def forward(self, x):
# Input x shape: (batch_size, seq_len, features) -> needs to be (batch_size, features, seq_len) for Conv1d
x = x.transpose(1, 2)
x = torch.relu(self.conv1(x))
x = torch.relu(self.conv2(x))
x = torch.relu(self.conv3(x))
# Take the output from the very last time step of the last layer
x = x[:, :, -1] # shape becomes (batch_size, n_channels)
return self.linear(x).squeeze()
tcn_model = TCNForecaster()
# ... same training loop as before
The Devil’s in the Details: Pitfalls and Best Practices
Input Scaling is Non-Negotiable: Unlike tree-based models, neural nets are pathologically sensitive to input scale. You must normalize your data (e.g., using
StandardScaler). Feeding in raw values like[1250.45, 1301.22, 1189.90]is a fantastic way to waste a week of your life debugging why loss won’t go down.The “Last Step” Fallacy: Notice in both code examples we only used the output from the last time step for the final prediction. This is common for “many-to-one” forecasting. For “many-to-many,” you’d use the whole output sequence. This is a crucial architectural decision.
They’re Data Gluttons: These models have a staggering number of parameters. They will overfit on your tiny 1000-point dataset and memorize the noise. You need either a massive amount of data or aggressive regularization: dropout (e.g.,
nn.Dropoutafter activations), weight decay in your optimizer, and early stopping.Hyperparameter Hell: The number of layers, hidden units, learning rate, dilation rates, kernel sizes… it’s a jungle. You will spend more time tuning these than you did building the model. Automate it with a library like Optuna or at least use a learning rate scheduler.
Interpretability is a Joke: Forget it. You can try to plot attention weights or analyze filters, but you’ll mostly be staring at a black box that happens to make good predictions. If you need to explain why to a stakeholder, you’re better off keeping a simpler model in your back pocket.
So, which one should you use? Try both. TCNs are often faster and sometimes more accurate. LSTMs are the established, well-understood workhorse. The “best” model is the one you can get to work reliably on your specific data, with your specific constraints. Now go forth and overfit. Then regularize.