17.6 Stacked and Deep RNNs | mikePietsch.com

Right, so you’ve got the basic LSTM or GRU cell working. It’s a marvel of engineering, a tiny state machine that almost, almost remembers things like you do. Now, let’s be honest: a single layer of these things is often about as powerful as a bicycle engine in a semi-truck. For anything remotely complex—like translating entire sentences, generating coherent paragraphs, or modeling polyphonic music—you need depth. You need to stack these cells into a deep RNN. It’s the difference between a soloist and a full orchestra; each layer adds a new level of abstraction and representation.

Think of it this way: the first layer might be great at capturing low-level patterns, like the short-term dependencies between words. “The cat sat on the…” – that layer is screaming “mat!” But the next layer can take those processed sequences and look for higher-order patterns. Maybe it’s figuring out the overall sentiment of the sentence, or that we’re in the middle of a conditional clause. Each subsequent layer operates on a more abstract representation of the sequence below it.

The Architecture of a Stacked RNN

Architecturally, it’s beautifully simple and a computational nightmare. You just take the output of one RNN layer and feed it as the input to the next RNN layer above it. That’s it. The hidden state from a cell in layer one becomes part of the input for the corresponding cell in layer two.

In code, with a modern framework like PyTorch, this is embarrassingly straightforward. The num_layers argument in the RNN, LSTM, or GRU module is your best friend here.

import torch
import torch.nn as nn

# Hyperparameters
input_size = 100   # e.g., size of word embeddings
hidden_size = 128  # Size of the hidden state in each cell
num_layers = 3     # This is the magic number: the depth
batch_size = 32
seq_length = 50

# Instantiate a deep LSTM
deep_lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)

# Create some dummy input data (batch_size, seq_length, input_size)
dummy_input = torch.randn(batch_size, seq_length, input_size)

# Initialize hidden and cell states for all layers
# The shape is (num_layers, batch_size, hidden_size)
h0 = torch.zeros(num_layers, batch_size, hidden_size)
c0 = torch.zeros(num_layers, batch_size, hidden_size)

# Forward pass
output, (hidden, cell) = deep_lstm(dummy_input, (h0, c0))

print(f"Output shape: {output.shape}")  # (32, 50, 128) - [batch, seq, hidden]
print(f"Final hidden state shape: {hidden.shape}") # (3, 32, 128) - [layer, batch, hidden]

See? The output tensor contains the hidden states from the last layer only for every time step. The hidden tensor contains the final hidden state for every layer at the last time step. This is crucial for things like sequence-to-sequence models.

The Vanishing Gradient Problem: Now in Stereo!

Remember how LSTMs and GRUs were invented to fight the vanishing gradient problem in simple RNNs? Well, welcome to the next round. Stacking layers creates a deeper network, and gradients have to flow back through both time and depth. It’s a brutal journey. Even with LSTM’s gated highway, gradients can still attenuate as they travel down through multiple layers. This is why you often see people cap deep RNNs at around 3-4 layers unless they have a truly massive dataset. If your model’s performance gets worse after the third layer, don’t panic. You’ve just rediscovered this fundamental limitation.

Best Practices and Pitfalls

This isn’t a free lunch. Here’s what you need to know to avoid blowing up your GPU and your sanity.

Initialization is Everything: Don’t you dare just use zeros for your initial hidden states. Well, you can, but it’s lazy. For deep networks, proper initialization of your hidden and cell states (often sampled from a normal distribution) can help with convergence. The same goes for the weights of the LSTM itself; use the built-in nn.LSTM initializers or something like Xavier.

Skip Connections are a Lifesaver: The designers of residual networks (ResNets) were onto something glorious. That same idea applies here. If you’re going really deep (for an RNN, “really deep” means more than 4 layers), consider adding residual connections between layers. It gives the gradient a shortcut and often makes deeper networks actually trainable.

class ResidualLSTM(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers):
        super().__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.residual = nn.Linear(input_size, hidden_size) if input_size != hidden_size else None

    def forward(self, x):
        out, _ = self.lstm(x)
        # Add residual connection from input to output of the last layer
        residual = x if self.residual is None else self.residual(x)
        # We add the residual for each time step
        return out + residual

Bidirectional is a Force Multiplier: Sometimes, context from the future is just as important as context from the past. For tasks like sentiment analysis or named entity recognition, you want your network to be omniscient. This is where bidirectional RNNs come in. You run two separate RNNs—one on the forward sequence and one on the reversed sequence—and concatenate their outputs. It doubles your parameter count and compute time, but the boost in performance is often worth it. And yes, you can and should stack bidirectional layers too. It’s a deep, bidirectional RNN. The code change? Just bidirectional=True.

# A beast: A 3-layer, bidirectional LSTM
deep_bi_lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True, bidirectional=True)
# The output dimension will be hidden_size * 2 because of the concatenation

Regularize Like Your Model Depends on It (It Does): Deep RNNs are massive over-parameterization machines begging to overfit. You need to fight back. Dropout is your primary weapon. But here’s the key insight: you don’t apply dropout within the inner gates of the LSTM—that usually cripples it. You apply dropout on the outputs between layers. In PyTorch, the dropout argument in the LSTM module does exactly this, applying dropout to the outputs of each layer except the last.

So, stack ’em up. But do it wisely. Start with 2 layers. See if 3 helps. If you go to 4, have a good reason and keep a close eye on your loss curve. You’re not just stacking cells; you’re building a hierarchy of temporal understanding.