18.4 Positional Encoding: Fixed and Learned
Right, so we’ve got these fancy word embeddings now. Your sequence of words is a tidy stack of vectors, each representing a word’s meaning in a high-dimensional space. Neat, but there’s a colossal problem: our model is, for all intents and purposes, a fancy bag-of-words. The words “dog bites man” and “man bites dog” have the exact same input representation. That’s a deal-breaker for understanding language, where order is, you know, the entire point.
The Transformer architecture, being brilliantly parallel, threw out recurrence and convolution. This gave it speed and scalability, but it also ripped out the innate sense of order those mechanisms provided. So, we have to manually inject the order back in. We do this by adding a positional encoding—a vector that uniquely represents each position in the sequence—directly to the word embedding at that position.
Think of it this way: your word embedding tells the model what the word is. The positional encoding tells it where the word is. The sum of these two vectors is what actually gets fed into the self-attention mechanism.
The Sinusoidal (Fixed) Encoding: The OG Solution
The original “Attention is All You Need” paper proposed a fixed, deterministic encoding using sine and cosine waves of different frequencies. This wasn’t an arbitrary choice; it was genius. The authors hypothesized that these sinusoidal patterns would allow the model to easily learn to attend to relative positions, since for any fixed offset k, the encoding at position pos + k can be represented as a linear function of the encoding at position pos.
The math looks a bit gnarly at first, but it’s actually quite elegant. For each dimension of the encoding vector, we use a different wavelength, forming a geometric progression from 2π to 10000 · 2π.
import torch
import torch.nn as nn
import math
def sinusoidal_positional_encoding(seq_len, d_model, device=None):
"""
Generates sinusoidal positional encodings.
Args:
seq_len: Length of the input sequence.
d_model: Dimensionality of the model embeddings.
device: The device to create the tensor on.
Returns:
A tensor of shape (1, seq_len, d_model) containing the positional encodings.
"""
if d_model % 2 != 0:
raise ValueError("Cannot use sin/cos positional encoding with "
"odd dim (got dim={:d})".format(d_model))
pe = torch.zeros(seq_len, d_model, device=device)
position = torch.arange(0, seq_len, dtype=torch.float, device=device).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2, dtype=torch.float, device=device) *
-(math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0) # Add batch dimension
return pe
# Example: Generate encodings for a sequence of length 10 with a model dimension of 512
seq_len = 10
d_model = 512
pe = sinusoidal_positional_encoding(seq_len, d_model)
print(pe.shape) # torch.Size([1, 10, 512])
The beauty of this fixed approach is that it’s deterministic and doesn’t require any learned parameters. It generalizes to sequence lengths longer than those seen during training, though its performance might understandably degrade a bit. It’s the elegant, mathematically-grounded solution.
Learned Positional Embeddings: The Pragmatic Choice
Meanwhile, over in the practical world, many implementors (including the authors of the original Transformer codebase for machine translation) said, “Sinusoids? That’s cute.” and just used a second embedding layer. This is exactly what it sounds like: you have a nn.Embedding layer that learns a unique vector for every single position up to some maximum length.
class TransformerModelWithLearnedPE(nn.Module):
def __init__(self, vocab_size, d_model, max_seq_len, nhead, num_layers):
super().__init__()
self.d_model = d_model
self.word_embedding = nn.Embedding(vocab_size, d_model)
self.pos_embedding = nn.Embedding(max_seq_len, d_model) # The learned PE layer
self.transformer = nn.TransformerEncoder(
nn.TransformerEncoderLayer(d_model, nhead),
num_layers
)
def forward(self, src, src_mask=None):
# src: (Batch, Seq_Len)
seq_len = src.size(1)
positions = torch.arange(seq_len, device=src.device).expand(src.size(0), seq_len) # Create position indices
# Embed words and positions, then sum them
word_embeds = self.word_embedding(src) * math.sqrt(self.d_model)
pos_embeds = self.pos_embedding(positions)
x = word_embeds + pos_embeds
x = self.transformer(x, src_mask)
return x
# Example usage
model = TransformerModelWithLearnedPE(vocab_size=10000, d_model=512, max_seq_len=512, nhead=8, num_layers=6)
src = torch.randint(0, 10000, (32, 50)) # Batch of 32, sequence length of 50
output = model(src)
It’s brutally simple. The obvious downside is that it’s rigidly fixed to max_seq_len; your model will have no idea what to do with a position it hasn’t seen during training. The upside? It’s learned. It can potentially discover positional representations that are more optimal for your specific task and dataset than a generic sine wave. It’s the “why overthink it?” solution.
Which One Should You Actually Use?
Here’s the trench wisdom you paid for: it often doesn’t matter nearly as much as you think.
- Fixed (Sinusoidal): The safe, theoretically-sound choice. Use it if you’re replicating a paper, if you’re worried about generalizing to longer sequences, or if you just want one less thing to worry about during training. It’s the default for a reason.
- Learned: The pragmatic choice. It often performs slightly better in practice, especially when your training data has plenty of examples of all sequence lengths up to your maximum. The downside is the hard limit on sequence length.
The real best practice is to just pick one and move on. Your model’s performance will be dominated by the quality of your data, your overall architecture, and your hyperparameter tuning, not by this choice. I’ve seen teams waste weeks A/B testing this when they should have been cleaning their training data. Don’t be that team.
The one non-negotiable rule is that you must add the positional encoding to the word embedding, not concatenate it. The whole point of the additive operation is that it creates a new vector that contains information from both sources, which is exactly what the subsequent self-attention layers are designed to work with. Concatenation would just create a wider, disjointed input and would likely harm performance.