18.8 GPT: Autoregressive Decoder-Only Pre-Training

Right, so you’ve heard the hype. “GPT changed everything!” It did, but not by inventing some alien technology. It took the core Transformer block we just talked about and made one brutally simple, wildly effective architectural choice: it threw away the encoder. That’s it. That’s the big secret. All those GPT models—GPT-2, GPT-3, the one you’re probably using to get summaries of this book—are just a stack of Transformer decoder blocks, with one small but critical tweak.

Think about the original decoder. It had this “encoder-decoder cross-attention” layer to look at the input sentence. But what if your entire task is to generate text from a prompt? You don’t have a separate input; it’s all one continuous stream. GPT said, “Fine, we’ll do it live.” It removes the cross-attention layer entirely. So what you’re left with is a stack of identical layers, each containing:

A masked self-attention mechanism (so the model can only look at previous words).
A feed-forward neural network.
Residual connections and layer normalization around each of those.

It’s beautifully, almost stupidly, parsimonious. This architecture is often called “decoder-only,” which is a bit of a misnomer because it’s not decoding anything from an encoder. It’s more accurate to think of it as a “generative Transformer stack.” Its entire existence is to take a sequence of tokens and predict the next most plausible token, over and over, until it tells you how to bake a cake or explains quantum mechanics in a pirate accent.

The Mechanics of Autoregression

This is the heart of the thing. “Autoregressive” sounds fancy, but it just means the model uses its own previous outputs as the input for generating the next step. You give it a prompt (“The meaning of life is”), it generates the first token (“to”), uses “The meaning of life is to” as the new input, generates the next token (“seek”), and so on, until it hits an end-of-sequence token.

This is why that “masked” part of self-attention is non-negotiable. During training, if we let the model peek at the token it’s supposed to be predicting, it would be the world’s easiest multiple-choice test. It would cheat instantly. The mask ensures that for each position i, the attention mechanism can only see positions j <= i. In practice, this is done by adding a matrix of -inf values to the attention scores for all future positions before applying the softmax.

Here’s a terribly inefficient but illustrative way to implement the causal mask:

import torch
import torch.nn as nn

def causal_mask(size):
    """Creates an upper-triangular matrix of -inf, with zeros on diag."""
    mask = torch.triu(torch.ones(size, size) * float('-inf'), diagonal=1)
    return mask

# Let's say we have a sequence of length 4
seq_len = 4
mask = causal_mask(seq_len)
print(mask)
# tensor([[0., -inf, -inf, -inf],
#         [0., 0., -inf, -inf],
#         [0., 0., 0., -inf],
#         [0., 0., 0., 0.]])

In reality, you’d use torch.nn.MultiheadAttention with attn_mask or, more commonly, let the nn.TransformerDecoder layer handle it for you.

Pre-training: The Hunger Games for GPUs

This is where the magic—and the immense cost—comes in. Pre-training doesn’t mean teaching GPT about “facts.” It means teaching it about probability. We give it a colossal dataset of text (the entire internet, more or less) and ask it to perform one task, trillions of times: “Given these n tokens, what is the most likely next token?”

This task, called Language Modeling, is deceptively powerful. To get really, really good at it, the model can’t just memorize phrases. It has to internalize grammar, syntax, some level of reasoning, style, and a shadowy semblance of common sense—all just as statistical byproducts of minimizing the negative log likelihood of the next word.

The best practice here is to use a tool like the Hugging Face transformers library, which has done the heroic work of wrapping this immense complexity. But to understand what’s happening under the hood, here’s the core training loop concept:

# Pseudo-code for the core pre-training step
for batch in massive_text_dataset:
    # Get input sequences and targets (which are just the input shifted by one)
    inputs = batch[:, :-1]  # All tokens except the last
    targets = batch[:, 1:]   # All tokens except the first

    # Forward pass through the model
    logits = gpt_model(inputs) # Shape: (batch_size, seq_len, vocab_size)

    # Calculate loss. We only care about the model's prediction for each position
    loss = nn.functional.cross_entropy(
        logits.view(-1, logits.size(-1)), # Flatten all predictions
        targets.view(-1)                  # Flatten all targets
    )

    # Backward pass and optimizer step
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

The Good, The Bad, and The Hallucinatory

This architecture has strengths and glaring weaknesses. Its biggest strength is its coherence and generativity. Because it’s trained to always produce what comes next, it’s fantastic at creating long-form, stylistically consistent text.

Its weakness is its rigidity. That causal mask is a straitjacket. It can’t “go back” and revise its work. This is why it sometimes gets stuck in loops or starts confidently contradicting itself—it’s committed to the path it’s on. This is also the root of “hallucination”; it’s generating what is statistically plausible, not what is factually correct. It’s a bullshitter, not a liar. There’s a difference.

Another pitfall is its computational inefficiency during generation. Unlike models that can process an entire input in parallel (like BERT), GPT must generate one token at a time, sequentially. This autoregressive decoding is the primary bottleneck for latency in large models.

The designers made a questionable choice, in my opinion, by sticking solely with this next-token prediction objective. It’s powerful, but it inherently prioritizes plausibility over truthfulness. Other models (like T5) showed that alternative pre-training tasks can lead to more “honest” models, but none matched the raw generative fluency of the GPT approach. They bet everything on scale, and frankly, they won. For now.