20.1 What Makes an LLM: Scale, Data, and Compute
Alright, let’s cut through the marketing fluff. When someone says “Large Language Model,” they’re really talking about a perfect storm of three things: Scale, Data, and Compute. Miss one leg of this tripod, and your fancy AI collapses into a pile of overhyped matrix multiplication. It’s not magic; it’s a brutally expensive engineering experiment that, against all odds, actually worked.
Think of it like this: you’re trying to build a perfect model of the world, but all you have to work with is the text humans have written down. The only way to do that is to find statistical patterns so deep and so nuanced that they approximate understanding. To find those patterns, you need an absurdly large network (scale), an ungodly amount of text for it to learn from (data), and a small fortune to pay for the electricity to make it all happen (compute).
The Unholy Trinity: Parameters, Tokens, and FLOPs
Let’s get specific. “Scale” primarily refers to the number of parameters in the model. A parameter is a weight or a bias the model learns during training. It’s a knob that gets tuned. You have billions of them. Don’t think of them as storing facts; think of them as storing relationships between concepts. More parameters mean a higher capacity to learn more subtle and complex relationships. Going from millions to billions of parameters is what unlocked the ability to hold a coherent conversation for more than two sentences.
The “Data” is measured in tokens. A token is roughly a word or a sub-word piece. We use sub-word tokenization because, frankly, our vocabularies are gigantic and messy. This process handles out-of-vocabulary words by breaking them into known parts. Let’s see how a common tokenizer, like OpenAI’s tiktoken, works on some text.
import tiktoken
# Let's use the tokenizer for GPT-4
encoder = tiktoken.encoding_for_model("gpt-4")
text = "Let's tokenize this! What do you think, fiancé?"
tokens = encoder.encode(text)
print("Tokens:", tokens)
print("Token pieces:", [encoder.decode([token]) for token in tokens])
print("Total tokens:", len(tokens))
This might output something like:
Tokens: [1335, 62, 11927, 536, 1010, 703, 527, 1114, 101, 24426, 10141, 30]
Token pieces: ['Let', "'", 's', ' token', 'ize', ' this', '!', ' What', ' do', ' you', ' think', ',', ' f', 'ian', 'cé', '?']
Total tokens: 16
See how it neatly chopped up “fiancé” into ' f', 'ian', 'cé'? This is a best practice. It allows the model to efficiently handle complex words and names it hasn’t seen before by composing them from known sub-word units. A common pitfall is assuming one word equals one token—it often doesn’t, especially for non-English languages, which can drastically inflate token counts and costs.
“Compute” is the dirty secret, measured in FLOPs (floating point operations). Training a modern LLM requires on the order of 10^23 to 10^25 FLOPs. Yes, those numbers are real. To put that in perspective, if you tried to run this on a single high-end GPU, it would take thousands of years. This is why we use massive clusters of thousands of GPUs running in parallel for months. The compute cost is the primary reason there are only a handful of organizations on the planet that can play this game. They made a questionable choice to make this so damn expensive, but hey, that’s the brute-force path we’re on.
The Scaling Laws: More is More (Until It Isn’t)
This isn’t just “throw more money at the problem.” The Chinchilla scaling laws (from DeepMind’s seminal 2022 paper) gave us a recipe. They showed that for optimal performance, the model size (parameters) and the dataset size (tokens) should scale together. Specifically, you should have roughly 20 tokens for every parameter.
So, for a 70 billion parameter model? You need about 1.4 trillion tokens of high-quality text. This was a huge insight. Before this, people were training massive models on too little data. The law tells us that a smaller model trained on more data can often outperform a larger, data-starved one. The common pitfall here is skimping on data quality. Scraping the dregs of the internet will get you a model that excels at generating toxic Reddit rants and SEO spam—which, to be fair, it is very good at that, but it’s probably not what you want.
The Training Slog: It’s Just Guess-the-Next-Word
Never forget what an LLM is fundamentally doing: predicting the next most plausible token in a sequence. That’s it. All its apparent intelligence, creativity, and reasoning emerge from this simple objective. The training process is a monumentally scaled-up version of this:
- Feed it a sequence of tokens (e.g., “The cat sat on the”).
- Let it predict the next token (it guesses “mat”).
- Compare its prediction to the actual next token in your data (which was “rug”).
- Use the loss (the difference) to nudge all those billions of parameters to make a slightly better guess next time.
- Repeat this a few trillion times.
The “best practice” is an understatement: your data must be impeccably cleaned and curated. Garbage in, gospel out. The edge case is that the model will faithfully learn all the biases, inconsistencies, and errors present in your data. There’s no magic filter. It’s a mirror, and sometimes the reflection is ugly.
So, that’s the deal. It’s not alchemy. It’s a testament to what happens when you combine a simple, scalable idea with enough data, enough parameters, and enough raw computing power to make a dent in a power grid. It’s absurd, it’s wildly expensive, and somehow, it works.