20.7 Open-Source LLMs: LLaMA, Mistral, Gemma, Phi, Qwen

Right, let’s talk about the open-source revolution. Because let’s be honest, the big, proprietary models from OpenAI and Google are impressive, but they’re also black boxes. You can’t see the gears turning, you can’t fine-tune them on your own secret data without paying an arm and a leg, and you certainly can’t run them on your own hardware without a corporate-sized trust fund. That’s where this motley crew of open-source models comes in. They’re the rebels, the tinkerer’s paradise, and frankly, the reason this field is moving at lightspeed. We’re not just users here; we’re mechanics.

20.6 Emergent Capabilities: In-Context Learning, Chain-of-Thought

Right, so you’ve heard the hype: LLMs are “magical” and “emergent.” Let’s cut through that. They’re not magical, but what they do is often emergent, meaning it’s a capability that wasn’t explicitly programmed but arises from the sheer scale of the model and its training. It’s the difference between teaching a kid arithmetic by rote memorization (boring) and watching them suddenly figure out how to reason through a word problem (wild). The two biggest party tricks in this category are In-Context Learning (ICL) and Chain-of-Thought (CoT) reasoning. They’re the reason these models feel so spookily intelligent instead of just being fancy autocomplete.

20.5 Mixture of Experts (MoE): Scaling Without Proportional Compute Cost

Right, so you’ve built a colossal dense transformer model. It’s a beast. 175 billion parameters. The problem? Every single time you want to generate a single, lousy token, you have to fire up every one of those 175 billion parameters. It’s like calling in a full-scale military operation to swat a fly. The compute cost is astronomical, and the latency is… well, let’s just say you have time to brew a coffee. Maybe two.

20.4 Context Window: KV Cache, Rope Embeddings, and Long Context

Alright, let’s talk about the single biggest constraint you’ll wrestle with when building with LLMs: the context window. Think of it as the model’s working memory. It’s the total number of tokens—that’s your input and the generated output combined—that the model can “see” at any one time. Early models had the attention span of a goldfish in a caffeine lab; we’re talking a paltry 2048 tokens. Now, we’re seeing models that can process entire books, technical manuals, or, let’s be honest, shockingly long rants. This expansion isn’t magic; it’s a series of clever, sometimes hacky, engineering triumphs. Let’s break them down.

20.3 Decoder-Only Architecture: Why GPT-Style Dominates

Alright, let’s talk about why the world seems to run on GPT-style models. You’ve heard of them: GPT-3, Jurassic-1, BLOOM, LLaMA. They’re the celebrities of the AI world. But why did this particular architecture, the “decoder-only” transformer, absolutely dominate the scene? It wasn’t an accident. It was a brutally pragmatic bet on scale, and it paid off in a way that left other, more elegant architectures in the dust. Think of the original “Transformer” model from the famous 2017 paper as a balanced, well-rounded athlete. It had an encoder (to read and understand input) and a decoder (to generate output). This was perfect for translation, where you need to deeply comprehend a sentence before you start writing its new version. But then we all got a bit obsessed with just generating stuff—stories, code, excuses for missing a deadline. For that, you don’t need a separate understanding phase; understanding and generation become the same dance. The decoder is already a phenomenal generator. So we asked: what if we just used the decoder part, gave it a truly absurd amount of data, and saw what happened?

20.2 Scaling Laws: Compute-Optimal Training (Chinchilla)

Alright, let’s talk about Chinchilla. You’ve probably heard the mantra: bigger models are better. More parameters, more smarts. It’s a seductive idea, and for a while, we all just kinda ran with it. We were building ever-larger monuments of parameters, throwing ungodly amounts of compute at them, and feeding them whatever data we had lying around. It was the era of “just scale it up, it’ll probably work.” Then a bunch of very smart people from DeepMind asked a profoundly simple question: “Are we being profoundly wasteful?” The answer, detailed in their 2022 paper “Training Compute-Optimal Large Language Models,” was a resounding yes. We were. Chinchilla is the model that resulted from this question, and its real legacy isn’t the model itself—it’s the law it proved. It showed us we’d been driving a Formula 1 car with the parking brake on.

20.1 What Makes an LLM: Scale, Data, and Compute

Alright, let’s cut through the marketing fluff. When someone says “Large Language Model,” they’re really talking about a perfect storm of three things: Scale, Data, and Compute. Miss one leg of this tripod, and your fancy AI collapses into a pile of overhyped matrix multiplication. It’s not magic; it’s a brutally expensive engineering experiment that, against all odds, actually worked. Think of it like this: you’re trying to build a perfect model of the world, but all you have to work with is the text humans have written down. The only way to do that is to find statistical patterns so deep and so nuanced that they approximate understanding. To find those patterns, you need an absurdly large network (scale), an ungodly amount of text for it to learn from (data), and a small fortune to pay for the electricity to make it all happen (compute).

— joke —

...