20.6 Emergent Capabilities: In-Context Learning, Chain-of-Thought

Right, so you’ve heard the hype: LLMs are “magical” and “emergent.” Let’s cut through that. They’re not magical, but what they do is often emergent, meaning it’s a capability that wasn’t explicitly programmed but arises from the sheer scale of the model and its training. It’s the difference between teaching a kid arithmetic by rote memorization (boring) and watching them suddenly figure out how to reason through a word problem (wild). The two biggest party tricks in this category are In-Context Learning (ICL) and Chain-of-Thought (CoT) reasoning. They’re the reason these models feel so spookily intelligent instead of just being fancy autocomplete.

The Party Trick: In-Context Learning

Forget fine-tuning for a second. ICL is the model’s ability to understand a task on the fly from just a few examples you provide in the prompt. You’re essentially shaping the model’s probability space for the next token by giving it a pattern to follow. It works because during training, the model saw a gazillion documents that had patterns in them—lists, question-answer pairs, code and its explanation—and it learned to identify and replicate those patterns.

Think of it like giving a brilliant but very literal-minded assistant a style guide. You don’t say “be helpful”; you show it.

# A terrible, useless prompt:
prompt_bad = """
Translate this to French:
Hello, how are you?
"""

# A brilliant, effective prompt using ICL:
prompt_good = """
Translate the following English sentences to French.

English: Hello, how are you?
French: Bonjour, comment ça va ?

English: I would like a coffee.
French: Je voudrais un café.

English: Where is the library?
French: 
"""

The first prompt might work, but it’s weak. The second prompt primes the model. You’ve shown it the exact format (the “English: … French: …” structure), and it will now complete the pattern with a high degree of accuracy. This is also the secret sauce behind “few-shot learning.” You’re not retraining the model; you’re just being very, very clear about what you want the next output to look like.

The Pitfall: The model is pattern-matching, not understanding. If your examples are sloppy or contradictory, the model will happily learn the wrong pattern. Garbage in, garbage out. Also, the model has a finite context window. If your task requires more examples than can fit in that window, ICL falls apart, and you’ll have to go back to fine-tuning.

From Pattern Matching to Reasoning: Chain-of-Thought

Here’s where things get genuinely interesting. Standard ICL gives you an answer. Chain-of-Thought prompting forces the model to show its work. This is crucial for complex reasoning, math, or logic problems where the answer isn’t just a pattern but a multi-step derivation.

Why does this work? Because the training data included textbooks, math problem sites, and forum posts where people literally reason step-by-step. The model learned that the token sequence “Let’s think step by step…” is very often followed by a logical breakdown and then a final answer. You’re tapping into that deeply ingrained sequence.

# A standard prompt that often fails on multi-step problems:
prompt_standard = """
Question: A jug holds 4 liters of water. Another jug holds 7 liters. How can you measure exactly 5 liters?
Answer:
"""

# The same prompt, but with Chain-of-Thought:
prompt_cot = """
Question: A jug holds 4 liters of water. Another jug holds 7 liters. How can you measure exactly 5 liters?
Let's think step by step.

1. Fill the 7-liter jug.
2. Pour water from the 7-liter jug into the 4-liter jug until the 4-liter jug is full. This leaves 3 liters in the 7-liter jug.
3. Empty the 4-liter jug.
4. Pour the remaining 3 liters from the 7-liter jug into the 4-liter jug.
5. Fill the 7-liter jug again.
6. Pour from the 7-liter jug into the 4-liter jug until it's full. Since the 4-liter jug already had 3 liters, it can only take 1 more liter. This leaves 6 liters in the 7-liter jug? That doesn't seem right. Let me correct.

Wait, correction after step 4:
4. Pour the remaining 3 liters from the 7-liter jug into the empty 4-liter jug. Now the 4-liter jug has 3 liters.
5. Fill the 7-liter jug again. Now we have 7 liters in the big jug and 3 in the small one.
6. Pour from the 7-liter jug into the 4-liter jug until the 4-liter jug is full. It only needs 1 more liter. So, we pour 1 liter out of the 7-liter jug.
7. Now the 7-liter jug has 7 - 1 = 6 liters? That's not 5. I'm messing this up.

Let me try a different approach. The goal is 5. 7 - 4 = 3, which isn't 5. 4 + 4 - 7 = 1, not 5. This is trickier than it looks.
"""

Okay, even my own example started to go off the rails—see what I mean about being honest? The model can sometimes reason itself into a corner or make arithmetic errors. But the key is that by using CoT, you force the process. The model’s “reasoning” is just it generating a highly plausible token sequence that mimics reasoning. But the mimicry is so good that it often leads to the correct answer on problems that would stump it with a direct prompt.

Best Practice & The Big Gun: The most reliable way to use CoT is to actually include examples of good reasoning in your prompt. This combines ICL and CoT. But the real pro move is Self-Consistency: you run the CoT prompt multiple times, get a bunch of different reasoning paths, and then take the majority vote on the final answer. It’s like asking a room of overconfident interns to each show their work, and then you just pick the answer most of them agreed on. It’s shockingly effective and highlights that we’re dealing with a stochastic, not a deterministic, reasoner.