20.7 Open-Source LLMs: LLaMA, Mistral, Gemma, Phi, Qwen

Right, let’s talk about the open-source revolution. Because let’s be honest, the big, proprietary models from OpenAI and Google are impressive, but they’re also black boxes. You can’t see the gears turning, you can’t fine-tune them on your own secret data without paying an arm and a leg, and you certainly can’t run them on your own hardware without a corporate-sized trust fund. That’s where this motley crew of open-source models comes in. They’re the rebels, the tinkerer’s paradise, and frankly, the reason this field is moving at lightspeed. We’re not just users here; we’re mechanics.

20.6 Emergent Capabilities: In-Context Learning, Chain-of-Thought

Right, so you’ve heard the hype: LLMs are “magical” and “emergent.” Let’s cut through that. They’re not magical, but what they do is often emergent, meaning it’s a capability that wasn’t explicitly programmed but arises from the sheer scale of the model and its training. It’s the difference between teaching a kid arithmetic by rote memorization (boring) and watching them suddenly figure out how to reason through a word problem (wild). The two biggest party tricks in this category are In-Context Learning (ICL) and Chain-of-Thought (CoT) reasoning. They’re the reason these models feel so spookily intelligent instead of just being fancy autocomplete.

20.5 Mixture of Experts (MoE): Scaling Without Proportional Compute Cost

Right, so you’ve built a colossal dense transformer model. It’s a beast. 175 billion parameters. The problem? Every single time you want to generate a single, lousy token, you have to fire up every one of those 175 billion parameters. It’s like calling in a full-scale military operation to swat a fly. The compute cost is astronomical, and the latency is… well, let’s just say you have time to brew a coffee. Maybe two.

20.4 Context Window: KV Cache, Rope Embeddings, and Long Context

Alright, let’s talk about the single biggest constraint you’ll wrestle with when building with LLMs: the context window. Think of it as the model’s working memory. It’s the total number of tokens—that’s your input and the generated output combined—that the model can “see” at any one time. Early models had the attention span of a goldfish in a caffeine lab; we’re talking a paltry 2048 tokens. Now, we’re seeing models that can process entire books, technical manuals, or, let’s be honest, shockingly long rants. This expansion isn’t magic; it’s a series of clever, sometimes hacky, engineering triumphs. Let’s break them down.

20.3 Decoder-Only Architecture: Why GPT-Style Dominates

Alright, let’s talk about why the world seems to run on GPT-style models. You’ve heard of them: GPT-3, Jurassic-1, BLOOM, LLaMA. They’re the celebrities of the AI world. But why did this particular architecture, the “decoder-only” transformer, absolutely dominate the scene? It wasn’t an accident. It was a brutally pragmatic bet on scale, and it paid off in a way that left other, more elegant architectures in the dust. Think of the original “Transformer” model from the famous 2017 paper as a balanced, well-rounded athlete. It had an encoder (to read and understand input) and a decoder (to generate output). This was perfect for translation, where you need to deeply comprehend a sentence before you start writing its new version. But then we all got a bit obsessed with just generating stuff—stories, code, excuses for missing a deadline. For that, you don’t need a separate understanding phase; understanding and generation become the same dance. The decoder is already a phenomenal generator. So we asked: what if we just used the decoder part, gave it a truly absurd amount of data, and saw what happened?

20.2 Scaling Laws: Compute-Optimal Training (Chinchilla)

Alright, let’s talk about Chinchilla. You’ve probably heard the mantra: bigger models are better. More parameters, more smarts. It’s a seductive idea, and for a while, we all just kinda ran with it. We were building ever-larger monuments of parameters, throwing ungodly amounts of compute at them, and feeding them whatever data we had lying around. It was the era of “just scale it up, it’ll probably work.” Then a bunch of very smart people from DeepMind asked a profoundly simple question: “Are we being profoundly wasteful?” The answer, detailed in their 2022 paper “Training Compute-Optimal Large Language Models,” was a resounding yes. We were. Chinchilla is the model that resulted from this question, and its real legacy isn’t the model itself—it’s the law it proved. It showed us we’d been driving a Formula 1 car with the parking brake on.

20.1 What Makes an LLM: Scale, Data, and Compute

Alright, let’s cut through the marketing fluff. When someone says “Large Language Model,” they’re really talking about a perfect storm of three things: Scale, Data, and Compute. Miss one leg of this tripod, and your fancy AI collapses into a pile of overhyped matrix multiplication. It’s not magic; it’s a brutally expensive engineering experiment that, against all odds, actually worked. Think of it like this: you’re trying to build a perfect model of the world, but all you have to work with is the text humans have written down. The only way to do that is to find statistical patterns so deep and so nuanced that they approximate understanding. To find those patterns, you need an absurdly large network (scale), an ungodly amount of text for it to learn from (data), and a small fortune to pay for the electricity to make it all happen (compute).

11.8 Cluster Autoscaler: Adding and Removing Nodes

Right, so you’ve got your pods scaling horizontally like a well-rehearsed flash mob. But what happens when the entire party runs out of room? That’s where the Cluster Autoscaler (CA) comes in. Think of it as the pragmatic bouncer for your Kubernetes nightclub. HPA and VPA handle the guest list (pods), but when the club is at capacity, the CA is the one who calls the building manager to add a new floor or, when things quiet down, tells the unused floors they can go home. It doesn’t care about CPU or memory inside your pods; it cares about whether there’s space for pods to run at all.

11.7 Combining HPA and VPA: Caveats and Best Practices

Right, so you’ve decided you want both horizontal and vertical autoscaling. Ambitious. A little greedy, even. I like it. It’s the “have your cake and eat it too” of Kubernetes resource management. But let’s be absolutely clear: combining HPA and VPA is like putting two brilliant, highly opinionated chefs in the same kitchen. If you don’t set very strict rules, they will absolutely fight over the stove, and you’ll end up with a culinary disaster (read: a cascading pod eviction nightmare).

11.6 VPA Modes: Off, Initial, Auto

Alright, let’s talk about VPA modes. This is where you decide just how much authority you’re willing to hand over to this particular robot butler. You’ve installed VPA, you’ve defined a VerticalPodAutoscaler resource, and now you have to choose its updateMode. You’ve got three options: Off, Initial, and Auto. Picking the right one is the difference between getting helpful advice and handing your cluster the keys to the kingdom with a blindfold on.

11.5 VPA: Right-Sizing Container Resource Requests

Right, so you’ve got HPA scaling the number of your pods based on traffic. That’s great. But what if the pods themselves are the problem? You’ve got a container running with a paltry 100m CPU request, but it’s constantly spiking to 800m and getting throttled into next Tuesday by the kernel. Or worse, you’ve got a memory leak slowly filling up a node because some container requested a laughably small 128Mi and is now trying to swallow 2Gi. This is where Vertical Pod Autoscaler (VPA) comes in—it’s the friend that tells you you’ve been wearing the wrong-sized clothes all along and helps you get a better fit.

11.4 HPA Behavior: Scale-Up and Scale-Down Stabilization

Alright, let’s talk about what happens after the HPA calculates it needs to scale. The raw metric says “we need 10 pods, NOW!” If we just blindly obeyed that command every polling interval, we’d be creating a chaotic mess. Pods would be frantically scaling up and down like a hyperactive yo-yo, your cluster’s control plane would weep, and your application’s performance would be a jagged nightmare of cold starts and sudden load drops. This is where behavior comes in—it’s the built-in shock absorber and common sense that prevents your cluster from having a panic attack.

11.3 Custom and External Metrics with KEDA

Right, so you’ve got HPA and VPA humming along, scaling based on CPU and memory like a well-trained golden retriever. It’s obedient, but let’s be honest, it’s not exactly clever. Your application’s real scaling triggers are probably more nuanced: the number of messages clogging your RabbitMQ queue, the throttle percentage on your third-party API, or the sheer number of users hammering your authentication service. This is where we graduate from the dog to a fox—sly, clever, and resource-aware. We do this by bringing in custom and external metrics, and the easiest, most elegant way to do that is with KEDA: the Kubernetes Event-Driven Autoscaler.

11.2 The Metrics Server: Required Infrastructure for HPA

Right, so you want to use the Horizontal Pod Autoscaler (HPA). Excellent choice. It’s basically magic, letting your application breathe in and out based on load. But here’s the thing about magic: it’s mostly just applied science, and the science here requires a specific piece of infrastructure. You can’t just wave a kubectl wand and expect it to work. You need the Metrics Server. Think of the Metrics Server as the nervous system for your cluster’s autoscaling. The kubelets on each node (the muscle) are constantly measuring resource usage—CPU and memory—of every pod. But those metrics are isolated, trapped on their individual nodes. The Metrics Server’s job is to be the brainstem: it periodically scrapes those usage stats from every kubelet, aggregates them in memory, and exposes them in a format the rest of the Kubernetes API can understand. Without it, the HPA is just a guy in a room staring at a blank teleprompter. He has no data. He can’t make decisions.

11.1 HPA: Scaling Based on CPU, Memory, and Custom Metrics

Alright, let’s talk about making your applications bend instead of break under pressure. We’re moving past the stone age of static replica counts. You don’t pay your cloud provider for a fleet of sleeping Pods, and manually scaling with kubectl scale is a party trick, not a strategy. Enter the Horizontal Pod Autoscaler (HPA), your automated, albeit occasionally dim, bartender who tops up your drinks (Pods) based on how thirsty (busy) your patrons are.

— joke —

...