22.8 Evaluation: Perplexity, Benchmarks, and Human Evaluation
Right, so you’ve spent all that time and money fine-tuning your model. You’ve babysat the training loop, prayed to the gradient gods, and now you have a shiny new set of weights. Is it any good? Or did you just create a very expensive, very specialized nonsense generator? This is where we separate the signal from the noise. Evaluation isn’t a box to check; it’s the whole point. The Perplexity Predicament Let’s start with perplexity, the ML community’s favorite unintuitive metric. Perplexity (PPL) is, technically, the exponentiated average negative log-likelihood per token. I know, that’s a mouthful. Think of it this way: it’s a measure of how surprised your model is by the data it’s seeing. A lower perplexity means the model finds the data less surprising, which generally means it’s modeling it better.