22.8 Evaluation: Perplexity, Benchmarks, and Human Evaluation

Right, so you’ve spent all that time and money fine-tuning your model. You’ve babysat the training loop, prayed to the gradient gods, and now you have a shiny new set of weights. Is it any good? Or did you just create a very expensive, very specialized nonsense generator? This is where we separate the signal from the noise. Evaluation isn’t a box to check; it’s the whole point.

The Perplexity Predicament

Let’s start with perplexity, the ML community’s favorite unintuitive metric. Perplexity (PPL) is, technically, the exponentiated average negative log-likelihood per token. I know, that’s a mouthful. Think of it this way: it’s a measure of how surprised your model is by the data it’s seeing. A lower perplexity means the model finds the data less surprising, which generally means it’s modeling it better.

It’s fantastic for apples-to-apples comparisons during the same training run. Watching your validation perplexity drop is a beautiful thing. It tells you the model is learning. But here’s the massive, glaring caveat: perplexity is not a measure of quality, it’s a measure of probability. You can have a model with a stunningly low perplexity that outputs perfectly probable, grammatically correct, and utterly useless gibberish. It’s also highly dependent on your tokenizer. A different tokenizer means different token counts, which means your perplexity values aren’t directly comparable across models.

You calculate it by running your evaluation dataset through the model and averaging the negative log-likelihoods. Here’s how you might do it with a Hugging Face model:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from datasets import load_dataset

model_name = "your-finetuned-model"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Add padding token if it doesn't exist
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load your evaluation dataset
eval_dataset = load_dataset('text', data_files={'eval': 'eval_data.txt'})['eval']

def compute_perplexity(model, tokenizer, dataset, max_length=512):
    losses = []
    for example in dataset:
        # Tokenize, ensuring we get labels for language modeling
        inputs = tokenizer(example['text'], truncation=True, padding=True, max_length=max_length, return_tensors="pt")
        with torch.no_grad():
            outputs = model(**inputs, labels=inputs['input_ids'])
            loss = outputs.loss
            losses.append(loss.exp().item()) # .exp() to convert from nll to perplexity
    return sum(losses) / len(losses)

avg_ppl = compute_perplexity(model, tokenizer, eval_dataset)
print(f"Average Perplexity: {avg_ppl:.2f}")

Use PPL as a first-line health check, not as your final verdict.

The Benchmarking Blender

Next up: benchmarks. These are standardized tests (like MMLU, HellaSwag, TruthfulQA, etc.) that try to measure general capabilities. They’re essential because they provide a somewhat objective, comparable yardstick against other models.

The good news: they’re standardized. The bad news: they’re standardized. The designers of these benchmarks had to make choices, and some of those choices are…questionable. They can be noisy, they can be gamed, and they often measure a very specific kind of knowledge that might have zero relevance to your specific use case (does your customer support bot really need to know ancient Sumerian history?).

Running them is a bit of a chore, but libraries like lm-evaluation-harness make it manageable. The key is to run the same benchmarks that the base model you started with was evaluated on. This gives you a before-and-after picture. Did your fine-tuning on medical data improve its score on MedQA? Great! Did it completely nuke its common-sense reasoning score on HellaSwag? Uh oh. That’s a sign of catastrophic forgetting, and it’s a real problem.

Best practice: Use benchmarks to catch major regressions, not as your primary measure of success. Your application is the ultimate benchmark.

The Human Evaluation Hassle

This is the most important, most expensive, and most annoying part. There is no substitute for putting your model in front of a human and asking, “Is this good?” This is where you learn that your technically-perfect model has a habit of being passive-aggressive or subtly missing the point.

You need a clear rubric. Don’t just say “rate this output 1-5.” Define what a 1 means vs. a 5. Is it accuracy? helpfulness? conciseness? safety? Style? You’ll probably need a combination. For example:

Factual Correctness: Is the information accurate?
Relevance: Does it actually answer the query or go on a tangent?
Tone & Safety: Is it appropriate and free from harmful content?

Use a tool like Toloka or Amazon Mechanical Turk, or better yet, use domain experts (if you’re building a legal model, have lawyers check it). And for the love of all that is holy, do not just look at cherry-picked examples. You must systematically evaluate a held-out test set of prompts you never trained on.

The brutal truth: If you can only do one type of evaluation, do human eval. All the low-perplexity, high-benchmark-score models in the world are worthless if a real user hates interacting with them. This is the trench work. It’s tedious, but it’s the only way to build something that doesn’t just look good on a chart, but actually works.