32.2 Standard Benchmarks: MMLU, HellaSwag, HumanEval, GSM8K, MATH

Right, let’s talk benchmarks. You can’t throw a rock in AI research without hitting a new paper claiming state-of-the-art performance, and these benchmarks are the rock-throwing targets. They’re the standardized tests of the LLM world: flawed, often infuriating, but for now, the best we’ve got to compare these digital oracles. Think of them less as a final exam and more like a physical for a pro athlete—they measure specific, important muscles, but they don’t tell you who’ll win the championship game.

The core idea is simple: train a model on a massive, general corpus, then see how it performs on a held-out set of questions it’s never seen. The key is that these benchmarks are standardized. Everyone uses the same test, so we can (theoretically) compare OpenAI’s latest model against Anthropic’s and Meta’s without them all just grading their own homework.

The Usual Suspects: A Quick Rundown

Here are the heavy hitters you’ll see in every other paper. Memorize these acronyms; they’re your new vocabulary.

MMLU (Massive Multitask Language Understanding): The granddaddy. It’s a multiple-choice test covering 57 subjects from high-school level to expert professional, from law and history to computer science and moral scenarios. A high MMLU score is the closest thing to claiming “this model is smart.” It’s a brutal test of stored knowledge and reasoning.
HellaSwag: This one is brilliantly sneaky. It tests commonsense reasoning by asking a model to complete a sentence or scenario. The trick is, the wrong answers are plausible but absurd if you really think about them. It was designed to be “adversarial” to models that just look for superficial statistical patterns. If a model aces HellaSwag, it probably has a halfway decent model of how the physical world works.
HumanEval: Brought to you by the good folks at OpenAI, this is a code generation benchmark. It provides a function signature and a docstring and asks the model to generate the function body. It’s then tested against a set of unit tests. This is pure, unadulterated coding skill.
GSM8K (Grade School Math 8K): Basic math word problems. The genius of GSM8K is that it requires multi-step reasoning. A model can’t just parrot an answer; it has to break the problem down (“chain-of-thought”) and solve it step by step. It’s a fantastic probe for logical coherence.
MATH: GSM8K’s bigger, meaner sibling. These are competition-level math problems that require a much deeper understanding of mathematical concepts. If a model does well here, its reasoning capabilities are seriously impressive.

The Devil’s in the Details: How to Actually Run One

You don’t need a supercomputer to play along. Libraries like lm-evaluation-harness (the official one) and EleutherAI's lm-eval make this surprisingly accessible. Let’s say you have a model loaded via the Hugging Face transformers library. Here’s how you’d get its GSM8K score.

# First, install the evaluation harness. Prepare for a dependency soup.
# pip install lm-eval

from lm_eval import tasks
from lm_eval.evaluator import evaluate
from lm_eval.models.huggingface import HFLM

# Load your model. This is a tiny example, your model will be bigger.
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "EleutherAI/gpt-neo-125M" # Using a small model for illustration
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Wrap it in the harness's adapter
lm = HFLM(pretrained=model, tokenizer=tokenizer)

# Define the tasks you want to run
task_names = ["gsm8k"] # You can add more: task_names = ["gsm8k", "hellaswag"]

# Run the evaluation
results = evaluate(
    lm,
    tasks.task_manager.build_tasks(task_names),
)

print(f"GSM8K Accuracy: {results['results']['gsm8k']['acc']*100:.2f}%")

This will churn for a while (even on a small model) and spit out an accuracy. For a 125M parameter model, expect a score perilously close to zero. For a modern large model, you’d hope for well over 50% on GSM8K.

The Inevitable Caveats: Why This Is All Kind of BS

Here’s where I get direct. These benchmarks are gameable, and the gaming has already begun.

Data Contamination: This is the big one. These benchmark datasets are publicly available. What if the model’s training data already contained the test questions and answers? It’s not learning; it’s memorizing. Researchers try to de-duplicate, but it’s a constant cat-and-mouse game. A surprisingly high score on a benchmark can sometimes be a red flag for contamination, not intelligence.
Narrow Focus: A model can crush MATH but be utterly incapable of writing a compelling poem. It can ace HellaSwag but fail to understand the emotional subtext of your conversation. These benchmarks test a specific, academic form of intelligence. They don’t test creativity, safety, or alignment with human values.
The “Clever Hans” Effect: Named after the horse that seemed to do math but was actually just reading subtle cues from its trainer, models can learn to simulate reasoning without actually doing it. They find statistical shortcuts in the data. HellaSwag was designed to combat this, but the arms race continues.
They’re Static: The world changes. A benchmark from 2023 doesn’t have questions about events in 2024. A model’s knowledge can become stale, but its benchmark score remains frozen in time, giving a false impression of its current utility.

So, the best practice? Use these scores as a directional guide, not a gospel truth. A model that scores terribly on MMLU and GSM8K is probably not very capable. A model that scores highly is likely very capable, but you absolutely must test it on your own specific use cases. Its performance on your data, for your task, is the only benchmark that truly matters.