32.1 Why LLM Evaluation Is Hard
Right, so you’ve built your fancy LLM application. It’s a beautiful RAG pipeline, a sleek agent, or maybe just a cleverly prompted chatbot. It works on your laptop. Your demo to the CEO was flawless. You’re feeling like a genius. Then you deploy it, and a user immediately asks, “So, according to your AI, Napoleon won the Battle of Waterloo with a fleet of hot air balloons,” and your entire sense of professional competence evaporates. Welcome to the thunderdome. Evaluating these things is brutally, hilariously difficult, and anyone who tells you otherwise is trying to sell you something.
The core of the problem is that we’re trying to grade a creative, stochastic process with the rigid, deterministic tools of software engineering. It’s like trying to judge a jazz improvisation with a multiple-choice test. You’re not checking for a correct output; you’re checking for a good one, and the definition of “good” shifts with the context.
The Moving Target of “Correctness”
First, let’s murder a sacred cow: there is no single “correct” answer for most LLM tasks. If I ask an LLM to summarize a news article, is the ideal output five sentences? Ten bullet points? A single emoji? The “correctness” is a spectrum influenced by subjectivity, context, and user intent.
Traditional software testing relies on exact matches (output == "expected string") or simple regex. This fails instantly with LLMs. Let’s prove it.
# How NOT to evaluate an LLM summary
generated_summary = "The study found a significant correlation between coffee consumption and productivity."
expected_summary = "The research demonstrated a strong link between drinking coffee and increased productivity."
# This fails miserably
if generated_summary == expected_summary:
print("Perfect!") # This will never print. You fail.
# Even this is too brittle
if generated_summary in expected_summary:
print("Close enough!") # Probably won't print either.
We need semantic similarity, not syntactic equality. This is where embeddings and metrics like cosine similarity come in. But even that’s not perfect. Which embedding model do you use? all-MiniLM-L6-v2? text-embedding-3-large? Your score changes based on this choice. You’re not measuring absolute truth; you’re measuring proximity according to one particular model’s worldview.
The Benchmark Mirage
“So,” you say, “I’ll just use a standard benchmark! GLUE! MMLU! Hell, even GSM8K for math!” Great. Please do. Then immediately understand their fatal flaw: benchmarks test for knowledge, but not for obedience or safety.
An LLM can ace a knowledge benchmark by having memorized the internet, but that same model might be utterly incapable of following your specific instruction to “respond in the style of a pirate and also don’t mention the war.” Benchmarks measure what the model knows, not how it behaves when you ask it something. Your application cares deeply about the latter. It’s the difference between giving a history exam and hiring a historian who won’t yell obscenities at your clients.
The Hallucination Hydra
Ah, hallucinations. The party trick that will get you fired. The problem isn’t just that models make stuff up; it’s that they do it with the serene, unshakable confidence of a con artist who believes their own lie.
Evaluating for hallucinations is a multi-headed beast. Sometimes it’s a factual inaccuracy against a known source (a RAG nightmare). Sometimes it’s a logical inconsistency within the same response. Sometimes it’s a subtle fabrication that sounds plausible unless you’re a domain expert.
# A simple check for a RAG system: is the answer grounded in the source?
def is_answer_grounded(source_text, generated_answer, threshold=0.8):
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
# Encode the source and the answer
source_embedding = model.encode(source_text, convert_to_tensor=True)
answer_embedding = model.encode(generated_answer, convert_to_tensor=True)
# Calculate cosine similarity
similarity = util.cos_sim(source_embedding, answer_embedding).item()
return similarity >= threshold
# Example
source = "The company's Q3 earnings were $5.2 billion, a 4% year-over-year increase."
answer_good = "Q3 earnings saw a 4% increase to $5.2 billion."
answer_bad = "The company reported a massive $10 billion profit in Q3."
print(is_answer_grounded(source, answer_good)) # Likely True
print(is_answer_grounded(source, answer_bad)) # Likely False
See the issue? This catches the egregious lie but might miss a more subtle one like “The company’s profits, driven by a 4% earnings increase, delighted shareholders.” The source didn’t mention shareholder reaction. It’s probably fine, but is it true? This is the gray area you’ll live in.
The RAGAS Paradox
Frameworks like RAGAS are brilliant because they try to break down this impossible problem into measurable components: Answer Faithfulness, Answer Relevance, Context Recall, etc. But here’s the catch: RAGAS often uses another LLM (like GPT-4) to judge your LLM. You are now using a stochastic process to evaluate a stochastic process. You have to trust the judge model’s own biases and knowledge. It’s like asking a random, brilliant, but occasionally sleepy professor to grade your exam. It’s the best tool we have, but it’s not a ground-truth oracle.
The best practice? Embrace the chaos. Use a combination of methods: automated metrics (RAGAS, cosine similarity), human evaluation (a simple thumbs up/down in your UI is gold), and rigorous manual testing on edge cases. You will never have a single score that tells you “this is good.” You will have a dashboard of conflicting signals, and your job is to learn which ones actually correlate with your application’s success. It’s hard because it has to be.