24.9 Evaluating RAG: RAGAS Framework
Right, so you’ve built your RAG pipeline. You’ve got your vector store humming, your embeddings are pristine, and your LLM isn’t hallucinating nearly as much. You pat yourself on the back. But then the terrifying question hits: How good is it, actually? You can’t just eyeball a few responses and call it a day. That’s like testing a parachute by jumping out of a plane and saying “Seemed fine!” on the way down. We need metrics. We need a framework. Enter RAGAS.
RAGAS (RAG Assessment) is, frankly, a lifesaver. It’s a framework designed to evaluate your RAG system without needing human-labeled ground truth data, which is a fancy way of saying it saves you from hundreds of hours of mind-numbing manual grading. It does this by using the LLM itself as a judge to score your system across several key dimensions. It’s brilliantly meta.
The Core Metrics: What Are We Even Measuring?
RAGAS breaks down the problem into three core metrics that get to the heart of what makes a RAG response good.
Faithfulness: This answers the question: “Is the answer grounded entirely in the retrieved context?” A high faithfulness score means your system isn’t lying to you—it’s not injecting facts that it just made up on the spot. If your answer is hallucinating, this score will be low. It’s the most important metric for making your system trustworthy.
Answer Relevance: This is about the quality of the answer itself. Is it actually relevant to the question? Is it concise and direct, or is it a verbose, meandering mess that avoids the point? A poorly phrased or irrelevant answer, even if it’s based on the context, gets a low score here.
Context Relevance: This is the one that critiques your retriever. It measures how much of the retrieved context was actually used to generate the answer. Did you get a bunch of perfect, golden paragraphs, or did you also retrieve five irrelevant documents that just added noise? A high score means your retriever is a precise sniper; a low score means it’s a messy shotgun.
These three scores are then combined (by default, their harmonic mean) into a single RAGAS score to give you a overall picture.
A Practical Example: Let’s Get Our Hands Dirty
Enough theory. Let’s see how you’d actually use this. First, install it: pip install ragas. Now, imagine we’ve run a query through our system and gotten a result. RAGAS needs just four things to evaluate it:
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevance, context_relevance
from datasets import Dataset
# Your hypothetical RAG run results
question = "What are the key benefits of using retrieval-augmented generation?"
ground_truth = "The main benefits are reduced hallucination by grounding responses in external knowledge, improved accuracy for domain-specific queries, and the ability to leverage up-to-date information without retraining the entire model." # This is optional in RAGAS, but helpful if you have it!
answer = "RAG primarily helps by reducing model hallucination, as it forces the generator to base its answers on the provided context. This leads to more accurate and trustworthy responses, especially for fact-based queries."
contexts = [
"Retrieval-Augmented Generation (RAG) enhances large language models by integrating a retrieval component. This component fetches relevant documents from a knowledge source, which the generator then uses to produce answers. The key advantages include a significant reduction in fabricated outputs, as the model is anchored to factual data.",
"Popular desserts in Paris include croissants, macarons, and éclairs." # <- Look, irrelevant noise!
]
# Format the data for the RAGAS evaluate function
data_dict = {
"question": [question],
"answer": [answer],
"contexts": [contexts],
"ground_truth": [ground_truth] # Optional but recommended
}
dataset = Dataset.from_dict(data_dict)
# Run the evaluation
score = evaluate(dataset, metrics=[faithfulness, answer_relevance, context_relevance])
print(score)
You’ll get back a dataset with the scores for each metric. That irrelevant context about Parisian desserts is going to murder our context_relevance score, and rightly so. Your retriever needs a tune-up.
Pitfalls and The Art of Not Fooling Yourself
RAGAS is powerful, but it’s not a magic truth box. You have to be aware of its quirks.
First, the LLM-as-a-judge approach inherits all the biases and weaknesses of the judge model itself. If you use a weaker model for evaluation, its judgment will be less reliable. The RAGAS authors recommend using gpt-4 as the judge for the most accurate scores, which, yes, adds cost. This is the part where you sigh and add another line item to your project budget.
Second, while you don’t need human labels, having a few ground truth answers (the ground_truth field in the example) is incredibly valuable for validation. It helps you calibrate and sanity-check what RAGAS is telling you. If RAGAS gives a high score to an answer you know is wrong, you’ve got a problem with your evaluator setup.
Finally, don’t become a slave to the aggregate score. A single overall RAGAS score is useful for a quick gut check, but the real value is in breaking it down. Is your faithfulness high but your answer_relevance low? Your generator might be struggling to synthesize a good answer from good context. Is your context_relevance in the toilet? Your problem is 100% in retrieval—stop blaming the LLM and go fix your chunks or your embedding model. The metrics tell you where to focus your engineering efforts, which is the whole point.