32.5 RAG Evaluation with RAGAS: Faithfulness, Answer Relevancy, Context Recall

Right, so you’ve built your RAG pipeline. You’ve chunked your documents, you’ve got a fancy vector store, and you’re feeling pretty good about yourself. Then you ask it a simple question like “What year was the company founded?” and it confidently tells you “The company was founded in 1492, primarily to explore new trade routes to the Indies using large language models.” Fantastic. You’ve just been introduced to the number one problem in RAG: your system is lying to you with information it found in your own documents.

This is why we don’t just throw our RAG system over the wall and hope for the best. We need to evaluate it, rigorously and repeatedly. And that’s where RAGAS (RAG Assessment) comes in. It’s a framework that gives you a set of metrics to grade your pipeline, and it’s about as forgiving as a stern but fair headmaster.

The Core Triad: Faithfulness, Answer Relevancy, and Context Recall

RAGAS focuses on three core metrics that get to the heart of whether your RAG system is actually useful, rather than just a fancy random fact generator.

Faithfulness: This is the big one. It answers the question: “Is the LLM’s answer based solely on the context I provided, or did it start making things up?” A high faithfulness score means the answer is grounded in the context. A low score means your model is hallucinating, even with the right documents in front of it. This is the metric that catches our “founded in 1492” nonsense.

Answer Relevancy: This measures how directly the final answer addresses the original query. An answer can be perfectly faithful to a context that’s totally irrelevant to the question. If you ask “What’s the capital of France?” and the system answers, “The Eiffel Tower is 300 meters tall,” based on a context about the Eiffel Tower, it’s faithful but completely irrelevant. This metric penalizes that.

Context Recall: This one judges your retrieval system. It measures how much of the ground truth information (the information you know should be in the answer) was actually retrieved and presented to the LLM. If the perfect context exists in your database but your retriever didn’t find it, your answer will be bad, and this metric will tell you the retriever is the problem, not the LLM.

Implementing This in Code

Let’s get our hands dirty. You’ll need to install ragas and have your OpenAI (or other LLM) API key ready.

pip install ragas

Now, let’s set up a simple evaluation. We need four things for each data point: a question, the ground_truth answer (this is the manual, gold-standard answer you expect), the contexts your system actually retrieved, and the answer your RAG pipeline generated.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
from datasets import Dataset
import os

# Set your LLM API key – RAGAS uses this to power its evaluation metrics
os.environ["OPENAI_API_KEY"] = "your-key-here"

# Example data. In reality, you'd have hundreds of these.
data_samples = {
    'question': ['What is the capital of France?'],
    'ground_truth': ['The capital of France is Paris.'],
    'contexts': [['Paris is the most populous city and capital of France. The Eiffel Tower is located there.']],
    'answer': ['Paris is the capital of France.']
}

# Convert to a Hugging Face Dataset format
dataset = Dataset.from_dict(data_samples)

# Run the evaluation
score = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_recall],
)

print(score)

You’ll get output that looks like this:

{'faithfulness': 1.0, 'answer_relevancy': 0.987, 'context_recall': 1.0}

In this perfect, textbook example, all scores are high. But the real world is a mess. Let’s break down what happens when things go wrong.

Common Pitfalls and The Art of Interpretation

The “I’m not telling you” Pitfall: What if your retrieved context is empty or doesn’t contain the answer? A good RAG system should say “I don’t know.” But many LLMs, feeling helpful, will invent an answer anyway. This will murder your faithfulness score. The fix? Prompt engineering. Your system prompt needs to be brutally explicit: “ONLY use the provided context. If the answer is not in the context, say ‘I cannot find the information in the provided documents.’”

The “Wikipedia Blob” Problem: Your retriever fetches eight long documents. The answer is buried in the second paragraph of the third one. The LLM finds it and gives a faithful, relevant answer. But your context_recall score might be low. Why? Because the metric sees all the irrelevant text you also retrieved as noise. It’s judging the precision of your retrieval, not just if it found the needle somewhere in the haystack. This is a feature, not a bug—it’s telling you your retriever isn’t precise enough.

The “True but Useless” Edge Case: Ask “How do I reset my password?” and the context says “For password reset, see section 5-A.” The LLM faithfully outputs “See section 5-A.” This is technically faithful and relevant, but it’s a terrible user experience. RAGAS metrics will give this a high score, which is correct from a pure RAG perspective. This highlights a crucial point: RAGAS metrics are necessary but not sufficient. You still need human-in-the-loop evaluation for UX and completeness. Don’t outsource all your judgment to a framework.

RAGAS is the best tool we have to move from “hoping it works” to “knowing how well it works.” Use it to find the cracks in your pipeline, but remember—it gives you the symptoms, not the diagnosis. A low faithfulness score means your LLM is making things up, but it’s your job to figure out if it’s due to a bad prompt, poorly structured contexts, or an LLM that’s just too creatively inclined for its own good. Now go fix it.