32.6 Evals Framework: OpenAI Evals and Custom Evaluation Harnesses

Right, so you’ve built your RAG pipeline. You’ve got your vector store humming, your chunking strategy is… well, it exists, and you’re ready to unleash this marvel upon your users. But how do you know it’s not about to confidently tell them that the capital of France is a delicious pastry? You don’t. Not until you build a rigorous evaluation framework. This is where we move from “hoping it works” to knowing it works.

Think of evals not as a test, but as your continuous integration (CI) for intelligence. You wouldn’t push a code change without tests, so why would you deploy a prompt change without testing its impact on accuracy, tone, and its propensity for making things up?

The OpenAI Evals Framework: Your Starting Pistol

OpenAI open-sourced their evals framework, and it’s a fantastic place to start. It’s essentially a standardized way to define tasks (e.g., “answer this question based on this context”), run your model against them, and grade the results. The beauty is in its structure; it forces you to think in terms of completions, metrics, and data.

First, install it. A word of warning: it can be a bit finicky. You might wrestle with dependency conflicts for a bit. This is your first taste of the “in the trenches” experience. Persevere.

pip install evals

Now, let’s say you want to evaluate a simple Q&A task. The core concept is an Eval class. Here’s a minimalist example evaluating a model’s ability to extract an answer from a given context, a fundamental RAG skill.

# my_custom_eval.py
import evals
from evals import EloSuite
from evals.elsuite import utils

class SimpleRAGEval(evals.Eval):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def eval_sample(self, sample, rng):
        # Your test data. 'sample' is a row from your dataset.
        context = sample["input"] # The provided context
        question = sample["ideal"] # The question to ask

        # This is the prompt you're testing
        prompt = [
            {"role": "system", "content": "Answer the question based only on the context below."},
            {"role": "system", "content": f"Context: {context}"},
            {"role": "user", "content": question},
        ]

        # Run the model with your prompt
        response = self.completion_fn(
            prompt=prompt,
            model="gpt-4-turbo", # The model you're evaluating
            temperature=0.0, # Zero temperature for deterministic outputs
        )
        result = response.get_completions()[0]

        # Grade the response. This is the crucial part!
        # Let's use a simple 'match' metric for this example.
        correct_answer = sample["ideal"] # The expected answer
        is_correct = utils.get_exact_match(result, correct_answer)

        # Record the metrics for this sample
        evals.record.record_metrics(
            accuracy=is_correct,
            model_response=result,
            expected_response=correct_answer
        )

To run this, you’d need a dataset file (e.g., dataset.jsonl) with lines like {"input": "The sky is blue due to Rayleigh scattering.", "ideal": "Why is the sky blue?"}. You’d then run it with the CLI:

oaieval gpt-4-turbo my_custom_eval --dataset <path_to_dataset>

The key takeaway here isn’t the specific code, but the pattern: define a task, run the model, grade the output, record the result. OpenAI’s framework provides the scaffolding so you don’t have to build the runner and metrics tracker from scratch.

Grading: The Art of Automating Judgment

The utils.get_exact_match above is brutally simplistic. In the real world, an answer can be correct without being a string literal match. This is where grading logic gets interesting. You have two main allies:

Model-Graded Evals: This is the meta move. You use a more powerful LLM (like GPT-4) to judge the output of your system-under-test. You write a prompt that describes the criteria for a good answer (e.g., “Is this answer supported by the context? Is it accurate? Is it concise?”) and have the judge model output a score or a classification. The evals framework has built-in support for this pattern. It feels weirdly circular until you try it and realize it’s surprisingly effective and scalable.
Heuristic-Based Metrics: For RAG, libraries like RAGAS formalize this. They break down the problem. Instead of one monolithic “is this good?” score, they compute:
- Answer Faithfulness: Is the generated answer based only on the provided context? This directly measures hallucination.
- Answer Relevance: Is the answer actually relevant to the question asked?
- Context Precision/Recall: Did the retrieval system find all and only the relevant context chunks?

RAGAS implements these as model-graded evals under the hood. The best practice is to combine both approaches. Use the nuanced, multi-faceted scores from RAGAS for deep analysis, and simpler, faster heuristic checks (like regex or exact match for specific facts) for quick sanity tests in your CI pipeline.

Common Pitfalls and The Honest Truth

Here’s where the committee-written manual would gloss over the messy parts. I won’t.

The Benchmarking Lie: Your eval dataset is everything. If it’s too small, your results are noise. If it’s not representative of real user queries, your results are a beautiful lie. You will constantly be curating and expanding this dataset. It is your single most important asset.
Cost and Latency: Running thousands of model-graded evals using GPT-4 as the judge gets expensive and slow. Fast. This is why you start small and invest heavily in sampling strategies. You don’t need to run the full suite on every commit; run a lightweight subset on PRs and the full battery nightly.
The Judge is Not God: The model-grading model has its own biases and can be fooled. Always spot-check its judgments. Your grading prompt is code—it needs to be versioned, tested, and refined. A bad prompt leads to garbage metrics.
You’re Testing the Whole System: Remember, an eval doesn’t just test the LLM. It tests your prompts, your retrieval logic, your chunking strategy, and everything in between. When the score drops, your first question should be “What else changed?”

The goal isn’t to achieve a perfect score. That’s impossible. The goal is to establish a baseline and then measure the impact of every change you make. Did tweaking the system prompt improve faithfulness but destroy conciseness? Your evals will tell you immediately. It transforms development from a game of guesswork into a process of informed iteration. Now go build your harness. Your future self, debugging a production hallucination at 2 AM, will thank you.