23.9 Evaluating and Iterating on Prompts
Alright, you’ve crafted a prompt. You’ve stared at it, tweaked a word, stared some more, and finally hit ’enter’ with the cautious optimism of someone defusing a bomb. The model spits something out. Is it good? Is it what you actually needed? Or is it just… plausible? This, my friend, is where the real work begins. Prompt engineering isn’t a one-and-done incantation; it’s a dialogue. And like any good conversation, you have to listen to the responses to know what to say next.
The Brutal Necessity of Evaluation
First, a hard truth: if you don’t have a way to judge the output, you’re just wandering in the dark hoping to bump into a solution. “It looks okay” is the enemy of production-ready AI. You need a rigorous, repeatable evaluation process. This doesn’t have to be a 1000-point inspection system out of the gate, but it must be more than a gut feeling.
Start by defining what “good” even means for your task. Is it:
- Accuracy: Does it factually match a known source or ground truth?
- Relevance: Does it directly answer the question or stay on topic?
- Correctness: Does the code it write run? Does the logic hold up?
- Style: Does it match a desired tone, length, or format?
- Brevity: Is it concise, or is it peppering you with unnecessary disclaimers?
For simple tasks, you might manually review a handful of outputs. For anything serious, you’ll want to build an evaluation dataset—a collection of representative inputs and your ideal outputs. This is your benchmark. Without it, you’re just guessing.
Building Your Evaluation Loop
Here’s the cycle: Prompt -> Output -> Evaluate -> Hypothesize -> Revise -> Repeat.
Let’s make this concrete. Say we’re building a bot to explain financial jargon. Our first shot might be a simple zero-shot prompt.
# Example Initial Prompt
prompt_v1 = """
Explain the following term: Quantitative Easing
"""
The output is… fine. It’s a textbook definition. But we wanted something for a layperson. So we evaluate: “Fails on style - too academic.” Our hypothesis is that specifying the audience will help.
# Iteration: Specify Audience
prompt_v2 = """
Explain the following financial term to a beginner who has no economics background. Use a simple analogy.
Term: Quantitative Easing
"""
Better! The model might output something about the central bank being “like a parent giving their kid more allowance to get them to spend.” Now we evaluate again. Is the analogy sound? Mostly. But what if we try “Mortgage-Backed Security”? The analogy might break down. We found an edge case! We need to handle terms that don’t lend themselves easily to simple analogies. Our next hypothesis: we need to give the model a way out.
# Iteration: Handle Edge Cases Gracefully
prompt_v3 = """
Explain the following financial term to a beginner who has no economics background.
- First, try to use a simple, relatable analogy.
- If a good analogy is too difficult, provide two clear, concise sentences explaining what it is and why it matters.
Term: {user_term}
"""
Now we’re getting robust. We’ve instructed the model on how to handle its own failure modes. This is a hallmark of a well-engineered prompt.
Leveraging the AI to Evaluate the AI
Manually judging outputs is a bottleneck. Here’s a pro-level move: use the LLM itself to help evaluate its (or another model’s) work. This is called LLM-as-a-judge or using a critique prompt.
Let’s say you need the output to be under 100 words. You can automate a length check, but what about relevance? You can ask another AI call to judge.
# Example of a simple critique prompt
evaluation_prompt = """
You are a strict editor. Review the following text and determine if it fulfills the goal.
Goal: Explain 'Quantitative Easing' to a beginner using a simple analogy.
Text: {generated_text}
Answer only 'YES' if the text is a relevant, simple, and correct analogy. Answer 'NO' if it is not, or if it is overly complex or inaccurate.
"""
You can run this against dozens of outputs to get a quick success rate. It’s not perfect, but it scales. For high-stakes applications, you’d use a more sophisticated rubric, but the principle is the same: automate the boring parts of evaluation so you can focus on the nuanced ones.
Common Pitfalls and The Art of The Fix
- The XY Problem: You keep iterating on the prompt for “write a function to sort a list” when what you really needed was “check if this data is already sorted.” You’re solving the wrong problem. Step back and question the initial task. This happens more often than you’d think.
- Overfitting to Weirdness: You get one bizarre output for a rare edge case and you rewrite the entire prompt to prevent that one thing, making it worse for the 99% of normal cases. Don’t let the tail wag the dog. Handle edge cases with targeted instructions (like we did above) rather than breaking the core prompt.
- Ignoring Context Length: Your prompt works great for short tasks but becomes a incoherent mess when the conversation history gets long. The model loses the plot. Be aware of the token limit and structure your prompts to be resilient within the context window you have.
- Chasing Perfection: You will never, ever create a prompt that is 100% flawless on the first try for every possible input. The goal is not perfection; it’s reliability and a known, acceptable error rate. Know when to stop iterating and ship it.
The key insight is that iteration isn’t a sign of failure; it’s the entire process. Every “bad” output isn’t a mistake—it’s data. It’s the model telling you, “I misunderstood what you meant by ‘simple’” or “I need a clearer boundary on this topic.” Your job is to listen, diagnose, and clarify. Now go on, have that conversation.