Benchmarks | mikePietsch.com

32.7 A/B Testing LLM Prompts and Models

Right, so you’ve crafted what you think is the perfect prompt. You’ve tweaked it, you’ve whispered sweet nothings to it, and you’re pretty sure it’s going to produce pure gold. But are you? Or are you just high on your own supply of syntactic cleverness? This is where we stop guessing and start measuring. We’re going to A/B test this thing, because in the world of LLMs, your intuition is often a liar.

32.6 Evals Framework: OpenAI Evals and Custom Evaluation Harnesses

Right, so you’ve built your RAG pipeline. You’ve got your vector store humming, your chunking strategy is… well, it exists, and you’re ready to unleash this marvel upon your users. But how do you know it’s not about to confidently tell them that the capital of France is a delicious pastry? You don’t. Not until you build a rigorous evaluation framework. This is where we move from “hoping it works” to knowing it works.

32.5 RAG Evaluation with RAGAS: Faithfulness, Answer Relevancy, Context Recall

Right, so you’ve built your RAG pipeline. You’ve chunked your documents, you’ve got a fancy vector store, and you’re feeling pretty good about yourself. Then you ask it a simple question like “What year was the company founded?” and it confidently tells you “The company was founded in 1492, primarily to explore new trade routes to the Indies using large language models.” Fantastic. You’ve just been introduced to the number one problem in RAG: your system is lying to you with information it found in your own documents.

32.4 Hallucination Detection: Fact-Checking and Grounding

Right, let’s talk about the LLM’s most infamous party trick: hallucination. It’s not the fun, psychedelic kind. It’s the “I will confidently state that the capital of France is Berlin because it sounds right” kind. As you start building systems on top of these models, this isn’t just a quirky bug; it’s a critical failure mode that can torpedo user trust, business logic, and your reputation. So, how do we catch these fabrications before they escape into the wild? We ground them and we fact-check them.

32.3 LLM-as-a-Judge: Using GPT-4 to Evaluate LLM Outputs

Right, so you’ve built your RAG pipeline or fine-tuned your model. It feels better. But does it perform better? You can’t just eyeball a few cherry-picked outputs and call it a day. You need data. You need metrics. And hiring a team of PhDs to manually score thousands of responses is, to put it mildly, a non-starter. Enter one of the more clever and slightly meta ideas in this space: using a powerful, general-purpose LLM like GPT-4 as an automated judge. The premise is beautifully simple: if you can’t trust a smaller model to answer correctly, maybe you can trust a bigger, more expensive one to evaluate correctly. It’s like bringing in a celebrity critic to judge a local baking contest.

32.2 Standard Benchmarks: MMLU, HellaSwag, HumanEval, GSM8K, MATH

Right, let’s talk benchmarks. You can’t throw a rock in AI research without hitting a new paper claiming state-of-the-art performance, and these benchmarks are the rock-throwing targets. They’re the standardized tests of the LLM world: flawed, often infuriating, but for now, the best we’ve got to compare these digital oracles. Think of them less as a final exam and more like a physical for a pro athlete—they measure specific, important muscles, but they don’t tell you who’ll win the championship game.

32.1 Why LLM Evaluation Is Hard

Right, so you’ve built your fancy LLM application. It’s a beautiful RAG pipeline, a sleek agent, or maybe just a cleverly prompted chatbot. It works on your laptop. Your demo to the CEO was flawless. You’re feeling like a genius. Then you deploy it, and a user immediately asks, “So, according to your AI, Napoleon won the Battle of Waterloo with a fleet of hot air balloons,” and your entire sense of professional competence evaporates. Welcome to the thunderdome. Evaluating these things is brutally, hilariously difficult, and anyone who tells you otherwise is trying to sell you something.

32. LLM Evaluation: Benchmarks, Hallucination, and RAGAS

37.8 Benchmarking Best Practices and Avoiding Compiler Tricks

Right, let’s get our hands dirty. Benchmarking in Go is deceptively simple, which is precisely why so many people get it subtly, tragically wrong. The testing package gives you just enough rope to hang yourself with, and the compiler—oh, the clever, clever compiler—is actively looking for a reason to snip your code into oblivion. Our job is to outsmart it, to force it to show us the real performance cost, not the cost of a cleverly optimized mirage.

37.7 String Interning and bytes.Buffer vs strings.Builder

Right, let’s talk about strings. You love them, I love them, the Go runtime tolerates them. They’re the duct tape of our programs, holding everything together until they suddenly become the number one reason your elegant service is now gasping for memory like a fish on a sidewalk. The fundamental problem is that strings in Go are immutable. This is a fantastic feature for concurrency and safety, but a real pain when you’re building them up in a hot loop. Every time you write s += "new piece", you’re not just appending; you’re allocating a whole new string, copying both s and "new piece" into it, and then sending the old s off to be cleaned up by the garbage collector (GC). Do this a few thousand times and your GC is going to be working overtime, putting a serious damper on your throughput.

37.6 Reducing Allocations: sync.Pool, Value Types, and Preallocating Slices

Right, let’s talk about allocations. In the world of Go, allocations are like trips to the garbage can: you have to do them, but if you’re running back and forth every five seconds, you’re not getting any real work done. The garbage collector is incredibly smart, but it’s not clairvoyant. Every time you escape to the heap, you’re giving it more work to do later, which means eventually, it will have to stop your world (or at least a big part of it) to clean up your mess.

37.5 Escape Analysis: go build -gcflags -m

Alright, let’s get our hands dirty with one of Go’s coolest party tricks: escape analysis. This isn’t some abstract academic concept; it’s the compiler’s way of making a crucial decision for you: “Should this variable live on the stack, nice and cheap, or does it need to escape to the heap, the land of garbage collection and slower allocations?” To see the compiler’s thought process laid bare, we use the -gcflags="-m" flag. Running go build -gcflags="-m" your_file.go will spit out a torrent of messages telling you exactly what escapes and, more importantly, why. Let’s decode this output together.

37.4 go tool trace: Goroutine and Scheduler Traces

Alright, let’s get our hands dirty with go tool trace. You’ve probably been staring at CPU and memory profiles until your eyes cross, wondering why your beautifully concurrent Go application isn’t going as fast as it should. Sometimes, the problem isn’t what your code is doing, but how and when the goroutines are being scheduled to do it. That’s where the execution tracer comes in. It’s like getting a top-down view of a busy highway system; a CPU profile just tells you which cars are revving their engines the hardest.

37.3 go tool pprof: Reading Profiles and Flame Graphs

Right, let’s get our hands dirty. You’ve just run your Go service under pprof, you’ve captured a profile, and now you’re staring at a terminal prompt or a scary-looking SVG. It feels like you’ve been handed the blueprints to a skyscraper written in a foreign language. Don’t panic. We’re going to learn that language together. The first thing to internalize is that pprof is not a single tool; it’s a Swiss Army knife with a dozen blades. The most common profiles you’ll grab are the CPU profile and the Heap (memory) profile. They answer two fundamentally different questions: “What is burning my CPU time?” and “Where is my memory getting allocated?”.

37.2 net/http/pprof: Live Profiling of Running Servers

Right, so you’ve got a service running. It’s chugging along, but something’s off. Maybe it’s a bit sluggish under load, or perhaps memory usage is doing a concerning impression of a ski jump. You need to see what’s happening right now, on its terms, in production. You don’t get to stop the world and attach a debugger. This is where net/http/pprof becomes your best friend—a Swiss Army knife that’s mostly sharp blades for introspection.

37.1 pprof: CPU and Memory Profiling

Right, let’s talk about pprof. This isn’t some abstract academic concept; it’s the scalpel you use when your application starts coughing up blood. You don’t just “think” your code is slow—you know it, with data. pprof is how you get that data. It’s the single most powerful tool in the Go profiler’s arsenal, and it’s built right into the standard library. The designers at Google, for all their quirks, absolutely nailed this one.

37. Performance Profiling and Optimization in Go

29.8 testify: Assertions, Suites, and Mocks

Alright, let’s talk about testify. You’ve probably already felt the raw, existential pain of writing a test with the standard library’s testing package and thought, “There has to be a better way than if got != want { t.Errorf(...) } for the ten thousandth time.” You’re right. There is. Enter testify. This third-party library is practically part of the standard library at this point, given its ubiquity. It’s a toolkit that gives you three big weapons: assertions to make your test conditions readable, test suites to structure your tests, and mocks to, well, mock things. Let’s break it down.

29.7 go test Flags: -run, -bench, -count, -race, -cover

Right, let’s talk about go test flags. This is where you stop just running tests and start interrogating them. The default go test is polite; it runs your tests and tells you if they passed. These flags are how you get it to spill its guts, confess its secrets, and do a little performance art for you. We’ll focus on the ones you’ll use daily. The -run Flag: Your Test Search Bar The -run flag is your first line of defense against running your entire 5,000-test suite when you just tweaked one function. It takes a regular expression and only runs tests whose names match it. Simple, right? The devil is in the details.

29.6 b.ReportAllocs and b.ResetTimer

Right, let’s talk about making your benchmarks actually mean something. You’ve probably written a simple one, run it with go test -bench=., and stared at a number. But that number is a liar. It’s a filthy, opportunistic liar that will include all the setup time you did outside the loop, the time it took to generate your test data, and the cost of that one fmt.Printf you left in there “just for debugging.” We’re not here to be lied to. We’re here to get the truth, and for that, we have b.ResetTimer() and b.ReportAllocs().

29.5 Benchmarks: func BenchmarkXxx(b *testing.B)

Right, so you’ve written some code and it doesn’t explode. Congratulations. But is it fast? Or, more importantly, is it fast enough? And how do you know if your latest “optimization” actually made things better or just made the code look like a Rube Goldberg machine? You guess. I benchmark. In Go, benchmarking isn’t a dark art; it’s a first-class citizen built right into the testing package. A benchmark function looks almost identical to a test function, but it uses a different parameter: *testing.B instead of *testing.T.

29.4 t.Helper: Better Error Attribution in Helper Functions

Right, so you’ve written a helper function for your tests. It’s a beautiful, DRY little piece of logic that you’re rightfully proud of. You call it from three different test cases. Then you run go test and it fails. The output hits your terminal: --- FAIL: TestSomethingImportant (0.00s) my_test.go:47: Expected user to be active. You stare at the screen. Line 47? Which one of the three test cases called the helper? Which set of inputs caused this failure? You now have to play detective, tracing back through your test logic to figure out which specific scenario just blew up. This is annoying, and it violates a core principle of good testing: failures should be immediately obvious. This is where t.Helper() comes in—it’s the way you tell the testing framework, “Hey, when you report a failure, blame the function that called me, not me.”

29.3 t.Run: Subtests and Parallel Subtests

Right, so you’ve written a table-driven test. It’s clean, it’s elegant, and you’re feeling pretty good about yourself. And you should. But now you run go test -v and you’re greeted with a monolithic block of output: TestMyFunction/input_1, TestMyFunction/input_2, … and when test #7 fails, you have to squint at the output to figure out which specific input scenario just blew up. And heaven forbid you want to run just that one failing scenario to debug it. You can’t. You have to run the whole table.

29.2 Table-Driven Tests: Slices of Test Cases

Right, let’s talk about table-driven tests. If you’re still writing tests by copying and pasting a test function and changing one or two values, I’m going to ask you politely, yet firmly, to stop. You’re not just creating more code to maintain; you’re missing out on one of the most elegant and powerful patterns in the Go testing ecosystem. The idea is brilliantly simple: you separate your test logic from your test data. Your test function becomes a single, clean engine, and your test cases become a slice of data that engine processes. It’s the difference between hand-crafting each meatball and having a perfectly calibrated meatball-making machine.

29.1 Writing Tests: func TestXxx(t *testing.T)

Right, so you’ve decided to write a test. Good for you. It’s the responsible thing to do, like flossing or not putting off that oil change. And in Go, the entry point for this particular brand of responsibility is a function named TestXxx, where Xxx is anything that doesn’t start with a lowercase letter. It’s not a suggestion; it’s how the go test command finds your work. You’ll be handed a *testing.T—think of it as your all-access pass to the test framework, your bullhorn for shouting about failures, and your notepad for logging what the heck is going on.