30.2 Seed Corpus and Generated Inputs

Right, let’s talk about the one thing that separates a fuzzer that finds real bugs from one that just makes your CPU fan sing the song of its people: the input. You can’t just throw a fuzzer at your code and hope it magically stumbles upon the malloc call that will make your program weep. You have to give it a head start. This is where your seed corpus and its generated offspring come in.

Think of your fuzzer not as a magic bug-finding machine, but as a very, very fast and stupid intern. If you tell it to “test the image decoder,” it might try to feed it a copy of the corporate HR manual. It’ll try, god bless it, but it won’t get far. If, instead, you give it a folder of sample JPEGs and PNGs (the seed corpus), it can start mutating those—flipping bits, adding chunks, truncating files—to create new, bizarre, and potentially program-shattering test cases (the generated inputs). You’re not doing the work for it; you’re just giving it a sensible starting point so it doesn’t spend its first million cycles trying to parse “Hello World” as a TIFF.

The Seed Corpus: Your Fuzzer’s Starter Kit

Your seed corpus is a collection of small, valid, and interesting input files. “Interesting” is the key word here. You want a diversity of structures and edge cases.

Valid inputs: These are crucial. They teach the fuzzer what a “good” file looks like, so it knows what structures to mutate instead of blindly corrupting. A valid PDF makes a better seed for a PDF parser than a text file containing “lol i’m a pdf.”
Small inputs: The fuzzer is going to be making millions of these. A 10MB seed file will be slow to load and mutate. Small files exercise different code paths quickly. A 50-byte input can often trigger a bug just as well as a 5MB one, but 50,000 times a second instead of 50.
Edge cases: Include the weird stuff. A zero-byte file. A file with a massive header. A file that’s all zeros. A file that’s all ones. This isn’t cheating; it’s curating. You’re seeding the fuzzer’s imagination with the kind of chaos you know exists in the wild.

Here’s how you might build a simple corpus for a hypothetical parser that understands a key:value format:

# Create a directory for your corpus seeds
mkdir ./corpus

# Create a few simple, valid seeds
echo "name:Alice" > ./corpus/seed1.txt
echo "id:1001" > ./corpus/seed2.txt
# An edge case: an empty value
echo "title:" > ./corpus/seed3.txt
# Another edge case: a potentially long value
echo "data:AAAAAAAAAAAAAAAAAAAA" > ./corpus/seed4.txt

You’d then point your fuzzer (like libFuzzer or AFL) at this ./corpus directory.

How The Magic Happens: Mutation Strategies

The fuzzer isn’t just randomly flipping bits. Well, it is, but it’s guided randomness. It uses smart mutation strategies on your seeds to generate new inputs. It might:

Flip bits: The classic. Change a single bit in a byte. A (0x41) becomes Q (0x51).
Insert/Delete bytes: Suddenly your neatly formatted input has an extra colon or is missing a crucial digit.
Splice parts: Take a chunk from one successful input and jam it into another. This is how it builds complex structures from simple ones.
Havoc: Apply multiple random mutations at once. This is where the real, beautiful breakage usually occurs.

The fuzzer’s feedback mechanism (coverage guidance) tells it which of these mutations actually led the program to execute new code paths. It saves those interesting mutations back into the corpus to be mutated again, creating a Darwinian evolution of ever-more-effective crash-inducing inputs.

The Generated Corpus: Don’t Throw It Away!

This is the most common rookie mistake. You run the fuzzer for a day, it finds a crash, and you stop. You fix the bug and delete the thousands of generated files it created. You have just thrown away the most valuable part of the exercise.

Those generated files are a map of how to break your program. They represent hours of computation that have taught the fuzzer exactly how to navigate your code’s darkest corners. You must save the generated corpus. Most good fuzzers do this by default, writing new interesting inputs back to the same corpus directory. This is how your fuzzer gets smarter across runs. Next time you start it, it begins not from the simple seeds, but from the advanced, weaponized inputs it created last time. This is called corpus distillation, and it’s the superpower of coverage-guided fuzzing.

To use this with a tool like go test fuzz, you literally just don’t delete the generated testdata directory. It’s that simple.

# First run creates and adds to the corpus
go test -fuzz=FuzzMyParser -fuzztime=1h

# The corpus is now in ./testdata/fuzz/FuzzMyParser
# Second run uses the now-smart corpus to find even more
go test -fuzz=FuzzMyParser -fuzztime=1h

The initial seeds are the spark, but the generated corpus is the raging fire. Never, ever extinguish it.