Evals and Golden Datasets

Evals are the mechanism that turns a working agent into a reliable one. Here is what they are, how golden datasets work, and how to design test cases that actually find failures.

April 12, 202611 min read7 / 10

The last post finished the UI layer. The agent draws on the canvas, responds in chat, and shows tool status as it works. You could ship it right now. But it is not reliable, and reliability is the actual job.

This is where evals come in. In my experience, this is the part that separates engineers who can build AI features from engineers who can make those features dependable enough to matter. It is also the part most developers have never done before, because deterministic code does not need it.

Why Evals Matter

The agent has a quality problem, but "give it a better prompt" is not a real answer. What does "better" mean? You can only find out by measuring.

If you are using an off-the-shelf agent you did not build, your options are narrow: change the prompt, maybe add MCP tools, not much else. But when you build the agent yourself, you can do anything. You can add tools, change the model, restructure the context, add a classifier, swap out the tool loop entirely. The problem is deciding which of those things will actually help. That is what evals are for.

Evals are similar to unit tests, but they are not the same thing. Unit tests give you a binary answer: pass or fail. Evals give you a score: better or worse. AI is probabilistic. A single run might succeed. Ten runs might show an 80% success rate. You cannot test your way to confidence with a single pass/fail check when the underlying system is non-deterministic. You need a way to track the probability of a good outcome over time.

The four things evals let you do:

Establish a baseline. Before you can improve anything, you need to know how bad it is right now. Not a gut feeling, an actual number.

Catch regressions. One sentence added to a system prompt can break something it previously did well. Evals run after every change so you can see what moved in either direction. The sentence you cannot remove is the one that cost someone two weeks to figure out.

Prove an improvement worked. Shipping a change and calling it better based on vibes is not engineering. Evals let you show that a change moved a metric.

Expose weaknesses. The agent only covers what it has tools for. Evals will surface the cases where users ask for something the agent cannot handle, or where the tool that should handle it produces poor output.

If you cannot measure it, you cannot improve it. That is the whole argument.

What a Golden Dataset Is

A golden dataset is a curated set of inputs with their expected outputs. It is the reference you compare your agent against. The name comes from general data engineering, where the "golden" copy is the one you trust.

A concrete example: imagine you have 100 screenshot tests for a web app built in Vue. You decide to rewrite the whole app in React. The functionality should be identical, so every screen should look the same. When you run the screenshot tests against the new React app, you compare each output against the Vue screenshots. Any pixel difference means the React version is wrong, because the Vue screenshots are the validated truth. Those Vue screenshots are the golden dataset.

For an AI agent, the golden dataset works the same way. You define inputs (things users might say to your agent) and expected outputs (what a correct response should look like). You run the agent on each input, compare the result to the expected output, and score the difference.

Building the dataset is your job as the person who built the agent. You understand the purpose of the tool. You know what users will ask. You can write down what a good response looks like. That knowledge does not live anywhere else.

What Goes Into the Dataset

A useful golden dataset has four things:

Expected inputs. What will users actually say? Think broadly. What will they definitely say, what might they say, what do you want to support, and what do you explicitly not want to support? All of those are valid test cases.

Expected characteristics. For each input, what does a good response look like? For the diagram agent, this might mean: the correct number of nodes, no overlapping elements, the right shapes, a layout that a human could read. Write this down explicitly. If the agent generates something you would accept, that is your expected output.

Difficulty levels. Not all test cases are equal. A simple case might be "make a basic flowchart with three steps." A medium case might be a multi-entity relationship diagram with specific connections. A hard case might be a full microservice architecture with domain knowledge baked in. Tracking difficulty lets you see how your agent performs at the edges, not just on ideal inputs.

Categories. Group test cases by the failure pattern they are testing. Tool coverage failures go in one bucket. Context quality failures go in another. Layout problems in another. When you decide to work on tool coverage for a few weeks, you want to pull up just that bucket and see if your changes moved the needle.

How to Design Test Cases

There is no single right way to do this, but the pattern I use is to split cases into three bands.

Simple cases are the happy path. "Make a basic shape." "Generate a two-step flowchart." Your agent should handle these without any trouble. If it fails simple cases, you have a fundamental problem before you even get to the interesting ones.

Medium cases are where layout and multi-step logic become challenges. Entity relationship diagrams, sequence diagrams with several actors, flows that branch. The agent can probably produce the individual elements correctly, but connecting and arranging them in a readable way is harder.

Hard cases are stress tests. A microservice architecture diagram where the agent needs domain knowledge it may not have. A prompt that asks for five things at once in a complex order. The agent should eventually handle these, but they will expose coverage gaps and context limits that medium cases will not.

The coverage question runs in both directions. You want test cases for things users will definitely ask and should get a good answer for. You also want test cases for things users might ask that your agent should decline. If someone asks the diagram agent to write them an essay, the right behaviour is to say no. That is a test case too.

Edge cases are their own category. These are the strange inputs that nobody planned for: a shape that is also a circle, a super-long prompt with "and, and, and" repeated until it breaks something, a Google Doc pasted in as a prompt. What matters here is not always whether the agent gets it right. What matters is how gracefully it fails. Does it produce garbage, or does it tell the user it cannot help with that?

Eval pipeline: golden dataset in, scored results out ExpandEval pipeline: golden dataset in, scored results out

Measurements Over Time, Not Pass or Fail

The thing that took me a while to fully absorb is that evals are not a snapshot. A single run tells you almost nothing. The agent might get lucky. It might get unlucky. What you actually want is the trend across many runs over time.

The more data you accumulate, the more clearly you can see the dependability of the system. A unit test that fails right now tells you something is broken. An eval that fails on 8 out of 20 runs tells you the system has about a 40% failure rate on that input type -- which is useful information, but only meaningful if you have the 20 runs to look at.

This also matters when people ask about model changes. If you swap from one model to a newer version, do not just run the eval once and call it better. Run it many times. Maybe the new model is more accurate but slower. Maybe the communication style is different enough that it breaks assumptions baked into your system prompt. "Better" is never obvious when you change a model. Evals that run across many attempts will surface what actually changed.

There is a distinction worth making here between evals and benchmarking. If you want to compare two models on your specific use case, you keep everything else identical and only change the model. That is benchmarking. You are not evaluating something you control. You are running your agent through the same obstacle course twice with a different engine. The result tells you which model fits your use case better, but the benchmark might not match real-world behaviour on every axis you care about.

Pass@K and Pass^K

These two terms come up constantly in research and at companies that run evals seriously. They confused me for a long time, so here is what they actually mean.

Pass@K is the probability that at least one of K attempts produces a correct answer. If you run the same input through your agent 20 times, how many of those runs produced a good result? If 4 out of 20 pass, that is a 20% pass rate. Run it again after an improvement. If 8 out of 20 pass now, you can see the improvement clearly. This metric measures quality: what is the best your system can do across multiple attempts.

Pass^K (pass to the power of K) is stricter. It measures the probability that all K attempts are correct. Instead of asking "did at least one pass?", it asks "did every single one pass?" This is a reliability metric. If even one run fails, the score for that batch is zero. A high Pass^K number means the system is dependable, not just occasionally capable.

In practice, you are almost always tracking both. Quality tells you the ceiling. Reliability tells you the floor. A system with high quality but low reliability means the agent can do impressive things but you cannot count on it. A system with high reliability but low quality means it consistently does something, but not the right something.

Capability Evals and Regression Evals

These map directly to what you already do with unit tests, just applied to a probabilistic system.

Capability evals measure what the system can do. Hard test cases live here. Can the agent handle a complex org chart? Can it produce a multi-level architecture diagram with correct structure? These tests tell you the upper limit of what the system is capable of in its current state.

Regression evals measure what the system should still do. Simple and medium test cases go here. After you make a change, did anything break that was already working? This is the same as running regression tests after a refactor. The simple cases should keep passing. If they do not, something regressed.

The combination is the full picture: capability evals tell you where you are improving, regression evals tell you what you did not break.

Manual Scoring

Before building any tooling, the first step is manual scoring. This is exactly what it sounds like: you look at an input, you look at the output the agent produced, and you decide whether it is good.

This is painful and slow. But it is essential, especially when you are the subject matter expert -- the person who deeply understands the domain and knows what a good result looks like. Before LLMs made it possible to automate parts of this, companies would hire large teams of contractors to look at outputs and rate them. That is still how a lot of the labelled data used to train AI models gets produced. The work is valuable because human judgment is the ground truth -- the definitive answer that everything else is measured against.

There is a useful diagnostic buried in this. If you cannot describe what a good response looks like for a given input, you will not be able to build something that produces it. If you can look at an output and say "I know this is wrong but I cannot tell you why," that is a signal. You need to understand the domain well enough to articulate goodness before you can measure it. If you cannot, that is the first problem to solve.

The goal with manual scoring is not to do it forever. It is to do it at the start, build up enough understanding of what good looks like, and then use that understanding to build automated scoring that can run faster and at scale.

The next post takes this from concept to code: building the eval harness, writing the golden dataset, and running the first real eval.


Further Reading and Watching

Practice what you just read.

Pass@K vs Pass^K: Measuring Quality and Reliability
1 exercise