Glossary / Evaluation & Measurement

Evals (AI Evaluation)

How you measure whether an AI system is actually any good. The thing every vendor claims to do and almost nobody does well. Without evals, you're flying blind.

Evaluation & Measurement

The Technical Definition

Evals (short for evaluations) are the tests you run against an AI system to measure whether it’s producing good outputs. A typical eval has three parts: a dataset of representative inputs, an expected behavior or grading rubric, and a method for scoring how the model did. You run the eval before you ship a change, and you run it again after, and you compare.

There are three common approaches. Exact-match evals check whether the output matches a known correct answer. Rubric-based evals score outputs against a checklist (was it accurate, on-brand, complete). LLM-as-a-judge evals use a separate model to grade the outputs of the first model — which sounds circular until you accept that humans grading thousands of outputs is not realistic at scale.

What This Actually Means for Your Business

Almost every AI vendor will tell you they “test rigorously.” Almost none of them will show you the evals. There’s a reason: most don’t have them, or the ones they have are anecdotal — a developer ran ten examples, the outputs looked fine, they shipped.

That’s not evaluation. That’s spot-checking.

Real evals are the difference between an AI deployment that improves over time and one that drifts silently. Without evals, you have no way to know whether a model update made things better or worse. You have no way to catch regressions when a vendor pushes a change. You have no way to compare two products on the same task. You have no way to tell your board whether the AI is actually working.

The companies doing this well treat evals like a unit-test suite for AI behavior. Every important capability — the customer service bot’s ability to handle a refund request, the document search tool’s ability to find the right policy, the coding assistant’s ability to produce working code — has a dataset of representative cases and a grading mechanism. They run the suite on every change. They watch the score over time. When the score drops, they investigate before the customer notices.

The other thing real evals do: they expose what the AI is bad at. A team that runs evals weekly knows that their assistant fails on multi-step billing questions and is great at first-line FAQs. That knowledge changes deployment decisions. It changes which queries get routed to humans. It changes how the product gets pitched internally. Without evals, the AI is a black box that everyone has opinions about and nobody can prove right or wrong.

The cost of not having evals shows up later, usually as a public failure. The model started behaving differently three weeks ago and nobody caught it. The vendor pushed an update and now the assistant is hallucinating policies that don’t exist. A customer complaint becomes a Twitter screenshot becomes a compliance review.

Reality Check

What the vendor says: “Our model has 95% accuracy on customer queries.”

What that means in practice: They ran the model against a dataset they assembled, scored by a method they chose, on tasks they curated. The 95% number is real — for that test. It says almost nothing about how the model performs on your actual customers, your actual data, and your actual edge cases. The eval that matters is the one you build.

What Operators Actually Do

Teams getting real value from AI build their own evals from day one. They start small — fifty examples drawn from real usage, graded by an internal expert. They expand the set as they encounter edge cases. They run the suite on every model update, every prompt change, every retrieval-system tweak. The eval becomes the source of truth for whether the system is improving.

They also build evals that match the actual decision being made. If the AI’s job is to draft refund responses, the eval doesn’t just measure tone or fluency — it measures whether the proposed refund amount is correct, whether the policy citation is real, and whether the customer would accept it. The eval has to be specific to the work, not generic to the model.

The other pattern: humans grade enough samples to validate the LLM-judge. You can’t have humans grade thousands of outputs every week. But you can have humans grade fifty, then run an LLM-judge against the same fifty, and verify that the judge is catching what the humans would. Once the judge is trustworthy, you scale. Skip the validation step and the LLM-judge will quietly grade everything as fine.

The Questions to Ask

  1. Can you show me the eval suite this product is tested against? Real evals have datasets, scoring methods, and tracked results over time. If the vendor hands you a slide with one accuracy number, they don’t have evals — they have a marketing claim.

  2. How will we know if the model gets worse? What’s the regression-detection mechanism? Who runs it? How often? What’s the threshold that triggers an alert?

  3. Can we add our own test cases to the eval suite? Your edge cases are not their edge cases. If the platform doesn’t let you contribute test data and run the suite against your scenarios, you’re trusting their definition of “good enough” instead of yours.

Get the next Brief

One operator. Every other Wednesday.

Plus the AI Glossary and the Failure Museum.
Real names. Real numbers. Honest analysis.