What Are Evals? A Practical Introduction to Evaluating AI Systems

Why AI systems need a different quality discipline than traditional software.

Core idea: evals are the disciplined practice of measuring whether an AI system behaves usefully, safely, and consistently enough for the job it is supposed to do.

Why This Is Different

AI applications do not fail like traditional software

Traditional software testing assumes a fairly stable relationship between input and output. If a user clicks a button, submits a form, or calls an API with a known payload, the expected result is usually deterministic. We can write unit tests, integration tests, contract tests, and regression suites around that assumption. The system either behaves as specified or it does not. For decades, this has been a strong and effective model for software quality.

AI systems break that neat contract. A language model can produce different responses to the same prompt, a retrieval pipeline can surface different evidence as documents change, and an agent can take different action paths depending on what it sees in the environment. That does not mean the system is broken. It means the system is probabilistic, context-sensitive, and adaptive. Those properties are exactly what make AI useful, but they also make quality harder to define and measure.

Consider a customer support copilot. A deterministic test can verify that the UI loads, the ticket ID is passed correctly, and the escalation button works. It cannot, by itself, tell you whether the generated reply is accurate, whether it cites the right refund policy, or whether it sounds confident while being subtly wrong. In a claims-processing assistant, the interface may work perfectly while the model extracts the wrong injury code from a physician note. In a coding assistant, the generated code may compile and still violate a security requirement or ignore an edge case hidden in the prompt. These are not failures that traditional testing catches reliably, because the problem is not only whether the software ran. The problem is whether the AI made a good judgment.

This is why AI applications require more than conventional testing. Deterministic tests still matter. You still need them for business logic, permissions, APIs, schemas, and workflows. But they are not sufficient for judging model quality, retrieval quality, planning quality, tool-use quality, or user-perceived usefulness. That gap is where evals come in, and the distinction becomes clearer in Testing vs Evals.

Definition

What are evals?

Evals are systematic methods for assessing how well an AI system performs against the behaviors, outcomes, and risks that matter for a real use case. Instead of asking only, “did the program execute correctly?”, evals ask broader questions: did the model answer correctly, did it follow instructions, did it use the right evidence, did it choose the right tool, did it complete the task, did it stay within policy, and did it do so consistently enough to trust?

In practice, an eval usually combines a dataset, a task, a rubric, and a scoring method. The dataset may contain prompts, documents, conversations, workflows, tool traces, or production examples. The rubric defines what good looks like. The scorer may be a human reviewer, a programmatic checker, another model acting as a judge, or some combination of the three. The output is not merely pass or fail. It is a structured signal about system quality, which is why dataset and scenario design matters so much.

That is the important shift. Traditional testing is mostly about correctness against explicit specifications. Evals are about fitness for purpose under uncertainty. They help teams decide whether an AI system is good enough to ship, safe enough to scale, and stable enough to change without causing regressions.

Dimensions

What exactly do we evaluate?

There is no single score that captures whether an AI system is good. Different use cases require different dimensions, and many of them trade off against one another. A model can be helpful but too slow. It can be fluent but ungrounded. It can complete tasks but overuse tools. It can be accurate on average and still fail badly on critical edge cases. That is why mature teams define a set of evaluators or evaluation dimensions rather than chasing one headline metric.

Task Quality

Did the system actually solve the user’s problem? This includes correctness, completeness, relevance, and whether the response or action was useful in context.

Groundedness

Was the output supported by the supplied evidence, retrieved documents, or tool results, or did the system fabricate unsupported claims?

Instruction Following

Did the model follow the requested format, constraints, tone, policy, or workflow, especially when the prompt included nontrivial rules?

Safety and Policy

Did the system avoid unsafe behavior, privacy leaks, disallowed content, biased outcomes, and actions outside approved boundaries?

Tool Use and Planning

For agents, did it choose the right tool, call it with the right inputs, interpret the result correctly, and follow a sensible plan?

Operational Fitness

Was the behavior reliable enough in production terms, including latency, cost, consistency, fallbacks, and resilience under messy real inputs?

These evaluators can be implemented in several ways. Some are deterministic, such as exact-match checks, schema validation, citation presence, or tool-call correctness. Some are human, such as rating answer quality or comparing two outputs. Some are model-based, where a stronger model grades helpfulness, groundedness, or rubric adherence. Most serious evaluation programs use a mix, because each evaluator catches a different class of failure.

Ownership

Who evaluates AI?

The short answer is: more people than in traditional software. AI quality is not owned by engineering alone because many of the most important judgments are not purely technical. Whether an answer is helpful, whether a recommendation is fair, whether a generated explanation is acceptable for a regulated workflow, and whether an action is safe to automate all require multiple perspectives.

Users evaluate AI every time they decide whether to trust it. Domain experts evaluate it when they judge correctness in medicine, law, finance, support, or operations. Product teams evaluate it when they ask whether it improves the task outcome or merely adds novelty. Safety teams and policy owners evaluate it for compliance, abuse resistance, and failure severity. Engineers evaluate the mechanics: prompt quality, retrieval quality, model selection, orchestration, and regressions. In many cases, another model also evaluates AI by acting as a judge or triaging large datasets before humans review the highest-risk cases.

That is why AI evaluation is inherently socio-technical. It sits at the intersection of model behavior, product intent, operational risk, and human expectations. A technically impressive system can still fail if the domain experts do not trust it or if the business cannot explain when it should be used and when it should not.

Practice

What it takes to evaluate AI well

Evaluating AI is not a one-time benchmark. It is an operating discipline. Teams need representative datasets, clear rubrics, baseline versions, repeatable execution, and a way to compare changes over time. That sounds obvious, but it is where many teams fail. They run a few prompts manually, like what they see, and mistake that for evidence. Reliable AI evaluation starts when examples become curated, scoring becomes explicit, and regressions become visible.

Good evals also require realism. Synthetic prompts can help expand coverage, but they should not replace production-like scenarios. The system should be tested on messy language, vague requests, conflicting instructions, poor source documents, adversarial cases, and long-tail examples that reflect the actual environment. If the system will be used by thousands of people, the eval set should not look like a demo prepared by the build team.

Finally, teams need feedback loops. Pre-release evals are necessary, but they are not enough because AI systems drift as prompts change, models are swapped, retrieval corpora evolve, and user behavior shifts. Production monitoring, sampled human review, incident analysis, and regression gates are all part of the evaluation system, which is where build-time evals and runtime evals become essential. In other words, evaluating AI takes test assets, judgment assets, and operational discipline.

Representative examples: prompts, documents, tool traces, and real edge cases that match the product’s usage.
Clear rubrics: explicit criteria for quality, safety, and task success so scoring does not collapse into vague opinion.
Layered evaluators: deterministic checks, human review, and model-based graders used where each is strongest.
Regression infrastructure: the ability to compare model, prompt, retrieval, and orchestration changes before shipping them broadly.
Production learning loops: telemetry, spot-checking, escalation analysis, and failure review after deployment.

Agentic Systems

Why subsystem evals matter even more for agents

Agentic systems raise the bar because they are compositional systems. They are built from interacting parts: prompt templates, planners, retrievers, tools, memory, policies, model calls, execution loops, and escalation logic. When an agent succeeds end to end, that success can hide fragility inside one or more of those parts. A lucky retrieval hit, a forgiving user request, or a narrow tool path can make the whole flow look stable even when an individual subsystem is brittle.

This is why end-to-end testing alone is dangerous for agentic systems. End-to-end tests are still useful because they tell you whether the assembled system can complete realistic tasks. But they also smooth over the reasons it worked. If an agent completes a travel-booking workflow, was the plan sound, was the tool selection correct, did it ground its decisions in reliable evidence, did it recover from tool failures, and did it escalate when confidence dropped? End-to-end success does not answer those questions cleanly.

Subsystem evals do. You can evaluate retrieval quality separately from answer quality. You can test whether the planner decomposes tasks well before you measure overall task completion. You can inspect whether tool arguments are correct even when the final user-visible output looks plausible. You can stress-test memory contamination, fallback behavior, and policy enforcement without needing the whole agent loop to run perfectly. This layered view is how teams identify root causes instead of only observing symptoms.

The practical lesson is simple: evaluate the parts and the whole. Use subsystem evals to find brittle links early, and use end-to-end evals to validate user outcomes under realistic conditions. If you skip subsystem evals, end-to-end tests can give a false sense of confidence. If you skip end-to-end evals, strong components may still combine into a poor product experience. Agentic systems require both, especially once you get into RAG evaluation and pre-build evaluation design.

Bottom line: for AI, especially for agents, quality is not just whether the software runs. It is whether a probabilistic, compositional system behaves well enough, often enough, under the conditions that matter.