AI Evaluation Strategy
Pre-Build Evals for AI Agents
The biggest mistake many teams make with AI agents is starting implementation before they have defined how success, failure, and quality will be judged. Pre-build evals correct that sequencing problem.
Start With Evaluation, Not Implementation
Most teams build AI agents backwards. They start with prompts, tools, frameworks, and demos. Only later do they try to define what the system was actually supposed to do, what kinds of failures matter, and how quality should be judged.
That sequence creates confusion. A prompt that looks impressive in a demo may fail on an edge case the team never wrote down. A workflow that seems useful in development may quietly violate a policy requirement that no one translated into an eval scenario. By the time these problems are visible, the implementation is already shaping the requirements.
Pre-build evals reverse that order. They force a team to decide what success looks like before writing prompts, wiring tools, building orchestration logic, or integrating retrieval. They are the AI equivalent of specifying tests before building the system, except the focus is not just on deterministic software behavior. It is on expected agent behavior under ambiguity, imperfect instructions, safety boundaries, and tool interactions, which is why they sit naturally between eval fundamentals and build-time eval execution.
Why This Matters
Traditional software teams usually begin with requirements, edge cases, and acceptance criteria. AI projects often skip that discipline because the system feels exploratory. But agents are still systems, and systems still need explicit behavioral expectations.
If you do not define the evaluation set early, several problems appear later: teams confuse a good demo with a reliable system, prompt changes are judged subjectively instead of against a fixed test set, tool-calling failures are discovered late, and safety requirements remain vague until a production incident forces clarity.
Pre-build evals do not eliminate uncertainty. They make uncertainty explicit and testable.
Where Pre-Build Evals Fit
It helps to place pre-build evals inside a broader evaluation lifecycle rather than treating them as the only kind of evaluation that matters.
Pre-Build Evals
Define behaviors, scenarios, failure modes, rubrics, and acceptance criteria.
Build-Time Evals
Test prompts, tools, workflows, retrieval, and outputs while developing.
Runtime Evals
Monitor real-world behavior after release using production signals, feedback, and incidents.
These stages serve different purposes. Pre-build evals answer what the system should do. Build-time evals answer whether the implementation satisfies those expectations. Runtime evals answer how the live system behaves under real traffic, real users, and real distribution shift.
If you skip pre-build evals, build-time evaluation becomes reactive. If you skip build-time evals, production becomes the test environment. And if you skip runtime evals, you never learn how the system behaves outside the lab.
What Should Be Evaluated Before Build
At the pre-build stage, you are not yet measuring production quality. You are designing the evaluation lens through which quality will later be measured. That means deciding which artifacts matter and what questions each artifact must answer.
Models
- Reasoning depth: Does the use case require decomposition, ranking, or multi-step logic?
- Knowledge dependence: Can the task rely on model knowledge, or will it need external grounding?
- Context handling: Will long instructions, long dialogues, or large retrieved contexts stress the model?
- Reliability expectations: Is mild inconsistency acceptable, or does the task require tightly bounded outputs?
Meta Prompts
- Instruction following: Must the system always return a schema, persona, or strict format?
- Sensitivity: Would minor phrasing changes alter output quality or tool behavior?
- Negative constraints: What must never appear in the response?
- Escalation rules: When should the system refuse, defer, or ask a clarifying question?
Platform-Level Guardrails
- False positives: Which legitimate requests must not be blocked?
- False negatives: Which harmful requests must always be intercepted?
- Boundary cases: Which ambiguous prompts require escalation or review?
- User experience impact: How should the system respond when a guardrail triggers?
Tool Specifications
- Parameter accuracy: What fields must be extracted from user input?
- Tool selection: When is tool use mandatory, optional, or prohibited?
- Tool hallucination: How will you detect attempts to call tools that do not exist?
- Recovery behavior: What should happen when a tool fails or returns no result?
RAG or Retrieval Design
- Corpus coverage: Does the source material even contain the answers users need?
- Chunking strategy: How small or large should retrieval units be?
- Relevance expectations: What does a good retrieval result look like?
- Grounding behavior: When must the system answer only from retrieved content?
Those retrieval expectations should later become concrete checks in RAG evals rather than staying as design-time intentions only.
Token and Cost Estimation
- Token assumptions: How large will instructions, context, and outputs likely be?
- Cost targets: What cost per task is acceptable?
- Latency tradeoffs: Is a slower but better workflow acceptable?
- Fallback strategy: When should cheaper models or staged flows be considered?
Core Evaluation Dimensions
Pre-build evals translate product intent into testable expectations. These are not implementation tasks yet. They are requirements that should later become datasets, rubrics, and pass-fail checks.
- Intent classification: Can the system correctly recognize what the user is trying to do?
- Data extraction: Can it reliably extract the fields needed for reasoning or tool use?
- Multi-intent classification: Can it split, sequence, or reject mixed requests correctly?
- Multi-turn dialogue management: Can it maintain context without inventing facts or losing the user goal?
- Tool selection: Does it know when tool use is mandatory, optional, or prohibited?
- Response generation: Does the final answer remain accurate, grounded, complete, clear, and policy compliant?
What the Pre-Build Eval Workflow Looks Like
If you assign pre-build evals to an evaluation team, the work should follow a repeatable workflow rather than an ad hoc brainstorming exercise. The team is not trying to tune prompts yet. It is producing the behavioral specification that builders will later implement against.
Scope the use case
Clarify goals, users, constraints, non-goals, and the failures that matter most.
Define behaviors and scenarios
Break the future agent into testable behaviors and write representative scenarios.
Set rubrics and thresholds
Write expected behavior, pass-fail rules, severity levels, and acceptance criteria.
Review and hand off
Version the eval pack, get sign-off, and map it into build-time evaluation.
-
1
Frame the use case and risk
Start by clarifying the system goal, target users, task boundaries, and business or policy risks. This is where the eval team decides what kinds of mistakes matter most and which failures are merely inconvenient versus unacceptable.
Output: Use-case scope, non-goals, and a first risk register.
-
2
Break the job into behaviors
Decompose the future agent into behaviors that can later be evaluated: intent recognition, data extraction, tool choice, multi-turn handling, grounding, refusal behavior, escalation, and final response quality.
Output: A behavior map showing what the system must do end to end and at each subsystem boundary.
-
3
Collect representative scenarios
Write scenarios across happy paths, edge cases, ambiguous requests, adversarial inputs, multi-turn conversations, and tool-dependent tasks. The point is coverage, not volume.
Output: A draft scenario set organized by category and severity.
-
4
Define expected behavior for each scenario
For every scenario, specify expected intent, required extracted data, allowed and disallowed tool behavior, grounding rules, response expectations, and failure conditions. This is where vague requirements become testable requirements.
Output: Scenario specifications with explicit expected behavior.
-
5
Write the rubric and pass-fail logic
Define how quality will be judged. Some checks may be binary, such as whether tool use was mandatory. Others may be rubric-based, such as clarity, completeness, or groundedness. Critical failures should be separated from minor quality issues.
Output: Rubrics, severity levels, and acceptance thresholds.
-
6
Run cross-functional review
The eval team should review the scenario set with product, domain experts, engineering, security, policy, and operations as needed. This prevents the eval set from reflecting only one team’s mental model of the system.
Output: A reviewed and corrected eval design with stakeholder sign-off.
-
7
Freeze the initial eval pack
Once reviewed, the team should version the scenarios, rubrics, and acceptance criteria so implementation has a stable target. If the target keeps moving informally, build-time evals will be impossible to interpret.
Output: Versioned baseline eval pack for development.
-
8
Hand off to build-time evaluation
Map each pre-build scenario into the datasets, automated checks, human review loops, or release gates that will be used during development. The pre-build phase ends only when the team has made the transition into measurable build-time execution clear.
Output: Traceability from pre-build expectations to build-time eval execution.
In practice, that workflow usually produces a small set of durable artifacts that the rest of the project depends on.
| Artifact | What it contains | Why it matters |
|---|---|---|
| Use-case brief | Goals, constraints, users, non-goals, and risk context. | Prevents the eval set from drifting into generic examples. |
| Scenario library | Representative prompts, conversations, and task situations. | Creates coverage across realistic and difficult conditions. |
| Behavior spec | Expected intent, extraction, tool behavior, and response rules per scenario. | Turns product intent into testable requirements. |
| Rubric and thresholds | Pass-fail rules, scoring dimensions, and severity definitions. | Makes quality decisions consistent instead of subjective. |
| Build-time mapping | How scenarios will be executed during development and release review. | Connects planning work to actual engineering practice. |
A Concrete End-To-End Example
Consider a support agent for enterprise software. The team has not built the workflow yet, but it already knows that the future agent will answer product questions, look up account status, and create support tickets through tools.
| Field | Example |
|---|---|
| Scenario ID | PB-017 |
| User Goal | Resolve an outage and understand account impact. |
| Input | Our analytics dashboard has been down since this morning. Can you check whether there is an active incident on our account and open a severity-one ticket if needed? |
| Expected Intent | Primary intent: incident status lookup. Secondary intent: ticket creation if outage is confirmed or required. |
| Required Data | Product area, time reference, severity signal, and account context. |
| Tool Expectation | Must use incident lookup first. Must not create a ticket without enough account context or confirmation. |
| Safety Expectation | Must not invent incident IDs, outage status, ticket IDs, or account details. |
| Response Expectation | Acknowledge the report, explain that status must be verified through tools, and ask for missing account details if needed. |
| Pass-Fail Rule | Pass if the agent chooses tool-first behavior, extracts the right fields, and asks for clarification when needed. Fail if it answers from prior knowledge, skips required tool use, or creates a ticket prematurely. |
Even a single scenario like this exercises multiple eval dimensions at once. It tests intent classification, extraction, tool selection, multi-turn handling, and response generation before a single orchestration workflow exists.
How To Use This In a Real Project
An effective pre-build eval set does not need to be huge. It needs to be representative. Start with happy paths, edge cases, ambiguous cases, adversarial or policy-sensitive cases, multi-turn cases, and tool-dependent cases.
For each scenario, define the user input, expected intent, required extracted fields, expected tool behavior, allowed and disallowed response characteristics, and a clear pass-fail rule. If a team cannot write these expectations down, it is usually not ready to build the agent.
Once development starts, use the same scenarios to run build-time evals against prompts, tools, and workflows. After release, use production failures and real-user feedback from runtime evaluation to expand the scenario set and improve future pre-build planning.
Common Mistakes
- Treating evals as something to add after the prototype works.
- Writing only happy-path examples.
- Confusing format correctness with task correctness.
- Leaving ambiguous cases unspecified.
- Ignoring tool failure and recovery behavior.
- Defining quality with vague language such as helpful or natural without measurable criteria.
- Failing to distinguish between pre-build planning, build-time testing, and runtime monitoring.
Final Thought
The point of pre-build evals is not to predict every failure before implementation. That is impossible. The point is to force precision early enough that the team builds the right system, with the right constraints, for the right behaviors.
If you wait until prompts are written and tools are integrated before deciding how the agent should be evaluated, the implementation will start shaping the requirements. Pre-build evals reverse that order. They make requirements shape the implementation.