AI Evaluation Strategy

Pre-Build Evals for AI Agents

The biggest mistake many teams make with AI agents is starting implementation before they have defined how success, failure, and quality will be judged. Pre-build evals correct that sequencing problem.

Start With Evaluation, Not Implementation

Most teams build AI agents backwards. They start with prompts, tools, frameworks, and demos. Only later do they try to define what the system was actually supposed to do, what kinds of failures matter, and how quality should be judged.

That sequence creates confusion. A prompt that looks impressive in a demo may fail on an edge case the team never wrote down. A workflow that seems useful in development may quietly violate a policy requirement that no one translated into an eval scenario. By the time these problems are visible, the implementation is already shaping the requirements.

Pre-build evals reverse that order. They force a team to decide what success looks like before writing prompts, wiring tools, building orchestration logic, or integrating retrieval. They are the AI equivalent of specifying tests before building the system, except the focus is not just on deterministic software behavior. It is on expected agent behavior under ambiguity, imperfect instructions, safety boundaries, and tool interactions.

Working definition: Pre-build evals are evaluation design activities done before implementation starts. They define behaviors, scenarios, failure modes, rubrics, and acceptance criteria while the team is still clarifying requirements.

Why This Matters

Traditional software teams usually begin with requirements, edge cases, and acceptance criteria. AI projects often skip that discipline because the system feels exploratory. But agents are still systems, and systems still need explicit behavioral expectations.

If you do not define the evaluation set early, several problems appear later: teams confuse a good demo with a reliable system, prompt changes are judged subjectively instead of against a fixed test set, tool-calling failures are discovered late, and safety requirements remain vague until a production incident forces clarity.

Pre-build evals do not eliminate uncertainty. They make uncertainty explicit and testable.

Where Pre-Build Evals Fit

It helps to place pre-build evals inside a broader evaluation lifecycle rather than treating them as the only kind of evaluation that matters.

Pre-Build Evals

Define behaviors, scenarios, failure modes, rubrics, and acceptance criteria.

Build-Time Evals

Test prompts, tools, workflows, retrieval, and outputs while developing.

Runtime Evals

Monitor real-world behavior after release using production signals, feedback, and incidents.

These stages serve different purposes. Pre-build evals answer what the system should do. Build-time evals answer whether the implementation satisfies those expectations. Runtime evals answer how the live system behaves under real traffic, real users, and real distribution shift.

If you skip pre-build evals, build-time evaluation becomes reactive. If you skip build-time evals, production becomes the test environment. And if you skip runtime evals, you never learn how the system behaves outside the lab.

What Should Be Evaluated Before Build

At the pre-build stage, you are not yet measuring production quality. You are designing the evaluation lens through which quality will later be measured. That means deciding which artifacts matter and what questions each artifact must answer.

Models

  • Reasoning depth: Does the use case require decomposition, ranking, or multi-step logic?
  • Knowledge dependence: Can the task rely on model knowledge, or will it need external grounding?
  • Context handling: Will long instructions, long dialogues, or large retrieved contexts stress the model?
  • Reliability expectations: Is mild inconsistency acceptable, or does the task require tightly bounded outputs?

Meta Prompts

  • Instruction following: Must the system always return a schema, persona, or strict format?
  • Sensitivity: Would minor phrasing changes alter output quality or tool behavior?
  • Negative constraints: What must never appear in the response?
  • Escalation rules: When should the system refuse, defer, or ask a clarifying question?

Platform-Level Guardrails

  • False positives: Which legitimate requests must not be blocked?
  • False negatives: Which harmful requests must always be intercepted?
  • Boundary cases: Which ambiguous prompts require escalation or review?
  • User experience impact: How should the system respond when a guardrail triggers?

Tool Specifications

  • Parameter accuracy: What fields must be extracted from user input?
  • Tool selection: When is tool use mandatory, optional, or prohibited?
  • Tool hallucination: How will you detect attempts to call tools that do not exist?
  • Recovery behavior: What should happen when a tool fails or returns no result?

RAG or Retrieval Design

  • Corpus coverage: Does the source material even contain the answers users need?
  • Chunking strategy: How small or large should retrieval units be?
  • Relevance expectations: What does a good retrieval result look like?
  • Grounding behavior: When must the system answer only from retrieved content?

Token and Cost Estimation

  • Token assumptions: How large will instructions, context, and outputs likely be?
  • Cost targets: What cost per task is acceptable?
  • Latency tradeoffs: Is a slower but better workflow acceptable?
  • Fallback strategy: When should cheaper models or staged flows be considered?

Core Evaluation Dimensions

Pre-build evals translate product intent into testable expectations. These are not implementation tasks yet. They are requirements that should later become datasets, rubrics, and pass-fail checks.

  • Intent classification: Can the system correctly recognize what the user is trying to do?
  • Data extraction: Can it reliably extract the fields needed for reasoning or tool use?
  • Multi-intent classification: Can it split, sequence, or reject mixed requests correctly?
  • Multi-turn dialogue management: Can it maintain context without inventing facts or losing the user goal?
  • Tool selection: Does it know when tool use is mandatory, optional, or prohibited?
  • Response generation: Does the final answer remain accurate, grounded, complete, clear, and policy compliant?

A Concrete End-To-End Example

Consider a support agent for enterprise software. The team has not built the workflow yet, but it already knows that the future agent will answer product questions, look up account status, and create support tickets through tools.

Field Example
Scenario ID PB-017
User Goal Resolve an outage and understand account impact.
Input Our analytics dashboard has been down since this morning. Can you check whether there is an active incident on our account and open a severity-one ticket if needed?
Expected Intent Primary intent: incident status lookup. Secondary intent: ticket creation if outage is confirmed or required.
Required Data Product area, time reference, severity signal, and account context.
Tool Expectation Must use incident lookup first. Must not create a ticket without enough account context or confirmation.
Safety Expectation Must not invent incident IDs, outage status, ticket IDs, or account details.
Response Expectation Acknowledge the report, explain that status must be verified through tools, and ask for missing account details if needed.
Pass-Fail Rule Pass if the agent chooses tool-first behavior, extracts the right fields, and asks for clarification when needed. Fail if it answers from prior knowledge, skips required tool use, or creates a ticket prematurely.

Even a single scenario like this exercises multiple eval dimensions at once. It tests intent classification, extraction, tool selection, multi-turn handling, and response generation before a single orchestration workflow exists.

How To Use This In a Real Project

An effective pre-build eval set does not need to be huge. It needs to be representative. Start with happy paths, edge cases, ambiguous cases, adversarial or policy-sensitive cases, multi-turn cases, and tool-dependent cases.

For each scenario, define the user input, expected intent, required extracted fields, expected tool behavior, allowed and disallowed response characteristics, and a clear pass-fail rule. If a team cannot write these expectations down, it is usually not ready to build the agent.

Once development starts, use the same scenarios to run build-time evals against prompts, tools, and workflows. After release, use production failures and real-user feedback to expand the scenario set and improve future pre-build planning.

Common Mistakes

  • Treating evals as something to add after the prototype works.
  • Writing only happy-path examples.
  • Confusing format correctness with task correctness.
  • Leaving ambiguous cases unspecified.
  • Ignoring tool failure and recovery behavior.
  • Defining quality with vague language such as helpful or natural without measurable criteria.
  • Failing to distinguish between pre-build planning, build-time testing, and runtime monitoring.

Final Thought

The point of pre-build evals is not to predict every failure before implementation. That is impossible. The point is to force precision early enough that the team builds the right system, with the right constraints, for the right behaviors.

If you wait until prompts are written and tools are integrated before deciding how the agent should be evaluated, the implementation will start shaping the requirements. Pre-build evals reverse that order. They make requirements shape the implementation.