Series Guide

AI Evals Content Map

A structured path through the evals cluster, from core concepts to deployment operations and business value.

This page is the front door to the evals section. It organizes the material by reading order and by operating concern so a reader can move from fundamentals into datasets, engineering workflow, production measurement, safety, governance, and executive outcomes without losing the thread.

What this section covers

Quality foundations What evals are, how they differ from testing, and how evaluation dimensions get defined.

Operational practice How teams design datasets, run build-time checks, evaluate RAG, and monitor live systems.

Control and value How safety, CX measurement, EvalOps, and business metrics turn evaluation into a real operating discipline.

Reading Order

Recommended path through the series

If someone is new to the topic, this sequence builds the right mental model first, then moves into implementation and production discipline.

How the content clusters together

Foundations

Conceptual base

Design and Engineering

Build the right system

Operations and Control

Run and govern it well

Coverage

What the current cluster now covers

The section now covers the full eval lifecycle with a practical bias toward agentic systems and enterprise deployment.

Area	Primary article	What it helps answer
Evals fundamentals	What Are Evals?	What evaluation is, why it exists, and which quality dimensions matter for AI systems.
Testing versus evaluation	Testing vs Evals	Which problems deterministic tests catch and which ones require probabilistic or rubric-based evaluation.
QA operating transition	Enterprise QA Teams Need Evals, Not Just Tests	How enterprise QA teams should extend deterministic assurance into evals, datasets, rubrics, slices, and release evidence for AI products.
Evaluation design before build	Pre-Build Evals for AI Agents	How to define scenarios, acceptance criteria, and failure modes before implementation starts shaping the requirements.
Datasets and scenario design	Datasets, Golden Sets, and Scenario Design	How to build representative, versioned, and risk-aware evaluation sets.
Build-time workflow	Build-Time Evals	How teams turn evals into engineering workflow with regression checks, release gates, and triage.
RAG-specific evaluation	RAG Evals	How to separate retrieval failure from grounding failure and final-answer failure.
Production behavior	Runtime Evals and Observability	How to detect drift, learn from traces, and evaluate performance after release.
Runtime governance	Runtime Governance	Which policies production agents need for rate limits, TTLs, action budgets, PII controls, permission boundaries, loop detection, outbound controls, and cost guardrails.
Adversarial and safety pressure testing	Safety Evals and Red Teaming	How to test guardrails, prompt injection resistance, policy robustness, and unsafe tool behavior.
Customer experience quality	CX Evals	How to evaluate whether AI interactions preserve trust, respond appropriately in context, and avoid turning support or service moments into product failures.
Governance and operating model	EvalOps	How teams own, version, approve, and review evaluation artifacts over time.
Business value	Business Metrics for Evals	How evaluation connects to cost, productivity, SLA adherence, containment, CSAT, and ROI.

AI Evals Content Map

What this section covers

Recommended path through the series

What Are Evals?

Testing vs Evals

Enterprise QA Teams Need Evals, Not Just Tests

Pre-Build Evals for AI Agents

Datasets, Golden Sets, and Scenario Design

Build-Time Evals

RAG Evals

Runtime Evals and Observability

Runtime Governance

Safety Evals and Red Teaming

CX Evals

EvalOps

Business Metrics for Evals

How the content clusters together

Foundations

Design and Engineering

Operations and Control

What the current cluster now covers

Fast entry points for different readers

If you are new to evals

If you lead QA or testing

If you are building AI systems

If you run production or governance