RAG Evals: Retrieval Relevance, Grounding, and Citation Fidelity

A retrieval-augmented system can fail even when the final answer sounds plausible, so the evaluation has to separate where the failure actually happened.

View the eval content map

Core idea: do not score RAG as one blended behavior. Evaluate retrieval quality, grounding quality, citation quality, and answer quality separately so you can tell whether the problem is the corpus, the retriever, the generator, or the interface between them.

Decomposition

Why RAG evaluation breaks when teams use one score

Retrieval-augmented generation systems are compositional. They depend on the corpus, chunking strategy, metadata, ranking, context assembly, prompt design, model behavior, and answer formatting. When teams ask only “was the final answer correct?” they lose the ability to diagnose where failure started. That is also why dataset design matters so much for RAG.

A wrong answer may reflect bad retrieval, but it may also reflect correct retrieval combined with poor grounding. A confident answer with citations may still be misleading if the citations do not actually support the claim being made. A low answer score may even hide a corpus problem rather than a model problem. Treating all of that as one metric leads to vague fixes and repeated regressions.

The solution is to evaluate RAG as a small system, not as a single response surface.

Dimensions

The four evaluation layers that matter most

Retrieval relevance

Did the system fetch the right documents or passages for the question, including the evidence needed to answer correctly?

Grounding quality

Did the answer stay faithful to the retrieved material, or did the model add unsupported claims, guesses, or synthesis errors?

Citation fidelity

Do the cited passages actually support the specific claims they are attached to, and are the citations complete enough for verification?

Answer usefulness

Even if the answer is grounded, is it complete, clear, appropriately scoped, and helpful for the user’s real task?

Practical rule: if you cannot tell whether a failure came from retrieval, grounding, or final answer composition, your RAG evaluation is too collapsed to guide engineering work.

Metrics

What teams should actually measure

Not every team needs the same exact metric set, but most serious RAG programs need evaluators across the layers below.

Layer	Useful questions
Corpus coverage	Does the source material contain the needed answer at all, or is the system being asked to solve impossible questions?
Chunk and document retrieval	Were the right passages retrieved, ranked, and included in context for this query?
Grounded response generation	Did the model answer using the retrieved evidence rather than unsupported prior knowledge or invention?
Citation fidelity	Do the references point to text that genuinely supports the exact statement being made?
Failure behavior	When evidence is missing, conflicting, or weak, does the system abstain, clarify, or overstate confidence?

Failure Analysis

The RAG failure modes worth isolating

Strong RAG evals are built around concrete failure modes rather than abstract scoreboards.

Missed evidence

The correct information exists in the corpus but the retriever fails to surface it.

Distracting evidence

The retriever returns passages that look relevant but actually push the answer toward the wrong interpretation.

Ungrounded synthesis

The answer mixes retrieved facts with unsupported claims, invented constraints, or overconfident interpolation.

Bad citations

The answer cites text that is adjacent to the claim, only partially supportive, or unrelated to the exact statement.

These categories matter because each one points to a different engineering response. Missed evidence may require better chunking, metadata, or ranking. Ungrounded synthesis may require a stricter generation prompt or a different rubric. Bad citations may reflect answer formatting logic rather than retrieval quality. Those fixes typically enter the delivery loop through build-time evals.

Dataset Design

What a RAG eval dataset should include

RAG datasets should not consist only of question-answer pairs. They should also preserve what evidence should be available, what evidence should not be used, and how the system should behave when the corpus is incomplete.

Answerable queries: cases where the corpus clearly contains the answer.
Unanswerable queries: cases where the system should say the corpus does not support a definitive answer.
Conflicting evidence: cases where documents disagree and the system must reflect uncertainty or explain the discrepancy.
Near-match traps: cases where documents look related but answer a different question.
Citation-sensitive cases: claims where precise source mapping matters, especially in policy, legal, medical, or enterprise knowledge settings.

This is one reason RAG evaluation depends heavily on the dataset design discipline described in the eval data article. Without well-structured scenarios, retrieval results are hard to interpret.

Example

A concrete RAG eval example

Imagine an internal policy assistant answering benefits questions from an enterprise handbook corpus. A user asks whether unused leave can be carried over into next year.

Field	Example
Query	"Can I carry over my unused leave balance into next year?"
Required evidence	The leave policy section defining carryover limits, expiration timing, and any regional exceptions.
Retrieval expectation	The system should retrieve the relevant policy section rather than general PTO descriptions or unrelated holiday rules.
Grounding expectation	The answer should state the carryover rule exactly as written and note exceptions only if they are documented in the retrieved material.
Citation expectation	The cited passage should specifically contain the carryover rule, not just a nearby section from the same handbook.
Failure behavior	If region-specific information is missing, the system should say so rather than inventing an exception policy.

This one case can reveal several different system weaknesses. The value of the eval is not just the score. The value is the ability to isolate which layer failed.

Pitfalls

Common RAG eval mistakes

Using answer correctness as the only metric: this hides whether the problem was retrieval, grounding, or citation quality.
Ignoring unanswerable cases: good RAG systems must know when the corpus does not justify an answer.
Scoring citations loosely: document-level overlap is often too weak when claim-level support is what matters.
Evaluating only ideal queries: real users ask vague, incomplete, or poorly phrased questions that stress retrieval differently.
Blaming the model for corpus gaps: some failures should trigger corpus improvement rather than prompt iteration.

Useful discipline: every failed RAG case should lead to one of four diagnoses: corpus problem, retrieval problem, grounding problem, or answer-presentation problem. If the diagnosis is always vague, the eval design needs work.

Action

How teams should start evaluating RAG in practice

Identify the highest-value question types users actually ask.
Create labeled cases with expected evidence, not just expected answers.
Separate retrieval, grounding, and citation scoring.
Add unanswerable and conflicting-evidence scenarios early.
Review failures by layer so engineering effort goes to the real source of error.

Teams do not need a perfect benchmark on day one. They do need a benchmark that reflects how RAG systems actually fail.