My go-to method is sentence level grounding checks plus an "unanswerable" test set. We measure: citation coverage, meaning what percentage of answer sentences are supported by at least one retrieved chunk, and entailment style faithfulness, where we verify the claim is actually backed by the cited text, not just "nearby." Practically, this is a small curated evaluation set (50 to 200 questions per domain) that includes trick questions and questions where the correct behavior is "I don't know." One failure this caught early was a confident answer citing the right document but the wrong section because chunking and retrieval favored a similar looking policy paragraph. The fix was to tighten chunk boundaries, boost section headers in retrieval, and require quote level evidence for high risk answers. We now run this evaluation on every retrieval config change before shipping.
Our primary application here is constructing a curated, adversarial test set for grounding. It's relatively easy to assess whether a model is 'faithful' using the naive (and commonly used) metrics, as a starting point, but its important not to stop there, as there are plenty of ways that a model can fail in subtle ways. We specifically build case where the retrieval context contains plausible, but incorrect grounding information. In one case we retrieved an archived copy of a policy document stating 'we retain employee data for seven years' when the actual current policy was three. Prompting with 'what is the employee data retention period' and the model dutifully and faithfully answered, 'seven years', and then cited the wrong source. A naive faithfulness score would call that a pass because it was closely matched to the retrieved text. Building this into our test cases forced us to put a subtler source-weighting, recency-ranking layer on our model's retrieval logic that we'd never have thought to touch pre-production.
Before production, our go-to approach for evaluating RAG systems is systematic grounding checks using source-linked evaluation, not just answer quality scoring. Every generated response is traced back to its retrieved documents, and we explicitly verify whether each key claim is supported by an actual citation. One metric that consistently catches issues is the unsupported-claim rate, the percentage of statements in an answer that cannot be directly mapped to the retrieved context. We measure this on a small but carefully curated test set that includes edge cases, such as ambiguous queries, outdated documents, and near-duplicate sources. This process once revealed a failure where answers sounded correct but were subtly combining facts from multiple documents in a way that none of the sources actually stated. Without grounding checks, it would have passed traditional accuracy reviews. That insight led us to tighten retrieval filters and add claim-level validation before deployment.
My go-to method is a claim-level attribution check that decomposes each answer into atomic claims and requires a minimum cosine similarity threshold between each claim and at least one retrieved chunk, plus an abstain flag when no chunk clears the bar. We pair this with a small adversarial test set seeded with near-duplicate passages and outdated facts. This process caught a failure where answers looked fluent but stitched together facts from multiple documents without a single supporting source, something standard retrieval recall and BLEU-style metrics completely missed Albert Richer, Founder, WhatAreTheBest.com