Our offline evals looked great. Production was on fire. The gap: golden set full of queries we imagined—not queries that actually torched the system. Harness was simple. Retrieval: recall@5 and nDCG@10 against labeled relevance. Generation: groundedness via LLM-as-judge—does every claim trace to a chunk? Both ran on every PR. Metrics didn't predict prod. Until we changed how we built the dataset. The trick: 30% of our golden set now comes from error logs. Thumbs-down queries. Zero-retrieval queries. Misspellings, jargon, phrasing we never would have invented. Before: recall@5 correlated 0.52 with prod satisfaction. After adding failure cases: 0.84. Same metric. Different dataset. The rubric shift: for groundedness, we stopped asking "is the answer correct" and started asking "does every sentence cite a chunk." Small tweak. Massive difference in killing hallucinations before ship.
We built our offline harness by basically splitting the retrieval side from the generation side. You can't evaluate them as one big blob. For the retrieval part, we run batch jobs against a versioned vector database using what we call a Golden Dataset of query-context pairs. We're looking specifically at nDCG and recall@k because if the right info isn't hitting those top three slots, the generator is going to struggle regardless of how good it is. For groundedness, we use an LLM-as-a-judge process. It goes through the output sentence by sentence and cross-references it against the retrieved chunks. If a claim doesn't have a direct semantic match in the source data, it gets flagged immediately. The one curation trick that actually tells us if we're going to fail in production is using adversarial unanswerables. We have an LLM cook up questions that are topically related to our documents but can't actually be answered by the specific data in our index. It's a stress test. If the system returns a high confidence score or tries to provide a grounded answer for these near-misses, we know our retrieval thresholds are way too loose. It stops that hallucination of relevance that usually kills RAG systems once they face real-world, messy user intent. At the end of the day, building these systems is less about writing complex code and more about the integrity of your test data. If your eval set doesn't reflect the edge cases and noisy queries of real users, you're basically shipping on a wing and a prayer.
I built an offline RAG eval harness by freezing a gold set of questions and expected source passages. We scored retrieval with recall at k and nDCG, then checked groundedness by verifying answers cite only retrieved text. It keep changes honest. I curated the dataset from real support logs and added hard negatives that look similar but are wrong. We used a simple rubric for faithfulness and usefulness scored by two reviewers. This predicted production drops within one sprint. Tight data beats fancy metrics every time.