I've migrated and built dozens of websites from Bubble to Webflow and other platforms, and while I haven't specifically taught RAG systems, I've evaluated what makes technical migrations succeed or fail--it's the same core skill: can someone prove their system actually works under real conditions? The most predictive criterion I'd look for is **retrieval precision under edge cases**. When we migrated GoFIGR's 50+ CMS items from Bubble to Webflow, the project only succeeded because we tested every filter, category, and search scenario before launch--not just the happy path. For a RAG capstone, I'd award full credit when students show me their system handles ambiguous queries, contradictory documents, or out-of-domain questions without hallucinating. Concrete evidence: query logs with failure cases, how they tuned chunk sizes or similarity thresholds, and A/B comparisons showing retrieval improved response accuracy by X%. Most students will demo their RAG working perfectly on sanitized test data. The ones who deserve full marks are the ones who break their own system first, document why it failed (like when our Hopstack redesign needed custom filtering beyond native Webflow CMS), then show measurable improvements. That's how you prove you understand the tool, not just copy-pasted a tutorial.
I run a landscaping company in Massachusetts, not a tech program, but I evaluate contractors and crew performance daily based on whether they can recall the *right* procedure for the *specific* problem in front of them--not just general knowledge. When I'm reviewing how my team handled a drainage issue or pest outbreak, the criterion I care about most is **contextual accuracy**--did they apply the solution that matches the exact soil conditions, time of year, and plant species we're dealing with? We had a crew nearly kill a client's native ferns by applying generic compost amendments instead of retrieving the correct protocol for acidic woodland soil. Full credit goes to whoever shows me they cross-referenced the specific site conditions before acting. The concrete evidence I look for: documentation showing they considered multiple options and explained why they rejected the irrelevant ones. For spring cleanups, I want to see notes proving they identified *why* pre-emergent herbicides work before 55degF soil temps but fail after--not just that "herbicides control weeds." That decision-making trail proves they understood context, not just memorized steps. In our testimonials, clients rave when we solve problems other companies said were "too difficult"--that happens because my team retrieves relevant past project details (equipment access limits, urban constraints) and adapts. That's the skillset I'd demand proof of.
I spent 14 years doing failure analysis at Intel--staring at silicon dies under microscopes, hunting for the one defect that killed performance. That trained me to spot the difference between "it works" and "it works reliably," which is exactly what separates decent RAG systems from ones that fall apart in production. The most predictive criterion I'd look for is **source attribution accuracy under retrieval conflicts**. When I'm doing data recovery, customers don't just want their photos back--they need to know which drive they came from, what folder, what date. Same principle applies here. I'd award full credit when students show me their system can correctly cite which document chunk produced each part of the answer, especially when two retrieved passages contradict each other. Concrete evidence: a test set where they deliberately feed conflicting sources and demonstrate how their system flags uncertainty or prioritizes based on metadata like document date or authority score. Most capstone projects show the RAG spitting out answers. The ones I'd give top marks to are students who include logs showing: "Query X retrieved chunks from Doc A (confidence 0.87) and Doc B (confidence 0.34), system selected A because [specific ranking logic], here's why that was correct." That's diagnostic thinking--proving you understand what's happening under the hood when retrieval gets messy. At my shop, I don't just fix a phone and hope it works. I test it, document what failed, explain why my repair addressed the root cause, and guarantee it for a year. Apply that same standard to your RAG system--show me the failure cases you caught and fixed before demo day.
I've spent years managing our Vendor Managed Inventory program across 60+ customer locations--when you're tracking thousands of SKUs across that many sites, you learn fast that retrieval is only half the battle. The real test is whether your system knows when it *doesn't* know something. The criterion I'd look for: **graceful handling of out-of-scope queries with clear boundary documentation**. When a contractor calls asking for a specialty valve we don't stock, the worst thing my team can do is guess or force-fit a wrong answer. Same with RAG systems. I'd award full credit when students demonstrate their system explicitly flags "insufficient context" and shows exactly what query terms caused the retrieval to fail--not just silence or hallucination. Concrete evidence I'd want: a test log showing queries that *should* fail (asking about topics not in the vector database), with the system returning "No relevant documents found for [specific terms]" plus a similarity score breakdown showing why the closest matches still fell below threshold. At Standard, we built our reputation by telling contractors "we'll get that for you tomorrow" instead of shipping the wrong part today. The students who document their system's limits as thoroughly as its successes--those are the ones building something you'd trust with real customer money on the line.