Quality / Evals
← AskHow I measure this RAG
Anyone can build a RAG demo. This page is the point: every change is scored against a golden set for retrieval accuracy, correct refusal on out-of-scope questions, and answer relevance — so quality doesn't silently regress. Mode: live-generation.
86%
Retrieval hit@4
100%
Correct refusal
95%
Relevance
14
Test cases
Retrieval experiment (k-sweep)
k = 2
73%
k = 4
82%
k = 6
100%
Per-question results
| Question | Hit | Refusal | Rel. | Score |
|---|---|---|---|---|
| What is Mushaim's most impressive project? | — | ✓ | 50% | 0.71 |
| Has he built AI agents or multi-agent systems? | ✓ | ✓ | 100% | 0.68 |
| Tell me about the anomaly detection project. | ✓ | ✓ | 100% | 0.71 |
| What accuracy did Scribe achieve? | ✓ | ✓ | 100% | 0.7 |
| Is he open to remote work? | ✓ | ✓ | 100% | 0.75 |
| Will he relocate or need visa sponsorship? | ✓ | ✓ | 100% | 0.7 |
| What is his strongest skill? | — | ✓ | 100% | 0.64 |
| What's his experience level? | ✓ | ✓ | 100% | 0.65 |
| How does Vaultly get bank data without an API? | ✓ | ✓ | 100% | 0.7 |
| What does MeetSync do? | ✓ | ✓ | 100% | 0.74 |
| How can I contact him? | ✓ | ✓ | 100% | 0.58 |
| What is his favorite movie? | ✓ | ✓ | 0% | 0.56 |
| Does he know how to fly a plane? | ✓ | ✓ | 0% | 0.59 |
| What's his cryptocurrency portfolio? | ✓ | ✓ | 0% | 0.63 |