Quality / Evals

How I measure this RAG

← Ask

Anyone can build a RAG demo. This page is the point: every change is scored against a golden set for retrieval accuracy, correct refusal on out-of-scope questions, and answer relevance — so quality doesn't silently regress. Mode: live-generation.

86%
Retrieval hit@4
100%
Correct refusal
95%
Relevance
14
Test cases

Retrieval experiment (k-sweep)

k = 2
73%
k = 4
82%
k = 6
100%

Per-question results

QuestionHitRefusalRel.Score
What is Mushaim's most impressive project?50%0.71
Has he built AI agents or multi-agent systems?100%0.68
Tell me about the anomaly detection project.100%0.71
What accuracy did Scribe achieve?100%0.7
Is he open to remote work?100%0.75
Will he relocate or need visa sponsorship?100%0.7
What is his strongest skill?100%0.64
What's his experience level?100%0.65
How does Vaultly get bank data without an API?100%0.7
What does MeetSync do?100%0.74
How can I contact him?100%0.58
What is his favorite movie?0%0.56
Does he know how to fly a plane?0%0.59
What's his cryptocurrency portfolio?0%0.63