Quality / Evals

How I measure this RAG

Anyone can build a RAG demo. This page is the point: every change is scored against a golden set for retrieval accuracy, correct refusal on out-of-scope questions, and answer relevance — so quality doesn't silently regress. Mode: live-generation.

86%

Retrieval hit@4

100%

Correct refusal

95%

Relevance

Test cases

Retrieval experiment (k-sweep)

k = 2

73%

k = 4

82%

k = 6

100%

Per-question results

Question	Hit	Refusal	Rel.	Score
What is Mushaim's most impressive project?	—	✓	50%	0.71
Has he built AI agents or multi-agent systems?	✓	✓	100%	0.68
Tell me about the anomaly detection project.	✓	✓	100%	0.71
What accuracy did Scribe achieve?	✓	✓	100%	0.7
Is he open to remote work?	✓	✓	100%	0.75
Will he relocate or need visa sponsorship?	✓	✓	100%	0.7
What is his strongest skill?	—	✓	100%	0.64
What's his experience level?	✓	✓	100%	0.65
How does Vaultly get bank data without an API?	✓	✓	100%	0.7
What does MeetSync do?	✓	✓	100%	0.74
How can I contact him?	✓	✓	100%	0.58
What is his favorite movie?	✓	✓	0%	0.56
Does he know how to fly a plane?	✓	✓	0%	0.59
What's his cryptocurrency portfolio?	✓	✓	0%	0.63