Hippo Evaluation Scorecard

Corpus

Summary

MetricMeanMedian
recall@k0.2100.000
mrr0.2130.000
ndcg@k0.1670.000
source_diversity0.8900.971
coverage_gap0.0000.000
groundedness
keyword_hit_rate0.0000.000
latency_ms_p504918.6
latency_ms_p955037.3

Stratified by enrichment_model

modelnmean recall@kmean gapmean ground
google/gemma-4-26b-a4b80.2740.000
google/gemma-4-31b390.2100.000
gpt-oss-120b160.2410.000
gpt-oss-120b-mlx-crack200.2800.000
qwen/qwen3.6-27b10.000
qwen3.5-35b-a3b200.1780.000
qwen3.6-35b-a3b-ud-mlx370.2110.000

Caveats

Per-question

idintenttopgapdivgroundkwdegraded
q01why-decision1.0000.0001.000
q02how-it-works1.0000.0001.000
q03why-decision1.0000.0001.000
q04how-it-works1.0000.0000.971
q05state-lookup1.0000.0001.000
q06why-decision1.0000.0000.783
q07how-it-works1.0000.0000.971
q08state-lookup1.0000.0001.000
q09how-it-works1.0000.0000.971
q10how-it-works1.0000.0000.971
q11why-decision1.0000.0001.000
q12how-it-works1.0000.0000.000
q13state-lookup1.0000.0000.783
q14state-lookup1.0000.0000.881
q15why-decision1.0000.0000.971
q16state-lookup1.0000.0000.881
q17why-decision1.0000.0000.971
q18how-it-works1.0000.0001.000
q19state-lookup1.0000.0000.469
q20how-it-works1.0000.0000.971
q21why-decision1.0000.0001.000
q22adversarial1.0000.0000.881
q23why-decision1.0000.0000.971
q24how-it-works1.0000.0000.971
q25how-it-works1.0000.0000.971
q26cross-source1.0000.0000.971
q27how-it-works1.0000.0000.722
q28why-decision1.0000.0001.000
q29how-it-works1.0000.0000.722
q30why-decision1.0000.0001.000
q31adversarial1.0000.0000.722
q32how-it-works1.0000.0000.971
q33why-decision1.0000.0000.971
q34state-lookup1.0000.0000.971
q35adversarial1.0000.0000.881
q36state-lookup1.0000.0000.971
q37cross-source1.0000.0000.971
q38adversarial1.0000.0000.881
q39state-lookup1.0000.0000.469
q40state-lookup1.0000.0000.971

Coverage gaps (weakest 10 questions)