FTS5 phrase-wrap (R-03): lexical mode wraps multi-word queries in a single phrase, so recall on long natural-language questions is pathologically low — not a ranking bug.
RRF normalization (R-07): hybrid scores are normalized to top=1.0 per query. Absolute score thresholds are not comparable across queries.
vec0 brute-force (R-02): there is no ANN index on knowledge_vectors. Latency is O(N); hybrid≥LanceDB will stop holding once the corpus grows well past ~2K nodes.
events.git_repo is NULL across the live v5 corpus, so project filtering silently falls back to cwd-prefix. Low recall on project-filtered queries is a data bug, not a retrieval bug.
Branch corpus coverage: on the postgres branch only ~1.7% of events have knowledge-node coverage (vs ~13.4% on main). Labels are drawn from the main-hippo corpus until backfill runs.
Per-question
id
intent
top
gap
div
ground
kw
degraded
q01
why-decision
1.000
0.000
1.000
—
✗
q02
how-it-works
1.000
0.000
1.000
—
✗
q03
why-decision
1.000
0.000
1.000
—
✗
q04
how-it-works
1.000
0.000
0.971
—
✗
q05
state-lookup
1.000
0.000
1.000
—
✗
q06
why-decision
1.000
0.000
0.783
—
✗
q07
how-it-works
1.000
0.000
0.971
—
✗
q08
state-lookup
1.000
0.000
1.000
—
✗
q09
how-it-works
1.000
0.000
0.971
—
✗
q10
how-it-works
1.000
0.000
0.971
—
✗
q11
why-decision
1.000
0.000
1.000
—
✗
q12
how-it-works
1.000
0.000
0.000
—
✗
q13
state-lookup
1.000
0.000
0.783
—
✗
q14
state-lookup
1.000
0.000
0.881
—
✗
q15
why-decision
1.000
0.000
0.971
—
✗
q16
state-lookup
1.000
0.000
0.881
—
✗
q17
why-decision
1.000
0.000
0.971
—
✗
q18
how-it-works
1.000
0.000
1.000
—
✗
q19
state-lookup
1.000
0.000
0.469
—
✗
q20
how-it-works
1.000
0.000
0.971
—
✗
q21
why-decision
1.000
0.000
1.000
—
✗
q22
adversarial
1.000
0.000
0.881
—
✗
q23
why-decision
1.000
0.000
0.971
—
✗
q24
how-it-works
1.000
0.000
0.971
—
✗
q25
how-it-works
1.000
0.000
0.971
—
✗
q26
cross-source
1.000
0.000
0.971
—
✗
q27
how-it-works
1.000
0.000
0.722
—
✗
q28
why-decision
1.000
0.000
1.000
—
✗
q29
how-it-works
1.000
0.000
0.722
—
✗
q30
why-decision
1.000
0.000
1.000
—
✗
q31
adversarial
1.000
0.000
0.722
—
✗
q32
how-it-works
1.000
0.000
0.971
—
✗
q33
why-decision
1.000
0.000
0.971
—
✗
q34
state-lookup
1.000
0.000
0.971
—
✗
q35
adversarial
1.000
0.000
0.881
—
✗
q36
state-lookup
1.000
0.000
0.971
—
✗
q37
cross-source
1.000
0.000
0.971
—
✗
q38
adversarial
1.000
0.000
0.881
—
✗
q39
state-lookup
1.000
0.000
0.469
—
✗
q40
state-lookup
1.000
0.000
0.971
—
✗
Coverage gaps (weakest 10 questions)
q01 (gap=0.000): Why did we replace LanceDB with sqlite-vec?
q02 (gap=0.000): How does the enrichment pipeline run shell, claude, and browser sources concurrently?
q03 (gap=0.000): What fixed the Claude session hook PID problem?
q04 (gap=0.000): How does the Firefox browser source send events to the daemon?
q05 (gap=0.000): What runs in the OTel observability stack locally?
q06 (gap=0.000): What is the north star vision for hippo?
q07 (gap=0.000): How are secrets redacted before storage?
q08 (gap=0.000): What is the current schema version and what changed in v6?
q09 (gap=0.000): How does hybrid retrieval combine vector and lexical scores?
q10 (gap=0.000): What does a degraded-mode response from rag.ask look like?