Baseline: 2026-04-29 pre-v3 vs v3-partial

A snapshot of intrinsic enrichment quality (5-expert panel) and end-to-end retrieval (hippo-eval), captured before the qwen3.6-35b-a3b-ud-mlx re-enrichment job touches the labeled gold corpus.

This is the before half of an A/B comparison. To get the after, re-run the panel + hippo-eval after the re-enrich job completes (or after force-prioritizing the 70 gold uuids).

When this was captured

What’s in this directory

Panel artifacts (intrinsic node quality)

Hippo-eval artifacts (end-to-end retrieval)

Reproduction scripts

Headline numbers

Panel (1–5 Likert, n=100)

dimensionmeanstdev across 5 experts
accuracy4.590.26
succinctness4.100.64
usefulness3.800.91
ask_suitability3.980.72
mcp_suitability4.000.47

Hippo-eval (40 questions, hybrid k=10)

metricmeanmedian
recall@100.2100.000
mrr0.2130.000
ndcg@100.1670.000
source_diversity0.8900.971
coverage_gap0.0000.000 (broken in hybrid mode — see caveats)

13/40 questions have any gold-uuid hit; 15/40 have recall@10 = 0; remaining 12 are adversarial-with-empty-gold.

How to compare after re-enrich finishes

Option A: same nodes, after corpus is fully v3

# 1. Confirm the 100 dossier uuids are now at v3
python3 - <<'PY'
import json, sqlite3
from pathlib import Path
uuids = [json.loads(l)["uuid"] for l in open("dossier.jsonl")]
conn = sqlite3.connect(Path.home() / ".local/share/hippo/hippo.db")
ph = ",".join("?"*len(uuids))
rows = conn.execute(
    f"SELECT enrichment_version, COUNT(*) FROM knowledge_nodes "
    f"WHERE uuid IN ({ph}) GROUP BY enrichment_version", uuids
).fetchall()
print(rows)
PY

# 2. Re-build dossier from current DB (same uuids, fresh content)
python3 build_dossier.py  # edit to read uuids from existing dossier.jsonl

# 3. Re-run the 5-expert panel against the new dossier
#    (use the agent dispatch pattern from the original session)

# 4. Re-aggregate
python3 aggregate.py

Option B: hippo-eval re-run

# The labeled gold uuids are stable; hippo-eval just runs against current DB.
uv run --project brain hippo-eval --no-synthesis --no-judge --out v3-after.md
# For groundedness + keyword_hit (slower):
uv run --project brain hippo-eval --out v3-after-full.md

Option C: force-prioritize gold uuids (fastest path to v3-after)

# Extract the 70 gold uuids
python3 - <<'PY'
import json
qs = json.load(open("../../../brain/tests/eval_questions.json"))
data = qs.get("questions", qs) if isinstance(qs, dict) else qs
golds = sorted({u for q in data for u in q.get("relevant_knowledge_node_uuids", [])})
print("\n".join(golds))
PY > gold_uuids.txt

# Modify re-enrich-knowledge-nodes.py to accept --uuids gold_uuids.txt
# OR raise the priority of those 70 nodes in the candidate query.
# Then re-run hippo-eval.

Cross-cutting findings (from the panel — for v3 evaluation criteria)

The “after” comparison should look for these specific signals to confirm whether v3 actually fixed each:

  1. Render plumbing: 66/100 nodes have orphan key_decisions/problems_encountered. Fix is in brain/src/hippo_brain/rag.py::_hit_lines, NOT in re-enrichment. Re-running won’t change this.
  2. Hallucinated env_vars/versions: 8/100 nodes. Should drop with a verbatim-validator post-LLM, NOT with re-enrichment alone. Re-running may not fix.
  3. Worktree-prefix leakage: 5–6/100 nodes. Should drop only if upsert_entities is fixed; re-enrichment alone won’t strip already-written rows.
  4. Long-session coverage drop: median 0.53 of source identifiers survive. Should drop with per-segment chunking.
  5. Tool-bucket pollution: 19/100 nodes have cargo clippy instead of cargo. Fix in upsert_entities.
  6. Filler-opening summaries: 10/100 nodes. Could improve with prompt tuning.
  7. Empty design_decisions when alternatives weighed: 6/100 nodes. Could improve with prompt tuning.

For each, “v3-after - v3-pre” delta on the relevant counts is the actionable comparison.

Known issues with hippo-eval (caveats from the harness itself)

Why this lives on PR #102

PR #102 ships hippo-bench-v2 — a benchmark framework. This baseline is what such a framework would consume as the before snapshot. Saving it here means: