Hippo v3 re-enrichment — expert panel scorecard
Sample: 100 stratified nodes from 2,256 currently-v3 nodes (see dossier.jsonl).
Model: qwen3.6-35b-a3b-ud-mlx.
Panel: 5 experts, independent, same dossier.
Date: 2026-04-29.
Per-expert means
| expert | accuracy | succinctness | usefulness | ask | mcp | overall |
|---|---|---|---|---|---|---|
| enrichment | 4.74 | 4.60 | 4.59 | 3.79 | 4.42 | 4.43 |
| vector | 4.26 | 3.15 | 2.46 | 3.23 | 3.31 | 3.28 |
| schema | 4.82 | 4.51 | 4.38 | 4.94 | 4.43 | 4.62 |
| rag | 4.36 | 3.73 | 3.26 | 3.44 | 4.02 | 3.76 |
| mcp | 4.78 | 4.53 | 4.33 | 4.50 | 3.80 | 4.39 |
| panel mean | 4.59 | 4.10 | 3.80 | 3.98 | 4.00 | 4.10 |
Cross-expert stdev per dim: accuracy 0.26, succinctness 0.64, usefulness 0.91, ask 0.72, mcp 0.47.
Inter-rater agreement
| dim | tight (Δ≤1) | medium (Δ=2) | wide (Δ≥3) |
|---|---|---|---|
| accuracy | 57 | 39 | 4 |
| succinctness | 19 | 36 | 45 |
| usefulness | 12 | 38 | 50 |
| ask_suitability | 34 | 16 | 50 |
| mcp_suitability | 26 | 31 | 43 |
The panel agrees on accuracy. They diverge widely on usefulness, ask, and mcp — expected, because each expert weights these against their lens. The wide spread on usefulness is mostly the vector specialist scoring against identifier density, while the schema/mcp specialists score against structural utility.
Worst 10 nodes (consensus)
| # | uuid | source | stratum | overall |
|---|---|---|---|---|
| 1 | 25b32204… | claude | long_content | 3.16 |
| 2 | 6dd039b8… | claude | long_content | 3.16 |
| 3 | b63c3f9b… | claude | short_embed_text | 3.28 |
| 4 | b239a21e… | shell | shell_random | 3.28 |
| 5 | c1260596… | claude | claude_random | 3.40 |
| 6 | 1b1ac570… | shell | shell_random | 3.40 |
| 7 | 87976564… | claude | claude_random | 3.44 |
| 8 | 5bbbc30a… | dual | short_embed_text | 3.48 |
| 9 | 432b32b2… | dual | dual_source | 3.48 |
| 10 | da979c70… | shell | shell_random | 3.52 |
25b32204… and 6dd039b8… are the duplicated-enrichment pair the enrichment
expert flagged. Both ironically score high on MCP search-input quality
(identifier-dense embed_text) and low on accuracy + ask suitability
(fabricated env_vars/flags + orphaned key_decisions).
Mean by stratum + source
| stratum | n | mean overall |
|---|---|---|
| topup_random | 2 | 4.44 |
| claude_random | 49 | 4.30 |
| long_content | 10 | 4.04 |
| shell_random | 24 | 3.90 |
| short_embed_text | 10 | 3.75 |
| dual_source | 5 | 3.66 |
| source | n | mean |
|---|---|---|
| claude | 67 | 4.23 |
| shell | 26 | 3.89 |
| dual | 7 | 3.62 |
Dual-source nodes (re-enriched twice — once from shell side, once from claude side, second pass overwriting) score worst on average.
Cross-cutting findings (themes appearing in ≥3 expert summaries)
1. Worktree-prefix leakage in path-typed entity names
Experts: enrichment, schema, mcp.
Evidence: schema counted 6/100 violations; mcp’s drift table shows ~17
distinct files that exist in BOTH worktree-prefixed and clean forms across
the corpus (crates/hippo-daemon/src/claude_session.rs appears in 14 nodes
with 3 distinct surface forms). The v3 prompt rule 5 told the model to strip
these; it doesn’t always.
Fix path: unconditional strip_worktree_prefix at enrichment-write time
inside upsert_entities on path-typed entities, plus a one-shot DB pass to
clean already-written rows. The entities.canonical column exists for this
and is largely unused.
2. Hallucinated env_vars and version strings
Experts: enrichment, schema (env_var case bucket), rag (orphaned data).
Evidence: 8/100 nodes have at least one env_var or semver fabricated
(model adds CARGO_HOME, PATH to Rust/Cargo work; invents 1.93.1,
0.149.0, 0.2.0 for release-flavored sessions). Most damaging failure mode
because it cannot be detected by retrieval.
Fix path: post-LLM verbatim-validator that rejects entity tokens absent
from source rows. Cheap to implement; would catch every flagged case.
3. Render plumbing leaves substantive content stranded
Experts: rag (top weakness), enrichment (long-session coverage), mcp
(answer hand-waviness).
Evidence: 66/100 nodes have populated key_decisions and/or
problems_encountered content that brain/src/hippo_brain/rag.py::_hit_lines
silently drops. 5 of these are extreme (≥600 chars + ≥5 unique identifiers
not visible in summary or embed_text). The synthesizing LLM never sees this
content, regardless of enrichment quality.
Fix path: add Decisions: and Problems: render branches in
rag.py::_hit_lines under the existing proportional truncation. This is the
single biggest leverage available — improves hippo ask materially without
re-enriching anything.
4. Long-session coverage drop
Experts: vector, enrichment.
Evidence: median surviving fraction of source identifiers in embed_text
is 0.53 (p10 = 0.25). On long Claude sessions with 60+ identifiers, the model
picks ~25 and silently drops the rest — often the deepest file paths users
will query for. embed_text length plateaus while content_len grows past 4000.
Fix path: per-segment chunking before enrichment for long Claude sessions
(re-enrich script can split a session whose content_len > 4000 into
multiple LLM calls and merge entity buckets).
5. Tool-bucket pollution with shell-invocation phrases
Experts: schema (top fault — 19/100 nodes), mcp (drift list).
Evidence: cargo clippy, git log, uv run --project ... get stored as
single tool-entity rows instead of being normalized to bare command names
(cargo, git, uv). This defeats cross-node dedup, inflates
get_entities(type='tool') cardinality, and splits hybrid-search ranking.
Fix path: normalize tool entities at upsert_entities time — split on
whitespace, take first token. A 5-line change to enrichment.py.
6. Filler-opening summaries that burn the 120-char RAG budget
Experts: rag (10/100), mcp (worst-for-ask uuids), enrichment (low-content nodes). Evidence: “The user requested…”, “Conducted a comprehensive…” lead 10/100 summaries; after 120-char truncation the synthesizer sees no concrete artifact. Fix path: prompt fix in the enricher: “Lead the summary with a concrete verb + artifact, never with subject-first prose.”
7. Empty design_decisions when source weighed alternatives
Experts: rag (6/100), enrichment.
Evidence: alternative-weighing language (“instead of”, “considered”,
“rather than”) in source text but design_decisions: [] in output. “Why did
I pick X?” questions return nothing useful from these nodes.
Fix path: prompt strengthening with positive examples; or a post-LLM
detector that re-prompts when the source contains alternative-weighing
phrases but the output emits an empty list.
Notable duplication finding (mentioned in plan)
The re-enrichment script’s UNION query in _select_candidate_nodes produces
a _source='shell' row AND a _source='claude' row for any node linked to
both event types. The script processes both, with the second pass
overwriting the first. Evidence: 25b32204… and 6dd039b8… are produced
by this path and score worst on the panel; node 6463 in the log was
re-enriched twice ((claude) then (shell)). Dual-source stratum in this
sample averages 3.62 — the worst by source. Worth a follow-up fix: pick the
better source for dual-linked nodes (probably claude if claude_segments
are populated, since they carry richer narrative) and skip the second pass.
What this scorecard is NOT
- Not a substitute for
hippo-eval(brain/src/hippo_brain/evaluation.py), which scores end-to-end retrieval recall/MRR against a labeled Q/A set. Different question, different methodology. Run that next; this panel is intrinsic-quality, that one is task-success. - Not a recommendation to roll back v3. The corpus is meaningfully better than what came before; the issues above are mostly post-LLM plumbing, not the model itself.
Outputs on disk
/tmp/hippo-eval-panel/dossier.jsonl— 100-node sample/tmp/hippo-eval-panel/RUBRIC.md— what the panel scored against/tmp/hippo-eval-panel/scores_<expert>.jsonl(5 files, 100 rows each)/tmp/hippo-eval-panel/summary_<expert>.md(4 of 5; schema expert returned summary inline only)/tmp/hippo-eval-panel/panel_scorecard.jsonl— per-node aggregate (panel mean + each expert’s scores + each expert’s note)/tmp/hippo-eval-panel/FINAL_REPORT.md— this file