Hippo v3 re-enrichment — expert panel scorecard

Sample: 100 stratified nodes from 2,256 currently-v3 nodes (see dossier.jsonl). Model: qwen3.6-35b-a3b-ud-mlx. Panel: 5 experts, independent, same dossier. Date: 2026-04-29.

Per-expert means

expertaccuracysuccinctnessusefulnessaskmcpoverall
enrichment4.744.604.593.794.424.43
vector4.263.152.463.233.313.28
schema4.824.514.384.944.434.62
rag4.363.733.263.444.023.76
mcp4.784.534.334.503.804.39
panel mean4.594.103.803.984.004.10

Cross-expert stdev per dim: accuracy 0.26, succinctness 0.64, usefulness 0.91, ask 0.72, mcp 0.47.

Inter-rater agreement

dimtight (Δ≤1)medium (Δ=2)wide (Δ≥3)
accuracy57394
succinctness193645
usefulness123850
ask_suitability341650
mcp_suitability263143

The panel agrees on accuracy. They diverge widely on usefulness, ask, and mcp — expected, because each expert weights these against their lens. The wide spread on usefulness is mostly the vector specialist scoring against identifier density, while the schema/mcp specialists score against structural utility.

Worst 10 nodes (consensus)

#uuidsourcestratumoverall
125b32204…claudelong_content3.16
26dd039b8…claudelong_content3.16
3b63c3f9b…claudeshort_embed_text3.28
4b239a21e…shellshell_random3.28
5c1260596…claudeclaude_random3.40
61b1ac570…shellshell_random3.40
787976564…claudeclaude_random3.44
85bbbc30a…dualshort_embed_text3.48
9432b32b2…dualdual_source3.48
10da979c70…shellshell_random3.52

25b32204… and 6dd039b8… are the duplicated-enrichment pair the enrichment expert flagged. Both ironically score high on MCP search-input quality (identifier-dense embed_text) and low on accuracy + ask suitability (fabricated env_vars/flags + orphaned key_decisions).

Mean by stratum + source

stratumnmean overall
topup_random24.44
claude_random494.30
long_content104.04
shell_random243.90
short_embed_text103.75
dual_source53.66
sourcenmean
claude674.23
shell263.89
dual73.62

Dual-source nodes (re-enriched twice — once from shell side, once from claude side, second pass overwriting) score worst on average.

Cross-cutting findings (themes appearing in ≥3 expert summaries)

1. Worktree-prefix leakage in path-typed entity names

Experts: enrichment, schema, mcp. Evidence: schema counted 6/100 violations; mcp’s drift table shows ~17 distinct files that exist in BOTH worktree-prefixed and clean forms across the corpus (crates/hippo-daemon/src/claude_session.rs appears in 14 nodes with 3 distinct surface forms). The v3 prompt rule 5 told the model to strip these; it doesn’t always. Fix path: unconditional strip_worktree_prefix at enrichment-write time inside upsert_entities on path-typed entities, plus a one-shot DB pass to clean already-written rows. The entities.canonical column exists for this and is largely unused.

2. Hallucinated env_vars and version strings

Experts: enrichment, schema (env_var case bucket), rag (orphaned data). Evidence: 8/100 nodes have at least one env_var or semver fabricated (model adds CARGO_HOME, PATH to Rust/Cargo work; invents 1.93.1, 0.149.0, 0.2.0 for release-flavored sessions). Most damaging failure mode because it cannot be detected by retrieval. Fix path: post-LLM verbatim-validator that rejects entity tokens absent from source rows. Cheap to implement; would catch every flagged case.

3. Render plumbing leaves substantive content stranded

Experts: rag (top weakness), enrichment (long-session coverage), mcp (answer hand-waviness). Evidence: 66/100 nodes have populated key_decisions and/or problems_encountered content that brain/src/hippo_brain/rag.py::_hit_lines silently drops. 5 of these are extreme (≥600 chars + ≥5 unique identifiers not visible in summary or embed_text). The synthesizing LLM never sees this content, regardless of enrichment quality. Fix path: add Decisions: and Problems: render branches in rag.py::_hit_lines under the existing proportional truncation. This is the single biggest leverage available — improves hippo ask materially without re-enriching anything.

4. Long-session coverage drop

Experts: vector, enrichment. Evidence: median surviving fraction of source identifiers in embed_text is 0.53 (p10 = 0.25). On long Claude sessions with 60+ identifiers, the model picks ~25 and silently drops the rest — often the deepest file paths users will query for. embed_text length plateaus while content_len grows past 4000. Fix path: per-segment chunking before enrichment for long Claude sessions (re-enrich script can split a session whose content_len > 4000 into multiple LLM calls and merge entity buckets).

5. Tool-bucket pollution with shell-invocation phrases

Experts: schema (top fault — 19/100 nodes), mcp (drift list). Evidence: cargo clippy, git log, uv run --project ... get stored as single tool-entity rows instead of being normalized to bare command names (cargo, git, uv). This defeats cross-node dedup, inflates get_entities(type='tool') cardinality, and splits hybrid-search ranking. Fix path: normalize tool entities at upsert_entities time — split on whitespace, take first token. A 5-line change to enrichment.py.

6. Filler-opening summaries that burn the 120-char RAG budget

Experts: rag (10/100), mcp (worst-for-ask uuids), enrichment (low-content nodes). Evidence: “The user requested…”, “Conducted a comprehensive…” lead 10/100 summaries; after 120-char truncation the synthesizer sees no concrete artifact. Fix path: prompt fix in the enricher: “Lead the summary with a concrete verb + artifact, never with subject-first prose.”

7. Empty design_decisions when source weighed alternatives

Experts: rag (6/100), enrichment. Evidence: alternative-weighing language (“instead of”, “considered”, “rather than”) in source text but design_decisions: [] in output. “Why did I pick X?” questions return nothing useful from these nodes. Fix path: prompt strengthening with positive examples; or a post-LLM detector that re-prompts when the source contains alternative-weighing phrases but the output emits an empty list.

Notable duplication finding (mentioned in plan)

The re-enrichment script’s UNION query in _select_candidate_nodes produces a _source='shell' row AND a _source='claude' row for any node linked to both event types. The script processes both, with the second pass overwriting the first. Evidence: 25b32204… and 6dd039b8… are produced by this path and score worst on the panel; node 6463 in the log was re-enriched twice ((claude) then (shell)). Dual-source stratum in this sample averages 3.62 — the worst by source. Worth a follow-up fix: pick the better source for dual-linked nodes (probably claude if claude_segments are populated, since they carry richer narrative) and skip the second pass.

What this scorecard is NOT

Outputs on disk