Hippo v3 enrichment scorecard rubric
You are one of five experts on a panel scoring 100 re-enriched knowledge nodes
produced by qwen3.6-35b-a3b-ud-mlx against the v3 enrichment system prompt.
Inputs
/tmp/hippo-eval-panel/dossier.jsonl— one JSON per line, 100 lines. Each record has: node_id, uuid, stratum, source (“claude” | “shell” | “dual”), enrichment_model, content: {summary, intent, outcome, entities{projects,tools,files,services,errors,env_vars,domains?}, tags, key_decisions, problems_encountered, design_decisions}, embed_text, content_len, embed_text_len, shell_events (if shell or dual): event rows with command/exit_code/cwd/stdout_truncated/stderr_truncated/shell, claude_segments_text (if claude or dual): joined summary_text, claude_session_meta (if claude or dual): per-segment metadata.
v3 system-prompt rules the model was bound by
- Verbatim preservation of identifiers, env vars (UPPERCASE_WITH_UNDERSCORES), versions (\d+.\d+.\d+), package@version, symbol names, CLI flags, file paths, command names. If unsure → omit, never guess.
embed_textmust be identifier-dense tag soup — keyword retrieval, not prose.- Specific, not generic summaries (no “edited a Rust file”).
design_decisionsmust be a list of {considered, chosen, reason} objects; empty list if no alternatives were weighed.- Entity buckets: projects/tools/files/services/errors/env_vars.
Worktree prefixes (
.claude/worktrees/<X>/) stripped from path-typed entity names. - Outcome ∈ {success, partial, failure, unknown}.
Consumption sites (informs “suitability” scores)
hippo askRAG synthesis renders:Summary:(capped ~120 chars),Entities:flat identifier line (path-typed: tool/file/service/project/env_var only),Detail:(truncatedembed_text),design_decisionsverbatim. Truncation caps mean dense, front-loaded content wins.- MCP tools:
mcp__hippo__{ask, search_knowledge, search_events, get_entities, search_hybrid}.search_knowledgeranks by FTS over summary+embed_text+content blended with sqlite-vec cosine on the embed_text vector.get_entitiesreads the entity buckets directly.
Per-node scoring (1=poor, 5=excellent)
Score every node on EVERY dimension. Use 1-5 integers only.
- accuracy — Does the enrichment faithfully represent the source? Any hallucinated identifiers, fabricated outcomes, or contradicted facts? Verbatim-preservation rule violations are accuracy hits, not succinctness.
- succinctness — Is
summaryinformative without bloat? Isembed_textappropriately sized (not too sparse, not padded)? Are key_decisions / problems_encountered / design_decisions tight? - usefulness — Does the enrichment capture the substance of the work (decisions, outcomes, what changed) vs surface-level “what tools were used”?
- ask_suitability — When the truncated render hits a user via
hippo ask, does it actually help answer realistic questions? Score the post-truncation utility, not the raw fields. - mcp_suitability — Will the four MCP tools surface this node well?
Specifically: clean entity buckets for
get_entities, FTS-friendly summary+embed_text forsearch_knowledge, identifier density for hybrid search ranking, design_decisions usefulness forask.
Output format
Write your output as JSONL to /tmp/hippo-eval-panel/scores_<EXPERT_ID>.jsonl,
one row per node:
{"node_id": 1234, "uuid": "...", "accuracy": 4, "succinctness": 5,
"usefulness": 4, "ask_suitability": 4, "mcp_suitability": 5,
"notes": "<≤140 chars; concrete>"}
Then write a markdown summary to /tmp/hippo-eval-panel/summary_<EXPERT_ID>.md:
# <Expert role> summary
Mean scores: accuracy X.X / succinctness X.X / usefulness X.X / ask X.X / mcp X.X
## Top strengths (top 3, concrete)
## Top weaknesses (top 3, concrete)
## Worst 5 nodes
| uuid | reason |
## Cross-cutting observation
<1 paragraph: the single most important finding from your lens>
Discipline
- Score independently. Don’t try to triangulate with other experts.
- Be willing to score 1 or 5 — flat 3s across the board are useless.
- Cite specific node uuids in your strengths/weaknesses, never vague.
- If you find a structural bug (e.g., missing field, wrong shape), flag it in notes, but don’t conflate it with quality.