hippo-bench Operator Runbook
This runbook covers the operator-driven gates that the autonomous bench loop deliberately doesn’t run because the blast radius is too high. The current critical entry is BT-29: deterministic-rerun verification.
mise shortcut: the everyday flow is wrapped in
bench:*mise tasks —mise run bench:status,mise run bench:run <model-id>, etc. See the bench README ↗. This runbook remains authoritative for the operator-gated BT-29 procedure.
Required pre-BT-29 corpus/Q/A gate
Before running the three BT-29 model passes, confirm the corpus is schema-current and the Q/A fixture is fully scoreable against it. A mislabeled or stale golden shifts MRR by 1/N — larger than the BT-29 budget — so this gate protects the trust claim at its root.
uv run --project brain hippo-bench corpus verify --corpus-version corpus-v2
uv run --project brain hippo-bench qa validate \
--qa-path ~/.local/share/hippo-bench/fixtures/eval-qa-v1.jsonl \
--corpus-sqlite ~/.local/share/hippo-bench/fixtures/corpus-v2.sqlite \
--min-scoreable 100
Both commands must exit 0. The Q/A golden_event_ids are corpus-grounded and
bound to a specific corpus_content_hash (recorded in
brain/src/hippo_brain/bench/qa_template.provenance.json). If you rebuild the
corpus from a different live DB, the sampled events change and the goldens stop
resolving — qa validate will report the shortfall. Re-annotate against the
new corpus (corpus-derived authoring, Mode B in
docs/baselines/QA-ANNOTATION.md) before
trusting any retrieval metric.
BT-29: deterministic-rerun verification
Why this exists
The bench’s “trust” claim — “if it says model A > model B, that’s true” — falls apart if the same model produces materially different verdicts across identical reruns. Per the tracking doc’s Definition of Done #1:
Three consecutive runs of the same model against the frozen reference corpus produce identical verdicts (Hit@1 ± 0.02, MRR ± 0.02, judge-mean ± 0.1).
Until BT-29 fires green at least once, the trust foundation is unverified.
Why the autonomous loop doesn’t run it
Each run pauses prod brain for ~30 min and consumes the local inference server (oMLX) exclusively. Three consecutive runs is ~90 min of blocked prod observability. That’s unsafe to trigger from a multi-iteration ralph loop where a hung model can extend the pause indefinitely.
Procedure
Prerequisites:
- The local inference server (oMLX, default
http://localhost:8000/v1) is running and idle (no other consumers). Pass--base-urlif it differs. - Prod brain is running and healthy (
hippo doctoris green). - The frozen corpus snapshot is present at the path you’ll pass to
--corpus-version. - You have ~90 min where prod observability gaps are acceptable.
Run:
hippo-bench run builds an internal run_id per invocation (timestamp +
short hash) and writes a JSONL there. The --out flag forces a specific
output path; that’s what we use here so all three runs land in known
locations the harness can compare.
# Pick a model your inference server (oMLX) can serve. Use the SAME model +
# temperature across all three runs; BT-29 measures bench-verdict
# reproducibility at the settings you actually deploy with, not at
# temperature=0 (which would make self-consistency a vacuous signal).
MODEL="Qwen3.6-35B-A3B-UD-MLX-4bit"
for i in 1 2 3; do
uv run --project brain hippo-bench run \
--models "$MODEL" \
--corpus-version corpus-v2 \
--out "/tmp/bt29-r$i.jsonl"
done
# Compare. Exits 1 if any model exceeds the 0.02 budget.
uv run --project brain hippo-bench determinism \
/tmp/bt29-r1.jsonl /tmp/bt29-r2.jsonl /tmp/bt29-r3.jsonl
The harness prints a determinism report with one row per compared model listing the mrr range, mrr delta, hit@1 range, hit@1 delta, and verdict. It exits 0 when every model’s MRR delta and Hit@1 delta are within budget — at which point trust foundation is verified for this model. No reference metrics are published in this runbook on purpose: real numbers belong in the trust ledger (see below), not in copy-paste templates.
The harness defaults to comparing the hybrid retrieval mode (production
path). To verify a different mode (e.g. semantic-only deployment), pass
--mode semantic. To loosen or tighten the budget, use --mrr-budget and
--hit-at-1-budget.
If any compared run is missing downstream_proxy.modes[<mode>].mrr or
hit_at_1 (e.g. the proxy step raised and was captured into errors[]),
the model gets a FAIL (missing: ...) verdict rather than a silent PASS —
determinism cannot be assessed when one of the data points is absent.
Expected output (FAIL path):
If any model’s MRR delta or Hit@1 delta crosses 0.02, the verdict is FAIL and the harness exits 1. This means the model is not deterministic enough for the bench to rank it reliably — the verdict is dominated by run-to-run noise rather than actual ranking signal.
Possible causes (ordered by likelihood):
- Model quantization mismatch — if the inference server unloaded and
reloaded the model between runs you may have hit a different quantization.
Confirm the served model card stayed identical before each run (e.g.
curl -s http://localhost:8000/v1/models). - Corpus drift — if
corpus.sqlitewas rebuilt mid-experiment, the inputs differ. Checksha256sumof the corpus file across runs. - Real model nondeterminism above the budget — sampling at default temperature (0.7) means MoE routing or stochastic decoding can produce spread; the budget is tuned to accept production-realistic noise. If the delta is consistently >0.02, the model is too noisy for the bench to rank reliably and should be excluded or flagged in the trust ledger.
- Temperature drift — if you bumped
--temperaturebetween runs, the spread is expected. Confirm all three invocations used the same flag.
Recording the result
Once a model passes BT-29, append a line to the trust ledger:
# (proposal — table doesn't exist yet, see Phase 3 work)
echo "$(date -u +%Y-%m-%dT%H:%MZ) | $MODEL | $VERDICT | mrr_delta=$MRR_DELTA | hit_at_1_delta=$HIT_AT_1_DELTA" \
>> docs/baselines/bt29-trust-ledger.tsv
This gives you “is this new model better than the last passing model on the ledger?” without re-running BT-29 on every challenger.
Skipping BT-29 (and being honest about it)
If you ship a bench result without BT-29 having run, the verdict is unverified empirically — the model passed lint + golden retrieval test (BT-19) but nothing has confirmed run-to-run stability. Document this on the run summary as “BT-29 deferred.” Do NOT claim “trust foundation complete” without it.
Other operator-gated procedures
(Empty for now. Add new entries here as Phase 2/3 work surfaces gates that shouldn’t run autonomously — judge-LLM rubric calibration, frozen-corpus re-freeze cadence, etc.)