Capture Operator Runbook
First-aid recipes for “something looks wrong with capture.” Companion to architecture.md (the system reference) and sources.md (per-source detail).
For an architectural overview of what each tool does, see architecture.md. The recipes here assume you already understand the layers.
At-a-glance: which tool answers which question?
| Question | Tool |
|---|---|
| Is capture broken right now? | hippo doctor (~2 s, exit code = fail count) |
| What’s broken and what should I do about it? | hippo doctor --explain (CAUSE / FIX / DOC per failure) |
| Has anything quietly broken in the last hour? | hippo alarms list (exits 1 if any unacknowledged) |
| Is a specific source healthy right now? | hippo probe --source <name> (synthetic round-trip) |
| Is the brain enriching properly? | hippo doctor (the brain section) — capture and enrichment are decoupled (I-10) |
| What did I just lose? | ~/.local/share/hippo/fallback/*.jsonl — fallback files (one per UTC date); replayed on next daemon start |
Doctor
hippo doctor runs ten checks in under 2 seconds. Each emits one of [OK], [WW] (warning), [!!] (failure), or [--] (informational, e.g., “no rows ever”). Exit code is the count of [!!] failures.
Use --explain to get CAUSE / FIX / DOC per failure. The DOC link points back into this directory.
hippo doctor # snapshot
hippo doctor --explain # snapshot + remediation per failure
A clean run looks like this:
[OK] CLI version: 0.20.0
[OK] Daemon is running (uptime 12h)
[OK] Daemon version matches CLI
[OK] Database exists (167 MB)
[OK] Brain queue depth: 0 pending, 0 failed
...
A failed run will list the specific check, the source, and (with --explain) what to do.
Alarms
capture_alarms is an append-only ledger of invariant violations. The watchdog writes; you acknowledge.
hippo alarms list # unacknowledged alarms; exit 1 if any
hippo alarms ack <id> --note "..." # acknowledge with a note
hippo alarms prune # clear auto-resolved alarms
Acknowledgment is permanent. Use --note to record what you did about it.
Probes
hippo probe --source <name> runs one synthetic round-trip on demand. Useful when you’ve just changed a configuration and want to confirm the source still lands. Probe rows are tagged with probe_tag IS NOT NULL and never appear in user-facing queries (see anti-patterns.md AP-6).
The launchd com.hippo.probe job runs probes every 5 minutes automatically; manual invocation is for confirming a specific source after operator action.
Recipes
”I ran a command but it’s not in hippo events”
# 1. Is the daemon up?
hippo doctor
# 2. Did the event land?
sqlite3 ~/.local/share/hippo/hippo.db "
SELECT id, command, timestamp
FROM events
WHERE source_kind = 'shell'
AND timestamp > strftime('%s','now') * 1000 - 600000
ORDER BY id DESC
LIMIT 10;
"
# 3. If not, is the source healthy per the watchdog?
sqlite3 ~/.local/share/hippo/hippo.db "
SELECT source, last_event_ts, consecutive_failures, probe_ok, probe_lag_ms
FROM source_health
WHERE source = 'shell';
"
# 4. Is the shell hook actually sourced?
grep -l 'hippo.zsh' ~/.zshrc ~/.zshenv ~/.config/zsh/*.zsh 2>/dev/null
# 5. Is the daemon socket responsive?
hippo probe --source shell
If the probe lands but the original command didn’t, the hook silently dropped the frame — check the fallback files (one JSONL per UTC date, written when the daemon was unreachable):
ls -la ~/.local/share/hippo/fallback/*.jsonl 2>/dev/null
A fallback file existing means the daemon was unreachable; the next daemon start will replay it via recover_fallback_files (crates/hippo-core/src/storage.rs).
”Doctor shows red”
hippo doctor --explain
Pick the first [!!] failure. The CAUSE/FIX/DOC block will tell you which file in this directory documents the relevant invariant. For example:
[!!] shell events: 8m ago (FAIL)→ I-1 violation. Seearchitecture.mdI-1; check whether your shell session has been idle (suppression) or whether the hook actually fired (runhippo probe --source shell).[!!] watchdog heartbeat: 4m ago (FAIL)→ I-7 violation. Watchdog crashed or its launchd job is missing. Checklaunchctl list | grep hippo. Ifcom.hippo.watchdogis missing, runhippo daemon install --force.[!!] fallback files: 5 files > 24h (recovery broken)→ I-9 violation. Daemon is up but old fallback files under~/.local/share/hippo/fallback/aren’t being drained. Check the daemon’s launchd logs (~/.local/share/hippo/daemon.stderr.logand the rollingdaemon.YYYY-MM-DD.logfiles written by the tracing appender) for write errors.
”Brain queue is backing up”
Capture and enrichment are decoupled (I-10). A backed-up brain queue is an enrichment problem, not a capture problem; events are still landing.
sqlite3 ~/.local/share/hippo/hippo.db "
SELECT status, COUNT(*) FROM enrichment_queue GROUP BY status;
"
# Live brain log
tail -f ~/.local/share/hippo/brain.stderr.log
Common causes:
- Inference backend model unloaded — load the model in your backend’s UI (LM Studio / oMLX / ollama / …) or set it to stay loaded.
- Inference backend model swapped —
[models].enrichmentin~/.config/hippo/config.tomldoesn’t match a loaded model on the backend. [lmstudio]→[inference]config-section drift — upgrading an old install? Rename the section in~/.config/hippo/config.toml. Both the daemon and the brain reject the legacy name with a clear migration error. The same[inference]key works for LM Studio, oMLX, ollama, vLLM, and any other OpenAI-compatible backend.- Brain crashed —
mise run restart(orlaunchctl bootout/bootstrapthe brain agent). - Watchdog I-12 (“brain preflight stuck”) will fire after the inference backend has been unreachable for ~1 minute; check
hippo alarms listand theStack Health Gradepanel.
The watchdog reaper handles transient locks (rows stuck in processing for > lock_timeout_secs); see docs/brain-watchdog.md. A persistent backlog is operator-visible — neither the watchdog nor doctor will silently drop work.
”Schema mismatch — daemon refuses to bind”
The daemon’s startup handshake (crates/hippo-daemon/src/schema_handshake.rs) requires the daemon and brain schema versions to match exactly. If they don’t, the daemon refuses to bind its socket.
# Run the unified handshake check (compares all three at once).
hippo doctor --explain | grep -A 4 "schema"
# Or inspect each side individually:
# 1. What does the live DB say?
sqlite3 ~/.local/share/hippo/hippo.db "PRAGMA user_version;"
# 2. What version does the daemon binary expect? (compiled-in constant)
grep -E "^pub const EXPECTED_VERSION" \
~/projects/hippo/crates/hippo-core/src/storage.rs
# 3. What version does the brain expect?
uv run --project brain python -c \
"from hippo_brain.schema_version import EXPECTED_SCHEMA_VERSION; print(EXPECTED_SCHEMA_VERSION)"
All three numbers must match. If they don’t, mise run install (or mise run install --clean) brings everything to the same version. Don’t manually PRAGMA user_version = N on the DB — migrations have to run.
”Probe lag is climbing”
source_health.probe_lag_ms is the end-to-end latency for a synthetic round-trip. Healthy: tens to hundreds of milliseconds for shell, low seconds for browser/claude-session. Climbing lag suggests the daemon is starving (load, disk pressure, or socket backlog).
sqlite3 ~/.local/share/hippo/hippo.db "
SELECT source, probe_lag_ms, datetime(probe_last_run_ts/1000, 'unixepoch', 'localtime')
FROM source_health
WHERE probe_lag_ms IS NOT NULL
ORDER BY probe_lag_ms DESC;
"
If lag exceeds the I-8 threshold (15 min for probe_last_run_ts), the watchdog will fire I-8 alarm. Climbing-but-under-threshold lag is informational only.
”I-16 fired / duplicate knowledge nodes detected”
The watchdog found one or more agentic_sessions segments carrying multiple knowledge nodes with identical (content, embed_text, node_type) — an enricher is (or was) minting duplicate nodes instead of replacing/reusing them (AP-13). The write-time guards (replace_prior_agentic_nodes for agentic; find_identical_node for workflow/browser) keep this at zero; a sustained non-zero count means a writer bypassed its guard or a backlog of pre-fix duplicates remains.
Remediate with the one-shot dedup script. It collapses each identity group to the earliest node (MIN(id)), re-points every knowledge_node_* link onto the survivor (union), and deletes the losers’ vectors + rows. Run it with writers stopped and a fresh backup in place:
# 1. stop writers so the BEGIN IMMEDIATE transaction can't race a live write
mise run stop
# 2. checkpoint the WAL into the main DB so the backup clone is complete
sqlite3 ~/.local/share/hippo/hippo.db "PRAGMA wal_checkpoint(TRUNCATE);"
# 3. instant APFS-clone backup
cp -c ~/.local/share/hippo/hippo.db ~/.local/share/hippo/hippo.db.pre-dedup-$(date +%Y%m%d-%H%M%S).bak
# 4. DRY RUN (default — reports groups/losers/predicted-after, changes nothing)
uv run --project brain python brain/scripts/dedup-knowledge-nodes.py --db ~/.local/share/hippo/hippo.db
# 5. APPLY (irreversible)
uv run --project brain python brain/scripts/dedup-knowledge-nodes.py --db ~/.local/share/hippo/hippo.db --apply
# 6. restart + verify
mise run start && hippo doctor && hippo alarms list
The script is idempotent (a second --apply deletes 0) and verifies PRAGMA foreign_key_check is clean before committing. Post-run, knowledge_nodes count == knowledge_vectors count and the I-16 query returns 0.
Recovery: manual operations
| Operation | Command |
|---|---|
| Backfill a specific Claude JSONL | hippo ingest claude-session <path> |
| Dedup duplicate knowledge nodes (I-16 / AP-13) | brain/scripts/dedup-knowledge-nodes.py (see I-16 recipe above; stop writers + back up first) |
| Run a probe on demand | hippo probe --source <name> |
| Force install (overwrite plists, native messaging manifest, shell-hook config) | hippo daemon install --force |
| Stop everything (preserves data) | mise run stop |
| Stop everything hard (SIGKILL, preserves data) | mise run nuke |
| Start everything | mise run start |
| Full clean reinstall (rebuild + reinstall) | mise run install --clean |
When to escalate to a follow-up issue
If hippo doctor --explain doesn’t tell you what to do, or you’re seeing a failure mode that isn’t in this runbook, file a GitHub issue with:
- The exact
hippo doctor --explainoutput - The relevant
source_healthrow(s) - Any unacknowledged
capture_alarms - Recent
~/.local/share/hippo/*.loglines
anti-patterns.md AP-1..AP-12 are the review blockers; test-matrix.md is the failure-mode-to-test reference for adding regression tests.