Capture Operator Runbook

First-aid recipes for “something looks wrong with capture.” Companion to architecture.md (the system reference) and sources.md (per-source detail).

For an architectural overview of what each tool does, see architecture.md. The recipes here assume you already understand the layers.

At-a-glance: which tool answers which question?

Question	Tool
Is capture broken right now?	`hippo doctor` (~2 s, exit code = fail count)
What’s broken and what should I do about it?	`hippo doctor --explain` (CAUSE / FIX / DOC per failure)
Has anything quietly broken in the last hour?	`hippo alarms list` (exits 1 if any unacknowledged)
Is a specific source healthy right now?	`hippo probe --source <name>` (synthetic round-trip)
Is the brain enriching properly?	`hippo doctor` (the brain section) — capture and enrichment are decoupled (I-10)
What did I just lose?	`~/.local/share/hippo/fallback/*.jsonl` — fallback files (one per UTC date); replayed on next daemon start

Doctor

hippo doctor runs ten checks in under 2 seconds. Each emits one of [OK], [WW] (warning), [!!] (failure), or [--] (informational, e.g., “no rows ever”). Exit code is the count of [!!] failures.

Use --explain to get CAUSE / FIX / DOC per failure. The DOC link points back into this directory.

hippo doctor             # snapshot
hippo doctor --explain   # snapshot + remediation per failure

A clean run looks like this:

[OK] CLI version: 0.20.0
[OK] Daemon is running (uptime 12h)
[OK] Daemon version matches CLI
[OK] Database exists (167 MB)
[OK] Brain queue depth: 0 pending, 0 failed
...

A failed run will list the specific check, the source, and (with --explain) what to do.

Alarms

capture_alarms is an append-only ledger of invariant violations. The watchdog writes; you acknowledge.

hippo alarms list                    # unacknowledged alarms; exit 1 if any
hippo alarms ack <id> --note "..."   # acknowledge with a note
hippo alarms prune                   # clear auto-resolved alarms

Acknowledgment is permanent. Use --note to record what you did about it.

Probes

hippo probe --source <name> runs one synthetic round-trip on demand. Useful when you’ve just changed a configuration and want to confirm the source still lands. Probe rows are tagged with probe_tag IS NOT NULL and never appear in user-facing queries (see anti-patterns.md AP-6).

The launchd com.hippo.probe job runs probes every 5 minutes automatically; manual invocation is for confirming a specific source after operator action.

Recipes

“I ran a command but it’s not in `hippo events`”

# 1. Is the daemon up?
hippo doctor

# 2. Did the event land?
sqlite3 ~/.local/share/hippo/hippo.db "
  SELECT id, command, timestamp
  FROM events
  WHERE source_kind = 'shell'
    AND timestamp > strftime('%s','now') * 1000 - 600000
  ORDER BY id DESC
  LIMIT 10;
"

# 3. If not, is the source healthy per the watchdog?
sqlite3 ~/.local/share/hippo/hippo.db "
  SELECT source, last_event_ts, consecutive_failures, probe_ok, probe_lag_ms
  FROM source_health
  WHERE source = 'shell';
"

# 4. Is the shell hook actually sourced?
grep -l 'hippo.zsh' ~/.zshrc ~/.zshenv ~/.config/zsh/*.zsh 2>/dev/null

# 5. Is the daemon socket responsive?
hippo probe --source shell

If the probe lands but the original command didn’t, the hook silently dropped the frame — check the fallback files (one JSONL per UTC date, written when the daemon was unreachable):

ls -la ~/.local/share/hippo/fallback/*.jsonl 2>/dev/null

A fallback file existing means the daemon was unreachable; the next daemon start will replay it via recover_fallback_files (crates/hippo-core/src/storage.rs).

“Doctor shows red”

hippo doctor --explain

Pick the first [!!] failure. The CAUSE/FIX/DOC block will tell you which file in this directory documents the relevant invariant. For example:

[!!] shell events: 8m ago (FAIL) → I-1 violation. See architecture.md I-1; check whether your shell session has been idle (suppression) or whether the hook actually fired (run hippo probe --source shell).
[!!] watchdog heartbeat: 4m ago (FAIL) → I-7 violation. Watchdog crashed or its launchd job is missing. Check launchctl list | grep hippo. If com.hippo.watchdog is missing, run hippo daemon install --force.
[!!] fallback files: 5 files > 24h (recovery broken) → I-9 violation. Daemon is up but old fallback files under ~/.local/share/hippo/fallback/ aren’t being drained. Check the daemon’s launchd logs (~/.local/share/hippo/daemon.stderr.log and the rolling daemon.YYYY-MM-DD.log files written by the tracing appender) for write errors.

“Browser capture is idle or doctor shows browser [!!]”

Firefox Developer Edition is the supported browser. Capture requires both the extension and the native-messaging host.

# Snapshot + remediation (browser lines show extension connectivity + last_error)
hippo doctor --explain

# One-shot synthetic round-trip
hippo probe --source browser

Permanent extension install (survives restarts):

One-time in Firefox Dev Edition: about:config → xpinstall.signatures.required = false
From the hippo repo: mise run install:ext (also runs as part of mise run install)
Restart Firefox if it was running during install
Verify: hippo doctor should show [OK] Firefox extension installed permanently

about:debugging → Load Temporary Add-on is for active extension development only — those loads are cleared when Firefox restarts and will fail I-4 / doctor browser checks after a restart until you run mise run install:ext again.

Native messaging host:

hippo daemon install --force   # writes ~/Library/Application Support/Mozilla/NativeMessagingHosts/hippo_daemon.json

See extension/firefox/README.md ↗ and the Browser section in sources.md.

“Brain queue is backing up”

Capture and enrichment are decoupled (I-10). A backed-up brain queue is an enrichment problem, not a capture problem; events are still landing.

sqlite3 ~/.local/share/hippo/hippo.db "
  SELECT status, COUNT(*) FROM enrichment_queue GROUP BY status;
"

# Live brain log
tail -f ~/.local/share/hippo/brain.stderr.log

Common causes:

Inference backend model unloaded — load the model in your backend’s UI (LM Studio / oMLX / ollama / …) or set it to stay loaded.
Inference backend model swapped — [models].enrichment in ~/.config/hippo/config.toml doesn’t match a loaded model on the backend.
[lmstudio] → [inference] config-section drift — upgrading an old install? Rename the section in ~/.config/hippo/config.toml. Both the daemon and the brain reject the legacy name with a clear migration error. The same [inference] key works for LM Studio, oMLX, ollama, vLLM, and any other OpenAI-compatible backend.
Brain crashed — mise run restart (or launchctl bootout/bootstrap the brain agent).
Watchdog I-12 (“brain preflight stuck”) will fire after the inference backend has been unreachable for ~1 minute; check hippo alarms list and the Stack Health Grade panel.

The watchdog reaper handles transient locks (rows stuck in processing for > lock_timeout_secs); see docs/brain-watchdog.md. A persistent backlog is operator-visible — neither the watchdog nor doctor will silently drop work.

“Schema mismatch — daemon refuses to bind”

The daemon’s startup handshake (crates/hippo-daemon/src/schema_handshake.rs) requires the daemon and brain schema versions to match exactly. If they don’t, the daemon refuses to bind its socket.

# Run the unified handshake check (compares all three at once).
hippo doctor --explain | grep -A 4 "schema"

# Or inspect each side individually:

# 1. What does the live DB say?
sqlite3 ~/.local/share/hippo/hippo.db "PRAGMA user_version;"

# 2. What version does the daemon binary expect? (compiled-in constant)
grep -E "^pub const EXPECTED_VERSION" \
  ~/projects/hippo/crates/hippo-core/src/storage.rs

# 3. What version does the brain expect?
uv run --project brain python -c \
  "from hippo_brain.schema_version import EXPECTED_SCHEMA_VERSION; print(EXPECTED_SCHEMA_VERSION)"

All three numbers must match. If they don’t, mise run install (or mise run install --clean) brings everything to the same version. Don’t manually PRAGMA user_version = N on the DB — migrations have to run.

“Probe lag is climbing”

source_health.probe_lag_ms is the end-to-end latency for a synthetic round-trip. Healthy: tens to hundreds of milliseconds for shell, low seconds for browser/claude-session. Climbing lag suggests the daemon is starving (load, disk pressure, or socket backlog).

sqlite3 ~/.local/share/hippo/hippo.db "
  SELECT source, probe_lag_ms, datetime(probe_last_run_ts/1000, 'unixepoch', 'localtime')
  FROM source_health
  WHERE probe_lag_ms IS NOT NULL
  ORDER BY probe_lag_ms DESC;
"

If lag exceeds the I-8 threshold (15 min for probe_last_run_ts), the watchdog will fire I-8 alarm. Climbing-but-under-threshold lag is informational only.

“I-16 fired / duplicate knowledge nodes detected”

The watchdog found one or more agentic_sessions segments carrying multiple knowledge nodes with identical (content, embed_text, node_type) — an enricher is (or was) minting duplicate nodes instead of replacing/reusing them (AP-13). The write-time guards (replace_prior_agentic_nodes for agentic; find_identical_node for workflow/browser) keep this at zero; a sustained non-zero count means a writer bypassed its guard or a backlog of pre-fix duplicates remains.

Remediate with the one-shot dedup script. It collapses each identity group to the earliest node (MIN(id)), re-points every knowledge_node_* link onto the survivor (union), and deletes the losers’ vectors + rows. Run it with writers stopped and a fresh backup in place:

# 1. stop writers so the BEGIN IMMEDIATE transaction can't race a live write
mise run stop
# 2. checkpoint the WAL into the main DB so the backup clone is complete
sqlite3 ~/.local/share/hippo/hippo.db "PRAGMA wal_checkpoint(TRUNCATE);"
# 3. instant APFS-clone backup
cp -c ~/.local/share/hippo/hippo.db ~/.local/share/hippo/hippo.db.pre-dedup-$(date +%Y%m%d-%H%M%S).bak
# 4. DRY RUN (default — reports groups/losers/predicted-after, changes nothing)
uv run --project brain python brain/scripts/dedup-knowledge-nodes.py --db ~/.local/share/hippo/hippo.db
# 5. APPLY (irreversible)
uv run --project brain python brain/scripts/dedup-knowledge-nodes.py --db ~/.local/share/hippo/hippo.db --apply
# 6. restart + verify
mise run start && hippo doctor && hippo alarms list

The script is idempotent (a second --apply deletes 0) and verifies PRAGMA foreign_key_check is clean before committing. Post-run, knowledge_nodes count == knowledge_vectors count and the I-16 query returns 0.

Recovery: manual operations

Operation	Command
Backfill a specific Claude JSONL	`hippo ingest claude-session <path>`
Dedup duplicate knowledge nodes (I-16 / AP-13)	`brain/scripts/dedup-knowledge-nodes.py` (see I-16 recipe above; stop writers + back up first)
Run a probe on demand	`hippo probe --source <name>`
Force install (overwrite plists, native messaging manifest, shell-hook config)	`hippo daemon install --force`
Stop everything (preserves data)	`mise run stop`
Stop everything hard (SIGKILL, preserves data)	`mise run nuke`
Start everything	`mise run start`
Full clean reinstall (rebuild + reinstall)	`mise run install --clean`

When to escalate to a follow-up issue

If hippo doctor --explain doesn’t tell you what to do, or you’re seeing a failure mode that isn’t in this runbook, file a GitHub issue with:

The exact hippo doctor --explain output
The relevant source_health row(s)
Any unacknowledged capture_alarms
Recent ~/.local/share/hippo/*.log lines

anti-patterns.md AP-1..AP-12 are the review blockers; test-matrix.md is the failure-mode-to-test reference for adding regression tests.