Capture Operator Runbook

First-aid recipes for “something looks wrong with capture.” Companion to architecture.md (the system reference) and sources.md (per-source detail).

For an architectural overview of what each tool does, see architecture.md. The recipes here assume you already understand the layers.

At-a-glance: which tool answers which question?

QuestionTool
Is capture broken right now?hippo doctor (~2 s, exit code = fail count)
What’s broken and what should I do about it?hippo doctor --explain (CAUSE / FIX / DOC per failure)
Has anything quietly broken in the last hour?hippo alarms list (exits 1 if any unacknowledged)
Is a specific source healthy right now?hippo probe --source <name> (synthetic round-trip)
Is the brain enriching properly?hippo doctor (the brain section) — capture and enrichment are decoupled (I-10)
What did I just lose?~/.local/share/hippo/fallback/*.jsonl — fallback files (one per UTC date); replayed on next daemon start

Doctor

hippo doctor runs ten checks in under 2 seconds. Each emits one of [OK], [WW] (warning), [!!] (failure), or [--] (informational, e.g., “no rows ever”). Exit code is the count of [!!] failures.

Use --explain to get CAUSE / FIX / DOC per failure. The DOC link points back into this directory.

hippo doctor             # snapshot
hippo doctor --explain   # snapshot + remediation per failure

A clean run looks like this:

[OK] CLI version: 0.20.0
[OK] Daemon is running (uptime 12h)
[OK] Daemon version matches CLI
[OK] Database exists (167 MB)
[OK] Brain queue depth: 0 pending, 0 failed
...

A failed run will list the specific check, the source, and (with --explain) what to do.

Alarms

capture_alarms is an append-only ledger of invariant violations. The watchdog writes; you acknowledge.

hippo alarms list                    # unacknowledged alarms; exit 1 if any
hippo alarms ack <id> --note "..."   # acknowledge with a note
hippo alarms prune                   # clear auto-resolved alarms

Acknowledgment is permanent. Use --note to record what you did about it.

Probes

hippo probe --source <name> runs one synthetic round-trip on demand. Useful when you’ve just changed a configuration and want to confirm the source still lands. Probe rows are tagged with probe_tag IS NOT NULL and never appear in user-facing queries (see anti-patterns.md AP-6).

The launchd com.hippo.probe job runs probes every 5 minutes automatically; manual invocation is for confirming a specific source after operator action.

Recipes

”I ran a command but it’s not in hippo events

# 1. Is the daemon up?
hippo doctor

# 2. Did the event land?
sqlite3 ~/.local/share/hippo/hippo.db "
  SELECT id, command, timestamp
  FROM events
  WHERE source_kind = 'shell'
    AND timestamp > strftime('%s','now') * 1000 - 600000
  ORDER BY id DESC
  LIMIT 10;
"

# 3. If not, is the source healthy per the watchdog?
sqlite3 ~/.local/share/hippo/hippo.db "
  SELECT source, last_event_ts, consecutive_failures, probe_ok, probe_lag_ms
  FROM source_health
  WHERE source = 'shell';
"

# 4. Is the shell hook actually sourced?
grep -l 'hippo.zsh' ~/.zshrc ~/.zshenv ~/.config/zsh/*.zsh 2>/dev/null

# 5. Is the daemon socket responsive?
hippo probe --source shell

If the probe lands but the original command didn’t, the hook silently dropped the frame — check the fallback files (one JSONL per UTC date, written when the daemon was unreachable):

ls -la ~/.local/share/hippo/fallback/*.jsonl 2>/dev/null

A fallback file existing means the daemon was unreachable; the next daemon start will replay it via recover_fallback_files (crates/hippo-core/src/storage.rs).

”Doctor shows red”

hippo doctor --explain

Pick the first [!!] failure. The CAUSE/FIX/DOC block will tell you which file in this directory documents the relevant invariant. For example:

”Brain queue is backing up”

Capture and enrichment are decoupled (I-10). A backed-up brain queue is an enrichment problem, not a capture problem; events are still landing.

sqlite3 ~/.local/share/hippo/hippo.db "
  SELECT status, COUNT(*) FROM enrichment_queue GROUP BY status;
"

# Live brain log
tail -f ~/.local/share/hippo/brain.stderr.log

Common causes:

The watchdog reaper handles transient locks (rows stuck in processing for > lock_timeout_secs); see docs/brain-watchdog.md. A persistent backlog is operator-visible — neither the watchdog nor doctor will silently drop work.

”Schema mismatch — daemon refuses to bind”

The daemon’s startup handshake (crates/hippo-daemon/src/schema_handshake.rs) requires the daemon and brain schema versions to match exactly. If they don’t, the daemon refuses to bind its socket.

# Run the unified handshake check (compares all three at once).
hippo doctor --explain | grep -A 4 "schema"

# Or inspect each side individually:

# 1. What does the live DB say?
sqlite3 ~/.local/share/hippo/hippo.db "PRAGMA user_version;"

# 2. What version does the daemon binary expect? (compiled-in constant)
grep -E "^pub const EXPECTED_VERSION" \
  ~/projects/hippo/crates/hippo-core/src/storage.rs

# 3. What version does the brain expect?
uv run --project brain python -c \
  "from hippo_brain.schema_version import EXPECTED_SCHEMA_VERSION; print(EXPECTED_SCHEMA_VERSION)"

All three numbers must match. If they don’t, mise run install (or mise run install --clean) brings everything to the same version. Don’t manually PRAGMA user_version = N on the DB — migrations have to run.

”Probe lag is climbing”

source_health.probe_lag_ms is the end-to-end latency for a synthetic round-trip. Healthy: tens to hundreds of milliseconds for shell, low seconds for browser/claude-session. Climbing lag suggests the daemon is starving (load, disk pressure, or socket backlog).

sqlite3 ~/.local/share/hippo/hippo.db "
  SELECT source, probe_lag_ms, datetime(probe_last_run_ts/1000, 'unixepoch', 'localtime')
  FROM source_health
  WHERE probe_lag_ms IS NOT NULL
  ORDER BY probe_lag_ms DESC;
"

If lag exceeds the I-8 threshold (15 min for probe_last_run_ts), the watchdog will fire I-8 alarm. Climbing-but-under-threshold lag is informational only.

”I-16 fired / duplicate knowledge nodes detected”

The watchdog found one or more agentic_sessions segments carrying multiple knowledge nodes with identical (content, embed_text, node_type) — an enricher is (or was) minting duplicate nodes instead of replacing/reusing them (AP-13). The write-time guards (replace_prior_agentic_nodes for agentic; find_identical_node for workflow/browser) keep this at zero; a sustained non-zero count means a writer bypassed its guard or a backlog of pre-fix duplicates remains.

Remediate with the one-shot dedup script. It collapses each identity group to the earliest node (MIN(id)), re-points every knowledge_node_* link onto the survivor (union), and deletes the losers’ vectors + rows. Run it with writers stopped and a fresh backup in place:

# 1. stop writers so the BEGIN IMMEDIATE transaction can't race a live write
mise run stop
# 2. checkpoint the WAL into the main DB so the backup clone is complete
sqlite3 ~/.local/share/hippo/hippo.db "PRAGMA wal_checkpoint(TRUNCATE);"
# 3. instant APFS-clone backup
cp -c ~/.local/share/hippo/hippo.db ~/.local/share/hippo/hippo.db.pre-dedup-$(date +%Y%m%d-%H%M%S).bak
# 4. DRY RUN (default — reports groups/losers/predicted-after, changes nothing)
uv run --project brain python brain/scripts/dedup-knowledge-nodes.py --db ~/.local/share/hippo/hippo.db
# 5. APPLY (irreversible)
uv run --project brain python brain/scripts/dedup-knowledge-nodes.py --db ~/.local/share/hippo/hippo.db --apply
# 6. restart + verify
mise run start && hippo doctor && hippo alarms list

The script is idempotent (a second --apply deletes 0) and verifies PRAGMA foreign_key_check is clean before committing. Post-run, knowledge_nodes count == knowledge_vectors count and the I-16 query returns 0.

Recovery: manual operations

OperationCommand
Backfill a specific Claude JSONLhippo ingest claude-session <path>
Dedup duplicate knowledge nodes (I-16 / AP-13)brain/scripts/dedup-knowledge-nodes.py (see I-16 recipe above; stop writers + back up first)
Run a probe on demandhippo probe --source <name>
Force install (overwrite plists, native messaging manifest, shell-hook config)hippo daemon install --force
Stop everything (preserves data)mise run stop
Stop everything hard (SIGKILL, preserves data)mise run nuke
Start everythingmise run start
Full clean reinstall (rebuild + reinstall)mise run install --clean

When to escalate to a follow-up issue

If hippo doctor --explain doesn’t tell you what to do, or you’re seeing a failure mode that isn’t in this runbook, file a GitHub issue with:

anti-patterns.md AP-1..AP-12 are the review blockers; test-matrix.md is the failure-mode-to-test reference for adding regression tests.