Capture Architecture

Reference for hippo’s capture-reliability stack: how events land, what the system promises, and what fires when something breaks. For per-source detail (what each source captures and where it lands), see sources.md. For first-aid recipes when something goes wrong, see operator-runbook.md. For review-blocker rules every contributor needs to internalize, see anti-patterns.md.

TL;DR

Every capture path writes two things in the same SQLite transaction: the event row and a source_health row. A background watchdog reads source_health once a minute, asserts sixteen named invariants, and writes alarms to capture_alarms on violations. A separate probe job sends synthetic events through each path every five minutes and records round-trip latency. Operators see all of this through hippo doctor and hippo alarms.

The four layers

+--------------------+       +-------------------+      +-------------+
|  capture path      |       |  source_health    |      | capture_    |
|  (per source)      | ----> |  (one row/source) | <----| alarms      |
|  writes event +    |       |                   |      |             |
|  health in same Tx |       |                   |      +-------------+
+--------------------+       +-------------------+              ^
                                       ^                        |
                                       |                        |
                              +-----------------+      +-----------------+
                              |  watchdog       |      |  doctor / CLI   |
                              |  every 60s,     |----> |  reads alarms,  |
                              |  asserts I-1..  |      |  shows status   |
                              |  I-12           |      +-----------------+
                              +-----------------+
                                       ^
                                       |
                              +-----------------+
                              |  probe          |
                              |  every 5m,      |
                              |  synthetic      |
                              |  round-trip     |
                              +-----------------+

Capture path — the per-source code that writes events. Shell hook → daemon socket. FSEvents watcher → daemon. Native messaging → daemon. Each path writes to its source’s events table AND to source_health in the same SQLite transaction. (See anti-patterns.md AP-1: writing health from inside the user’s interactive prompt is forbidden — health writes happen in the daemon’s flush_events, never in shell/hippo.zsh.)
source_health table — one row per source, holds the latest “did the event land?” signal: last_event_ts, consecutive_failures, events_last_1h, probe_ok, probe_last_run_ts, probe_lag_ms. Single SQL ground truth.
Watchdog (com.hippo.watchdog, every 60 s) — asserts sixteen invariants (most against source_health; I-14 against the knowledge-node vector store, I-16 against the knowledge-node/agentic-session join), writes capture_alarms rows on violations. Rate-limited per invariant (one alarm per invariant per hour). Implemented in crates/hippo-daemon/src/watchdog.rs.
Probe (com.hippo.probe, every 5 minutes) — sends synthetic events through each capture path, measures end-to-end latency, records probe_lag_ms in source_health. Probe rows carry probe_tag IS NOT NULL and are filtered out of every user-facing query (RAG, MCP tools, hippo events). See crates/hippo-daemon/src/probe.rs. (See anti-patterns.md AP-6: probe rows must never appear in user-facing queries.)

Operator interface: hippo doctor for a snapshot, hippo alarms for unacknowledged violations, hippo probe to run a one-off synthetic check.

Tables

`source_health`

One row per source. Updated in the same transaction as event writes; the watchdog reads it; the probe job updates probe_* columns.

Column	Type	Meaning
`source`	TEXT PK	Source name (one row per source; non-exhaustive): `shell`, `claude-tool`, `agentic-session-claude`, `agentic-session-opencode`, `agentic-session-codex`, `agentic-session-cursor`, `browser`, `claude-session-watcher`, `workflow`, `watchdog`, `brain-preflight`. (There is no `probe` row — the probe job writes `probe_*` columns onto each real source’s row, not a separate probe heartbeat.)
`last_event_ts`	INTEGER	Epoch ms of the most recent successful event write for this source.
`consecutive_failures`	INTEGER	Bumped on each failure; reset on success. Backstop for I-1, I-4 freshness alarms.
`events_last_1h` / `_24h`	INTEGER	Rolling counts. Maintained by the daemon: incremented per-write in `crates/hippo-daemon/src/daemon.rs::flush_events`, then periodically corrected by `recompute_rolling_counts` (same file, every 5 min) which overwrites them with fresh `COUNT(*)` queries against `events` / `claude_sessions` / `browser_events`. The watchdog reads these values; it does not compute them.
`probe_ok`	INTEGER	Last probe-job result: 1 = healthy, 0 = unhealthy. Source-specific definition (see “Probes” below).
`probe_last_run_ts`	INTEGER	When the probe last completed for this source.
`probe_lag_ms`	INTEGER	End-to-end latency of the most recent successful probe.
`updated_at`	INTEGER	Always bumped on any column update.

`capture_alarms`

Append-only ledger of invariant violations. The watchdog writes; hippo alarms ack flips the acknowledgment flag.

Column	Meaning
`id`	PK
`invariant_id`	One of `I-1` … `I-16`
`raised_at`	First detection time (epoch ms)
`details_json`	Invariant-specific diagnostic context — affected source, `since_ms`, and per-invariant details
`acked_at`	NULL until `hippo alarms ack <id>`
`ack_note`	Optional note supplied at acknowledgment
`resolved_at`	Set once the invariant has stayed clean for 2 consecutive ticks
`clean_ticks`	Consecutive-clean tick count driving the auto-resolve loop

Invariants (I-1..I-16)

Asserted by the watchdog every 60 s. Each has a formal predicate in crates/hippo-daemon/src/watchdog.rs. Violations create or refresh a capture_alarms row; the doctor surfaces them with [!!] severity.

ID	Assertion	Threshold	Suppressed when	Backstop
I-1 Shell liveness	If user has an active zsh and `hippo.zsh` is sourced, shell/probe events must land within one probe cadence plus jitter grace.	7 min	No zsh process; HID idle > 5 min; night-hours window with no recent command.	Watchdog alarm + doctor `[!!] shell events`.
I-2 Claude-session end-to-end	For every Claude JSONL with `mtime < 5 min`, a matching `claude_sessions` row must exist.	5 min	No live JSONL.	Watchdog alarm naming each missing `session_id`.
I-3 Claude-tool concurrency	If a live JSONL has received a `tool_use` line within 5 min, at least one matching `events.source_kind='claude-tool'` row must exist in that window.	5 min	No live JSONL with recent `tool_use`.	Structured log only by default; opt-in alarm via `[watchdog] claude_tool_alarm = true`.
I-4 Browser round-trip	If extension heartbeat is recent (within 5 min cadence + grace) and Firefox is running, browser events must land within one probe cadence plus jitter grace.	7 min	Extension heartbeat absent or stale; Firefox not running (doctor suppresses).	Watchdog alarm + doctor `[!!] browser events` with connectivity state.
I-5 Drop visibility	Every event dropped (socket accept + crash, buffer overflow) increments a persistent counter. Zero tolerance for invisible drops.	every drop	—	OTel counter `hippo.daemon.events.dropped` (paired with `hippo.daemon.events.ingested`); see `crates/hippo-daemon/src/metrics.rs`.
I-6 Buffer non-saturation	Sustained drop rate over any 5 min sliding window ≤ 0.1% of total event traffic.	0.1% / 5 min	—	Watchdog alarm + doctor `[!!] drop-rate`.
I-7 Watchdog liveness	The watchdog itself writes to `source_health WHERE source='watchdog'` at least every 60 s.	180 s stale	—	Doctor only (a dead watchdog can’t alarm about itself).
I-8 Probe freshness	For each source with `probe_last_run_ts IS NOT NULL`: `probe_ok = 1` OR `probe_last_run_ts > now − 15 min`.	15 min	—	Watchdog alarm + doctor `[!!] <source> probe`.
I-9 Fallback file age	If any JSONL fallback file under `~/.local/share/hippo/` is > 24 h old AND the daemon socket is responsive, recovery is broken.	24 h	Daemon down (fallback drain happens at startup).	Doctor `[!!] fallback files`.
I-10 Capture/enrichment decoupling	Brain being down (HTTP 5xx/timeout) MUST NOT prevent `source_health` updates for capture sources. Architectural — verified via canary in CI, not at runtime.	—	—	Architectural enforcement; if violated, every other invariant becomes unreliable.
I-11 Opencode-session coverage	If `agentic-session-opencode.consecutive_failures > 3`, the poller is actively broken. Proxy predicate; full freshness check lives in `hippo doctor` which suppresses on idle opencode-DB mtime.	proxy	Bench pause window.	Watchdog alarm + doctor `[!!] agentic-session-opencode events`.
I-12 Brain preflight stuck	If `brain-preflight.consecutive_failures > 12` (≈ 1 minute at the brain’s 5 s poll), the inference backend has been unreachable for long enough to be a real outage. Motivating incident: silent `[lmstudio]` → `[inference]` config-section drift made the brain point at port 1234 forever with no alarm.	~1 min	—	Watchdog alarm. Doctor surfaces it via the existing `Brain inference backend: unreachable` line.
I-13 Codex-session coverage	If `agentic-session-codex.consecutive_failures > 3`, the Codex rollout poller is actively broken. Proxy predicate; full freshness check lives in `hippo doctor`.	proxy	Bench pause window.	Watchdog alarm + doctor `[!!] agentic-session-codex events`.
I-14 Embedding orphan backlog	Count of `knowledge_nodes` older than `reaper.orphan_stale_secs` with no row in the `knowledge_vectors` shadow table must stay ≤ `reaper.alarm_threshold`. A sustained backlog means the embedding orphan-reaper is down or wedged.	25 orphans (configurable)	Shadow table absent — fresh install, nothing embedded yet.	Watchdog alarm.
I-15 Cursor-session coverage	If `agentic-session-cursor.consecutive_failures > 3`, the Cursor poller is actively broken. Proxy predicate; full freshness in `hippo doctor`.	proxy	Bench pause window.	Watchdog alarm + doctor `[!!] agentic-session-cursor events`.
I-16 Agentic node dedup	No single `agentic_sessions` segment may carry more than one knowledge node (any type) with identical `(content, embed_text, node_type)`. A non-zero count means an enricher is minting duplicate nodes instead of replacing/reusing them (the historical KB-duplication class — see AP-13). Covers all node types now that every writer is guarded (agentic = write-time replacement; workflow/browser = write-time content dedup).	0 dup-groups (configurable via `[watchdog] dup_node_alarm_threshold`)	`knowledge_node_agentic_sessions` absent — fresh install.	Watchdog alarm. Remediate with `brain/scripts/dedup-knowledge-nodes.py`.

Probes

Synthetic round-trip verification, every 5 minutes per source.

Mechanism. A hippo probe --source <name> invocation generates a synthetic event tagged with a per-run UUID in probe_tag, then waits for it to appear in the source’s events table. Browser probes use the same per-run tag to bypass the normal browser URL/time-bucket dedup window. End-to-end latency is recorded in source_health.probe_lag_ms.
Where they live. Probe rows have probe_tag IS NOT NULL. Every user-facing query (RAG retrieval, MCP search_events / search_knowledge / get_entities, hippo events, hippo ask) filters them out at the daemon-side query path. A Semgrep rule blocks new query call-sites that omit the filter. (See anti-patterns.md AP-6.)
Per-source probe_ok definition. For shell: pgrep -x zsh non-empty AND hippo.zsh sourced AND HID idle < 5 min. For browser: last synthetic hippo probe --source browser round-trip succeeded (written by the probe job). Extension connectivity is tracked separately via source_health.last_heartbeat_ts (NM heartbeats every 5 min). Extension-side native-messaging failures are forwarded on the heartbeat as last_error_msg and surfaced by hippo doctor --explain. For claude-session: at least one JSONL under ~/.claude/projects with recent mtime. The watchdog computes shell/claude-session predicates on every cycle; browser I-4 uses heartbeat freshness, not probe_ok.
Manual probe. hippo probe --source <name> runs one cycle on demand. Useful when bringing a source back up after a configuration change.

Backstops

The system promises observability, not correctness. If something breaks, the goal is for the user to see it within minutes, not 21 days (the duration of an actual past silent browser-capture outage that motivated this architecture).

hippo doctor — interactive, < 2 s wall-clock, ten checks, exit code = fail count. --explain adds CAUSE/FIX/DOC per failure.
hippo alarms list — unacknowledged alarms; exits 1 if any.
macOS notification — opt-in via [watchdog] notify_macos = true. Rate-limited to one per invariant per hour.
OTel — every counter is a Prometheus metric when the otel feature is built (default-on). See otel/README.md ↗ for the local Grafana stack.