Capture Architecture

Reference for hippo’s capture-reliability stack: how events land, what the system promises, and what fires when something breaks. For per-source detail (what each source captures and where it lands), see sources.md. For first-aid recipes when something goes wrong, see operator-runbook.md. For review-blocker rules every contributor needs to internalize, see anti-patterns.md.

TL;DR

Every capture path writes two things in the same SQLite transaction: the event row and a source_health row. A background watchdog reads source_health once a minute, asserts sixteen named invariants, and writes alarms to capture_alarms on violations. A separate probe job sends synthetic events through each path every five minutes and records round-trip latency. Operators see all of this through hippo doctor and hippo alarms.

The four layers

+--------------------+       +-------------------+      +-------------+
|  capture path      |       |  source_health    |      | capture_    |
|  (per source)      | ----> |  (one row/source) | <----| alarms      |
|  writes event +    |       |                   |      |             |
|  health in same Tx |       |                   |      +-------------+
+--------------------+       +-------------------+              ^
                                       ^                        |
                                       |                        |
                              +-----------------+      +-----------------+
                              |  watchdog       |      |  doctor / CLI   |
                              |  every 60s,     |----> |  reads alarms,  |
                              |  asserts I-1..  |      |  shows status   |
                              |  I-12           |      +-----------------+
                              +-----------------+
                                       ^
                                       |
                              +-----------------+
                              |  probe          |
                              |  every 5m,      |
                              |  synthetic      |
                              |  round-trip     |
                              +-----------------+
  1. Capture path — the per-source code that writes events. Shell hook → daemon socket. FSEvents watcher → daemon. Native messaging → daemon. Each path writes to its source’s events table AND to source_health in the same SQLite transaction. (See anti-patterns.md AP-1: writing health from inside the user’s interactive prompt is forbidden — health writes happen in the daemon’s flush_events, never in shell/hippo.zsh.)
  2. source_health table — one row per source, holds the latest “did the event land?” signal: last_event_ts, consecutive_failures, events_last_1h, probe_ok, probe_last_run_ts, probe_lag_ms. Single SQL ground truth.
  3. Watchdog (com.hippo.watchdog, every 60 s) — asserts sixteen invariants (most against source_health; I-14 against the knowledge-node vector store, I-16 against the knowledge-node/agentic-session join), writes capture_alarms rows on violations. Rate-limited per invariant (one alarm per invariant per hour). Implemented in crates/hippo-daemon/src/watchdog.rs.
  4. Probe (com.hippo.probe, every 5 minutes) — sends synthetic events through each capture path, measures end-to-end latency, records probe_lag_ms in source_health. Probe rows carry probe_tag IS NOT NULL and are filtered out of every user-facing query (RAG, MCP tools, hippo events). See crates/hippo-daemon/src/probe.rs. (See anti-patterns.md AP-6: probe rows must never appear in user-facing queries.)

Operator interface: hippo doctor for a snapshot, hippo alarms for unacknowledged violations, hippo probe to run a one-off synthetic check.

Tables

source_health

One row per source. Updated in the same transaction as event writes; the watchdog reads it; the probe job updates probe_* columns.

ColumnTypeMeaning
sourceTEXT PKSource name (one row per source; non-exhaustive): shell, claude-tool, agentic-session-claude, agentic-session-opencode, agentic-session-codex, agentic-session-cursor, browser, claude-session-watcher, workflow, watchdog, brain-preflight. (There is no probe row — the probe job writes probe_* columns onto each real source’s row, not a separate probe heartbeat.)
last_event_tsINTEGEREpoch ms of the most recent successful event write for this source.
consecutive_failuresINTEGERBumped on each failure; reset on success. Backstop for I-1, I-4 freshness alarms.
events_last_1h / _24hINTEGERRolling counts. Maintained by the daemon: incremented per-write in crates/hippo-daemon/src/daemon.rs::flush_events, then periodically corrected by recompute_rolling_counts (same file, every 5 min) which overwrites them with fresh COUNT(*) queries against events / claude_sessions / browser_events. The watchdog reads these values; it does not compute them.
probe_okINTEGERLast probe-job result: 1 = healthy, 0 = unhealthy. Source-specific definition (see “Probes” below).
probe_last_run_tsINTEGERWhen the probe last completed for this source.
probe_lag_msINTEGEREnd-to-end latency of the most recent successful probe.
updated_atINTEGERAlways bumped on any column update.

capture_alarms

Append-only ledger of invariant violations. The watchdog writes; hippo alarms ack flips the acknowledgment flag.

ColumnMeaning
idPK
invariant_idOne of I-1I-16
raised_atFirst detection time (epoch ms)
details_jsonInvariant-specific diagnostic context — affected source, since_ms, and per-invariant details
acked_atNULL until hippo alarms ack <id>
ack_noteOptional note supplied at acknowledgment
resolved_atSet once the invariant has stayed clean for 2 consecutive ticks
clean_ticksConsecutive-clean tick count driving the auto-resolve loop

Invariants (I-1..I-16)

Asserted by the watchdog every 60 s. Each has a formal predicate in crates/hippo-daemon/src/watchdog.rs. Violations create or refresh a capture_alarms row; the doctor surfaces them with [!!] severity.

IDAssertionThresholdSuppressed whenBackstop
I-1 Shell livenessIf user has an active zsh and hippo.zsh is sourced, shell/probe events must land within one probe cadence plus jitter grace.7 minNo zsh process; HID idle > 5 min; night-hours window with no recent command.Watchdog alarm + doctor [!!] shell events.
I-2 Claude-session end-to-endFor every Claude JSONL with mtime < 5 min, a matching claude_sessions row must exist.5 minNo live JSONL.Watchdog alarm naming each missing session_id.
I-3 Claude-tool concurrencyIf a live JSONL has received a tool_use line within 5 min, at least one matching events.source_kind='claude-tool' row must exist in that window.5 minNo live JSONL with recent tool_use.Structured log only by default; opt-in alarm via [watchdog] claude_tool_alarm = true.
I-4 Browser round-tripIf Firefox is up AND extension heartbeat is recent, browser/probe events must land within one probe cadence plus jitter grace.7 minFirefox not running; extension heartbeat absent or stale.Watchdog alarm + doctor [!!] browser events.
I-5 Drop visibilityEvery event dropped (socket accept + crash, buffer overflow) increments a persistent counter. Zero tolerance for invisible drops.every dropOTel counter hippo.daemon.events.dropped (paired with hippo.daemon.events.ingested); see crates/hippo-daemon/src/metrics.rs.
I-6 Buffer non-saturationSustained drop rate over any 5 min sliding window ≤ 0.1% of total event traffic.0.1% / 5 minWatchdog alarm + doctor [!!] drop-rate.
I-7 Watchdog livenessThe watchdog itself writes to source_health WHERE source='watchdog' at least every 60 s.180 s staleDoctor only (a dead watchdog can’t alarm about itself).
I-8 Probe freshnessFor each source with probe_last_run_ts IS NOT NULL: probe_ok = 1 OR probe_last_run_ts > now − 15 min.15 minWatchdog alarm + doctor [!!] <source> probe.
I-9 Fallback file ageIf any JSONL fallback file under ~/.local/share/hippo/ is > 24 h old AND the daemon socket is responsive, recovery is broken.24 hDaemon down (fallback drain happens at startup).Doctor [!!] fallback files.
I-10 Capture/enrichment decouplingBrain being down (HTTP 5xx/timeout) MUST NOT prevent source_health updates for capture sources. Architectural — verified via canary in CI, not at runtime.Architectural enforcement; if violated, every other invariant becomes unreliable.
I-11 Opencode-session coverageIf agentic-session-opencode.consecutive_failures > 3, the poller is actively broken. Proxy predicate; full freshness check lives in hippo doctor which suppresses on idle opencode-DB mtime.proxyBench pause window.Watchdog alarm + doctor [!!] agentic-session-opencode events.
I-12 Brain preflight stuckIf brain-preflight.consecutive_failures > 12 (≈ 1 minute at the brain’s 5 s poll), the inference backend has been unreachable for long enough to be a real outage. Motivating incident: silent [lmstudio][inference] config-section drift made the brain point at port 1234 forever with no alarm.~1 minWatchdog alarm. Doctor surfaces it via the existing Brain inference backend: unreachable line.
I-13 Codex-session coverageIf agentic-session-codex.consecutive_failures > 3, the Codex rollout poller is actively broken. Proxy predicate; full freshness check lives in hippo doctor.proxyBench pause window.Watchdog alarm + doctor [!!] agentic-session-codex events.
I-14 Embedding orphan backlogCount of knowledge_nodes older than reaper.orphan_stale_secs with no row in the knowledge_vectors shadow table must stay ≤ reaper.alarm_threshold. A sustained backlog means the embedding orphan-reaper is down or wedged.25 orphans (configurable)Shadow table absent — fresh install, nothing embedded yet.Watchdog alarm.
I-15 Cursor-session coverageIf agentic-session-cursor.consecutive_failures > 3, the Cursor poller is actively broken. Proxy predicate; full freshness in hippo doctor.proxyBench pause window.Watchdog alarm + doctor [!!] agentic-session-cursor events.
I-16 Agentic node dedupNo single agentic_sessions segment may carry more than one knowledge node (any type) with identical (content, embed_text, node_type). A non-zero count means an enricher is minting duplicate nodes instead of replacing/reusing them (the historical KB-duplication class — see AP-13). Covers all node types now that every writer is guarded (agentic = write-time replacement; workflow/browser = write-time content dedup).0 dup-groups (configurable via [watchdog] dup_node_alarm_threshold)knowledge_node_agentic_sessions absent — fresh install.Watchdog alarm. Remediate with brain/scripts/dedup-knowledge-nodes.py.

Probes

Synthetic round-trip verification, every 5 minutes per source.

Backstops

The system promises observability, not correctness. If something breaks, the goal is for the user to see it within minutes, not 21 days (the duration of an actual past silent browser-capture outage that motivated this architecture).

See also