Capture Architecture
Reference for hippo’s capture-reliability stack: how events land, what the system promises, and what fires when something breaks. For per-source detail (what each source captures and where it lands), see sources.md. For first-aid recipes when something goes wrong, see operator-runbook.md. For review-blocker rules every contributor needs to internalize, see anti-patterns.md.
TL;DR
Every capture path writes two things in the same SQLite transaction: the event row and a source_health row. A background watchdog reads source_health once a minute, asserts sixteen named invariants, and writes alarms to capture_alarms on violations. A separate probe job sends synthetic events through each path every five minutes and records round-trip latency. Operators see all of this through hippo doctor and hippo alarms.
The four layers
+--------------------+ +-------------------+ +-------------+
| capture path | | source_health | | capture_ |
| (per source) | ----> | (one row/source) | <----| alarms |
| writes event + | | | | |
| health in same Tx | | | +-------------+
+--------------------+ +-------------------+ ^
^ |
| |
+-----------------+ +-----------------+
| watchdog | | doctor / CLI |
| every 60s, |----> | reads alarms, |
| asserts I-1.. | | shows status |
| I-12 | +-----------------+
+-----------------+
^
|
+-----------------+
| probe |
| every 5m, |
| synthetic |
| round-trip |
+-----------------+
- Capture path — the per-source code that writes events. Shell hook → daemon socket. FSEvents watcher → daemon. Native messaging → daemon. Each path writes to its source’s events table AND to
source_healthin the same SQLite transaction. (Seeanti-patterns.mdAP-1: writing health from inside the user’s interactive prompt is forbidden — health writes happen in the daemon’sflush_events, never inshell/hippo.zsh.) source_healthtable — one row per source, holds the latest “did the event land?” signal:last_event_ts,consecutive_failures,events_last_1h,probe_ok,probe_last_run_ts,probe_lag_ms. Single SQL ground truth.- Watchdog (
com.hippo.watchdog, every 60 s) — asserts sixteen invariants (most againstsource_health; I-14 against the knowledge-node vector store, I-16 against the knowledge-node/agentic-session join), writescapture_alarmsrows on violations. Rate-limited per invariant (one alarm per invariant per hour). Implemented incrates/hippo-daemon/src/watchdog.rs. - Probe (
com.hippo.probe, every 5 minutes) — sends synthetic events through each capture path, measures end-to-end latency, recordsprobe_lag_msinsource_health. Probe rows carryprobe_tag IS NOT NULLand are filtered out of every user-facing query (RAG, MCP tools,hippo events). Seecrates/hippo-daemon/src/probe.rs. (Seeanti-patterns.mdAP-6: probe rows must never appear in user-facing queries.)
Operator interface: hippo doctor for a snapshot, hippo alarms for unacknowledged violations, hippo probe to run a one-off synthetic check.
Tables
source_health
One row per source. Updated in the same transaction as event writes; the watchdog reads it; the probe job updates probe_* columns.
| Column | Type | Meaning |
|---|---|---|
source | TEXT PK | Source name (one row per source; non-exhaustive): shell, claude-tool, agentic-session-claude, agentic-session-opencode, agentic-session-codex, agentic-session-cursor, browser, claude-session-watcher, workflow, watchdog, brain-preflight. (There is no probe row — the probe job writes probe_* columns onto each real source’s row, not a separate probe heartbeat.) |
last_event_ts | INTEGER | Epoch ms of the most recent successful event write for this source. |
consecutive_failures | INTEGER | Bumped on each failure; reset on success. Backstop for I-1, I-4 freshness alarms. |
events_last_1h / _24h | INTEGER | Rolling counts. Maintained by the daemon: incremented per-write in crates/hippo-daemon/src/daemon.rs::flush_events, then periodically corrected by recompute_rolling_counts (same file, every 5 min) which overwrites them with fresh COUNT(*) queries against events / claude_sessions / browser_events. The watchdog reads these values; it does not compute them. |
probe_ok | INTEGER | Last probe-job result: 1 = healthy, 0 = unhealthy. Source-specific definition (see “Probes” below). |
probe_last_run_ts | INTEGER | When the probe last completed for this source. |
probe_lag_ms | INTEGER | End-to-end latency of the most recent successful probe. |
updated_at | INTEGER | Always bumped on any column update. |
capture_alarms
Append-only ledger of invariant violations. The watchdog writes; hippo alarms ack flips the acknowledgment flag.
| Column | Meaning |
|---|---|
id | PK |
invariant_id | One of I-1 … I-16 |
raised_at | First detection time (epoch ms) |
details_json | Invariant-specific diagnostic context — affected source, since_ms, and per-invariant details |
acked_at | NULL until hippo alarms ack <id> |
ack_note | Optional note supplied at acknowledgment |
resolved_at | Set once the invariant has stayed clean for 2 consecutive ticks |
clean_ticks | Consecutive-clean tick count driving the auto-resolve loop |
Invariants (I-1..I-16)
Asserted by the watchdog every 60 s. Each has a formal predicate in crates/hippo-daemon/src/watchdog.rs. Violations create or refresh a capture_alarms row; the doctor surfaces them with [!!] severity.
| ID | Assertion | Threshold | Suppressed when | Backstop |
|---|---|---|---|---|
| I-1 Shell liveness | If user has an active zsh and hippo.zsh is sourced, shell/probe events must land within one probe cadence plus jitter grace. | 7 min | No zsh process; HID idle > 5 min; night-hours window with no recent command. | Watchdog alarm + doctor [!!] shell events. |
| I-2 Claude-session end-to-end | For every Claude JSONL with mtime < 5 min, a matching claude_sessions row must exist. | 5 min | No live JSONL. | Watchdog alarm naming each missing session_id. |
| I-3 Claude-tool concurrency | If a live JSONL has received a tool_use line within 5 min, at least one matching events.source_kind='claude-tool' row must exist in that window. | 5 min | No live JSONL with recent tool_use. | Structured log only by default; opt-in alarm via [watchdog] claude_tool_alarm = true. |
| I-4 Browser round-trip | If Firefox is up AND extension heartbeat is recent, browser/probe events must land within one probe cadence plus jitter grace. | 7 min | Firefox not running; extension heartbeat absent or stale. | Watchdog alarm + doctor [!!] browser events. |
| I-5 Drop visibility | Every event dropped (socket accept + crash, buffer overflow) increments a persistent counter. Zero tolerance for invisible drops. | every drop | — | OTel counter hippo.daemon.events.dropped (paired with hippo.daemon.events.ingested); see crates/hippo-daemon/src/metrics.rs. |
| I-6 Buffer non-saturation | Sustained drop rate over any 5 min sliding window ≤ 0.1% of total event traffic. | 0.1% / 5 min | — | Watchdog alarm + doctor [!!] drop-rate. |
| I-7 Watchdog liveness | The watchdog itself writes to source_health WHERE source='watchdog' at least every 60 s. | 180 s stale | — | Doctor only (a dead watchdog can’t alarm about itself). |
| I-8 Probe freshness | For each source with probe_last_run_ts IS NOT NULL: probe_ok = 1 OR probe_last_run_ts > now − 15 min. | 15 min | — | Watchdog alarm + doctor [!!] <source> probe. |
| I-9 Fallback file age | If any JSONL fallback file under ~/.local/share/hippo/ is > 24 h old AND the daemon socket is responsive, recovery is broken. | 24 h | Daemon down (fallback drain happens at startup). | Doctor [!!] fallback files. |
| I-10 Capture/enrichment decoupling | Brain being down (HTTP 5xx/timeout) MUST NOT prevent source_health updates for capture sources. Architectural — verified via canary in CI, not at runtime. | — | — | Architectural enforcement; if violated, every other invariant becomes unreliable. |
| I-11 Opencode-session coverage | If agentic-session-opencode.consecutive_failures > 3, the poller is actively broken. Proxy predicate; full freshness check lives in hippo doctor which suppresses on idle opencode-DB mtime. | proxy | Bench pause window. | Watchdog alarm + doctor [!!] agentic-session-opencode events. |
| I-12 Brain preflight stuck | If brain-preflight.consecutive_failures > 12 (≈ 1 minute at the brain’s 5 s poll), the inference backend has been unreachable for long enough to be a real outage. Motivating incident: silent [lmstudio] → [inference] config-section drift made the brain point at port 1234 forever with no alarm. | ~1 min | — | Watchdog alarm. Doctor surfaces it via the existing Brain inference backend: unreachable line. |
| I-13 Codex-session coverage | If agentic-session-codex.consecutive_failures > 3, the Codex rollout poller is actively broken. Proxy predicate; full freshness check lives in hippo doctor. | proxy | Bench pause window. | Watchdog alarm + doctor [!!] agentic-session-codex events. |
| I-14 Embedding orphan backlog | Count of knowledge_nodes older than reaper.orphan_stale_secs with no row in the knowledge_vectors shadow table must stay ≤ reaper.alarm_threshold. A sustained backlog means the embedding orphan-reaper is down or wedged. | 25 orphans (configurable) | Shadow table absent — fresh install, nothing embedded yet. | Watchdog alarm. |
| I-15 Cursor-session coverage | If agentic-session-cursor.consecutive_failures > 3, the Cursor poller is actively broken. Proxy predicate; full freshness in hippo doctor. | proxy | Bench pause window. | Watchdog alarm + doctor [!!] agentic-session-cursor events. |
| I-16 Agentic node dedup | No single agentic_sessions segment may carry more than one knowledge node (any type) with identical (content, embed_text, node_type). A non-zero count means an enricher is minting duplicate nodes instead of replacing/reusing them (the historical KB-duplication class — see AP-13). Covers all node types now that every writer is guarded (agentic = write-time replacement; workflow/browser = write-time content dedup). | 0 dup-groups (configurable via [watchdog] dup_node_alarm_threshold) | knowledge_node_agentic_sessions absent — fresh install. | Watchdog alarm. Remediate with brain/scripts/dedup-knowledge-nodes.py. |
Probes
Synthetic round-trip verification, every 5 minutes per source.
- Mechanism. A
hippo probe --source <name>invocation generates a synthetic event tagged with a per-run UUID inprobe_tag, then waits for it to appear in the source’s events table. Browser probes use the same per-run tag to bypass the normal browser URL/time-bucket dedup window. End-to-end latency is recorded insource_health.probe_lag_ms. - Where they live. Probe rows have
probe_tag IS NOT NULL. Every user-facing query (RAG retrieval, MCPsearch_events/search_knowledge/get_entities,hippo events,hippo ask) filters them out at the daemon-side query path. A Semgrep rule blocks new query call-sites that omit the filter. (Seeanti-patterns.mdAP-6.) - Per-source
probe_okdefinition. For shell:pgrep -x zshnon-empty ANDhippo.zshsourced AND HID idle < 5 min. For browser: Firefox running AND extension heartbeat fresh. For claude-session: at least one JSONL under~/.claude/projectswith recentmtime. The watchdog computes these on every cycle. - Manual probe.
hippo probe --source <name>runs one cycle on demand. Useful when bringing a source back up after a configuration change.
Backstops
The system promises observability, not correctness. If something breaks, the goal is for the user to see it within minutes, not 21 days (the duration of an actual past silent browser-capture outage that motivated this architecture).
hippo doctor— interactive, < 2 s wall-clock, ten checks, exit code = fail count.--explainadds CAUSE/FIX/DOC per failure.hippo alarms list— unacknowledged alarms; exits 1 if any.- macOS notification — opt-in via
[watchdog] notify_macos = true. Rate-limited to one per invariant per hour. - OTel — every counter is a Prometheus metric when the
otelfeature is built (default-on). Seeotel/README.md↗ for the local Grafana stack.
See also
docs/lifecycle.md— end-to-end trace of how each event type becomes a knowledge node, with diagnostic SQL recipes.sources.md— what each source captures, where it lands, what fires.anti-patterns.md— AP-1..AP-12: review blockers.operator-runbook.md— doctor recipes, alarm responses, recovery flows.test-matrix.md— failure-mode-to-test mapping and the contract for adding new tests.docs/archive/capture-reliability-overhaul/— historical design records (P0–P3 overhaul, post-mortems).