Adding a New Capture Source

The contract for plugging a new event source into hippo. Companion to architecture.md (the system reference) and sources.md (per-source coverage). For the rules that constrain how you implement these steps, see anti-patterns.md — every step below points back at one of AP-1..AP-12.

The bar for “it ships”: hippo doctor is green, the watchdog asserts at least one freshness invariant for the new source, and the source-audit and test-matrix tables both have rows naming a regression test.

Before you start

Adding a source is a multi-week piece of work, not a weekend project. The steps below are deliberately exhaustive because every step is a place a previous source’s bug came from. If you’re trying to plumb something through quickly, ask whether the data is already available via an existing source:

Want to captureAlready covered byIf you still need a new source
Bash historyShell hook (hippo.zsh) — bash equivalent welcome as a small variant rather than new source kindA bash version of hippo.zsh belongs in shell/ and reuses the daemon socket; no new source_kind needed unless metadata diverges
macOS Notification Center eventsNot yetNew source — full contract below
Cursor Agent CLI sessionsCovered — com.hippo.cursor-session Rust poller (cursor_session.rs); see docs/superpowers/specs/2026-05-25-cursor-ingestion-design.mdAider is a separate tool and would require its own source if the JSONL shape diverges
iMessage / Slack messagesNo, and probably should not be added — privacy footprint exceeds redaction’s reachDon’t

If after that filter you still want to add a source, read on.

The contract

Every new source must implement all of these. Skipping any one is a known-bug shape.

1. Source identity

Two distinct identifiers are involved; pick a value for each.

events.source_kind distinguishes rows in the events table. Today’s set: 'shell' and 'claude-tool' (rows where tool_name IS NOT NULL). Pick a kebab-case value if your source writes into the events table. Sources with their own table (browser, claude-session) don’t need a source_kind value.

source_health.source is the watchdog’s per-source heartbeat key. Today’s set: 'shell', 'claude-tool', 'agentic-session-claude', 'agentic-session-opencode', 'agentic-session-codex', 'agentic-session-cursor', 'browser', 'claude-session-watcher', 'watchdog', 'brain-preflight'. (Note: there is no 'probe' row — probes write probe_ok / probe_lag_ms / probe_last_run_ts onto the real source’s row, not into a separate probe heartbeat.)

The new value goes in:

2. Schema migration

Bump EXPECTED_VERSION in storage.rs AND EXPECTED_SCHEMA_VERSION in brain/src/hippo_brain/schema_version.py (they must agree — see docs/schema.md).

The migration block in storage.rs::open_db does at minimum:

-- Seed a source_health row so the watchdog has something to assert against
INSERT OR IGNORE INTO source_health (source, last_event_ts, updated_at)
VALUES ('<your-source-kind>', NULL, unixepoch('now') * 1000);

If your source needs a dedicated table (browser-style, with extracted text + dwell), add it here too. Match the existing patterns: PK id INTEGER PRIMARY KEY, created_at INTEGER NOT NULL DEFAULT (unixepoch('now', 'subsec') * 1000), dedup column envelope_id TEXT with a unique index WHERE envelope_id IS NOT NULL.

If your source events flow through the existing events table (shell/claude-tool style), no new table is needed; just source_kind = '<your-source-kind>'.

Add a row to docs/schema.md’s changelog. Add a doc note about what the new source captures.

3. Capture path

Where the actual writes happen. The contract is two writes in one SQLite transaction — the event row AND a source_health UPDATE — so the watchdog sees source health in lockstep with the event landing.

For sources that flow through the daemon’s existing flush_events path (recommended for anything that can speak the Unix socket protocol):

For sources that watch a file system path (claude-session-style):

For sources that poll a remote API (workflow-runs style):

AP-1 forbids doing the source_health write from inside the user-facing capture call site (e.g. inside the shell hook). Health writes must happen on the daemon side, in the same SQLite transaction as the event row, after the event is buffered.

AP-2 forbids coupling capture-side health to enrichment health. Your source_health row must NOT track LM Studio reachability, brain-process state, or queue depth.

AP-11 forbids .filter_map(Result::ok) or .ok().unwrap_or_default() in any write path. Every error gets a warn! log and a counter bump.

4. Redaction (if applicable)

If your source captures user-typed content (not metadata), it must run through crates/hippo-core/src/redaction.rs::RedactionEngine before storage. See docs/redaction.md for what the engine does. Browser content goes through redaction; URLs go through strip_sensitive_params separately.

Sources that capture only metadata (workflow-runs is structural, not user-typed) can skip this. Document the reasoning if you skip.

5. Probe coverage

The synthetic-probe job runs every 5 minutes and exercises every probed source. To add probe support:

If your source genuinely cannot be probed (e.g., a one-shot import path, or a source that depends on user activity that synthetic probes can’t fake), add an explicit “probe-exempt” entry in sources.md with a one-sentence rationale.

AP-6 forbids letting probe rows appear in user-facing queries. Every query path against the source’s table must include AND probe_tag IS NULL. The Semgrep rule blocks new query call-sites that omit it.

6. Enrichment eligibility

Edit brain/src/hippo_brain/enrichment.py::is_enrichment_eligible to add a branch for your source. Decide:

If your source warrants its own queue table (browser-style), add it to the migration in step 2. Otherwise reuse enrichment_queue (events) or one of the existing per-source tables.

7. Brain-side enrichment path

Add a _enrich_<your_source>_batches method to brain/src/hippo_brain/server.py, modeled on _enrich_shell_batches / _enrich_browser_batches. The shape:

  1. claim_pending_<source> returns batches of events to enrich.
  2. Build a prompt via a source-specific build_<source>_enrichment_prompt(events) function in your own module under brain/src/hippo_brain/.
  3. Call _call_llm_with_retries(SYSTEM_PROMPT, prompt, "<source-label>").
  4. Parse with parse_enrichment_response(raw) (returns the canonical EnrichmentResult shape).
  5. Write via write_knowledge_node (or a source-specific writer if you need extra link-table columns, like write_claude_knowledge_node).
  6. Background-embed via embed_knowledge_node (asyncio task, gathered at end of batch).

Copy a System Prompt from one of the existing modules; honor the verbatim-preservation rule (PR #100) and the identifier-density rule for embed_text.

8. Watchdog invariant

Add an entry to the I-1..I-N invariant list in crates/hippo-daemon/src/watchdog.rs and the matching documentation in architecture.md.

A typical freshness invariant looks like:

I-N: If <context-condition> is true (your source is “active”), source_health.<your-source>.last_event_ts must be within <threshold> of now.

The context-condition is essential: shell silence overnight is normal, browser silence is normal when Firefox is closed. Don’t fire alarms on absolute silence — gate on a positive activity signal. (AP-3 forbids unconditional silence alarms.)

Threshold guidance: pick at least 3× the expected event-rate interval. Shell is 60 s (1000× the 50 ms typical hook latency). Claude-session is 5 min. Browser is 2 min.

9. Doctor check

hippo doctor runs two related sets of source checks (both in crates/hippo-daemon/src/commands.rs); a new source needs to be wired into both:

Severity:

Keep the soft + hard thresholds in the same commands.rs constants/source_freshness_probes() definitions that check_source_freshness consumes today, so doctor and watchdog behavior share a single source of truth rather than duplicating values. Watchdog thresholds are not config-driven today; if you need runtime override, that is a follow-up task.

10. Test matrix + source audit

Every source row in sources.md names an integration test that proves rows land. Add yours:

// crates/hippo-daemon/tests/source_audit.rs::your_source_events
fn your_source_events() {
    // Drive a real event through the capture path (no mocks of the
    // write layer; mocks of upstream are okay if needed).
    // Assert the event_table row exists and source_health updated.
}

Add a row to test-matrix.md for each known failure mode you’ve thought through. For a new source you’ll typically add at least:

11. Documentation

Add rows to:

Don’t add to anti-patterns.md — that file is for review-blocker rules learned from real bugs, not for net-new constraints from speculation. If your source uncovers a genuine new failure mode in review or production, then add it.

Worked example: bash history

Suppose you’re adding bash history capture as a new source. (For zsh, just edit hippo.zsh; this example is illustrative.)

StepWhat you’d do
1. Source identitysource_kind = 'bash'
2. MigrationBump EXPECTED_VERSION to N+1; seed INSERT OR IGNORE INTO source_health (source, last_event_ts, updated_at) VALUES ('bash', NULL, ...). No new table — bash events flow through events.
3. Capture pathWrite shell/hippo.bash analogous to hippo.zsh (preexec/precmd → fire-and-forget on the daemon socket). The daemon side reuses handle_send_event_shell with source_kind parameterized — minor refactor in commands.rs.
4. RedactionSame RedactionEngine runs on commands + stdout + stderr; nothing source-specific.
5. ProbeAdd a "bash" arm to the source: Option<&str> dispatch in probe.rs::run and a probe_bash async function (no enum); probe_ok is pgrep -x bash non-empty. The probe injects a synthetic command via the same socket path.
6. Eligibilityis_enrichment_eligible(event_dict, "bash") mirrors the shell branch — same trivial-command set, same duration threshold.
7. BrainIf shell/bash share enrichment shape, no new method needed: parameterize _enrich_shell_batches to pull from both source_kind in ('shell','bash').
8. Watchdog invariantI-N+1 mirrors I-1: bash liveness when bash is the user’s active shell, 60 s threshold.
9. DoctorNew SourceFreshnessProbe entry in source_freshness_probes() (used by check_source_freshness); also extend check_source_staleness to include the new source’s source_health row.
10. Testssource_audit::bash_events (event lands), nm_bash_restart (capture survives daemon restart), bash-specific probe round-trip.
11. DocsRow in sources.md; changelog in schema.md; bash hook documented in shell/README.md.

Estimated effort for a competent contributor: 3-5 days of focused work if all the daemon/brain abstractions accommodate the new source cleanly. The wider half is the test matrix.

When NOT to add a new source

See also