Capture Sources

Per-source coverage map: what each source captures, the entry point that writes it, the tables it lands in, and what fires when it stops landing. Companion to architecture.md (the system reference) and test-matrix.md (failure-mode-to-test mapping).

For the rules every contributor must follow when adding a new source, see anti-patterns.md. For first-aid when one of these sources stops working, see operator-runbook.md.

Source matrix

#SourceEntry pointTablesInvariantProbeStatus
1Shell commands (zsh hook)hippo.zsh::preexec/precmd → unix socket → commands.rs::handle_send_event_shelldaemon.rs::flush_eventsstorage.rs::insert_event_atevents (source_kind='shell')I-1Yes (synthetic command via hippo probe --source shell)healthy
2Claude tool eventsDerived during Claude-session ingest. claude_session.rs::build_envelope → same flush_events path; storage.rs derives source_kind='claude-tool' when tool_name.is_some()events (source_kind='claude-tool', tool_name set)I-3Indirect (rides on claude-session probe)healthy
3Claude session segments (FS watcher)com.hippo.claude-session-watcher LaunchAgent → FSEvents on ~/.claude/projects/**/*.jsonlwatch_claude_sessions.rs::process_fileclaude_session.rs::ingest_session_fileinsert_segmentsclaude_sessions, claude_enrichment_queue (capture-health key agentic-session-claude)I-2Yes (synthetic JSONL append)healthy
4Claude subagent sessions (agent-*.jsonl)Same FS-watcher path; SessionFile::from_path detects <project>/<parent-uuid>/subagents/<id>.jsonl and sets is_subagent=trueclaude_sessions with is_subagent=1, parent_session_idI-2(rides on parent)healthy
5Browser visits (Firefox extension)extension/firefox content script → background → native messaging stdio → native_messaging.rs::runsend_event_fire_and_forgetflush_eventsstorage.rs::insert_browser_eventbrowser_events, browser_enrichment_queueI-4Yes (synthetic NM frame)healthy
6GitHub workflow runsgh_poll.rs::run_once (poller) → storage.rs::workflow_store::*workflow_runs, workflow_jobs, workflow_annotations, workflow_log_excerpts, workflow_enrichment_queue(no real-time invariant; doctor checks freshness)Nohealthy; opt-in via [github] enabled = true
7Xcode ClaudeAgentConfig sessionscom.hippo.xcode-claude-ingest LaunchAgent → scripts/hippo-ingest-claude.py → Python insert_segment (same claude_sessions schema)claude_sessions, claude_enrichment_queueI-2 (shared)Indirecthealthy
8Codex sessionscom.hippo.codex-session LaunchAgent (scheduled poller) → hippo codex-pollcodex_session::poll_tick (Rust) walks ~/.codex/sessions (+ archived + Xcode CodingAssistant/codex/sessions as secondary) for rollout-*.jsonl, upserts segmented rows into claude_sessions via (session_id, segment_index) ON CONFLICT; capture-health key agentic-session-codexclaude_sessions (shared), claude_enrichment_queueI-2 (shared)Nohealthy — Rust poller, same source-audit coverage as other claude_sessions sources
9Cursor Agent sessionscom.hippo.cursor-session LaunchAgent (scheduled poller) → hippo cursor-pollcursor_session::poll_tick (Rust) walks ~/.cursor/projects/**/agent-transcripts/**/*.jsonl, parses char-bounded segments stamped from file mtime, upserts segmented rows into claude_sessions via (session_id, segment_index) ON CONFLICT; capture-health key agentic-session-cursorclaude_sessions (shared), claude_enrichment_queueI-15Assertion-only (settled transcripts present → matching claude_sessions row)healthy — Rust poller, same source-audit coverage as other claude_sessions sources
10Probe eventscom.hippo.probe LaunchAgent → crates/hippo-daemon/src/probe.rs → per-source synthetic-event pathevents / browser_events / claude_sessions with probe_tag IS NOT NULLI-8(drives the others’ probes)healthy
11Watchdog heartbeatcom.hippo.watchdogcrates/hippo-daemon/src/watchdog.rssource_health WHERE source='watchdog' UPDATE every cyclesource_health onlyI-7n/ahealthy
12Opencode sessionscom.hippo.opencode-poll LaunchAgent (every [opencode] poll_interval_secs) → hippo opencode-pollopencode_session.rs::poll_tick reads opencode’s own SQLite → upserts agentic_sessions + enqueues agentic_enrichment_queueagentic_sessions (harness=‘opencode’), agentic_enrichment_queue, knowledge_node_agentic_sessionsI-11No (deferred; doctor uses opencode DB mtime as a freshness proxy)new in v14 — no production probe yet

Per-source notes

Shell

The shell hook runs in the user’s interactive prompt critical path. The capture path is fire-and-forget at the socket boundary:

Captures: command, exit_code, duration_ms, cwd, hostname, shell, git_branch, git_commit, git_repo, stdout (truncated to head/tail), stderr (truncated). Redaction runs before storage; see config/README.md.

Claude session segments

Two distinct sources write to claude_sessions:

  1. The FS watcher (com.hippo.claude-session-watcher, KeepAlive=true) is the canonical real-time path. FSEvents on ~/.claude/projects/**/*.jsonl triggers extract_segments; segments are upserted via (session_id, segment_index) ON CONFLICT. The legacy per-session tmux tailer was deleted in T-8 (PR #89); the SessionStart hook is now a no-op debug log.
  2. Manual recovery via hippo ingest claude-session <path> does a one-shot batch import. Useful when the watcher is wedged or for backfilling a single file.

The watcher’s resume state lives in claude_session_offsets per file. Content-hash dedup gates re-enrichment: a segment whose content hasn’t changed since last enrichment is not re-enqueued. (See anti-patterns.md AP-12 for the historical bug class that motivated content-hash dedup.)

Browser

The Firefox extension is a TypeScript build (extension/firefox/); the daemon-side adapter is native_messaging.rs. The extension only captures from allow-listed domains ([browser.allowlist] in config.toml). Page content is extracted via Mozilla Readability on page departure — full readable article text plus URL, title, dwell time, and scroll depth. URL query parameters listed in [browser.url_redaction] are stripped before storage.

The native messaging manifest at ~/Library/Application Support/Mozilla/NativeMessagingHosts/hippo-native-messaging.json is installed by hippo daemon install --force.

Workflow runs (GitHub Actions)

Opt-in via [github] enabled = true and a token in HIPPO_GITHUB_TOKEN (env var, ~/.config/zsh/.env, or gh auth token; see config/config.default.toml for full token-scope guidance). Polls the Actions API every [github] poll_interval_secs (default 60), upserts runs/jobs/annotations/log excerpts, and enqueues for enrichment.

There is no real-time invariant — doctor’s source-freshness probe (crates/hippo-daemon/src/commands.rs::source_freshness_probes) checks MAX(workflow_runs.started_at) against soft (3 d) and hard (30 d) thresholds.

Xcode-side sources (ClaudeAgentConfig + Codex)

Two LaunchAgents write into the shared claude_sessions table. The ClaudeAgentConfig path (com.hippo.xcode-claude-ingest) still uses the Python script scripts/hippo-ingest-claude.py. The Codex rollout path (com.hippo.codex-session) is a Rust poller: hippo codex-poll runs codex_session::poll_tick, which walks ~/.codex/sessions (+ archived + the Xcode CodingAssistant directory) for rollout-*.jsonl files, parses their distinct JSONL envelope (session_meta, response_item/function_call) natively in Rust, and upserts segments into claude_sessions. Both paths share the claude_enrichment_queue enrichment path.

Cursor Agent sessions

The Cursor poller (com.hippo.cursor-session) is a Rust poller: hippo cursor-poll runs cursor_session::poll_tick, which walks ~/.cursor/projects recursively for agent-transcripts/**/*.jsonl files (main sessions and subagents). Cursor transcripts carry no per-line timestamps; segments are bounded by accumulated character count and time-stamped from the file mtime. Subagents are ingested as their own claude_sessions rows with is_subagent=1 and parent_session_id set. The poller shares the claude_enrichment_queue enrichment path with Claude Code, Codex, and other claude_sessions sources. Capture-health key is agentic-session-cursor; watchdog invariant I-15 gates on consecutive_failures > 3.

Opencode sessions

Polled (not watched) — opencode owns its SQLite DB and we open it read-only, so we cannot subscribe to writes the way the Claude FS watcher does for JSONL files. hippo opencode-poll runs every [opencode] poll_interval_secs (default 30 s) under com.hippo.opencode-poll.

Schema-wise this source is harness-agnostic by design: agentic_sessions carries a harness column ('claude-code', 'opencode', 'codex') and is the destination for any future agentic-harness poller. v14 only wires opencode; codex/claude-code rows in this table are aspirational.

Change detection is a per-session watermark, not a global cursor. Each tick full-scans opencode’s session table and compares every row’s time_updated against the end_time Hippo already stored for that same session in agentic_sessions — the row keyed by (session_id, harness='opencode') (a session with no such row is new; one whose source time_updated exceeds its stored end_time has grown). INSERT … ON CONFLICT DO UPDATE keeps the destination row idempotent across re-reads. Because each session is its own watermark, this is gap-free and duplicate-free under partial failure: a failed upsert rolls back, leaving end_time behind, so that one session is retried next tick while unchanged siblings — including same-millisecond ones — are left alone. (Opencode does not use agentic_cursor; that table now serves only the codex per-file cursor. The earlier global-cursor design produced an unbounded duplicate-node “loop of sadness” at the watermark boundary and a partial-failure lost-update — both eliminated by going per-session.)

agentic_sessions.summary_text is built at write time from the opencode columns we have (title, agent, model, snapshot diff stats). The brain’s _enrich_opencode_batches reads this column verbatim as the LLM prompt body, so any future enrichment quality work flows through build_summary_text in opencode_session.rs.

The brain side mirrors claude_sessions.py: claim_pending_opencode_segments flips queue rows to processing, the LLM call produces a knowledge_nodes row, write_opencode_knowledge_node links via knowledge_node_agentic_sessions and closes out the queue entry. Eligibility filter (in enrichment.py::is_enrichment_eligible) skips sessions with <3 messages and no diffs/commits.

No production probe yet — hippo probe --source opencode is deferred. The doctor freshness check uses the opencode DB’s own mtime as a suppression signal so an idle day in opencode doesn’t fail the run.

Probes

Synthetic events sent through each path every 5 minutes. Probe rows are tagged with a per-run UUID in probe_tag IS NOT NULL and are filtered out of every user-facing query. The filter is enforced both upstream (the daemon never enqueues probe events for enrichment) and downstream (every query in commands.rs, mcp.py, retrieval.py adds AND probe_tag IS NULL). A Semgrep rule blocks new query call-sites that omit the filter. (See anti-patterns.md AP-6.)

Adding a new source

The full contract for a new capture source is documented in adding-a-source.md. It covers the eleven required pieces: source identity, schema migration, capture path, redaction, probes, eligibility predicate, brain enrichment path, watchdog invariant, doctor check, test matrix, and documentation. A worked example (hypothetical bash source) walks through every step with concrete file references.