docs/EVENT_DRIVEN_CAPTURE_SPEC.md
Status: Draft Date: 2026-02-20
Three independent capture systems run on their own clocks with zero synchronization:
When a user searches a keyword that exists in accessibility data, the nearest screenshot is from a different moment. The thumbnail is wrong. The user doesn't trust the results.
Meanwhile the vision pipeline burns CPU comparing and skipping identical frames on a static screen. The ActivityFeed already detects every click, keystroke, and app switch — but instead of triggering a capture, it nudges a polling rate. That's backwards.
Kill the three-system split. One system: event happens → screenshot + text extraction → store together.
Event (click / app switch / typing pause / scroll stop / idle timer)
→ Screenshot (reuse capture_monitor_image)
→ Accessibility tree walk (reuse walk_focused_window)
→ If accessibility empty → OCR fallback (reuse process_ocr_task)
→ Write JPEG to disk
→ Insert frame + text into DB (single row, single timestamp)
Screenshot and text share the same timestamp because they come from the same capture. No desync possible.
| Trigger | Debounce | Why |
|---|---|---|
| App switch | 300ms settle | Highest-value event — user changed context |
| Window focus change | 300ms settle | New tab, new document, new conversation |
| Mouse click | 200ms | User interacted — screen likely changed |
| Typing pause | 500ms after last key | Capture the result of typing, not every character |
| Scroll stop | 400ms after last scroll | New content scrolled into view |
| Clipboard copy | 200ms | User grabbed something — capture context |
| Idle fallback | Every 5s | Catch passive changes: notifications, incoming messages, auto-play |
Hard constraints:
FrameComparer).Accessibility first. OCR as fallback. No "both" mode at capture time — keep it simple.
walk_focused_window() → result
if result.text_content is non-empty → done (text_source = "accessibility")
if result is empty/error → run OCR → done (text_source = "ocr")
Accessibility tree walk has a 200ms hard timeout. If the app has a massive AX tree (Electron apps with 10k+ nodes), we take whatever text we got in 200ms and move on. This keeps capture latency predictable.
OCR is the safety net for:
The user doesn't choose. The system picks the right method per-capture.
No more H.265 video encoding. No more FFmpeg for frame extraction.
Each capture writes a JPEG directly to disk:
~/.screenpipe/data/
2026-02-20/
1708423935123_m0.jpg # monitor 0 screenshot
1708423937456_m0.jpg
1708423939100_m1.jpg # monitor 1 screenshot
...
Metadata (text, app name, trigger, etc.) lives in the DB, not sidecar files. The JPEG is just pixels.
Why kill video?
Storage math (8 hours active use, 1080p, JPEG quality 80 ≈ 80KB/frame):
Fewer frames, each slightly larger, far less total storage.
Reading old data: Legacy video-chunk frames stay on disk forever. The frame retrieval endpoint checks snapshot_path on the frame row — if set, serve JPEG directly; if NULL, use the existing FFmpeg extraction path. Old data keeps working with zero migration effort.
One migration, additive only:
ALTER TABLE frames ADD COLUMN snapshot_path TEXT;
ALTER TABLE frames ADD COLUMN accessibility_text TEXT;
ALTER TABLE frames ADD COLUMN capture_trigger TEXT; -- 'app_switch', 'click', 'typing_pause', 'scroll_stop', 'clipboard', 'idle', etc.
ALTER TABLE frames ADD COLUMN text_source TEXT DEFAULT 'ocr'; -- 'ocr' or 'accessibility'
CREATE INDEX idx_frames_ts_device ON frames(timestamp, device_name);
New frames: snapshot_path set, video_chunk_id may be NULL, accessibility_text populated.
Old frames: snapshot_path NULL, existing video_chunk_id + offset_index used.
Both coexist in the same table. Timeline and search show both. No data loss.
Keyword search queries both ocr_text (via existing ocr_text_fts) and the new accessibility_text on frames. Since accessibility_text is on the frame row, the matched thumbnail is always correct.
For the keyword search handler (/search/keyword):
-- Existing OCR path (unchanged)
SELECT ... FROM ocr_text_fts WHERE ocr_text_fts MATCH ?
-- New accessibility path
UNION
SELECT ... FROM frames
WHERE accessibility_text LIKE '%' || ? || '%'
OR frame_id IN (SELECT rowid FROM accessibility_text_fts WHERE accessibility_text_fts MATCH ?)
Results are merged, deduplicated by frame ID, sorted by timestamp. Thumbnails are always correct regardless of which text source matched.
Events are monitor-specific where possible:
Other monitors get idle fallback captures only (every 5s, deduplicated).
This avoids capturing all monitors on every click — important for 3+ monitor setups.
┌─────────────────────┐
│ Event Listener │ (reuse existing CGEventTap / UI Automation)
│ (real-time thread) │
└─────────┬───────────┘
│ EventTrigger (type + monitor + timestamp)
▼
┌─────────────────────┐
│ Debounce + Dedup │ (per-monitor, 200ms min interval)
│ (async task) │
└─────────┬───────────┘
│ qualified trigger
▼
┌───────────────────────────────┐
│ Capture Worker (per monitor) │
│ 1. capture_monitor_image() │ ~5ms
│ 2. capture_windows() │ ~10ms
│ 3. walk_focused_window() │ ~10-200ms (200ms timeout)
│ 4. if empty → process_ocr() │ ~100-500ms (rare)
│ 5. encode JPEG, write to disk │ ~5-10ms
│ 6. insert frame + text to DB │ ~5ms (batched)
└───────────────────────────────┘
Total latency per capture: ~30-50ms typical (accessibility path), ~200-600ms worst case (OCR fallback). Well within the 200ms minimum interval.
One capture worker per monitor. Workers are independent — a slow OCR on monitor 1 doesn't block capture on monitor 2.
Remove:
Add:
Keep:
This is not additive. Old code gets removed.
| Removed | Reason |
|---|---|
continuous_capture() loop in core.rs | Replaced by event-driven capture |
save_frames_as_video() in video.rs | No more video encoding |
FrameWriteTracker in video.rs | No video chunks to track offsets in |
FrameComparer as capture gatekeeper | Events decide when to capture, not frame diffs. Keep only for idle dedup. |
ActivityFeed::get_capture_params() | No FPS to adjust. Feed becomes event source. |
| Adaptive FPS feature flag | Gone entirely |
ocr_work_queue / OCR worker thread | OCR runs inline on accessibility fallback only |
video_frame_queue / video encoding thread | No video to encode |
| FFmpeg encoding dependency (write path) | Still needed for legacy frame extraction (read path only) |
WindowOcrCache (300s TTL, 100 entries) | Accessibility is fast enough to not need caching. OCR fallback is rare. |
What stays for backward compat (read path only):
extract_frame_from_video() — for displaying old video-chunk framesvideo_chunks table — for old dataoffset_index / fps columns on frames — for old dataThese remain but receive no new writes. They're read-only legacy support.
Not phased. One PR per step, each shippable independently, but all ship in the same release.
framesSnapshotWriter: JPEG write to ~/.screenpipe/data/YYYY-MM-DD/insert_snapshot_frame() in DBget_frame_data() to serve snapshots directlypaired_capture(): screenshot + accessibility walk + OCR fallbackPairedCaptureResult with image bytes + text + metadataActivityFeed with tokio::sync::Notify + event typeEventDrivenCapture::wait_for_trigger() — debounce + dedup logicCGEventTap / UI Automation hooksVisionManager's capture task with event-driven loopwait_for_trigger → paired_capture → snapshot_write → db_insertcontinuous_capture(), save_frames_as_video(), FrameWriteTrackerget_capture_params()accessibility_text to keyword search FTSadaptive-fps feature flag from Cargo.toml~/.screenpipe/data/YYYY-MM-DD/ created automaticallyThe same APIs we use for capture can drive automated E2E tests. On macOS, osascript opens apps, clicks buttons, types text, switches windows. On Windows, PowerShell + UI Automation does the same. Tests perform real user actions, wait for captures to appear in the DB, and assert correctness.
Layer 1: Unit tests (fast, CI, no UI)
Layer 2: Integration tests (CI, headless)
Layer 3: E2E robot tests (real machines, real UI, nightly CI)
# macOS: osascript drives real apps
# Windows: PowerShell + [System.Windows.Automation] drives real apps
test_app_switch_capture:
1. open TextEdit, type "test document alpha"
2. open Safari, navigate to example.com
3. sleep 1s
4. query DB: frames WHERE capture_trigger = 'app_switch' AND timestamp > test_start
5. assert: >= 2 frames
6. assert: frame 1 accessibility_text contains "test document alpha"
7. assert: frame 2 app_name = "Safari"
8. assert: both snapshot_path files are valid JPEGs
test_typing_pause_capture:
1. focus TextEdit, type "meeting notes for project X"
2. sleep 1s
3. assert: frame with capture_trigger = 'typing_pause'
4. assert: accessibility_text contains "meeting notes for project X"
test_scroll_capture:
1. open Safari, navigate to long page
2. scroll down 5 times
3. sleep 1s
4. assert: frame with capture_trigger = 'scroll_stop'
5. assert: content differs from pre-scroll frame
test_click_capture:
1. open System Settings, click "General"
2. sleep 500ms
3. assert: frame with capture_trigger = 'click'
test_idle_fallback:
1. do nothing for 12s
2. assert: >= 1 frame with capture_trigger = 'idle'
test_rapid_events_debounce:
1. click 20 times in 1s
2. sleep 1s
3. assert: <= 5 frames (200ms min interval)
test_search_thumbnail_correctness:
1. open TextEdit, type "unique_term_xyz"
2. switch to Safari
3. sleep 2s
4. GET /search?q=unique_term_xyz
5. assert: result thumbnail shows TextEdit, not Safari
Layer 4: Soak test (8 hours real use, before every release)
~90% of new code is platform-agnostic (debounce, paired capture, snapshot writer, DB, search). Platform-specific code already exists and is abstracted:
| Component | macOS | Windows | New code needed? |
|---|---|---|---|
| Event detection | CGEventTap | SetWindowsHookEx | No — already exists in platform/ |
| Screenshot | ScreenCaptureKit | DXGI/GDI | No — already abstracted |
| Accessibility tree | AX API | UI Automation | No — already in tree/ |
| Debounce/dedup | Pure Rust | Same | No |
| Snapshot writer | File I/O | Same | No |
| JPEG encoding | image crate | Same | No |
| DB | SQLite | Same | No |
One platform-specific tuning: UIA tree walk is slower on Windows (200-500ms vs 10-50ms on macOS). The accessibility timeout constant needs #[cfg(target_os)]:
#[cfg(target_os = "macos")]
const AX_WALK_TIMEOUT_MS: u64 = 200;
#[cfg(target_os = "windows")]
const AX_WALK_TIMEOUT_MS: u64 = 350;
E2E robot tests use platform-native automation:
osascript (AppleScript)System.Windows.Automation| Metric | Today | Target |
|---|---|---|
| CPU idle (static screen, release) | 3-5% | < 0.5% |
| CPU active (browsing, release) | 8-15% | < 5% |
| App switch → frame in DB | 1-5s | < 500ms |
| Search thumbnail correctness | ~60% for accessibility matches | 100% |
| Frame serve latency (new frames) | 100-500ms (FFmpeg) | < 5ms |
| Storage (8hr active day) | 800 MB – 1.6 GB | ~300 MB |
| Lines of code in capture pipeline | ~2500 (core.rs + video.rs + frame_comparison.rs) | ~800 |