Back to Screenpipe

architecture

docs/mintlify/docs-mintlify-mig-tmp/architecture.mdx

2.3.278.3 KB
Original Source

overview

screenpipe is a Rust application that captures your screen and audio using an event-driven architecture, processes them locally, and stores everything in a SQLite database. instead of recording every second, it listens for meaningful OS events and captures only when something actually changes — pairing each screenshot with accessibility tree data for maximum quality at minimal cost.

mermaid
graph LR
    subgraph trigger["event triggers"]
        E1[app switch]
        E2[click / scroll]
        E3[typing pause]
        E4[idle timer]
    end

    subgraph capture["paired capture"]
        SS[screenshot]
        A11Y[accessibility tree]
        OCR[OCR fallback]
    end

    subgraph audio["audio"]
        MIC[microphone]
        SYS[system audio]
        STT[speech-to-text]
    end

    subgraph store["storage"]
        DB[(SQLite)]
        FS[JPEG files]
    end

    subgraph serve["API · localhost:3030"]
        REST[REST API]
        MCP[MCP server]
    end

    E1 & E2 & E3 & E4 --> SS
    SS --> A11Y
    A11Y -->|empty?| OCR
    A11Y --> DB
    OCR --> DB
    SS --> FS

    MIC & SYS --> STT --> DB

    DB --> REST
    DB --> MCP
    FS --> REST

    REST --> P[pipes / AI agents]
    MCP --> AI[Claude · Cursor · etc.]

data flow

mermaid
sequenceDiagram
    participant OS as OS Events
    participant Capture
    participant A11Y as Accessibility
    participant OCR as OCR (fallback)
    participant Audio
    participant SQLite
    participant API
    participant AI

    OS->>Capture: meaningful event (click, app switch, typing pause...)
    Capture->>Capture: screenshot
    Capture->>A11Y: walk accessibility tree
    alt accessibility data available
        A11Y->>SQLite: structured text + metadata
    else accessibility empty (remote desktop, games)
        A11Y->>OCR: fallback
        OCR->>SQLite: extracted text + metadata
    end
    Capture->>SQLite: JPEG frame

    loop every 30s chunk
        Audio->>SQLite: transcription + speaker
        Audio->>SQLite: audio file
    end

    AI->>API: search query
    API->>SQLite: SQL lookup
    SQLite-->>API: results
    API-->>AI: JSON response

crates

screenpipe is a Rust workspace with specialized crates:

mermaid
graph TD
    APP[screenpipe-app-tauri
<i>desktop app</i>]
    SERVER[screenpipe-server
<i>REST API · routes</i>]
    DB[screenpipe-db
<i>SQLite · types</i>]
    VISION[screenpipe-vision
<i>screen capture · OCR</i>]
    AUDIO[screenpipe-audio
<i>audio capture · STT</i>]
    CORE[screenpipe-core
<i>pipes · config</i>]
    EVENTS[screenpipe-events
<i>event system</i>]
    A11Y[screenpipe-accessibility
<i>UI events · macOS, Windows</i>]
    AI[screenpipe-apple-intelligence
<i>Foundation Models</i>]
    INT[screenpipe-integrations
<i>MCP · reminders</i>]

    APP --> SERVER
    APP --> AI
    SERVER --> DB
    SERVER --> VISION
    SERVER --> AUDIO
    SERVER --> CORE
    SERVER --> EVENTS
    AUDIO --> DB
    VISION --> DB
    CORE --> DB
    A11Y --> DB
    INT --> SERVER

layers

1. event-driven capture

screenpipe listens for meaningful OS events instead of polling at a fixed FPS. when an event fires, it captures a screenshot and walks the accessibility tree together — same timestamp, same frame.

triggerdescription
app switchuser switched to a different application
window focusa new window gained focus
click / scrolluser interacted with the UI
typing pauseuser stopped typing (debounced)
clipboard copycontent copied to clipboard
idle fallbackperiodic capture every ~5s when nothing is happening
whathowcrate
screenevent-triggered screenshot of the active monitorscreenpipe-vision
text extractionaccessibility tree walk (structured text: buttons, labels, fields)screenpipe-accessibility
OCR fallbackwhen accessibility data is empty (remote desktops, games, some Linux apps)screenpipe-vision
audiomultiple input/output devices in configurable chunks (default 30s)screenpipe-audio

2. processing

enginetypeplatformwhen used
accessibility treetext extractionmacOS, Windowsprimary — used for every capture
Apple VisionOCRmacOSfallback when accessibility is empty
Windows nativeOCRWindowsfallback when accessibility is empty
TesseractOCRLinuxprimary (accessibility support varies)
Whisperspeech-to-textlocal, all platformsaudio transcription
Deepgramspeech-to-textcloud APIoptional cloud audio

additional processing: speaker identification, PII redaction, frame deduplication (skips identical frames).

3. storage

all data stays local on your machine:

  • SQLite at ~/.screenpipe/db.sqlite — metadata, accessibility text, OCR text, transcriptions, speakers, tags, UI elements
  • media at ~/.screenpipe/data/ — JPEG screenshots (event-driven frames), audio chunks

4. API

REST API on localhost:3030:

endpointdescription
/searchfiltered content retrieval (OCR, audio, accessibility)
/search/keywordkeyword search with text positions
/elementslightweight UI element search (accessibility tree data)
/frames/{id}access captured frames
/frames/{id}/contextaccessibility text + URLs + OCR fallback for a frame
/healthsystem status and metrics
/raw_sqldirect database queries
/ai/chat/completionsApple Intelligence (macOS 26+)

see API reference for the full endpoint list.

5. pipes

pipes are AI agents (.md prompt files) that run on your screen data. they're executed by an AI agent that reads the prompt, queries the screenpipe API, and takes action.

pipes live in ~/.screenpipe/pipes/{name}/ and run on cron-like schedules.

6. desktop app

the desktop app is built with Tauri (Rust backend) + Next.js (React frontend):

mermaid
graph LR
    subgraph tauri["Tauri shell"]
        RS[Rust backend
commands · permissions · tray]
        WV[WebView]
    end

    subgraph frontend["Next.js frontend"]
        PAGES[pages
chat · timeline · settings]
        STORE[Zustand stores]
        UI[shadcn/ui components]
    end

    subgraph backend["screenpipe-server"]
        API[REST API :3030]
    end

    RS --> WV
    WV --> PAGES
    PAGES --> STORE
    STORE --> UI
    PAGES --> API

database schema

key tables:

tablestores
framescaptured screen frame metadata (includes snapshot_path, accessibility_text, capture_trigger)
ocr_textOCR fallback text extracted from frames
elementsUI elements from accessibility tree (buttons, labels, text fields) with FTS5 search
audio_chunksaudio recording metadata
audio_transcriptionstext from audio
speakersidentified speakers
ui_eventskeyboard, mouse, clipboard events
tagsuser-applied tags on content

inspect directly:

bash
sqlite3 ~/.screenpipe/db.sqlite .schema

resource usage

runs 24/7 on a MacBook Pro M3 (32 GB) or a $400 Windows laptop:

metrictypical value
RAM~600 MB
CPU~5-10%
storage~5-10 GB/month (event-driven capture only stores frames when something changes)

source code

componentpath
API serverscreenpipe-server/src/
screen capturescreenpipe-vision/src/core.rs
audio capturescreenpipe-audio/src/
databasescreenpipe-db/src/db.rs
pipesscreenpipe-core/src/pipes/
MCP serverscreenpipe-mcp/src/index.ts
desktop appscreenpipe-app-tauri/