docs/BATCH_TRANSCRIPTION_SPEC.md
Whisper inference on Metal (GPU) competes with video call apps (Zoom, Meet, Teams, FaceTime) for GPU resources. Users report lag during calls and have to quit screenpipe — defeating the purpose of recording everything.
Root cause: Real-time transcription runs Whisper Large v3 Turbo (1.6GB model) on Metal every ~21-30 seconds, consuming 27-29% CPU and significant GPU bandwidth. Video call apps also need Metal for encoding/decoding.
Defer audio transcription to idle periods. Audio is still captured and saved to disk in real-time (no data loss), but Whisper inference only runs when the system isn't under load. This is acceptable because screenpipe is a search-your-history tool — users don't need live captions, they need transcriptions available before they search.
The current pipeline:
Capture → .mp4 to disk → Whisper inference → DB insert
(real-time, blocking)
New pipeline:
Capture → .mp4 to disk → [idle?] → YES → Whisper inference → DB insert
→ NO → queue segment metadata for later
Audio files are already saved to disk before transcription. The change is: instead of immediately running Whisper, we check system load first and potentially defer.
Current bug: audio_transcriptions.timestamp uses Utc::now() at DB insertion time, not audio capture time. Today this is ~3-7 seconds off (transcription latency). With batch mode, it could be minutes or hours off.
Fix: Pass the original capture timestamp through the pipeline and use it for the DB insert. The AudioInput struct already carries timing info from the capture loop — thread it through to insert_audio_transcription().
Define "idle" as ALL of:
Check interval: every 10 seconds.
Video call app detection: check running processes or frontmost app name. On macOS, use NSWorkspace.shared.frontmostApplication. On Windows, check foreground window process name.
Option A (simple): Keep segments in the existing crossbeam channel. The channel is bounded at 1000 segments. At 30s per segment, that's ~8.3 hours of audio. If the channel fills, fall back to real-time transcription (don't drop audio).
Option B (durable): New DB table for pending segments with file path and capture timestamp. Survives app restarts. Adds complexity but handles edge cases (crash recovery, multi-day backlogs).
Recommendation: Start with Option A. 1000 segments / 8.3 hours is sufficient for any reasonable meeting day. Add Option B later if users hit limits.
When idle detected:
GPU batching optimization (future): process multiple segments per Whisper model load to reduce init/teardown overhead.
Add to /health response:
{
"audio_pipeline": {
... existing fields ...
"transcription_mode": "realtime" | "batched",
"pending_segments": 42,
"oldest_pending_age_secs": 1800,
"batch_paused_reason": "cpu_high" | "video_call_detected" | null
}
}
When pending > 0:
● recording (12 segments pending transcription)
When caught up:
● recording
transcriptionMode: "realtime" | "smart" | "manual" (default: "realtime")
batchCpuThreshold: number (0-100, default: 70)
| Scenario | Behavior |
|---|---|
| Back-to-back meetings all day (8h) | Channel holds up to ~8.3h of segments. If exceeded, spill to real-time processing (accept GPU load) rather than drop. |
| User searches during backlog | Returns all completed transcriptions. Pending segments shown as "pending". |
| App crash with pending segments | With Option A (channel), pending segments are lost (but .mp4 files exist on disk — future: rescan). With Option B (DB), pending segments survive. |
| Laptop sleep during backlog | On wake, idle detector resumes. If CPU is low, batch processing continues. |
| User switches from "smart" to "realtime" mid-backlog | Immediately resume real-time processing. Drain any pending backlog first. |
| User disables audio recording | Stop capturing. Pending backlog still processes to completion. |
| Multiple audio devices | Each device's segments enter the same channel. Processing is device-agnostic. |
| Deepgram engine (not Whisper) | Batch mode still applies — Deepgram API calls are deferred too. Reduces API call frequency during meetings. |
Add to AudioPipelineMetrics:
segments_deferred: AtomicU64 — segments sent to batch queue instead of real-timesegments_batch_processed: AtomicU64 — segments processed from batch queuebatch_pause_events: AtomicU64 — number of times batch mode activatedbatch_resume_events: AtomicU64 — number of times batch processing resumedPostHog events:
batch_transcription_activated — with reason (cpu_high, video_call)batch_transcription_resumed — with pending_count, idle_durationbatch_backlog_cleared — with total_segments, total_durationaudio_transcriptions.timestamp to use capture timetranscription_paused AtomicBool to AudioManagerpending_segments to health endpointtranscriptionMode