packages/kilo-docs/pages/contributing/architecture/voice-transcription.md
Developers can code 3-5x faster by dictating rather than typing, yet Kilo Code currently has no voice input capability. This creates friction for users who want to quickly describe complex features or iterate on ideas hands-free.
This spec proposes adding live voice transcription to the chat interface, replacing the send button with a microphone icon when the text box is empty. Users can speak naturally while seeing real-time transcription appear in the input field, dramatically improving coding velocity for voice-preferred workflows.
The MVP will use OpenAI's Realtime API with FFmpeg-based audio streaming for low-latency transcription (~100ms). This mirrors the approach used by Cursor and Cline, proven to work well in VS Code environments.
The system follows a straightforward streaming architecture where user voice input is captured by FFmpeg, streamed as PCM16 audio to OpenAI's Realtime API via WebSocket, and transcribed text is displayed live in the chat input box. Typing interrupts recording instantly.
avfoundationdshow (DirectShow)alsa or pulseInstallation Check Flow:
ffmpeg -versionDocumentation Structure:
docs/user-guide/voice-transcription-setup.md
AudioCaptureService class with platform-specific FFmpeg spawning