README.md
Clone any voice. Generate speech. Dictate into any app. Talk to agents in voices you own.
The full voice I/O stack, running locally on your machine.
</p> <p align="center"> <a href="https://github.com/jamiepine/voicebox/releases"> </a> <a href="https://github.com/jamiepine/voicebox/releases/latest"> </a> <a href="https://github.com/jamiepine/voicebox/stargazers"> </a> <a href="https://github.com/jamiepine/voicebox/blob/main/LICENSE"> </a> <a href="https://deepwiki.com/jamiepine/voicebox"> </a> </p> <p align="center"> <a href="https://voicebox.sh">voicebox.sh</a> • <a href="https://docs.voicebox.sh">Docs</a> • <a href="#download">Download</a> • <a href="#features">Features</a> • <a href="#api">API</a> • <a href="docs/content/docs/overview/troubleshooting.mdx">Troubleshooting</a> </p> <p align="center"> <a href="https://voicebox.sh"> </a> </p> <p align="center"> <em>Click the image above to watch the demo video on <a href="https://voicebox.sh">voicebox.sh</a></em> </p> <p align="center"> </p> <p align="center"> </p>Voicebox is a local-first AI voice studio — a free and open-source alternative to ElevenLabs and WisprFlow in one app. Clone voices from a few seconds of audio, generate speech in 23 languages across 7 TTS engines, dictate into any text field with a global hotkey, and give any MCP-aware AI agent a voice of your choosing.
The two cloud incumbents sit on opposite halves of the voice I/O loop — ElevenLabs on output, WisprFlow on input. Voicebox does both, bridges them with a bundled local LLM for refinement and per-profile personas, and runs the whole thing on your machine.
[laugh], [sigh], [gasp] via Chatterbox Turbo; natural-language delivery control via Qwen CustomVoicevoicebox.speak) and any MCP-aware agent (Claude Code, Cursor, Cline) speaks to you in a voice you've cloned| Platform | Download |
|---|---|
| macOS (Apple Silicon) | Download DMG |
| macOS (Intel) | Download DMG |
| Windows | Download MSI |
| Docker | docker compose up |
Linux — Pre-built binaries are not yet available. See voicebox.sh/linux-install for build-from-source instructions.
Having trouble? See the Troubleshooting Guide for common install, generation, model-download, and GPU issues.
Seven TTS engines with different strengths, switchable per-generation:
| Engine | Languages | Strengths |
|---|---|---|
| Qwen3-TTS (0.6B / 1.7B) | 10 | High-quality multilingual cloning, delivery instructions ("speak slowly", "whisper") |
| Qwen CustomVoice | 10 | 9 curated preset voices with natural-language delivery control — no reference audio required |
| LuxTTS | English | Lightweight (~1GB VRAM), 48kHz output, 150x realtime on CPU |
| Chatterbox Multilingual | 23 | Broadest language coverage — Arabic, Danish, Finnish, Greek, Hebrew, Hindi, Malay, Norwegian, Polish, Swahili, Swedish, Turkish and more |
| Chatterbox Turbo | English | Fast 350M model with paralinguistic emotion/sound tags |
| TADA (1B / 3B) | 10 | HumeAI speech-language model — 700s+ coherent audio, text-acoustic dual alignment |
| Kokoro | 8 | 50 curated preset voices, tiny 82M model, fast CPU inference |
Only Chatterbox Turbo interprets paralinguistic tags like [laugh] and
[sigh]. Qwen3-TTS, LuxTTS, Chatterbox Multilingual, and HumeAI TADA read them
literally as text.
With Chatterbox Turbo selected, type / in the text input to open the tag
inserter and add expressive tags inline with speech:
[laugh] [chuckle] [gasp] [cough] [sigh] [groan] [sniff] [shush] [clear throat]
8 audio effects powered by Spotify's pedalboard library. Apply after generation, preview in real time, build reusable presets.
| Effect | Description |
|---|---|
| Pitch Shift | Up or down by up to 12 semitones |
| Reverb | Configurable room size, damping, wet/dry mix |
| Delay | Echo with adjustable time, feedback, and mix |
| Chorus / Flanger | Modulated delay for metallic or lush textures |
| Compressor | Dynamic range compression |
| Gain | Volume adjustment (-40 to +40 dB) |
| High-Pass Filter | Remove low frequencies |
| Low-Pass Filter | Remove high frequencies |
Ships with 4 built-in presets (Robotic, Radio, Echo Chamber, Deep Voice) and supports custom presets. Effects can be assigned per-profile as defaults.
Text is automatically split at sentence boundaries and each chunk is generated independently, then crossfaded together. Works with all engines.
[tags]Every generation supports multiple versions with provenance tracking:
Generation is non-blocking. Submit and immediately start typing the next one.
Multi-voice timeline editor for conversations, podcasts, and narratives.
The other half of the voice I/O loop. Hold a hotkey anywhere on your system, speak, release — on macOS the transcript pastes straight into the focused text field. Or hit the mic on any Voicebox text input and dictate directly into the app.
Space mid-hold upgrades into a toggle session without a gap in audiorecording, transcribing, refining, and speaking states. Same pill agents use when they speak to you, so there's one mental model for both directions of the loopVoicebox runs OpenAI Whisper for transcription — the same model that backs dictation, the Captures tab, and the /transcribe API. Running on MLX (Apple Silicon) or PyTorch (CUDA / ROCm / DirectML / CPU) depending on your platform.
| Size | Notes |
|---|---|
| Base / Small / Medium / Large | Standard Whisper quality ladder |
| Turbo | ~8x faster than Whisper Large, minimal quality loss |
More engines (Parakeet v3, Qwen3-ASR) are planned — see Roadmap.
Every dictation, in-app recording, and uploaded audio file lands in the Captures tab — original audio paired with transcript, always preserved.
Every agent gets a voice. One tool call and any MCP-aware agent can speak to you in a voice you've cloned — task completions, questions, notifications. The same pill that surfaces during dictation surfaces during agent speech, so you always see what's coming out of your machine.
// In any MCP-aware agent:
await voicebox.speak({
text: "Deploy complete.",
profile: "Morgan",
});
Also exposed as POST /speak for anything that doesn't speak MCP — ACP, A2A, shell scripts, custom harnesses.
recording, transcribing, refining, and speaking are all states of the same OS-level overlay, so dictation and agent speech share one surfacelast_seen_at timestamp confirms the install actually tookvoicebox-mcp binaryAttach a free-form personality to any voice profile — who this voice is, how they speak, what they care about. Two actions appear on the generate box when a personality is set, powered by a bundled Qwen3 LLM running entirely locally.
Agents can reach the same rewrite path over MCP by passing personality: true to voicebox.speak, turning the tool into a text-in → personality-LLM → TTS pipeline. The same LLM backs dictation's refinement step — one LLM in the app, one model cache, one GPU-memory footprint.
Local LLM options: Qwen3 0.6B / 1.7B / 4B, sharing the TTS runtime (MLX on Apple Silicon, PyTorch elsewhere).
Use cases: agent dev loops (dictate a question, hear the answer in a cloned voice), interactive characters for games and narrative tools, speech assistance for people who can't speak in their original voice.
VOICEBOX_MODELS_DIR| Platform | Backend | Notes |
|---|---|---|
| macOS (Apple Silicon) | MLX (Metal) | 4-5x faster via Neural Engine |
| Windows / Linux (NVIDIA) | PyTorch (CUDA) | Auto-downloads CUDA binary from within the app |
| Linux (AMD) | PyTorch (ROCm) | Auto-configures HSA_OVERRIDE_GFX_VERSION |
| Windows (any GPU) | DirectML | Universal Windows GPU support |
| Intel Arc | IPEX/XPU | Intel discrete GPU acceleration |
| Any | CPU | Works everywhere, just slower |
Voicebox exposes a REST API for integrating voice I/O into your own apps and agents.
# Generate speech
curl -X POST http://127.0.0.1:17493/generate \
-H "Content-Type: application/json" \
-d '{"text": "Hello world", "profile_id": "abc123", "language": "en"}'
# Agent voice output — any app or script can speak in a cloned voice
curl -X POST http://127.0.0.1:17493/speak \
-H "Content-Type: application/json" \
-H "X-Voicebox-Client-Id: my-script" \
-d '{"text": "Deploy complete.", "profile": "Morgan"}'
# Transcribe an audio file
curl -X POST http://127.0.0.1:17493/transcribe \
-F "[email protected]" \
-F "model=whisper-turbo"
# List voice profiles
curl http://127.0.0.1:17493/profiles
POST /speak accepts profile as a name (case-insensitive) or id, and resolves via the same precedence as the MCP tool: explicit arg → per-client binding → capture_settings.default_playback_voice_id.
Voicebox ships a built-in Model Context Protocol server so any MCP-aware agent (Claude Code, Cursor, Windsurf, Cline, VS Code MCP extensions) can speak, transcribe, and browse captures and profiles.
Claude Code one-liner:
claude mcp add voicebox \
--transport http \
--url http://127.0.0.1:17493/mcp \
--header "X-Voicebox-Client-Id: claude-code"
Any HTTP MCP client (Cursor, Windsurf, VS Code, etc.):
{
"mcpServers": {
"voicebox": {
"url": "http://127.0.0.1:17493/mcp",
"headers": { "X-Voicebox-Client-Id": "cursor" }
}
}
}
Stdio fallback for clients that don't speak HTTP MCP — point at the bundled voicebox-mcp binary inside the app:
{
"mcpServers": {
"voicebox": {
"command": "/Applications/Voicebox.app/Contents/MacOS/voicebox-mcp",
"env": { "VOICEBOX_CLIENT_ID": "claude-desktop" }
}
}
}
Four tools ship: voicebox.speak, voicebox.transcribe, voicebox.list_captures, voicebox.list_profiles. Per-client voice bindings are managed in Voicebox → Settings → MCP. See the full MCP guide for tool signatures, resolution precedence, the speaking-pill contract, and security notes.
// In any MCP-aware agent:
await voicebox.speak({
text: "Tests passing. Ready to merge.",
profile: "Morgan", // optional — falls back to the per-client binding
personality: true, // optional — rewrites text through the profile's personality LLM first
});
Use cases: agent dev loops (voice in, voice out), game dialogue, podcast production, accessibility tools, voice assistants, content automation.
Full API documentation available at http://127.0.0.1:17493/docs.
| Layer | Technology |
|---|---|
| Desktop App | Tauri (Rust) |
| Frontend | React, TypeScript, Tailwind CSS |
| State | Zustand, React Query |
| Backend | FastAPI (Python) |
| TTS Engines | Qwen3-TTS, Qwen CustomVoice, LuxTTS, Chatterbox, Chatterbox Turbo, TADA, Kokoro |
| STT | Whisper / Whisper Turbo (PyTorch or MLX) |
| Local LLM | Qwen3 (0.6B / 1.7B / 4B), shared runtime with TTS / STT |
| MCP Server | FastMCP mounted at /mcp (Streamable HTTP) + bundled stdio shim binary |
| Native Shim | Rust (inside Tauri) for global hotkey, paste injection, focus introspection |
| Effects | Pedalboard (Spotify) |
| Inference | MLX (Apple Silicon) / PyTorch (CUDA/ROCm/XPU/CPU) |
| Database | SQLite |
| Audio | WaveSurfer.js, librosa |
| Feature | Description |
|---|---|
| Windows / Linux auto-paste | Dictation paste parity — SendInput on Windows, uinput / AT-SPI on Linux |
| STT engine expansion | Parakeet v3 and Qwen3-ASR joining Whisper — 50+ languages, better non-English quality |
| Pipeline routing | Configurable source → transform → sink chains with webhook + MCP sinks and a preset editor |
| Streaming transcription | WebSocket /transcribe/stream for partial transcripts as you speak |
| End-to-end speech LLMs | Moshi, GLM-4-Voice, Qwen2.5 Omni — real voice-to-voice, no text between |
| Voice Design | Create new voices from text descriptions |
| Long-form capture | Dual-stream recorder (mic + system audio) with summary LLM transform |
| Platform sinks | Apple Notes, Obsidian, and other opt-in integrations |
| Plugin architecture | Extend with custom models, transforms, and sinks |
| Mobile companion | Control Voicebox from your phone |
For the full engineering status, open-issue triage, and prioritized work queue, see docs/PROJECT_STATUS.md — a living document that tracks what's shipped, what's in-flight, candidate TTS engines under evaluation, and why we've accepted or backlogged specific integrations.
See CONTRIBUTING.md for detailed setup and contribution guidelines.
git clone https://github.com/jamiepine/voicebox.git
cd voicebox
just setup # creates Python venv, installs all deps
just dev # starts backend + desktop app
Install just: brew install just or cargo install just. Run just --list to see all commands.
Prerequisites: Bun, Rust, Python 3.11+, Tauri Prerequisites, and Xcode on macOS.
The repo ships a pre-wired .mcp.json at the root — running Claude Code inside this checkout picks up the Voicebox MCP tools automatically once the dev app is running.
just build # Build CPU server binary + Tauri app
just build-local # (Windows) Build CPU + CUDA server binaries + Tauri app
The multi-engine architecture makes adding new TTS engines straightforward. A step-by-step guide covers the full process: dependency research, backend protocol implementation, frontend wiring, and PyInstaller bundling.
The guide is optimized for AI coding agents. An agent skill can pick up a model name and handle the entire integration autonomously — you just test the build locally.
voicebox/
├── app/ # Shared React frontend
├── tauri/ # Desktop app (Tauri + Rust)
├── web/ # Web deployment
├── backend/ # Python FastAPI server
├── landing/ # Marketing website
└── scripts/ # Build & release scripts
Contributions welcome! See CONTRIBUTING.md for guidelines.
Found a security vulnerability? Please report it responsibly. See SECURITY.md for details.
MIT License — see LICENSE for details.