Voicebox Project Status & Roadmap

Last updated: 2026-04-18 | Current version: v0.4.1 | 232 open issues | 12 open PRs

Architecture Overview
Current State
Open PRs — Triage & Analysis
Open Issues — Categorized
Existing Plan Documents — Status
New Model Integration — Landscape
Architectural Bottlenecks
Recommended Priorities

Architecture Overview

Tauri shell (Rust) hosts a React frontend (app/) that talks over HTTP on localhost:17493 to a FastAPI backend (backend/).

The backend exposes:

TTSBackend Protocol with seven concrete engine implementations:
- Qwen3-TTS (PyTorch or MLX depending on platform)
- Qwen CustomVoice (predefined speakers with instruct)
- LuxTTS (fast, CPU-friendly)
- Chatterbox Multilingual (23 languages)
- Chatterbox Turbo (English, paralinguistic tags)
- TADA (1B English, 3B multilingual via HumeAI)
- Kokoro 82M (pre-built voices, CPU realtime)
STTBackend Protocol for Whisper (PyTorch or MLX-Whisper)
Profiles / History / Stories services for persistence and timeline editing

Key Files

Layer	File	Purpose
Backend entry	`backend/main.py`	FastAPI app, all API routes (~2850 lines)
TTS protocol	`backend/backends/__init__.py:32-101`	`TTSBackend` Protocol definition
Model registry	`backend/backends/__init__.py:17-29,153-366`	`ModelConfig` dataclass + registry helpers
TTS factory	`backend/backends/__init__.py:382-426`	Thread-safe engine registry (double-checked locking)
PyTorch TTS	`backend/backends/pytorch_backend.py`	Qwen3-TTS via `qwen_tts` package
MLX TTS	`backend/backends/mlx_backend.py`	Qwen3-TTS via `mlx_audio.tts`
LuxTTS	`backend/backends/luxtts_backend.py`	LuxTTS — fast, CPU-friendly
Chatterbox MTL	`backend/backends/chatterbox_backend.py`	Chatterbox Multilingual — 23 languages
Chatterbox Turbo	`backend/backends/chatterbox_turbo_backend.py`	Chatterbox Turbo — English, paralinguistic tags
TADA	`backend/backends/hume_backend.py`	HumeAI TADA — 1B English + 3B Multilingual
Kokoro	`backend/backends/kokoro_backend.py`	Kokoro 82M — CPU realtime, pre-built voices
Qwen CustomVoice	`backend/backends/qwen_custom_voice_backend.py`	Qwen CustomVoice — predefined speakers with instruct
Platform detect	`backend/platform_detect.py`	Apple Silicon → MLX, else → PyTorch
API types	`backend/models.py`	Pydantic request/response models
HF progress	`backend/utils/hf_progress.py`	HFProgressTracker (tqdm patching for download progress)
Audio utils	`backend/utils/audio.py`	`trim_tts_output()`, normalize, load/save audio
Frontend API	`app/src/lib/api/client.ts`	Hand-written fetch wrapper
Frontend types	`app/src/lib/api/types.ts`	TypeScript API types
Engine selector	`app/src/components/Generation/EngineModelSelector.tsx`	Shared engine/model dropdown
Generation form	`app/src/components/Generation/GenerationForm.tsx`	TTS generation UI
Floating gen box	`app/src/components/Generation/FloatingGenerateBox.tsx`	Compact generation UI
Model manager	`app/src/components/ServerSettings/ModelManagement.tsx`	Model download/status/progress UI
GPU acceleration	`app/src/components/ServerSettings/GpuAcceleration.tsx`	CUDA backend swap UI
Gen form hook	`app/src/lib/hooks/useGenerationForm.ts`	Form validation + submission
Language constants	`app/src/lib/constants/languages.ts`	Per-engine language maps

How TTS Generation Works (Current Flow)

POST /generate
  1. Look up voice profile from DB
  2. Resolve engine from request (qwen | qwen_custom_voice | luxtts | chatterbox | chatterbox_turbo | tada | kokoro)
  3. Get backend: get_tts_backend_for_engine(engine)  # thread-safe singleton per engine
  4. Check model cache → if missing, trigger background download, return HTTP 202
  5. Load model (lazy): tts_backend.load_model(model_size)
  6. Create voice prompt: profiles.create_voice_prompt_for_profile(engine=engine)
       → tts_backend.create_voice_prompt(audio_path, reference_text)
  7. Generate: tts_backend.generate(text, voice_prompt, language, seed, instruct)
  8. Post-process: trim_tts_output() for Chatterbox engines
  9. Save WAV → data/generations/{id}.wav
  10. Insert history record in SQLite
  11. Return GenerationResponse

Current State

What's Shipped (v0.4.x)

New since v0.3.0:

Kokoro 82M TTS engine + voice profile type system (PR #325)
Qwen CustomVoice preset engine — predefined speakers with instruct support (PR #328)
Intel Arc (XPU) GPU support (PR #320)
Blackwell GPU (sm_120) CUDA support (PR #401)
Generation cancellation flow (PR #444)
Frontend quality gates + TypeScript hardening (PR #418)
macOS Intel (x86_64) PyTorch compatibility (PR #416)
Frozen-binary import fixes for Kokoro / Chatterbox Multilingual / scipy / transformers (PR #438)
Linux PipeWire/PulseAudio monitor detection (PR #457)
Server survives GUI close on Windows (PR #402)
GPU arch compatibility warning on startup (catches unsupported PyTorch builds)
cpal Stream playback reliability (PR #405), clip-splitting stability (PR #403)
torch.from_numpy crash with numpy 2.x in frozen binary (PR #361)
Async CUDA download lock (PR #428), NUMBA_CACHE_DIR env var (PR #425)
"Clear failed" history button (PR #412)
External server GUI startup + data refresh (PR #319)
Force offline mode for cached Qwen/Whisper models (PR #318)
macOS 11 ScreenCaptureKit launch crash fix (PR #424)

Core TTS (cumulative):

Qwen3-TTS voice cloning (1.7B and 0.6B models, MLX + PyTorch)
Qwen CustomVoice (preset speakers, instruct)
LuxTTS — fast, CPU-friendly English TTS (PR #254)
Chatterbox Multilingual — 23 languages including Hebrew (PR #257)
Chatterbox Turbo — paralinguistic tags, low latency English (PR #258)
HumeAI TADA — 1B English + 3B Multilingual (PR #296)
Kokoro 82M — CPU-realtime, 8 languages, Apache 2.0 (PR #325)
Multi-engine architecture with thread-safe backend registry (PR #254)
Chunked TTS generation — engine-agnostic, removes ~500 char limit (PR #266)
Async generation queue (PR #269)
Post-processing audio effects system (PR #271)
Voice profile type system (preset vs cloned, engine compatibility gating)
Centralized ModelConfig registry — no per-engine dispatch maps
Shared EngineModelSelector component

Infrastructure (cumulative):

CUDA backend swap via binary download (PR #252), cu128 upgrade (PR #316), Blackwell/sm_120 (PR #401)
CUDA backend split into independently versioned server + libs archives (PR #298)
Intel Arc XPU support (PR #320)
Docker + web deployment (PR #161)
Backend refactor: modular architecture, style guide, tooling (PR #285)
Settings overhaul: routed sub-tabs, server logs, changelog, about page (PR #294)
Windows support: CUDA detection, cross-platform justfile, server lifecycle (PR #272, #402)
Linux audio capture via pactl monitor detection (PR #457)
macOS Intel x86_64 compatibility (PR #416)
Voice profiles with multi-sample support
Stories editor (multi-track DAW timeline)
Whisper transcription (base, small, medium, large, turbo variants)
Model management UI with inline download progress + folder migration (PR #268)
Download cancel/clear UI with error panel (PR #238)
Generation history with caching and cancellation (PR #444)
Streaming generation endpoint (MLX only)
Audio player freeze fix + UX improvements (PR #293)
CORS restriction to known local origins (PR #88)

Abandoned / Backlogged Integrations

Model	PR / Branch	Reason
CosyVoice2/3	PR #311	Output quality too poor. Heavy deps, no PyPI, needed 5+ shims. PR should be closed.
VoxCPM 1.5 / VoxCPM2	`voicebox-new-models` research (2026-04-18)	Backlogged. See detailed analysis below.

VoxCPM — Evaluation Notes (2026-04-18)

Project: OpenBMB/VoxCPM — tokenizer-free TTS, 2B params (VoxCPM2), end-to-end diffusion autoregressive architecture, 30 languages, 48 kHz output, Apache 2.0, pip install voxcpm.

Why it looked interesting:

Clean PyPI install (pip install voxcpm)
Apache 2.0 — commercially safe
Voice cloning via reference_wav_path with optional prompt_wav_path + prompt_text for "ultimate" cloning
Streaming API via generate_streaming()
Zero-shot cloning + style control via parenthetical prefixes in text ((slightly faster, cheerful tone)...)
Relatively high-quality output per demos

Why we backlogged it:

Effectively CUDA-only. README states CUDA ≥ 12.0 as hard requirement. Source code's from_pretrained(device=None|"auto") claims "preferring CUDA, then MPS, then CPU," but in practice:
- MPS (Apple Silicon) broken upstream — OpenBMB/VoxCPM issues #232 (NotImplementedError: Output channels > 65536 not supported at the MPS device) and #248 (IndexError on M3 Mac) are both open with no resolution.
- CPU unsupported in the Python package — issue #256 shows voxcpm --device cpu rejected with unrecognized arguments. The only CPU path is the third-party VoxCPM.cpp GGML engine, which is a separate ecosystem project, not pip install voxcpm.
- macOS source install fails — issue #233 open with no resolution.
Would require CUDA-only gating in UI (new requires_cuda flag on ModelConfig, lock icon + "Requires NVIDIA GPU" in ModelManagement.tsx / EngineModelSelector.tsx) plus a hard error at load_model() as safety net. Doable but adds first-class platform gating that doesn't exist for any other engine today.
Voicebox's user base skews Apple Silicon (MLX is a primary backend). Shipping a CUDA-only model sets a precedent worth a separate scoping discussion (see issues #419 engine sprawl, #420 platform tiers, PR #465).

What would change the decision:

Upstream fixes MPS crashes (watch issues #232, #248).
We define an "experimental / CUDA-only" engine tier as part of issue #419 / PR #465, and decide it's acceptable to ship engines that are hidden on non-NVIDIA platforms.
VoxCPM.cpp matures into a viable CPU path we can wrap (currently separate project, C++/GGML, unclear ergonomics).

Integration shape if we revive it: Zero-shot cloning maps naturally to the Chatterbox-style backend (store ref_audio + ref_text paths in the voice prompt dict, process at generate time). Est. ~250 lines for voxcpm_backend.py + one ModelConfig entry + engine registration in backends/__init__.py. Frontend UI gating is the bigger lift.

What's In-Flight

Feature	Branch/PR	Status
Platform support tiers	PR #465, issue #420	Defining tier-1 (supported) vs tier-2 (community) platforms
Engine sprawl cleanup	issue #419	First-class vs experimental TTS backends distinction
Frontend tech-debt burn-down	issue #421	Biome + a11y debt before gating CI
Docker registry auto-publish	PR #463, issue #453	ghcr.io image on tag push
New model research	`voicebox-new-models` branch	Evaluating Fish Speech, XTTS-v2, Pocket TTS, VibeVoice, Fish Audio S2, index-tts2

TTS Engine Comparison

Engine	Model Name	Profile Type	Languages	Size	Key Features	Instruct Support
Qwen3-TTS 1.7B	`qwen-tts-1.7B`	Cloned	10 (zh, en, ja, ko, de, fr, ru, pt, es, it)	~3.5 GB	Highest quality, voice cloning	None (Base model has no instruct path)
Qwen3-TTS 0.6B	`qwen-tts-0.6B`	Cloned	10	~1.2 GB	Lighter, faster	None
Qwen CustomVoice 1.7B	`qwen-custom-voice-1.7B`	Preset	10	~3.5 GB	Predefined speakers, instruct support	Yes
Qwen CustomVoice 0.6B	`qwen-custom-voice-0.6B`	Preset	10	~1.2 GB	Predefined speakers, instruct support	Yes
LuxTTS	`luxtts`	Cloned	English	~300 MB	CPU-friendly, 48 kHz, fast	None
Chatterbox	`chatterbox-tts`	Cloned	23 (incl. Hebrew, Arabic, Hindi, etc.)	~3.2 GB	Zero-shot cloning, multilingual	Partial — `exaggeration` float (0-1)
Chatterbox Turbo	`chatterbox-turbo`	Cloned	English	~1.5 GB	Paralinguistic tags ([laugh], [cough]), 350M params, low latency	Partial — inline tags only
TADA 1B	`tada-1b`	Cloned	English	~4 GB	HumeAI speech-language model, 700s+ coherent audio	None
TADA 3B Multilingual	`tada-3b-ml`	Cloned	10 (en, ar, zh, de, es, fr, it, ja, pl, pt)	~8 GB	Multilingual, text-acoustic dual alignment	None
Kokoro 82M	`kokoro`	Preset	8 (en, es, fr, hi, it, pt, ja, zh)	~350 MB	82M params, CPU realtime, Apache 2.0, pre-built voices	None

Multi-Engine Architecture (Shipped)

Thread-safe backend registry (_tts_backends dict + _tts_backends_lock) with double-checked locking
Per-engine backend instances — each engine gets its own singleton, loaded lazily
Engine field on GenerationRequest — frontend sends engine: 'qwen' | 'qwen_custom_voice' | 'luxtts' | 'chatterbox' | 'chatterbox_turbo' | 'tada' | 'kokoro'
Per-engine language filtering — ENGINE_LANGUAGES map in frontend, backend regex accepts all languages
Per-engine voice prompts — create_voice_prompt_for_profile() dispatches to the correct backend
Profile type system — preset vs cloned profiles, UI grays out incompatible engines and auto-switches on selection
Trim post-processing — trim_tts_output() for Chatterbox engines (cuts trailing silence/hallucination)

Known Limitations

HF XET progress: Large files downloaded via hf-xet (HuggingFace's new transfer backend) report n=0 in tqdm updates. Progress bars may appear stuck for large .safetensors files even though the download is proceeding. This is a known upstream limitation.
Chatterbox Turbo upstream token bug: from_pretrained() passes token=os.getenv("HF_TOKEN") or True which fails without a stored HF token. Our backend works around this by calling snapshot_download(token=None) + from_local().
chatterbox-tts must install with --no-deps: It pins numpy<1.26, torch==2.6.0, transformers==4.46.3 — all incompatible with our stack (Python 3.12, torch 2.10, transformers 4.57.3). Sub-deps listed explicitly in requirements.txt.
Instruct parameter partially shipped (#224, #303): Qwen CustomVoice (PR #328) now provides real instruct support via predefined speakers. Other backends still silently drop the instruct field — the UI exposes the field broadly but most engines ignore it. The floating generate box was patched to restore instruct for CustomVoice (commit 106aec4).
Streaming generation only works for Qwen on MLX. Other engines use the non-streaming /generate endpoint.
dicta-onnx (Hebrew diacritization) not included — upstream Chatterbox bug requires model_path arg but calls Dicta() with none. Hebrew works fine without it.
Blackwell (RTX 50-series) CUDA: cu128 + sm_120 kernel support shipped (PR #401, #316), but users still report cudaErrorNoKernelImageForDevice (#417, #400, #396, #395, #390, #362) — likely a stale CUDA binary on upgraded installs. Needs a follow-up diagnostic / forced re-download path.
Long text 50k character limit (#464, #365, #354): Still hit on GPU despite chunking (PR #266). Chunking reliability needs another pass.
ROCm on RDNA 3/4 (#469): HSA_OVERRIDE_GFX_VERSION is hardcoded and harms newer cards.
flash-attn is not installed warning on every platform (cosmetic, common user complaint): Our transformer-based engines (Chatterbox / Qwen) emit Warning: flash-attn is not installed. Will only run the manual PyTorch version. Please install flash-attn for faster inference. on every startup, on every platform — we don't pin flash-attn in requirements because installing it is fragile and version-sensitive. Fallback is PyTorch SDPA, which is near-FA2 throughput on Ampere+ and is what actually runs. Per-platform reality: (a) macOS/Apple Silicon — FlashAttention is CUDA-only, irrelevant here; MLX has its own attention kernels. (b) Linux — pip install flash-attn --no-build-isolation works but takes 20+ min to compile. (c) Windows — no official support (Dao-AILab README still says only "Might work"; source builds routinely fail on recent CUDA/MSVC, issues #1715, #1828, #2395). Windows users can install community prebuilt wheels from kingbri1/flash-attention or bdashore3/flash-attention (latest v2.8.3, Aug 2025; win_amd64 wheels for CUDA 12.4/12.8, Torch 2.6–2.9, Python 3.10–3.13) matching their exact CUDA/Torch/Python, or use WSL2. Native-Windows alternatives worth considering as a build-time swap: SageAttention (thu-ml, Apache 2.0, claims 2–5× over FA2) and xformers (official Windows wheels). Action for us: troubleshooting doc now covers it (see docs/content/docs/overview/troubleshooting.mdx), and we should optionally suppress the warning via logging.getLogger(...).setLevel(ERROR) at backend import since the fallback is functionally fine.
WebAudio playback dies after audio-session interruption (#41, plus an internal repro where the app is backgrounded long enough): WaveSurfer's AudioContext gets suspended by macOS — either because another app grabs the audio output, or because the WKWebView throttles when backgrounded. play() resolves and timeupdate can still fire, but no audio reaches the output. Only app restart fixes it. Things already tried that didn't work: (a) swapping WaveSurfer backend away from WebAudio — introduced more bugs, not an option; (b) remount hook on the player — doesn't help because a freshly-created AudioContext is born suspended and only resumes on a user gesture. PR #293 was a prior partial fix that doesn't cover this path. Next thing to try (not yet attempted — confirmed via grep of AudioPlayer.tsx): call wavesurfer.getMediaElement().getGainNode().context.resume() on the play button click (the click itself is a valid user gesture), plus a visibilitychange + statechange listener as belt-and-suspenders. The ctx.resume() pattern already exists in the codebase at useStoryPlayback.ts:52 — just not wired into the main player.

Open PRs — Triage & Analysis

Recently Merged (Since Last Update — 2026-03-18 → 2026-04-18)

PR	Title	Merged
#481	fix(build): pin transformers in MLX requirements to prevent 5.x upgrade	2026-04-19
#470	fix(api-client): declare moved + errors on migrateModels response type	2026-04-18
#457	fix(linux): use pactl to detect PipeWire/PulseAudio monitor	2026-04-18
#450	docs: clarify paralinguistic tag support in quick start	2026-04-18
#447	fix: delete version rows and files in delete_generations_by_profile	2026-04-18
#444	Fix generation cancellation flow	2026-04-18
#440	fix(paths): strip legacy "data/" prefix when resolving stored paths	2026-04-18
#439	Fix migration dialog hanging when no models are present	2026-04-18
#438	fix(build): repair frozen-binary imports for kokoro/chatterbox-multilingual/scipy/transformers	2026-04-18
#433	fix: warn user when no models to migrate during storage change	2026-04-18
#425	Add NUMBA_CACHE_DIR environment variable	2026-04-16
#424	fix: avoid ScreenCaptureKit launch crash on macOS 11	2026-04-16
#418	Frontend quality gates + TypeScript hardening	2026-04-18
#416	fix(deps): relax PyTorch requirement for macOS Intel (x86_64)	2026-04-16
#412	feat(history): add "Clear failed" button	2026-04-16
#405	fix: keep cpal Stream alive until playback completes	2026-04-16
#403	fix: prevent intermittent clip splitting failures	2026-04-16
#402	fix: reliably keep server alive after GUI close on Windows	2026-04-16
#401	feat: add Blackwell GPU (sm_120) CUDA support	2026-04-16
#394	fix(history): populate status/error/engine fields from DB row	2026-04-16
#384	Fix: Resolve ModuleNotFoundError in effects service	2026-04-16
#361	fix: torch.from_numpy crash with numpy 2.x in frozen binary	2026-04-16
#345	Fix: "Failed to Save" preset error by resolving backend import path	2026-03-22
#344	fix: include changelog in docker web build	2026-03-27
#332	Fix links in Get Started section of index.mdx	2026-03-21
#328	feat: add Qwen CustomVoice preset engine	2026-03-27
#325	feat: Kokoro 82M TTS engine + voice profile type system	2026-03-20
#321	fix: allows deletion of failed generations	2026-03-19
#320	feat: Intel Arc (XPU) GPU support	2026-03-21
#319	fix: GUI startup with external server + data refresh on server switch	2026-03-27
#318	fix: force offline mode when loading cached models (Qwen TTS & Whisper)	2026-03-21
#316	Upgrade CUDA backend from cu126 to cu128, fix GPU settings UI	2026-03-18

Currently Open (12 PRs)

PR	Title	Status	Notes
#465	docs: define tier-1 and tier-2 platform support targets	Community PR	Pairs with issue #420. Important for scoping.
#463	feat(actions): add docker-registry.yml for automatic ghcr.io publishing	Community PR	Pairs with issue #453. Low risk.
#443	fix: prevent infinite retry loop in offline mode (#434)	Community PR	Fixes reported bug.
#430	feat: add MiniMax TTS provider support	Community PR	Cloud TTS provider — new direction (external API). Superset of #331?
#331	feat: add MiniMax Cloud TTS as a built-in engine	Community PR	Likely superseded by #430. Dedupe.
#311	feat: add CosyVoice2/3 TTS engine	Close	Abandoned — output quality too poor.
#253	Enhance speech tokenizer with 48kHz version	Community PR	Qwen tokenizer upgrade. Still worth reviewing.
#227	fix: harden input validation & file safety	Community PR	Coupled to #225 (custom models).
#225	feat: custom HuggingFace voice model support	Community PR	Needs rework for multi-engine arch.
#195	feat: per-profile LoRA fine-tuning	Draft	Complex. 15 new endpoints.
#154	feat: Audiobook tab	Community PR	Chunked generation now shipped (#266).
#91	fix: CoreAudio device enumeration	Draft	macOS audio device handling.

Open Issues — Categorized

GPU / Hardware Detection — still the top category

RTX 50-series (Blackwell / sm_120) cluster — NEW: #417, #400, #396, #395, #390, #362 all report cudaErrorNoKernelImageForDevice / "no kernel image available." sm_120 support shipped in PR #401 + cu128 in PR #316, but users on upgraded installs still hit it — likely stale CUDA binary. Needs a diagnostic that detects binary/GPU-arch mismatch and prompts re-download.

AMD / ROCm — NEW: #469 HSA_OVERRIDE_GFX_VERSION is hardcoded and breaks RDNA 3/4 cards. #313 DirectML on AMD Ryzen AI Max+ 395 not working.

Intel Arc: PR #320 shipped XPU support — may resolve #119.

General GPU-not-detected (older): #368, #310, #330, #324, #326, #355 (multi-GPU / eGPU).

Fix path: CUDA backend swap (PR #252) + cu128 (PR #316) + sm_120 (PR #401) + GPU-arch warning (73170d0) are all in. Remaining work is diagnostics + re-download prompts for users whose binary predates the kernel updates.

Model Downloads

Still reported. Users get stuck downloads, can't resume, offline mode edge cases.

Key issues: #475 (MAC CustomVoice install error), #449 (infinite loading macOS), #445 (can't download CustomVoice), #462 (Qwen requires internet even when loaded — regression from #150), #434 (infinite retry loop offline — PR #443 open), #432 (storage location change hangs when empty — partly fixed by PR #439/#433), #348 (TADA 3B Multilingual download fails), #336 (TADA model not listed in app), #275 (No module named 'chatterbox' on download), #304 (whisper-base feature extractor load error), #287 (macOS ARM check_model_inputs ImportError on new version), #181, #180.

Fix path: PR #443 addresses infinite offline retry. CustomVoice-specific download failures (#475, #445) need triage — likely related to frozen-binary import fixes in PR #438. TADA cluster (#336, #348) and macOS ARM import regressions (#287, #275, #304) need a dedicated triage pass.

Qwen 0.6B-downloads-1.7B reports: #485 (2026-04-19), #423 (macOS M1), #329. Originally a stale-fallback bug: mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16 wasn't published when MLX support shipped, so the 0.6B slot was aliased to the 1.7B repo. The 0.6B bf16 conversion is live now and both backend/backends/mlx_backend.py and backend/backends/__init__.py point at their correct repos. Qwen CustomVoice is unaffected — it runs via PyTorch on all platforms, both sizes always have dedicated repos.

Language Requests (ongoing)

Strong demand: Hungarian (#479), Indonesian (#458, #247), Thai (#455), Bangla (#454), Arabic (#379), Persian (#162), IndicF5 (#339 — Indian languages), Ukrainian (#109), Chinese UI (#392, #261).

Fix path: Chatterbox Multilingual (PR #257) covers Arabic, Danish, German, Greek, Finnish, Hebrew, Hindi, Dutch, Norwegian, Polish, Swedish, Swahili, Turkish. Still missing: Hungarian, Indonesian, Thai, Bangla, Ukrainian. Issue #411 offers a PR for UI i18n foundation.

New Model Requests (growing)

Issue	Model Requested
#478	CosyVoice3 (we tried & abandoned CosyVoice2/3 — see #311)
#407, #347	RVC-style voice-to-voice / seed voice conversion (STS)
#385	Fish Audio S2
#380	OmniVoice
#370	index-tts2
#364	Voxtral-TTS
#335	Faster-Qwen-TTS
#346	Multi-model batch request
#381	Microsoft MAI models
#339	IndicF5
#226	GGUF support
#172	VibeVoice
#138	Export to ONNX/Piper format
#132	LavaSR (transcription)
#147	Facebook Omnilingual ASR
#338	Default voices

The multi-engine architecture makes integration straightforward — see content/docs/developer/tts-engines.mdx. Platform-specific gating (e.g. VoxCPM CUDA-only) doesn't exist yet and would need design.

Platform Scope & Quality Debt — NEW category

Awareness issues filed this cycle — ties into engine sprawl and platform tier work.

#419 — Engine sprawl: define first-class vs experimental TTS backends
#420 — Formalize tier-1 vs tier-2 platform support targets (PR #465 open)
#421 — Track & burn down frontend Biome + a11y debt before gating CI
#422 — Code-split web build (main bundle > 1 MB)

Long-Form / Chunking

Still reported despite chunking + queue being merged.

Key issues: #464 (50k char limit on GPU despite 16 GB VRAM — v0.4.0), #365 (FR: >50k chars), #363 (smart chunking to prevent robotic artifacts), #354 (50k limit v0.3.0).

Fix path: Chunking (#266) and queue (#269) shipped. Remaining work is raising/removing the 50k guard and tuning chunk boundaries for prosody.

Feature Requests (ongoing)

Notable:

#480 — Noise removal on uploaded recordings
#448 — API for non-Qwen models (external integrations)
#427 — Task status control
#407, #347 — Voice-to-voice / audio-to-audio conversion
#387 — Location of downloaded generated voices
#383 — Concatenate partial reference audio into generated audio
#382 — Lightning.ai support
#376 — Remote mode
#353 — Audio transcoding
#317 — Voice pitch control
#189 — "Auto" language option
#173 — Vocal intonation/inflection control
#165, #270 — Audiobook mode (PR #154 open)
#242 — Seed value pinning
#228 — Always use 0.6B option
#235 — Finetuned Qwen3-TTS tokenizer (PR #253 open)
#144 — Copy text to clipboard

Housekeeping / Triage Needed

Issue	Reason
#431, #408	Spam — Chinese "free Claude API" promos. Close.
#398 ("Excelente")	Non-issue. Close.
#357	Informational — project featured in Awesome MLX. Close after acknowledgement.
#374, #377	Version-release questions, no bug. Close.
#306 ("voice model"), #389 ("New model"), #473 ("New functionality")	Title-only issues, no content. Request details or close.
#309	Uninstall/cleanup question. Answer and close.
#241	"How to use in Colab" — support question, not a bug.
#423 / #485 / #329	Stale MLX fallback to 1.7B repo — fixed; 0.6B bf16 conversion now live on `mlx-community`, registry points at correct repo on both backends.
#336 / #348	TADA download/registration cluster — triage together.
#287 / #275 / #304	macOS ARM import regressions on new version — likely one root cause.
#292, #349	Possibly already fixed by merged PRs (#321/#412 and #345). Verify + close.

~70 older issues (pre-#170) not individually categorized above. Most are long-tail support questions or duplicates of problems now addressed by the multi-engine / model-registry work. A dedicated backlog-sweep pass is overdue.

Bugs (ongoing)

Category	Issues
Generation failures	#476, #467, #452, #459 (voice clone fetch error), #468 (tada-1b marked error), #437, #300, #301, #282
Audio quality	#456 (clipping errors v0.4.0), #436 (emotion labels), #333 (pitch/echo), #307 (by-model breakdown), #340 (all generations say "www...")
Transcription	#371 (fails every time), #291 (extract transcription from generated audio)
Effects / presets	#349 ("Failed to save" when creating effects presets — possibly fixed by merged #345)
File ops	#477 (spacy_pkuseg dict missing on frozen Windows build), #472 (storage location change), #283 (allow longer files for voice creation + in-app trim), #350 (failed to add sample)
History	#292 (can't delete failed generations — possibly fixed by merged #321/#412)
Windows	#466 (install problem), #375 (WinError 5 access denied), #273 (port 8000 conflict), #201 (model doesn't stay loaded)
Linux	#471 (thread-safe PULSE_SOURCE), #413 (Arch build), #409 (Kubuntu build), #351, #341
macOS	#441 (older macOS), #369 (malware flag), #334 (microphone permission), #287 (`check_model_inputs` ImportError — regression), #171 (ARM64 binary won't open)
Profile/UI	#360 (Kokoro profile hides others — partly addressed by auto-switch), #299 (drag-drop on Win11), #329 (size selector state bug), #393 (stuck loading screen after reinstall to new dir)
Integrations	#397 (SAMMI-bot 422 Unprocessable Entity)
Audio playback / session	#41 (macOS: Voicebox goes silent after another app takes audio output; restart restores it) — see deep-dive below
Database	#174 (sqlite3 IntegrityError)

Existing Plan Documents — Status

Document	Target Version	Status	Relevance
`TTS_PROVIDER_ARCHITECTURE.md`	v0.1.13	Partially superseded by multi-engine arch + CUDA swap	Core concepts implemented differently than planned
`CUDA_BACKEND_SWAP.md`	—	Shipped (PR #252)	CUDA binary download + backend restart
`CUDA_BACKEND_SWAP_FINAL.md`	—	Shipped (PR #252)	Final implementation plan
`EXTERNAL_PROVIDERS.md`	v0.2.0	Not started	Remote server support
`MLX_AUDIO.md`	—	Shipped	MLX backend is live
`DOCKER_DEPLOYMENT.md`	v0.2.0	Shipped (PR #161)	Docker + web deployment
`OPENAI_SUPPORT.md`	v0.2.0	Not started	OpenAI-compatible API layer
`PR33_CUDA_PROVIDER_REVIEW.md`	—	Reference	Analysis of the original provider approach

New Model Integration — Landscape

Status Snapshot (2026-04-18)

Model	Cloning	Speed	Sample Rate	Languages	VRAM	Instruct	Cross-platform?	Status
Qwen3-TTS	10s zero-shot	Medium	24 kHz	10	Medium	None	MLX + PyTorch	Shipped
Qwen CustomVoice	Preset speakers	Medium	24 kHz	10	Medium	Yes	PyTorch	Shipped (PR #328)
LuxTTS	3s zero-shot	150x RT, CPU ok	48 kHz	English	<1 GB	None	All	Shipped (PR #254)
Chatterbox MTL	5s zero-shot	Medium	24 kHz	23	Medium	Partial — `exaggeration`	CPU/CUDA	Shipped (PR #257)
Chatterbox Turbo	5s zero-shot	Fast	24 kHz	English	Low	Partial — inline tags	CPU/CUDA	Shipped (PR #258)
HumeAI TADA 1B/3B	Zero-shot	5x faster than LLM-TTS	24 kHz	EN (1B), 10 (3B)	Medium	Partial — prosody	PyTorch	Shipped (PR #296)
Kokoro-82M	Preset voices	CPU realtime	24 kHz	8	Tiny (82M)	None	All	Shipped (PR #325)
~~CosyVoice2-0.5B~~	3-10s zero-shot	Very fast	24 kHz	Multilingual	Low	Yes	—	Abandoned (PR #311) — poor output quality
~~VoxCPM2~~	Zero-shot	~0.15 RTF streaming	48 kHz	30	Medium	Partial — parenthetical style	CUDA-only in practice	Backlogged (2026-04-18) — see notes above
Fish Speech	10-30s few-shot	Real-time	24-44 kHz	50+	Medium	Yes — word-level inline	All	Candidate — license TBD
Fish Audio S2	—	—	—	—	—	—	—	Candidate (#385)
XTTS-v2	6s zero-shot	Mid-GPU	24 kHz	17+	Medium	Partial — style transfer from ref	All	Candidate — CPML license likely blocker
Pocket TTS (Kyutai)	Zero-shot + streaming	>1x RT on CPU	—	English + several European (FR/DE/PT/IT/ES added by Feb 2026)	~100M	None	CPU-first	Candidate — MIT
MOSS-TTS-Nano	Zero-shot	Realtime on 4 CPU cores	48 kHz stereo	20	0.1B	Partial — MOSS-VoiceGenerator companion does text-to-voice design	All (ONNX CPU path dropped 2026-04-17)	Top candidate — Apache 2.0, released 2026-04-13, streaming
VibeVoice (Microsoft)	—	—	—	Multi-speaker long-form (up to 90 min, 4 speakers)	1.5B	—	—	Candidate (#172) — Stories-editor fit
index-tts2	—	—	—	—	—	—	—	Candidate (#370)
Voxtral TTS (Mistral)	Zero-shot (short clips) + 20 preset voices	Single-GPU	—	—	4B (`Voxtral-4B-TTS-2603`)	Presets + cloning	CUDA (16 GB+ VRAM)	Candidate (#364) — frontier quality claim, open-weight
Dia / Dia2	—	—	—	—	—	—	—	Watch — emotion-forward, but "rough edges" / artifacts per April reviews
IndicF5	—	—	—	Indian languages	—	—	—	Candidate (#339) — fills Indic gap
MiniMax Cloud TTS	—	Cloud	—	—	N/A (API)	—	N/A	Community PR #430, #331 — new direction (external API)
OmniVoice	—	—	—	—	—	—	—	Candidate (#380)
RVC voice conversion	N/A (STS)	—	—	—	—	N/A	All	New modality, not TTS (#407, #347)

Watch list: MioTTS-2.6B (fast LLM-based EN/JP, vLLM compatible), Oolel-Voices (Soynade Research, expressive modular control), Faster-Qwen-TTS (#335), Orpheus / Sesame CSM (on-device fine-tuning discussions), Fish Audio S2 Pro / Fish Speech V1.5 (benchmark leader but research/non-commercial license — same blocker as Fish Speech).

Deep-research pass (2026-04-18): MOSS-TTS-Nano identified as the freshest high-alignment candidate — verified via OpenMOSS/MOSS-TTS README (0.1B params, Apache 2.0, 48 kHz stereo, 4-core CPU realtime, streaming, released 2026-04-13). Dedicated repo: OpenMOSS/MOSS-TTS-Nano. Voxtral TTS verified on HF as mistralai/Voxtral-4B-TTS-2603.

Active Evaluation Criteria (learned from cycle)

Cross-platform first. MLX is a primary backend for our Apple Silicon user base. CUDA-only models require platform gating that doesn't exist yet — shipping one sets a precedent (see VoxCPM notes, issues #419/#420).
PyPI + Apache/MIT licensing preferred. Heavy deps, git-only installs, and --no-deps workarounds are expensive to maintain (Chatterbox taught us this).
Output quality is non-negotiable. CosyVoice was abandoned despite the best instruct API.
Instruct support fills a real gap (#173, #224, #303). Qwen CustomVoice partially addresses it with preset speakers; zero-shot clone-with-instruct is still unmet.
Long-form + streaming are user-requested (#363, #365, #464). Candidates with native streaming (Pocket TTS, Fish Speech) get extra weight.

Adding a New Engine (Now Straightforward)

With the model config registry and shared EngineModelSelector component, adding a new TTS engine requires:

Create backend/backends/<engine>_backend.py — implement TTSBackend protocol (~200-300 lines)
Register in backend/backends/__init__.py — add ModelConfig entry + TTS_ENGINES entry + factory elif
Update backend/models.py — add engine name to regex
Update frontend — add to engine union type, EngineModelSelector options, form schema, language map, profile type gating (icons/labels ~9 files per grep of kokoro)

main.py requires zero changes — the registry handles all dispatch automatically.

Platform gating doesn't exist yet. If we add a CUDA-only model (e.g. VoxCPM), we need a new requires_cuda (or more generally requires: list[device]) flag on ModelConfig, plumbed through /models API and surfaced in ModelManagement.tsx and EngineModelSelector.tsx as a lock icon + "Requires NVIDIA GPU" state. Backend should hard-error at load_model() as a safety net.

Total effort: ~1 day for a well-documented model with a PyPI package, cross-platform. ~2 days if platform gating is required. See content/docs/developer/tts-engines.mdx for the full guide.

Architectural Bottlenecks

1. Single Backend Singleton — RESOLVED

The singleton TTS backend was replaced with a thread-safe per-engine registry in PR #254. Multiple engines can now be loaded simultaneously.

2. `main.py` Dispatch Point Duplication — RESOLVED

Previously, each engine required updates to 6+ hardcoded dispatch maps across main.py (~320 lines of if/elif chains). A model config registry in backend/backends/__init__.py now centralizes all model metadata (ModelConfig dataclass) with helper functions (load_engine_model(), check_model_loaded(), engine_needs_trim(), etc.). Adding a new engine requires zero changes to main.py.

3. Model Config is Scattered — RESOLVED

Model identifiers, HF repo IDs, display names, and engine metadata are now consolidated in the ModelConfig registry. Backend-aware branching (e.g. MLX vs PyTorch Qwen repo IDs) happens inside the registry. Frontend model options are centralized in EngineModelSelector.tsx.

4. Voice Prompt Cache Assumes PyTorch Tensors

backend/utils/cache.py uses torch.save() / torch.load(). LuxTTS, Chatterbox, and Kokoro backends work around this by storing reference audio paths (or preset voice IDs) instead of tensors in their voice prompt dicts. Not ideal but functional.

5. Frontend Assumes Qwen Model Sizes — RESOLVED

The generation form now uses a flat model dropdown with engine-based routing. Per-engine language filtering is in place. Model size is only sent for Qwen / Qwen CustomVoice.

6. No Platform Gating on Models — NEW

ModelConfig has no way to express hardware requirements. Every engine is shown to every user, regardless of whether it'll actually load. Users on non-CUDA platforms discover failure at load time (or not at all — some fall back silently to CPU and never complete). Blocks shipping CUDA-only engines (VoxCPM) and would improve the Intel Arc / ROCm / CPU-only UX today. See ModelConfig TODO: add requires: list[Literal["cuda", "mps", "xpu", "cpu", "rocm"]] or equivalent, plumb through /models API, render in ModelManagement.tsx + EngineModelSelector.tsx.

7. Engine Sprawl — NEW

Seven TTS engines shipped, more candidates queued. Issue #419 asks for a first-class vs experimental distinction. Related: issue #420 asks for formalized platform support tiers. Combined, these would let us ship more engines more confidently with clearer expectations for users.

Recommended Priorities

Tier 1 — Ship Now

Priority	PR/Item	Impact	Effort
1	RTX 50-series / Blackwell diagnostic — detect stale CUDA binary vs GPU arch, prompt re-download (#417, #400, #396, #395, #390, #362)	Large cluster of user-blocking errors	Medium
2	CustomVoice download failures (#475, #445)	New engine blocked on MAC/Win — regression triage	Medium
3	50k char limit on GPU (#464)	Regression — chunking should handle this	Medium
4	Close PR #311 (CosyVoice) and dedupe #331/#430 (MiniMax)	Housekeeping	None
5	PR #443 — infinite offline retry loop	Bug fix, reviewable	Low
6	PR #465 — define tier-1 / tier-2 platforms	Unblocks engine-sprawl decision (#419)	Low
7	PR #463 — docker registry auto-publish	Community PR, low risk	Low
8	#253 — 48kHz speech tokenizer	Quality improvement for Qwen	Medium
9	Kokoro profile UX (#360) — partially addressed by auto-switch	Polish	Low

Tier 2 — Feature Work

Priority	Item	Impact	Effort
1	Engine tier system (#419) — first-class vs experimental, platform gating in `ModelConfig`	Unblocks CUDA-only engines (VoxCPM, etc.) and frontend polish	Medium
2	Frontend tech-debt burn-down (#421) + code-split (#422)	Before gating CI on Biome	Medium
3	#154 — Audiobook tab	Long-form users. Chunking + queue shipped.	Medium
4	UI i18n (#411 PR offer, #392, #261)	Chinese UI + general localization	Medium
5	#225 — Custom HuggingFace models	User-supplied models. Needs rework.	High
6	OpenAI-compatible API (plan doc exists) — see also #448 (API for non-Qwen)	Low effort once API is stable	Low
7	LoRA fine-tuning (PR #195)	Complex, needs rework for multi-engine	Very High
8	Streaming for non-MLX engines	Currently MLX-only	Medium
9	Voice-to-voice / RVC (#407, #347)	New modality — different arch shape	High

Tier 3 — Future Engines (cross-platform preferred)

Priority	Item	Notes
1	MOSS-TTS-Nano	0.1B, Apache 2.0, 4-core CPU realtime, 48 kHz stereo, streaming, 20 langs, released 2026-04-13. Best alignment with our criteria. Verify install ergonomics before committing.
2	Pocket TTS (Kyutai)	CPU-first 100M model. MIT. Fills streaming gap without CUDA dependency. Several European langs added by Feb 2026.
3	IndicF5	Fills Indian-language gap (#339). Closes many language-request issues.
4	VibeVoice (Microsoft, #172)	1.5B, long-form multi-speaker (up to 90 min, 4 speakers). Strong Stories-editor fit.
5	Voxtral TTS (Mistral, #364)	4B presets+cloning. Frontier quality claim, but 16 GB+ VRAM — would need the platform-tier work first.
6	Fish Speech / Fish Audio S2	50+ langs, word-level instruct. License clarification first. (#385)
7	XTTS-v2	17+ langs, mature pip. CPML likely kills commercial use — verify.
8	index-tts2 (#370)	Unvetted.
—	~~VoxCPM2~~	Backlogged — CUDA-only upstream. Revisit when tier system ships or MPS bugs are fixed upstream.

Previously Prioritized — Now Done

~~Kokoro 82M — finish integration~~ Shipped (PR #325)
~~Qwen CustomVoice~~ Shipped (PR #328)
~~Intel Arc (XPU) support~~ Shipped (PR #320)
~~Blackwell CUDA~~ Shipped (PR #401, follow-up work open)
~~Generation cancellation~~ Shipped (PR #444)
~~macOS Intel x86_64~~ Shipped (PR #416)

Branch Inventory

Branch	PR	Status	Notes
`voicebox-new-models`	—	Active	New model research (Fish Speech, Pocket TTS, VibeVoice, etc.); VoxCPM evaluated & backlogged
`fix/kokoro-pyinstaller-source-files`	—	Active	Kokoro frozen-build source bundling (parent of `voicebox-new-models`)
`feat/cosyvoice-engine`	#311	Open — closing	CosyVoice2/3 — abandoned, poor quality
`feat/kokoro`	#325	Merged	Kokoro 82M + voice profile type system
`feat/qwen-custom-voice`	#328	Merged	Qwen CustomVoice preset engine
`feat/chatterbox-turbo`	#258	Merged	Chatterbox Turbo + per-engine languages
`feat/chatterbox`	#257	Merged	Chatterbox Multilingual
`feat/luxtts`	#254	Merged	LuxTTS + multi-engine arch

Quick Reference: API Endpoints

<details> <summary>All current endpoints</summary>

Endpoint	Method	Purpose
`/health`	GET	Health check, model/GPU status
`/profiles`	POST, GET	Create/list voice profiles
`/profiles/{id}`	GET, PUT, DELETE	Profile CRUD
`/profiles/{id}/samples`	POST, GET	Add/list voice samples
`/profiles/{id}/avatar`	POST, GET, DELETE	Avatar management
`/profiles/{id}/export`	GET	Export profile as ZIP
`/profiles/import`	POST	Import profile from ZIP
`/generate`	POST	Generate speech (engine param selects TTS backend)
`/generate/stream`	POST	Stream speech (MLX only)
`/history`	GET	List generation history
`/history/{id}`	GET, DELETE	Get/delete generation
`/history/{id}/export`	GET	Export generation ZIP
`/history/{id}/export-audio`	GET	Export audio only
`/transcribe`	POST	Transcribe audio (Whisper)
`/models/status`	GET	All model statuses (Qwen, LuxTTS, Chatterbox, Chatterbox Turbo, TADA, Whisper)
`/models/download`	POST	Trigger model download
`/models/download/cancel`	POST	Cancel/dismiss download
`/models/{name}`	DELETE	Delete downloaded model
`/models/load`	POST	Load model into memory
`/models/unload`	POST	Unload model
`/models/progress/{name}`	GET	SSE download progress
`/tasks/active`	GET	Active downloads/generations (with inline progress)
`/stories`	POST, GET	Create/list stories
`/stories/{id}`	GET, PUT, DELETE	Story CRUD
`/stories/{id}/items`	POST, GET	Story items CRUD
`/stories/{id}/export`	GET	Export story audio
`/channels`	POST, GET	Audio channel CRUD
`/channels/{id}`	PUT, DELETE	Channel update/delete
`/cache/clear`	POST	Clear voice prompt cache
`/server/cuda/status`	GET	CUDA binary availability
`/server/cuda/download`	POST	Download CUDA binary
`/server/cuda/switch`	POST	Switch to CUDA backend

</details>

Voicebox Project Status & Roadmap

Voicebox Project Status & Roadmap

Table of Contents

Architecture Overview

Key Files

How TTS Generation Works (Current Flow)

Current State

What's Shipped (v0.4.x)

Abandoned / Backlogged Integrations

VoxCPM — Evaluation Notes (2026-04-18)

What's In-Flight

TTS Engine Comparison

Multi-Engine Architecture (Shipped)

Known Limitations

Open PRs — Triage & Analysis

Recently Merged (Since Last Update — 2026-03-18 → 2026-04-18)

Currently Open (12 PRs)

Open Issues — Categorized

GPU / Hardware Detection — still the top category

Model Downloads

Language Requests (ongoing)

New Model Requests (growing)

Platform Scope & Quality Debt — NEW category

Long-Form / Chunking

Feature Requests (ongoing)

Housekeeping / Triage Needed

Bugs (ongoing)

Existing Plan Documents — Status

New Model Integration — Landscape

Status Snapshot (2026-04-18)

Active Evaluation Criteria (learned from cycle)

Adding a New Engine (Now Straightforward)

Architectural Bottlenecks

1. Single Backend Singleton — RESOLVED

2. main.py Dispatch Point Duplication — RESOLVED

3. Model Config is Scattered — RESOLVED

4. Voice Prompt Cache Assumes PyTorch Tensors

5. Frontend Assumes Qwen Model Sizes — RESOLVED

6. No Platform Gating on Models — NEW

7. Engine Sprawl — NEW

Recommended Priorities

Tier 1 — Ship Now

Tier 2 — Feature Work

Tier 3 — Future Engines (cross-platform preferred)

Previously Prioritized — Now Done

Branch Inventory

Quick Reference: API Endpoints

2. `main.py` Dispatch Point Duplication — RESOLVED