Back to Voicebox

Voicebox Project Status & Roadmap

docs/PROJECT_STATUS.md

0.5.046.8 KB
Original Source

Voicebox Project Status & Roadmap

Last updated: 2026-04-18 | Current version: v0.4.1 | 232 open issues | 12 open PRs


Table of Contents

  1. Architecture Overview
  2. Current State
  3. Open PRs — Triage & Analysis
  4. Open Issues — Categorized
  5. Existing Plan Documents — Status
  6. New Model Integration — Landscape
  7. Architectural Bottlenecks
  8. Recommended Priorities

Architecture Overview

Tauri shell (Rust) hosts a React frontend (app/) that talks over HTTP on localhost:17493 to a FastAPI backend (backend/).

The backend exposes:

  • TTSBackend Protocol with seven concrete engine implementations:
    • Qwen3-TTS (PyTorch or MLX depending on platform)
    • Qwen CustomVoice (predefined speakers with instruct)
    • LuxTTS (fast, CPU-friendly)
    • Chatterbox Multilingual (23 languages)
    • Chatterbox Turbo (English, paralinguistic tags)
    • TADA (1B English, 3B multilingual via HumeAI)
    • Kokoro 82M (pre-built voices, CPU realtime)
  • STTBackend Protocol for Whisper (PyTorch or MLX-Whisper)
  • Profiles / History / Stories services for persistence and timeline editing

Key Files

LayerFilePurpose
Backend entrybackend/main.pyFastAPI app, all API routes (~2850 lines)
TTS protocolbackend/backends/__init__.py:32-101TTSBackend Protocol definition
Model registrybackend/backends/__init__.py:17-29,153-366ModelConfig dataclass + registry helpers
TTS factorybackend/backends/__init__.py:382-426Thread-safe engine registry (double-checked locking)
PyTorch TTSbackend/backends/pytorch_backend.pyQwen3-TTS via qwen_tts package
MLX TTSbackend/backends/mlx_backend.pyQwen3-TTS via mlx_audio.tts
LuxTTSbackend/backends/luxtts_backend.pyLuxTTS — fast, CPU-friendly
Chatterbox MTLbackend/backends/chatterbox_backend.pyChatterbox Multilingual — 23 languages
Chatterbox Turbobackend/backends/chatterbox_turbo_backend.pyChatterbox Turbo — English, paralinguistic tags
TADAbackend/backends/hume_backend.pyHumeAI TADA — 1B English + 3B Multilingual
Kokorobackend/backends/kokoro_backend.pyKokoro 82M — CPU realtime, pre-built voices
Qwen CustomVoicebackend/backends/qwen_custom_voice_backend.pyQwen CustomVoice — predefined speakers with instruct
Platform detectbackend/platform_detect.pyApple Silicon → MLX, else → PyTorch
API typesbackend/models.pyPydantic request/response models
HF progressbackend/utils/hf_progress.pyHFProgressTracker (tqdm patching for download progress)
Audio utilsbackend/utils/audio.pytrim_tts_output(), normalize, load/save audio
Frontend APIapp/src/lib/api/client.tsHand-written fetch wrapper
Frontend typesapp/src/lib/api/types.tsTypeScript API types
Engine selectorapp/src/components/Generation/EngineModelSelector.tsxShared engine/model dropdown
Generation formapp/src/components/Generation/GenerationForm.tsxTTS generation UI
Floating gen boxapp/src/components/Generation/FloatingGenerateBox.tsxCompact generation UI
Model managerapp/src/components/ServerSettings/ModelManagement.tsxModel download/status/progress UI
GPU accelerationapp/src/components/ServerSettings/GpuAcceleration.tsxCUDA backend swap UI
Gen form hookapp/src/lib/hooks/useGenerationForm.tsForm validation + submission
Language constantsapp/src/lib/constants/languages.tsPer-engine language maps

How TTS Generation Works (Current Flow)

POST /generate
  1. Look up voice profile from DB
  2. Resolve engine from request (qwen | qwen_custom_voice | luxtts | chatterbox | chatterbox_turbo | tada | kokoro)
  3. Get backend: get_tts_backend_for_engine(engine)  # thread-safe singleton per engine
  4. Check model cache → if missing, trigger background download, return HTTP 202
  5. Load model (lazy): tts_backend.load_model(model_size)
  6. Create voice prompt: profiles.create_voice_prompt_for_profile(engine=engine)
       → tts_backend.create_voice_prompt(audio_path, reference_text)
  7. Generate: tts_backend.generate(text, voice_prompt, language, seed, instruct)
  8. Post-process: trim_tts_output() for Chatterbox engines
  9. Save WAV → data/generations/{id}.wav
  10. Insert history record in SQLite
  11. Return GenerationResponse

Current State

What's Shipped (v0.4.x)

New since v0.3.0:

  • Kokoro 82M TTS engine + voice profile type system (PR #325)
  • Qwen CustomVoice preset engine — predefined speakers with instruct support (PR #328)
  • Intel Arc (XPU) GPU support (PR #320)
  • Blackwell GPU (sm_120) CUDA support (PR #401)
  • Generation cancellation flow (PR #444)
  • Frontend quality gates + TypeScript hardening (PR #418)
  • macOS Intel (x86_64) PyTorch compatibility (PR #416)
  • Frozen-binary import fixes for Kokoro / Chatterbox Multilingual / scipy / transformers (PR #438)
  • Linux PipeWire/PulseAudio monitor detection (PR #457)
  • Server survives GUI close on Windows (PR #402)
  • GPU arch compatibility warning on startup (catches unsupported PyTorch builds)
  • cpal Stream playback reliability (PR #405), clip-splitting stability (PR #403)
  • torch.from_numpy crash with numpy 2.x in frozen binary (PR #361)
  • Async CUDA download lock (PR #428), NUMBA_CACHE_DIR env var (PR #425)
  • "Clear failed" history button (PR #412)
  • External server GUI startup + data refresh (PR #319)
  • Force offline mode for cached Qwen/Whisper models (PR #318)
  • macOS 11 ScreenCaptureKit launch crash fix (PR #424)

Core TTS (cumulative):

  • Qwen3-TTS voice cloning (1.7B and 0.6B models, MLX + PyTorch)
  • Qwen CustomVoice (preset speakers, instruct)
  • LuxTTS — fast, CPU-friendly English TTS (PR #254)
  • Chatterbox Multilingual — 23 languages including Hebrew (PR #257)
  • Chatterbox Turbo — paralinguistic tags, low latency English (PR #258)
  • HumeAI TADA — 1B English + 3B Multilingual (PR #296)
  • Kokoro 82M — CPU-realtime, 8 languages, Apache 2.0 (PR #325)
  • Multi-engine architecture with thread-safe backend registry (PR #254)
  • Chunked TTS generation — engine-agnostic, removes ~500 char limit (PR #266)
  • Async generation queue (PR #269)
  • Post-processing audio effects system (PR #271)
  • Voice profile type system (preset vs cloned, engine compatibility gating)
  • Centralized ModelConfig registry — no per-engine dispatch maps
  • Shared EngineModelSelector component

Infrastructure (cumulative):

  • CUDA backend swap via binary download (PR #252), cu128 upgrade (PR #316), Blackwell/sm_120 (PR #401)
  • CUDA backend split into independently versioned server + libs archives (PR #298)
  • Intel Arc XPU support (PR #320)
  • Docker + web deployment (PR #161)
  • Backend refactor: modular architecture, style guide, tooling (PR #285)
  • Settings overhaul: routed sub-tabs, server logs, changelog, about page (PR #294)
  • Windows support: CUDA detection, cross-platform justfile, server lifecycle (PR #272, #402)
  • Linux audio capture via pactl monitor detection (PR #457)
  • macOS Intel x86_64 compatibility (PR #416)
  • Voice profiles with multi-sample support
  • Stories editor (multi-track DAW timeline)
  • Whisper transcription (base, small, medium, large, turbo variants)
  • Model management UI with inline download progress + folder migration (PR #268)
  • Download cancel/clear UI with error panel (PR #238)
  • Generation history with caching and cancellation (PR #444)
  • Streaming generation endpoint (MLX only)
  • Audio player freeze fix + UX improvements (PR #293)
  • CORS restriction to known local origins (PR #88)

Abandoned / Backlogged Integrations

ModelPR / BranchReason
CosyVoice2/3PR #311Output quality too poor. Heavy deps, no PyPI, needed 5+ shims. PR should be closed.
VoxCPM 1.5 / VoxCPM2voicebox-new-models research (2026-04-18)Backlogged. See detailed analysis below.

VoxCPM — Evaluation Notes (2026-04-18)

Project: OpenBMB/VoxCPM — tokenizer-free TTS, 2B params (VoxCPM2), end-to-end diffusion autoregressive architecture, 30 languages, 48 kHz output, Apache 2.0, pip install voxcpm.

Why it looked interesting:

  • Clean PyPI install (pip install voxcpm)
  • Apache 2.0 — commercially safe
  • Voice cloning via reference_wav_path with optional prompt_wav_path + prompt_text for "ultimate" cloning
  • Streaming API via generate_streaming()
  • Zero-shot cloning + style control via parenthetical prefixes in text ((slightly faster, cheerful tone)...)
  • Relatively high-quality output per demos

Why we backlogged it:

  • Effectively CUDA-only. README states CUDA ≥ 12.0 as hard requirement. Source code's from_pretrained(device=None|"auto") claims "preferring CUDA, then MPS, then CPU," but in practice:
    • MPS (Apple Silicon) broken upstream — OpenBMB/VoxCPM issues #232 (NotImplementedError: Output channels > 65536 not supported at the MPS device) and #248 (IndexError on M3 Mac) are both open with no resolution.
    • CPU unsupported in the Python package — issue #256 shows voxcpm --device cpu rejected with unrecognized arguments. The only CPU path is the third-party VoxCPM.cpp GGML engine, which is a separate ecosystem project, not pip install voxcpm.
    • macOS source install fails — issue #233 open with no resolution.
  • Would require CUDA-only gating in UI (new requires_cuda flag on ModelConfig, lock icon + "Requires NVIDIA GPU" in ModelManagement.tsx / EngineModelSelector.tsx) plus a hard error at load_model() as safety net. Doable but adds first-class platform gating that doesn't exist for any other engine today.
  • Voicebox's user base skews Apple Silicon (MLX is a primary backend). Shipping a CUDA-only model sets a precedent worth a separate scoping discussion (see issues #419 engine sprawl, #420 platform tiers, PR #465).

What would change the decision:

  • Upstream fixes MPS crashes (watch issues #232, #248).
  • We define an "experimental / CUDA-only" engine tier as part of issue #419 / PR #465, and decide it's acceptable to ship engines that are hidden on non-NVIDIA platforms.
  • VoxCPM.cpp matures into a viable CPU path we can wrap (currently separate project, C++/GGML, unclear ergonomics).

Integration shape if we revive it: Zero-shot cloning maps naturally to the Chatterbox-style backend (store ref_audio + ref_text paths in the voice prompt dict, process at generate time). Est. ~250 lines for voxcpm_backend.py + one ModelConfig entry + engine registration in backends/__init__.py. Frontend UI gating is the bigger lift.

What's In-Flight

FeatureBranch/PRStatus
Platform support tiersPR #465, issue #420Defining tier-1 (supported) vs tier-2 (community) platforms
Engine sprawl cleanupissue #419First-class vs experimental TTS backends distinction
Frontend tech-debt burn-downissue #421Biome + a11y debt before gating CI
Docker registry auto-publishPR #463, issue #453ghcr.io image on tag push
New model researchvoicebox-new-models branchEvaluating Fish Speech, XTTS-v2, Pocket TTS, VibeVoice, Fish Audio S2, index-tts2

TTS Engine Comparison

EngineModel NameProfile TypeLanguagesSizeKey FeaturesInstruct Support
Qwen3-TTS 1.7Bqwen-tts-1.7BCloned10 (zh, en, ja, ko, de, fr, ru, pt, es, it)~3.5 GBHighest quality, voice cloningNone (Base model has no instruct path)
Qwen3-TTS 0.6Bqwen-tts-0.6BCloned10~1.2 GBLighter, fasterNone
Qwen CustomVoice 1.7Bqwen-custom-voice-1.7BPreset10~3.5 GBPredefined speakers, instruct supportYes
Qwen CustomVoice 0.6Bqwen-custom-voice-0.6BPreset10~1.2 GBPredefined speakers, instruct supportYes
LuxTTSluxttsClonedEnglish~300 MBCPU-friendly, 48 kHz, fastNone
Chatterboxchatterbox-ttsCloned23 (incl. Hebrew, Arabic, Hindi, etc.)~3.2 GBZero-shot cloning, multilingualPartial — exaggeration float (0-1)
Chatterbox Turbochatterbox-turboClonedEnglish~1.5 GBParalinguistic tags ([laugh], [cough]), 350M params, low latencyPartial — inline tags only
TADA 1Btada-1bClonedEnglish~4 GBHumeAI speech-language model, 700s+ coherent audioNone
TADA 3B Multilingualtada-3b-mlCloned10 (en, ar, zh, de, es, fr, it, ja, pl, pt)~8 GBMultilingual, text-acoustic dual alignmentNone
Kokoro 82MkokoroPreset8 (en, es, fr, hi, it, pt, ja, zh)~350 MB82M params, CPU realtime, Apache 2.0, pre-built voicesNone

Multi-Engine Architecture (Shipped)

  • Thread-safe backend registry (_tts_backends dict + _tts_backends_lock) with double-checked locking
  • Per-engine backend instances — each engine gets its own singleton, loaded lazily
  • Engine field on GenerationRequest — frontend sends engine: 'qwen' | 'qwen_custom_voice' | 'luxtts' | 'chatterbox' | 'chatterbox_turbo' | 'tada' | 'kokoro'
  • Per-engine language filteringENGINE_LANGUAGES map in frontend, backend regex accepts all languages
  • Per-engine voice promptscreate_voice_prompt_for_profile() dispatches to the correct backend
  • Profile type system — preset vs cloned profiles, UI grays out incompatible engines and auto-switches on selection
  • Trim post-processingtrim_tts_output() for Chatterbox engines (cuts trailing silence/hallucination)

Known Limitations

  • HF XET progress: Large files downloaded via hf-xet (HuggingFace's new transfer backend) report n=0 in tqdm updates. Progress bars may appear stuck for large .safetensors files even though the download is proceeding. This is a known upstream limitation.
  • Chatterbox Turbo upstream token bug: from_pretrained() passes token=os.getenv("HF_TOKEN") or True which fails without a stored HF token. Our backend works around this by calling snapshot_download(token=None) + from_local().
  • chatterbox-tts must install with --no-deps: It pins numpy<1.26, torch==2.6.0, transformers==4.46.3 — all incompatible with our stack (Python 3.12, torch 2.10, transformers 4.57.3). Sub-deps listed explicitly in requirements.txt.
  • Instruct parameter partially shipped (#224, #303): Qwen CustomVoice (PR #328) now provides real instruct support via predefined speakers. Other backends still silently drop the instruct field — the UI exposes the field broadly but most engines ignore it. The floating generate box was patched to restore instruct for CustomVoice (commit 106aec4).
  • Streaming generation only works for Qwen on MLX. Other engines use the non-streaming /generate endpoint.
  • dicta-onnx (Hebrew diacritization) not included — upstream Chatterbox bug requires model_path arg but calls Dicta() with none. Hebrew works fine without it.
  • Blackwell (RTX 50-series) CUDA: cu128 + sm_120 kernel support shipped (PR #401, #316), but users still report cudaErrorNoKernelImageForDevice (#417, #400, #396, #395, #390, #362) — likely a stale CUDA binary on upgraded installs. Needs a follow-up diagnostic / forced re-download path.
  • Long text 50k character limit (#464, #365, #354): Still hit on GPU despite chunking (PR #266). Chunking reliability needs another pass.
  • ROCm on RDNA 3/4 (#469): HSA_OVERRIDE_GFX_VERSION is hardcoded and harms newer cards.
  • flash-attn is not installed warning on every platform (cosmetic, common user complaint): Our transformer-based engines (Chatterbox / Qwen) emit Warning: flash-attn is not installed. Will only run the manual PyTorch version. Please install flash-attn for faster inference. on every startup, on every platform — we don't pin flash-attn in requirements because installing it is fragile and version-sensitive. Fallback is PyTorch SDPA, which is near-FA2 throughput on Ampere+ and is what actually runs. Per-platform reality: (a) macOS/Apple Silicon — FlashAttention is CUDA-only, irrelevant here; MLX has its own attention kernels. (b) Linuxpip install flash-attn --no-build-isolation works but takes 20+ min to compile. (c) Windows — no official support (Dao-AILab README still says only "Might work"; source builds routinely fail on recent CUDA/MSVC, issues #1715, #1828, #2395). Windows users can install community prebuilt wheels from kingbri1/flash-attention or bdashore3/flash-attention (latest v2.8.3, Aug 2025; win_amd64 wheels for CUDA 12.4/12.8, Torch 2.6–2.9, Python 3.10–3.13) matching their exact CUDA/Torch/Python, or use WSL2. Native-Windows alternatives worth considering as a build-time swap: SageAttention (thu-ml, Apache 2.0, claims 2–5× over FA2) and xformers (official Windows wheels). Action for us: troubleshooting doc now covers it (see docs/content/docs/overview/troubleshooting.mdx), and we should optionally suppress the warning via logging.getLogger(...).setLevel(ERROR) at backend import since the fallback is functionally fine.
  • WebAudio playback dies after audio-session interruption (#41, plus an internal repro where the app is backgrounded long enough): WaveSurfer's AudioContext gets suspended by macOS — either because another app grabs the audio output, or because the WKWebView throttles when backgrounded. play() resolves and timeupdate can still fire, but no audio reaches the output. Only app restart fixes it. Things already tried that didn't work: (a) swapping WaveSurfer backend away from WebAudio — introduced more bugs, not an option; (b) remount hook on the player — doesn't help because a freshly-created AudioContext is born suspended and only resumes on a user gesture. PR #293 was a prior partial fix that doesn't cover this path. Next thing to try (not yet attempted — confirmed via grep of AudioPlayer.tsx): call wavesurfer.getMediaElement().getGainNode().context.resume() on the play button click (the click itself is a valid user gesture), plus a visibilitychange + statechange listener as belt-and-suspenders. The ctx.resume() pattern already exists in the codebase at useStoryPlayback.ts:52 — just not wired into the main player.

Open PRs — Triage & Analysis

Recently Merged (Since Last Update — 2026-03-18 → 2026-04-18)

PRTitleMerged
#481fix(build): pin transformers in MLX requirements to prevent 5.x upgrade2026-04-19
#470fix(api-client): declare moved + errors on migrateModels response type2026-04-18
#457fix(linux): use pactl to detect PipeWire/PulseAudio monitor2026-04-18
#450docs: clarify paralinguistic tag support in quick start2026-04-18
#447fix: delete version rows and files in delete_generations_by_profile2026-04-18
#444Fix generation cancellation flow2026-04-18
#440fix(paths): strip legacy "data/" prefix when resolving stored paths2026-04-18
#439Fix migration dialog hanging when no models are present2026-04-18
#438fix(build): repair frozen-binary imports for kokoro/chatterbox-multilingual/scipy/transformers2026-04-18
#433fix: warn user when no models to migrate during storage change2026-04-18
#425Add NUMBA_CACHE_DIR environment variable2026-04-16
#424fix: avoid ScreenCaptureKit launch crash on macOS 112026-04-16
#418Frontend quality gates + TypeScript hardening2026-04-18
#416fix(deps): relax PyTorch requirement for macOS Intel (x86_64)2026-04-16
#412feat(history): add "Clear failed" button2026-04-16
#405fix: keep cpal Stream alive until playback completes2026-04-16
#403fix: prevent intermittent clip splitting failures2026-04-16
#402fix: reliably keep server alive after GUI close on Windows2026-04-16
#401feat: add Blackwell GPU (sm_120) CUDA support2026-04-16
#394fix(history): populate status/error/engine fields from DB row2026-04-16
#384Fix: Resolve ModuleNotFoundError in effects service2026-04-16
#361fix: torch.from_numpy crash with numpy 2.x in frozen binary2026-04-16
#345Fix: "Failed to Save" preset error by resolving backend import path2026-03-22
#344fix: include changelog in docker web build2026-03-27
#332Fix links in Get Started section of index.mdx2026-03-21
#328feat: add Qwen CustomVoice preset engine2026-03-27
#325feat: Kokoro 82M TTS engine + voice profile type system2026-03-20
#321fix: allows deletion of failed generations2026-03-19
#320feat: Intel Arc (XPU) GPU support2026-03-21
#319fix: GUI startup with external server + data refresh on server switch2026-03-27
#318fix: force offline mode when loading cached models (Qwen TTS & Whisper)2026-03-21
#316Upgrade CUDA backend from cu126 to cu128, fix GPU settings UI2026-03-18

Currently Open (12 PRs)

PRTitleStatusNotes
#465docs: define tier-1 and tier-2 platform support targetsCommunity PRPairs with issue #420. Important for scoping.
#463feat(actions): add docker-registry.yml for automatic ghcr.io publishingCommunity PRPairs with issue #453. Low risk.
#443fix: prevent infinite retry loop in offline mode (#434)Community PRFixes reported bug.
#430feat: add MiniMax TTS provider supportCommunity PRCloud TTS provider — new direction (external API). Superset of #331?
#331feat: add MiniMax Cloud TTS as a built-in engineCommunity PRLikely superseded by #430. Dedupe.
#311feat: add CosyVoice2/3 TTS engineCloseAbandoned — output quality too poor.
#253Enhance speech tokenizer with 48kHz versionCommunity PRQwen tokenizer upgrade. Still worth reviewing.
#227fix: harden input validation & file safetyCommunity PRCoupled to #225 (custom models).
#225feat: custom HuggingFace voice model supportCommunity PRNeeds rework for multi-engine arch.
#195feat: per-profile LoRA fine-tuningDraftComplex. 15 new endpoints.
#154feat: Audiobook tabCommunity PRChunked generation now shipped (#266).
#91fix: CoreAudio device enumerationDraftmacOS audio device handling.

Open Issues — Categorized

GPU / Hardware Detection — still the top category

RTX 50-series (Blackwell / sm_120) cluster — NEW: #417, #400, #396, #395, #390, #362 all report cudaErrorNoKernelImageForDevice / "no kernel image available." sm_120 support shipped in PR #401 + cu128 in PR #316, but users on upgraded installs still hit it — likely stale CUDA binary. Needs a diagnostic that detects binary/GPU-arch mismatch and prompts re-download.

AMD / ROCm — NEW: #469 HSA_OVERRIDE_GFX_VERSION is hardcoded and breaks RDNA 3/4 cards. #313 DirectML on AMD Ryzen AI Max+ 395 not working.

Intel Arc: PR #320 shipped XPU support — may resolve #119.

General GPU-not-detected (older): #368, #310, #330, #324, #326, #355 (multi-GPU / eGPU).

Fix path: CUDA backend swap (PR #252) + cu128 (PR #316) + sm_120 (PR #401) + GPU-arch warning (73170d0) are all in. Remaining work is diagnostics + re-download prompts for users whose binary predates the kernel updates.

Model Downloads

Still reported. Users get stuck downloads, can't resume, offline mode edge cases.

Key issues: #475 (MAC CustomVoice install error), #449 (infinite loading macOS), #445 (can't download CustomVoice), #462 (Qwen requires internet even when loaded — regression from #150), #434 (infinite retry loop offline — PR #443 open), #432 (storage location change hangs when empty — partly fixed by PR #439/#433), #348 (TADA 3B Multilingual download fails), #336 (TADA model not listed in app), #275 (No module named 'chatterbox' on download), #304 (whisper-base feature extractor load error), #287 (macOS ARM check_model_inputs ImportError on new version), #181, #180.

Fix path: PR #443 addresses infinite offline retry. CustomVoice-specific download failures (#475, #445) need triage — likely related to frozen-binary import fixes in PR #438. TADA cluster (#336, #348) and macOS ARM import regressions (#287, #275, #304) need a dedicated triage pass.

Qwen 0.6B-downloads-1.7B reports: #485 (2026-04-19), #423 (macOS M1), #329. Originally a stale-fallback bug: mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16 wasn't published when MLX support shipped, so the 0.6B slot was aliased to the 1.7B repo. The 0.6B bf16 conversion is live now and both backend/backends/mlx_backend.py and backend/backends/__init__.py point at their correct repos. Qwen CustomVoice is unaffected — it runs via PyTorch on all platforms, both sizes always have dedicated repos.

Language Requests (ongoing)

Strong demand: Hungarian (#479), Indonesian (#458, #247), Thai (#455), Bangla (#454), Arabic (#379), Persian (#162), IndicF5 (#339 — Indian languages), Ukrainian (#109), Chinese UI (#392, #261).

Fix path: Chatterbox Multilingual (PR #257) covers Arabic, Danish, German, Greek, Finnish, Hebrew, Hindi, Dutch, Norwegian, Polish, Swedish, Swahili, Turkish. Still missing: Hungarian, Indonesian, Thai, Bangla, Ukrainian. Issue #411 offers a PR for UI i18n foundation.

New Model Requests (growing)

IssueModel Requested
#478CosyVoice3 (we tried & abandoned CosyVoice2/3 — see #311)
#407, #347RVC-style voice-to-voice / seed voice conversion (STS)
#385Fish Audio S2
#380OmniVoice
#370index-tts2
#364Voxtral-TTS
#335Faster-Qwen-TTS
#346Multi-model batch request
#381Microsoft MAI models
#339IndicF5
#226GGUF support
#172VibeVoice
#138Export to ONNX/Piper format
#132LavaSR (transcription)
#147Facebook Omnilingual ASR
#338Default voices

The multi-engine architecture makes integration straightforward — see content/docs/developer/tts-engines.mdx. Platform-specific gating (e.g. VoxCPM CUDA-only) doesn't exist yet and would need design.

Platform Scope & Quality Debt — NEW category

Awareness issues filed this cycle — ties into engine sprawl and platform tier work.

  • #419 — Engine sprawl: define first-class vs experimental TTS backends
  • #420 — Formalize tier-1 vs tier-2 platform support targets (PR #465 open)
  • #421 — Track & burn down frontend Biome + a11y debt before gating CI
  • #422 — Code-split web build (main bundle > 1 MB)

Long-Form / Chunking

Still reported despite chunking + queue being merged.

Key issues: #464 (50k char limit on GPU despite 16 GB VRAM — v0.4.0), #365 (FR: >50k chars), #363 (smart chunking to prevent robotic artifacts), #354 (50k limit v0.3.0).

Fix path: Chunking (#266) and queue (#269) shipped. Remaining work is raising/removing the 50k guard and tuning chunk boundaries for prosody.

Feature Requests (ongoing)

Notable:

  • #480 — Noise removal on uploaded recordings
  • #448 — API for non-Qwen models (external integrations)
  • #427 — Task status control
  • #407, #347 — Voice-to-voice / audio-to-audio conversion
  • #387 — Location of downloaded generated voices
  • #383 — Concatenate partial reference audio into generated audio
  • #382 — Lightning.ai support
  • #376 — Remote mode
  • #353 — Audio transcoding
  • #317 — Voice pitch control
  • #189 — "Auto" language option
  • #173 — Vocal intonation/inflection control
  • #165, #270 — Audiobook mode (PR #154 open)
  • #242 — Seed value pinning
  • #228 — Always use 0.6B option
  • #235 — Finetuned Qwen3-TTS tokenizer (PR #253 open)
  • #144 — Copy text to clipboard

Housekeeping / Triage Needed

IssueReason
#431, #408Spam — Chinese "free Claude API" promos. Close.
#398 ("Excelente")Non-issue. Close.
#357Informational — project featured in Awesome MLX. Close after acknowledgement.
#374, #377Version-release questions, no bug. Close.
#306 ("voice model"), #389 ("New model"), #473 ("New functionality")Title-only issues, no content. Request details or close.
#309Uninstall/cleanup question. Answer and close.
#241"How to use in Colab" — support question, not a bug.
#423 / #485 / #329Stale MLX fallback to 1.7B repo — fixed; 0.6B bf16 conversion now live on mlx-community, registry points at correct repo on both backends.
#336 / #348TADA download/registration cluster — triage together.
#287 / #275 / #304macOS ARM import regressions on new version — likely one root cause.
#292, #349Possibly already fixed by merged PRs (#321/#412 and #345). Verify + close.

~70 older issues (pre-#170) not individually categorized above. Most are long-tail support questions or duplicates of problems now addressed by the multi-engine / model-registry work. A dedicated backlog-sweep pass is overdue.

Bugs (ongoing)

CategoryIssues
Generation failures#476, #467, #452, #459 (voice clone fetch error), #468 (tada-1b marked error), #437, #300, #301, #282
Audio quality#456 (clipping errors v0.4.0), #436 (emotion labels), #333 (pitch/echo), #307 (by-model breakdown), #340 (all generations say "www...")
Transcription#371 (fails every time), #291 (extract transcription from generated audio)
Effects / presets#349 ("Failed to save" when creating effects presets — possibly fixed by merged #345)
File ops#477 (spacy_pkuseg dict missing on frozen Windows build), #472 (storage location change), #283 (allow longer files for voice creation + in-app trim), #350 (failed to add sample)
History#292 (can't delete failed generations — possibly fixed by merged #321/#412)
Windows#466 (install problem), #375 (WinError 5 access denied), #273 (port 8000 conflict), #201 (model doesn't stay loaded)
Linux#471 (thread-safe PULSE_SOURCE), #413 (Arch build), #409 (Kubuntu build), #351, #341
macOS#441 (older macOS), #369 (malware flag), #334 (microphone permission), #287 (check_model_inputs ImportError — regression), #171 (ARM64 binary won't open)
Profile/UI#360 (Kokoro profile hides others — partly addressed by auto-switch), #299 (drag-drop on Win11), #329 (size selector state bug), #393 (stuck loading screen after reinstall to new dir)
Integrations#397 (SAMMI-bot 422 Unprocessable Entity)
Audio playback / session#41 (macOS: Voicebox goes silent after another app takes audio output; restart restores it) — see deep-dive below
Database#174 (sqlite3 IntegrityError)

Existing Plan Documents — Status

DocumentTarget VersionStatusRelevance
TTS_PROVIDER_ARCHITECTURE.mdv0.1.13Partially superseded by multi-engine arch + CUDA swapCore concepts implemented differently than planned
CUDA_BACKEND_SWAP.mdShipped (PR #252)CUDA binary download + backend restart
CUDA_BACKEND_SWAP_FINAL.mdShipped (PR #252)Final implementation plan
EXTERNAL_PROVIDERS.mdv0.2.0Not startedRemote server support
MLX_AUDIO.mdShippedMLX backend is live
DOCKER_DEPLOYMENT.mdv0.2.0Shipped (PR #161)Docker + web deployment
OPENAI_SUPPORT.mdv0.2.0Not startedOpenAI-compatible API layer
PR33_CUDA_PROVIDER_REVIEW.mdReferenceAnalysis of the original provider approach

New Model Integration — Landscape

Status Snapshot (2026-04-18)

ModelCloningSpeedSample RateLanguagesVRAMInstructCross-platform?Status
Qwen3-TTS10s zero-shotMedium24 kHz10MediumNoneMLX + PyTorchShipped
Qwen CustomVoicePreset speakersMedium24 kHz10MediumYesPyTorchShipped (PR #328)
LuxTTS3s zero-shot150x RT, CPU ok48 kHzEnglish<1 GBNoneAllShipped (PR #254)
Chatterbox MTL5s zero-shotMedium24 kHz23MediumPartial — exaggerationCPU/CUDAShipped (PR #257)
Chatterbox Turbo5s zero-shotFast24 kHzEnglishLowPartial — inline tagsCPU/CUDAShipped (PR #258)
HumeAI TADA 1B/3BZero-shot5x faster than LLM-TTS24 kHzEN (1B), 10 (3B)MediumPartial — prosodyPyTorchShipped (PR #296)
Kokoro-82MPreset voicesCPU realtime24 kHz8Tiny (82M)NoneAllShipped (PR #325)
CosyVoice2-0.5B3-10s zero-shotVery fast24 kHzMultilingualLowYesAbandoned (PR #311) — poor output quality
VoxCPM2Zero-shot~0.15 RTF streaming48 kHz30MediumPartial — parenthetical styleCUDA-only in practiceBacklogged (2026-04-18) — see notes above
Fish Speech10-30s few-shotReal-time24-44 kHz50+MediumYes — word-level inlineAllCandidate — license TBD
Fish Audio S2Candidate (#385)
XTTS-v26s zero-shotMid-GPU24 kHz17+MediumPartial — style transfer from refAllCandidate — CPML license likely blocker
Pocket TTS (Kyutai)Zero-shot + streaming>1x RT on CPUEnglish + several European (FR/DE/PT/IT/ES added by Feb 2026)~100MNoneCPU-firstCandidate — MIT
MOSS-TTS-NanoZero-shotRealtime on 4 CPU cores48 kHz stereo200.1BPartial — MOSS-VoiceGenerator companion does text-to-voice designAll (ONNX CPU path dropped 2026-04-17)Top candidate — Apache 2.0, released 2026-04-13, streaming
VibeVoice (Microsoft)Multi-speaker long-form (up to 90 min, 4 speakers)1.5BCandidate (#172) — Stories-editor fit
index-tts2Candidate (#370)
Voxtral TTS (Mistral)Zero-shot (short clips) + 20 preset voicesSingle-GPU4B (Voxtral-4B-TTS-2603)Presets + cloningCUDA (16 GB+ VRAM)Candidate (#364) — frontier quality claim, open-weight
Dia / Dia2Watch — emotion-forward, but "rough edges" / artifacts per April reviews
IndicF5Indian languagesCandidate (#339) — fills Indic gap
MiniMax Cloud TTSCloudN/A (API)N/ACommunity PR #430, #331 — new direction (external API)
OmniVoiceCandidate (#380)
RVC voice conversionN/A (STS)N/AAllNew modality, not TTS (#407, #347)

Watch list: MioTTS-2.6B (fast LLM-based EN/JP, vLLM compatible), Oolel-Voices (Soynade Research, expressive modular control), Faster-Qwen-TTS (#335), Orpheus / Sesame CSM (on-device fine-tuning discussions), Fish Audio S2 Pro / Fish Speech V1.5 (benchmark leader but research/non-commercial license — same blocker as Fish Speech).

Deep-research pass (2026-04-18): MOSS-TTS-Nano identified as the freshest high-alignment candidate — verified via OpenMOSS/MOSS-TTS README (0.1B params, Apache 2.0, 48 kHz stereo, 4-core CPU realtime, streaming, released 2026-04-13). Dedicated repo: OpenMOSS/MOSS-TTS-Nano. Voxtral TTS verified on HF as mistralai/Voxtral-4B-TTS-2603.

Active Evaluation Criteria (learned from cycle)

  1. Cross-platform first. MLX is a primary backend for our Apple Silicon user base. CUDA-only models require platform gating that doesn't exist yet — shipping one sets a precedent (see VoxCPM notes, issues #419/#420).
  2. PyPI + Apache/MIT licensing preferred. Heavy deps, git-only installs, and --no-deps workarounds are expensive to maintain (Chatterbox taught us this).
  3. Output quality is non-negotiable. CosyVoice was abandoned despite the best instruct API.
  4. Instruct support fills a real gap (#173, #224, #303). Qwen CustomVoice partially addresses it with preset speakers; zero-shot clone-with-instruct is still unmet.
  5. Long-form + streaming are user-requested (#363, #365, #464). Candidates with native streaming (Pocket TTS, Fish Speech) get extra weight.

Adding a New Engine (Now Straightforward)

With the model config registry and shared EngineModelSelector component, adding a new TTS engine requires:

  1. Create backend/backends/<engine>_backend.py — implement TTSBackend protocol (~200-300 lines)
  2. Register in backend/backends/__init__.py — add ModelConfig entry + TTS_ENGINES entry + factory elif
  3. Update backend/models.py — add engine name to regex
  4. Update frontend — add to engine union type, EngineModelSelector options, form schema, language map, profile type gating (icons/labels ~9 files per grep of kokoro)

main.py requires zero changes — the registry handles all dispatch automatically.

Platform gating doesn't exist yet. If we add a CUDA-only model (e.g. VoxCPM), we need a new requires_cuda (or more generally requires: list[device]) flag on ModelConfig, plumbed through /models API and surfaced in ModelManagement.tsx and EngineModelSelector.tsx as a lock icon + "Requires NVIDIA GPU" state. Backend should hard-error at load_model() as a safety net.

Total effort: ~1 day for a well-documented model with a PyPI package, cross-platform. ~2 days if platform gating is required. See content/docs/developer/tts-engines.mdx for the full guide.


Architectural Bottlenecks

1. Single Backend Singleton — RESOLVED

The singleton TTS backend was replaced with a thread-safe per-engine registry in PR #254. Multiple engines can now be loaded simultaneously.

2. main.py Dispatch Point Duplication — RESOLVED

Previously, each engine required updates to 6+ hardcoded dispatch maps across main.py (~320 lines of if/elif chains). A model config registry in backend/backends/__init__.py now centralizes all model metadata (ModelConfig dataclass) with helper functions (load_engine_model(), check_model_loaded(), engine_needs_trim(), etc.). Adding a new engine requires zero changes to main.py.

3. Model Config is Scattered — RESOLVED

Model identifiers, HF repo IDs, display names, and engine metadata are now consolidated in the ModelConfig registry. Backend-aware branching (e.g. MLX vs PyTorch Qwen repo IDs) happens inside the registry. Frontend model options are centralized in EngineModelSelector.tsx.

4. Voice Prompt Cache Assumes PyTorch Tensors

backend/utils/cache.py uses torch.save() / torch.load(). LuxTTS, Chatterbox, and Kokoro backends work around this by storing reference audio paths (or preset voice IDs) instead of tensors in their voice prompt dicts. Not ideal but functional.

5. Frontend Assumes Qwen Model Sizes — RESOLVED

The generation form now uses a flat model dropdown with engine-based routing. Per-engine language filtering is in place. Model size is only sent for Qwen / Qwen CustomVoice.

6. No Platform Gating on Models — NEW

ModelConfig has no way to express hardware requirements. Every engine is shown to every user, regardless of whether it'll actually load. Users on non-CUDA platforms discover failure at load time (or not at all — some fall back silently to CPU and never complete). Blocks shipping CUDA-only engines (VoxCPM) and would improve the Intel Arc / ROCm / CPU-only UX today. See ModelConfig TODO: add requires: list[Literal["cuda", "mps", "xpu", "cpu", "rocm"]] or equivalent, plumb through /models API, render in ModelManagement.tsx + EngineModelSelector.tsx.

7. Engine Sprawl — NEW

Seven TTS engines shipped, more candidates queued. Issue #419 asks for a first-class vs experimental distinction. Related: issue #420 asks for formalized platform support tiers. Combined, these would let us ship more engines more confidently with clearer expectations for users.


Tier 1 — Ship Now

PriorityPR/ItemImpactEffort
1RTX 50-series / Blackwell diagnostic — detect stale CUDA binary vs GPU arch, prompt re-download (#417, #400, #396, #395, #390, #362)Large cluster of user-blocking errorsMedium
2CustomVoice download failures (#475, #445)New engine blocked on MAC/Win — regression triageMedium
350k char limit on GPU (#464)Regression — chunking should handle thisMedium
4Close PR #311 (CosyVoice) and dedupe #331/#430 (MiniMax)HousekeepingNone
5PR #443 — infinite offline retry loopBug fix, reviewableLow
6PR #465 — define tier-1 / tier-2 platformsUnblocks engine-sprawl decision (#419)Low
7PR #463 — docker registry auto-publishCommunity PR, low riskLow
8#253 — 48kHz speech tokenizerQuality improvement for QwenMedium
9Kokoro profile UX (#360) — partially addressed by auto-switchPolishLow

Tier 2 — Feature Work

PriorityItemImpactEffort
1Engine tier system (#419) — first-class vs experimental, platform gating in ModelConfigUnblocks CUDA-only engines (VoxCPM, etc.) and frontend polishMedium
2Frontend tech-debt burn-down (#421) + code-split (#422)Before gating CI on BiomeMedium
3#154 — Audiobook tabLong-form users. Chunking + queue shipped.Medium
4UI i18n (#411 PR offer, #392, #261)Chinese UI + general localizationMedium
5#225 — Custom HuggingFace modelsUser-supplied models. Needs rework.High
6OpenAI-compatible API (plan doc exists) — see also #448 (API for non-Qwen)Low effort once API is stableLow
7LoRA fine-tuning (PR #195)Complex, needs rework for multi-engineVery High
8Streaming for non-MLX enginesCurrently MLX-onlyMedium
9Voice-to-voice / RVC (#407, #347)New modality — different arch shapeHigh

Tier 3 — Future Engines (cross-platform preferred)

PriorityItemNotes
1MOSS-TTS-Nano0.1B, Apache 2.0, 4-core CPU realtime, 48 kHz stereo, streaming, 20 langs, released 2026-04-13. Best alignment with our criteria. Verify install ergonomics before committing.
2Pocket TTS (Kyutai)CPU-first 100M model. MIT. Fills streaming gap without CUDA dependency. Several European langs added by Feb 2026.
3IndicF5Fills Indian-language gap (#339). Closes many language-request issues.
4VibeVoice (Microsoft, #172)1.5B, long-form multi-speaker (up to 90 min, 4 speakers). Strong Stories-editor fit.
5Voxtral TTS (Mistral, #364)4B presets+cloning. Frontier quality claim, but 16 GB+ VRAM — would need the platform-tier work first.
6Fish Speech / Fish Audio S250+ langs, word-level instruct. License clarification first. (#385)
7XTTS-v217+ langs, mature pip. CPML likely kills commercial use — verify.
8index-tts2 (#370)Unvetted.
VoxCPM2Backlogged — CUDA-only upstream. Revisit when tier system ships or MPS bugs are fixed upstream.

Previously Prioritized — Now Done

  • Kokoro 82M — finish integration Shipped (PR #325)
  • Qwen CustomVoice Shipped (PR #328)
  • Intel Arc (XPU) support Shipped (PR #320)
  • Blackwell CUDA Shipped (PR #401, follow-up work open)
  • Generation cancellation Shipped (PR #444)
  • macOS Intel x86_64 Shipped (PR #416)

Branch Inventory

BranchPRStatusNotes
voicebox-new-modelsActiveNew model research (Fish Speech, Pocket TTS, VibeVoice, etc.); VoxCPM evaluated & backlogged
fix/kokoro-pyinstaller-source-filesActiveKokoro frozen-build source bundling (parent of voicebox-new-models)
feat/cosyvoice-engine#311Open — closingCosyVoice2/3 — abandoned, poor quality
feat/kokoro#325MergedKokoro 82M + voice profile type system
feat/qwen-custom-voice#328MergedQwen CustomVoice preset engine
feat/chatterbox-turbo#258MergedChatterbox Turbo + per-engine languages
feat/chatterbox#257MergedChatterbox Multilingual
feat/luxtts#254MergedLuxTTS + multi-engine arch

Quick Reference: API Endpoints

<details> <summary>All current endpoints</summary>
EndpointMethodPurpose
/healthGETHealth check, model/GPU status
/profilesPOST, GETCreate/list voice profiles
/profiles/{id}GET, PUT, DELETEProfile CRUD
/profiles/{id}/samplesPOST, GETAdd/list voice samples
/profiles/{id}/avatarPOST, GET, DELETEAvatar management
/profiles/{id}/exportGETExport profile as ZIP
/profiles/importPOSTImport profile from ZIP
/generatePOSTGenerate speech (engine param selects TTS backend)
/generate/streamPOSTStream speech (MLX only)
/historyGETList generation history
/history/{id}GET, DELETEGet/delete generation
/history/{id}/exportGETExport generation ZIP
/history/{id}/export-audioGETExport audio only
/transcribePOSTTranscribe audio (Whisper)
/models/statusGETAll model statuses (Qwen, LuxTTS, Chatterbox, Chatterbox Turbo, TADA, Whisper)
/models/downloadPOSTTrigger model download
/models/download/cancelPOSTCancel/dismiss download
/models/{name}DELETEDelete downloaded model
/models/loadPOSTLoad model into memory
/models/unloadPOSTUnload model
/models/progress/{name}GETSSE download progress
/tasks/activeGETActive downloads/generations (with inline progress)
/storiesPOST, GETCreate/list stories
/stories/{id}GET, PUT, DELETEStory CRUD
/stories/{id}/itemsPOST, GETStory items CRUD
/stories/{id}/exportGETExport story audio
/channelsPOST, GETAudio channel CRUD
/channels/{id}PUT, DELETEChannel update/delete
/cache/clearPOSTClear voice prompt cache
/server/cuda/statusGETCUDA binary availability
/server/cuda/downloadPOSTDownload CUDA binary
/server/cuda/switchPOSTSwitch to CUDA backend
</details>