docs/vibevoice-tts.md
VibeVoice-TTS is a long-form, multi-speaker text-to-speech model designed to generate expressive conversational audio such as podcasts from text. It can synthesize speech up to 90 minutes long with up to 4 distinct speakers, surpassing the typical 1β2 speaker limits of many prior models.
Model: VibeVoice-1.5B
Report: Technical Report
<div align="center">| Model | Context Length | Generation Length | Weight |
|---|---|---|---|
| VibeVoice-1.5B | 64K | ~90 min | HF link |
| VibeVoice-Large | 32K | ~45 min | Disabled |
β±οΈ 90-minute Long-form Generation: Synthesizes conversational/single-speaker speech up to 90 minutes in a single pass, maintaining speaker consistency and semantic coherence throughout.
π₯ Multi-speaker Support: Supports up to 4 distinct speakers in a single conversation, with natural turn-taking and speaker consistency across long dialogues.
π Expressive Speech: Generates expressive, natural-sounding speech that captures conversational dynamics and emotional nuances.
π Multi-lingual Support: Supports English, Chinese and other languages.
VibeVoice-TTS employs a next-token diffusion framework that combines:
English
<div align="center">https://github.com/user-attachments/assets/0967027c-141e-4909-bec8-091558b1b784
</div>Chinese
<div align="center">https://github.com/user-attachments/assets/322280b7-3093-4c67-86e3-10be4746c88f
</div>Cross-Lingual
<div align="center">https://github.com/user-attachments/assets/838d8ad9-a201-4dde-bb45-8cd3f59ce722
</div>Spontaneous Singing
<div align="center">https://github.com/user-attachments/assets/6f27a8a5-0c60-4f57-87f3-7dea2e11c730
</div>Long Conversation with 4 people
<div align="center">https://github.com/user-attachments/assets/a357c4b6-9768-495c-a576-1618f6275727
</div>For more examples, see the Project Page.
Disabled due to widespread misuse.
The model achieves state-of-the-art performance on long-form multi-speaker speech generation tasks. For detailed evaluation results, please refer to the paper.
<div align="center"> </div>We observed users may encounter occasional instability when synthesizing Chinese speech. We recommend:
We'd like to thank PsiPi for sharing an interesting way for emotion control. Detials can be found via discussion12.
A: Yes, it's a pretrained model without any post-training or benchmark-specific optimizations. In a way, this makes VibeVoice very versatile and fun to use.
A: As you can see from our demo page, the background music or sounds are spontaneous. This means we can't directly control whether they are generated or not. The model is content-aware, and these sounds are triggered based on the input text and the chosen voice prompt.
Here are a few things we've noticed:
In fact, we intentionally decided not to denoise our training data because we think it's an interesting feature for BGM to show up at just the right moment. You can think of it as a little easter egg we left for you.
A: We don't perform any text normalization during training or inference. Our philosophy is that a large language model should be able to handle complex user inputs on its own. However, due to the nature of the training data, you might still run into some corner cases.
A: Our training data doesn't contain any music data. The ability to sing is an emergent capability of the model (which is why it might sound off-key, even on a famous song like 'See You Again'). (The Large model is more likely to exhibit this than the 1.5B).
A: The volume of Chinese data in our training set is significantly smaller than the English data. Additionally, certain special characters (e.g., Chinese quotation marks) may occasionally cause pronunciation issues.
A: The model does exhibit strong cross-lingual transfer capabilities, including the preservation of accents, but its performance can be unstable. This is an emergent ability of the model that we have not specifically optimized. It's possible that a satisfactory result can be achieved through repeated sampling.
While efforts have been made to optimize it through various techniques, it may still produce outputs that are unexpected, biased, or inaccurate. VibeVoice inherits any biases, errors, or omissions produced by its base model (specifically, Qwen2.5 1.5b in this release). Potential for Deepfakes and Disinformation: High-quality synthetic speech can be misused to create convincing fake audio content for impersonation, fraud, or spreading disinformation. Users must ensure transcripts are reliable, check content accuracy, and avoid using generated content in misleading ways. Users are expected to use the generated content and to deploy the models in a lawful manner, in full compliance with all applicable laws and regulations in the relevant jurisdictions. It is best practice to disclose the use of AI when sharing AI-generated content.
English and Chinese only: Transcripts in languages other than English or Chinese may result in unexpected audio outputs.
Non-Speech Audio: The model focuses solely on speech synthesis and does not handle background noise, music, or other sound effects.
Overlapping Speech: The current model does not explicitly model or generate overlapping speech segments in conversations.
We do not recommend using VibeVoice in commercial or real-world applications without further testing and development. This model is intended for research and development purposes only. Please use responsibly.