docs/content/features/text-to-audio.md
+++ disableToc = false title = "Text to Audio (TTS)" weight = 11 url = "/features/text-to-audio/" +++
The LocalAI TTS API is compatible with the OpenAI TTS API and the Elevenlabs API.
The /tts endpoint can also be used to generate speech from text.
Input: input, model
For example, to generate an audio file, you can send a POST request to the /tts endpoint with the instruction as the request body:
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"input": "Hello world",
"model": "tts"
}'
Returns an audio/wav file.
LocalAI supports streaming TTS generation, allowing audio to be played as it's generated. This is useful for real-time applications and reduces latency.
To enable streaming, add "stream": true to your request:
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"input": "Hello world, this is a streaming test",
"model": "voxcpm",
"stream": true
}' | aplay
The audio will be streamed chunk-by-chunk as it's generated, allowing playback to start before generation completes. This is particularly useful for long texts or when you want to minimize perceived latency.
You can also pipe the streamed audio directly to audio players like aplay (Linux) or save it to a file:
# Stream to aplay (Linux)
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"input": "This is a longer text that will be streamed as it is generated",
"model": "voxcpm",
"stream": true
}' | aplay
# Stream to a file
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"input": "Streaming audio to file",
"model": "voxcpm",
"stream": true
}' > output.wav
Note: Streaming TTS is currently supported by the voxcpm backend. Other backends will fall back to non-streaming mode if streaming is not supported.
Required: Don't use LocalAI images ending with the -core tag,. Python dependencies are required in order to use this backend.
Coqui works without any configuration, to test it, you can run the following curl command:
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"backend": "coqui",
"model": "tts_models/en/ljspeech/glow-tts",
"input":"Hello, this is a test!"
}'
You can use the env variable COQUI_LANGUAGE to set the language used by the coqui backend.
You can also use config files to configure tts models (see section below on how to use config files).
To install the piper audio models manually:
.tar.tgz files (.onnx,.json) inside modelsTo use the tts endpoint, run the following command. You can specify a backend with the backend parameter. For example, to use the piper backend:
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"model":"it-riccardo_fasol-x-low.onnx",
"backend": "piper",
"input": "Ciao, sono Ettore"
}' | aplay
Note:
aplay is a Linux command. You can use other tools to play the audio file.GO_TAGS=tts flag.LocalAI also has experimental support for transformers-musicgen for the generation of short musical compositions. Currently, this is implemented via the same requests used for text to speech:
curl --request POST \
--url http://localhost:8080/tts \
--header 'Content-Type: application/json' \
--data '{
"backend": "transformers-musicgen",
"model": "facebook/musicgen-medium",
"input": "Cello Rave"
}' | aplay
Future versions of LocalAI will expose additional control over audio generation beyond the text prompt.
ACE-Step 1.5 is a music generation model that can create music from text descriptions, lyrics, or audio samples. It supports both simple text-to-music and advanced music generation with metadata like BPM, key scale, and time signature.
Install the ace-step-turbo model from the Model gallery or run local-ai run models install ace-step-turbo.
ACE-Step supports two modes: Simple mode (text description + vocal language) and Advanced mode (caption, lyrics, BPM, key, and more).
Simple mode:
curl http://localhost:8080/v1/audio/speech -H "Content-Type: application/json" -d '{
"model": "ace-step-turbo",
"input": "A soft Bengali love song for a quiet evening",
"vocal_language": "bn"
}' --output music.flac
Advanced mode (using the /sound endpoint):
curl http://localhost:8080/sound -H "Content-Type: application/json" -d '{
"model": "ace-step-turbo",
"caption": "A funky Japanese disco track",
"lyrics": "[Verse 1]\n...",
"bpm": 120,
"keyscale": "Ab major",
"language": "ja",
"duration_seconds": 225
}' --output music.flac
You can configure ACE-Step models with various options:
name: ace-step-turbo
backend: ace-step
parameters:
model: acestep-v15-turbo
known_usecases:
- sound_generation
- tts
options:
- "device:auto"
- "use_flash_attention:true"
- "init_lm:true" # Enable LLM for enhanced generation
- "lm_model_path:acestep-5Hz-lm-0.6B" # or acestep-5Hz-lm-4B
- "lm_backend:pt" # or vllm
- "temperature:0.85"
- "top_p:0.9"
- "inference_steps:8"
- "guidance_scale:7.0"
VibeVoice-Realtime is a real-time text-to-speech model that generates natural-sounding speech with voice cloning capabilities.
Install the vibevoice model in the Model gallery or run local-ai run models install vibevoice.
Use the tts endpoint by specifying the vibevoice backend:
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"model": "vibevoice",
"input":"Hello!"
}' | aplay
VibeVoice supports voice cloning through voice preset files. You can configure a model with a specific voice:
name: vibevoice
backend: vibevoice
parameters:
model: microsoft/VibeVoice-Realtime-0.5B
tts:
voice: "Frank" # or use audio_path to specify a .pt file path
# Available English voices: Carter, Davis, Emma, Frank, Grace, Mike
Then you can use the model:
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"model": "vibevoice",
"input":"Hello!"
}' | aplay
Pocket TTS is a lightweight text-to-speech model designed to run efficiently on CPUs. It supports voice cloning through HuggingFace voice URLs or local audio files.
Install the pocket-tts model in the Model gallery or run local-ai run models install pocket-tts.
Use the tts endpoint by specifying the pocket-tts backend:
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"model": "pocket-tts",
"input":"Hello world, this is a test."
}' | aplay
Pocket TTS supports voice cloning through built-in voice names, HuggingFace URLs, or local audio files. You can configure a model with a specific voice:
name: pocket-tts
backend: pocket-tts
tts:
voice: "azelma" # Built-in voice name
# Or use HuggingFace URL: "hf://kyutai/tts-voices/alba-mackenna/casual.wav"
# Or use local file path: "path/to/voice.wav"
# Available built-in voices: alba, marius, javert, jean, fantine, cosette, eponine, azelma
You can also pre-load a default voice for faster first generation:
name: pocket-tts
backend: pocket-tts
options:
- "default_voice:azelma" # Pre-load this voice when model loads
Then you can use the model:
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"model": "pocket-tts",
"input":"Hello world, this is a test."
}' | aplay
Qwen3-TTS is a high-quality text-to-speech model that supports three modes: custom voice (predefined speakers), voice design (natural language instructions), and voice cloning (from reference audio).
Install the qwen-tts model in the Model gallery or run local-ai run models install qwen-tts.
Use the tts endpoint by specifying the qwen-tts backend:
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"model": "qwen-tts",
"input":"Hello world, this is a test."
}' | aplay
Qwen3-TTS supports predefined speakers. You can specify a speaker using the voice parameter:
name: qwen-tts
backend: qwen-tts
parameters:
model: Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
tts:
voice: "Vivian" # Available speakers: Vivian, Serena, Uncle_Fu, Dylan, Eric, Ryan, Aiden, Ono_Anna, Sohee
Available speakers:
Voice Design allows you to create custom voices using natural language instructions. Configure the model with an instruct option:
name: qwen-tts-design
backend: qwen-tts
parameters:
model: Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
options:
- "instruct:δ½η°ζε¨η¨ε«©ηθθε₯³ε£°οΌι³θ°ει«δΈθ΅·δΌζζΎοΌθ₯ι εΊι»δΊΊγεδ½εε»ζεθηε¬θ§ζζγ"
Then use the model:
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"model": "qwen-tts-design",
"input":"Hello world, this is a test."
}' | aplay
Voice Clone allows you to clone a voice from reference audio. Configure the model with an AudioPath and optional ref_text:
name: qwen-tts-clone
backend: qwen-tts
parameters:
model: Qwen/Qwen3-TTS-12Hz-1.7B-Base
tts:
audio_path: "path/to/reference_audio.wav" # Reference audio file
options:
- "ref_text:This is the transcript of the reference audio."
- "x_vector_only_mode:false" # Set to true to use only speaker embedding (ref_text not required)
You can also use URLs or base64 strings for the reference audio. The backend automatically detects the mode based on available parameters (AudioPath β VoiceClone, instruct option β VoiceDesign, voice parameter β CustomVoice).
Then use the model:
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"model": "qwen-tts-clone",
"input":"Hello world, this is a test."
}' | aplay
Qwen3-TTS also supports loading multiple voices for voice cloning, allowing you to select different voices at request time. Configure multiple voices using the voices option:
name: qwen-tts-multi-voice
backend: qwen-tts
parameters:
model: Qwen/Qwen3-TTS-12Hz-1.7B-Base
options:
- voices:[{"name":"jane","audio":"voices/jane.wav","ref_text":"voices/jane-ref.txt"},{"name":"john","audio":"voices/john.wav","ref_text":"voices/john-ref.txt"}]
The voices option accepts a JSON array where each voice entry must have:
name: The voice identifier (used in API requests)audio: Path to the reference audio file (relative to model directory or absolute)ref_text: Path to the reference text file for the audio it is paired withThen use the model with voice selection:
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"model": "qwen-tts-multi-voice",
"input":"Hello world, this is Jane speaking.",
"voice": "jane"
}' | aplay
# Switch to a different voice
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"model": "qwen-tts-multi-voice",
"input":"Hello world, this is John speaking.",
"voice": "john"
}' | aplay
Voice Selection Priority:
voice parameter in the API request (highest priority)voice option in the model configurationError Handling: If you request a voice that doesn't exist in the voices list, the API will return an error with a list of available voices:
{"error": "Voice 'unknown' not found. Available voices: jane, john"}
Backward Compatibility:
The multi-voice mode is backward compatible with existing single-voice configurations. Models using audio_path in the tts section will continue to work as before.
You can also use a config-file to specify TTS models and their parameters.
In the following example we define a custom config to load the xtts_v2 model, and specify a voice and language.
name: xtts_v2
backend: coqui
parameters:
language: fr
model: tts_models/multilingual/multi-dataset/xtts_v2
tts:
voice: Ana Florence
With this config, you can now use the following curl command to generate a text-to-speech audio file:
curl -L http://localhost:8080/tts \
-H "Content-Type: application/json" \
-d '{
"model": "xtts_v2",
"input": "Bonjour, je suis Ana Florence. Comment puis-je vous aider?"
}' | aplay
To provide some compatibility with OpenAI API regarding response_format, ffmpeg must be installed (or a docker image including ffmpeg used) to leverage converting the generated wav file before the api provide its response.
Warning regarding a change in behaviour. Before this addition, the parameter was ignored and a wav file was always returned, with potential codec errors later in the integration (like trying to decode a mp3 file from a wav, which is the default format used by OpenAI)
Supported format thanks to ffmpeg are wav, mp3, aac, flac, opus, defaulting to wav if an unknown or no format is provided.
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"input": "Hello world",
"model": "tts",
"response_format": "mp3"
}'
If a response_format is added in the query (other than wav) and ffmpeg is not available, the call will fail.