docs/tools/tts.md
OpenClaw can convert outbound replies into audio across 14 speech providers and deliver native voice messages on Feishu, Matrix, Telegram, and WhatsApp, audio attachments everywhere else, and PCM/Ulaw streams for telephony and Talk.
```json5
{
messages: {
tts: {
auto: "always",
provider: "elevenlabs",
},
},
}
```
| Provider | Auth | Notes |
|---|---|---|
| Azure Speech | AZURE_SPEECH_KEY + AZURE_SPEECH_REGION (also AZURE_SPEECH_API_KEY, SPEECH_KEY, SPEECH_REGION) | Native Ogg/Opus voice-note output and telephony. |
| DeepInfra | DEEPINFRA_API_KEY | OpenAI-compatible TTS. Defaults to hexgrad/Kokoro-82M. |
| ElevenLabs | ELEVENLABS_API_KEY or XI_API_KEY | Voice cloning, multilingual, deterministic via seed. |
| Google Gemini | GEMINI_API_KEY or GOOGLE_API_KEY | Gemini API TTS; persona-aware via promptTemplate: "audio-profile-v1". |
| Gradium | GRADIUM_API_KEY | Voice-note and telephony output. |
| Inworld | INWORLD_API_KEY | Streaming TTS API. Native Opus voice-note and PCM telephony. |
| Local CLI | none | Runs a configured local TTS command. |
| Microsoft | none | Public Edge neural TTS via node-edge-tts. Best-effort, no SLA. |
| MiniMax | MINIMAX_API_KEY (or Token Plan: MINIMAX_OAUTH_TOKEN, MINIMAX_CODE_PLAN_KEY, MINIMAX_CODING_API_KEY) | T2A v2 API. Defaults to speech-2.8-hd. |
| OpenAI | OPENAI_API_KEY | Also used for auto-summary; supports persona instructions. |
| OpenRouter | OPENROUTER_API_KEY (can reuse models.providers.openrouter.apiKey) | Default model hexgrad/kokoro-82m. |
| Volcengine | VOLCENGINE_TTS_API_KEY or BYTEPLUS_SEED_SPEECH_API_KEY (legacy AppID/token: VOLCENGINE_TTS_APPID/_TOKEN) | BytePlus Seed Speech HTTP API. |
| Vydra | VYDRA_API_KEY | Shared image, video, and speech provider. |
| xAI | XAI_API_KEY | xAI batch TTS. Native Opus voice-note is not supported. |
| Xiaomi MiMo | XIAOMI_API_KEY | MiMo TTS through Xiaomi chat completions. |
If multiple providers are configured, the selected one is used first and the
others are fallback options. Auto-summary uses summaryModel (or
agents.defaults.model.primary), so that provider must also be authenticated
if you keep summaries enabled.
TTS config lives under messages.tts in ~/.openclaw/openclaw.json. Pick a
preset and adapt the provider block:
Use agents.list[].tts when one agent should speak with a different provider,
voice, model, persona, or auto-TTS mode. The agent block deep-merges over
messages.tts, so provider credentials can stay in the global provider config:
{
messages: {
tts: {
auto: "always",
provider: "elevenlabs",
providers: {
elevenlabs: { apiKey: "${ELEVENLABS_API_KEY}", model: "eleven_multilingual_v2" },
},
},
},
agents: {
list: [
{
id: "reader",
tts: {
providers: {
elevenlabs: { voiceId: "EXAVITQu4vr4xnSDxMaL" },
},
},
},
],
},
}
To pin a per-agent persona, set agents.list[].tts.persona alongside provider
config — it overrides the global messages.tts.persona for that agent only.
Precedence order for automatic replies, /tts audio, /tts status, and the
tts agent tool:
messages.ttsagents.list[].ttschannels.<channel>.ttschannels.<channel>.accounts.<id>.tts/tts preferences for this host[[tts:...]] directives when model overrides are enabledChannel and account overrides use the same shape as messages.tts and
deep-merge over the earlier layers, so shared provider credentials can stay in
messages.tts while a channel or bot account changes only voice, model, persona,
or auto mode:
{
messages: {
tts: {
provider: "openai",
providers: {
openai: { apiKey: "${OPENAI_API_KEY}", model: "gpt-4o-mini-tts" },
},
},
},
channels: {
feishu: {
accounts: {
english: {
tts: {
providers: {
openai: { voice: "shimmer" },
},
},
},
},
},
},
}
A persona is a stable spoken identity that can be applied deterministically across providers. It can prefer one provider, define provider-neutral prompt intent, and carry provider-specific bindings for voices, models, prompt templates, seeds, and voice settings.
{
messages: {
tts: {
auto: "always",
persona: "narrator",
personas: {
narrator: {
label: "Narrator",
provider: "elevenlabs",
providers: {
elevenlabs: { voiceId: "EXAVITQu4vr4xnSDxMaL", modelId: "eleven_multilingual_v2" },
},
},
},
},
},
}
{
messages: {
tts: {
auto: "always",
persona: "alfred",
personas: {
alfred: {
label: "Alfred",
description: "Dry, warm British butler narrator.",
provider: "google",
fallbackPolicy: "preserve-persona",
prompt: {
profile: "A brilliant British butler. Dry, witty, warm, charming, emotionally expressive, never generic.",
scene: "A quiet late-night study. Close-mic narration for a trusted operator.",
sampleContext: "The speaker is answering a private technical request with concise confidence and dry warmth.",
style: "Refined, understated, lightly amused.",
accent: "British English.",
pacing: "Measured, with short dramatic pauses.",
constraints: ["Do not read configuration values aloud.", "Do not explain the persona."],
},
providers: {
google: {
model: "gemini-3.1-flash-tts-preview",
voiceName: "Algieba",
promptTemplate: "audio-profile-v1",
},
openai: { model: "gpt-4o-mini-tts", voice: "cedar" },
elevenlabs: {
voiceId: "voice_id",
modelId: "eleven_multilingual_v2",
seed: 42,
voiceSettings: {
stability: 0.65,
similarityBoost: 0.8,
style: 0.25,
useSpeakerBoost: true,
speed: 0.95,
},
},
},
},
},
},
},
}
The active persona is selected deterministically:
/tts persona <id> local preference, if set.messages.tts.persona, if set.Provider selection runs explicit-first:
/tts provider <id> local preference.provider.messages.tts.provider.For each provider attempt, OpenClaw merges configs in this order:
messages.tts.providers.<id>messages.tts.personas.<persona>.providers.<id>Persona prompt fields (profile, scene, sampleContext, style, accent,
pacing, constraints) are provider-neutral. Each provider decides how
to use them:
fallbackPolicy controls behavior when a persona has no binding for the
attempted provider:
| Policy | Behavior |
|---|---|
preserve-persona | Default. Provider-neutral prompt fields stay available; the provider may use them or ignore them. |
provider-defaults | Persona is omitted from prompt preparation for that attempt; the provider uses its neutral defaults while fallback to other providers continues. |
fail | Skip that provider attempt with reasonCode: "not_configured" and personaBinding: "missing". Fallback providers are still tried. |
The whole TTS request only fails when every attempted provider is skipped or fails.
By default, the assistant can emit [[tts:...]] directives to override
voice, model, or speed for a single reply, plus an optional
[[tts:text]]...[[/tts:text]] block for expressive cues that should appear in
audio only:
Here you go.
[[tts:voiceId=pMsXgVXv3BLzUgSXRplE model=eleven_v3 speed=1.1]]
[[tts:text]](laughs) Read the song once more.[[/tts:text]]
When messages.tts.auto is "tagged", directives are required to trigger
audio. Streaming block delivery strips directives from visible text before the
channel sees them, even when split across adjacent blocks.
provider=... is ignored unless modelOverrides.allowProvider: true. When a
reply declares provider=..., the other keys in that directive are parsed
only by that provider; unsupported keys are stripped and reported as TTS
directive warnings.
Available directive keys:
provider (registered provider id; requires allowProvider: true)voice / voiceName / voice_name / google_voice / voiceIdmodel / google_modelstability, similarityBoost, style, speed, useSpeakerBoostvol / volume (MiniMax volume, 0–10)pitch (MiniMax integer pitch, −12 to 12; fractional values are truncated)emotion (Volcengine emotion tag)applyTextNormalization (auto|on|off)languageCode (ISO 639-1)seedDisable model overrides entirely:
{ messages: { tts: { modelOverrides: { enabled: false } } } }
Allow provider switching while keeping other knobs configurable:
{ messages: { tts: { modelOverrides: { enabled: true, allowProvider: true, allowSeed: false } } } }
Single command /tts. On Discord, OpenClaw also registers /voice because
/tts is a built-in Discord command — text /tts ... still works.
/tts off | on | status
/tts chat on | off | default
/tts latest
/tts provider <id>
/tts persona <id> | off
/tts limit <chars>
/tts summary off
/tts audio <text>
Behavior notes:
/tts on writes the local TTS preference to always; /tts off writes it to off./tts chat on|off|default writes a session-scoped auto-TTS override for the current chat./tts persona <id> writes the local persona preference; /tts persona off clears it./tts latest reads the latest assistant reply from the current session transcript and sends it as audio once. It stores only a hash of that reply on the session entry to suppress duplicate voice sends./tts audio generates a one-off audio reply (does not toggle TTS on).limit and summary are stored in local prefs, not the main config./tts status includes fallback diagnostics for the latest attempt — Fallback: <primary> -> <used>, Attempts: ..., and per-attempt detail (provider:outcome(reasonCode) latency)./status shows the active TTS mode plus configured provider, model, voice, and sanitized custom endpoint metadata when TTS is enabled.Slash commands write local overrides to prefsPath. The default is
~/.openclaw/settings/tts.json; override with the OPENCLAW_TTS_PREFS env var
or messages.tts.prefsPath.
| Stored field | Effect |
|---|---|
auto | Local auto-TTS override (always, off, …) |
provider | Local primary provider override |
persona | Local persona override |
maxLength | Summary threshold (default 1500 chars) |
summarize | Summary toggle (default true) |
These override the effective config from messages.tts plus the active
agents.list[].tts block for that host.
TTS voice delivery is channel-capability driven. Channel plugins advertise
whether voice-style TTS should ask providers for a native voice-note target or
keep normal audio-file synthesis and only mark compatible output for voice
delivery.
opus_48000_64 from ElevenLabs, opus from OpenAI).
ffmpeg before sending the native voice message. WhatsApp sends
the result through the Baileys audio payload with ptt: true and
audio/ogg; codecs=opus. If conversion fails, Feishu receives the original
file as an attachment; WhatsApp send fails rather than posting an incompatible
PTT payload.mp3_44100_128 from ElevenLabs, mp3 from OpenAI).
speech-2.8-hd model, 32kHz sample rate) for normal audio attachments. For channel-advertised voice-note targets, OpenClaw transcodes the MiniMax MP3 to 48kHz Opus with ffmpeg before delivery when the channel advertises transcoding.ffmpeg before delivery when the channel advertises transcoding.outputFormat. Voice-note targets are
converted to Ogg/Opus and telephony output is converted to raw 16 kHz mono PCM
with ffmpeg.ulaw_8000 at 8 kHz for telephony.OGG_OPUS for voice-note targets, and raw PCM at 22050 Hz for Talk/telephony.responseFormat may be mp3, wav, pcm, mulaw, or alaw. OpenClaw uses xAI's batch REST TTS endpoint and returns a complete audio attachment; xAI's streaming TTS WebSocket is not used by this provider path. Native Opus voice-note format is not supported by this path.microsoft.outputFormat (default audio-24khz-48kbitrate-mono-mp3).
outputFormat, but not all formats are available from the service.sendVoice accepts OGG/MP3/M4A; use OpenAI/ElevenLabs if you need
guaranteed Opus voice messages.OpenAI/ElevenLabs output formats are fixed per channel (see above).
When messages.tts.auto is enabled, OpenClaw:
MEDIA: directive.summaryModel (or agents.defaults.model.primary).mode: "final", still sends audio-only TTS for streamed final replies
after the text stream completes; the generated media goes through the same
channel media normalization as normal reply attachments.If the reply exceeds maxLength and summary is off (or no API key for the
summary model), audio is skipped and the normal text reply is sent.
Reply -> TTS enabled?
no -> send text
yes -> has media / MEDIA: / short?
yes -> send text
no -> length > limit?
no -> TTS -> attach audio
yes -> summary enabled?
no -> send text
yes -> summarize -> TTS -> attach audio
| Target | Format |
|---|---|
| Feishu / Matrix / Telegram / WhatsApp | Voice-note replies prefer Opus (opus_48000_64 from ElevenLabs, opus from OpenAI). 48 kHz / 64 kbps balances clarity and size. |
| Other channels | MP3 (mp3_44100_128 from ElevenLabs, mp3 from OpenAI). 44.1 kHz / 128 kbps default for speech. |
| Talk / telephony | Provider-native PCM (Inworld 22050 Hz, Google 24 kHz), or ulaw_8000 from Gradium for telephony. |
Per-provider notes:
ffmpeg. WhatsApp sends through Baileys with ptt: true and audio/ogg; codecs=opus. If conversion fails: Feishu falls back to attaching the original file; WhatsApp send fails rather than posting an incompatible PTT payload.speech-2.8-hd); transcoded to 48 kHz Opus for voice-note targets via ffmpeg.outputFormat. Voice-note targets are converted to Ogg/Opus and telephony output to raw 16 kHz mono PCM.OGG_OPUS voice-note, raw PCM 22050 Hz for Talk/telephony.responseFormat may be mp3|wav|pcm|mulaw|alaw. Uses xAI's batch REST endpoint — streaming WebSocket TTS is not used. Native Opus voice-note format is not supported.microsoft.outputFormat (default audio-24khz-48kbitrate-mono-mp3). Telegram sendVoice accepts OGG/MP3/M4A; use OpenAI/ElevenLabs if you need guaranteed Opus voice messages. If the configured Microsoft format fails, OpenClaw retries with MP3.OpenAI and ElevenLabs output formats are fixed per channel as listed above.
<ParamField path="apiKey" type="string">Env: `INWORLD_API_KEY`.</ParamField>
<ParamField path="baseUrl" type="string">Default `https://api.inworld.ai`.</ParamField>
<ParamField path="modelId" type="string">Default `inworld-tts-1.5-max`. Also: `inworld-tts-1.5-mini`, `inworld-tts-1-max`, `inworld-tts-1`.</ParamField>
<ParamField path="voiceId" type="string">Default `Sarah`.</ParamField>
<ParamField path="temperature" type="number">Sampling temperature `0..2`.</ParamField>
The tts tool converts text to speech and returns an audio attachment for
reply delivery. On Feishu, Matrix, Telegram, and WhatsApp, the audio is
delivered as a voice message rather than a file attachment. Feishu and
WhatsApp can transcode non-Opus TTS output on this path when ffmpeg is
available.
WhatsApp sends audio through Baileys as a PTT voice note (audio with
ptt: true) and sends visible text separately from PTT audio because
clients do not consistently render captions on voice notes.
The tool accepts optional channel and timeoutMs fields; timeoutMs is a
per-call provider request timeout in milliseconds.
| Method | Purpose |
|---|---|
tts.status | Read current TTS state and last attempt. |
tts.enable | Set local auto preference to always. |
tts.disable | Set local auto preference to off. |
tts.convert | One-off text → audio. |
tts.setProvider | Set local provider preference. |
tts.setPersona | Set local persona preference. |
tts.providers | List configured providers and status. |