docs/nodes/audio.md
maxBytes before sending to each model entry.Body with an [Audio] block and sets {{Transcript}}.CommandBody/RawBody are set to the transcript so slash commands still work.--verbose, we log when transcription runs and when it replaces the body.If you don’t configure models and tools.media.audio.enabled is not set to false,
OpenClaw auto-detects in this order and stops at the first working option:
sherpa-onnx-offline (requires SHERPA_ONNX_MODEL_DIR with encoder/decoder/joiner/tokens)whisper-cli (from whisper-cpp; uses WHISPER_CPP_MODEL or the bundled tiny model)whisper (Python CLI; downloads models automatically)gemini) using read_many_filesmodels.providers.* entries that support audio are tried firstTo disable auto-detection, set tools.media.audio.enabled: false.
To customize, set tools.media.audio.models.
Note: Binary detection is best-effort across macOS/Linux/Windows; ensure the CLI is on PATH (we expand ~), or set an explicit CLI model with a full command path.
{
tools: {
media: {
audio: {
enabled: true,
maxBytes: 20971520,
models: [
{ provider: "openai", model: "gpt-4o-mini-transcribe" },
{
type: "cli",
command: "whisper",
args: ["--model", "base", "{{MediaPath}}"],
timeoutSeconds: 45,
},
],
},
},
},
}
{
tools: {
media: {
audio: {
enabled: true,
scope: {
default: "allow",
rules: [{ action: "deny", match: { chatType: "group" } }],
},
models: [{ provider: "openai", model: "gpt-4o-mini-transcribe" }],
},
},
},
}
{
tools: {
media: {
audio: {
enabled: true,
models: [{ provider: "deepgram", model: "nova-3" }],
},
},
},
}
{
tools: {
media: {
audio: {
enabled: true,
models: [{ provider: "mistral", model: "voxtral-mini-latest" }],
},
},
},
}
{
tools: {
media: {
audio: {
enabled: true,
models: [{ provider: "senseaudio", model: "senseaudio-asr-pro-1.5-260319" }],
},
},
},
}
{
tools: {
media: {
audio: {
enabled: true,
echoTranscript: true, // default is false
echoFormat: '📝 "{transcript}"', // optional, supports {transcript}
models: [{ provider: "openai", model: "gpt-4o-mini-transcribe" }],
},
},
},
}
models.providers.*.apiKey).DEEPGRAM_API_KEY when provider: "deepgram" is used.SENSEAUDIO_API_KEY when provider: "senseaudio" is used.baseUrl, headers, and providerOptions via tools.media.audio.tools.media.audio.maxBytes). Oversize audio is skipped for that model and the next entry is tried.maxChars for audio is unset (full transcript). Set tools.media.audio.maxChars or per-entry maxChars to trim output.gpt-4o-mini-transcribe; set model: "gpt-4o-transcribe" for higher accuracy.tools.media.audio.attachments to process multiple voice notes (mode: "all" + maxAttachments).{{Transcript}}.tools.media.audio.echoTranscript is off by default; enable it to send transcript confirmation back to the originating chat before agent processing.tools.media.audio.echoFormat customizes the echo text (placeholder: {transcript}).args should use {{MediaPath}} for the local audio file path. Run openclaw doctor --fix to migrate deprecated {input} placeholders from older audio.transcription.command configs.Provider-based audio transcription honors standard outbound proxy env vars:
HTTPS_PROXYHTTP_PROXYALL_PROXYhttps_proxyhttp_proxyall_proxyIf no proxy env vars are set, direct egress is used. If proxy config is malformed, OpenClaw logs a warning and falls back to direct fetch.
When requireMention: true is set for a group chat, OpenClaw now transcribes audio before checking for mentions. This allows voice notes to be processed even when they contain mentions.
How it works:
@BotName, emoji triggers).Fallback behavior:
Opt-out per Telegram group/topic:
channels.telegram.groups.<chatId>.disableAudioPreflight: true to skip preflight transcript mention checks for that group.channels.telegram.groups.<chatId>.topics.<threadId>.disableAudioPreflight to override per-topic (true to skip, false to force-enable).false (preflight enabled when mention-gated conditions match).Example: A user sends a voice note saying "Hey @Claude, what's the weather?" in a Telegram group with requireMention: true. The voice note is transcribed, the mention is detected, and the agent replies.
chatType is normalized to direct, group, or room.jq -r .text.parakeet-mlx, if you pass --output-dir, OpenClaw reads <output-dir>/<media-basename>.txt when --output-format is txt (or omitted); non-txt output formats fall back to stdout parsing.timeoutSeconds, default 60s) to avoid blocking the reply queue.