docs/nodes/media-understanding.md
OpenClaw can summarize inbound media (image/audio/video) before the reply pipeline runs. It auto-detects when local tools or provider keys are available, and can be disabled or customized. If understanding is off, models still receive the original files/URLs as usual.
Vendor-specific media behavior is registered by vendor plugins, while OpenClaw core owns the shared tools.media config, fallback order, and reply-pipeline integration.
- `Body` becomes `[Image]`, `[Audio]`, or `[Video]` block.
- Audio sets `{{Transcript}}`; command parsing uses caption text when present, otherwise the transcript.
- Captions are preserved as `User text:` inside the block.
If understanding fails or is disabled, the reply flow continues with the original body + attachments.
tools.media supports shared models plus per-capability overrides:
{
tools: {
media: {
models: [
/* shared list */
],
image: {
/* optional overrides */
},
audio: {
/* optional overrides */
echoTranscript: true,
echoFormat: '๐ "{transcript}"',
},
video: {
/* optional overrides */
},
},
},
}
Each models[] entry can be provider or CLI:
CLI templates can also use:
- `{{MediaDir}}` (directory containing the media file)
- `{{OutputDir}}` (scratch dir created for this run)
- `{{OutputBase}}` (scratch file base path, no extension)
Recommended defaults:
maxChars: 500 for image/video (short, command-friendly)maxChars: unset for audio (full transcript unless you set a limit)maxBytes:
If tools.media.<capability>.enabled is not set to false and you haven't configured models, OpenClaw auto-detects in this order and stops at the first working option:
- `sherpa-onnx-offline` (requires `SHERPA_ONNX_MODEL_DIR` with encoder/decoder/joiner/tokens)
- `whisper-cli` (`whisper-cpp`; uses `WHISPER_CPP_MODEL` or the bundled tiny model)
- `whisper` (Python CLI; downloads models automatically)
Bundled fallback order:
- Audio: OpenAI โ Groq โ xAI โ Deepgram โ Google โ SenseAudio โ ElevenLabs โ Mistral
- Image: OpenAI โ Anthropic โ Google โ MiniMax โ MiniMax Portal โ Z.AI
- Video: Google โ Qwen โ Moonshot
To disable auto-detection, set:
{
tools: {
media: {
audio: {
enabled: false,
},
},
},
}
When provider-based audio and video media understanding is enabled, OpenClaw honors standard outbound proxy environment variables for provider HTTP calls:
HTTPS_PROXYHTTP_PROXYALL_PROXYhttps_proxyhttp_proxyall_proxyIf no proxy env vars are set, media understanding uses direct egress. If the proxy value is malformed, OpenClaw logs a warning and falls back to direct fetch.
If you set capabilities, the entry only runs for those media types. For shared lists, OpenClaw can infer defaults:
openai, anthropic, minimax: imageminimax-portal: imagemoonshot: image + videoopenrouter: imagegoogle (Gemini API): image + audio + videoqwen: image + videomistral: audiozai: imagegroq: audioxai: audiodeepgram: audiomodels.providers.<id>.models[] catalog with an image-capable model: imageFor CLI entries, set capabilities explicitly to avoid surprising matches. If you omit capabilities, the entry is eligible for the list it appears in.
| Capability | Provider integration | Notes |
|---|---|---|
| Image | OpenAI, OpenAI Codex OAuth, Codex app-server, OpenRouter, Anthropic, Google, MiniMax, Moonshot, Qwen, Z.AI, config providers | Vendor plugins register image support; openai-codex/* uses OAuth provider plumbing; codex/* uses a bounded Codex app-server turn; MiniMax and MiniMax OAuth both use MiniMax-VL-01; image-capable config providers auto-register. |
| Audio | OpenAI, Groq, xAI, Deepgram, Google, SenseAudio, ElevenLabs, Mistral | Provider transcription (Whisper/Groq/xAI/Deepgram/Gemini/SenseAudio/Scribe/Voxtral). |
| Video | Google, Qwen, Moonshot | Provider video understanding via vendor plugins; Qwen video understanding uses the Standard DashScope endpoints. |
minimax and minimax-portal image understanding comes from the plugin-owned MiniMax-VL-01 media provider.models.providers.minimax entries materialize image-capable M2.7 chat refs.whisper-cli, whisper, gemini) are useful when provider APIs are unavailable.parakeet-mlx note: with --output-dir, OpenClaw reads <output-dir>/<media-basename>.txt when output format is txt (or unspecified); non-txt formats fall back to stdout.Per-capability attachments controls which attachments are processed:
When mode: "all", outputs are labeled [Image 1/2], [Audio 2/2], etc.
When media understanding runs, /status includes a short summary line:
๐ Media: image ok (openai/gpt-5.4) ยท audio skipped (maxBytes)
This shows per-capability outcomes and the chosen provider/model when applicable.
scope to limit where understanding runs (e.g. only DMs).