docs/tools/video-generation.md
OpenClaw agents can generate videos from text prompts, reference images, or existing videos. Sixteen provider backends are supported, each with different model options, input modes, and feature sets. The agent picks the right provider automatically based on your configuration and available API keys.
<Note> The `video_generate` tool only appears when at least one video-generation provider is available. If you do not see it in your agent tools, set a provider API key or configure `agents.defaults.videoGenerationModel`. </Note>OpenClaw treats video generation as three runtime modes:
generate — text-to-video requests with no reference media.imageToVideo — request includes one or more reference images.videoToVideo — request includes one or more reference videos.Providers can support any subset of those modes. The tool validates the
active mode before submission and reports supported modes in action=list.
```bash
export GEMINI_API_KEY="your-key"
```
The agent calls `video_generate` automatically. No tool allowlisting
is needed.
Video generation is asynchronous. When the agent calls video_generate in a
session:
While a job is in flight, duplicate video_generate calls in the same
session return the current task status instead of starting another
generation. Use openclaw tasks list or openclaw tasks show <taskId> to
check progress from the CLI.
Outside of session-backed agent runs (for example, direct tool invocations), the tool falls back to inline generation and returns the final media path in the same turn.
Generated video files are saved under OpenClaw-managed media storage when
the provider returns bytes. The default generated-video save cap follows
the video media limit, and agents.defaults.mediaMaxMb raises it for
larger renders. When a provider also returns a hosted output URL, OpenClaw
can deliver that URL instead of failing the task if local persistence
rejects an oversized file.
| State | Meaning |
|---|---|
queued | Task created, waiting for the provider to accept it. |
running | Provider is processing (typically 30 seconds to 5 minutes depending on provider and resolution). |
succeeded | Video ready; the agent wakes and posts it to the conversation. |
failed | Provider error or timeout; the agent wakes with error details. |
Check status from the CLI:
openclaw tasks list
openclaw tasks show <taskId>
openclaw tasks cancel <taskId>
If a video task is already queued or running for the current session,
video_generate returns the existing task status instead of starting a new
one. Use action: "status" to check explicitly without triggering a new
generation.
| Provider | Default model | Text | Image ref | Video ref | Auth |
|---|---|---|---|---|---|
| Alibaba | wan2.6-t2v | ✓ | Yes (remote URL) | Yes (remote URL) | MODELSTUDIO_API_KEY |
| BytePlus (1.0) | seedance-1-0-pro-250528 | ✓ | Up to 2 images (I2V models only; first + last frame) | — | BYTEPLUS_API_KEY |
| BytePlus Seedance 1.5 | seedance-1-5-pro-251215 | ✓ | Up to 2 images (first + last frame via role) | — | BYTEPLUS_API_KEY |
| BytePlus Seedance 2.0 | dreamina-seedance-2-0-260128 | ✓ | Up to 9 reference images | Up to 3 videos | BYTEPLUS_API_KEY |
| ComfyUI | workflow | ✓ | 1 image | — | COMFY_API_KEY or COMFY_CLOUD_API_KEY |
| DeepInfra | Pixverse/Pixverse-T2V | ✓ | — | — | DEEPINFRA_API_KEY |
| fal | fal-ai/minimax/video-01-live | ✓ | 1 image; up to 9 with Seedance reference-to-video | Up to 3 videos with Seedance reference-to-video | FAL_KEY |
veo-3.1-fast-generate-preview | ✓ | 1 image | 1 video | GEMINI_API_KEY | |
| MiniMax | MiniMax-Hailuo-2.3 | ✓ | 1 image | — | MINIMAX_API_KEY or MiniMax OAuth |
| OpenAI | sora-2 | ✓ | 1 image | 1 video | OPENAI_API_KEY |
| OpenRouter | google/veo-3.1-fast | ✓ | Up to 4 images (first/last frame or references) | — | OPENROUTER_API_KEY |
| Qwen | wan2.6-t2v | ✓ | Yes (remote URL) | Yes (remote URL) | QWEN_API_KEY |
| Runway | gen4.5 | ✓ | 1 image | 1 video | RUNWAYML_API_SECRET |
| Together | Wan-AI/Wan2.2-T2V-A14B | ✓ | 1 image | — | TOGETHER_API_KEY |
| Vydra | veo3 | ✓ | 1 image (kling) | — | VYDRA_API_KEY |
| xAI | grok-imagine-video | ✓ | 1 first-frame image or up to 7 reference_images | 1 video | XAI_API_KEY |
Some providers accept additional or alternate API key env vars. See individual provider pages for details.
Run video_generate action=list to inspect available providers, models, and
runtime modes at runtime.
The explicit mode contract used by video_generate, contract tests, and
the shared live sweep:
| Provider | generate | imageToVideo | videoToVideo | Shared live lanes today |
|---|---|---|---|---|
| Alibaba | ✓ | ✓ | ✓ | generate, imageToVideo; videoToVideo skipped because this provider needs remote http(s) video URLs |
| BytePlus | ✓ | ✓ | — | generate, imageToVideo |
| ComfyUI | ✓ | ✓ | — | Not in the shared sweep; workflow-specific coverage lives with Comfy tests |
| DeepInfra | ✓ | — | — | generate; native DeepInfra video schemas are text-to-video in the bundled contract |
| fal | ✓ | ✓ | ✓ | generate, imageToVideo; videoToVideo only when using Seedance reference-to-video |
| ✓ | ✓ | ✓ | generate, imageToVideo; shared videoToVideo skipped because the current buffer-backed Gemini/Veo sweep does not accept that input | |
| MiniMax | ✓ | ✓ | — | generate, imageToVideo |
| OpenAI | ✓ | ✓ | ✓ | generate, imageToVideo; shared videoToVideo skipped because this org/input path currently needs provider-side inpaint/remix access |
| OpenRouter | ✓ | ✓ | — | generate, imageToVideo |
| Qwen | ✓ | ✓ | ✓ | generate, imageToVideo; videoToVideo skipped because this provider needs remote http(s) video URLs |
| Runway | ✓ | ✓ | ✓ | generate, imageToVideo; videoToVideo runs only when the selected model is runway/gen4_aleph |
| Together | ✓ | ✓ | — | generate, imageToVideo |
| Vydra | ✓ | ✓ | — | generate; shared imageToVideo skipped because bundled veo3 is text-only and bundled kling requires a remote image URL |
| xAI | ✓ | ✓ | ✓ | generate, imageToVideo; videoToVideo skipped because this provider currently needs a remote MP4 URL |
<ParamField path="image" type="string">Single reference image (path or URL).</ParamField>
<ParamField path="images" type="string[]">Multiple reference images (up to 9).</ParamField>
<ParamField path="imageRoles" type="string[]">
Optional per-position role hints parallel to the combined image list.
Canonical values: first_frame, last_frame, reference_image.
</ParamField>
<ParamField path="video" type="string">Single reference video (path or URL).</ParamField>
<ParamField path="videos" type="string[]">Multiple reference videos (up to 4).</ParamField>
<ParamField path="videoRoles" type="string[]">
Optional per-position role hints parallel to the combined video list.
Canonical value: reference_video.
</ParamField>
<ParamField path="audioRef" type="string">
Single reference audio (path or URL). Used for background music or voice
reference when the provider supports audio inputs.
</ParamField>
<ParamField path="audioRefs" type="string[]">Multiple reference audios (up to 3).</ParamField>
<ParamField path="audioRoles" type="string[]">
Optional per-position role hints parallel to the combined audio list.
Canonical value: reference_audio.
</ParamField>
adaptive is a provider-specific sentinel: it is forwarded as-is to
providers that declare adaptive in their capabilities (e.g. BytePlus
Seedance uses it to auto-detect the ratio from the input image
dimensions). Providers that do not declare it surface the value via
details.ignoredOverrides in the tool result so the drop is visible.
Reference inputs select the runtime mode:
generateimageToVideovideoToVideomaxInputAudios.Mixed image and video references are not a stable shared capability surface. Prefer one reference type per request.
Some capability checks are applied at the fallback layer rather than the tool boundary, so a request that exceeds the primary provider's limits can still run on a capable fallback:
maxInputAudios (or 0) is skipped when
the request contains audio references; next candidate is tried.maxDurationSeconds below the requested durationSeconds
with no declared supportedDurationSeconds list → skipped.providerOptions and the active candidate explicitly
declares a typed providerOptions schema → skipped if supplied keys are
not in the schema or value types do not match. Providers without a
declared schema receive options as-is (backward-compatible
pass-through). A provider can opt out of all provider options by
declaring an empty schema (capabilities.providerOptions: {}), which
causes the same skip as a type mismatch.The first skip reason in a request logs at warn so operators see when
their primary provider was passed over; subsequent skips log at debug to
keep long fallback chains quiet. If every candidate is skipped, the
aggregated error includes the skip reason for each.
| Action | What it does |
|---|---|
generate | Default. Create a video from the given prompt and optional reference inputs. |
status | Check the state of the in-flight video task for the current session without starting another generation. |
list | Show available providers, models, and their capabilities. |
OpenClaw resolves the model in this order:
model tool parameter — if the agent specifies one in the call.videoGenerationModel.primary from config.videoGenerationModel.fallbacks in order.If a provider fails, the next candidate is tried automatically. If all candidates fail, the error includes details from each attempt.
Set agents.defaults.mediaGenerationAutoProviderFallback: false to use
only the explicit model, primary, and fallbacks entries.
{
agents: {
defaults: {
videoGenerationModel: {
primary: "google/veo-3.1-fast-generate-preview",
fallbacks: ["runway/gen4.5", "qwen/wan2.6-t2v"],
},
},
},
}
Models: `seedance-1-0-pro-250528` (default),
`seedance-1-0-pro-t2v-250528`, `seedance-1-0-pro-fast-251015`,
`seedance-1-0-lite-t2v-250428`, `seedance-1-0-lite-i2v-250428`.
T2V models (`*-t2v-*`) do not accept image inputs; I2V models and
general `*-pro-*` models support a single reference image (first
frame). Pass the image positionally or set `role: "first_frame"`.
T2V model IDs are automatically switched to the corresponding I2V
variant when an image is provided.
Supported `providerOptions` keys: `seed` (number), `draft` (boolean —
forces 480p), `camera_fixed` (boolean).
Uses the unified `content[]` API. Supports at most 2 input images
(`first_frame` + `last_frame`). All inputs must be remote `https://`
URLs. Set `role: "first_frame"` / `"last_frame"` on each image, or
pass images positionally.
`aspectRatio: "adaptive"` auto-detects ratio from the input image.
`audio: true` maps to `generate_audio`. `providerOptions.seed`
(number) is forwarded.
Uses the unified `content[]` API. Supports up to 9 reference images,
3 reference videos, and 3 reference audios. All inputs must be remote
`https://` URLs. Set `role` on each asset — supported values:
`"first_frame"`, `"last_frame"`, `"reference_image"`,
`"reference_video"`, `"reference_audio"`.
`aspectRatio: "adaptive"` auto-detects ratio from the input image.
`audio: true` maps to `generate_audio`. `providerOptions.seed`
(number) is forwarded.
The shared video-generation contract supports mode-specific capabilities instead of only flat aggregate limits. New provider implementations should prefer explicit mode blocks:
capabilities: {
generate: {
maxVideos: 1,
maxDurationSeconds: 10,
supportsResolution: true,
},
imageToVideo: {
enabled: true,
maxVideos: 1,
maxInputImages: 1,
maxInputImagesByModel: { "provider/reference-to-video": 9 },
maxDurationSeconds: 5,
},
videoToVideo: {
enabled: true,
maxVideos: 1,
maxInputVideos: 1,
maxDurationSeconds: 5,
},
}
Flat aggregate fields such as maxInputImages and maxInputVideos are
not enough to advertise transform-mode support. Providers should
declare generate, imageToVideo, and videoToVideo explicitly so live
tests, contract tests, and the shared video_generate tool can validate
mode support deterministically.
When one model in a provider has wider reference-input support than the
rest, use maxInputImagesByModel, maxInputVideosByModel, or
maxInputAudiosByModel instead of raising the mode-wide limit.
Opt-in live coverage for the shared bundled providers:
OPENCLAW_LIVE_TEST=1 pnpm test:live -- extensions/video-generation-providers.live.test.ts
Repo wrapper:
pnpm test:live:media video
This live file loads missing provider env vars from ~/.profile, prefers
live/env API keys ahead of stored auth profiles by default, and runs a
release-safe smoke by default:
generate for every non-FAL provider in the sweep.OPENCLAW_LIVE_VIDEO_GENERATION_TIMEOUT_MS (180000 by default).FAL is opt-in because provider-side queue latency can dominate release time:
pnpm test:live:media video --video-providers fal
Set OPENCLAW_LIVE_VIDEO_GENERATION_FULL_MODES=1 to also run declared
transform modes the shared sweep can exercise safely with local media:
imageToVideo when capabilities.imageToVideo.enabled.videoToVideo when capabilities.videoToVideo.enabled and the
provider/model accepts buffer-backed local video input in the shared
sweep.Today the shared videoToVideo live lane covers runway only when you
select runway/gen4_aleph.
Set the default video-generation model in your OpenClaw config:
{
agents: {
defaults: {
videoGenerationModel: {
primary: "qwen/wan2.6-t2v",
fallbacks: ["qwen/wan2.6-r2v-flash"],
},
},
},
}
Or via the CLI:
openclaw config set agents.defaults.videoGenerationModel.primary "qwen/wan2.6-t2v"