Back to Openclaw

Video generation

docs/tools/video-generation.md

2026.5.526.6 KB
Original Source

OpenClaw agents can generate videos from text prompts, reference images, or existing videos. Sixteen provider backends are supported, each with different model options, input modes, and feature sets. The agent picks the right provider automatically based on your configuration and available API keys.

<Note> The `video_generate` tool only appears when at least one video-generation provider is available. If you do not see it in your agent tools, set a provider API key or configure `agents.defaults.videoGenerationModel`. </Note>

OpenClaw treats video generation as three runtime modes:

  • generate — text-to-video requests with no reference media.
  • imageToVideo — request includes one or more reference images.
  • videoToVideo — request includes one or more reference videos.

Providers can support any subset of those modes. The tool validates the active mode before submission and reports supported modes in action=list.

Quick start

<Steps> <Step title="Configure auth"> Set an API key for any supported provider:
```bash
export GEMINI_API_KEY="your-key"
```
</Step> <Step title="Pick a default model (optional)"> ```bash openclaw config set agents.defaults.videoGenerationModel.primary "google/veo-3.1-fast-generate-preview" ``` </Step> <Step title="Ask the agent"> > Generate a 5-second cinematic video of a friendly lobster surfing at sunset.
The agent calls `video_generate` automatically. No tool allowlisting
is needed.
</Step> </Steps>

How async generation works

Video generation is asynchronous. When the agent calls video_generate in a session:

  1. OpenClaw submits the request to the provider and immediately returns a task id.
  2. The provider processes the job in the background (typically 30 seconds to 5 minutes depending on the provider and resolution).
  3. When the video is ready, OpenClaw wakes the same session with an internal completion event.
  4. The agent tells the user and attaches the finished video. In group/channel chats that use message-tool-only visible delivery, the agent relays the result through the message tool instead of OpenClaw posting it directly.

While a job is in flight, duplicate video_generate calls in the same session return the current task status instead of starting another generation. Use openclaw tasks list or openclaw tasks show <taskId> to check progress from the CLI.

Outside of session-backed agent runs (for example, direct tool invocations), the tool falls back to inline generation and returns the final media path in the same turn.

Generated video files are saved under OpenClaw-managed media storage when the provider returns bytes. The default generated-video save cap follows the video media limit, and agents.defaults.mediaMaxMb raises it for larger renders. When a provider also returns a hosted output URL, OpenClaw can deliver that URL instead of failing the task if local persistence rejects an oversized file.

Task lifecycle

StateMeaning
queuedTask created, waiting for the provider to accept it.
runningProvider is processing (typically 30 seconds to 5 minutes depending on provider and resolution).
succeededVideo ready; the agent wakes and posts it to the conversation.
failedProvider error or timeout; the agent wakes with error details.

Check status from the CLI:

bash
openclaw tasks list
openclaw tasks show <taskId>
openclaw tasks cancel <taskId>

If a video task is already queued or running for the current session, video_generate returns the existing task status instead of starting a new one. Use action: "status" to check explicitly without triggering a new generation.

Supported providers

ProviderDefault modelTextImage refVideo refAuth
Alibabawan2.6-t2vYes (remote URL)Yes (remote URL)MODELSTUDIO_API_KEY
BytePlus (1.0)seedance-1-0-pro-250528Up to 2 images (I2V models only; first + last frame)BYTEPLUS_API_KEY
BytePlus Seedance 1.5seedance-1-5-pro-251215Up to 2 images (first + last frame via role)BYTEPLUS_API_KEY
BytePlus Seedance 2.0dreamina-seedance-2-0-260128Up to 9 reference imagesUp to 3 videosBYTEPLUS_API_KEY
ComfyUIworkflow1 imageCOMFY_API_KEY or COMFY_CLOUD_API_KEY
DeepInfraPixverse/Pixverse-T2VDEEPINFRA_API_KEY
falfal-ai/minimax/video-01-live1 image; up to 9 with Seedance reference-to-videoUp to 3 videos with Seedance reference-to-videoFAL_KEY
Googleveo-3.1-fast-generate-preview1 image1 videoGEMINI_API_KEY
MiniMaxMiniMax-Hailuo-2.31 imageMINIMAX_API_KEY or MiniMax OAuth
OpenAIsora-21 image1 videoOPENAI_API_KEY
OpenRoutergoogle/veo-3.1-fastUp to 4 images (first/last frame or references)OPENROUTER_API_KEY
Qwenwan2.6-t2vYes (remote URL)Yes (remote URL)QWEN_API_KEY
Runwaygen4.51 image1 videoRUNWAYML_API_SECRET
TogetherWan-AI/Wan2.2-T2V-A14B1 imageTOGETHER_API_KEY
Vydraveo31 image (kling)VYDRA_API_KEY
xAIgrok-imagine-video1 first-frame image or up to 7 reference_images1 videoXAI_API_KEY

Some providers accept additional or alternate API key env vars. See individual provider pages for details.

Run video_generate action=list to inspect available providers, models, and runtime modes at runtime.

Capability matrix

The explicit mode contract used by video_generate, contract tests, and the shared live sweep:

ProvidergenerateimageToVideovideoToVideoShared live lanes today
Alibabagenerate, imageToVideo; videoToVideo skipped because this provider needs remote http(s) video URLs
BytePlusgenerate, imageToVideo
ComfyUINot in the shared sweep; workflow-specific coverage lives with Comfy tests
DeepInfragenerate; native DeepInfra video schemas are text-to-video in the bundled contract
falgenerate, imageToVideo; videoToVideo only when using Seedance reference-to-video
Googlegenerate, imageToVideo; shared videoToVideo skipped because the current buffer-backed Gemini/Veo sweep does not accept that input
MiniMaxgenerate, imageToVideo
OpenAIgenerate, imageToVideo; shared videoToVideo skipped because this org/input path currently needs provider-side inpaint/remix access
OpenRoutergenerate, imageToVideo
Qwengenerate, imageToVideo; videoToVideo skipped because this provider needs remote http(s) video URLs
Runwaygenerate, imageToVideo; videoToVideo runs only when the selected model is runway/gen4_aleph
Togethergenerate, imageToVideo
Vydragenerate; shared imageToVideo skipped because bundled veo3 is text-only and bundled kling requires a remote image URL
xAIgenerate, imageToVideo; videoToVideo skipped because this provider currently needs a remote MP4 URL

Tool parameters

Required

<ParamField path="prompt" type="string" required> Text description of the video to generate. Required for `action: "generate"`. </ParamField>

Content inputs

<ParamField path="image" type="string">Single reference image (path or URL).</ParamField> <ParamField path="images" type="string[]">Multiple reference images (up to 9).</ParamField> <ParamField path="imageRoles" type="string[]"> Optional per-position role hints parallel to the combined image list. Canonical values: first_frame, last_frame, reference_image. </ParamField> <ParamField path="video" type="string">Single reference video (path or URL).</ParamField> <ParamField path="videos" type="string[]">Multiple reference videos (up to 4).</ParamField> <ParamField path="videoRoles" type="string[]"> Optional per-position role hints parallel to the combined video list. Canonical value: reference_video. </ParamField> <ParamField path="audioRef" type="string"> Single reference audio (path or URL). Used for background music or voice reference when the provider supports audio inputs. </ParamField> <ParamField path="audioRefs" type="string[]">Multiple reference audios (up to 3).</ParamField> <ParamField path="audioRoles" type="string[]"> Optional per-position role hints parallel to the combined audio list. Canonical value: reference_audio. </ParamField>

<Note> Role hints are forwarded to the provider as-is. Canonical values come from the `VideoGenerationAssetRole` union but providers may accept additional role strings. `*Roles` arrays must not have more entries than the corresponding reference list; off-by-one mistakes fail with a clear error. Use an empty string to leave a slot unset. For xAI, set every image role to `reference_image` to use its `reference_images` generation mode; omit the role or use `first_frame` for single-image image-to-video. </Note>

Style controls

<ParamField path="aspectRatio" type="string"> Aspect-ratio hint such as `1:1`, `16:9`, `9:16`, `adaptive`, or a provider-specific value. OpenClaw normalizes or ignores unsupported values per provider. </ParamField> <ParamField path="resolution" type="string">Resolution hint such as `480P`, `720P`, `768P`, `1080P`, `4K`, or a provider-specific value. OpenClaw normalizes or ignores unsupported values per provider.</ParamField> <ParamField path="durationSeconds" type="number"> Target duration in seconds (rounded to nearest provider-supported value). </ParamField> <ParamField path="size" type="string">Size hint when the provider supports it.</ParamField> <ParamField path="audio" type="boolean"> Enable generated audio in the output when supported. Distinct from `audioRef*` (inputs). </ParamField> <ParamField path="watermark" type="boolean">Toggle provider watermarking when supported.</ParamField>

adaptive is a provider-specific sentinel: it is forwarded as-is to providers that declare adaptive in their capabilities (e.g. BytePlus Seedance uses it to auto-detect the ratio from the input image dimensions). Providers that do not declare it surface the value via details.ignoredOverrides in the tool result so the drop is visible.

Advanced

<ParamField path="action" type='"generate" | "status" | "list"' default="generate"> `"status"` returns the current session task; `"list"` inspects providers. </ParamField> <ParamField path="model" type="string">Provider/model override (e.g. `runway/gen4.5`).</ParamField> <ParamField path="filename" type="string">Output filename hint.</ParamField> <ParamField path="timeoutMs" type="number">Optional provider request timeout in milliseconds.</ParamField> <ParamField path="providerOptions" type="object"> Provider-specific options as a JSON object (e.g. `{"seed": 42, "draft": true}`). Providers that declare a typed schema validate the keys and types; unknown keys or mismatches skip the candidate during fallback. Providers without a declared schema receive the options as-is. Run `video_generate action=list` to see what each provider accepts. </ParamField> <Note> Not all providers support all parameters. OpenClaw normalizes duration to the closest provider-supported value, and remaps translated geometry hints such as size-to-aspect-ratio when a fallback provider exposes a different control surface. Truly unsupported overrides are ignored on a best-effort basis and reported as warnings in the tool result. Hard capability limits (such as too many reference inputs) fail before submission. Tool results report applied settings; `details.normalization` captures any requested-to-applied translation. </Note>

Reference inputs select the runtime mode:

  • No reference media → generate
  • Any image reference → imageToVideo
  • Any video reference → videoToVideo
  • Reference audio inputs do not change the resolved mode; they apply on top of whatever mode the image/video references select, and only work with providers that declare maxInputAudios.

Mixed image and video references are not a stable shared capability surface. Prefer one reference type per request.

Fallback and typed options

Some capability checks are applied at the fallback layer rather than the tool boundary, so a request that exceeds the primary provider's limits can still run on a capable fallback:

  • Active candidate declaring no maxInputAudios (or 0) is skipped when the request contains audio references; next candidate is tried.
  • Active candidate's maxDurationSeconds below the requested durationSeconds with no declared supportedDurationSeconds list → skipped.
  • Request contains providerOptions and the active candidate explicitly declares a typed providerOptions schema → skipped if supplied keys are not in the schema or value types do not match. Providers without a declared schema receive options as-is (backward-compatible pass-through). A provider can opt out of all provider options by declaring an empty schema (capabilities.providerOptions: {}), which causes the same skip as a type mismatch.

The first skip reason in a request logs at warn so operators see when their primary provider was passed over; subsequent skips log at debug to keep long fallback chains quiet. If every candidate is skipped, the aggregated error includes the skip reason for each.

Actions

ActionWhat it does
generateDefault. Create a video from the given prompt and optional reference inputs.
statusCheck the state of the in-flight video task for the current session without starting another generation.
listShow available providers, models, and their capabilities.

Model selection

OpenClaw resolves the model in this order:

  1. model tool parameter — if the agent specifies one in the call.
  2. videoGenerationModel.primary from config.
  3. videoGenerationModel.fallbacks in order.
  4. Auto-detection — providers that have valid auth, starting with the current default provider, then remaining providers in alphabetical order.

If a provider fails, the next candidate is tried automatically. If all candidates fail, the error includes details from each attempt.

Set agents.defaults.mediaGenerationAutoProviderFallback: false to use only the explicit model, primary, and fallbacks entries.

json5
{
  agents: {
    defaults: {
      videoGenerationModel: {
        primary: "google/veo-3.1-fast-generate-preview",
        fallbacks: ["runway/gen4.5", "qwen/wan2.6-t2v"],
      },
    },
  },
}

Provider notes

<AccordionGroup> <Accordion title="Alibaba"> Uses DashScope / Model Studio async endpoint. Reference images and videos must be remote `http(s)` URLs. </Accordion> <Accordion title="BytePlus (1.0)"> Provider id: `byteplus`.
Models: `seedance-1-0-pro-250528` (default),
`seedance-1-0-pro-t2v-250528`, `seedance-1-0-pro-fast-251015`,
`seedance-1-0-lite-t2v-250428`, `seedance-1-0-lite-i2v-250428`.

T2V models (`*-t2v-*`) do not accept image inputs; I2V models and
general `*-pro-*` models support a single reference image (first
frame). Pass the image positionally or set `role: "first_frame"`.
T2V model IDs are automatically switched to the corresponding I2V
variant when an image is provided.

Supported `providerOptions` keys: `seed` (number), `draft` (boolean —
forces 480p), `camera_fixed` (boolean).
</Accordion> <Accordion title="BytePlus Seedance 1.5"> Requires the [`@openclaw/byteplus-modelark`](https://www.npmjs.com/package/@openclaw/byteplus-modelark) plugin. Provider id: `byteplus-seedance15`. Model: `seedance-1-5-pro-251215`.
Uses the unified `content[]` API. Supports at most 2 input images
(`first_frame` + `last_frame`). All inputs must be remote `https://`
URLs. Set `role: "first_frame"` / `"last_frame"` on each image, or
pass images positionally.

`aspectRatio: "adaptive"` auto-detects ratio from the input image.
`audio: true` maps to `generate_audio`. `providerOptions.seed`
(number) is forwarded.
</Accordion> <Accordion title="BytePlus Seedance 2.0"> Requires the [`@openclaw/byteplus-modelark`](https://www.npmjs.com/package/@openclaw/byteplus-modelark) plugin. Provider id: `byteplus-seedance2`. Models: `dreamina-seedance-2-0-260128`, `dreamina-seedance-2-0-fast-260128`.
Uses the unified `content[]` API. Supports up to 9 reference images,
3 reference videos, and 3 reference audios. All inputs must be remote
`https://` URLs. Set `role` on each asset — supported values:
`"first_frame"`, `"last_frame"`, `"reference_image"`,
`"reference_video"`, `"reference_audio"`.

`aspectRatio: "adaptive"` auto-detects ratio from the input image.
`audio: true` maps to `generate_audio`. `providerOptions.seed`
(number) is forwarded.
</Accordion> <Accordion title="ComfyUI"> Workflow-driven local or cloud execution. Supports text-to-video and image-to-video through the configured graph. </Accordion> <Accordion title="fal"> Uses a queue-backed flow for long-running jobs. Most fal video models accept a single image reference. Seedance 2.0 reference-to-video models accept up to 9 images, 3 videos, and 3 audio references, with at most 12 total reference files. </Accordion> <Accordion title="Google (Gemini / Veo)"> Supports one image or one video reference. </Accordion> <Accordion title="MiniMax"> Single image reference only. </Accordion> <Accordion title="OpenAI"> Only `size` override is forwarded. Other style overrides (`aspectRatio`, `resolution`, `audio`, `watermark`) are ignored with a warning. </Accordion> <Accordion title="OpenRouter"> Uses OpenRouter's asynchronous `/videos` API. OpenClaw submits the job, polls `polling_url`, and downloads either `unsigned_urls` or the documented job content endpoint. The bundled `google/veo-3.1-fast` default advertises 4/6/8 second durations, `720P`/`1080P` resolutions, and `16:9`/`9:16` aspect ratios. </Accordion> <Accordion title="Qwen"> Same DashScope backend as Alibaba. Reference inputs must be remote `http(s)` URLs; local files are rejected upfront. </Accordion> <Accordion title="Runway"> Supports local files via data URIs. Video-to-video requires `runway/gen4_aleph`. Text-only runs expose `16:9` and `9:16` aspect ratios. </Accordion> <Accordion title="Together"> Single image reference only. </Accordion> <Accordion title="Vydra"> Uses `https://www.vydra.ai/api/v1` directly to avoid auth-dropping redirects. `veo3` is bundled as text-to-video only; `kling` requires a remote image URL. </Accordion> <Accordion title="xAI"> Supports text-to-video, single first-frame image-to-video, up to 7 `reference_image` inputs through xAI `reference_images`, and remote video edit/extend flows. </Accordion> </AccordionGroup>

Provider capability modes

The shared video-generation contract supports mode-specific capabilities instead of only flat aggregate limits. New provider implementations should prefer explicit mode blocks:

typescript
capabilities: {
  generate: {
    maxVideos: 1,
    maxDurationSeconds: 10,
    supportsResolution: true,
  },
  imageToVideo: {
    enabled: true,
    maxVideos: 1,
    maxInputImages: 1,
    maxInputImagesByModel: { "provider/reference-to-video": 9 },
    maxDurationSeconds: 5,
  },
  videoToVideo: {
    enabled: true,
    maxVideos: 1,
    maxInputVideos: 1,
    maxDurationSeconds: 5,
  },
}

Flat aggregate fields such as maxInputImages and maxInputVideos are not enough to advertise transform-mode support. Providers should declare generate, imageToVideo, and videoToVideo explicitly so live tests, contract tests, and the shared video_generate tool can validate mode support deterministically.

When one model in a provider has wider reference-input support than the rest, use maxInputImagesByModel, maxInputVideosByModel, or maxInputAudiosByModel instead of raising the mode-wide limit.

Live tests

Opt-in live coverage for the shared bundled providers:

bash
OPENCLAW_LIVE_TEST=1 pnpm test:live -- extensions/video-generation-providers.live.test.ts

Repo wrapper:

bash
pnpm test:live:media video

This live file loads missing provider env vars from ~/.profile, prefers live/env API keys ahead of stored auth profiles by default, and runs a release-safe smoke by default:

  • generate for every non-FAL provider in the sweep.
  • One-second lobster prompt.
  • Per-provider operation cap from OPENCLAW_LIVE_VIDEO_GENERATION_TIMEOUT_MS (180000 by default).

FAL is opt-in because provider-side queue latency can dominate release time:

bash
pnpm test:live:media video --video-providers fal

Set OPENCLAW_LIVE_VIDEO_GENERATION_FULL_MODES=1 to also run declared transform modes the shared sweep can exercise safely with local media:

  • imageToVideo when capabilities.imageToVideo.enabled.
  • videoToVideo when capabilities.videoToVideo.enabled and the provider/model accepts buffer-backed local video input in the shared sweep.

Today the shared videoToVideo live lane covers runway only when you select runway/gen4_aleph.

Configuration

Set the default video-generation model in your OpenClaw config:

json5
{
  agents: {
    defaults: {
      videoGenerationModel: {
        primary: "qwen/wan2.6-t2v",
        fallbacks: ["qwen/wan2.6-r2v-flash"],
      },
    },
  },
}

Or via the CLI:

bash
openclaw config set agents.defaults.videoGenerationModel.primary "qwen/wan2.6-t2v"