HunyuanVideo-1.5

HunyuanVideo-1.5 is a lightweight yet powerful video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs. This achievement is built upon several key components, including meticulous data curation, an advanced DiT architecture with selective and sliding tile attention (SSTA), enhanced bilingual understanding through glyph-aware text encoding, progressive pre-training and post-training, and an efficient video super-resolution network. Leveraging these designs, we developed a unified framework capable of high-quality text-to-video and image-to-video generation across multiple durations and resolutions. Extensive experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source models.

You can find all the original HunyuanVideo checkpoints under the Tencent organization.

[!TIP] Click on the HunyuanVideo models in the right sidebar for more examples of video generation tasks.

The examples below use a checkpoint from hunyuanvideo-community because the weights are stored in a layout compatible with Diffusers.

The example below demonstrates how to generate a video optimized for memory or inference speed.

Refer to the Reduce memory usage guide for more details about the various memory saving techniques.

import torch
from diffusers import AutoModel, HunyuanVideo15Pipeline
from diffusers.utils import export_to_video


pipeline = HunyuanVideo15Pipeline.from_pretrained(
    "HunyuanVideo-1.5-Diffusers-480p_t2v",
    torch_dtype=torch.bfloat16,
)

# model-offloading and tiling
pipeline.enable_model_cpu_offload()
pipeline.vae.enable_tiling()

prompt = "A fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys."
video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0]
export_to_video(video, "output.mp4", fps=15)

Notes

HunyuanVideo1.5 use attention masks with variable-length sequences. For best performance, we recommend using an attention backend that handles padding efficiently.
- H100/H800: _flash_3_hub or _flash_3_varlen_hub
- A100/A800/RTX 4090: flash_hub or flash_varlen_hub
- Other GPUs: sage_hub

Refer to the Attention backends guide for more details about using a different backend.

pipe.transformer.set_attention_backend("flash_hub")  # or your preferred backend

[HunyuanVideo15Pipeline] use guider and does not take guidance_scale parameter at runtime.

You can check the default guider configuration using pipe.guider:

>>> pipe.guider 
ClassifierFreeGuidance {
  "_class_name": "ClassifierFreeGuidance",
  "_diffusers_version": "0.36.0.dev0",
  "enabled": true,
  "guidance_rescale": 0.0,
  "guidance_scale": 6.0,
  "start": 0.0,
  "stop": 1.0,
  "use_original_formulation": false
}

State:
  step: None
  num_inference_steps: None
  timestep: None
  count_prepared: 0
  enabled: True
  num_conditions: 2

To update guider configuration, you can run pipe.guider = pipe.guider.new(...)

pipe.guider = pipe.guider.new(guidance_scale=5.0)

HunyuanVideo15Pipeline

[[autodoc]] HunyuanVideo15Pipeline

all
call

HunyuanVideo15ImageToVideoPipeline

[[autodoc]] HunyuanVideo15ImageToVideoPipeline

all
call

HunyuanVideo15PipelineOutput

[[autodoc]] pipelines.hunyuan_video1_5.pipeline_output.HunyuanVideo15PipelineOutput