EasyAnimate

EasyAnimate by Alibaba PAI.

The description from it's GitHub page: EasyAnimate is a pipeline based on the transformer architecture, designed for generating AI images and videos, and for training baseline models and Lora models for Diffusion Transformer. We support direct prediction from pre-trained EasyAnimate models, allowing for the generation of videos with various resolutions, approximately 6 seconds in length, at 8fps (EasyAnimateV5.1, 1 to 49 frames). Additionally, users can train their own baseline and Lora models for specific style transformations.

This pipeline was contributed by bubbliiiing. The original codebase can be found here. The original weights can be found under hf.co/alibaba-pai.

There are two official EasyAnimate checkpoints for text-to-video and video-to-video.

checkpoints	recommended inference dtype
`alibaba-pai/EasyAnimateV5.1-12b-zh`	torch.float16
`alibaba-pai/EasyAnimateV5.1-12b-zh-InP`	torch.float16

There is one official EasyAnimate checkpoints available for image-to-video and video-to-video.

checkpoints	recommended inference dtype
`alibaba-pai/EasyAnimateV5.1-12b-zh-InP`	torch.float16

There are two official EasyAnimate checkpoints available for control-to-video.

checkpoints	recommended inference dtype
`alibaba-pai/EasyAnimateV5.1-12b-zh-Control`	torch.float16
`alibaba-pai/EasyAnimateV5.1-12b-zh-Control-Camera`	torch.float16

For the EasyAnimateV5.1 series:

Text-to-video (T2V) and Image-to-video (I2V) works for multiple resolutions. The width and height can vary from 256 to 1024.
Both T2V and I2V models support generation with 1~49 frames and work best at this value. Exporting videos at 8 FPS is recommended.

Quantization

Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.

Refer to the Quantization overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [EasyAnimatePipeline] for inference with bitsandbytes.

import torch
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, EasyAnimateTransformer3DModel, EasyAnimatePipeline
from diffusers.utils import export_to_video

quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
transformer_8bit = EasyAnimateTransformer3DModel.from_pretrained(
    "alibaba-pai/EasyAnimateV5.1-12b-zh",
    subfolder="transformer",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

pipeline = EasyAnimatePipeline.from_pretrained(
    "alibaba-pai/EasyAnimateV5.1-12b-zh",
    transformer=transformer_8bit,
    torch_dtype=torch.float16,
    device_map="balanced",
)

prompt = "A cat walks on the grass, realistic style."
negative_prompt = "bad detailed"
video = pipeline(prompt=prompt, negative_prompt=negative_prompt, num_frames=49, num_inference_steps=30).frames[0]
export_to_video(video, "cat.mp4", fps=8)

EasyAnimatePipeline

[[autodoc]] EasyAnimatePipeline

all
call

EasyAnimatePipelineOutput

[[autodoc]] pipelines.easyanimate.pipeline_output.EasyAnimatePipelineOutput