docs/report_02.md
In Open-Sora 1.1 release, we train a 700M models on 10M data (Open-Sora 1.0 trained on 400K data) with a better STDiT architecture. We implement the following features mentioned in sora's report:
To achieve this goal, we use multi-task learning in the pretraining stage. For diffusion models, training with different sampled timestep is already a multi-task learning. We further extend this idea to multi-resolution, aspect ratio, frame length, fps, and different mask strategies for image and video conditioned generation. We train the model on 0s~15s, 144p to 720p, various aspect ratios videos. Although the quality of time consistency is not that high due to limit training FLOPs, we can still see the potential of the model.
We made the following modifications to the original ST-DiT for better training stability and performance (ST-DiT-2):
As mentioned in the sora's report, training with original video's resolution, aspect ratio, and length increase sampling flexibility and improve framing and composition. We found three ways to achieve this goal:
For the simplicity of implementation, we choose the bucket method. We pre-define some fixed resolution, and allocate different samples to different bucket. The concern for bucketing is listed below. But we can see that the concern is not a big issue in our case.
<details> <summary>View the concerns</summary>bucket_config to define the batch size for each bucket to ensure the processing speed is similar.As shown in the figure, a bucket is a triplet of (resolution, num_frame, aspect_ratio). We provide pre-defined aspect ratios for different resolution that covers most of the common video aspect ratios. Before each epoch, we shuffle the dataset and allocate the samples to different buckets as shown in the figure. We put a sample into a bucket with largest resolution and frame length that is smaller than the video's.
Considering our computational resource is limited, we further introduce two attributes keep_prob and batch_size for each (resolution, num_frame) to reduce the computational cost and enable multi-stage training. Specifically, a high-resolution video will be downsampled to a lower resolution with probability 1-keep_prob and the batch size for each bucket is batch_size. In this way, we can control the number of samples in different buckets and balance the GPU load by search a good batch size for each bucket.
A detailed explanation of the bucket usage in training is available in docs/config.md.
Transformers can be easily extended to support image-to-image and video-to-video tasks. We propose a mask strategy to support image and video conditioning. The mask strategy is shown in the figure below.
Typically, we unmask the frames to be conditioned on for image/video-to-video condition. During the ST-DiT forward, unmasked frames will have timestep 0, while others remain the same (t). We find directly apply the strategy to trained model yield poor results as the diffusion model did not learn to handle different timesteps in one sample during training.
Inspired by UL2, we introduce random mask strategy during training. Specifically, we randomly unmask the frames during training, including unmask the first frame, the first k frames, the last frame, the last k frames, the first and last k frames, random frames, etc. Based on Open-Sora 1.0, with 50% probability of applying masking, we see the model can learn to handle image conditioning (while 30% yields worse ability) for 10k steps, with a little text-to-video performance drop. Thus, for Open-Sora 1.1, we pretrain the model from scratch with masking strategy.
An illustration of masking strategy config to use in inference is given as follow. A five number tuple provides great flexibility in defining the mask strategy. By conditioning on generated frames, we can autogressively generate infinite frames (although error propagates).
A detailed explanation of the mask strategy usage is available in docs/config.md.
As we found in Open-Sora 1.0, the data number and quality are crucial for training a good model, we work hard on scaling the dataset. First, we create an automatic pipeline following SVD, inlcuding scene cutting, captioning, various scoring and filtering, and dataset management scripts and conventions. More infomation can be found in docs/data_processing.md.
We plan to use panda-70M and other data to traing the model, which is approximately 30M+ data. However, we find disk IO a botteleneck for training and data processing at the same time. Thus, we can only prepare a 10M dataset and did not go through all processing pipeline that we built. Finally, we use a dataset with 9.7M videos + 2.6M images for pre-training, and 560k videos + 1.6M images for fine-tuning. The pretraining dataset statistics are shown below. More information about the dataset can be found in docs/datasets.md.
Image text tokens (by T5 tokenizer):
Video text tokens (by T5 tokenizer). We directly use panda's short caption for training, and caption other datasets by ourselves. The generated caption is usually less than 200 tokens.
Video duration:
With limited computational resources, we have to carefully monitor the training process, and change the training strategy if we speculate the model is not learning well since there is no computation for ablation study. Thus, Open-Sora 1.1's training includes multiple changes, and as a result, ema is not applied.
Pixart-alpha-1024 checkpoints. We find the model easily adapts to generate images with different resolutions. We use SpeeDiT (iddpm-speed) to accelerate the diffusion training.To summarize, the training of Open-Sora 1.1 requires approximately 9 days on 64 H800 GPUs.
As we get one step closer to the replication of Sora, we find many limitations for the current model, and these limitations point to the future work.
- Algorithm & Acceleration: Zangwei Zheng, Xiangyu Peng, Shenggui Li, Hongxing Liu, Yukun Zhou, Tianyi Li
- Data Collection & Pipeline: Xiangyu Peng, Zangwei Zheng, Chenhui Shen, Tom Young, Junjie Wang, Chenfeng Yu