docs/source/multi_task_dit.mdx
Multitask Diffusion Transformer (DiT) Policy is an evolution of the original Diffusion Policy architecture, which leverages a large DiT with text and vision conditioning for multitask robot learning. This implementation supports both diffusion and flow matching objectives for action generation, enabling robots to perform diverse manipulation tasks conditioned on language instructions.
The model uses:
This model is exciting because you can achieve extremely high dexterity, competitive with multi-billion parameter VLAs, with only ~450M parameters and significantly less training.
Multitask DiT Policy has additional dependencies. Install it with:
pip install lerobot[multi_task_dit]
This will install all necessary dependencies including the HuggingFace Transformers library for CLIP models.
To use Multitask DiT in your LeRobot configuration, specify the policy type as:
policy.type=multi_task_dit
Here's a complete training command for training Multitask DiT on your dataset:
lerobot-train \
--dataset.repo_id=YOUR_DATASET \
--output_dir=./outputs/multitask_dit_training \
--batch_size=32 \
--steps=5000 \
--save_freq=500 \
--log_freq=100 \
--policy.type=multi_task_dit \
--policy.device=cuda \
--policy.repo_id="HF_USER/multitask-dit-your-robot" \
--wandb.enable=true
For reliable performance, start with these suggested default hyperparameters:
lerobot-train \
--dataset.repo_id=YOUR_DATASET \
--output_dir=./outputs/mutitask_dit_training \
--batch_size=320 \
--steps=30000 \
--policy.type=multi_task_dit \
--policy.device=cuda \
--policy.horizon=32 \
--policy.n_action_steps=24 \
--policy.objective=diffusion \
--policy.noise_scheduler_type=DDPM \
--policy.num_train_timesteps=100 \
--policy.repo_id="HF_USER/multitask-dit-your-robot" \
--wandb.enable=true
Key Parameters:
diffusion - start with diffusion and experiment with flow matching if generation quality is poorChoose between diffusion and flow matching:
# Diffusion objective (default)
--policy.objective=diffusion \
--policy.noise_scheduler_type=DDPM \ # or "DDIM"
--policy.num_train_timesteps=100 \
--policy.num_inference_steps=10 \ # For faster inference
--policy.beta_schedule=squaredcos_cap_v2 \ # Noise schedule type
--policy.prediction_type=epsilon \ # "epsilon" (predict noise) or "sample" (predict clean)
--policy.clip_sample=true \ # Clip samples during denoising
--policy.clip_sample_range=1.0 # Clipping range [-x, x]
# Flow matching objective
--policy.objective=flow_matching \
--policy.timestep_sampling_strategy=beta \ # or "uniform" | the beta sampling strategy performance appears much better in practice
--policy.num_integration_steps=100 \
--policy.integration_method=euler \ # or "rk4"
--policy.sigma_min=0.0 # Minimum noise in flow interpolation path
Adjust model capacity based on dataset size:
# Small datasets (< 100 examples)
--policy.num_layers=4 \
--policy.hidden_dim=512 \
--policy.num_heads=8 # should ideally be hidden_dim // 64
# Medium datasets (100-5k examples) - default
--policy.num_layers=6 \
--policy.hidden_dim=512 \
--policy.num_heads=8 # should ideally be hidden_dim // 64
# Large datasets (> 5k examples)
--policy.num_layers=8 \
--policy.hidden_dim=512 \
--policy.num_heads=8 # should ideally be hidden_dim // 64
Positional Encoding Options:
The model supports two positional encoding methods for action sequences:
# Rotary Position Embedding (RoPE) - default, recommended
--policy.use_rope=true \
--policy.rope_base=10000.0 # Base frequency for RoPE
# Absolute positional encoding
--policy.use_positional_encoding=true # Disables RoPE when true
Other Transformer Parameters:
--policy.dropout=0.1 # Dropout rate for DiT blocks (0.0-1.0)
--policy.timestep_embed_dim=256 # Timestep embedding dimension
# Use different CLIP model for more expressivity at the cost of inference time
# experiment with larger or smaller models depending on the complexity of your tasks and size of dataset
--policy.vision_encoder_name=openai/clip-vit-large-patch14
# Use separate vision encoder per camera
# This may be useful when cameras have significantly different characteristics, but
# be wary of increased VRAM footprint.
--policy.use_separate_rgb_encoder_per_camera=true
# Image preprocessing
--policy.image_resize_shape=[XXX,YYY] \ # you may need to resize your images for inference speed ups
--policy.image_crop_shape=[224,224] \
--policy.image_crop_is_random=true # Random during training, center at inference
# Use different CLIP text encoder model
# same as vision: experiment with larger or smaller models depending on the
# complexity of your tasks and size of dataset
--policy.text_encoder_name=openai/clip-vit-large-patch14
The vision encoder uses a separate learning rate multiplier, where 1/10th is suggested to be the ideal staritng point:
--policy.optimizer_lr=2e-5 \
--policy.vision_encoder_lr_multiplier=0.1 # Vision encoder LR = 0.1 * optimizer_lr
The original diffusion implementation here is based on the work described in TRI's LBM paper
Additionally, we have implemented a flow-matching objective, which is described at a high-level in Boston Dynamics blog post.
Consider testing the flow-matching objective and evaluating performance differences for your task:
--policy.objective=flow_matching \
--policy.timestep_sampling_strategy=beta \
--policy.timestep_sampling_alpha=1.5 \
--policy.timestep_sampling_beta=1.0 \
--policy.timestep_sampling_s=0.999
This hasn't been shown to be a silver bullet across every user case, but it occasionally results in smoother and more consistent actions.
Match model capacity to your dataset size:
horizon TuningThe model can be sensitive to the horizon you choose. Start with around a 1 second horizon based on your control frequency:
horizon=30horizon=10Then experiment with increasing from there. The horizon determines how far into the future the model predicts actions.
n_action_steps SensitivityThe model can also be very sensitive to n_action_steps. Start with it being around 0.8 seconds based on your control frequency and tune from there:
For faster inference, use DDIM with fewer sampling steps:
--policy.noise_scheduler_type=DDIM \
--policy.num_inference_steps=10
To resume training from a checkpoint:
lerobot-train \
--config_path=./outputs/mutitask_dit_training/checkpoints/last/pretrained_model/train_config.json \
--resume=true
The checkpoint directory should contain model.safetensors and config.json files (saved automatically during training). When resuming, the configuration is loaded from the checkpoint, so you don't need to specify other parameters.
Training these models can be finicky. Here are common failure modes and debugging approaches:
The model may "collapse" during inference, resulting in static or no motion. This can occur when:
Insufficient training data: If you only have 20-50 examples, try to roughly double your dataset size. Once you have above 300 examples, if you're still seeing this, the task may be too complex.
Multiple similar tasks: When your dataset contains multiple similar tasks (e.g., picking up 2 different objects), the model may rely too heavily on language conditioning which might not be rich enough.
Debugging tips:
Sometimes the robot will completely ignore your instruction and perform some other task. This generally only happens if you have trained on multiple tasks.
Potential causes:
Debugging tips:
If training loss is unstable or diverging:
1e-5 and 3e-4Here's a complete example training on a custom dataset:
lerobot-train \
--dataset.repo_id=YOUR_DATASET \
--output_dir=./outputs/mutitask_dit_training \
--batch_size=320 \
--steps=30000 \
--save_freq=1000 \
--log_freq=100 \
--eval_freq=1000 \
--policy.type=multi_task_dit \
--policy.device=cuda \
--policy.horizon=32 \
--policy.n_action_steps=24 \
--policy.objective=diffusion \
--policy.noise_scheduler_type=DDPM \
--policy.num_layers=6 \
--policy.hidden_dim=512 \
--policy.vision_encoder_name=openai/clip-vit-base-patch16 \
--policy.image_resize_shape=[320,240] \
--policy.image_crop_shape=[224,224] \
--policy.repo_id="HF_USER/multitask-dit-your-robot" \
--wandb.enable=true \
--wandb.project=multitask_dit
python -m lerobot.scripts.lerobot_train \
--dataset.repo_id=HuggingFaceVLA/libero \
--policy.type=multi_task_dit \
--policy.push_to_hub=false \
--output_dir="./outputs/multitask_dit_libero" \
--job_name="multitask-dit-libero" \
--wandb.enable=true \
--wandb.project=multitask_dit_libero \
--dataset.image_transforms.enable=true \
--dataset.image_transforms.max_num_transforms=4 \
--dataset.image_transforms.tfs='{"brightness":{"type":"ColorJitter","kwargs":{"brightness":[0.75,1.25]}},"contrast":{"type":"ColorJitter","kwargs":{"contrast":[0.6,1.4]}},"saturation":{"type":"ColorJitter","kwargs":{"saturation":[0.8,1.2]}},"hue":{"type":"ColorJitter","kwargs":{"hue":[-0.05,0.05]}},"sharpness":{"type":"SharpnessJitter","kwargs":{"sharpness":[0.6,1.4]}},"rotation":{"type":"RandomRotation","kwargs":{"degrees":[-5,5]}},"translation":{"type":"RandomAffine","kwargs":{"degrees":0,"translate":[0.1,0.1]}}}' \
--dataset.video_backend=torchcodec \
--policy.use_amp=true \
--policy.horizon=48 \
--policy.n_obs_steps=2 \
--policy.use_rope=true \
--policy.use_positional_encoding=false \
--policy.hidden_dim=768 \
--policy.num_layers=8 \
--policy.num_heads=12 \
--policy.dropout=0.1 \
--policy.timestep_embed_dim=256 \
--policy.objective=diffusion \
--policy.optimizer_lr=3e-4 \
--policy.optimizer_weight_decay=0 \
--policy.scheduler_warmup_steps=0 \
--policy.vision_encoder_name=openai/clip-vit-base-patch16 \
--policy.image_resize_shape=[256,256] \
--policy.image_crop_is_random=true \
--policy.text_encoder_name=openai/clip-vit-base-patch16 \
--policy.vision_encoder_lr_multiplier=0.1 \
--policy.device=cuda \
--num_workers=8 \
--save_freq=4000 \
--log_freq=100 \
--steps=100000 \
--batch_size=320
Results:
| LIBERO Spatial | LIBERO Object | LIBERO Goal | LIBERO 10 | Average |
|---|---|---|---|---|
| 87.0 | 98.2 | 93.8 | 83.2 | 90.6 |
For more details on the technical implementation and architecture, see: