docs/source/pi0fast.mdx
π₀-FAST is a Vision-Language-Action model for general robot control that uses autoregressive next-token prediction to model continuous robot actions.
π₀-FAST combines the power of Vision-Language Models with a novel action tokenization approach called FAST (Frequency-space Action Sequence Tokenization). This enables training autoregressive VLAs on highly dexterous tasks that are impossible with standard binning-based discretization, while training up to 5x faster than diffusion-based approaches like π₀.
Standard approaches for robot action tokenization use simple per-dimension, per-timestep binning schemes. While passable for simple behaviors, this rapidly breaks down for complex and dexterous skills that require precision and high-frequency control.
FAST solves this by compressing action sequences using signal processing techniques, resulting in a dense sequence of action tokens that can be predicted autoregressively—just like language tokens.
The FAST tokenizer compresses action sequences through the following steps:
Normalize: Take a continuous action chunk of shape (H, D) where H is the horizon and D is the action dimension. Normalize using one of the supported normalization methods (Quantiles recommended to handle outliers).
Discrete Cosine Transform (DCT): Apply DCT (via scipy) to each action dimension separately. DCT is a compression algorithm commonly used in image and audio codecs (JPEG, MP3).
Quantization: Round and remove insignificant coefficients for each action dimension, producing a sparse frequency matrix.
Flatten: Flatten the matrix into a 1D vector, with low-frequency components first.
Byte Pair Encoding (BPE): Train a BPE tokenizer to compress the DCT coefficients into dense action tokens, typically achieving 10x compression over prior tokenization approaches.
This approach can transform any existing VLM into a VLA by training it to predict these FAST tokens.
Install LeRobot by following our Installation Guide.
Install π₀-FAST dependencies by running:
pip install -e ".[pi]"
You have two options for the FAST tokenizer:
Use the pre-trained tokenizer: The lerobot/fast-action-tokenizer tokenizer was trained on 1M+ real robot action sequences and works as a general-purpose tokenizer.
Train your own tokenizer: For maximum performance on your specific dataset, you can finetune the tokenizer on your own data.
lerobot-train-tokenizer \
--repo_id "user/my-lerobot-dataset" \
--action_horizon 10 \
--encoded_dims "0:6" \
--vocab_size 1024 \
--scale 10.0 \
--normalization_mode QUANTILES \
--output_dir "./my_fast_tokenizer" \
--push_to_hub \
--hub_repo_id "username/my-action-tokenizer"
| Parameter | Description | Default |
|---|---|---|
--repo_id | LeRobot dataset repository ID | Required |
--action_horizon | Number of future actions in each chunk | 10 |
--encoded_dims | Comma-separated dimension ranges to encode (e.g., "0:6,7:23") | "0:6,7:23" |
--vocab_size | BPE vocabulary size | 1024 |
--scale | DCT scaling factor for quantization | 10.0 |
--normalization_mode | Normalization mode (MEAN_STD, MIN_MAX, QUANTILES, QUANTILE10, IDENTITY) | QUANTILES |
--sample_fraction | Fraction of chunks to sample per episode | 0.1 |
To use π₀-FAST in LeRobot, specify the policy type as:
policy.type=pi0_fast
For training π₀-FAST, you can use the LeRobot training script:
lerobot-train \
--dataset.repo_id=your_dataset \
--policy.type=pi0_fast \
--output_dir=./outputs/pi0fast_training \
--job_name=pi0fast_training \
--policy.pretrained_path=lerobot/pi0_fast_base \
--policy.dtype=bfloat16 \
--policy.gradient_checkpointing=true \
--policy.chunk_size=10 \
--policy.n_action_steps=10 \
--policy.max_action_tokens=256 \
--steps=100000 \
--batch_size=4 \
--policy.device=cuda
| Parameter | Description | Default |
|---|---|---|
--policy.gradient_checkpointing=true | Reduces memory usage significantly during training | false |
--policy.dtype=bfloat16 | Use mixed precision training for efficiency | float32 |
--policy.chunk_size | Number of action steps to predict (action horizon) | 50 |
--policy.n_action_steps | Number of action steps to execute | 50 |
--policy.max_action_tokens | Maximum number of FAST tokens per action chunk | 256 |
--policy.action_tokenizer_name | FAST tokenizer to use | lerobot/fast-action-tokenizer |
--policy.compile_model=true | Enable torch.compile for faster training | false |
π₀-FAST supports KV-caching, a widely used optimization in LLM inference. This caches the key-value pairs from the attention mechanism, avoiding redundant computation during autoregressive decoding.
# KV-caching is enabled by default
policy.use_kv_cache=true
from lerobot.policies.pi0_fast import PI0FastPolicy, PI0FastConfig
# Load the policy
policy = PI0FastPolicy.from_pretrained("your-model-path")
# During inference
actions = policy.predict_action_chunk(batch)
π₀-FAST uses a PaliGemma-based architecture:
The model takes images, text instructions, and robot state as input, and outputs discrete FAST tokens that are decoded back to continuous actions.
| Parameter | Description | Default |
|---|---|---|
paligemma_variant | VLM backbone variant (gemma_300m, gemma_2b) | gemma_2b |
max_state_dim | Maximum state vector dimension (padded) | 32 |
max_action_dim | Maximum action vector dimension (padded) | 32 |
temperature | Sampling temperature (0.0 for greedy) | 0.0 |
max_decoding_steps | Maximum decoding steps | 256 |
use_kv_cache | Enable KV caching for faster inference | true |
| Feature | π₀ | π₀-FAST |
|---|---|---|
| Action Representation | Flow Matching (Diffusion) | Autoregressive Tokens (FAST) |
| Training Speed | 1x | 5x faster |
| Dexterity | High | High |
| Inference Method | Iterative Denoising | Autoregressive Decoding |
| KV-Caching | N/A | Supported |
We reproduce the results of π₀Fast on the LIBERO benchmark using the LeRobot implementation. We take the LeRobot PiFast base model lerobot/pi0fast-base and finetune for an additional 40kk steps in bfloat16, with batch size of 256 on 8 H100 GPUs using the HuggingFace LIBERO dataset.
The finetuned model can be found here:
With the following training command:
lerobot-train \
--dataset.repo_id=lerobot/libero \
--output_dir=outputs/libero_pi0fast \
--job_name=libero_pi0fast \
--policy.path=lerobot/pi0fast_base \
--policy.dtype=bfloat16 \
--steps=100000 \
--save_freq=20000 \
--batch_size=4 \
--policy.device=cuda \
--policy.scheduler_warmup_steps=4000 \
--policy.scheduler_decay_steps=100000 \
--policy.scheduler_decay_lr=1e-5 \
--policy.gradient_checkpointing=true \
--policy.chunk_size=10 \
--policy.n_action_steps=10 \
--policy.max_action_tokens=256 \
--policy.empty_cameras=1 \
We then evaluate the finetuned model using the LeRobot LIBERO implementation, by running the following command:
tasks="libero_object,libero_spatial,libero_goal,libero_10"
lerobot-eval \
--policy.path=lerobot/pi0fast-libero \
--policy.max_action_tokens=256 \
--env.type=libero \
--policy.gradient_checkpointing=false \
--env.task=${tasks} \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--rename_map='{"observation.images.image":"observation.images.base_0_rgb","observation.images.image2":"observation.images.left_wrist_0_rgb"}'
Note: We set n_action_steps=10, similar to the original OpenPI implementation.
We obtain the following results on the LIBERO benchmark:
| Model | LIBERO Spatial | LIBERO Object | LIBERO Goal | LIBERO 10 | Average |
|---|---|---|---|---|---|
| π₀-fast | 70.0 | 100.0 | 100.0 | 60.0 | 82.5 |
The full evaluation output folder, including videos, is available here
This model follows the Apache 2.0 License, consistent with the original OpenPI repository.