docs/source/porting_datasets_v3.mdx
This tutorial explains how to port large-scale robotic datasets to the LeRobot Dataset v3.0 format. We'll use the DROID 1.0.1 dataset as our primary example, which demonstrates handling multi-terabyte datasets with thousands of shards across SLURM clusters.
Dataset v3.0 fundamentally changes how data is organized and stored:
v2.1 Structure (Episode-based):
dataset/
āāā data/chunk-000/episode_000000.parquet
āāā data/chunk-000/episode_000001.parquet
āāā videos/chunk-000/camera/episode_000000.mp4
āāā meta/episodes.jsonl
v3.0 Structure (File-based):
dataset/
āāā data/chunk-000/file-000.parquet # Multiple episodes per file
āāā videos/camera/chunk-000/file-000.mp4 # Consolidated video chunks
āāā meta/episodes/chunk-000/file-000.parquet # Structured metadata
This transition from individual episode files to file-based chunks dramatically improves performance and reduces storage overhead.
Dataset v3.0 introduces significant improvements for handling large datasets:
dataset.meta.episodesBefore porting large datasets, ensure you have:
DROID 1.0.1 is an excellent example of a large-scale robotic dataset:
gsutilThe dataset contains diverse manipulation demonstrations with:
DROID_FEATURES = {
# Episode markers
"is_first": {"dtype": "bool", "shape": (1,)},
"is_last": {"dtype": "bool", "shape": (1,)},
"is_terminal": {"dtype": "bool", "shape": (1,)},
# Language instructions
"language_instruction": {"dtype": "string", "shape": (1,)},
"language_instruction_2": {"dtype": "string", "shape": (1,)},
"language_instruction_3": {"dtype": "string", "shape": (1,)},
# Robot state
"observation.state.gripper_position": {"dtype": "float32", "shape": (1,)},
"observation.state.cartesian_position": {"dtype": "float32", "shape": (6,)},
"observation.state.joint_position": {"dtype": "float32", "shape": (7,)},
# Camera observations
"observation.images.wrist_left": {"dtype": "image"},
"observation.images.exterior_1_left": {"dtype": "image"},
"observation.images.exterior_2_left": {"dtype": "image"},
# Actions
"action.gripper_position": {"dtype": "float32", "shape": (1,)},
"action.cartesian_position": {"dtype": "float32", "shape": (6,)},
"action.joint_position": {"dtype": "float32", "shape": (7,)},
# Standard LeRobot format
"observation.state": {"dtype": "float32", "shape": (8,)}, # joints + gripper
"action": {"dtype": "float32", "shape": (8,)}, # joints + gripper
}
For DROID specifically:
pip install tensorflow
pip install tensorflow_datasets
For other datasets, install the appropriate readers for your source format.
Download DROID from Google Cloud Storage using gsutil:
# Install Google Cloud SDK if not already installed
# https://cloud.google.com/sdk/docs/install
# Download the full RLDS dataset (1.7TB)
gsutil -m cp -r gs://gresearch/robotics/droid/1.0.1 /your/data/
# Or download just the 100-episode sample (2GB) for testing
gsutil -m cp -r gs://gresearch/robotics/droid_100 /your/data/
[!WARNING] Large datasets require substantial time and storage:
- Full DROID (1.7TB): Several days to download depending on bandwidth
- Processing time: 7+ days for local porting of full dataset
- Upload time: 3+ days to push to Hugging Face Hub
- Local storage: ~400GB for processed LeRobot format
python examples/port_datasets/port_droid.py \
--raw-dir /your/data/droid/1.0.1 \
--repo-id your_id/droid_1.0.1 \
--push-to-hub
For development, you can port a single shard:
python examples/port_datasets/port_droid.py \
--raw-dir /your/data/droid/1.0.1 \
--repo-id your_id/droid_1.0.1_test \
--num-shards 2048 \
--shard-index 0
This approach works for smaller datasets or testing, but large datasets require cluster computing.
For large datasets like DROID, parallel processing across multiple nodes dramatically reduces processing time.
pip install datatrove # Hugging Face's distributed processing library
Find your partition information:
sinfo --format="%R" # List available partitions
sinfo -N -p your_partition -h -o "%N cpus=%c mem=%m" # Check resources
Choose a CPU partition - no GPU needed for dataset porting.
python examples/port_datasets/slurm_port_shards.py \
--raw-dir /your/data/droid/1.0.1 \
--repo-id your_id/droid_1.0.1 \
--logs-dir /your/logs \
--job-name port_droid \
--partition your_partition \
--workers 2048 \
--cpus-per-task 8 \
--mem-per-cpu 1950M
--workers: Number of parallel jobs (max 2048 for DROID's shard count)--cpus-per-task: 8 CPUs recommended for frame encoding parallelization--mem-per-cpu: ~16GB total RAM (8Ć1950M) for loading raw frames[!TIP] Start with fewer workers (e.g., 100) to test your cluster configuration before launching thousands of jobs.
Check running jobs:
squeue -u $USER
Monitor overall progress:
jobs_status /your/logs
Inspect individual job logs:
less /your/logs/port_droid/slurm_jobs/JOB_ID_WORKER_ID.out
Debug failed jobs:
failed_logs /your/logs/port_droid
Once all porting jobs complete:
python examples/port_datasets/slurm_aggregate_shards.py \
--repo-id your_id/droid_1.0.1 \
--logs-dir /your/logs \
--job-name aggr_droid \
--partition your_partition \
--workers 2048 \
--cpus-per-task 8 \
--mem-per-cpu 1950M
python examples/port_datasets/slurm_upload.py \
--repo-id your_id/droid_1.0.1 \
--logs-dir /your/logs \
--job-name upload_droid \
--partition your_partition \
--workers 50 \
--cpus-per-task 4 \
--mem-per-cpu 1950M
[!NOTE] Upload uses fewer workers (50) since it's network-bound rather than compute-bound.
Your completed dataset will have this modern structure:
dataset/
āāā meta/
ā āāā episodes/
ā ā āāā chunk-000/
ā ā āāā file-000.parquet # Episode metadata
ā āāā tasks.parquet # Task definitions
ā āāā stats.json # Aggregated statistics
ā āāā info.json # Dataset information
āāā data/
ā āāā chunk-000/
ā āāā file-000.parquet # Consolidated episode data
āāā videos/
āāā camera_key/
āāā chunk-000/
āāā file-000.mp4 # Consolidated video files
This replaces the old episode-per-file structure with efficient, optimally-sized chunks.
If you have existing datasets in v2.1 format, use the migration tool:
python src/lerobot/datasets/v30/convert_dataset_v21_to_v30.py \
--repo-id your_id/existing_dataset
This automatically:
Dataset v3.0 provides significant improvements for large datasets: