docs/source/metaworld.mdx
Meta-World is an open-source simulation benchmark for multi-task and meta reinforcement learning in continuous-control robotic manipulation. It bundles 50 diverse manipulation tasks using everyday objects and a common tabletop Sawyer arm, providing a standardized playground to test whether algorithms can learn many different tasks and generalize quickly to new ones.
Meta-World provides 50 tasks organized into difficulty groups. In LeRobot, you can evaluate on individual tasks, difficulty groups, or the full MT50 suite:
| Group | CLI name | Tasks | Description |
|---|---|---|---|
| Easy | easy | 28 | Tasks with simple dynamics and single-step goals |
| Medium | medium | 11 | Tasks requiring multi-step reasoning |
| Hard | hard | 6 | Tasks with complex contacts and precise manipulation |
| Very Hard | very_hard | 5 | The most challenging tasks in the suite |
| MT50 (all) | Comma-separated list | 50 | All 50 tasks — the most challenging multi-task setting |
You can also pass individual task names directly (e.g., assembly-v3, dial-turn-v3).
We provide a LeRobot-ready dataset for Meta-World MT50 on the HF Hub: lerobot/metaworld_mt50. This dataset is formatted for the MT50 evaluation that uses all 50 tasks with fixed object/goal positions and one-hot task vectors for consistency.
After following the LeRobot installation instructions:
pip install -e ".[metaworld]"
pip install "gymnasium==1.1.0"
Evaluate on the medium difficulty split (a good balance of coverage and compute):
lerobot-eval \
--policy.path="your-policy-id" \
--env.type=metaworld \
--env.task=medium \
--eval.batch_size=1 \
--eval.n_episodes=10
Evaluate on a specific task:
lerobot-eval \
--policy.path="your-policy-id" \
--env.type=metaworld \
--env.task=assembly-v3 \
--eval.batch_size=1 \
--eval.n_episodes=10
Evaluate across multiple tasks or difficulty groups:
lerobot-eval \
--policy.path="your-policy-id" \
--env.type=metaworld \
--env.task=assembly-v3,dial-turn-v3,handle-press-side-v3 \
--eval.batch_size=1 \
--eval.n_episodes=10
--env.task accepts explicit task lists (comma-separated) or difficulty groups (e.g., easy, medium, hard, very_hard).--eval.batch_size controls how many environments run in parallel.--eval.n_episodes sets how many episodes to run per task.Observations:
observation.image — single camera view (corner2), 480x480 HWC uint8observation.state — 4-dim proprioceptive state (end-effector position + gripper)Actions:
Box(-1, 1, shape=(4,)) — 3D end-effector delta + 1D gripperFor reproducible benchmarking, use 10 episodes per task. For the full MT50 suite this gives 500 total episodes. If you care about generalization, run on the full MT50 — it is intentionally challenging and reveals strengths/weaknesses better than a few narrow tasks.
Train a SmolVLA policy on a subset of Meta-World tasks:
lerobot-train \
--policy.type=smolvla \
--policy.repo_id=${HF_USER}/metaworld-test \
--policy.load_vlm_weights=true \
--dataset.repo_id=lerobot/metaworld_mt50 \
--env.type=metaworld \
--env.task=assembly-v3,dial-turn-v3,handle-press-side-v3 \
--output_dir=./outputs/ \
--steps=100000 \
--batch_size=4 \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--eval_freq=1000
info["is_success"] keys when writing post-processing or logging so your success metrics line up with the benchmark.batch_size, steps, and eval_freq to match your compute budget.