docs/content/concepts/train.md
A Rerun catalog can feed training pipelines two ways: export recordings to a standard format, or stream them directly into a PyTorch DataLoader.
The catalog exposes recordings as queryable DataFrames via DataFusion. Multi-rate sensor streams can be time-aligned and columns of interest extracted, with the result written to whatever format a training pipeline expects.
See Export recordings to LeRobot datasets for a worked example.
The experimental rerun.experimental.dataloader module wraps a catalog as iterable or map-style PyTorch datasets, with no intermediate export step.
Three things describe a dataset (see reference):
DataSource — a catalog DatasetEntry with an optional segment filter; each registered RRD is one segment, typically one episode or trajectoryindex — the timeline that defines what "one sample" means (e.g. "frame_index" or "real_time")fields — a dict of Fields, each mapping a source column (an entity:Archetype:component triple) to a decoderSampleIndex pre-computes the full sample space from lightweight per-segment index-range metadata — one query per segment, not a scan of the data.
For timestamp timelines, FixedRateSampling defines the sampling grid and the server handles drift between grid and real row positions via fill_latest_at.
Each Field has a ColumnDecoder (_decoders.py) that converts a raw Arrow column to a torch.Tensor:
NumericDecoder — scalars and numeric listsImageDecoder — JPEG/PNG blobsVideoFrameDecoder — compressed video (h264/h265/av1)Field(window=(start, end)) returns a slice of values across an inclusive range relative to the current sample rather than a single value.
This is how action chunks and observation history are expressed.
RerunIterableDataset — streaming with automatic shuffling and cross-worker and DDP partitioningRerunMapDataset — random access by global index; works with PyTorch samplers like DistributedSampler and WeightedRandomSamplerSee Train PyTorch models with Rerun for usage.