docs/user-guide/features/megatron_energon.md
Advanced multimodal dataloader for efficient loading of text, images, video, and audio at scale.
Megatron Energon is purpose-built for large-scale multimodal training with:
pip install megatron-energon
from megatron.energon import get_train_dataset, get_loader, WorkerConfig
# Create dataset
ds = get_train_dataset(
'/path/to/dataset',
batch_size=32,
shuffle_buffer_size=1000,
worker_config=WorkerConfig.default_worker_config(),
)
# Create loader and iterate
for batch in get_loader(ds):
# Training step
pass
# Load image-text dataset
ds = get_train_dataset(
'/path/to/multimodal/dataset',
batch_size=32,
worker_config=WorkerConfig(num_workers=8, prefetch_factor=2),
)
for batch in get_loader(ds):
images = batch['image'] # Image tensors
texts = batch['text'] # Text captions
# Process batch
Mix multiple datasets with custom weights:
from megatron.energon import Blender
blended_ds = Blender([
('/path/to/dataset1', 0.6), # 60%
('/path/to/dataset2', 0.3), # 30%
('/path/to/dataset3', 0.1), # 10%
])
WorkerConfig(
num_workers=8, # Parallel workers
prefetch_factor=2, # Batches to prefetch per worker
persistent_workers=True, # Keep workers alive between epochs
)
| Parameter | Description |
|---|---|
batch_size | Samples per batch |
shuffle_buffer_size | Buffer size for randomization |
max_samples_per_sequence | Max samples to pack into one sequence |
worker_config | Worker configuration for parallel loading |
from megatron.energon import get_train_dataset, get_loader
from megatron.training import get_args
args = get_args()
train_ds = get_train_dataset(
args.data_path,
batch_size=args.micro_batch_size,
)
for iteration, batch in enumerate(get_loader(train_ds)):
loss = train_step(batch)