docs/user-guide/training-examples.md
Get started with Megatron Core training using these practical examples.
The simplest way to get started is with the basic training loop using mock data:
# Distributed training on 2 GPUs with mock data
torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py
This example:
Train LLaMA-3 8B model with FP8 mixed precision on 8 GPUs:
./examples/llama/train_llama3_8b_h100_fp8.sh
Configuration:
For training with your own data:
torchrun --nproc_per_node=8 pretrain_gpt.py \
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 1 \
--num-layers 32 \
--hidden-size 4096 \
--num-attention-heads 32 \
--seq-length 2048 \
--max-position-embeddings 2048 \
--micro-batch-size 4 \
--global-batch-size 32 \
--train-iters 100000 \
--lr 3.0e-4 \
--min-lr 3.0e-5 \
--lr-decay-style cosine \
--lr-warmup-iters 2000 \
--weight-decay 0.1 \
--clip-grad 1.0 \
--bf16 \
--data-path /path/to/your/preprocessed_data \
--split 949,50,1 \
--save /path/to/checkpoints \
--load /path/to/checkpoints \
--log-interval 10 \
--save-interval 1000 \
--eval-interval 1000
Train a GPT-3 style model:
torchrun --nproc_per_node=8 pretrain_gpt.py \
--tensor-model-parallel-size 2 \
--pipeline-model-parallel-size 2 \
--num-layers 24 \
--hidden-size 2048 \
--num-attention-heads 16 \
--seq-length 1024 \
--max-position-embeddings 1024 \
--micro-batch-size 2 \
--global-batch-size 16 \
--train-iters 100000 \
--lr 1.5e-4 \
--min-lr 1.0e-5 \
--lr-decay-style cosine \
--lr-warmup-iters 1000 \
--weight-decay 0.1 \
--clip-grad 1.0 \
--fp16 \
--data-path /path/to/preprocessed_data \
--split 949,50,1 \
--save /path/to/checkpoints \
--load /path/to/checkpoints
| Argument | Description |
|---|---|
--num-layers | Number of transformer layers |
--hidden-size | Hidden dimension size |
--num-attention-heads | Number of attention heads |
--seq-length | Sequence length for training |
| Argument | Description |
|---|---|
--micro-batch-size | Batch size per GPU |
--global-batch-size | Total batch size across all GPUs |
--train-iters | Number of training iterations |
| Argument | Description |
|---|---|
--lr | Peak learning rate |
--min-lr | Minimum learning rate |
--lr-decay-style | LR schedule (cosine, linear, constant) |
--lr-warmup-iters | Warmup iterations |
| Argument | Description |
|---|---|
--fp16 | FP16 mixed precision |
--bf16 | BF16 mixed precision (recommended) |
--fp8-hybrid | FP8 mixed precision (Hopper/Ada/Blackwell) |
| Argument | Description |
|---|---|
--data-path | Path to preprocessed data |
--split | Train/validation/test split (e.g., 949,50,1) |
--save | Checkpoint save directory |
--load | Checkpoint load directory |
--save-interval | Save checkpoint every N iterations |