examples/llama/README.md
<a id="overview" name="overview"></a>
Train Llama models using FP8 precision with Megatron-Core.
<a id="prerequisites" name="prerequisites"></a>
# Clone repository
export HOST_MEGATRON_LM_DIR="/path/to/your/host/megatron-lm"
git clone https://github.com/NVIDIA/Megatron-LM.git "$HOST_MEGATRON_LM_DIR"
cd "$HOST_MEGATRON_LM_DIR"
git checkout "core_r0.12.0"
# Set paths
export HOST_CHECKPOINT_PATH="./checkpoints/llama3_8b_fp8"
export HOST_TENSORBOARD_LOGS_PATH="./tensorboard_logs/llama3_8b_fp8"
# Optional: For real data
# export HOST_TOKENIZER_MODEL_PATH="/path/to/host/tokenizer.model"
# export HOST_DATA_PREFIX="/path/to/host/mydata_prefix"
<a id="training-setup" name="training-setup"></a>
PYTORCH_IMAGE="nvcr.io/nvidia/pytorch:25.03-py3"
docker run --rm --gpus all --ipc=host --ulimit memlock=-1 \
-v "${HOST_MEGATRON_LM_DIR}:/workspace/megatron-lm" \
-v "${HOST_CHECKPOINT_PATH}:/workspace/checkpoints" \
-v "${HOST_TENSORBOARD_LOGS_PATH}:/workspace/tensorboard_logs" \
--workdir /workspace/megatron-lm \
$PYTORCH_IMAGE \
bash examples/llama/train_llama3_8b_h100_fp8.sh \
/workspace/checkpoints \
/workspace/tensorboard_logs \
2>&1 | tee "${HOST_TENSORBOARD_LOGS_PATH}/training_mock_$(date +'%y-%m-%d_%H-%M-%S').log"
PYTORCH_IMAGE="nvcr.io/nvidia/pytorch:25.03-py3"
docker run --rm --gpus all --ipc=host --ulimit memlock=-1 \
-v "${HOST_MEGATRON_LM_DIR}:/workspace/megatron-lm" \
-v "${HOST_CHECKPOINT_PATH}:/workspace/checkpoints" \
-v "${HOST_TENSORBOARD_LOGS_PATH}:/workspace/tensorboard_logs" \
-v "${HOST_TOKENIZER_MODEL_PATH}:/workspace/tokenizer_model" \
-v "$(dirname "${HOST_DATA_PREFIX}"):/workspace/data_dir" \
--workdir /workspace/megatron-lm \
$PYTORCH_IMAGE \
bash examples/llama/train_llama3_8b_h100_fp8.sh \
/workspace/checkpoints \
/workspace/tensorboard_logs \
/workspace/tokenizer_model \
"/workspace/data_dir/$(basename "${HOST_DATA_PREFIX}")" \
2>&1 | tee "${HOST_TENSORBOARD_LOGS_PATH}/training_custom_$(date +'%y-%m-%d_%H-%M-%S').log"
<a id="configuration" name="configuration"></a>
Default parallelism strategy:
Llama-3-8B architecture:
Key training parameters:
You can modify these parameters directly in the train_llama3_8b_h100_fp8.sh script.
This configuration follows those defined in NeMo Framework's performance scripts, which can be found at https://github.com/NVIDIA/NeMo/tree/main/scripts/performance.
| Model | #-GPUs | GBS | MBS | Seq Length | TP | PP | CP | VP | EP | GA | Tokens/sec/GPU | TFLOP/sec/GPU |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LLAMA3-8B | 8 | 128 | 1 | 8192 | 1 | 1 | 2 | 1 | 1 | 32 | 13812 | 800 |
| LLAMA3-70B | 64 | 128 | 1 | 8192 | 4 | 8 | 1 | 5 | 1 | 64 | 1621 | 780 |
| LLAMA3-405B | 1024 | 512 | 1 | 8192 | 8 | 8 | 2 | 8 | 1 | 64 | 315 | 834 |
Legend:
As NeMo uses Megatron-Core, for the latest performance benchmarks, please refer to the official NeMo documentation.
<a id="test-datasets" name="test-datasets"></a>
Recommended datasets:
Preprocess datasets:
python "${HOST_MEGATRON_LM_DIR}/tools/preprocess_data.py" \
--input your_dataset.json \
--output-prefix test_dataset \
--tokenizer-type HuggingFaceTokenizer \
--tokenizer-model /path/to/tokenizer.model \
--append-eod
<a id="fp8-training-considerations" name="fp8-training-considerations"></a>
Hardware: Requires NVIDIA Hopper, Ada, or Blackwell GPUs for FP8 support
Troubleshooting: If you encounter NaN values or instability with FP8 training, please refer to Transformer Engine.