examples/gptoss/README.md
Megatron Bridge provides a quick and convenient way to convert HuggingFace checkpoints to the Megatron format used by Megatron-LM. Follow the instructions in the Megatron-Bridge Installation to run the nemo docker container and convert checkpoints (via mounted volumes - make sure that the huggingface cache location AND the megatron checkpoint locations are properly mounted, otherwise you may not be saving the converted model to disk correctly).
Below is an example of how to use Megatron-Bridge inside the pytorch container to convert a HuggingFace model checkpoint to Megatron format.
Reference: Megatron-Bridge Dockerfile
Inside the pytorch container run the following commands to install Megatron-Bridge:
cd /opt
git clone --recursive https://github.com/NVIDIA-NeMo/Megatron-Bridge.git
cd Megatron-Bridge
# Make sure submodules are initialized (for 3rdparty/Megatron-LM)
git submodule update --init --recursive
export PATH="/root/.local/bin:$PATH"
export UV_PROJECT_ENVIRONMENT=/opt/venv
export VIRTUAL_ENV=/opt/venv
export PATH="$UV_PROJECT_ENVIRONMENT/bin:$PATH"
export UV_LINK_MODE=copy
export UV_VERSION="0.7.2"
# Install UV
curl -LsSf https://astral.sh/uv/${UV_VERSION}/install.sh | sh
# Create virtual environment and build the package
uv venv ${UV_PROJECT_ENVIRONMENT} --system-site-packages
uv sync --locked --only-group build
uv sync --locked --link-mode copy --all-extras --all-groups
uv pip install --no-deps -e .
source ${UV_PROJECT_ENVIRONMENT}/bin/activate
export HOST_MEGATRON_LM_DIR="/path/to/your/host/megatron-lm"
git clone https://github.com/NVIDIA/Megatron-LM.git "$HOST_MEGATRON_LM_DIR"
cd "$HOST_MEGATRON_LM_DIR"
export HF_TOKEN={your_hf_token_here}
Set --nproc-per-node to be the number of GPUs per node. Set hf_model_name to be the Huggingface model e.g. openai/gpt-oss-20b
python3 -m torch.distributed.launch --nproc-per-node=8 examples/gptoss/01_convert_from_hf.py --hf-model openai/gpt-oss-20b
To train from scratch first follow the steps below to setup the environment appropriately before running the training script in docker. Even though we are running the same container as before, it is better to restart the container to ensure a clean environment and that all environment and docker variables are set correctly. For the following example we used 8x GB300, but you should change the number of GPUs and nodes as needed.
# Change these based on model and directory from previous conversion step
export MODEL_DIR_NAME="openai_gpt-oss_20b"
export HOST_CHECKPOINT_PATH="./megatron_checkpoints/${MODEL_DIR_NAME}"
export HOST_TENSORBOARD_LOGS_PATH="./tensorboard_logs/${MODEL_DIR_NAME}"
By default we will use mock data to train the model in the example below. To use your own data, set the following environment variables:
# Optional: For real data
export HOST_TOKENIZER_MODEL_PATH="/path/to/host/tokenizer.model"
export HOST_DATA_PREFIX="/path/to/host/mydata_prefix"
Run the following to create a distributed_config.env file with the appropriate distributed training configurations. Change the values as needed for your setup. This file will override the default values in 02_train.sh.
cat > ./distributed_config.env << 'EOF'
GPUS_PER_NODE=8
NUM_NODES=1
MASTER_ADDR=localhost
MASTER_PORT=6000
NODE_RANK=0
EOF
NOTE: This container runs the example training script 02_train.sh located in the examples/gptoss directory. By default, we have only set pipeline parallelism to be the number of GPUs. Adjust TP_SIZE, EP_SIZE, PP_SIZE, etc. in 02_train.sh. You can also adjust modify --hidden-size, --ffn-hidden-size, --num-attention-heads, NUM_LAYERS, etc.
To train using mock data, run the following command:
PYTORCH_IMAGE="nvcr.io/nvidia/pytorch:25.12-py3"
docker run --rm --gpus all --ipc=host --ulimit memlock=-1 \
-v "${HOST_MEGATRON_LM_DIR}:/workspace/megatron-lm" \
-v "${HOST_CHECKPOINT_PATH}:/workspace/checkpoints" \
-v "${HOST_TENSORBOARD_LOGS_PATH}:/workspace/tensorboard_logs" \
-v "./distributed_config.env:/workspace/megatron-lm/examples/gptoss/distributed_config.env" \
--workdir /workspace/megatron-lm \
$PYTORCH_IMAGE \
bash examples/gptoss/02_train.sh \
--checkpoint-path /workspace/checkpoints \
--tensorboard-logs-path /workspace/tensorboard_logs \
--distributed-config-file /workspace/megatron-lm/examples/gptoss/distributed_config.env \
2>&1 | tee "${HOST_TENSORBOARD_LOGS_PATH}/training_mock_$(date +'%y-%m-%d_%H-%M-%S').log"
Note: If you run into issues generating mock data one solution might be to reduce the number of GPUs to 1 and try to generate the data again.
If using real data with with the HOST_TOKENIZER_MODEL_PATH and HOST_DATA_PREFIX environment variables set, run the following command instead:
PYTORCH_IMAGE="nvcr.io/nvidia/pytorch:25.12-py3"
docker run --rm --gpus all --ipc=host --ulimit memlock=-1 \
-v "${HOST_MEGATRON_LM_DIR}:/workspace/megatron-lm" \
-v "${HOST_CHECKPOINT_PATH}:/workspace/checkpoints" \
-v "${HOST_TENSORBOARD_LOGS_PATH}:/workspace/tensorboard_logs" \
-v "${HOST_TOKENIZER_MODEL_PATH}:/workspace/tokenizer_model" \
-v "$(dirname "${HOST_DATA_PREFIX}"):/workspace/data_dir" \
-v "./distributed_config.env:/workspace/megatron-lm/examples/gptoss/distributed_config.env" \
--workdir /workspace/megatron-lm \
$PYTORCH_IMAGE \
bash examples/gptoss/02_train.sh \
--checkpoint-path /workspace/checkpoints \
--tensorboard-logs-path /workspace/tensorboard_logs \
--tokenizer /workspace/tokenizer_model \
--data "/workspace/data_dir/$(basename "${HOST_DATA_PREFIX}")" \
--distributed-config-file /workspace/megatron-lm/examples/gptoss/distributed_config.env \
2>&1 | tee "${HOST_TENSORBOARD_LOGS_PATH}/training_custom_$(date +'%y-%m-%d_%H-%M-%S').log"
Just run the following command to change from the megatron checkpoint from training to the huggingface format to share with others (make sure you have the same virtual environment setup as in Step 0):
python3 -m torch.distributed.launch --nproc-per-node=8 examples/gptoss/03_convert_to_hf.py --hf-model openai/gpt-oss-20b --megatron-model ./megatron_checkpoints/openai_gpt-oss_20b