examples/bert/README.md
<a id="markdown-training-setup" name="training-setup"></a>
To run the model using a docker container run it as follows
PYTORCH_IMAGE=nvcr.io/nvidia/pytorch:24.01-py3
CHECKPOINT_PATH="" #<Specify path>
TENSORBOARD_LOGS_PATH=""#<Specify path>
VOCAB_FILE="" #<Specify path to file>//bert-vocab.txt
DATA_PATH="" #<Specify path and file prefix>_text_document
docker run \
--gpus=all \
--ipc=host \
--workdir /workspace/megatron-lm \
-v /path/to/data:/path/to/data \
-v /path/to/megatron-lm:/workspace/megatron-lm \
megatron-lm nvcr.io/nvidia/pytorch:24.01-py3 \
bash examples/bert/train_bert_340m_distributed.sh $CHECKPOINT_PATH $TENSORBOARD_LOGS_PATH $VOCAB_FILE $DATA_PATH "
NOTE: Depending on the environment you are running it the above command might like slightly different.
<a id="markdown-configurations" name="configurations"></a> The example in this folder shows you how to run 340m large model. There are other configs you could run as well
--num-layers 48 \
--hidden-size 2560 \
--num-attention-heads 32 \
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 1 \
--num-layers 48 \
--hidden-size 6144 \
--num-attention-heads 96 \
--tensor-model-parallel-size 4 \
--pipeline-model-parallel-size 4 \