examples/multimodal/nvlm/README.md
Please refer to the NVLM paper for details.
NOTE: VLMs in Megatron are under active development and are expected to change.
NVLM 1.0 model weights are publicly available in HuggingFace and Megatron format.
Please use examples/multimodal/Dockerfile.
Please refer to Tables 4 and 6 in the NVLM paper for full list of pretrain and SFT datasets. Please refer to https://nvidia.github.io/Megatron-Energon/data_prep.html on preparing datasets in the Megatron Energon format.
NVLM 1.0 models use OpenGVLab/InternViT-6B-448px-V1-5 from HuggingFace. Please download it and run the following command to convert it to Megatron format.
python examples/multimodal/model_converter/internvit_converter.py --output-dir <some output dir> --use-te --tensor-parallel-size 8
NVLM 1.0 34B starts from NousResearch/Nous-Hermes-2-Yi-34B from HuggingFace. Please download it and run the following command to convert it to Megatron format.
python tools/checkpoint/convert.py --bf16 --model-type GPT --loader llama_mistral --saver mcore --target-tensor-parallel-size 8 --checkpoint-type hf \
--load-dir <hf model directory> --save-dir <output dir> --tokenizer-model <hf model name/directory> \
--saver-transformer-impl transformer_engine --model-size yi-34B --make-vocab-size-divisible-by 1
NVLM 1.0 72B starts from Qwen/Qwen2-72B-Instruct from HuggingFace. Please download it and run the following command to convert it to Megatron format.
python tools/checkpoint/convert.py --bf16 --model-type GPT --loader llama_mistral --saver mcore --target-tensor-parallel-size 8 --checkpoint-type hf \
--load-dir <hf model directory> --save-dir <output directory> --tokenizer-model <hf model name/directory> \
--saver-transformer-impl transformer_engine --model-size qwen2.5-72Bf
Combine the vision model checkpoint from InternVit with the 34B or 72B language model by running:
examples/multimodal/combine_lm_vision_checkpoints.sh <language model directory> <vision model directory> <output directory> nvlm
examples/multimodal/nvlm/pretrain_yi_34b_internvit_6b.sh. Please use the InternViT + 34B combined checkpoint and tokenizer from HuggingFace.examples/multimodal/nvlm/sft_34b_internvit.sh using the checkpoint from 1.examples/multimodal/nvlm/pretrain_qwen20_72b_internvit_6b.sh. Please use the InternViT + 72B combined checkpoint and tokenizer from HuggingFace.examples/multimodal/nvlm/pp_checkpoint_converter.py --input <pretrained checkpoint directory> \
--input-pipeline-parallel 1 --output <some output dir> --output-pipeline-parallel 4 \
--tensor-parallel 8
examples/multimodal/nvlm/sft_qwen20_72b_internvit_6b.sh using the checkpoint from 2.examples/multimodal/nvlm/pp_checkpoint_converter.py --input <sft checkpoint directory> \
--input-pipeline-parallel 4 --output <some output dir> --output-pipeline-parallel 1 \
--tensor-parallel 8
Run the text generation script.
examples/multimodal/nvlm/run_text_generation_yi_34b_internvit_6b.sh --input-image-path /path/to/input/images --output-path /some/output/directory \
--model-path /path/to/model.pt --gt-path /path/to/groundtruth/file --task generation-task-name --use-tiling
examples/multimodal/nvlm/run_text_generation_qwen20_72b_internvit_6b.sh --input-image-path /path/to/input/images --output-path /some/output/directory \
--model-path /path/to/model.pt --gt-path /path/to/groundtruth/file --task generation-task-name --use-tiling
where --task generation-task-name is the name of the evaluation benchmark such as captioning, MMMU or TextVQA.
Then, run one of the evaluation scripts from examples/multimodal. For example
python examples/multimodal/evaluate_mmmu.py --input-path /output/directory/from/generation