Multimodal Example

NOTE: This example is under active development and is expected change.

The following walks through all the steps required to pretrain and instruction tune a llava architecture vision-language model (VLM). It is important to precisely follow all steps to obtain the benchmark scores at the end.

This example has been tested on an A100 based DGX cluster. Pretraining and instruction tuning took approximately 1 day and 11 hours respectively on 64 GPUs using four way tensor parallelism (tp=4). Training speed will scale approximately linearly with number of GPUs available.

Multimodal support in megatron is still under active development. This example is not intended to produce state-of-the-art model quality (that would require more data and model refinements), it is merely intended to demonstrate the multimodal functionality in megatron. If you hit any problems, please open a github issue.

Setup

Docker container

You can build a docker container using examples/multimodal/Dockerfile to run this example.

# At the Megatron-LM root directory, execute the following
docker build -t megatron-multimodal -f examples/multimodal/Dockerfile .

Language model

Follow the instructions in Mistral to download weights for Mistral-7B-Instruct-v0.3 from HuggingFace and convert to mcore format with tensor parallel size 4. Please use the tokenizer from HuggingFace.

Vision model

This example uses the OpenAI CLIP ViT-L/14@336px Vision model. To download the weights from OpenAI and convert them to a format that can be loaded in megatron, please run the following:

python examples/multimodal/model_converter/clip_converter.py --download-root /some/download/folder --output /some/output/folder --tensor-parallel-size 4 --use-te

Combined model checkpoint

Update the paths to point to the mcore converted CLIP and Mistral models and run the following script to combine the Mistral and CLIP models into a single multimodal checkpoint folder:

examples/multimodal/combine_lm_vision_checkpoints.sh /path/to/mistral/model /path/to/clip/model /output/dir

Training

Pretraining

Download the LLavA-Pretrain dataset from Hugging Face and unzip the images folder (NOTE: 79GB of disk space required):
```
git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain
cd LLaVA-Pretrain
unzip images.zip
```

Run the following script to convert the data to webdataset format:

cd <megatron-lm dir>
python examples/multimodal/convert_llava_pretrain_to_wds.py

Run the following command to convert to megatron-energon format:

cd <LLaVA-Pretrain dir>/wds
energon prepare ./

select the following values for the presented options:

> Please enter a desired train/val/test split like "0.5, 0.2, 0.3" or "8,1,1": 9,1,0
> Do you want to create a dataset.yaml interactively? [Y/n]: Y
> Please enter a number to choose a class: 9 (VQASample)
> Do you want to set a simple field_map[Y] (or write your own sample_loader [n])? [Y/n]: Y
> Please enter a webdataset field name for 'image' (<class 'torch.Tensor'>): jpg
> Please enter a webdataset field name for 'context' (<class 'str'>): json[0][value]
> Please enter a webdataset field name for 'answers' (typing.Optional[typing.List[str]], default: None): json[1][value]
> Please enter a webdataset field name for 'answer_weights' (typing.Optional[torch.Tensor], default: None):

Update pretrain_dataset.yaml so that both path variables point to LLaVA-Pretrain/wds
Run the following script to pretrain a llava model for image captioning:
```
cd <megatron-lm dir>
examples/multimodal/pretrain_mistral_clip.sh
```

All being well you should observe training and validation loss curves similar to the following:

These curves were obtained with global batch size of 256. Changing this value will likely change the curves. For pretraining and instruction tuning llava models we have found that loss curves are an unreliable predictor of downstream task performance. Therefore it is necessary to run test generation and evaluation on a range of metrics to understand model quality. We intend to add training time zero-shot evaluation in a future update.

You can execute the pretraining script multiple times to resume training. On resuming, the latest model, optimizer, and dataloader state are loaded.

SFT

Prepare an instruction tuning dataset such in megatron-energon format. NOTE: we do not provide instructions for this.
Update sft_dataset.yaml so that both path variables point to the train and val splits of your instruction tuning dataset.

Run the following script to instruction tune the pre-trained llava model:

```
examples/multimodal/sft_mistral_clip.sh
```

You can execute the SFT script multiple times to resume training. On resuming, the latest model, optimizer, and dataloader state are loaded.

Evaluation

Generation

Run the following script:

examples/multimodal/text_generation_mistral_clip.sh --input-image-path /path/to/input/images --output-path /some/output/directory \
    --model-path /path/to/model.pt --gt-path /path/to/groundtruth/file --task generation-task-name

where --task generation-task-name is the name of the evaluation benchmark such as captioning or MMMU.

After pretraining

COCO captioning

Download the COCO 2014 test image set:

wget http://images.cocodataset.org/zips/test2014.zip
Download COCO test image annotations:

https://storage.googleapis.com/sfr-vision-language-research/datasets/coco_karpathy_test.json
First, run text generation using --task captioning.

Run the following command:

python examples/multimodal/evaluate_coco.py --input-path /output/directory/from/generation --groundtruth-path /path/to/groundtruth/file

For the mistral-7b-instruct plus clip llava model you should obtain a COCO CIDer score of approximately 94.

After SFT

MMMU

The official MMMU repository is not pip installable currently so please clone their code in examples/multimodal by running git clone https://github.com/MMMU-Benchmark/MMMU.git.

The MMMU dataset is loaded from HuggingFace automatically as part of the code.

Run text generation using --task MMMU. Then, run the following command:

python examples/multimodal/evaluate_mmmu.py --input-path /output/directory/from/generation

For the mistral-7b-instruct plus clip instruction tuned llava model you should obtain a MMMU score of approximately 38.