docs/models/multimodal.md
Megatron Core supports multimodal models that combine language with vision, audio, and other modalities for comprehensive multimodal understanding.
MIMO (Multimodal In/Out Model) is an experimental framework in Megatron Core that supports arbitrary combinations of modalities including vision, audio, and text. MIMO provides a flexible architecture for building custom multimodal models.
Note: MIMO is experimental and under active development. The API may change in future releases.
Key Features:
See examples/mimo for training scripts and examples.
| Model | Description | Vision Encoder | Language Model |
|---|---|---|---|
| LLaVA | Visual instruction tuning | CLIP ViT-L/14 | Mistral-7B / LLaMA |
| NVLM | NVIDIA Vision-Language Model | CLIP / Custom ViT | LLaMA-based |
| LLaMA 3.1 Nemotron Nano VL | Efficient multimodal model | Vision Transformer | LLaMA 3.1 8B |
| Model | Description | Key Features |
|---|---|---|
| CLIP ViT | OpenAI's CLIP Vision Transformer | Image-text alignment, multiple scales (L/14@336px) |
| RADIO | Resolution-Agnostic Dynamic Image Optimization | Flexible resolution handling, efficient vision encoding |
For multimodal diffusion models (image generation, text-to-image, etc.), see NeMo Diffusion Models. NeMo provides production-ready implementations of:
Multimodal training examples can be found in the following directories:
MIMO Framework:
examples/mimo/ - Multimodal In/Out training with support for vision-language and audio-vision-language modelsSpecific Multimodal Models:
examples/multimodal/ - LLaVA-style training with Mistral + CLIPexamples/multimodal/nvlm/ - NVLM training scriptsexamples/multimodal/llama_3p1_nemotron_nano_vl_8b_v1/ - Nemotron VL trainingexamples/multimodal/radio/ - RADIO vision encoder integration