Multimodal Models

Megatron Core supports multimodal models that combine language with vision, audio, and other modalities for comprehensive multimodal understanding.

MIMO: Multimodal In/Out Framework

MIMO (Multimodal In/Out Model) is an experimental framework in Megatron Core that supports arbitrary combinations of modalities including vision, audio, and text. MIMO provides a flexible architecture for building custom multimodal models.

Note: MIMO is experimental and under active development. The API may change in future releases.

Key Features:

Arbitrary modality combinations (vision, audio, text, etc.)
Flexible encoder architecture for different input modalities
Unified embedding space across modalities
Support for both vision-language and audio-vision-language models

See examples/mimo for training scripts and examples.

Vision-Language Models

Model	Description	Vision Encoder	Language Model
LLaVA	Visual instruction tuning	CLIP ViT-L/14	Mistral-7B / LLaMA
NVLM	NVIDIA Vision-Language Model	CLIP / Custom ViT	LLaMA-based
LLaMA 3.1 Nemotron Nano VL	Efficient multimodal model	Vision Transformer	LLaMA 3.1 8B

Vision Encoders

Model	Description	Key Features
CLIP ViT	OpenAI's CLIP Vision Transformer	Image-text alignment, multiple scales (L/14@336px)
RADIO	Resolution-Agnostic Dynamic Image Optimization	Flexible resolution handling, efficient vision encoding

Diffusion Models

For multimodal diffusion models (image generation, text-to-image, etc.), see NeMo Diffusion Models. NeMo provides production-ready implementations of:

Stable Diffusion variants
Text-to-image generation
Image-to-image translation
ControlNet and other conditioning mechanisms

Multimodal Features

Image-Text Alignment: Pre-training on image-caption pairs
Visual Instruction Tuning: Fine-tuning on instruction-following datasets
Flexible Vision Encoders: Support for different ViT architectures and resolutions
Combined Checkpointing: Unified checkpoints combining vision and language models
Efficient Training: Full parallelism support (TP, PP, DP) for both vision and language components

Example Scripts

Multimodal training examples can be found in the following directories:

MIMO Framework:

examples/mimo/ - Multimodal In/Out training with support for vision-language and audio-vision-language models

Specific Multimodal Models:

examples/multimodal/ - LLaVA-style training with Mistral + CLIP
examples/multimodal/nvlm/ - NVLM training scripts
examples/multimodal/llama_3p1_nemotron_nano_vl_8b_v1/ - Nemotron VL training
examples/multimodal/radio/ - RADIO vision encoder integration