Back to Megatron Lm

Language Models

docs/models/llms.md

23.062.7 KB
Original Source
<!--- Copyright (c) 2022-2026, NVIDIA CORPORATION. All rights reserved. NVIDIA CORPORATION and its licensors retain all intellectual property and proprietary rights in and to this software, related documentation and any modifications thereto. Any use, reproduction, disclosure or distribution of this software and related documentation without an express license agreement from NVIDIA CORPORATION is strictly prohibited. -->

Language Models

Megatron Core supports the following language model architectures for large-scale training.

Converting HuggingFace Models

Use Megatron Bridge to convert HuggingFace models to Megatron format. Megatron Bridge is the official standalone converter with support for an extensive list of models including LLaMA, Mistral, Mixtral, Qwen, DeepSeek, Gemma, Phi, Nemotron, and many more.

See the Megatron Bridge supported models list for the complete and up-to-date list.

Decoder-Only Models

ModelDescriptionKey Features
GPTGenerative Pre-trained TransformerStandard autoregressive LM, foundational architecture
LLaMAMeta's LLaMA familyEfficient architecture with RoPE, SwiGLU, RMSNorm
MistralMistral AI modelsSliding window attention, efficient inference
MixtralSparse Mixture-of-Experts8x7B MoE architecture for efficient scaling
QwenAlibaba's Qwen seriesHuggingFace integration, multilingual support
MambaState Space ModelSubquadratic sequence length scaling, efficient long context

Encoder-Only Models

ModelDescriptionKey Features
BERTBidirectional Encoder RepresentationsMasked language modeling, classification tasks

Encoder-Decoder Models

ModelDescriptionKey Features
T5Text-to-Text Transfer TransformerUnified text-to-text framework, sequence-to-sequence

Example Scripts

Training examples for these models can be found in the examples/ directory:

  • examples/gpt3/ - GPT-3 training scripts
  • examples/llama/ - LLaMA training scripts
  • examples/mixtral/ - Mixtral MoE training
  • examples/mamba/ - Mamba training scripts
  • examples/bert/ - BERT training scripts
  • examples/t5/ - T5 training scripts

Model Implementation

All language models are built using Megatron Core's composable transformer blocks, enabling:

  • Flexible parallelism strategies (TP, PP, DP, EP, CP)
  • Mixed precision training (FP16, BF16, FP8)
  • Distributed checkpointing
  • Efficient memory management