verl/models/mcore/readme.md
updated 20251222
There has been 3 ways that verl integrates megatron-core as it training backend:
There is a configure option of megatron.use_mbridge to choose way#1 (false) or way#2 (true), and after the megatron-bridge is integrated we have a new option megatron.vanilla_mbridge to choose way#2 (true) or way#3 (false)
Now since we deprecated the way#1, the option use_mbridge will be asserted to be true and will be removed after v0.7. The default vanilla_mbridge is true for now and will be false one the megatron-bridge backend turns default.
With the bridge way(#2 or #3), we can directly load and save the megatron model weight through HuggingFace format, and we can use any megatron version >= 0.13 to adopt new megatron optimization feature as handy as possible by directly add overrided megatron configs such as +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform.
Make sure the model is supported by your inference engine (vLLM or SGLang or TensorRT-LLM) with correct version.
Make sure the model is supported by the bridge
megatron-bridge or contribute your implementation to megatron-bridge. Be cautious to have a matched version of Megatron and TransformerEnginembridge or megatron-bridge(prefered).Now the model is supported, just change the model path of your scripts and run the scritps.
Now we use mbridge to support megatron models. And we will migrate to megatron-bridge in the future.
With the mbridge, we can use allmost all the Megatron-Core features to support new models with little effort. And no offline weights conversion is needed, all the weights conversion is done online. We can directly save the mcore model to huggingface format during training.
Also, we can easily upgrade the mcore version to the latest version. In most cases, the upgrade is seamless. (except when the mcore API changes and we need to update the verl code accordingly)
verl/verl/models/mcore/registry.py.The earlier versions of verl use Megatron-LM 0.4 and workaround huggingface model classes. To better use the latest features and speedup of modern Megatron, we are migrating to Megatron-Core(mcore), and use the recommended GPTModel class for all language models. With mcore GPTModel, we can use the latest features like context parallel, expert parallel, dist_checkpointing, etc. and we can update mcore with little effort in the future for new features.
The migration has been successful with the help of the mcore team and the community. What we have done is:
Megatron version to 0.14.0LlamaForCausalLM and Qwen2ForCausalLM to mcore GPTModeltensor parallel, pipeline parallel, sequence parallel, virtual pipeline parallel, context parallel.dist_checkpointing feature and a basic offline weighs conversion script from huggingface to mcore dist_checkpointing format.We are working on the following features:
Qwen2MoeForCausalLMMixtralForCausalLMDeepseekV3ForCausalLMexpert parallelFeatures we invite the community to contribute:
dist_checkpointing format.
megatron_checkpoint_manager.py by dist_checkpointing format.To track the progress of verl mcore integration, please refer to the mcore integration issue.
To engage the community in contributing, here are the key steps in our mcore integration process and features under development.
The huggingface transformers is the de facto standard of model zoo while mcore is good at computation efficiency. The main challenge is conversion between the two.
main steps:
modelling the huggingface model with mcore GPTModel
TransformerConfigGPTModel with the converted configGPTModelonline weight conversion from mcore to huggingface (due to the rollout engine vLLM is using huggingface format)
support the mcore features in verl
tensor parallel, pipeline parallel, sequence parallel, virtual pipeline parallel, context parallelcheckpointing
GPTModelThe first step is to convert huggingface config to mcore TransformerConfig and init the mcore GPTModel with the converted config. See code in verl/models/mcore/config_converter.py and verl/verl/models/mcore/models/model_initializer.py. The corresponding model forward code is in verl/verl/models/mcore/models/model_forward.py.
There are two ways of loading the huggingface model weights to the GPTModel
dist_checkpointing format.dist_checkpointing format. The speed is fast and memory consumption is low.verl/scripts/converter_hf_to_mcore.py.See function convert_megatron_model_to_transformers_model in verl/utils/megatron_utils.py for the details.
It should be refatored for extensibility and better performance.
Most of the features of GPTModel is out-of-the-box supported in verl through changing the TransformerConfig, except those about parallel strategies, such as expert parallel.
Features about parallel strategies should be supported with changes about the online weights conversion(especially the resharding part) and verl work dispatching.
The existing checkpointing code is in verl/utils/checkpoint/megatron_checkpoint_manager.py. And the script to convert checkpoint to huggingface format is in verl/scripts/model_merger.
The existing checkpoint format simply saves every rank's weights and optimizer states. It should be refactored by dist_checkpointing format.
GPTModel (The Pai-Megatron-Path is a good reference)
TransformerConfigGPTModel with the converted configGPTModeldist_checkpointing formatThe greatest challenge for scaling up to larger models is the memory consumption.
The necessary features under development for scaling up are