docs/workers/model_engine.rst
.. _vermouth: https://github.com/vermouth1992
Author: Chi Zhang <https://github.com/vermouth1992>_
Last updated: 09/25/2025.
+----------+-----------+--------------+-------------+--------------------------+ | Backends | Model | Scalability | Model | Pain points | | | Supported | | Definition | | | | | | | | +==========+===========+==============+=============+==========================+ | FSDP | Day 1 | - Dense is OK| Huggingface | Monkey patch can be | | + | support | | + monkey | easily impacted by | | ulysses | HF model | - MoE is bad | patch | transformers version | +----------+-----------+--------------+-------------+--------------------------+ | MCore | Limited | Best | GPTModel | Supporting new models is | | | | | (One model | difficult | | | | | for all) | | +----------+-----------+--------------+-------------+--------------------------+
Note that all the workers and trainers run in SPMD mode. SFT/DPO/RM
trainer is directly invoked by torchrun. The Actor/Critic worker can
also be invoked by a RayWorkerGroup and provides APIs to a single
controller.
forward_step.RL trainer utilizes workers to construct HybridFlow program. This is out of the scope of model engine.
========== ====================== ====================== Model type Language model Value model ========== ====================== ====================== Input text/image/video/audio text/image/video/audio Output logits for next token logits as value ========== ====================== ======================
Currently, we have two model types: language model and value model. We expect to expand the category to include Qwen-Omni family (output both text and audio) and VLA models.
Currently, verl adopts left-right padding data format in RL trainer. This creates massive padding when the discrepancy between response length is large. We will start to implement no-padding format throughout the whole system.
.. image:: https://github.com/vermouth1992/verl-data/blob/master/images/data_format.png?raw=true :alt: Data Format
Here is the migration plan:
.. image:: https://github.com/vermouth1992/verl-data/blob/master/images/verl-ckpt.png?raw=true :alt: Model Engine Checkpoint System
The engine constructs the model using huggingface config, then load
weights from huggingface checkpoint. If the engine directly uses
huggingface model definition, it can use function provided by
transformers. Otherwise, each engine has to write their own
checkpoint load logic (e.g.,
mbridge <https://github.com/ISEEKYAN/mbridge>__). During model
training, each engine has to implement save_checkpoint and
load_checkpoint that save/load intermediate sharded checkpoint including
model, optimizer and lr scheduler states. Each engine has to implement a
checkpoint merge script, that merges the intermediate sharded checkpoint
back to huggingface format.
A tentative model engine API can be found: https://github.com/volcengine/verl/blob/main/verl/workers/engine/base.py#L24
Add a new backend
- Start a new folder under ``verl/workers/engine``. Then, implement
``transformer_impl.py``. If you want to implement a non-transformer
model, please contact us in advance.
- Add the engine config to the GSM8k SFT trainer script:
https://github.com/volcengine/verl/blob/main/tests/special_e2e/sft/run_sft_engine_gsm8k.sh
- Invoke the tests with your backend:
https://github.com/volcengine/verl/blob/main/tests/special_e2e/sft/test_sft_engine_all.sh.
This test script will run various backends and various
configurations, and compare the loss and grad norm of the first step
to make sure they are close.
Add a new model type