Back to Verl

Algorithm Baselines

docs/algo/baseline.md

0.7.111.8 KB
Original Source

Algorithm Baselines

Last updated: 06/18/2025.

GSM8k

Assuming GSM8k/math dataset is preprocessed via:

bash
python3 examples/data_preprocess/*.py

Refer to the table below to reproduce RL training from different pre-trained checkpoints. Below is the performance on the GSM8k dataset if not specified otherwise. More comprehensive benchmark results areavailable in the recipe folder.

HardwareModelMethodTest scoreDetails
NVIDIA GPUgoogle/gemma-2-2b-ithf checkpoint23.9Huggingface
NVIDIA GPUgoogle/gemma-2-2b-itSFT52.06command and logs
NVIDIA GPUgoogle/gemma-2-2b-itSFT + PPO64.02command and logs, wandb
NVIDIA GPUQwen/Qwen2.5-0.5B-Instructhf checkpoint49.6Qwen blog
NVIDIA GPUQwen/Qwen2.5-0.5B-InstructPPO56.7command and log
NVIDIA GPUQwen/Qwen2.5-0.5B-InstructPRIME58.7script, wandb
NVIDIA GPUQwen/Qwen2.5-0.5B-InstructGRPO-LoRA54.3command and logs
NVIDIA GPUQwen/Qwen2.5-1.5B-InstructGRPO-LoRA77.9command and logs
NVIDIA GPUQwen/Qwen2.5-3B-InstructGRPO-LoRA86.1command and logs
NVIDIA GPUdeepseek-ai/deepseek-llm-7b-chatPPO (Megatron)69.5 [1]log, wandb
NVIDIA GPUQwen/Qwen2-7B-InstructGRPO89script
NVIDIA GPUQwen/Qwen2-7B-InstructGRPO (FSDP2)89.8log
NVIDIA GPUQwen/Qwen2-7B-InstructGRPO (Megatron)89.6log
NVIDIA GPUQwen/Qwen2.5-7B-InstructReMax97script, wandb
NVIDIA GPUQwen/Qwen2.5-7B-InstructSPPO65.6 (MATH)SPPO script
NVIDIA GPUQwen/Qwen2.5-7B-InstructGRPO-LoRA93.4command and logs
NVIDIA GPUMixtral-8x22B-Instruct-v0.1Instruct model83.7Qwen Blog
NVIDIA GPUMixtral-8x22B-Instruct-v0.1RLOO (Megatron)92.3wandb
NVIDIA GPUQwen/Qwen2.5-7B-InstructSPIN92script
NVIDIA GPUQwen/Qwen2-7B-InstructGPG88log, wandb
NVIDIA GPUQwen/Qwen2-7B-InstructGPG (Megatron)88log, wandb
NVIDIA GPUQwen/Qwen2.5-VL-7B-InstructGRPO (Megatron)65.4 (GEO3k)script, wandb
AMD MI300deepseek-ai/deepseek-llm-7b-chatPPO70.5 [1]log
AMD MI300deepseek-ai/deepseek-llm-7b-chatGRPO71.4 [1]log
NVIDIA GPUQwen/Qwen2.5-14B-InstructGRPO-LoRA94.6command and logs
NVIDIA GPUQwen/Qwen2.5-32B-InstructGRPO-LoRA95.8command and logs
NVIDIA GPUQwen/Qwen2.5-72B-InstructGRPO-LoRA96.0command and logs

DAPO math-17k

Note:

  • For Qwen/Qwen2.5-Math-7B, we directly modify the max_position_embeddings to 32768 without observing performance degradation in order to train longer response length.
HardwareModelMethodTest scoreDetails
NVIDIA GPUQwen/Qwen2.5-Math-7B (32k)DAPO36.3command, logs
NVIDIA GPUQwen/Qwen2.5-7B-InstructDAPO + Code Interpreter40.0command

Below is the result on leetcode if not specified otherwise.

HardwareModelMethodTest scoreDetails
NVIDIA GPUPRIME-RL/Eurus-2-7B-SFTRPIME36.1script, swanlab

Notes

[1] During evaluation, we have only extracted answers following the format "####". A more flexible answer extraction, longer response length, and better prompt engineering may lead to a higher score.

[2] The default value of actor_rollout_ref.actor.entropy_coeff is set to 0.0 since verl 0.3.x on 2025-05-30, which is different from previous versions.