Back to Verl

Ascend Retool Best Practice

docs/ascend_tutorial/examples/ascend_retool_best_pratice.rst

0.7.18.7 KB
Original Source

Ascend Retool Best Practice

Last updated: 03/01/2026.

引言

Retool论文参考(Retool) 集成代码解释器工具,通过多轮实时代码执行进行策略部署,并教会模型根据结果反馈学习何时以及如何调用工具。

  1. 环境构建
  2. 模型训练

用例模型脚本以及其需要的硬件条件各自如下:

=============== ============ ============ =============== 模型 NPU型号 节点数量 训推后端 =============== ============ ============ =============== Qwen2.5-7B Atlas 900 A2 1 vllm + FSDP =============== ============ ============ ===============

环境构建

1.从自定义Conda环境进行构建

============ ============================================================ software version ============ ============================================================ Python >= 3.10, <3.12 CANN == 8.3.RC1 torch == 2.7.1 torch_npu == 2.7.1 verl v0.6.1 commitId=d62da4950573d7a4b7ef2362337952e7ab59e78d vllm v0.11.0 vllm-ascend v0.11.0-dev transformers 4.57.6 ============ ============================================================

模型训练与评估

1.模型数据准备 ^^^^^^^^^^^ Qwen2.5-7B ^^^^^^^^^^^ 下载模型权重

--local-dir: 模型保存路径

.. code-block:: bash

git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct

下载训练数据集

.. code-block:: bash

git clone https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k

下载评估数据集

.. code-block:: bash

git clone https://huggingface.co/datasets/Maxwell-Jia/AIME_2024

下载预训练数据集

.. code-block:: bash

python3 recipe/retool/retool_sft_preprocess.py

注:自动下载ReTool-SFT,最后生成数据默认保存在~/ReTool-SFT/data目录下

执行预训练脚本

.. code-block:: bash

bash recipe/retool/run_qwen2_7b_sft_npu.sh # 需适配脚本中路径

合并预训练权重生成checkpoint

.. code-block:: bash

python3 -m verl.model_merger merge --backend fsdp
--local_dir ${DATASETS}/checkpoint/multiturn-sft-qwen-2.5-7b-instruct/global_step_372
--target_dir ${DATASETS}/checkpoint/multiturn-sft-qwen-2.5-7b-instruct/global_step_372/huggingface

2.代码沙箱准备

开源沙箱代码及部署参考 https://github.com/bytedance/SandboxFusion

沙箱代码下载

.. code-block:: bash

git clone -b main https://github.com/bytedance/SandboxFusion.git

沙箱安装

.. code-block:: bash

cd SandboxFusion conda create -n sandbox -y python=3.11 conda activate sandbox pip install poetry poetry lock poetry install mkdir -p docs/build cd runtime/python bash install-python-runtime.sh cd ../../ make run-online

3.训练

示例配置文件如下,在recipe/retool目录下创建一个run_qwen2.5_7b_dapo_npu.sh 根据开发者实际路径配置情况修改模型训练脚本中的以下参数

.. code-block:: bash

set -x

export VLLM_USE_V1=1 export TORCHDYNAMO_DISABLE=1 export VLLM_ASCEND_ENABLE_NZ=0 export TASK_QUEUE_ENABLE=1 export VLLM_ENABLE_GRAPH_MODE=1 export HCCL_OP_EXPANSION_MODE="AIV" export VLLM_ASCEND_ENABLE_MLP_OPTIMIZE=1 export LD_PRELOAD=/usr/local/lib/libjemalloc.so.2

================= data/model/tool =================

HDFS_ROOT=${HDFS_ROOT:-"${PWD}"} DATA_ROOT=${DATA_ROOT:-"${PWD}"}

dapo_math_17k=$DATA_ROOT/dataset/BytedTsinghua-SIA/DAPO-Math-17k aime_2024=$DATA_ROOT/dataset/Maxwell-Jia/AIME_2024 #aime_2025=$DATA_ROOT/dataset/yentinglin/aime_2025 model_path=$DATA_ROOT/dataset/checkpoint/multiturn-sft-qwen-2.5-7b-instruct/global_step_372/huggingface

train_files="['$dapo_math_17k']" test_files="['$aime_2024']"

tool

tool_config_path=recipe/retool/sandbox_fusion_tool_config.yaml

wandb

project_name=retool experiment_name=qwen2.5-7b_dapo default_local_dir=$DATA_ROOT/checkpoint/$experiment_name

创建日志文件

export TIMESTAMP=$(date +%Y%m%d_%H%M%S) LOG_DIR="$HDFS_ROOT/verl/logs/$project_name/$experiment_name"

判断路径是否存在

if [ ! -d "$LOG_DIR" ]; then # 路径不存在,创建路径 mkdir -p "$LOG_DIR" echo "Directory $LOG_DIR created." else echo "Directory $LOG_DIR already exists." fi

LOG_FILE="${LOG_DIR}/${TIMESTAMP}.log" touch "$LOG_FILE" echo "Log file $LOG_FILE created."

================= algorithm =================

adv_estimator=grpo

use_kl_in_reward=False kl_coef=0.0 use_kl_loss=False kl_loss_coef=0.0

clip_ratio_low=0.2 clip_ratio_high=0.28

max_turns=16 max_prompt_length=2048 max_response_length=20480 actor_lr=1e-6

train_batch_size=32 ppo_mini_batch_size=16

n_resp_per_prompt=16 n_resp_per_prompt_val=30

================= performance =================

infer_tp=2 # vllm train_sp=4 # train offload=True

actor_max_token_len_per_gpu=$(( (max_prompt_length + max_response_length) * 1 )) log_prob_max_token_len_per_gpu=$(( actor_max_token_len_per_gpu * 4 ))

PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo
algorithm.adv_estimator=$adv_estimator
algorithm.use_kl_in_reward=$use_kl_in_reward
algorithm.kl_ctrl.kl_coef=$kl_coef
data.train_files="$train_files"
data.val_files="$test_files"
data.return_raw_chat=True
data.train_batch_size=$train_batch_size
data.max_prompt_length=$max_prompt_length
data.max_response_length=$max_response_length
data.filter_overlong_prompts=True
data.truncation='error'
data.custom_cls.path=recipe/retool/retool.py
data.custom_cls.name=CustomRLHFDataset
custom_reward_function.path=recipe/retool/retool.py
custom_reward_function.name=compute_score
actor_rollout_ref.model.path=$model_path
actor_rollout_ref.model.use_remove_padding=True
actor_rollout_ref.model.enable_gradient_checkpointing=True
actor_rollout_ref.actor.use_kl_loss=$use_kl_loss
actor_rollout_ref.actor.kl_loss_coef=$kl_loss_coef
actor_rollout_ref.actor.clip_ratio_low=$clip_ratio_low
actor_rollout_ref.actor.clip_ratio_high=$clip_ratio_high
actor_rollout_ref.actor.clip_ratio_c=10.0
actor_rollout_ref.actor.optim.lr=$actor_lr
actor_rollout_ref.actor.use_dynamic_bsz=True
actor_rollout_ref.actor.ppo_mini_batch_size=$ppo_mini_batch_size
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=$actor_max_token_len_per_gpu
actor_rollout_ref.actor.ulysses_sequence_parallel_size=$train_sp
actor_rollout_ref.actor.fsdp_config.param_offload=$offload
actor_rollout_ref.actor.fsdp_config.optimizer_offload=$offload
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=$log_prob_max_token_len_per_gpu
actor_rollout_ref.rollout.max_num_batched_tokens=$actor_max_token_len_per_gpu
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.mode=async
actor_rollout_ref.rollout.max_num_seqs=1024
actor_rollout_ref.rollout.tensor_model_parallel_size=$infer_tp
actor_rollout_ref.rollout.multi_turn.enable=True
actor_rollout_ref.rollout.multi_turn.max_user_turns=$max_turns
actor_rollout_ref.rollout.multi_turn.max_assistant_turns=$max_turns
actor_rollout_ref.rollout.multi_turn.tool_config_path=$tool_config_path
actor_rollout_ref.rollout.multi_turn.format=hermes
actor_rollout_ref.rollout.gpu_memory_utilization=0.9
actor_rollout_ref.rollout.n=$n_resp_per_prompt
actor_rollout_ref.rollout.val_kwargs.top_p=0.6
actor_rollout_ref.rollout.val_kwargs.temperature=1.0
actor_rollout_ref.rollout.val_kwargs.n=$n_resp_per_prompt_val
actor_rollout_ref.rollout.enable_chunked_prefill=True
actor_rollout_ref.rollout.enforce_eager=False
trainer.logger=['console']
trainer.project_name=$project_name
trainer.experiment_name=$experiment_name
trainer.n_gpus_per_node=8
trainer.val_before_train=False
trainer.log_val_generations=20
trainer.nnodes=1
trainer.save_freq=100
trainer.default_local_dir=$default_local_dir
trainer.test_freq=20
trainer.device=npu
actor_rollout_ref.actor.entropy_from_logits_with_chunking=True
actor_rollout_ref.ref.entropy_from_logits_with_chunking=True
actor_rollout_ref.actor.use_torch_compile=False
actor_rollout_ref.ref.use_torch_compile=False
actor_rollout_ref.actor.entropy_checkpointing=True
actor_rollout_ref.ref.entropy_checkpointing=True
actor_rollout_ref.ref.use_torch_compile=False
trainer.total_epochs=1 $@ > $LOG_FILE 2>&1 &