docs/ascend_tutorial/examples/ascend_retool_best_pratice.rst
Last updated: 03/01/2026.
Retool论文参考(Retool) 集成代码解释器工具,通过多轮实时代码执行进行策略部署,并教会模型根据结果反馈学习何时以及如何调用工具。
用例模型脚本以及其需要的硬件条件各自如下:
=============== ============ ============ ===============
模型 NPU型号 节点数量 训推后端
=============== ============ ============ ===============
Qwen2.5-7B Atlas 900 A2 1 vllm + FSDP
=============== ============ ============ ===============
1.从自定义Conda环境进行构建
============ ============================================================
software version
============ ============================================================
Python >= 3.10, <3.12
CANN == 8.3.RC1
torch == 2.7.1
torch_npu == 2.7.1
verl v0.6.1 commitId=d62da4950573d7a4b7ef2362337952e7ab59e78d
vllm v0.11.0
vllm-ascend v0.11.0-dev
transformers 4.57.6
============ ============================================================
1.模型数据准备
^^^^^^^^^^^
Qwen2.5-7B
^^^^^^^^^^^
下载模型权重
--local-dir: 模型保存路径
.. code-block:: bash
git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct
下载训练数据集
.. code-block:: bash
git clone https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k
下载评估数据集
.. code-block:: bash
git clone https://huggingface.co/datasets/Maxwell-Jia/AIME_2024
下载预训练数据集
.. code-block:: bash
python3 recipe/retool/retool_sft_preprocess.py
注:自动下载ReTool-SFT,最后生成数据默认保存在~/ReTool-SFT/data目录下
执行预训练脚本
.. code-block:: bash
bash recipe/retool/run_qwen2_7b_sft_npu.sh # 需适配脚本中路径
合并预训练权重生成checkpoint
.. code-block:: bash
python3 -m verl.model_merger merge --backend fsdp
--local_dir ${DATASETS}/checkpoint/multiturn-sft-qwen-2.5-7b-instruct/global_step_372
--target_dir ${DATASETS}/checkpoint/multiturn-sft-qwen-2.5-7b-instruct/global_step_372/huggingface
2.代码沙箱准备
开源沙箱代码及部署参考 https://github.com/bytedance/SandboxFusion
沙箱代码下载
.. code-block:: bash
git clone -b main https://github.com/bytedance/SandboxFusion.git
沙箱安装
.. code-block:: bash
cd SandboxFusion conda create -n sandbox -y python=3.11 conda activate sandbox pip install poetry poetry lock poetry install mkdir -p docs/build cd runtime/python bash install-python-runtime.sh cd ../../ make run-online
3.训练
示例配置文件如下,在recipe/retool目录下创建一个run_qwen2.5_7b_dapo_npu.sh 根据开发者实际路径配置情况修改模型训练脚本中的以下参数
.. code-block:: bash
set -x
export VLLM_USE_V1=1 export TORCHDYNAMO_DISABLE=1 export VLLM_ASCEND_ENABLE_NZ=0 export TASK_QUEUE_ENABLE=1 export VLLM_ENABLE_GRAPH_MODE=1 export HCCL_OP_EXPANSION_MODE="AIV" export VLLM_ASCEND_ENABLE_MLP_OPTIMIZE=1 export LD_PRELOAD=/usr/local/lib/libjemalloc.so.2
HDFS_ROOT=${HDFS_ROOT:-"${PWD}"} DATA_ROOT=${DATA_ROOT:-"${PWD}"}
dapo_math_17k=$DATA_ROOT/dataset/BytedTsinghua-SIA/DAPO-Math-17k aime_2024=$DATA_ROOT/dataset/Maxwell-Jia/AIME_2024 #aime_2025=$DATA_ROOT/dataset/yentinglin/aime_2025 model_path=$DATA_ROOT/dataset/checkpoint/multiturn-sft-qwen-2.5-7b-instruct/global_step_372/huggingface
train_files="['$dapo_math_17k']" test_files="['$aime_2024']"
tool_config_path=recipe/retool/sandbox_fusion_tool_config.yaml
project_name=retool experiment_name=qwen2.5-7b_dapo default_local_dir=$DATA_ROOT/checkpoint/$experiment_name
export TIMESTAMP=$(date +%Y%m%d_%H%M%S) LOG_DIR="$HDFS_ROOT/verl/logs/$project_name/$experiment_name"
if [ ! -d "$LOG_DIR" ]; then # 路径不存在,创建路径 mkdir -p "$LOG_DIR" echo "Directory $LOG_DIR created." else echo "Directory $LOG_DIR already exists." fi
LOG_FILE="${LOG_DIR}/${TIMESTAMP}.log" touch "$LOG_FILE" echo "Log file $LOG_FILE created."
adv_estimator=grpo
use_kl_in_reward=False kl_coef=0.0 use_kl_loss=False kl_loss_coef=0.0
clip_ratio_low=0.2 clip_ratio_high=0.28
max_turns=16 max_prompt_length=2048 max_response_length=20480 actor_lr=1e-6
train_batch_size=32 ppo_mini_batch_size=16
n_resp_per_prompt=16 n_resp_per_prompt_val=30
infer_tp=2 # vllm train_sp=4 # train offload=True
actor_max_token_len_per_gpu=$(( (max_prompt_length + max_response_length) * 1 )) log_prob_max_token_len_per_gpu=$(( actor_max_token_len_per_gpu * 4 ))
PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo
algorithm.adv_estimator=$adv_estimator
algorithm.use_kl_in_reward=$use_kl_in_reward
algorithm.kl_ctrl.kl_coef=$kl_coef
data.train_files="$train_files"
data.val_files="$test_files"
data.return_raw_chat=True
data.train_batch_size=$train_batch_size
data.max_prompt_length=$max_prompt_length
data.max_response_length=$max_response_length
data.filter_overlong_prompts=True
data.truncation='error'
data.custom_cls.path=recipe/retool/retool.py
data.custom_cls.name=CustomRLHFDataset
custom_reward_function.path=recipe/retool/retool.py
custom_reward_function.name=compute_score
actor_rollout_ref.model.path=$model_path
actor_rollout_ref.model.use_remove_padding=True
actor_rollout_ref.model.enable_gradient_checkpointing=True
actor_rollout_ref.actor.use_kl_loss=$use_kl_loss
actor_rollout_ref.actor.kl_loss_coef=$kl_loss_coef
actor_rollout_ref.actor.clip_ratio_low=$clip_ratio_low
actor_rollout_ref.actor.clip_ratio_high=$clip_ratio_high
actor_rollout_ref.actor.clip_ratio_c=10.0
actor_rollout_ref.actor.optim.lr=$actor_lr
actor_rollout_ref.actor.use_dynamic_bsz=True
actor_rollout_ref.actor.ppo_mini_batch_size=$ppo_mini_batch_size
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=$actor_max_token_len_per_gpu
actor_rollout_ref.actor.ulysses_sequence_parallel_size=$train_sp
actor_rollout_ref.actor.fsdp_config.param_offload=$offload
actor_rollout_ref.actor.fsdp_config.optimizer_offload=$offload
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=$log_prob_max_token_len_per_gpu
actor_rollout_ref.rollout.max_num_batched_tokens=$actor_max_token_len_per_gpu
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.mode=async
actor_rollout_ref.rollout.max_num_seqs=1024
actor_rollout_ref.rollout.tensor_model_parallel_size=$infer_tp
actor_rollout_ref.rollout.multi_turn.enable=True
actor_rollout_ref.rollout.multi_turn.max_user_turns=$max_turns
actor_rollout_ref.rollout.multi_turn.max_assistant_turns=$max_turns
actor_rollout_ref.rollout.multi_turn.tool_config_path=$tool_config_path
actor_rollout_ref.rollout.multi_turn.format=hermes
actor_rollout_ref.rollout.gpu_memory_utilization=0.9
actor_rollout_ref.rollout.n=$n_resp_per_prompt
actor_rollout_ref.rollout.val_kwargs.top_p=0.6
actor_rollout_ref.rollout.val_kwargs.temperature=1.0
actor_rollout_ref.rollout.val_kwargs.n=$n_resp_per_prompt_val
actor_rollout_ref.rollout.enable_chunked_prefill=True
actor_rollout_ref.rollout.enforce_eager=False
trainer.logger=['console']
trainer.project_name=$project_name
trainer.experiment_name=$experiment_name
trainer.n_gpus_per_node=8
trainer.val_before_train=False
trainer.log_val_generations=20
trainer.nnodes=1
trainer.save_freq=100
trainer.default_local_dir=$default_local_dir
trainer.test_freq=20
trainer.device=npu
actor_rollout_ref.actor.entropy_from_logits_with_chunking=True
actor_rollout_ref.ref.entropy_from_logits_with_chunking=True
actor_rollout_ref.actor.use_torch_compile=False
actor_rollout_ref.ref.use_torch_compile=False
actor_rollout_ref.actor.entropy_checkpointing=True
actor_rollout_ref.ref.entropy_checkpointing=True
actor_rollout_ref.ref.use_torch_compile=False
trainer.total_epochs=1 $@ > $LOG_FILE 2>&1 &