Back to Verl

Recipe: CollabLLM

docs/algo/collabllm.md

0.7.16.1 KB
Original Source

Recipe: CollabLLM

Last updated: 09/22/2025.

Open-Source Algorithm Implementation & Expriement Running: Haiquan Chen, Shirley Wu

šŸ  Homepage | šŸ“ PaperĀ |Ā šŸ¤— Datasets & Models | ā­ļø Original Implementation

verl provides a recipe for the Outstanding Paper at ICML 2025, "CollabLLM: From Passive Responders to Active Collaborators". CollabLLM is a unified fine-tuning framework that optimizes LLMs for effective and efficient multiturn collaboration with users.

Core Idea: Models are rewarded based on how well their responses enable effective future collaboration with users.

Paper Authors: Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, Jianfeng Gao


Quick Start

0. Environment

Make sure the required packages for verl are installed. Additionally, install litellm and export the required API keys. The API model will be used for user simulators and, optionally, LLM Judges (see the Configuration section below).

1. Prepare Your Dataset

First, process your dataset using the provided script (see example commands and usage in process_dataset.py):

bash
python process_dataset.py --dataset <> ... --dataset_type <sft or rl>

Requirements:

  • Input: A Hugging Face multiturn dataset. Existing datasets: collabllm/collabllm-multiturn-$DATASET, with DATASET in one of [math-hard(-large), medium(-large), bigcodebench(-large)] (*-large are the datasets used in the CollabLLM paper)
  • Example format: See collabllm-multiturn-math-hard
  • To generate your own dataset: Use build_dataset.py from the original CollabLLM repository

2. Train Your Model

(Optional) For Supervised Fine-Tuning (SFT):

bash
bash train_sft_collabllm.sh

For Reinforcement Learning (RL):

bash
bash train_rl_collabllm.sh

The RL script shows an example to train CollabLLM on math-hard-large.

  • The config to sample future conversations are in recipe/collabllm/config/collabllm_interaction_config.yaml.

  • The Multiturn-aware Reward is aggregated from these three conversational-level rewards:

    +reward_model.reward_kwargs.metric_weights.accuracy=1 \
    +reward_model.reward_kwargs.metric_weights.interactivity=1 \
    +reward_model.reward_kwargs.metric_weights.token_amount=-0.0001 \
    

    You can remove, add, or modify the weights depending on your task. A list of implemented metrics you can already add are under recipe/collabllm/metrics. For example, on medium-large, you can replace accuracy with bleu_score via

    +reward_model.reward_kwargs.metric_weights.bleu_score=1 
    

    which will instead apply bleu score on the sampled future conversations.

Algorithm

StepNameDescription
1Model response generationThe model generates multiple responses for each prompt in a batch.
2Collaborative simulationA user simulator (e.g., GPT or Claude) samples num_repeat_rollouts conversations for up to max_user_turns additional turns.
3Compute Multiturn-aware RewardCustomized conversational reward functions are applied to the sampled conversations. Rewards are aggregated, then averaged across rollouts.
4Update modelThe model weights are updated using the computed multiturn-aware rewards.

Configuration

The primary configuration is managed through the launch script train_rl_collabllm.sh and the YAML file recipe/collabllm/config/collabllm_interaction_config.yaml. Key configuration sections:

SectionKey Parameters / Notes
dataPaths to training/validation files, batch sizes, sequence lengths.
actor_rollout_ref (common)Base model path (used for actor + initial reference), FSDP settings, optimization (LR, scheduler).
actor_rollout_ref (CollabLLM-specific)Hyperparameters under actor_rollout_ref.rollout.multi_turn: max_user_turns, max_assistant_turns, num_repeat_rollouts.
interactionDefined in collabllm_interaction_config.yaml. Specifies user simulator and hyperparameters. Requires exported API keys.
reward_modelManager set to collabllm by default. Modify reward_model.reward_kwargs.metric_weights for conversational rewards and weights. LLM Judge hyperparameters (e.g., model, temperature) go under reward_model.reward_kwargs.llm_judge_kwargs.
algorithmGRPO-specific hyperparameters such as actor_rollout_ref.rollout.n.
trainerDistributed training (nodes, GPUs per node), logging (WandB), checkpointing frequency.

Key Files

File PathPurpose
recipe/collabllm/collabllm_agent_loop.pyMain logic to sample future conversations, using CollabLLMInteraction from verl/interactions/collabllm_interaction.py.
verl/workers/reward_manager/collabllm.pyComputes rewards for future conversations, leveraging recipe/collabllm/reward_function.py to apply each metric.

Acknowledgement

We sincerely thank the verl community and advisors for their contributions and guidance!