Back to Vllm

Reward Usages

docs/models/pooling_models/reward.md

0.20.16.0 KB
Original Source

Reward Usages

A reward model (RM) is designed to evaluate and score the quality of outputs generated by a language model, acting as a proxy for human preferences.

Summary

  • Model Usage: reward
  • Pooling Task:
Model TypesPooling Tasks
(sequence) (outcome) reward modelsclassify
token (outcome) reward modelstoken_classify
process reward modelstoken_classify
  • Offline APIs:
    • LLM.encode(..., pooling_task="...")
  • Online APIs:
    • Pooling API (/pooling)

Supported Models

Reward Models

Using sequence classification models as (sequence) (outcome) reward models, the usage and supported features are the same as for normal classification models.

--8<-- [start:supported-sequence-reward-models]

ArchitectureModelsExample HF ModelsLoRAPP
JambaForSequenceClassificationJambaai21labs/Jamba-tiny-reward-dev, etc.✅︎✅︎
Qwen3ForSequenceClassification<sup>C</sup>Qwen3-basedSkywork/Skywork-Reward-V2-Qwen3-0.6B, etc.✅︎✅︎
LlamaForSequenceClassification<sup>C</sup>Llama-basedSkywork/Skywork-Reward-V2-Llama-3.2-1B, etc.✅︎✅︎
*Model<sup>C</sup>, *ForCausalLM<sup>C</sup>, etc.Generative modelsN/A**

<sup>C</sup> Automatically converted into a classification model via --convert classify. (details)

If your model is not in the above list, we will try to automatically convert the model using [as_seq_cls_model][vllm.model_executor.models.adapters.as_seq_cls_model]. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.

--8<-- [end:supported-sequence-reward-models]

Token Reward Models

The key distinction between (sequence) classification and token classification lies in their output granularity: (sequence) classification produces a single result for an entire input sequence, whereas token classification yields a result for each individual token within the sequence.

Using token classification models as token (outcome) reward models, the usage and supported features are the same as for normal token classification models.

--8<-- [start:supported-token-reward-models]

ArchitectureModelsExample HF ModelsLoRAPP
InternLM2ForRewardModelInternLM2-basedinternlm/internlm2-1_8b-reward, internlm/internlm2-7b-reward, etc.✅︎✅︎
Qwen2ForRewardModelQwen2-basedQwen/Qwen2.5-Math-RM-72B, etc.✅︎✅︎
*Model<sup>C</sup>, *ForCausalLM<sup>C</sup>, etc.Generative modelsN/A**

<sup>C</sup> Automatically converted into a classification model via --convert classify. (details)

If your model is not in the above list, we will try to automatically convert the model using [as_seq_cls_model][vllm.model_executor.models.adapters.as_seq_cls_model].

--8<-- [end:supported-token-reward-models]

Process Reward Models

The process reward models used for evaluating intermediate steps are crucial to achieving the desired outcome.

ArchitectureModelsExample HF ModelsLoRAPP
LlamaForCausalLMLlama-basedpeiyi9979/math-shepherd-mistral-7b-prm, etc.✅︎✅︎
Qwen2ForProcessRewardModelQwen2-basedQwen/Qwen2.5-Math-PRM-7B, etc.✅︎✅︎

!!! important For process-supervised reward models such as peiyi9979/math-shepherd-mistral-7b-prm, the pooling config should be set explicitly, e.g.: --pooler-config '{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'.

Offline Inference

Pooling Parameters

The following [pooling parameters][vllm.PoolingParams] are supported.

python
--8<-- "vllm/pooling_params.py:common-pooling-params"
--8<-- "vllm/pooling_params.py:classify-pooling-params"

LLM.encode

The [encode][vllm.LLM.encode] method is available to all pooling models in vLLM.

  • Reward Models

Set pooling_task="classify" when using LLM.encode for (sequence) (outcome) reward models:

python
from vllm import LLM

llm = LLM(model="Skywork/Skywork-Reward-V2-Qwen3-0.6B", runner="pooling")
(output,) = llm.encode("Hello, my name is", pooling_task="classify")

data = output.outputs.data
print(f"Data: {data!r}")
  • Token Reward Models

Set pooling_task="token_classify" when using LLM.encode for token (outcome) reward models:

python
from vllm import LLM

llm = LLM(model="internlm/internlm2-1_8b-reward", runner="pooling", trust_remote_code=True)
(output,) = llm.encode("Hello, my name is", pooling_task="token_classify")

data = output.outputs.data
print(f"Data: {data!r}")
  • Process Reward Models

Set pooling_task="token_classify" when using LLM.encode for token (outcome) reward models:

python
from vllm import LLM

llm = LLM(model="Qwen/Qwen2.5-Math-PRM-7B", runner="pooling")
(output,) = llm.encode("Hello, my name is<extra_0><extra_0><extra_0>", pooling_task="token_classify")

data = output.outputs.data
print(f"Data: {data!r}")

Online Serving

Please refer to the pooling API. Pooling task corresponding to reward model types refer to the table above.

More examples

More examples can be found here: examples/pooling/reward

Deprecated Features

LLM.reward

llm.reward api is deprecated and will be removed in v0.23. Please use LLM.encode with pooling_task="classify" or pooling_task="token_classify" instead.