Reward Usages

A reward model (RM) is designed to evaluate and score the quality of outputs generated by a language model, acting as a proxy for human preferences.

Summary

Model Usage: reward
Pooling Task:

Model Types	Pooling Tasks
(sequence) (outcome) reward models	classify
token (outcome) reward models	token_classify
process reward models	token_classify

Offline APIs:
- LLM.encode(..., pooling_task="...")
Online APIs:
- Pooling API (/pooling)

Supported Models

Reward Models

Using sequence classification models as (sequence) (outcome) reward models, the usage and supported features are the same as for normal classification models.

--8<-- [start:supported-sequence-reward-models]

Architecture	Models	Example HF Models	LoRA	PP
`JambaForSequenceClassification`	Jamba	`ai21labs/Jamba-tiny-reward-dev`, etc.	✅︎	✅︎
`Qwen3ForSequenceClassification`<sup>C</sup>	Qwen3-based	`Skywork/Skywork-Reward-V2-Qwen3-0.6B`, etc.	✅︎	✅︎
`LlamaForSequenceClassification`<sup>C</sup>	Llama-based	`Skywork/Skywork-Reward-V2-Llama-3.2-1B`, etc.	✅︎	✅︎
`Model`<sup>C</sup>, `ForCausalLM`<sup>C</sup>, etc.	Generative models	N/A	*	*

<sup>C</sup> Automatically converted into a classification model via --convert classify. (details)

If your model is not in the above list, we will try to automatically convert the model using [as_seq_cls_model][vllm.model_executor.models.adapters.as_seq_cls_model]. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.

--8<-- [end:supported-sequence-reward-models]

Token Reward Models

The key distinction between (sequence) classification and token classification lies in their output granularity: (sequence) classification produces a single result for an entire input sequence, whereas token classification yields a result for each individual token within the sequence.

Using token classification models as token (outcome) reward models, the usage and supported features are the same as for normal token classification models.

--8<-- [start:supported-token-reward-models]

Architecture	Models	Example HF Models	LoRA	PP
`InternLM2ForRewardModel`	InternLM2-based	`internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, etc.	✅︎	✅︎
`Qwen2ForRewardModel`	Qwen2-based	`Qwen/Qwen2.5-Math-RM-72B`, etc.	✅︎	✅︎
`Model`<sup>C</sup>, `ForCausalLM`<sup>C</sup>, etc.	Generative models	N/A	*	*

<sup>C</sup> Automatically converted into a classification model via --convert classify. (details)

If your model is not in the above list, we will try to automatically convert the model using [as_seq_cls_model][vllm.model_executor.models.adapters.as_seq_cls_model].

--8<-- [end:supported-token-reward-models]

Process Reward Models

The process reward models used for evaluating intermediate steps are crucial to achieving the desired outcome.

Architecture	Models	Example HF Models	LoRA	PP
`LlamaForCausalLM`	Llama-based	`peiyi9979/math-shepherd-mistral-7b-prm`, etc.	✅︎	✅︎
`Qwen2ForProcessRewardModel`	Qwen2-based	`Qwen/Qwen2.5-Math-PRM-7B`, etc.	✅︎	✅︎

!!! important For process-supervised reward models such as peiyi9979/math-shepherd-mistral-7b-prm, the pooling config should be set explicitly, e.g.: --pooler-config '{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'.

Offline Inference

Pooling Parameters

The following [pooling parameters][vllm.PoolingParams] are supported.

python

--8<-- "vllm/pooling_params.py:common-pooling-params"
--8<-- "vllm/pooling_params.py:classify-pooling-params"

`LLM.encode`

The [encode][vllm.LLM.encode] method is available to all pooling models in vLLM.

Reward Models

Set pooling_task="classify" when using LLM.encode for (sequence) (outcome) reward models:

python

from vllm import LLM

llm = LLM(model="Skywork/Skywork-Reward-V2-Qwen3-0.6B", runner="pooling")
(output,) = llm.encode("Hello, my name is", pooling_task="classify")

data = output.outputs.data
print(f"Data: {data!r}")

Token Reward Models

Set pooling_task="token_classify" when using LLM.encode for token (outcome) reward models:

python

from vllm import LLM

llm = LLM(model="internlm/internlm2-1_8b-reward", runner="pooling", trust_remote_code=True)
(output,) = llm.encode("Hello, my name is", pooling_task="token_classify")

data = output.outputs.data
print(f"Data: {data!r}")

Process Reward Models

Set pooling_task="token_classify" when using LLM.encode for token (outcome) reward models:

python

from vllm import LLM

llm = LLM(model="Qwen/Qwen2.5-Math-PRM-7B", runner="pooling")
(output,) = llm.encode("Hello, my name is<extra_0><extra_0><extra_0>", pooling_task="token_classify")

data = output.outputs.data
print(f"Data: {data!r}")

Online Serving

Please refer to the pooling API. Pooling task corresponding to reward model types refer to the table above.

More examples

More examples can be found here: examples/pooling/reward

Deprecated Features

`LLM.reward`

llm.reward api is deprecated and will be removed in v0.23. Please use LLM.encode with pooling_task="classify" or pooling_task="token_classify" instead.