docs/models/pooling_models/scoring.md
The score models is designed to compute similarity scores between two input prompts. It supports three model types (aka score_type): cross-encoder, late-interaction, and bi-encoder.
!!! note vLLM handles only the model inference component of RAG pipelines (such as embedding generation and reranking). For higher-level RAG orchestration, you should leverage integration frameworks like LangChain.
| Score Types | Pooling Tasks | scoring function |
|---|---|---|
cross-encoder | classify (see note) | linear classifier |
late-interaction | token_embed | late interaction(MaxSim) |
bi-encoder | embed | cosine similarity |
LLM.score/score)/rerank, /v1/rerank, /v2/rerank)!!! note Only when a classification model outputs num_labels equal to 1 can it be used as a scoring model and have its scoring API enabled.
Cross-encoder (aka reranker) models are a subset of classification models that accept two prompts as input and output num_labels equal to 1.
--8<-- [start:supported-cross-encoder-models]
| Architecture | Models | Example HF Models | Score template (see note) | LoRA | PP |
|---|---|---|---|---|---|
BertForSequenceClassification | BERT-based | cross-encoder/ms-marco-MiniLM-L-6-v2, etc. | N/A | ||
GemmaForSequenceClassification | Gemma-based | BAAI/bge-reranker-v2-gemma(see note), etc. | bge-reranker-v2-gemma.jinja | ✅︎ | ✅︎ |
GteNewForSequenceClassification | mGTE-TRM (see note) | Alibaba-NLP/gte-multilingual-reranker-base, etc. | N/A | ||
LlamaBidirectionalForSequenceClassification<sup>C</sup> | Llama-based with bidirectional attention | nvidia/llama-nemotron-rerank-1b-v2, etc. | nemotron-rerank.jinja | ✅︎ | ✅︎ |
Qwen2ForSequenceClassification<sup>C</sup> | Qwen2-based | mixedbread-ai/mxbai-rerank-base-v2(see note), etc. | mxbai_rerank_v2.jinja | ✅︎ | ✅︎ |
Qwen3ForSequenceClassification<sup>C</sup> | Qwen3-based | tomaarsen/Qwen3-Reranker-0.6B-seq-cls, Qwen/Qwen3-Reranker-0.6B(see note), etc. | qwen3_reranker.jinja | ✅︎ | ✅︎ |
RobertaForSequenceClassification | RoBERTa-based | cross-encoder/quora-roberta-base, etc. | N/A | ||
XLMRobertaForSequenceClassification | XLM-RoBERTa-based | BAAI/bge-reranker-v2-m3, etc. | N/A | ||
*Model<sup>C</sup>, *ForCausalLM<sup>C</sup>, etc. | Generative models | N/A | N/A | * | * |
<sup>C</sup> Automatically converted into a classification model via --convert classify. (details)
* Feature support is the same as that of the original model.
!!! note Some models require a specific prompt format to work correctly.
You can find Example HF Models's corresponding score template in [examples/pooling/score/template/](../../../examples/pooling/score/template)
Examples : [examples/pooling/score/using_template_offline.py](../../../examples/pooling/score/using_template_offline.py) [examples/pooling/score/using_template_online.py](../../../examples/pooling/score/using_template_online.py)
!!! note
Load the official original BAAI/bge-reranker-v2-gemma by using the following command.
```bash
vllm serve BAAI/bge-reranker-v2-gemma --hf_overrides '{"architectures": ["GemmaForSequenceClassification"],"classifier_from_token": ["Yes"],"method": "no_post_processing"}'
```
!!! note
The second-generation GTE model (mGTE-TRM) is named NewForSequenceClassification. The name NewForSequenceClassification is too generic, you should set --hf-overrides '{"architectures": ["GteNewForSequenceClassification"]}' to specify the use of the GteNewForSequenceClassification architecture.
!!! note
Load the official original mxbai-rerank-v2 by using the following command.
```bash
vllm serve mixedbread-ai/mxbai-rerank-base-v2 --hf_overrides '{"architectures": ["Qwen2ForSequenceClassification"],"classifier_from_token": ["0", "1"], "method": "from_2_way_softmax"}'
```
!!! note
Load the official original Qwen3 Reranker by using the following command. More information can be found at: examples/pooling/score/qwen3_reranker_offline.py examples/pooling/score/qwen3_reranker_online.py.
```bash
vllm serve Qwen/Qwen3-Reranker-0.6B --hf_overrides '{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}'
```
!!! note For more information about multimodal models inputs, see this page.
| Architecture | Models | Inputs | Example HF Models | LoRA | PP |
|---|---|---|---|---|---|
JinaVLForSequenceClassification | JinaVL-based | T + I<sup>E+</sup> | jinaai/jina-reranker-m0, etc. | ✅︎ | ✅︎ |
LlamaNemotronVLForSequenceClassification | Llama Nemotron Reranker + SigLIP | T + I<sup>E+</sup> | nvidia/llama-nemotron-rerank-vl-1b-v2 | ||
Qwen3VLForSequenceClassification | Qwen3-VL-Reranker | T + I<sup>E+</sup> + V<sup>E+</sup> | Qwen/Qwen3-VL-Reranker-2B(see note), etc. | ✅︎ | ✅︎ |
<sup>C</sup> Automatically converted into a classification model via --convert classify. (details)
* Feature support is the same as that of the original model.
!!! note
Similar to Qwen3-Reranker, you need to use the following --hf_overrides to load the official original Qwen3-VL-Reranker.
```bash
vllm serve Qwen/Qwen3-VL-Reranker-2B --hf_overrides '{"architectures": ["Qwen3VLForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}'
```
--8<-- [end:supported-cross-encoder-models]
All models that support token embedding task also support using the score API to compute similarity scores by calculating the late interaction of two input prompts. See this page for more information about token embedding models.
--8<-- "docs/models/pooling_models/token_embed.md:supported-token-embed-models"
All models that support embedding task also support using the score API to compute similarity scores by calculating the cosine similarity of two input prompt's embeddings. See this page for more information about embedding models.
--8<-- "docs/models/pooling_models/embed.md:supported-embed-models"
The following [pooling parameters][vllm.PoolingParams] are only supported by cross-encoder models and do not work for late-interaction and bi-encoder models.
--8<-- "vllm/pooling_params.py:common-pooling-params"
--8<-- "vllm/pooling_params.py:classify-pooling-params"
LLM.scoreThe [score][vllm.LLM.score] method outputs similarity scores between sentence pairs.
from vllm import LLM
llm = LLM(model="BAAI/bge-reranker-v2-m3", runner="pooling")
(output,) = llm.score(
"What is the capital of France?",
"The capital of Brazil is Brasilia.",
)
score = output.outputs.score
print(f"Score: {score}")
A code example can be found here: examples/basic/offline_inference/score.py
Our Score API (/score) is similar to LLM.score, compute similarity scores between two input prompts.
The following Score API parameters are supported:
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-extra-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:classify-extra-params"
--8<-- "vllm/entrypoints/pooling/scoring/protocol.py:scoring-common-params"
--8<-- "vllm/entrypoints/pooling/scoring/protocol.py:score-request-params"
You can pass a string to both queries and documents, forming a single sentence pair.
curl -X 'POST' \
'http://127.0.0.1:8000/score' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "BAAI/bge-reranker-v2-m3",
"encoding_format": "float",
"queries": "What is the capital of France?",
"documents": "The capital of France is Paris."
}'
??? console "Response"
```json
{
"id": "score-request-id",
"object": "list",
"created": 693447,
"model": "BAAI/bge-reranker-v2-m3",
"data": [
{
"index": 0,
"object": "score",
"score": 1
}
],
"usage": {}
}
```
You can pass a string to queries and a list to documents, forming multiple sentence pairs
where each pair is built from queries and a string in documents.
The total number of pairs is len(documents).
??? console "Request"
```bash
curl -X 'POST' \
'http://127.0.0.1:8000/score' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "BAAI/bge-reranker-v2-m3",
"queries": "What is the capital of France?",
"documents": [
"The capital of Brazil is Brasilia.",
"The capital of France is Paris."
]
}'
```
??? console "Response"
```json
{
"id": "score-request-id",
"object": "list",
"created": 693570,
"model": "BAAI/bge-reranker-v2-m3",
"data": [
{
"index": 0,
"object": "score",
"score": 0.001094818115234375
},
{
"index": 1,
"object": "score",
"score": 1
}
],
"usage": {}
}
```
You can pass a list to both queries and documents, forming multiple sentence pairs
where each pair is built from a string in queries and the corresponding string in documents (similar to zip()).
The total number of pairs is len(documents).
??? console "Request"
```bash
curl -X 'POST' \
'http://127.0.0.1:8000/score' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "BAAI/bge-reranker-v2-m3",
"encoding_format": "float",
"queries": [
"What is the capital of Brazil?",
"What is the capital of France?"
],
"documents": [
"The capital of Brazil is Brasilia.",
"The capital of France is Paris."
]
}'
```
??? console "Response"
```json
{
"id": "score-request-id",
"object": "list",
"created": 693447,
"model": "BAAI/bge-reranker-v2-m3",
"data": [
{
"index": 0,
"object": "score",
"score": 1
},
{
"index": 1,
"object": "score",
"score": 1
}
],
"usage": {}
}
```
You can pass multi-modal inputs to scoring models by passing content including a list of multi-modal input (image, etc.) in the request. Refer to the examples below for illustration.
=== "JinaVL-Reranker"
To serve the model:
```bash
vllm serve jinaai/jina-reranker-m0
```
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
??? Code
```python
import requests
response = requests.post(
"http://localhost:8000/v1/score",
json={
"model": "jinaai/jina-reranker-m0",
"queries": "slm markdown",
"documents": [
{
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png"
},
}
],
},
{
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png"
},
}
]
},
],
},
)
response.raise_for_status()
response_json = response.json()
print("Scoring output:", response_json["data"][0]["score"])
print("Scoring output:", response_json["data"][1]["score"])
```
Full example:
/rerank, /v1/rerank, and /v2/rerank APIs are compatible with both Jina AI's rerank API interface and
Cohere's rerank API interface to ensure compatibility with
popular open-source tools.
Code example: examples/pooling/score/rerank_api_online.py
The following rerank api parameters are supported:
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-extra-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:classify-extra-params"
--8<-- "vllm/entrypoints/pooling/scoring/protocol.py:scoring-common-params"
--8<-- "vllm/entrypoints/pooling/scoring/protocol.py:rerank-request-params"
Note that the top_n request parameter is optional and will default to the length of the documents field.
Result documents will be sorted by relevance, and the index property can be used to determine original order.
??? console "Request"
```bash
curl -X 'POST' \
'http://127.0.0.1:8000/v1/rerank' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "BAAI/bge-reranker-base",
"query": "What is the capital of France?",
"documents": [
"The capital of Brazil is Brasilia.",
"The capital of France is Paris.",
"Horses and cows are both animals"
]
}'
```
??? console "Response"
```json
{
"id": "rerank-fae51b2b664d4ed38f5969b612edff77",
"model": "BAAI/bge-reranker-base",
"usage": {
"total_tokens": 56
},
"results": [
{
"index": 1,
"document": {
"text": "The capital of France is Paris."
},
"relevance_score": 0.99853515625
},
{
"index": 0,
"document": {
"text": "The capital of Brazil is Brasilia."
},
"relevance_score": 0.0005860328674316406
}
]
}
```
More examples can be found here: examples/pooling/score
AS cross-encoder models are a subset of classification models that accept two prompts as input and output num_labels equal to 1, cross-encoder features should be consistent with (sequence) classification. For more information, see this page.
Score templates are supported for cross-encoder models only. If you are using an embedding model for scoring, vLLM does not apply a score template.
Some scoring models require a specific prompt format to work correctly. You can specify a custom score template using the --chat-template parameter (see Chat Template).
Like chat templates, the score template receives a messages list. For scoring, each message has a role attribute—either "query" or "document". For the usual kind of point-wise cross-encoder, you can expect exactly two messages: one query and one document. To access the query and document content, use Jinja's selectattr filter:
{{ (messages | selectattr("role", "eq", "query") | first).content }}{{ (messages | selectattr("role", "eq", "document") | first).content }}This approach is more robust than index-based access (messages[0], messages[1]) because it selects messages by their semantic role. It also avoids assumptions about message ordering if additional message types are added to messages in the future.
Example template file: examples/pooling/score/template/nemotron-rerank.jinja
You can enable or disable activation via use_activation only works for cross-encoder models.