docs/serving/openai_compatible_server.md
vLLM provides an HTTP server that implements OpenAI's Completions API, Chat API, and more! This functionality lets you serve models and interact with them using an HTTP client.
In your terminal, you can install vLLM, then start the server with the vllm serve command. (You can also use our Docker image.)
vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
--dtype auto \
--api-key token-abc123
To call the server, in your preferred text editor, create a script that uses an HTTP client. Include any messages that you want to send to the model. Then run that script. Below is an example script using the official OpenAI Python client.
??? code
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="token-abc123",
)
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "Hello!"},
],
)
print(completion.choices[0].message)
```
!!! tip
vLLM supports some parameters that are not supported by OpenAI, top_k for example.
You can pass these parameters to vLLM using the OpenAI client in the extra_body parameter of your requests, i.e. extra_body={"top_k": 50} for top_k.
!!! important
By default, the server applies generation_config.json from the Hugging Face model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
To disable this behavior, please pass `--generation-config vllm` when launching the server.
We currently support the following OpenAI APIs:
/v1/completions)
suffix parameter is not supported./v1/responses)
/v1/chat/completions)
user parameter is ignored.parallel_tool_calls parameter to false ensures vLLM only returns zero or one tool call per request. Setting it to true (the default) allows returning more than one tool call per request. There is no guarantee more than one tool call will be returned if this is set to true, as that behavior is model dependent and not all models are designed to support parallel tool calls./v1/embeddings)
/v1/audio/transcriptions)
/v1/audio/translations)
/v1/realtime)
In addition, we have the following custom APIs:
/tokenize, /detokenize)
/pooling)
/classify)
/v2/embed)
/score, /v1/score)
/generative_scoring)
"generate").label_token_ids./rerank, /v1/rerank, /v2/rerank)
In order for the language model to support chat protocol, vLLM requires the model to include a chat template in its tokenizer configuration. The chat template is a Jinja2 template that specifies how roles, messages, and other chat-specific tokens are encoded in the input.
An example chat template for NousResearch/Meta-Llama-3-8B-Instruct can be found here
Some models do not provide a chat template even though they are instruction/chat fine-tuned. For those models,
you can manually specify their chat template in the --chat-template parameter with the file path to the chat
template, or the template in string form. Without a chat template, the server will not be able to process chat
and all chat requests will error.
vllm serve <model> --chat-template ./path-to-chat-template.jinja
vLLM community provides a set of chat templates for popular models. You can find them under the examples directory.
With the inclusion of multi-modal chat APIs, the OpenAI spec now accepts chat messages in a new format which specifies
both a type and a text field. An example is provided below:
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Classify this sentiment: vLLM is wonderful!"},
],
},
],
)
Most chat templates for LLMs expect the content field to be a string, but there are some newer models like
meta-llama/Llama-Guard-3-1B that expect the content to be formatted according to the OpenAI schema in the
request. vLLM provides best-effort support to detect this automatically, which is logged as a string like
"Detected the chat template content format to be...", and internally converts incoming requests to match
the detected format, which can be one of:
"string": A string.
"Hello world""openai": A list of dictionaries, similar to OpenAI schema.
[{"type": "text", "text": "Hello world!"}]If the result is not what you expect, you can set the --chat-template-content-format CLI argument
to override which format to use.
vLLM supports a set of parameters that are not part of the OpenAI API. In order to use them, you can pass them as extra parameters in the OpenAI client. Or directly merge them into the JSON payload if you are using HTTP call directly.
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"},
],
extra_body={
"structured_outputs": {"choice": ["positive", "negative"]},
},
)
Only X-Request-Id HTTP request header is supported for now. It can be enabled
with --enable-request-id-headers.
??? code
```python
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"},
],
extra_headers={
"x-request-id": "sentiment-classification-00001",
},
)
print(completion._request_id)
completion = client.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
prompt="A robot may not injure a human being",
extra_headers={
"x-request-id": "completion-test",
},
)
print(completion._request_id)
```
The FastAPI /docs endpoint requires an internet connection by default. To enable offline access in air-gapped environments, use the --enable-offline-docs flag:
vllm serve NousResearch/Meta-Llama-3-8B-Instruct --enable-offline-docs
Our Completions API is compatible with OpenAI's Completions API; you can use the official OpenAI Python client to interact with it.
Code example: examples/basic/online_serving/openai_completion_client.py
The following sampling parameters are supported.
??? code
```python
--8<-- "vllm/entrypoints/openai/completion/protocol.py:completion-sampling-params"
```
The following extra parameters are supported:
??? code
```python
--8<-- "vllm/entrypoints/openai/completion/protocol.py:completion-extra-params"
```
Our Chat API is compatible with OpenAI's Chat Completions API; you can use the official OpenAI Python client to interact with it.
We support both Vision- and Audio-related parameters; see our Multimodal Inputs guide for more information.
image_url.detail parameter is not supported.Code example: examples/basic/online_serving/openai_chat_completion_client.py
The following sampling parameters are supported.
??? code
```python
--8<-- "vllm/entrypoints/openai/chat_completion/protocol.py:chat-completion-sampling-params"
```
The following extra parameters are supported:
??? code
```python
--8<-- "vllm/entrypoints/openai/chat_completion/protocol.py:chat-completion-extra-params"
```
Our Responses API is compatible with OpenAI's Responses API; you can use the official OpenAI Python client to interact with it.
Code example: examples/online_serving/openai_responses_client_with_tools.py
The following extra parameters in the request object are supported:
??? code
```python
--8<-- "vllm/entrypoints/openai/responses/protocol.py:responses-extra-params"
```
The following extra parameters in the response object are supported:
??? code
```python
--8<-- "vllm/entrypoints/openai/responses/protocol.py:responses-response-extra-params"
```
Our Transcriptions API is compatible with OpenAI's Transcriptions API; you can use the official OpenAI Python client to interact with it.
!!! note
To use the Transcriptions API, please install with extra audio dependencies using pip install vllm[audio].
Code example: examples/online_serving/openai_transcription_client.py
NOTE: beam search is currently supported in the transcriptions endpoint for encoder-decoder multimodal models, e.g., whisper, but highly inefficient as work for handling the encoder/decoder cache is actively ongoing. This is an active point of ongoing optimization and will be handled properly in the very near future.
Set the maximum audio file size (in MB) that VLLM will accept, via the
VLLM_MAX_AUDIO_CLIP_FILESIZE_MB environment variable. Default is 25 MB.
The Transcriptions API supports uploading audio files in various formats including FLAC, MP3, MP4, MPEG, MPGA, M4A, OGG, WAV, and WEBM.
Using OpenAI Python Client:
??? code
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="token-abc123",
)
# Upload audio file from disk
with open("audio.mp3", "rb") as audio_file:
transcription = client.audio.transcriptions.create(
model="openai/whisper-large-v3-turbo",
file=audio_file,
language="en",
response_format="verbose_json",
)
print(transcription.text)
```
Using curl with multipart/form-data:
??? code
```bash
curl -X POST "http://localhost:8000/v1/audio/transcriptions" \
-H "Authorization: Bearer token-abc123" \
-F "[email protected]" \
-F "model=openai/whisper-large-v3-turbo" \
-F "language=en" \
-F "response_format=verbose_json"
```
Supported Parameters:
file: The audio file to transcribe (required)model: The model to use for transcription (required)language: The language code (e.g., "en", "zh") (optional)prompt: Optional text to guide the transcription style (optional)response_format: Format of the response ("json", "text") (optional)temperature: Sampling temperature between 0 and 1 (optional)For the complete list of supported parameters including sampling parameters and vLLM extensions, see the protocol definitions.
Response Format:
For verbose_json response format:
??? code
```json
{
"text": "Hello, this is a transcription of the audio file.",
"language": "en",
"duration": 5.42,
"segments": [
{
"id": 0,
"seek": 0,
"start": 0.0,
"end": 2.5,
"text": "Hello, this is a transcription",
"tokens": [50364, 938, 428, 307, 275, 28347],
"temperature": 0.0,
"avg_logprob": -0.245,
"compression_ratio": 1.235,
"no_speech_prob": 0.012
}
]
}
```
Currently “verbose_json” response format doesn’t support no_speech_prob.
The following sampling parameters are supported.
??? code
```python
--8<-- "vllm/entrypoints/openai/speech_to_text/protocol.py:transcription-sampling-params"
```
The following extra parameters are supported:
??? code
```python
--8<-- "vllm/entrypoints/openai/speech_to_text/protocol.py:transcription-extra-params"
```
Our Translation API is compatible with OpenAI's Translations API;
you can use the official OpenAI Python client to interact with it.
Whisper models can translate audio from one of the 55 non-English supported languages into English.
Please mind that the popular openai/whisper-large-v3-turbo model does not support translating.
!!! note
To use the Translation API, please install with extra audio dependencies using pip install vllm[audio].
Code example: examples/online_serving/openai_translation_client.py
The following sampling parameters are supported.
--8<-- "vllm/entrypoints/openai/speech_to_text/protocol.py:translation-sampling-params"
The following extra parameters are supported:
--8<-- "vllm/entrypoints/openai/speech_to_text/protocol.py:translation-extra-params"
The Realtime API provides WebSocket-based streaming audio transcription, allowing real-time speech-to-text as audio is being recorded.
!!! note
To use the Realtime API, please install with extra audio dependencies using uv pip install vllm[audio].
Audio must be sent as base64-encoded PCM16 audio at 16kHz sample rate, mono channel.
ws://host/v1/realtimesession.created eventsession.update with model/paramsinput_audio_buffer.commit when readyinput_audio_buffer.append events with base64 PCM16 chunkstranscription.delta events with incremental texttranscription.done with final text + usage| Event | Description |
|---|---|
input_audio_buffer.append | Send base64-encoded audio chunk: {"type": "input_audio_buffer.append", "audio": "<base64>"} |
input_audio_buffer.commit | Trigger transcription processing or end: {"type": "input_audio_buffer.commit", "final": bool} |
session.update | Configure session: {"type": "session.update", "model": "model-name"} |
| Event | Description |
|---|---|
session.created | Connection established with session ID and timestamp |
transcription.delta | Incremental transcription text: {"type": "transcription.delta", "delta": "text"} |
transcription.done | Final transcription with usage stats |
error | Error notification with message and optional code |
Our Tokenizer API is a simple wrapper over HuggingFace-style tokenizers. It consists of two endpoints:
/tokenize corresponds to calling tokenizer.encode()./detokenize corresponds to calling tokenizer.decode().The /generative_scoring endpoint uses a CausalLM model (e.g., Llama, Qwen, Mistral) to compute the probability of specified token IDs appearing as the next token. Each item (document) is concatenated with the query to form a prompt, and the model predicts how likely each label token is as the next token after that prompt. This lets you score items against a query — for example, asking "Is this the capital of France?" and scoring each city by how likely the model is to answer "Yes".
This endpoint is automatically available when the server is started with a generative model (task "generate"). It is separate from the pooling-based Score API, which uses cross-encoder, bi-encoder, or late-interaction models.
Requirements:
label_token_ids parameter is required and must contain at least 1 token ID.P(label_token_ids[0]) / (P(label_token_ids[0]) + P(label_token_ids[1])) (softmax over the two labels).curl -X POST http://localhost:8000/generative_scoring \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"query": "Is this city the capital of France?",
"items": ["Paris", "London", "Berlin"],
"label_token_ids": [9454, 2753]
}'
Here, each item is appended to the query to form prompts like "Is this city the capital of France? Paris", "... London", etc. The model then predicts the next token, and the score reflects the probability of "Yes" (token 9454) vs "No" (token 2753).
??? console "Response"
```json
{
"id": "generative-scoring-abc123",
"object": "list",
"created": 1234567890,
"model": "Qwen/Qwen3-0.6B",
"data": [
{"index": 0, "object": "score", "score": 0.95},
{"index": 1, "object": "score", "score": 0.12},
{"index": 2, "object": "score", "score": 0.08}
],
"usage": {"prompt_tokens": 45, "total_tokens": 48, "completion_tokens": 3}
}
```
prompt = query + item (or item + query if item_first=true)label_token_idsapply_softmax=true)To find the token IDs for your labels, use the tokenizer:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
yes_id = tokenizer.encode("Yes", add_special_tokens=False)[0]
no_id = tokenizer.encode("No", add_special_tokens=False)[0]
print(f"Yes: {yes_id}, No: {no_id}")
Ray Serve LLM enables scalable, production-grade serving of the vLLM engine. It integrates tightly with vLLM and extends it with features such as auto-scaling, load balancing, and back-pressure.
Key capabilities:
The following example shows how to deploy a large model like DeepSeek R1 with Ray Serve LLM: examples/online_serving/ray_serve_deepseek.py.
Learn more about Ray Serve LLM with the official Ray Serve LLM documentation.