docs/serving/online_serving/openai_compatible_server.md
vLLM provides an HTTP server that implements OpenAI's Completions API, Chat API, and more! This functionality lets you serve models and interact with them using an HTTP client.
We currently support the following OpenAI APIs:
/v1/completions)
suffix parameter is not supported./v1/responses)
/v1/chat/completions)
user parameter is ignored.parallel_tool_calls parameter to false ensures vLLM only returns zero or one tool call per request. Setting it to true (the default) allows returning more than one tool call per request. There is no guarantee more than one tool call will be returned if this is set to true, as that behavior is model dependent and not all models are designed to support parallel tool calls./v1/embeddings)
/v1/audio/transcriptions)
/v1/audio/translations)
In your terminal, you can install vLLM, then start the server with the vllm serve command. (You can also use our Docker image.)
vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
--dtype auto \
--api-key token-abc123
To call the server, in your preferred text editor, create a script that uses an HTTP client. Include any messages that you want to send to the model. Then run that script. Below is an example script using the official OpenAI Python client.
??? code
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="token-abc123",
)
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "Hello!"},
],
)
print(completion.choices[0].message)
```
!!! tip
vLLM supports some parameters that are not supported by OpenAI, top_k for example.
You can pass these parameters to vLLM using the OpenAI client in the extra_body parameter of your requests, i.e. extra_body={"top_k": 50} for top_k.
!!! important
By default, the server applies generation_config.json from the Hugging Face model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
To disable this behavior, please pass `--generation-config vllm` when launching the server.
vLLM supports a set of parameters that are not part of the OpenAI API. In order to use them, you can pass them as extra parameters in the OpenAI client. Or directly merge them into the JSON payload if you are using HTTP call directly.
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"},
],
extra_body={
"structured_outputs": {"choice": ["positive", "negative"]},
},
)
Only X-Request-Id HTTP request header is supported for now. It can be enabled
with --enable-request-id-headers.
??? code
```python
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"},
],
extra_headers={
"x-request-id": "sentiment-classification-00001",
},
)
print(completion._request_id)
completion = client.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
prompt="A robot may not injure a human being",
extra_headers={
"x-request-id": "completion-test",
},
)
print(completion._request_id)
```
Our Completions API is compatible with OpenAI's Completions API; you can use the official OpenAI Python client to interact with it.
Code example: examples/basic/online_serving/openai_completion_client.py
The following sampling parameters are supported.
??? code
```python
--8<-- "vllm/entrypoints/openai/completion/protocol.py:completion-sampling-params"
```
The following extra parameters are supported:
??? code
```python
--8<-- "vllm/entrypoints/openai/completion/protocol.py:completion-extra-params"
```
Our Chat API is compatible with OpenAI's Chat Completions API; you can use the official OpenAI Python client to interact with it.
We support both Vision- and Audio-related parameters; see our Multimodal Inputs guide for more information.
image_url.detail parameter is not supported.Code example: examples/basic/online_serving/openai_chat_completion_client.py
The following sampling parameters are supported.
??? code
```python
--8<-- "vllm/entrypoints/openai/chat_completion/protocol.py:chat-completion-sampling-params"
```
The following extra parameters are supported:
??? code
```python
--8<-- "vllm/entrypoints/openai/chat_completion/protocol.py:chat-completion-extra-params"
```
Our Responses API is compatible with OpenAI's Responses API; you can use the official OpenAI Python client to interact with it.
Code example: examples/tool_calling/openai_responses_client_with_tools.py
The following extra parameters in the request object are supported:
??? code
```python
--8<-- "vllm/entrypoints/openai/responses/protocol.py:responses-extra-params"
```
The following extra parameters in the response object are supported:
??? code
```python
--8<-- "vllm/entrypoints/openai/responses/protocol.py:responses-response-extra-params"
```