OpenAI-Compatible Server

vLLM provides an HTTP server that implements OpenAI's Completions API, Chat API, and more! This functionality lets you serve models and interact with them using an HTTP client.

Supported APIs

We currently support the following OpenAI APIs:

Completions API (/v1/completions)
- Only applicable to text generation models.
- Note: suffix parameter is not supported.
Responses API (/v1/responses)
- Only applicable to text generation models.
Chat Completions API (/v1/chat/completions)
- Only applicable to text generation models with a chat template.
- Note: user parameter is ignored.
- Note: Setting the parallel_tool_calls parameter to false ensures vLLM only returns zero or one tool call per request. Setting it to true (the default) allows returning more than one tool call per request. There is no guarantee more than one tool call will be returned if this is set to true, as that behavior is model dependent and not all models are designed to support parallel tool calls.
Embeddings API (/v1/embeddings)
- Only applicable to embedding models.
Transcriptions API (/v1/audio/transcriptions)
- Only applicable to Automatic Speech Recognition (ASR) models.
Translation API (/v1/audio/translations)
- Only applicable to Automatic Speech Recognition (ASR) models.

Completions API

In your terminal, you can install vLLM, then start the server with the vllm serve command. (You can also use our Docker image.)

bash

vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
  --dtype auto \
  --api-key token-abc123

To call the server, in your preferred text editor, create a script that uses an HTTP client. Include any messages that you want to send to the model. Then run that script. Below is an example script using the official OpenAI Python client.

??? code

```python
from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Hello!"},
    ],
)

print(completion.choices[0].message)
```

!!! tip vLLM supports some parameters that are not supported by OpenAI, top_k for example. You can pass these parameters to vLLM using the OpenAI client in the extra_body parameter of your requests, i.e. extra_body={"top_k": 50} for top_k.

!!! important By default, the server applies generation_config.json from the Hugging Face model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.

To disable this behavior, please pass `--generation-config vllm` when launching the server.

Extra Parameters

vLLM supports a set of parameters that are not part of the OpenAI API. In order to use them, you can pass them as extra parameters in the OpenAI client. Or directly merge them into the JSON payload if you are using HTTP call directly.

python

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"},
    ],
    extra_body={
        "structured_outputs": {"choice": ["positive", "negative"]},
    },
)

Extra HTTP Headers

Only X-Request-Id HTTP request header is supported for now. It can be enabled with --enable-request-id-headers.

??? code

```python
completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"},
    ],
    extra_headers={
        "x-request-id": "sentiment-classification-00001",
    },
)
print(completion._request_id)

completion = client.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    prompt="A robot may not injure a human being",
    extra_headers={
        "x-request-id": "completion-test",
    },
)
print(completion._request_id)
```

API Reference

Completions API

Our Completions API is compatible with OpenAI's Completions API; you can use the official OpenAI Python client to interact with it.

Code example: examples/basic/online_serving/openai_completion_client.py

Extra parameters

The following sampling parameters are supported.

??? code

```python
--8<-- "vllm/entrypoints/openai/completion/protocol.py:completion-sampling-params"
```

The following extra parameters are supported:

??? code

```python
--8<-- "vllm/entrypoints/openai/completion/protocol.py:completion-extra-params"
```

Chat API

Our Chat API is compatible with OpenAI's Chat Completions API; you can use the official OpenAI Python client to interact with it.

We support both Vision- and Audio-related parameters; see our Multimodal Inputs guide for more information.

Note: image_url.detail parameter is not supported.

Code example: examples/basic/online_serving/openai_chat_completion_client.py

Extra parameters

The following sampling parameters are supported.

??? code

```python
--8<-- "vllm/entrypoints/openai/chat_completion/protocol.py:chat-completion-sampling-params"
```

The following extra parameters are supported:

??? code

```python
--8<-- "vllm/entrypoints/openai/chat_completion/protocol.py:chat-completion-extra-params"
```

Responses API

Our Responses API is compatible with OpenAI's Responses API; you can use the official OpenAI Python client to interact with it.

Code example: examples/tool_calling/openai_responses_client_with_tools.py

Extra parameters

The following extra parameters in the request object are supported:

??? code

```python
--8<-- "vllm/entrypoints/openai/responses/protocol.py:responses-extra-params"
```

The following extra parameters in the response object are supported:

??? code

```python
--8<-- "vllm/entrypoints/openai/responses/protocol.py:responses-response-extra-params"
```