Back to Llama Index

vLLM

docs/examples/llm/vllm.ipynb

0.14.213.2 KB
Original Source

vLLM

There's two modes of using vLLM local and remote. Let's start form the former one, which requeries CUDA environment available locally.

Install vLLM

pip install vllm

or if you want to compile you can compile from source

Orca-7b Completion Example

python
%pip install llama-index-llms-vllm
python
import os

os.environ["HF_HOME"] = "model/"
python
from llama_index.llms.vllm import Vllm, VllmServer
python
llm = Vllm(
    model="microsoft/Orca-2-7b",
    tensor_parallel_size=4,
    max_new_tokens=100,
    vllm_kwargs={"swap_space": 1, "gpu_memory_utilization": 0.5},
)
python
llm.complete("[INST]You are a helpful assistant[/INST] What is a black hole ?")

LLama-2-7b Completion Example

python
llm = Vllm(
    model="codellama/CodeLlama-7b-hf",
    dtype="float16",
    tensor_parallel_size=4,
    temperature=0,
    max_new_tokens=100,
    vllm_kwargs={
        "swap_space": 1,
        "gpu_memory_utilization": 0.5,
        "max_model_len": 4096,
    },
)
python
llm.complete("import socket\n\ndef ping_exponential_backoff(host: str):")

Mistral chat 7b Completion Example

python
llm = Vllm(
    model="mistralai/Mistral-7B-Instruct-v0.1",
    dtype="float16",
    tensor_parallel_size=4,
    temperature=0,
    max_new_tokens=100,
    vllm_kwargs={
        "swap_space": 1,
        "gpu_memory_utilization": 0.5,
        "max_model_len": 4096,
    },
)
python
llm.complete(" What is a black hole ?")

Calling vLLM via HTTP

In this mode there is no need to install vllm model nor have CUDA available locally. To setup the vLLM API you can follow the guide present here. Note: llama-index-llms-vllm module is a client for vllm.entrypoints.api_server which is only a demo.

If vLLM server is launched with vllm.entrypoints.openai.api_server as OpenAI Compatible Server or via Docker you need OpenAILike class from llama-index-llms-openai-like module

Completion Response

python
from llama_index.core.llms import ChatMessage
python
llm = VllmServer(
    api_url="http://localhost:8000/generate", max_new_tokens=100, temperature=0
)
python
llm.complete("what is a black hole ?")
python
message = [ChatMessage(content="hello", role="user")]
llm.chat(message)

Streaming Response

python
list(llm.stream_complete("what is a black hole"))[-1]
python
message = [ChatMessage(content="what is a black hole", role="user")]
[x for x in llm.stream_chat(message)][-1]

Async Response

python
import asyncio

await llm.acomplete("What is a black hole")
python
await llm.achat(message)
python
[x async for x in await llm.astream_complete("what is a black hole")][-1]
python
[x async for x in await llm.astream_chat(message)][-1]