vLLM

There's two modes of using vLLM local and remote. Let's start form the former one, which requeries CUDA environment available locally.

Install vLLM

pip install vllm

or if you want to compile you can compile from source

Orca-7b Completion Example

python

%pip install llama-index-llms-vllm

python

import os

os.environ["HF_HOME"] = "model/"

python

from llama_index.llms.vllm import Vllm, VllmServer

python

llm = Vllm(
    model="microsoft/Orca-2-7b",
    tensor_parallel_size=4,
    max_new_tokens=100,
    vllm_kwargs={"swap_space": 1, "gpu_memory_utilization": 0.5},
)

python

llm.complete("[INST]You are a helpful assistant[/INST] What is a black hole ?")

LLama-2-7b Completion Example

python

llm = Vllm(
    model="codellama/CodeLlama-7b-hf",
    dtype="float16",
    tensor_parallel_size=4,
    temperature=0,
    max_new_tokens=100,
    vllm_kwargs={
        "swap_space": 1,
        "gpu_memory_utilization": 0.5,
        "max_model_len": 4096,
    },
)

python

llm.complete("import socket\n\ndef ping_exponential_backoff(host: str):")

Mistral chat 7b Completion Example

python

llm = Vllm(
    model="mistralai/Mistral-7B-Instruct-v0.1",
    dtype="float16",
    tensor_parallel_size=4,
    temperature=0,
    max_new_tokens=100,
    vllm_kwargs={
        "swap_space": 1,
        "gpu_memory_utilization": 0.5,
        "max_model_len": 4096,
    },
)

python

llm.complete(" What is a black hole ?")

Calling vLLM via HTTP

In this mode there is no need to install vllm model nor have CUDA available locally. To setup the vLLM API you can follow the guide present here. Note: llama-index-llms-vllm module is a client for vllm.entrypoints.api_server which is only a demo.

If vLLM server is launched with vllm.entrypoints.openai.api_server as OpenAI Compatible Server or via Docker you need OpenAILike class from llama-index-llms-openai-like module

Completion Response

python

from llama_index.core.llms import ChatMessage

python

llm = VllmServer(
    api_url="http://localhost:8000/generate", max_new_tokens=100, temperature=0
)

python

llm.complete("what is a black hole ?")

python

message = [ChatMessage(content="hello", role="user")]
llm.chat(message)

Streaming Response

python

list(llm.stream_complete("what is a black hole"))[-1]

python

message = [ChatMessage(content="what is a black hole", role="user")]
[x for x in llm.stream_chat(message)][-1]

Async Response

python

import asyncio

await llm.acomplete("What is a black hole")

python

await llm.achat(message)

python

[x async for x in await llm.astream_complete("what is a black hole")][-1]

python

[x async for x in await llm.astream_chat(message)][-1]