OpenVINO GenAI LLMs

OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference. OpenVINO™ Runtime can enable running the same model optimized across various hardware devices. Accelerate your deep learning performance across use cases like: language + LLMs, computer vision, automatic speech recognition, and more.

OpenVINOGenAILLM is a wrapper of OpenVINO-GenAI API. OpenVINO models can be run locally through this entitiy wrapped by LlamaIndex :

In the below line, we install the packages necessary for this demo:

python

%pip install llama-index-llms-openvino-genai

python

%pip install optimum[openvino]

Now that we're set up, let's play around:

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

python

!pip install llama-index

python

from llama_index.llms.openvino_genai import OpenVINOGenAILLM

Model Exporting

It is possible to export your model to the OpenVINO IR format with the CLI, and load the model from local folder.

python

!optimum-cli export openvino --model microsoft/Phi-3-mini-4k-instruct --task text-generation-with-past --weight-format int4 model_path

You can download a optimized IR model from OpenVINO model hub of Hugging Face.

python

import huggingface_hub as hf_hub

model_id = "OpenVINO/Phi-3-mini-4k-instruct-int4-ov"
model_path = "Phi-3-mini-4k-instruct-int4-ov"

hf_hub.snapshot_download(model_id, local_dir=model_path)

Model Loading

Models can be loaded by specifying the model parameters using the OpenVINOGenAILLM method.

If you have an Intel GPU, you can specify device="gpu" to run inference on it.

python

ov_llm = OpenVINOGenAILLM(
    model_path=model_path,
    device="CPU",
)

You can pass the generation config parameters through ov_llm.config. The supported parameters are listed at the openvino_genai.GenerationConfig.

python

ov_llm.config.max_new_tokens = 100

python

response = ov_llm.complete("What is the meaning of life?")
print(str(response))

Streaming

Using stream_complete endpoint

python

response = ov_llm.stream_complete("Who is Paul Graham?")
for r in response:
    print(r.delta, end="")

Using stream_chat endpoint

python

from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(
        role="system", content="You are a pirate with a colorful personality"
    ),
    ChatMessage(role="user", content="What is your name"),
]
resp = ov_llm.stream_chat(messages)

for r in resp:
    print(r.delta, end="")

For more information refer to: