docs/examples/llm/openvino-genai.ipynb
<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/llm/openvino.ipynb" target="_parent"></a>
OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference. OpenVINO™ Runtime can enable running the same model optimized across various hardware devices. Accelerate your deep learning performance across use cases like: language + LLMs, computer vision, automatic speech recognition, and more.
OpenVINOGenAILLM is a wrapper of OpenVINO-GenAI API. OpenVINO models can be run locally through this entitiy wrapped by LlamaIndex :
In the below line, we install the packages necessary for this demo:
%pip install llama-index-llms-openvino-genai
%pip install optimum[openvino]
Now that we're set up, let's play around:
If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.
!pip install llama-index
from llama_index.llms.openvino_genai import OpenVINOGenAILLM
It is possible to export your model to the OpenVINO IR format with the CLI, and load the model from local folder.
!optimum-cli export openvino --model microsoft/Phi-3-mini-4k-instruct --task text-generation-with-past --weight-format int4 model_path
You can download a optimized IR model from OpenVINO model hub of Hugging Face.
import huggingface_hub as hf_hub
model_id = "OpenVINO/Phi-3-mini-4k-instruct-int4-ov"
model_path = "Phi-3-mini-4k-instruct-int4-ov"
hf_hub.snapshot_download(model_id, local_dir=model_path)
Models can be loaded by specifying the model parameters using the OpenVINOGenAILLM method.
If you have an Intel GPU, you can specify device="gpu" to run inference on it.
ov_llm = OpenVINOGenAILLM(
model_path=model_path,
device="CPU",
)
You can pass the generation config parameters through ov_llm.config. The supported parameters are listed at the openvino_genai.GenerationConfig.
ov_llm.config.max_new_tokens = 100
response = ov_llm.complete("What is the meaning of life?")
print(str(response))
Using stream_complete endpoint
response = ov_llm.stream_complete("Who is Paul Graham?")
for r in response:
print(r.delta, end="")
Using stream_chat endpoint
from llama_index.core.llms import ChatMessage
messages = [
ChatMessage(
role="system", content="You are a pirate with a colorful personality"
),
ChatMessage(role="user", content="What is your name"),
]
resp = ov_llm.stream_chat(messages)
for r in resp:
print(r.delta, end="")
For more information refer to: