Local Multimodal pipeline with OpenVINO

OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference. The OpenVINO™ Runtime supports various hardware devices including x86 and ARM CPUs, and Intel GPUs. It can help to boost deep learning performance in Computer Vision, Automatic Speech Recognition, Natural Language Processing and other common tasks.

Hugging Face multimodal model can be supported by OpenVINO through OpenVINOMultiModal class.

python

%pip install llama-index-multi-modal-llms-openvino -q

python

%pip install llama-index llama-index-readers-file -q

Export and compress multimodal model

It is possible to export your model to the OpenVINO IR format with the CLI, and load the model from local folder.

python

from pathlib import Path

model_id = "llava-hf/llava-v1.6-mistral-7b-hf"
model_path = Path(model_id.split("/")[-1]) / "FP16"

if not model_path.exists():
    !optimum-cli export openvino --model {model_id} --weight-format fp16 {model_path}

python

import shutil
import nncf
import openvino as ov
import gc

core = ov.Core()

compression_config = {
    "mode": nncf.CompressWeightsMode.INT4_SYM,
    "group_size": 64,
    "ratio": 0.6,
}

compressed_model_path = model_path.parent / "INT4"
if not compressed_model_path.exists():
    ov_model = core.read_model(model_path / "openvino_language_model.xml")
    compressed_ov_model = nncf.compress_weights(ov_model, **compression_config)
    ov.save_model(
        compressed_ov_model,
        compressed_model_path / "openvino_language_model.xml",
    )
    del compressed_ov_model
    del ov_model
    gc.collect()
    for file_name in model_path.glob("*"):
        if file_name.name in [
            "openvino_language_model.xml",
            "openvino_language_model.bin",
        ]:
            continue
        shutil.copy(file_name, compressed_model_path)

Prepare the input data

python

import os

os.makedirs("./input_images", exist_ok=True)

url = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

image

python

from llama_index.multi_modal_llms.openvino import OpenVINOMultiModal
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained(
    "llava-v1.6-mistral-7b-hf/INT4", trust_remote_code=True
)


def messages_to_prompt(messages, image_documents):
    """
    Prepares the input messages and images.
    """
    conversation = [{"type": "text", "text": messages[0].content}]
    images = []
    for img_doc in image_documents:
        images.append(img_doc)
        conversation.append({"type": "image"})
    messages = [
        {"role": "user", "content": conversation}
    ]  # Wrap conversation in a user role

    print(messages)

    # Apply a chat template to format the message with the processor
    text_prompt = processor.apply_chat_template(
        messages, add_generation_prompt=True
    )

    # Prepare the model inputs (text + images) and convert to tensor
    inputs = processor(text=text_prompt, images=images, return_tensors="pt")
    return inputs

Model Loading

Models can be loaded by specifying the model parameters using the OpenVINOMultiModal method.

If you have an Intel GPU, you can specify device_map="gpu" to run inference on it.

python

vlm = OpenVINOMultiModal(
    model_id_or_path="llava-v1.6-mistral-7b-hf/INT4",
    device="cpu",
    messages_to_prompt=messages_to_prompt,
    generate_kwargs={"do_sample": False},
)

Inference with local OpenVINO model

python

response = vlm.complete("Describe the images", image_documents=[image])
print(response.text)

Streaming

python

response = vlm.stream_complete("Describe the images", image_documents=[image])
for r in response:
    print(r.delta, end="")