IPEX-LLM on Intel CPU

IPEX-LLM is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency.

This example goes over how to use LlamaIndex to interact with ipex-llm for text generation and chat on CPU.

Note

You could refer to here for full examples of IpexLLM. Please note that for running on Intel CPU, please specify -d 'cpu' in command argument when running the examples.

Install llama-index-llms-ipex-llm. This will also install ipex-llm and its dependencies.

python

%pip install llama-index-llms-ipex-llm

In this example we'll use HuggingFaceH4/zephyr-7b-alpha model for demostration. It requires updating transformers and tokenizers packages.

python

%pip install -U transformers==4.37.0 tokenizers==0.15.2

Before loading the Zephyr model, you'll need to define completion_to_prompt and messages_to_prompt for formatting prompts. This is essential for preparing inputs that the model can interpret accurately.

python

# Transform a string into input zephyr-specific input
def completion_to_prompt(completion):
    return f"<|system|>\n</s>\n<|user|>\n{completion}</s>\n<|assistant|>\n"


# Transform a list of chat messages into zephyr-specific input
def messages_to_prompt(messages):
    prompt = ""
    for message in messages:
        if message.role == "system":
            prompt += f"<|system|>\n{message.content}</s>\n"
        elif message.role == "user":
            prompt += f"<|user|>\n{message.content}</s>\n"
        elif message.role == "assistant":
            prompt += f"<|assistant|>\n{message.content}</s>\n"

    # ensure we start with a system prompt, insert blank if needed
    if not prompt.startswith("<|system|>\n"):
        prompt = "<|system|>\n</s>\n" + prompt

    # add final assistant prompt
    prompt = prompt + "<|assistant|>\n"

    return prompt

Basic Usage

Load the Zephyr model locally using IpexLLM using IpexLLM.from_model_id. It will load the model directly in its Huggingface format and convert it automatically to low-bit format for inference.

python

import warnings

warnings.filterwarnings(
    "ignore", category=UserWarning, message=".*padding_mask.*"
)

from llama_index.llms.ipex_llm import IpexLLM

llm = IpexLLM.from_model_id(
    model_name="HuggingFaceH4/zephyr-7b-alpha",
    tokenizer_name="HuggingFaceH4/zephyr-7b-alpha",
    context_window=512,
    max_new_tokens=128,
    generate_kwargs={"do_sample": False},
    completion_to_prompt=completion_to_prompt,
    messages_to_prompt=messages_to_prompt,
)

Now you can proceed to use the loaded model for text completion and interactive chat.

Text Completion

python

completion_response = llm.complete("Once upon a time, ")
print(completion_response.text)

Streaming Text Completion

python

response_iter = llm.stream_complete("Once upon a time, there's a little girl")
for response in response_iter:
    print(response.delta, end="", flush=True)

Chat

python

from llama_index.core.llms import ChatMessage

message = ChatMessage(role="user", content="Explain Big Bang Theory briefly")
resp = llm.chat([message])
print(resp)

Streaming Chat

python

message = ChatMessage(role="user", content="What is AI?")
resp = llm.stream_chat([message], max_tokens=256)
for r in resp:
    print(r.delta, end="")

Save/Load Low-bit Model

Alternatively, you might save the low-bit model to disk once and use from_model_id_low_bit instead of from_model_id to reload it for later use - even across different machines. It is space-efficient, as the low-bit model demands significantly less disk space than the original model. And from_model_id_low_bit is also more efficient than from_model_id in terms of speed and memory usage, as it skips the model conversion step.

To save the low-bit model, use save_low_bit as follows.

python

saved_lowbit_model_path = (
    "./zephyr-7b-alpha-low-bit"  # path to save low-bit model
)

llm._model.save_low_bit(saved_lowbit_model_path)
del llm

Load the model from saved lowbit model path as follows.

Note that the saved path for the low-bit model only includes the model itself but not the tokenizers. If you wish to have everything in one place, you will need to manually download or copy the tokenizer files from the original model's directory to the location where the low-bit model is saved.

python

llm_lowbit = IpexLLM.from_model_id_low_bit(
    model_name=saved_lowbit_model_path,
    tokenizer_name="HuggingFaceH4/zephyr-7b-alpha",
    # tokenizer_name=saved_lowbit_model_path,  # copy the tokenizers to saved path if you want to use it this way
    context_window=512,
    max_new_tokens=64,
    completion_to_prompt=completion_to_prompt,
    generate_kwargs={"do_sample": False},
)

Try stream completion using the loaded low-bit model.

python

response_iter = llm_lowbit.stream_complete("What is Large Language Model?")
for response in response_iter:
    print(response.delta, end="", flush=True)