LlamaCPP

In this short notebook, we show how to use the llama-cpp-python library with LlamaIndex.

In this notebook, we use the Qwen/Qwen2.5-7B-Instruct-GGUF model, along with the proper prompt formatting.

By default, if model_path and model_url are blank, the LlamaCPP module will load llama2-chat-13B.

Installation

To get the best performance out of LlamaCPP, it is recommended to install the package so that it is compiled with GPU support. A full guide for installing this way is here.

Full MACOS instructions are also here.

In general:

Use CuBLAS if you have CUDA and an NVidia GPU
Use METAL if you are running on an M1/M2 MacBook
Use CLBLAST if you are running on an AMD/Intel GPU

For me, on a MAC, I need to install the metal backend.

bash

CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python

Then you can install the required llama-index pacakages

python

%pip install llama-index-embeddings-huggingface
%pip install llama-index-llms-llama-cpp

Setup LLM

The LlamaCPP llm is highly configurable. Depending on the model being used, you'll want to pass in messages_to_prompt and completion_to_prompt functions to help format the model inputs.

For any kwargs that need to be passed in during initialization, set them in model_kwargs. A full list of available model kwargs is available in the LlamaCPP docs.

For any kwargs that need to be passed in during inference, you can set them in generate_kwargs. See the full list of generate kwargs here.

In general, the defaults are a great starting point. The example below shows configuration with all defaults.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

python

model_url = "https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q3_k_m.gguf"

python

from llama_index.llms.llama_cpp import LlamaCPP
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")


def messages_to_prompt(messages):
    messages = [{"role": m.role.value, "content": m.content} for m in messages]
    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    return prompt


def completion_to_prompt(completion):
    messages = [{"role": "user", "content": completion}]
    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    return prompt


llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    model_url=model_url,
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    model_path=None,
    temperature=0.1,
    max_new_tokens=256,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=16384,
    # kwargs to pass to __call__()
    generate_kwargs={},
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU
    model_kwargs={"n_gpu_layers": -1},
    # transform inputs into Llama2 format
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)

We can tell that the model is using metal and our GPU due to the logging!

offloaded 29/29 layers to GPU


## Start using our `LlamaCPP` LLM abstraction!

We can simply use the `complete` method of our `LlamaCPP` LLM abstraction to generate completions given a prompt.

```python
response = llm.complete("Hello! Can you tell me a poem about cats and dogs?")
print(response.text)

We can use the stream_complete endpoint to stream the response as it’s being generated rather than waiting for the entire response to be generated.

python

response_iter = llm.stream_complete("Can you write me a poem about fast cars?")
for response in response_iter:
    print(response.delta, end="", flush=True)

Query engine set up with LlamaCPP

We can simply pass in the LlamaCPP LLM abstraction to the LlamaIndex query engine as usual.

But first, let's change the global tokenizer to match our LLM.

python

from llama_index.core import set_global_tokenizer
from transformers import AutoTokenizer

set_global_tokenizer(
    AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct").encode
)

python

# use Huggingface embeddings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

python

from llama_index.core import SimpleDirectoryReader

# load documents
documents = SimpleDirectoryReader("../data/paul_graham/").load_data()

python

from llama_index.core import VectorStoreIndex

# create vector store index
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

python

# set up query engine
query_engine = index.as_query_engine(llm=llm)

python

response = query_engine.query("What did the author do growing up?")
print(response)