docs/examples/llm/llama_cpp.ipynb
<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/llm/llama_2_llama_cpp.ipynb" target="_parent"></a>
In this short notebook, we show how to use the llama-cpp-python library with LlamaIndex.
In this notebook, we use the Qwen/Qwen2.5-7B-Instruct-GGUF model, along with the proper prompt formatting.
By default, if model_path and model_url are blank, the LlamaCPP module will load llama2-chat-13B.
To get the best performance out of LlamaCPP, it is recommended to install the package so that it is compiled with GPU support. A full guide for installing this way is here.
Full MACOS instructions are also here.
In general:
CuBLAS if you have CUDA and an NVidia GPUMETAL if you are running on an M1/M2 MacBookCLBLAST if you are running on an AMD/Intel GPUFor me, on a MAC, I need to install the metal backend.
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python
Then you can install the required llama-index pacakages
%pip install llama-index-embeddings-huggingface
%pip install llama-index-llms-llama-cpp
The LlamaCPP llm is highly configurable. Depending on the model being used, you'll want to pass in messages_to_prompt and completion_to_prompt functions to help format the model inputs.
For any kwargs that need to be passed in during initialization, set them in model_kwargs. A full list of available model kwargs is available in the LlamaCPP docs.
For any kwargs that need to be passed in during inference, you can set them in generate_kwargs. See the full list of generate kwargs here.
In general, the defaults are a great starting point. The example below shows configuration with all defaults.
If you're opening this Notebook on colab, you will probably need to install LlamaIndex π¦.
model_url = "https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q3_k_m.gguf"
from llama_index.llms.llama_cpp import LlamaCPP
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
def messages_to_prompt(messages):
messages = [{"role": m.role.value, "content": m.content} for m in messages]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
return prompt
def completion_to_prompt(completion):
messages = [{"role": "user", "content": completion}]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
return prompt
llm = LlamaCPP(
# You can pass in the URL to a GGML model to download it automatically
model_url=model_url,
# optionally, you can set the path to a pre-downloaded model instead of model_url
model_path=None,
temperature=0.1,
max_new_tokens=256,
# llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
context_window=16384,
# kwargs to pass to __call__()
generate_kwargs={},
# kwargs to pass to __init__()
# set to at least 1 to use GPU
model_kwargs={"n_gpu_layers": -1},
# transform inputs into Llama2 format
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
verbose=True,
)
We can tell that the model is using metal and our GPU due to the logging!
offloaded 29/29 layers to GPU
## Start using our `LlamaCPP` LLM abstraction!
We can simply use the `complete` method of our `LlamaCPP` LLM abstraction to generate completions given a prompt.
```python
response = llm.complete("Hello! Can you tell me a poem about cats and dogs?")
print(response.text)
We can use the stream_complete endpoint to stream the response as itβs being generated rather than waiting for the entire response to be generated.
response_iter = llm.stream_complete("Can you write me a poem about fast cars?")
for response in response_iter:
print(response.delta, end="", flush=True)
We can simply pass in the LlamaCPP LLM abstraction to the LlamaIndex query engine as usual.
But first, let's change the global tokenizer to match our LLM.
from llama_index.core import set_global_tokenizer
from transformers import AutoTokenizer
set_global_tokenizer(
AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct").encode
)
# use Huggingface embeddings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
from llama_index.core import SimpleDirectoryReader
# load documents
documents = SimpleDirectoryReader("../data/paul_graham/").load_data()
from llama_index.core import VectorStoreIndex
# create vector store index
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)
# set up query engine
query_engine = index.as_query_engine(llm=llm)
response = query_engine.query("What did the author do growing up?")
print(response)