llama-index-integrations/llms/llama-index-llms-llamafile/README.md
Use the following command to download a LlamaFile from Hugging Face:
wget https://huggingface.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile
On Unix-like systems, run the following command:
chmod +x TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile
For Windows, simply rename the file to end with .exe.
Run the following command to start the model server, which will listen on http://localhost:8080 by default:
./TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile --server --nobrowser --embedding
If you are using Google Colab or want to interact with LlamaIndex, you will need to install the necessary packages:
%pip install llama-index-llms-llamafile
!pip install llama-index
from llama_index.llms.llamafile import Llamafile
from llama_index.core.llms import ChatMessage
Create an instance of the LlamaFile LLM:
llm = Llamafile(temperature=0, seed=0)
To generate a completion for a prompt, use the complete method:
resp = llm.complete("Who is Octavia Butler?")
print(resp)
You can also interact with the LLM using a list of messages:
messages = [
ChatMessage(
role="system",
content="Pretend you are a pirate with a colorful personality.",
),
ChatMessage(role="user", content="What is your name?"),
]
resp = llm.chat(messages)
print(resp)
To use the streaming capabilities, you can call the stream_complete method:
response = llm.stream_complete("Who is Octavia Butler?")
for r in response:
print(r.delta, end="")
You can also stream chat responses:
messages = [
ChatMessage(
role="system",
content="Pretend you are a pirate with a colorful personality.",
),
ChatMessage(role="user", content="What is your name?"),
]
resp = llm.stream_chat(messages)
for r in resp:
print(r.delta, end="")
https://docs.llamaindex.ai/en/stable/examples/llm/llamafile/