docs/examples/llm/llamafile.ipynb
One of the simplest ways to run an LLM locally is using a llamafile. llamafiles bundle model weights and a specially-compiled version of llama.cpp into a single file that can run on most computers any additional dependencies. They also come with an embedded inference server that provides an API for interacting with your model.
Here's a simple bash script that shows all 3 setup steps:
# Download a llamafile from HuggingFace
wget https://huggingface.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile
# Make the file executable. On Windows, instead just rename the file to end in ".exe".
chmod +x TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile
# Start the model server. Listens at http://localhost:8080 by default.
./TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile --server --nobrowser --embedding
Your model's inference server listens at localhost:8080 by default.
If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.
%pip install llama-index-llms-llamafile
!pip install llama-index
from llama_index.llms.llamafile import Llamafile
llm = Llamafile(temperature=0, seed=0)
resp = llm.complete("Who is Octavia Butler?")
print(resp)
WARNING: TinyLlama's description of Octavia Butler above contains many falsehoods. For example, she was born in California, not Pennsylvania. The information about her family and her education is a hallucation. She did not work as an elementary school teacher. Instead, she took a series of temporary jobs that would allow her to focus her energy on writing. Her work did not "quickly gain recognition": she sold her first short story around 1970, but did not gain prominence for another 14 years, when her short story "Speech Sounds" won the Hugo Award in 1984. Please refer to Wikipedia for a real biography of Octavia Butler.
We use the TinyLlama model in this example notebook primarily because it's small and therefore quick to download for example purposes. A larger model might hallucinate less. However, this should serve as a reminder that LLMs often do lie, even about topics that are well-known enough to have a Wikipedia page. It's important verify their outputs with your own research.
chat with a list of messagesfrom llama_index.core.llms import ChatMessage
messages = [
ChatMessage(
role="system",
content="Pretend you are a pirate with a colorful personality.",
),
ChatMessage(role="user", content="What is your name?"),
]
resp = llm.chat(messages)
print(resp)
Using stream_complete endpoint
response = llm.stream_complete("Who is Octavia Butler?")
for r in response:
print(r.delta, end="")
Using stream_chat endpoint
from llama_index.core.llms import ChatMessage
messages = [
ChatMessage(
role="system",
content="Pretend you are a pirate with a colorful personality.",
),
ChatMessage(role="user", content="What is your name?"),
]
resp = llm.stream_chat(messages)
for r in resp:
print(r.delta, end="")