Back to Llama Index

llamafile

docs/examples/llm/llamafile.ipynb

0.14.213.6 KB
Original Source

llamafile

One of the simplest ways to run an LLM locally is using a llamafile. llamafiles bundle model weights and a specially-compiled version of llama.cpp into a single file that can run on most computers any additional dependencies. They also come with an embedded inference server that provides an API for interacting with your model.

Setup

  1. Download a llamafile from HuggingFace
  2. Make the file executable
  3. Run the file

Here's a simple bash script that shows all 3 setup steps:

bash
# Download a llamafile from HuggingFace
wget https://huggingface.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile

# Make the file executable. On Windows, instead just rename the file to end in ".exe".
chmod +x TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile

# Start the model server. Listens at http://localhost:8080 by default.
./TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile --server --nobrowser --embedding

Your model's inference server listens at localhost:8080 by default.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

python
%pip install llama-index-llms-llamafile
python
!pip install llama-index
python
from llama_index.llms.llamafile import Llamafile
python
llm = Llamafile(temperature=0, seed=0)
python
resp = llm.complete("Who is Octavia Butler?")
python
print(resp)

WARNING: TinyLlama's description of Octavia Butler above contains many falsehoods. For example, she was born in California, not Pennsylvania. The information about her family and her education is a hallucation. She did not work as an elementary school teacher. Instead, she took a series of temporary jobs that would allow her to focus her energy on writing. Her work did not "quickly gain recognition": she sold her first short story around 1970, but did not gain prominence for another 14 years, when her short story "Speech Sounds" won the Hugo Award in 1984. Please refer to Wikipedia for a real biography of Octavia Butler.

We use the TinyLlama model in this example notebook primarily because it's small and therefore quick to download for example purposes. A larger model might hallucinate less. However, this should serve as a reminder that LLMs often do lie, even about topics that are well-known enough to have a Wikipedia page. It's important verify their outputs with your own research.

Call chat with a list of messages

python
from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(
        role="system",
        content="Pretend you are a pirate with a colorful personality.",
    ),
    ChatMessage(role="user", content="What is your name?"),
]
resp = llm.chat(messages)
python
print(resp)

Streaming

Using stream_complete endpoint

python
response = llm.stream_complete("Who is Octavia Butler?")
python
for r in response:
    print(r.delta, end="")

Using stream_chat endpoint

python
from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(
        role="system",
        content="Pretend you are a pirate with a colorful personality.",
    ),
    ChatMessage(role="user", content="What is your name?"),
]
resp = llm.stream_chat(messages)
python
for r in resp:
    print(r.delta, end="")