llamafile

One of the simplest ways to run an LLM locally is using a llamafile. llamafiles bundle model weights and a specially-compiled version of llama.cpp into a single file that can run on most computers any additional dependencies. They also come with an embedded inference server that provides an API for interacting with your model.

Setup

Download a llamafile from HuggingFace
Make the file executable
Run the file

Here's a simple bash script that shows all 3 setup steps:

bash

# Download a llamafile from HuggingFace
wget https://huggingface.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile

# Make the file executable. On Windows, instead just rename the file to end in ".exe".
chmod +x TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile

# Start the model server. Listens at http://localhost:8080 by default.
./TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile --server --nobrowser --embedding

Your model's inference server listens at localhost:8080 by default.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

python

%pip install llama-index-llms-llamafile

python

!pip install llama-index

python

from llama_index.llms.llamafile import Llamafile

python

llm = Llamafile(temperature=0, seed=0)

python

resp = llm.complete("Who is Octavia Butler?")

python

print(resp)

WARNING: TinyLlama's description of Octavia Butler above contains many falsehoods. For example, she was born in California, not Pennsylvania. The information about her family and her education is a hallucation. She did not work as an elementary school teacher. Instead, she took a series of temporary jobs that would allow her to focus her energy on writing. Her work did not "quickly gain recognition": she sold her first short story around 1970, but did not gain prominence for another 14 years, when her short story "Speech Sounds" won the Hugo Award in 1984. Please refer to Wikipedia for a real biography of Octavia Butler.

We use the TinyLlama model in this example notebook primarily because it's small and therefore quick to download for example purposes. A larger model might hallucinate less. However, this should serve as a reminder that LLMs often do lie, even about topics that are well-known enough to have a Wikipedia page. It's important verify their outputs with your own research.

Call `chat` with a list of messages

python

from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(
        role="system",
        content="Pretend you are a pirate with a colorful personality.",
    ),
    ChatMessage(role="user", content="What is your name?"),
]
resp = llm.chat(messages)

python

print(resp)

Streaming

Using stream_complete endpoint

python

response = llm.stream_complete("Who is Octavia Butler?")

python

for r in response:
    print(r.delta, end="")

Using stream_chat endpoint

python

from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(
        role="system",
        content="Pretend you are a pirate with a colorful personality.",
    ),
    ChatMessage(role="user", content="What is your name?"),
]
resp = llm.stream_chat(messages)

python

for r in resp:
    print(r.delta, end="")

llamafile

llamafile

Setup

Call chat with a list of messages

Streaming

Call `chat` with a list of messages