LLM Cookbook with Intel Gaudi

Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8 and 70B sizes. The Llama 3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks.

In this notebook, we will demonstrate how to use Llama3 with LlamaIndex.

We use Llama-3-8B-Instruct for the demonstration through Intel Gaudi.

Installation and Setup

python

!pip -q install llama-parse
!pip -q install python-dotenv==1.0.0
!pip -q install llama_index
!pip -q install llama-index-llms-gaudi
!pip -q install llama-index-embeddings-gaudi
!pip -q install llama-index-graph-stores-neo4j
!pip -q install llama-index-readers-wikipedia
!pip -q install wikipedia
!pip -q install InstructorEmbedding==1.0.1
!pip -q install sentence-transformers
!pip -q install --upgrade-strategy eager optimum[habana]
!pip -q install optimum-habana==1.14.1
!pip -q install huggingface-hub==0.23.2

python

import nest_asyncio

nest_asyncio.apply()

import argparse
import os, sys, logging

from llama_index.readers.wikipedia import WikipediaReader
from llama_index.llms.gaudi import GaudiLLM
from llama_index.embeddings.gaudi import GaudiEmbedding
from llama_index.core.prompts import PromptTemplate

from llama_index.core import (
    SimpleDirectoryReader,
    KnowledgeGraphIndex,
    Settings,
    StorageContext,
)

logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
    datefmt="%m/%d/%Y %H:%M:%S",
    level=logging.INFO,
)
logger = logging.getLogger(__name__)

python

class AttributeContainer:
    def __init__(self, **kwargs):
        # Set attributes dynamically based on keyword arguments
        for key, value in kwargs.items():
            setattr(self, key, value)


args = AttributeContainer(
    device="hpu",
    model_name_or_path="meta-llama/Meta-Llama-3-8B-Instruct",
    bf16=True,
    max_new_tokens=100,
    max_input_tokens=0,
    batch_size=1,
    warmup=3,
    n_iterations=5,
    local_rank=0,
    use_kv_cache=True,
    use_hpu_graphs=True,
    dataset_name=None,
    column_name=None,
    do_sample=False,
    num_beams=1,
    trim_logits=False,
    seed=27,
    profiling_warmup_steps=0,
    profiling_steps=0,
    profiling_record_shapes=False,
    prompt=None,
    bad_words=None,
    force_words=None,
    assistant_model=None,
    peft_model=None,
    token=None,
    model_revision="main",
    attn_softmax_bf16=False,
    output_dir=None,
    bucket_size=-1,
    dataset_max_samples=-1,
    limit_hpu_graphs=False,
    reuse_cache=False,
    verbose_workers=False,
    simulate_dyn_prompt=None,
    reduce_recompile=False,
    use_flash_attention=False,
    flash_attention_recompute=False,
    flash_attention_causal_mask=False,
    flash_attention_fast_softmax=False,
    book_source=False,
    torch_compile=False,
    ignore_eos=True,
    temperature=1.0,
    top_p=1.0,
    const_serialization_path=None,
    csp=None,
    disk_offload=False,
    trust_remote_code=False,
    quant_config=os.getenv("QUANT_CONFIG", ""),
    num_return_sequences=1,
    bucket_internal=False,
)

python

def completion_to_prompt(completion):
    return f"<|system|>\n</s>\n<|user|>\n{completion}</s>\n<|assistant|>\n"

python

# Transform a list of chat messages into zephyr-specific input
def messages_to_prompt(messages):
    prompt = ""
    for message in messages:
        if message.role == "system":
            prompt += f"<|system|>\n{message.content}</s>\n"
        elif message.role == "user":
            prompt += f"<|user|>\n{message.content}</s>\n"
        elif message.role == "assistant":
            prompt += f"<|assistant|>\n{message.content}</s>\n"

    # ensure we start with a system prompt, insert blank if needed
    if not prompt.startswith("<|system|>\n"):
        prompt = "<|system|>\n</s>\n" + prompt

    # add final assistant prompt
    prompt = prompt + "<|assistant|>\n"

    return prompt

Setup LLM using Intel Gaudi

python

from huggingface_hub import notebook_login

notebook_login()

python

from llama_index.llms.gaudi import GaudiLLM

llm = GaudiLLM(
    args=args,
    logger=logger,
    model_name="meta-llama/Meta-Llama-3-8B-Instruct",
    tokenizer_name="meta-llama/Meta-Llama-3-8B-Instruct",
    query_wrapper_prompt=PromptTemplate(
        "<|system|>\n</s>\n<|user|>\n{query_str}</s>\n<|assistant|>\n"
    ),
    context_window=3900,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
    messages_to_prompt=messages_to_prompt,
    device_map="auto",
)

Setup Embedding Model

python

from llama_index.embeddings.gaudi import GaudiEmbedding

embed_model = GaudiEmbedding(
    embedding_input_size=-1, model_name="BAAI/bge-small-en-v1.5"
)

Define Global Settings Configuration

In LlamaIndex, you can define global settings so you don't have to pass the LLM / embedding model objects everywhere.

python

from llama_index.core import Settings

Settings.llm = llm
Settings.embed_model = embed_model

Download Data

Here you'll download data that's used in section 2 and onwards.

python

!wget "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt" "paul_graham_essay.txt"

Load Data

We load data using LlamaParse by default, but you can also choose to opt for our free pypdf reader (in SimpleDirectoryReader by default) if you don't have an account!

LlamaParse: Signup for an account here: cloud.llamaindex.ai. You get 1k free pages a day, and paid plan is 7k free pages + 0.3c per additional page. LlamaParse is a good option if you want to parse complex documents, like PDFs with charts, tables, and more.
Default PDF Parser (In SimpleDirectoryReader). If you don't want to signup for an account / use a PDF service, just use the default PyPDF reader bundled in our file loader. It's a good choice for getting started!

python

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader(
    input_files=["paul_graham_essay.txt"]
).load_data()

1. Basic Completion and Chat

Call complete with a prompt

python

response = llm.complete("Who is Paul Graham?")

print(response)

python

stream_response = llm.stream_complete(
    "you're a Paul Graham fan. tell me why you like Paul Graham"
)

for t in stream_response:
    print(t.delta, end="")

Call chat with a list of messages

python

from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(role="system", content="You are Paul Graham."),
    ChatMessage(role="user", content="Write a paragraph about politics."),
]
response = llm.chat(messages)

python

print(response)

2. Basic RAG (Vector Search, Summarization)

Basic RAG (Vector Search)

python

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(similarity_top_k=3)

python

response = query_engine.query("Tell me about family matters")

python

print(str(response))

Basic RAG (Summarization)

python

from llama_index.core import SummaryIndex

summary_index = SummaryIndex.from_documents(documents)
summary_engine = summary_index.as_query_engine()

python

response = summary_engine.query(
    "Given your assessment of this article, what is Paul Graham best known for?"
)

python

print(str(response))

3. Advanced RAG (Routing)

Build a Router that can choose whether to do vector search or summarization

python

from llama_index.core.tools import QueryEngineTool, ToolMetadata

vector_tool = QueryEngineTool(
    index.as_query_engine(llm=llm),
    metadata=ToolMetadata(
        name="vector_search",
        description="Useful for searching for specific facts.",
    ),
)

summary_tool = QueryEngineTool(
    summary_index.as_query_engine(response_mode="tree_summarize", llm=llm),
    metadata=ToolMetadata(
        name="summary",
        description="Useful for summarizing an entire document.",
    ),
)

python

from llama_index.core.query_engine import SubQuestionQueryEngine

query_engine = SubQuestionQueryEngine.from_defaults(
    [vector_tool, summary_tool],
    llm=llm,
    verbose=True,
)
response = query_engine.query("tell me something about paul graham?")

python

print(response)

4. Text-to-SQL

Here, we download and use a sample SQLite database with 11 tables, with various info about music, playlists, and customers. We will limit to a select few tables for this test.

python

!wget "https://www.sqlitetutorial.net/wp-content/uploads/2018/03/chinook.zip" -O "./data/chinook.zip"
!unzip "./data/chinook.zip"

python

from sqlalchemy import (
    create_engine,
    MetaData,
    Table,
    Column,
    String,
    Integer,
    select,
    column,
)

engine = create_engine("sqlite:///chinook.db")

python

from llama_index.core import SQLDatabase

sql_database = SQLDatabase(engine)

python

from llama_index.core.indices.struct_store import NLSQLTableQueryEngine

query_engine = NLSQLTableQueryEngine(
    sql_database=sql_database,
    tables=["albums", "tracks", "artists"],
    llm=llm,
)

python

response = query_engine.query("What are some albums?")

print(response)

python

response = query_engine.query("What are some artists? Limit it to 5.")

print(response)

This last query should be a more complex join

python

response = query_engine.query(
    "What are some tracks from the artist AC/DC? Limit it to 3"
)

print(response)

python

print(response.metadata["sql_query"])

5. Structured Data Extraction - Graph RAG with Local NEO4J Database

python

import neo4j
from llama_index.graph_stores.neo4j import Neo4jGraphStore
from llama_index.core import PropertyGraphIndex
from llama_index.core import (
    KnowledgeGraphIndex,
    StorageContext,
)

graph_store = Neo4jGraphStore(
    username="<user_name for NEO4J server>",
    password="<password for NEO4J server>",
    url="<URL for NEO4J server>",
    database="neo4j",
)

storage_context = StorageContext.from_defaults(graph_store=graph_store)
neo4j_index = KnowledgeGraphIndex.from_documents(
    documents=documents,
    max_triplets_per_chunk=3,
    storage_context=storage_context,
    embed_model=embed_model,
    include_embeddings=True,
)

python

struct_query_engine = neo4j_index.as_query_engine(
    include_text=True,
    response_mode="tree_summarize",
    embedding_mode="hybrid",
    similarity_top_k=5,
)

response = struct_query_engine.query("who is paul graham?")

python

print(response)

6. Adding Chat History to RAG (Chat Engine)

In this section we create a stateful chatbot from a RAG pipeline, with our chat engine abstraction.

Unlike a stateless query engine, the chat engine maintains conversation history (through a memory module like buffer memory). It performs retrieval given a condensed question, and feeds the condensed question + context + chat history into the final LLM prompt.

python

from llama_index.core.memory import ChatMemoryBuffer
from llama_index.core.chat_engine import CondensePlusContextChatEngine

memory = ChatMemoryBuffer.from_defaults(token_limit=3900)

chat_engine = CondensePlusContextChatEngine.from_defaults(
    index.as_retriever(),
    memory=memory,
    llm=llm,
    context_prompt=(
        "You are a chatbot, able to have normal interactions, as well as talk"
        " about Paul Graham."
        "Here are the relevant documents for the context:\n"
        "{context_str}"
        "\nInstruction: Use the previous chat history, or the context above, to interact and help the user."
    ),
    verbose=True,
)

python

response = chat_engine.chat(
    "Tell me about the essay Paul Graham wrote on the topic of programming."
)
print(str(response))

python

response = chat_engine.chat(
    "What about the essays Paul Graham wrote on other topics?"
)
print(str(response))