Llama3 Cookbook with Ollama and Replicate

Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8 and 70B sizes. The Llama 3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks.

In this notebook, we demonstrate how to use Llama3 with LlamaIndex for a comprehensive set of use cases.

Basic completion / chat
Basic RAG (Vector Search, Summarization)
Advanced RAG (Routing, Sub-Questions)
Text-to-SQL
Structured Data Extraction
Chat Engine + Memory
Agents

We use Llama3-8B through Ollama, and Llama3-70B through Replicate.

Installation and Setup

python

!pip install llama-index
!pip install llama-index-llms-ollama
!pip install llama-index-llms-replicate
!pip install llama-index-embeddings-huggingface
!pip install llama-parse
!pip install replicate

python

import nest_asyncio

nest_asyncio.apply()

Setup LLM using Ollama

python

from llama_index.llms.ollama import Ollama

llm = Ollama(
    model="llama3",
    request_timeout=120.0,
    # Manually set the context window to limit memory usage
    context_window=8000,
)

Setup LLM using Replicate

Make sure you have REPLICATE_API_TOKEN specified!

python

# os.environ["REPLICATE_API_TOKEN"] = "<YOUR_API_KEY>"

python

from llama_index.llms.replicate import Replicate

llm_replicate = Replicate(model="meta/meta-llama-3-70b-instruct")
# llm_replicate = Replicate(model="meta/meta-llama-3-8b-instruct")

Setup Embedding Model

python

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

Define Global Settings Configuration

In LlamaIndex, you can define global settings so you don't have to pass the LLM / embedding model objects everywhere.

python

from llama_index.core import Settings

Settings.llm = llm
Settings.embed_model = embed_model

Download Data

Here you'll download data that's used in section 2 and onwards.

We'll download some articles on Kendrick, Drake, and their beef (as of May 2024).

python

!mkdir data
!wget "https://www.dropbox.com/scl/fi/t1soxfjdp0v44an6sdymd/drake_kendrick_beef.pdf?rlkey=u9546ymb7fj8lk2v64r6p5r5k&st=wjzzrgil&dl=1" -O data/drake_kendrick_beef.pdf
!wget "https://www.dropbox.com/scl/fi/nts3n64s6kymner2jppd6/drake.pdf?rlkey=hksirpqwzlzqoejn55zemk6ld&st=mohyfyh4&dl=1" -O data/drake.pdf
!wget "https://www.dropbox.com/scl/fi/8ax2vnoebhmy44bes2n1d/kendrick.pdf?rlkey=fhxvn94t5amdqcv9vshifd3hj&st=dxdtytn6&dl=1" -O data/kendrick.pdf

Load Data

We load data using LlamaParse by default, but you can also choose to opt for our free pypdf reader (in SimpleDirectoryReader by default) if you don't have an account!

LlamaParse: Signup for an account here: cloud.llamaindex.ai. You get 1k free pages a day, and paid plan is 7k free pages + 0.3c per additional page. LlamaParse is a good option if you want to parse complex documents, like PDFs with charts, tables, and more.
Default PDF Parser (In SimpleDirectoryReader). If you don't want to signup for an account / use a PDF service, just use the default PyPDF reader bundled in our file loader. It's a good choice for getting started!

python

from llama_parse import LlamaParse

docs_kendrick = LlamaParse(result_type="text").load_data("./data/kendrick.pdf")
docs_drake = LlamaParse(result_type="text").load_data("./data/drake.pdf")
docs_both = LlamaParse(result_type="text").load_data(
    "./data/drake_kendrick_beef.pdf"
)


# from llama_index.core import SimpleDirectoryReader

# docs_kendrick = SimpleDirectoryReader(input_files=["data/kendrick.pdf"]).load_data()
# docs_drake = SimpleDirectoryReader(input_files=["data/drake.pdf"]).load_data()
# docs_both = SimpleDirectoryReader(input_files=["data/drake_kendrick_beef.pdf"]).load_data()

1. Basic Completion and Chat

Call complete with a prompt

python

response = llm.complete("do you like drake or kendrick better?")

print(response)

python

stream_response = llm.stream_complete(
    "you're a drake fan. tell me why you like drake more than kendrick"
)

for t in stream_response:
    print(t.delta, end="")

Call chat with a list of messages

python

from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(role="system", content="You are Kendrick."),
    ChatMessage(role="user", content="Write a verse."),
]
response = llm.chat(messages)

python

print(response)

2. Basic RAG (Vector Search, Summarization)

Basic RAG (Vector Search)

python

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(docs_both)
query_engine = index.as_query_engine(similarity_top_k=3)

python

response = query_engine.query("Tell me about family matters")

python

print(str(response))

Basic RAG (Summarization)

python

from llama_index.core import SummaryIndex

summary_index = SummaryIndex.from_documents(docs_both)
summary_engine = summary_index.as_query_engine()

python

response = summary_engine.query(
    "Given your assessment of this article, who won the beef?"
)

python

print(str(response))

3. Advanced RAG (Routing, Sub-Questions)

Build a Router that can choose whether to do vector search or summarization

python

from llama_index.core.tools import QueryEngineTool, ToolMetadata

vector_tool = QueryEngineTool(
    index.as_query_engine(),
    metadata=ToolMetadata(
        name="vector_search",
        description="Useful for searching for specific facts.",
    ),
)

summary_tool = QueryEngineTool(
    index.as_query_engine(response_mode="tree_summarize"),
    metadata=ToolMetadata(
        name="summary",
        description="Useful for summarizing an entire document.",
    ),
)

python

from llama_index.core.query_engine import RouterQueryEngine

query_engine = RouterQueryEngine.from_defaults(
    [vector_tool, summary_tool], select_multi=False, verbose=True
)

response = query_engine.query(
    "Tell me about the song meet the grahams - why is it significant"
)

python

print(response)

Break Complex Questions down into Sub-Questions

Our Sub-Question Query Engine breaks complex questions down into sub-questions.

python

drake_index = VectorStoreIndex.from_documents(docs_drake)
drake_query_engine = drake_index.as_query_engine(similarity_top_k=3)

kendrick_index = VectorStoreIndex.from_documents(docs_kendrick)
kendrick_query_engine = kendrick_index.as_query_engine(similarity_top_k=3)

python

from llama_index.core.tools import QueryEngineTool, ToolMetadata

drake_tool = QueryEngineTool(
    drake_index.as_query_engine(),
    metadata=ToolMetadata(
        name="drake_search",
        description="Useful for searching over Drake's life.",
    ),
)

kendrick_tool = QueryEngineTool(
    kendrick_index.as_query_engine(),
    metadata=ToolMetadata(
        name="kendrick_summary",
        description="Useful for searching over Kendrick's life.",
    ),
)

python

from llama_index.core.query_engine import SubQuestionQueryEngine

query_engine = SubQuestionQueryEngine.from_defaults(
    [drake_tool, kendrick_tool],
    llm=llm_replicate,  # llama3-70b
    verbose=True,
)

response = query_engine.query("Which albums did Drake release in his career?")

print(response)

4. Text-to-SQL

Here, we download and use a sample SQLite database with 11 tables, with various info about music, playlists, and customers. We will limit to a select few tables for this test.

python

!wget "https://www.sqlitetutorial.net/wp-content/uploads/2018/03/chinook.zip" -O "./data/chinook.zip"
!unzip "./data/chinook.zip"

python

from sqlalchemy import (
    create_engine,
    MetaData,
    Table,
    Column,
    String,
    Integer,
    select,
    column,
)

engine = create_engine("sqlite:///chinook.db")

python

from llama_index.core import SQLDatabase

sql_database = SQLDatabase(engine)

python

from llama_index.core.indices.struct_store import NLSQLTableQueryEngine

query_engine = NLSQLTableQueryEngine(
    sql_database=sql_database,
    tables=["albums", "tracks", "artists"],
    llm=llm_replicate,
)

python

response = query_engine.query("What are some albums?")

print(response)

python

response = query_engine.query("What are some artists? Limit it to 5.")

print(response)

This last query should be a more complex join

python

response = query_engine.query(
    "What are some tracks from the artist AC/DC? Limit it to 3"
)

print(response)

python

print(response.metadata["sql_query"])

5. Structured Data Extraction

An important use case for function calling is extracting structured objects. LlamaIndex provides an intuitive interface for this through structured_predict - simply define the target Pydantic class (can be nested), and given a prompt, we extract out the desired object.

NOTE: Since there's no native function calling support with Llama3 / Ollama, the structured extraction is performed by prompting the LLM + output parsing.

python

from llama_index.llms.ollama import Ollama
from llama_index.core.prompts import PromptTemplate
from pydantic import BaseModel


class Restaurant(BaseModel):
    """A restaurant with name, city, and cuisine."""

    name: str
    city: str
    cuisine: str


llm = Ollama(
    model="llama3",
    request_timeout=120.0,
    # Manually set the context window to limit memory usage
    context_window=8000,
)
prompt_tmpl = PromptTemplate(
    "Generate a restaurant in a given city {city_name}"
)

python

restaurant_obj = llm.structured_predict(
    Restaurant, prompt_tmpl, city_name="Miami"
)
print(restaurant_obj)

6. Adding Chat History to RAG (Chat Engine)

In this section we create a stateful chatbot from a RAG pipeline, with our chat engine abstraction.

Unlike a stateless query engine, the chat engine maintains conversation history (through a memory module like buffer memory). It performs retrieval given a condensed question, and feeds the condensed question + context + chat history into the final LLM prompt.

python

from llama_index.core.memory import ChatMemoryBuffer
from llama_index.core.chat_engine import CondensePlusContextChatEngine

memory = ChatMemoryBuffer.from_defaults(token_limit=3900)

chat_engine = CondensePlusContextChatEngine.from_defaults(
    index.as_retriever(),
    memory=memory,
    llm=llm,
    context_prompt=(
        "You are a chatbot, able to have normal interactions, as well as talk"
        " about the Kendrick and Drake beef."
        "Here are the relevant documents for the context:\n"
        "{context_str}"
        "\nInstruction: Use the previous chat history, or the context above, to interact and help the user."
    ),
    verbose=True,
)

python

response = chat_engine.chat(
    "Tell me about the songs Drake released in the beef."
)
print(str(response))

python

response = chat_engine.chat("What about Kendrick?")
print(str(response))

7. Agents

Here we build agents with Llama 3. We perform RAG over simple functions as well as the documents above.

Agents And Tools

python

import json
from typing import Sequence, List

from llama_index.core.llms import ChatMessage
from llama_index.core.tools import BaseTool, FunctionTool
from llama_index.core.agent import ReActAgent

import nest_asyncio

nest_asyncio.apply()

Define Tools

python

def multiply(a: int, b: int) -> int:
    """Multiple two integers and returns the result integer"""
    return a * b


def add(a: int, b: int) -> int:
    """Add two integers and returns the result integer"""
    return a + b


def subtract(a: int, b: int) -> int:
    """Subtract two integers and returns the result integer"""
    return a - b


def divide(a: int, b: int) -> int:
    """Divides two integers and returns the result integer"""
    return a / b


multiply_tool = FunctionTool.from_defaults(fn=multiply)
add_tool = FunctionTool.from_defaults(fn=add)
subtract_tool = FunctionTool.from_defaults(fn=subtract)
divide_tool = FunctionTool.from_defaults(fn=divide)

ReAct Agent

python

agent = ReActAgent.from_tools(
    [multiply_tool, add_tool, subtract_tool, divide_tool],
    llm=llm_replicate,
    verbose=True,
)

Querying

python

response = agent.chat("What is (121 + 2) * 5?")
print(str(response))

ReAct Agent With RAG QueryEngine Tools

python

from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    StorageContext,
    load_index_from_storage,
)

from llama_index.core.tools import QueryEngineTool, ToolMetadata

Create ReAct Agent using RAG QueryEngine Tools

python

drake_tool = QueryEngineTool(
    drake_index.as_query_engine(),
    metadata=ToolMetadata(
        name="drake_search",
        description="Useful for searching over Drake's life.",
    ),
)

kendrick_tool = QueryEngineTool(
    kendrick_index.as_query_engine(),
    metadata=ToolMetadata(
        name="kendrick_search",
        description="Useful for searching over Kendrick's life.",
    ),
)

query_engine_tools = [drake_tool, kendrick_tool]

python

agent = ReActAgent.from_tools(
    query_engine_tools,  ## TODO: define query tools
    llm=llm_replicate,
    verbose=True,
)

Querying

python

response = agent.chat("Tell me about how Kendrick and Drake grew up")
print(str(response))