CitationQueryEngine

This notebook walks through how to use the CitationQueryEngine

The CitationQueryEngine can be used with any existing index.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

python

%pip install llama-index-embeddings-openai
%pip install llama-index-llms-openai

python

!pip install llama-index

Setup

python

import os
from llama_index.llms.openai import OpenAI
from llama_index.core.query_engine import CitationQueryEngine
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    StorageContext,
    load_index_from_storage,
)

python

from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

Download Data

python

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

python

if not os.path.exists("./citation"):
    documents = SimpleDirectoryReader("./data/paul_graham").load_data()
    index = VectorStoreIndex.from_documents(
        documents,
    )
    index.storage_context.persist(persist_dir="./citation")
else:
    index = load_index_from_storage(
        StorageContext.from_defaults(persist_dir="./citation"),
    )

Create the CitationQueryEngine w/ Default Arguments

python

query_engine = CitationQueryEngine.from_args(
    index,
    similarity_top_k=3,
    # here we can control how granular citation sources are, the default is 512
    citation_chunk_size=512,
)

python

response = query_engine.query("What did the author do growing up?")

python

print(response)

python

# source nodes are 6, because the original chunks of 1024-sized nodes were broken into more granular nodes
print(len(response.source_nodes))

Inspecting the Actual Source

Sources start counting at 1, but python arrays start counting at zero!

Let's confirm the source makes sense.

python

print(response.source_nodes[0].node.get_text())

python

print(response.source_nodes[1].node.get_text())

Adjusting Settings

Note that setting the chunk size larger than the original chunk size of the nodes will have no effect.

The default node chunk size is 1024, so here, we are not making our citation nodes any more granular.

python

query_engine = CitationQueryEngine.from_args(
    index,
    # increase the citation chunk size!
    citation_chunk_size=1024,
    similarity_top_k=3,
)

python

response = query_engine.query("What did the author do growing up?")

python

print(response)

python

# should be less source nodes now!
print(len(response.source_nodes))

Inspecting the Actual Source

Sources start counting at 1, but python arrays start counting at zero!

Let's confirm the source makes sense.

python

print(response.source_nodes[0].node.get_text())