Get References from PDFs

This guide shows you how to use LlamaIndex to get in-line page number citations in the response (and the response is streamed).

This is a simple combination of using the page number metadata in our PDF loader along with our indexing/query abstractions to use this information.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

python

%pip install llama-index-llms-openai

python

!pip install llama-index

python

from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    download_loader,
    RAKEKeywordTableIndex,
)

python

from llama_index.llms.openai import OpenAI

llm = OpenAI(temperature=0, model="gpt-3.5-turbo")

Download Data

python

!mkdir -p 'data/10k/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/10k/lyft_2021.pdf' -O 'data/10k/lyft_2021.pdf'

Load document and build index

python

reader = SimpleDirectoryReader(input_files=["./data/10k/lyft_2021.pdf"])
data = reader.load_data()

python

index = VectorStoreIndex.from_documents(data)

python

query_engine = index.as_query_engine(streaming=True, similarity_top_k=3)

Stream response with page citation

python

response = query_engine.query(
    "What was the impact of COVID? Show statements in bullet form and show"
    " page reference after each statement."
)
response.print_response_stream()

Inspect source nodes

python

for node in response.source_nodes:
    print("-----")
    text_fmt = node.node.get_content().strip().replace("\n", " ")[:1000]
    print(f"Text:\t {text_fmt} ...")
    print(f"Metadata:\t {node.node.metadata}")
    print(f"Score:\t {node.score:.3f}")