Back to Llama Index

Get References from PDFs

docs/examples/citation/pdf_page_reference.ipynb

0.14.211.9 KB
Original Source

Get References from PDFs

This guide shows you how to use LlamaIndex to get in-line page number citations in the response (and the response is streamed).

This is a simple combination of using the page number metadata in our PDF loader along with our indexing/query abstractions to use this information.

<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/citation/pdf_page_reference.ipynb" target="_parent"></a>

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

python
%pip install llama-index-llms-openai
python
!pip install llama-index
python
from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    download_loader,
    RAKEKeywordTableIndex,
)
python
from llama_index.llms.openai import OpenAI

llm = OpenAI(temperature=0, model="gpt-3.5-turbo")

Download Data

python
!mkdir -p 'data/10k/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/10k/lyft_2021.pdf' -O 'data/10k/lyft_2021.pdf'

Load document and build index

python
reader = SimpleDirectoryReader(input_files=["./data/10k/lyft_2021.pdf"])
data = reader.load_data()
python
index = VectorStoreIndex.from_documents(data)
python
query_engine = index.as_query_engine(streaming=True, similarity_top_k=3)

Stream response with page citation

python
response = query_engine.query(
    "What was the impact of COVID? Show statements in bullet form and show"
    " page reference after each statement."
)
response.print_response_stream()

Inspect source nodes

python
for node in response.source_nodes:
    print("-----")
    text_fmt = node.node.get_content().strip().replace("\n", " ")[:1000]
    print(f"Text:\t {text_fmt} ...")
    print(f"Metadata:\t {node.node.metadata}")
    print(f"Score:\t {node.score:.3f}")