Back to Llama Index

Automated Metadata Extraction for Better Retrieval + Synthesis

docs/examples/metadata_extraction/MetadataExtraction_LLMSurvey.ipynb

0.14.216.9 KB
Original Source

<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/metadata_extraction/MetadataExtraction_LLMSurvey.ipynb" target="_parent"></a>

Automated Metadata Extraction for Better Retrieval + Synthesis

In this tutorial, we show you how to perform automated metadata extraction for better retrieval results. We use two extractors: a QuestionAnsweredExtractor which generates question/answer pairs from a piece of text, and also a SummaryExtractor which extracts summaries, not only within the current text, but also within adjacent texts.

We show that this allows for "chunk dreaming" - each individual chunk can have more "holistic" details, leading to higher answer quality given retrieved results.

Our data source is taken from Eugene Yan's popular article on LLM Patterns: https://eugeneyan.com/writing/llm-patterns/

Setup

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

python
%pip install llama-index-llms-openai
%pip install llama-index-readers-web
python
!pip install llama-index
python
import nest_asyncio

nest_asyncio.apply()

import os
import openai
python
# OPTIONAL: setup W&B callback handling for tracing
from llama_index.core import set_global_handler

set_global_handler("wandb", run_args={"project": "llamaindex"})
python
os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]

Define Metadata Extractors

Here we define metadata extractors. We define two variants:

  • metadata_extractor_1 only contains the QuestionsAnsweredExtractor
  • metadata_extractor_2 contains both the QuestionsAnsweredExtractor as well as the SummaryExtractor
python
from llama_index.llms.openai import OpenAI
from llama_index.core.schema import MetadataMode
python
llm = OpenAI(temperature=0.1, model="gpt-3.5-turbo", max_tokens=512)

We also show how to instantiate the SummaryExtractor and QuestionsAnsweredExtractor.

python
from llama_index.core.node_parser import TokenTextSplitter
from llama_index.core.extractors import (
    SummaryExtractor,
    QuestionsAnsweredExtractor,
)

node_parser = TokenTextSplitter(
    separator=" ", chunk_size=256, chunk_overlap=128
)


extractors_1 = [
    QuestionsAnsweredExtractor(
        questions=3, llm=llm, metadata_mode=MetadataMode.EMBED
    ),
]

extractors_2 = [
    SummaryExtractor(summaries=["prev", "self", "next"], llm=llm),
    QuestionsAnsweredExtractor(
        questions=3, llm=llm, metadata_mode=MetadataMode.EMBED
    ),
]

Load in Data, Run Extractors

We load in Eugene's essay (https://eugeneyan.com/writing/llm-patterns/) using our LlamaHub SimpleWebPageReader.

We then run our extractors.

python
from llama_index.core import SimpleDirectoryReader
python
# load in blog

from llama_index.readers.web import SimpleWebPageReader

reader = SimpleWebPageReader(html_to_text=True)
docs = reader.load_data(urls=["https://eugeneyan.com/writing/llm-patterns/"])
python
print(docs[0].get_content())
python
orig_nodes = node_parser.get_nodes_from_documents(docs)
python
# take just the first 8 nodes for testing
nodes = orig_nodes[20:28]
python
print(nodes[3].get_content(metadata_mode="all"))

Run metadata extractors

python
from llama_index.core.ingestion import IngestionPipeline

# process nodes with metadata extractors
pipeline = IngestionPipeline(transformations=[node_parser, *extractors_1])

nodes_1 = pipeline.run(nodes=nodes, in_place=False, show_progress=True)
python
print(nodes_1[3].get_content(metadata_mode="all"))
python
# 2nd pass: run summaries, and then metadata extractor

# process nodes with metadata extractor
pipeline = IngestionPipeline(transformations=[node_parser, *extractors_2])

nodes_2 = pipeline.run(nodes=nodes, in_place=False, show_progress=True)

Visualize some sample data

python
print(nodes_2[3].get_content(metadata_mode="all"))
python
print(nodes_2[1].get_content(metadata_mode="all"))

Setup RAG Query Engines, Compare Results!

We setup 3 indexes/query engines on top of the three node variants.

python
from llama_index.core import VectorStoreIndex
from llama_index.core.response.notebook_utils import (
    display_source_node,
    display_response,
)
python
# try out different query engines

# index0 = VectorStoreIndex(orig_nodes)
# index1 = VectorStoreIndex(nodes_1 + orig_nodes[8:])
# index2 = VectorStoreIndex(nodes_2 + orig_nodes[8:])

index0 = VectorStoreIndex(orig_nodes)
index1 = VectorStoreIndex(orig_nodes[:20] + nodes_1 + orig_nodes[28:])
index2 = VectorStoreIndex(orig_nodes[:20] + nodes_2 + orig_nodes[28:])
python
query_engine0 = index0.as_query_engine(similarity_top_k=1)
query_engine1 = index1.as_query_engine(similarity_top_k=1)
query_engine2 = index2.as_query_engine(similarity_top_k=1)

Try out some questions

In this question, we see that the naive response response0 only mentions BLEU and ROUGE, and lacks context about other metrics.

response2 on the other hand has all metrics within its context.

python
# query_str = "In the original RAG paper, can you describe the two main approaches for generation and compare them?"
query_str = (
    "Can you describe metrics for evaluating text generation quality, compare"
    " them, and tell me about their downsides"
)

response0 = query_engine0.query(query_str)
response1 = query_engine1.query(query_str)
response2 = query_engine2.query(query_str)
python
display_response(
    response0, source_length=1000, show_source=True, show_source_metadata=True
)
python
print(response0.source_nodes[0].node.get_content())
python
display_response(
    response1, source_length=1000, show_source=True, show_source_metadata=True
)
python
display_response(
    response2, source_length=1000, show_source=True, show_source_metadata=True
)

In this next question, we ask about BERTScore/MoverScore.

The responses are similar. But response2 gives slightly more detail than response0 since it has more information about MoverScore contained in the Metadata.

python
# query_str = "What are some reproducibility issues with the ROUGE metric? Give some details related to benchmarks and also describe other ROUGE issues. "
query_str = (
    "Can you give a high-level overview of BERTScore/MoverScore + formulas if"
    " available?"
)

response0 = query_engine0.query(query_str)
response1 = query_engine1.query(query_str)
response2 = query_engine2.query(query_str)
python
display_response(
    response0, source_length=1000, show_source=True, show_source_metadata=True
)
python
display_response(
    response1, source_length=1000, show_source=True, show_source_metadata=True
)
python
display_response(
    response2, source_length=1000, show_source=True, show_source_metadata=True
)
python
response1.source_nodes[0].node.metadata