docs/examples/metadata_extraction/EntityExtractionClimate.ipynb
<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/metadata_extraction/EntityExtractionClimate.ipynb" target="_parent"></a>
In this demo, we use the new EntityExtractor to extract entities from each node, stored in metadata. The default model is tomaarsen/span-marker-mbert-base-multinerd, which is downloaded an run locally from HuggingFace.
For more information on metadata extraction in LlamaIndex, see our documentation.
If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.
%pip install llama-index-llms-openai
%pip install llama-index-extractors-entity
!pip install llama-index
# Needed to run the entity extractor
# !pip install span_marker
import os
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
from llama_index.extractors.entity import EntityExtractor
from llama_index.core.node_parser import SentenceSplitter
entity_extractor = EntityExtractor(
prediction_threshold=0.5,
label_entities=False, # include the entity label in the metadata (can be erroneous)
device="cpu", # set to "cuda" if you have a GPU
)
node_parser = SentenceSplitter()
transformations = [node_parser, entity_extractor]
Here, we will download the 2023 IPPC Climate Report - Chapter 3 on Oceans and Coastal Ecosystems (172 Pages)
!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf
Next, load the documents.
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader(
input_files=["./IPCC_AR6_WGII_Chapter03.pdf"]
).load_data()
Now, this is a pretty long document. Since we are not running on CPU, for now, we will only run on a subset of documents. Feel free to run it on all documents on your own though!
from llama_index.core.ingestion import IngestionPipeline
import random
random.seed(42)
# comment out to run on all documents
# 100 documents takes about 5 minutes on CPU
documents = random.sample(documents, 100)
pipeline = IngestionPipeline(transformations=transformations)
nodes = pipeline.run(documents=documents)
samples = random.sample(nodes, 5)
for node in samples:
print(node.metadata)
from llama_index.core import VectorStoreIndex
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings
Settings.llm = OpenAI(model="gpt-3.5-turbo", temperature=0.2)
index = VectorStoreIndex(nodes=nodes)
query_engine = index.as_query_engine()
response = query_engine.query("What is said by Fox-Kemper?")
print(response)
Here, we re-construct the index, but without metadata
for node in nodes:
node.metadata.pop("entities", None)
print(nodes[0].metadata)
index = VectorStoreIndex(nodes=nodes)
query_engine = index.as_query_engine()
response = query_engine.query("What is said by Fox-Kemper?")
print(response)
As we can see, our metadata-enriched index is able to fetch more relevant information.