docs/examples/property_graph/Dynamic_KG_Extraction.ipynb
In this notebook, we'll compare three different LLM Path Extractors from llama_index:
We'll use a Wikipedia page as our test data and visualize the resulting knowledge graphs using Pyvis.
!pip install llama_index pyvis wikipedia
from llama_index.core import Document, PropertyGraphIndex
from llama_index.core.indices.property_graph import (
SimpleLLMPathExtractor,
SchemaLLMPathExtractor,
DynamicLLMPathExtractor,
)
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings
import wikipedia
import os
import nest_asyncio
nest_asyncio.apply()
os.environ["OPENAI_API_KEY"] = "sk-proj-..."
# Set up global configurations
llm = OpenAI(temperature=0.0, model="gpt-3.5-turbo")
Settings.llm = llm
Settings.chunk_size = 2048
Settings.chunk_overlap = 20
def get_wikipedia_content(title):
try:
page = wikipedia.page(title)
return page.content
except wikipedia.exceptions.DisambiguationError as e:
print(f"Disambiguation page. Options: {e.options}")
except wikipedia.exceptions.PageError:
print(f"Page '{title}' does not exist.")
return None
wiki_title = "Barack Obama"
content = get_wikipedia_content(wiki_title)
if content:
document = Document(text=content, metadata={"title": wiki_title})
print(
f"Fetched content for '{wiki_title}' (length: {len(content)} characters)"
)
else:
print("Failed to fetch Wikipedia content.")
kg_extractor = SimpleLLMPathExtractor(
llm=llm, max_paths_per_chunk=20, num_workers=4
)
simple_index = PropertyGraphIndex.from_documents(
[document],
llm=llm,
embed_kg_nodes=False,
kg_extractors=[kg_extractor],
show_progress=True,
)
simple_index.property_graph_store.save_networkx_graph(
name="./SimpleGraph.html"
)
simple_index.property_graph_store.get_triplets(
entity_names=["Barack Obama", "Obama"]
)[:5]
Here, we let the LLM define the ontology on the fly, giving it full freedom to label the nodes as it best sees fit.
kg_extractor = DynamicLLMPathExtractor(
llm=llm,
max_triplets_per_chunk=20,
num_workers=4,
# Let the LLM infer entities and their labels (types) on the fly
allowed_entity_types=None,
# Let the LLM infer relationships on the fly
allowed_relation_types=None,
# LLM will generate any entity properties, set `None` to skip property generation (will be faster without)
allowed_relation_props=[],
# LLM will generate any relation properties, set `None` to skip property generation (will be faster without)
allowed_entity_props=[],
)
dynamic_index = PropertyGraphIndex.from_documents(
[document],
llm=llm,
embed_kg_nodes=False,
kg_extractors=[kg_extractor],
show_progress=True,
)
dynamic_index.property_graph_store.save_networkx_graph(
name="./DynamicGraph.html"
)
dynamic_index.property_graph_store.get_triplets(
entity_names=["Barack Obama", "Obama"]
)[:5]
Here, we have partial knowledge of what we want to detect, we know the article is about Barack Obama, so we define some entities and relations that could help guide the LLM in the labeling process as it detects the entities and relations. This doesn't guarantee that the LLM will use them, it simply guides it and gives it some ideas. It will still be up to the LLM to decide whether it uses the entities and relations we provide or not.
kg_extractor = DynamicLLMPathExtractor(
llm=llm,
max_triplets_per_chunk=20,
num_workers=4,
allowed_entity_types=["POLITICIAN", "POLITICAL_PARTY"],
allowed_relation_types=["PRESIDENT_OF", "MEMBER_OF"],
allowed_relation_props=["description"],
allowed_entity_props=["description"],
)
dynamic_index_2 = PropertyGraphIndex.from_documents(
[document],
llm=llm,
embed_kg_nodes=False,
kg_extractors=[kg_extractor],
show_progress=True,
)
dynamic_index_2.property_graph_store.save_networkx_graph(
name="./DynamicGraph_2.html"
)
dynamic_index_2.property_graph_store.get_triplets(
entity_names=["Barack Obama", "Obama"]
)[:5]
kg_extractor = SchemaLLMPathExtractor(
llm=llm,
max_triplets_per_chunk=20,
strict=False, # Set to False to showcase why it's not going to be the same as DynamicLLMPathExtractor
possible_entities=None, # USE DEFAULT ENTITIES (PERSON, ORGANIZATION... etc)
possible_relations=None, # USE DEFAULT RELATIONSHIPS
possible_relation_props=[
"extra_description"
], # Set to `None` to skip property generation
possible_entity_props=[
"extra_description"
], # Set to `None` to skip property generation
num_workers=4,
)
schema_index = PropertyGraphIndex.from_documents(
[document],
llm=llm,
embed_kg_nodes=False,
kg_extractors=[kg_extractor],
show_progress=True,
)
schema_index.property_graph_store.save_networkx_graph(
name="./SchemaGraph.html"
)
schema_index.property_graph_store.get_triplets(
entity_names=["Barack Obama", "Obama"]
)[:5]
Let's compare the results of the three extractors:
SimpleLLMPathExtractor: This extractor creates a basic knowledge graph without any predefined schema. It may produce a larger number of diverse relationships but might lack consistency in entity and relation naming.
DynamicLLMPathExtractor:
SchemaLLMPathExtractor: With a predefined schema, this extractor produces a more structured graph. The entities and relations are limited to those specified in the schema, which can lead to a more consistent but potentially less comprehensive graph. Even if we set "strict" to false, the extracted KG Graph doesn't reflect the LLM's pursuit of trying to find new entities and types that fall outside of the input schema's scope.