docs/examples/node_parsers/semantic_chunking.ipynb
<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/node_parsers/semantic_chunking.ipynb" target="_parent"></a>
"Semantic chunking" is a new concept proposed Greg Kamradt in his video tutorial on 5 levels of embedding chunking: https://youtu.be/8OJC21T2SL4?t=1933.
Instead of chunking text with a fixed chunk size, the semantic splitter adaptively picks the breakpoint in-between sentences using embedding similarity. This ensures that a "chunk" contains sentences that are semantically related to each other.
We adapted it into a LlamaIndex module.
Check out our notebook below!
Caveats:
%pip install llama-index-embeddings-openai
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'pg_essay.txt'
from llama_index.core import SimpleDirectoryReader
# load documents
documents = SimpleDirectoryReader(input_files=["pg_essay.txt"]).load_data()
from llama_index.core.node_parser import (
SentenceSplitter,
SemanticSplitterNodeParser,
)
from llama_index.embeddings.openai import OpenAIEmbedding
import os
os.environ["OPENAI_API_KEY"] = "sk-..."
embed_model = OpenAIEmbedding()
splitter = SemanticSplitterNodeParser(
buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
)
# also baseline splitter
base_splitter = SentenceSplitter(chunk_size=512)
nodes = splitter.get_nodes_from_documents(documents)
Let's take a look at chunks produced by the semantic splitter.
print(nodes[1].get_content())
print(nodes[2].get_content())
print(nodes[3].get_content())
In contrast let's compare against the baseline with a fixed chunk size.
base_nodes = base_splitter.get_nodes_from_documents(documents)
print(base_nodes[2].get_content())
from llama_index.core import VectorStoreIndex
from llama_index.core.response.notebook_utils import display_source_node
vector_index = VectorStoreIndex(nodes)
query_engine = vector_index.as_query_engine()
base_vector_index = VectorStoreIndex(base_nodes)
base_query_engine = base_vector_index.as_query_engine()
response = query_engine.query(
"Tell me about the author's programming journey through childhood to college"
)
print(str(response))
for n in response.source_nodes:
display_source_node(n, source_length=20000)
base_response = base_query_engine.query(
"Tell me about the author's programming journey through childhood to college"
)
print(str(base_response))
for n in base_response.source_nodes:
display_source_node(n, source_length=20000)
response = query_engine.query("Tell me about the author's experience in YC")
print(str(response))
base_response = base_query_engine.query(
"Tell me about the author's experience in YC"
)
print(str(base_response))