Back to Llama Index

Semantic Chunker

docs/examples/node_parsers/semantic_chunking.ipynb

0.14.213.5 KB
Original Source

<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/node_parsers/semantic_chunking.ipynb" target="_parent"></a>

Semantic Chunker

"Semantic chunking" is a new concept proposed Greg Kamradt in his video tutorial on 5 levels of embedding chunking: https://youtu.be/8OJC21T2SL4?t=1933.

Instead of chunking text with a fixed chunk size, the semantic splitter adaptively picks the breakpoint in-between sentences using embedding similarity. This ensures that a "chunk" contains sentences that are semantically related to each other.

We adapted it into a LlamaIndex module.

Check out our notebook below!

Caveats:

  • The regex primarily works for English sentences
  • You may have to tune the breakpoint percentile threshold.

Setup Data

python
%pip install llama-index-embeddings-openai
python
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'pg_essay.txt'
python
from llama_index.core import SimpleDirectoryReader

# load documents
documents = SimpleDirectoryReader(input_files=["pg_essay.txt"]).load_data()

Define Semantic Splitter

python
from llama_index.core.node_parser import (
    SentenceSplitter,
    SemanticSplitterNodeParser,
)
from llama_index.embeddings.openai import OpenAIEmbedding

import os

os.environ["OPENAI_API_KEY"] = "sk-..."
python
embed_model = OpenAIEmbedding()
splitter = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
)

# also baseline splitter
base_splitter = SentenceSplitter(chunk_size=512)
python
nodes = splitter.get_nodes_from_documents(documents)

Inspecting the Chunks

Let's take a look at chunks produced by the semantic splitter.

Chunk 1: IBM 1401

python
print(nodes[1].get_content())

Chunk 2: Personal Computer + College

python
print(nodes[2].get_content())

Chunk 3: Finishing up College + Grad School

python
print(nodes[3].get_content())

Compare against Baseline

In contrast let's compare against the baseline with a fixed chunk size.

python
base_nodes = base_splitter.get_nodes_from_documents(documents)
python
print(base_nodes[2].get_content())

Setup Query Engine

python
from llama_index.core import VectorStoreIndex
from llama_index.core.response.notebook_utils import display_source_node
python
vector_index = VectorStoreIndex(nodes)
query_engine = vector_index.as_query_engine()
python
base_vector_index = VectorStoreIndex(base_nodes)
base_query_engine = base_vector_index.as_query_engine()

Run some Queries

python
response = query_engine.query(
    "Tell me about the author's programming journey through childhood to college"
)
python
print(str(response))
python
for n in response.source_nodes:
    display_source_node(n, source_length=20000)
python
base_response = base_query_engine.query(
    "Tell me about the author's programming journey through childhood to college"
)
python
print(str(base_response))
python
for n in base_response.source_nodes:
    display_source_node(n, source_length=20000)
python
response = query_engine.query("Tell me about the author's experience in YC")
python
print(str(response))
python
base_response = base_query_engine.query(
    "Tell me about the author's experience in YC"
)
python
print(str(base_response))