Semantic Chunker

"Semantic chunking" is a new concept proposed Greg Kamradt in his video tutorial on 5 levels of embedding chunking: https://youtu.be/8OJC21T2SL4?t=1933.

Instead of chunking text with a fixed chunk size, the semantic splitter adaptively picks the breakpoint in-between sentences using embedding similarity. This ensures that a "chunk" contains sentences that are semantically related to each other.

We adapted it into a LlamaIndex module.

Check out our notebook below!

Caveats:

The regex primarily works for English sentences
You may have to tune the breakpoint percentile threshold.

Setup Data

python

%pip install llama-index-embeddings-openai

python

!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'pg_essay.txt'

python

from llama_index.core import SimpleDirectoryReader

# load documents
documents = SimpleDirectoryReader(input_files=["pg_essay.txt"]).load_data()

Define Semantic Splitter

python

from llama_index.core.node_parser import (
    SentenceSplitter,
    SemanticSplitterNodeParser,
)
from llama_index.embeddings.openai import OpenAIEmbedding

import os

os.environ["OPENAI_API_KEY"] = "sk-..."

python

embed_model = OpenAIEmbedding()
splitter = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
)

# also baseline splitter
base_splitter = SentenceSplitter(chunk_size=512)

python

nodes = splitter.get_nodes_from_documents(documents)

Inspecting the Chunks

Let's take a look at chunks produced by the semantic splitter.

Chunk 1: IBM 1401

python

print(nodes[1].get_content())

Chunk 2: Personal Computer + College

python

print(nodes[2].get_content())

Chunk 3: Finishing up College + Grad School

python

print(nodes[3].get_content())

Compare against Baseline

In contrast let's compare against the baseline with a fixed chunk size.

python

base_nodes = base_splitter.get_nodes_from_documents(documents)

python

print(base_nodes[2].get_content())

Setup Query Engine

python

from llama_index.core import VectorStoreIndex
from llama_index.core.response.notebook_utils import display_source_node

python

vector_index = VectorStoreIndex(nodes)
query_engine = vector_index.as_query_engine()

python

base_vector_index = VectorStoreIndex(base_nodes)
base_query_engine = base_vector_index.as_query_engine()

Run some Queries

python

response = query_engine.query(
    "Tell me about the author's programming journey through childhood to college"
)

python

print(str(response))

python

for n in response.source_nodes:
    display_source_node(n, source_length=20000)

python

base_response = base_query_engine.query(
    "Tell me about the author's programming journey through childhood to college"
)

python

print(str(base_response))

python

for n in base_response.source_nodes:
    display_source_node(n, source_length=20000)

python

response = query_engine.query("Tell me about the author's experience in YC")

python

print(str(response))

python

base_response = base_query_engine.query(
    "Tell me about the author's experience in YC"
)

python

print(str(base_response))