Semantic Double Merging and Chunking

Download spacy

python

!pip install spacy

Download required spacy model

python

!python3 -m spacy download en_core_web_md

Download sample data:

python

!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'pg_essay.txt'

python

from llama_index.core.node_parser import (
    SemanticDoubleMergingSplitterNodeParser,
    LanguageConfig,
)
from llama_index.core import SimpleDirectoryReader

Load document and create sample splitter:

python

documents = SimpleDirectoryReader(input_files=["pg_essay.txt"]).load_data()

config = LanguageConfig(language="english", spacy_model="en_core_web_md")
splitter = SemanticDoubleMergingSplitterNodeParser(
    language_config=config,
    initial_threshold=0.4,
    appending_threshold=0.5,
    merging_threshold=0.5,
    max_chunk_size=5000,
)

Get the nodes:

python

nodes = splitter.get_nodes_from_documents(documents)

Sample nodes:

python

print(nodes[0].get_content())

python

print(nodes[5].get_content())

Remember that different spaCy models and various parameter values can perform differently on specific texts. A text that clearly changes its subject matter should have lower threshold values to easily detect these changes. Conversely, a text with a very uniform subject matter should have high threshold values to help split the text into a greater number of chunks. For more information and comparison with different chunking methods check https://bitpeak.pl/chunking-methods-in-rag-methods-comparison/