llama-index-integrations/node_parser/llama-index-node-parser-chonkie/README.md
This package provides an integration between LlamaIndex and Chonkie, a powerful and flexible chunking library.
pip install llama-index-node_parser-chonkie
from llama_index.core import Document
from llama_index.node_parser.chonkie import Chunker
# Create a chunker (defaults to 'recursive')
chunker = Chunker(chunk_size=512)
# Create a document
doc = Document(text="Your long text here...")
# Get nodes
nodes = chunker.get_nodes_from_documents([doc])
The Chunker acts as a wrapper for various Chonkie chunking strategies. You can specify the strategy using the chunker parameter:
chunker | Description |
|---|---|
recursive | (Default) Recursively splits text based on a hierarchy of separators. |
sentence | Splits text into sentences. |
token | Splits text into chunks based on token counts. |
word | Splits text based on word counts. |
semantic | Splits text based on semantic similarity. |
late | Late chunking strategy. |
neural | Neural-based chunking. |
code | Optimized for source code. |
fast | High-performance basic chunking. |
run the following code to see the full list of valid aliases:
from llama_index.node_parser import Chunker
print(Chunker.valid_chunkers)
You can pass any keyword arguments accepted by the underlying Chonkie chunker directly to Chunker:
chunker = Chunker(
chunker="semantic",
chunk_size=512,
embedding_model="all-MiniLM-L6-v2",
threshold=0.5,
)
You can use Chunker directly to parse documents into nodes:
from llama_index.core import Document
from llama_index.node_parser.chonkie import Chunker
chunker = Chunker(chunk_size=512)
doc = Document(text="Your long text here...")
nodes = chunker.get_nodes_from_documents([doc])
or you can also use it as a component within the Ingestion pipeline:
from llama_index.core import Document
from llama_index.core.ingestion import IngestionPipeline
from llama_index.node_parser.chonkie import Chunker
pipeline = IngestionPipeline(
transformations=[
Chunker("recursive", chunk_size=512),
# ... other transformations
]
)
nodes = pipeline.run(documents=[Document.example()])