Back to Docling

Hybrid chunking

docs/examples/hybrid_chunking.ipynb

2.92.06.6 KB
Original Source

Hybrid chunking

Overview

Hybrid chunking applies tokenization-aware refinements on top of document-based hierarchical chunking.

For more details, see here.

Setup

python
%pip install -qU pip docling transformers
python
DOC_SOURCE = "../../tests/data/md/wiki.md"

Basic usage

We first convert the document:

python
from docling.document_converter import DocumentConverter

doc = DocumentConverter().convert(source=DOC_SOURCE).document

For a basic chunking scenario, we can just instantiate a HybridChunker, which will use the default parameters.

python
from docling.chunking import HybridChunker

chunker = HybridChunker()
chunk_iter = chunker.chunk(dl_doc=doc)

šŸ‘‰ NOTE: As you see above, using the HybridChunker can sometimes lead to a warning from the transformers library, however this is a "false alarm" — for details check here.

Note that the text you would typically want to embed is the context-enriched one as returned by the contextualize() method:

python
for i, chunk in enumerate(chunk_iter):
    print(f"=== {i} ===")
    print(f"chunk.text:\n{f'{chunk.text[:300]}…'!r}")

    enriched_text = chunker.contextualize(chunk=chunk)
    print(f"chunker.contextualize(chunk):\n{f'{enriched_text[:300]}…'!r}")

    print()

Configuring tokenization

For more control on the chunking, we can parametrize tokenization as shown below.

In a RAG / retrieval context, it is important to make sure that the chunker and embedding model are using the same tokenizer.

šŸ‘‰ HuggingFace transformers tokenizers can be used as shown in the following example:

python
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
from transformers import AutoTokenizer

from docling.chunking import HybridChunker

EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
MAX_TOKENS = 64  # set to a small number for illustrative purposes

tokenizer = HuggingFaceTokenizer(
    tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),
    max_tokens=MAX_TOKENS,  # optional, by default derived from `tokenizer` for HF case
)

šŸ‘‰ Alternatively, OpenAI tokenizers can be used as shown in the example below (uncomment to use — requires installing docling-core[chunking-openai]):

python
# import tiktoken

# from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer

# tokenizer = OpenAITokenizer(
#     tokenizer=tiktoken.encoding_for_model("gpt-4o"),
#     max_tokens=128 * 1024,  # context window length required for OpenAI tokenizers
# )

We can now instantiate our chunker:

python
chunker = HybridChunker(
    tokenizer=tokenizer,
    merge_peers=True,  # optional, defaults to True
)
chunk_iter = chunker.chunk(dl_doc=doc)
chunks = list(chunk_iter)

Points to notice looking at the output chunks below:

  • Where possible, we fit the limit of 64 tokens for the metadata-enriched serialization form (see chunk 2)
  • Where needed, we stop before the limit, e.g. see cases of 63 as it would otherwise run into a comma (see chunk 6)
  • Where possible, we merge undersized peer chunks (see chunk 0)
  • "Tail" chunks trailing right after merges may still be undersized (see chunk 8)
python
for i, chunk in enumerate(chunks):
    print(f"=== {i} ===")
    txt_tokens = tokenizer.count_tokens(chunk.text)
    print(f"chunk.text ({txt_tokens} tokens):\n{chunk.text!r}")

    ser_txt = chunker.contextualize(chunk=chunk)
    ser_tokens = tokenizer.count_tokens(ser_txt)
    print(f"chunker.contextualize(chunk) ({ser_tokens} tokens):\n{ser_txt!r}")

    print()

Table chunking with header repetition

When chunking documents with tables, the HybridChunker can repeat table headers in each chunk to maintain context. This is particularly useful for wide tables where the content spans multiple chunks.

Let's demonstrate this with a CSV file containing customer data.

python
# Convert a CSV file with a wide table
CSV_SOURCE = "../../tests/data/csv/csv-comma.csv"

csv_result = DocumentConverter().convert(source=CSV_SOURCE)
csv_doc = csv_result.document

print(f"Document has {len(list(csv_doc.iterate_items()))} items")
print("\nFirst few lines of the CSV table:")
print(csv_doc.export_to_markdown()[:500])

Now let's chunk this table with header repetition enabled. We'll use a small token limit to force the table to be split across multiple chunks.

python
from docling_core.transforms.chunker.hierarchical_chunker import (
    ChunkingDocSerializer,
    ChunkingSerializerProvider,
)
from docling_core.transforms.serializer.markdown import (
    MarkdownParams,
    MarkdownTableSerializer,
)


# Create a custom serializer provider that uses Markdown for tables
class MDTableSerializerProvider(ChunkingSerializerProvider):
    def get_serializer(self, doc):
        return ChunkingDocSerializer(
            doc=doc,
            table_serializer=MarkdownTableSerializer(),
            params=MarkdownParams(compact_tables=True),
        )


small_tokenizer = HuggingFaceTokenizer(
    tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),
    max_tokens=200,
)

chunker_with_headers = HybridChunker(
    tokenizer=small_tokenizer,
    repeat_table_header=True,  # Repeat headers in each chunk
    serializer_provider=MDTableSerializerProvider(),  # Use Markdown table format
)

csv_chunks = list(chunker_with_headers.chunk(csv_doc))

print(f"Total chunks created: {len(csv_chunks)}\n")

# Display the first few chunks to show header repetition
for i, chunk in enumerate(csv_chunks[:3], 1):
    print(f"{'=' * 60}")
    print(f"Chunk {i}:")
    print(f"{'=' * 60}")
    chunk_text = chunk.text
    # Show first 300 characters of each chunk
    preview = chunk_text[:300] + "..." if len(chunk_text) > 300 else chunk_text
    print(preview)
    print(f"\nTokens: {small_tokenizer.count_tokens(chunk_text)}")
    print(f"Has table header: {chunk_text.startswith('|')}\n")

Each chunk starts with the table header row, ensuring that every chunk maintains the context of what each column represents. This is especially important when:

  • Feeding chunks to an embedding model for semantic search
  • Processing chunks independently in downstream tasks
  • Working with wide tables that naturally span multiple chunks

For more advanced control over header handling in wide tables, including the omit_header_on_overflow parameter, see the Line-based chunking example.