Chunking & tokenization with Data Prep Kit

This notebook demonstrates how to build a sequence of <a href=https://github.com/data-prep-kit/data-prep-kit> DPK transforms </a> for ingesting HTML documents using Docling2Parquet transforms and chunking them using Doc_Chunk transform. Both transforms are based on the <a href=https://docling-project.github.io/docling/> Docling library</a>.

In this example, we will use the Wikimedia API to retrieve the HTML articles that will be used as a seed for our LLM application. Once the articles are loaded to a local cache, we will construct and invoke the sequence of transforms to ingest the content and produce the embedding for the chuncked content.

🔍 Why DPK Pipelines

DPK transform pipelines are intended to simplify how any number of transforms can be executed in a sequence to ingest, annotate, filter and create embedding used for LLM post-training and RAG applications.

🧰 Key Transforms in This Recipe

We will use the following transforms from DPK:

Docling2Parquet: Ingest one or more HTML document and turn it into a parquet file.
Doc_Chunk: Create chunks from one more more ducment.
Tokenization: Create embedding for document chunks.

Prerequisites

1- This notebook uses Wikimedia API for retrieving the initial HTML documents and llama-tokenizer from hugging face.

2- In order to use the notebook, users must provide a .env file with a valid access tokens to be used for accessing the wikimedia endpoint (<a href=https://enterprise.wikimedia.com/docs/> instructions can be found here </a>) and a Hugging face token for loading the model (<a href=https://huggingface.co/docs/hub/en/security-tokens> instructions can be found here</a>). The .env file will look something like this:

WIKI_ACCESS_TOKEN='eyxxx'
HF_READ_ACCESS_TOKEN='hf_xxx'

3- Install DPK library to environment

python

%%capture
%pip install "data-prep-toolkit-transforms[docling2parquet,doc_chunk,tokenization]"
%pip install pandas
%pip install "numpy<2.0"
from dotenv import load_dotenv

load_dotenv(".env", override=True)

We will define and use a utility function for downloading the articles and saving them to the local disk:

load_corpus: Uses http request with the wikimedia api token to connect to a Wikimedia endpoint and retrieve the HTML articles that will be used as a seed for our LLM application. The article will then be saved to a local cache folder for further processing

python

def load_corpus(articles: list, folder: str) -> int:
    import os
    import re

    import requests

    headers = {"Authorization": f"Bearer {os.getenv('WIKI_ACCESS_TOKEN')}"}
    count = 0
    for article in articles:
        try:
            endpoint = f"https://api.enterprise.wikimedia.com/v2/articles/{article}"
            response = requests.get(endpoint, headers=headers)
            response.raise_for_status()
            doc = response.json()
            for article in doc:
                filename = re.sub(r"[^a-zA-Z0-9_]", "_", article["name"])
                with open(f"{folder}/{filename}.html", "w") as f:
                    f.write(article["article_body"]["html"])
                    count = count + 1
        except Exception as e:
            print(f"Failed to retrieve content: {e}")
    return count

🔗 Setup the experiment

DPK requires that we define a source/input folder where the transform sequence will be ingesting the document and a destination/output folder where the embedding will be stored. We will also initialize the list of articles we want to use in our application

python

import os
import tempfile

datafolder = tempfile.mkdtemp(dir=os.getcwd())
articles = ["Science,_technology,_engineering,_and_mathematics"]
assert load_corpus(articles, datafolder) > 0, "Faild to download any documents"

🔗 Injest

Invoke Docling2Parquet tansform that will parse the HTML document and create a Markdown

python

%%capture
from dpk_docling2parquet import Docling2Parquet, docling2parquet_contents_types

result = Docling2Parquet(
    input_folder=datafolder,
    output_folder=f"{datafolder}/docling2parquet",
    data_files_to_use=[".html"],
    docling2parquet_contents_type=docling2parquet_contents_types.MARKDOWN,  # markdown
).transform()

🔗 Chunk

Invoke DocChunk tansform to break the HTML document into chunks

python

%%capture
from dpk_doc_chunk import DocChunk

result = DocChunk(
    input_folder=f"{datafolder}/docling2parquet",
    output_folder=f"{datafolder}/doc_chunk",
    doc_chunk_chunking_type="li_markdown",
    doc_chunk_chunk_size_tokens=128,  # default 128
    doc_chunk_chunk_overlap_tokens=30,  # default 30
).transform()

🔗 Tokenization

Invoke Tokenization transform to create embedding of various chunks

python

%%capture
from dpk_tokenization import Tokenization

Tokenization(
    input_folder=f"{datafolder}/doc_chunk",
    output_folder=f"{datafolder}/tkn",
    tkn_tokenizer="hf-internal-testing/llama-tokenizer",
    tkn_chunk_size=20_000,
).transform()

✅ Summary

This notebook demonstrated how to run a DPK pipeline using IBM's Data Prep Kit and the Docling library. Each transform create one or more parquet files that users can explore to better understand what each stage of the pipeline produces. The see the output of the final stage, we will use Pandas to read the final parquet file and display its content

python

from pathlib import Path

import pandas as pd

parquet_files = list(Path(f"{datafolder}/tkn/").glob("*.parquet"))
pd.concat(pd.read_parquet(file) for file in parquet_files)