Back to Docling

Conversion of custom XML

docs/examples/backend_xml_rag.ipynb

2.92.013.7 KB
Original Source

<a href="https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/backend_xml_rag.ipynb" target="_parent"></a>

Conversion of custom XML

StepTechExecution
EmbeddingHugging Face / Sentence Transformers💻 Local
Vector storeMilvus💻 Local
Gen AIHugging Face Inference API🌐 Remote

Overview

This is an example of using Docling for converting structured data (XML) into a unified document representation format, DoclingDocument, and leverage its riched structured content for RAG applications.

Data used in this example consist of patents from the United States Patent and Trademark Office (USPTO) and medical articles from PubMed Central® (PMC).

In this notebook, we accomplish the following:

For more details on document chunking with Docling, refer to the Chunking documentation. For RAG with Docling and LlamaIndex, also check the example RAG with LlamaIndex.

Simple conversion

The XML file format defines and stores data in a format that is both human-readable and machine-readable. Because of this flexibility, Docling requires custom backend processors to interpret XML definitions and convert them into DoclingDocument objects.

Some public data collections in XML format are already supported by Docling (USTPO patents and PMC articles). In these cases, the document conversion is straightforward and the same as with any other supported format, such as PDF or HTML. The execution example in Simple Conversion is the recommended usage of Docling for a single file:

python
from docling.document_converter import DocumentConverter

# a sample PMC article:
source = "../../tests/data/jats/elife-56337.nxml"
converter = DocumentConverter()
result = converter.convert(source)
print(result.status)

Once the document is converted, it can be exported to any format supported by Docling. For instance, to markdown (showing here the first lines only):

python
md_doc = result.document.export_to_markdown()

delim = "\n"
print(delim.join(md_doc.split(delim)[:8]))

If the XML file is not supported, a ConversionError message will be raised.

python
from io import BytesIO

from docling.datamodel.base_models import DocumentStream
from docling.exceptions import ConversionError

xml_content = (
    b'<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE docling_test SYSTEM '
    b'"test.dtd"><docling>Random content</docling>'
)
stream = DocumentStream(name="docling_test.xml", stream=BytesIO(xml_content))
try:
    result = converter.convert(stream)
except ConversionError as ce:
    print(ce)

You can always refer to the Usage documentation page for a list of supported formats.

End-to-end application

This section describes a step-by-step application for processing XML files from supported public collections and use them for question-answering.

Setup

Requirements can be installed as shown below. The --no-warn-conflicts argument is meant for Colab's pre-populated Python environment, feel free to remove for stricter usage.

python
%pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-vector-stores-milvus llama-index-readers-file python-dotenv

This notebook uses HuggingFace's Inference API. For an increased LLM quota, a token can be provided via the environment variable HF_TOKEN.

If you're running this notebook in Google Colab, make sure you add your API key as a secret.

python
import os
from warnings import filterwarnings

from dotenv import load_dotenv


def _get_env_from_colab_or_os(key):
    try:
        from google.colab import userdata

        try:
            return userdata.get(key)
        except userdata.SecretNotFoundError:
            pass
    except ImportError:
        pass
    return os.getenv(key)


load_dotenv()

filterwarnings(action="ignore", category=UserWarning, module="pydantic")

We can now define the main parameters:

python
from pathlib import Path
from tempfile import mkdtemp

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI

EMBED_MODEL_ID = "BAAI/bge-small-en-v1.5"
EMBED_MODEL = HuggingFaceEmbedding(model_name=EMBED_MODEL_ID)
TEMP_DIR = Path(mkdtemp())
MILVUS_URI = str(TEMP_DIR / "docling.db")
GEN_MODEL = HuggingFaceInferenceAPI(
    token=_get_env_from_colab_or_os("HF_TOKEN"),
    model_name="mistralai/Mixtral-8x7B-Instruct-v0.1",
)
embed_dim = len(EMBED_MODEL.get_text_embedding("hi"))
# https://github.com/huggingface/transformers/issues/5486:
os.environ["TOKENIZERS_PARALLELISM"] = "false"

Fetch the data

In this notebook we will use XML data from collections supported by Docling:

The raw files will be downloaded form the source and saved in a temporary directory.

PMC articles

The OA file is a manifest file of all the PMC articles, including the URL path to download the source files. In this notebook we will use as example the article Pathogens spread by high-altitude windborne mosquitoes, which is available in the archive file PMC11703268.tar.gz.

python
import tarfile
from io import BytesIO

import requests

# PMC article PMC11703268
url: str = "https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/e3/6b/PMC11703268.tar.gz"

print(f"Downloading {url}...")
buf = BytesIO(requests.get(url).content)
print("Extracting and storing the XML file containing the article text...")
with tarfile.open(fileobj=buf, mode="r:gz") as tar_file:
    for tarinfo in tar_file:
        if tarinfo.isreg():
            file_path = Path(tarinfo.name)
            if file_path.suffix == ".nxml":
                with open(TEMP_DIR / file_path.name, "wb") as file_obj:
                    file_obj.write(tar_file.extractfile(tarinfo).read())
                print(f"Stored XML file {file_path.name}")

USPTO patents

Since each USPTO file is a concatenation of several patents, we need to split its content into valid XML pieces. The following code downloads a sample zip file, split its content in sections, and dumps each section as an XML file. For simplicity, this pipeline is shown here in a sequential manner, but it could be parallelized.

python
import zipfile

# Patent grants from December 17-23, 2024
url: str = (
    "https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2024/ipg241217.zip"
)
XML_SPLITTER: str = '<?xml version="1.0"'
doc_num: int = 0

print(f"Downloading {url}...")
buf = BytesIO(requests.get(url).content)
print("Parsing zip file, splitting into XML sections, and exporting to files...")
with zipfile.ZipFile(buf) as zf:
    res = zf.testzip()
    if res:
        print("Error validating zip file")
    else:
        with zf.open(zf.namelist()[0]) as xf:
            is_patent = False
            patent_buffer = BytesIO()
            for xf_line in xf:
                decoded_line = xf_line.decode(errors="ignore").rstrip()
                xml_index = decoded_line.find(XML_SPLITTER)
                if xml_index != -1:
                    if (
                        xml_index > 0
                    ):  # cases like </sequence-cwu><?xml version="1.0"...
                        patent_buffer.write(xf_line[:xml_index])
                        patent_buffer.write(b"\r\n")
                        xf_line = xf_line[xml_index:]
                    if patent_buffer.getbuffer().nbytes > 0 and is_patent:
                        doc_num += 1
                        patent_id = f"ipg241217-{doc_num}"
                        with open(TEMP_DIR / f"{patent_id}.xml", "wb") as file_obj:
                            file_obj.write(patent_buffer.getbuffer())
                    is_patent = False
                    patent_buffer = BytesIO()
                elif decoded_line.startswith("<!DOCTYPE"):
                    is_patent = True
                patent_buffer.write(xf_line)
python
print(f"Fetched and exported {doc_num} documents.")

Parse, chunk, and index

The DoclingDocument format of the converted patents has a rich hierarchical structure, inherited from the original XML document and preserved by the Docling custom backend. In this notebook, we will leverage:

  • The SimpleDirectoryReader pattern to iterate over the exported XML files created in section Fetch the data.
  • The LlamaIndex extensions, DoclingReader and DoclingNodeParser, to ingest the patent chunks into a Milvus vector store.
  • The HierarchicalChunker implementation, which applies a document-based hierarchical chunking, to leverage the patent structures like sections and paragraphs within sections.

Refer to other possible implementations and usage patterns in the Chunking documentation and the RAG with LlamaIndex notebook.

Set the Docling reader and the directory reader

Note that DoclingReader uses Docling's DocumentConverter by default and therefore it will recognize the format of the XML files and leverage the PatentUsptoDocumentBackend automatically.

For demonstration purposes, we limit the scope of the analysis to the first 100 patents.

python
from llama_index.core import SimpleDirectoryReader
from llama_index.readers.docling import DoclingReader

reader = DoclingReader(export_type=DoclingReader.ExportType.JSON)
dir_reader = SimpleDirectoryReader(
    input_dir=TEMP_DIR,
    exclude=["docling.db", "*.nxml"],
    file_extractor={".xml": reader},
    filename_as_id=True,
    num_files_limit=100,
)
Set the node parser

Note that the HierarchicalChunker is the default chunking implementation of the DoclingNodeParser.

python
from llama_index.node_parser.docling import DoclingNodeParser

node_parser = DoclingNodeParser()
Set a local Milvus database and run the ingestion
python
from llama_index.core import StorageContext, VectorStoreIndex
from llama_index.vector_stores.milvus import MilvusVectorStore

vector_store = MilvusVectorStore(
    uri=MILVUS_URI,
    dim=embed_dim,
    overwrite=True,
)

index = VectorStoreIndex.from_documents(
    documents=dir_reader.load_data(show_progress=True),
    transformations=[node_parser],
    storage_context=StorageContext.from_defaults(vector_store=vector_store),
    embed_model=EMBED_MODEL,
    show_progress=True,
)

Finally, add the PMC article to the vector store directly from the reader.

python
index.from_documents(
    documents=reader.load_data(TEMP_DIR / "nihpp-2024.12.26.630351v1.nxml"),
    transformations=[node_parser],
    storage_context=StorageContext.from_defaults(vector_store=vector_store),
    embed_model=EMBED_MODEL,
)

Question-answering with RAG

The retriever can be used to identify highly relevant documents:

python
retriever = index.as_retriever(similarity_top_k=3)
results = retriever.retrieve("What patents are related to fitness devices?")

for item in results:
    print(item)

With the query engine, we can run the question-answering with the RAG pattern on the set of indexed documents.

First, we can prompt the LLM directly:

python
from llama_index.core.base.llms.types import ChatMessage, MessageRole
from rich.console import Console
from rich.panel import Panel

console = Console()
query = "Do mosquitoes in high altitude expand viruses over large distances?"

usr_msg = ChatMessage(role=MessageRole.USER, content=query)
response = GEN_MODEL.chat(messages=[usr_msg])

console.print(Panel(query, title="Prompt", border_style="bold red"))
console.print(
    Panel(
        response.message.content.strip(),
        title="Generated Content",
        border_style="bold green",
    )
)

Now, we can compare the response when the model is prompted with the indexed PMC article as supporting context:

python
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters

filters = MetadataFilters(
    filters=[
        ExactMatchFilter(key="filename", value="nihpp-2024.12.26.630351v1.nxml"),
    ]
)

query_engine = index.as_query_engine(llm=GEN_MODEL, filter=filters, similarity_top_k=3)
result = query_engine.query(query)

console.print(
    Panel(
        result.response.strip(),
        title="Generated Content with RAG",
        border_style="bold green",
    )
)