Back to Docling

XBRL Document Conversion

docs/examples/xbrl_conversion.ipynb

2.93.08.1 KB
Original Source

<a href="https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/xbrl_conversion.ipynb" target="_parent"></a>

XBRL Document Conversion

This example demonstrates how to parse XBRL (eXtensible Business Reporting Language) documents using Docling, completely offline.

XBRL is a standard XML-based format used globally by companies, regulators, and financial institutions for exchanging business and financial information in a structured, machine-readable format. It's widely adopted for regulatory filings (e.g., SEC filings in the US).

What you'll learn

  • How to configure Docling to parse XBRL documents offline
  • How to provide a local taxonomy package for XBRL validation
  • How to extract structured data from XBRL instance documents
  • How to export XBRL content to various formats (Markdown, JSON, etc.)

The data to run this notebook has been fetched from the SEC's Electronic Data Gathering, Analysis, and Retrieval (EDGAR) system.

Setup

Install Docling with XBRL support:

python
%pip install -q docling

Download Sample XBRL Data

For this example, we'll use a sample XBRL instance document and its taxonomy. In a real scenario, you would have your own XBRL files and taxonomy packages.

We'll download the test data from the Docling repository:

python
import urllib.request
from pathlib import Path

# Create directories for XBRL data
data_dir = Path("xbrl_data")
taxonomy_dir = data_dir / "taxonomy"
taxonomy_dir.mkdir(parents=True, exist_ok=True)

# Base URL for test data
base_url = (
    "https://raw.githubusercontent.com/docling-project/docling/main/tests/data/xbrl/"
)

# Download XBRL instance file
instance_file = data_dir / "mlac-20251231.xml"
if not instance_file.exists():
    print("Downloading XBRL instance file...")
    urllib.request.urlretrieve(f"{base_url}mlac-20251231.xml", instance_file)
    print(f"Downloaded: {instance_file}")

# Download taxonomy files
taxonomy_files = [
    "mlac-20251231.xsd",
    "mlac-20251231_cal.xml",
    "mlac-20251231_def.xml",
    "mlac-20251231_lab.xml",
    "mlac-20251231_pre.xml",
]

print("Downloading taxonomy files...")
for filename in taxonomy_files:
    target_file = taxonomy_dir / filename
    if not target_file.exists():
        urllib.request.urlretrieve(f"{base_url}mlac-taxonomy/{filename}", target_file)
        print(f"  Downloaded: {filename}")

# Download taxonomy package (contains URL mappings for offline parsing)
taxonomy_package = taxonomy_dir / "taxonomy_package.zip"
if not taxonomy_package.exists():
    print("Downloading taxonomy package...")
    urllib.request.urlretrieve(
        f"{base_url}mlac-taxonomy/taxonomy_package.zip", taxonomy_package
    )
    print("  Downloaded: taxonomy_package.zip")

print("\nAll files downloaded successfully!")

Configure XBRL Backend

To parse XBRL documents offline, we need to:

  1. Enable local resource fetching (for taxonomy files)
  2. Disable remote resource fetching (for offline operation)
  3. Provide the path to the local taxonomy directory
python
from docling.datamodel.backend_options import XBRLBackendOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, XBRLFormatOption

# Configure XBRL backend options
backend_options = XBRLBackendOptions(
    enable_local_fetch=True,  # Allow reading local taxonomy files
    enable_remote_fetch=False,  # Disable remote fetching for offline operation
    taxonomy=taxonomy_dir,  # Path to local taxonomy directory
)

# Create document converter with XBRL support
converter = DocumentConverter(
    allowed_formats=[InputFormat.XML_XBRL],
    format_options={
        InputFormat.XML_XBRL: XBRLFormatOption(backend_options=backend_options)
    },
)

print("XBRL converter configured successfully!")

💡 Because the converter must read the supporting taxonomy files, set the enable_local_fetch option to True in the XBRL backend settings.
💡 In addition to the XBRL report's own taxonomy files, you need a taxonomy package-a bundle containing URL remappings that enables completely offline parsing. If you prefer not to supply a taxonomy package, omit it and set enable_remote_fetch to True in the XBRL backend settings. The backend will fetch the web‑referenced files from the remote publishers and cache them locally for reuse.

Convert XBRL Document

Now we can convert the XBRL instance document. The converter will:

  • Parse the XBRL instance file
  • Validate it against the local taxonomy
  • Extract metadata, text blocks, and numeric facts
  • Convert everything to a unified DoclingDocument representation
python
# Convert the XBRL document
print(f"Converting XBRL document: {instance_file}")
result = converter.convert(instance_file)
doc = result.document

print("\nConversion successful!")
print(f"Document name: {doc.name}")
print(f"Number of items: {len(list(doc.iterate_items()))}")

Inspect Document Structure

Let's examine the structure of the converted document:

python
from docling_core.types.doc import DocItemLabel

# Count items by type
item_counts = {}
for item, _ in doc.iterate_items():
    label = item.label
    item_counts[label] = item_counts.get(label, 0) + 1

print("Document structure:")
for label, count in sorted(item_counts.items(), key=lambda x: x[1], reverse=True):
    print(f"  {label.value}: {count}")

View Sample Content

Let's look at some of the extracted content:

python
# Display first few text items
print("Sample text content:\n")
text_count = 0
for item, _ in doc.iterate_items():
    if item.label == DocItemLabel.TEXT and text_count < 3:
        print(f"- {item.text[:200]}..." if len(item.text) > 200 else f"- {item.text}")
        print()
        text_count += 1

View Key-Value Pairs

XBRL numeric facts are extracted as key-value pairs:

python
# Display sample key-value pairs
graph_data = doc.key_value_items[0].graph
print(f"Total key-value pairs extracted: {len(graph_data.links)}\n")
for link in graph_data.links[:10]:
    source = next(
        item for item in graph_data.cells if item.cell_id == link.source_cell_id
    )
    target = next(
        item for item in graph_data.cells if item.cell_id == link.target_cell_id
    )
    print(f"{source.text} -> {target.text}")

💡 The current backend implementation flattens all key‑value pairs in an XBRL report. Future improvements will preserve the rich taxonomy of those data points.

Export to Markdown

Export the document to Markdown format for easy reading:

python
# Export to Markdown
markdown_content = doc.export_to_markdown()

# Display first 2000 characters
print("Markdown export (first 2000 characters):\n")
print(markdown_content[:2000])
print("\n...")

# Save to file
output_md = data_dir / "output.md"
output_md.write_text(markdown_content)
print(f"\nFull markdown saved to: {output_md}")

Export to JSON

Export the complete document structure to JSON:

python
import json

# Export to JSON
output_json = data_dir / "output.json"
doc.save_as_json(output_json)

print(f"Document exported to JSON: {output_json}")
print(f"File size: {output_json.stat().st_size / 1024:.2f} KB")

Summary

In this example, we demonstrated:

✅ How to configure Docling for offline XBRL parsing
✅ How to provide a local taxonomy for XBRL validation
✅ How to convert XBRL instance documents to DoclingDocument
✅ How to extract metadata, text blocks, and numeric facts
✅ How to export XBRL content to Markdown and JSON formats

Key Points

  • Offline operation: By setting enable_remote_fetch=False, all processing happens locally
  • Taxonomy support: The local taxonomy directory should contain all necessary schema and linkbase files
  • Structured extraction: XBRL numeric facts are extracted as key-value pairs with graph representation
  • Text blocks: HTML text blocks in XBRL are converted to structured content

Note on Future Changes

⚠️ The current implementation uses DoclingDocument's GraphData object to represent key-value pairs. This design will change in a future release of the docling-core library.