DiT: Self-Supervised Pre-Training for Document Image Transformer

DiT (Document Image Transformer) is a self-supervised pre-trained Document Image Transformer model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised counterparts ever exist due to the lack of human labeled document images.

<div align="center"> Model outputs with PubLayNet (left) and ICDAR 2019 cTDaR (right) </div>

What's New

Demos on HuggingFace: Document Layout Analysis, Document Image Classification
March 2022: release pre-trained checkpoints and fine-tuning checkpoints & codes (DiT-base and DiT-large)
March 2022: release preprint in arXiv

Pretrained models

We provide two DiT weights pretrained on IIT-CDIP Test Collection 1.0. The models were pretrained with 224x224 resolution.

DiT-base: #layer=12; hidden=768; FFN factor=4x; #head=12; patch=16x16 (#parameters: 86M)
DiT-large: #layer=24; hidden=1024; FFN factor=4x; #head=16; patch=16x16 (#parameters: 304M)

Download checkpoints that are self-supervised pretrained on IIT-CDIP Test Collection 1.0:

DiT-base: dit_base_patch16_224
DiT-large: dit_large_patch16_224

Setup

First, clone the repo and install required packages:

git clone https://github.com/microsoft/unilm.git
cd unilm/dit
pip install -r requirements.txt

The required packages including: Pytorch version 1.9.0, torchvision version 0.10.0 and Timm version 0.5.4, etc.

For mixed-precision training, please install apex

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

For object detection, please additionally install detectron2 library and shapely. Refer to the Detectron2's INSTALL.md.

bash

# Install `detectron2`
python -m pip install detectron2 -f \
  https://dl.fbaipublicfiles.com/detectron2/wheels/cu111/torch1.9/index.html

# Install `shapely`
pip install shapely

Fine-tuning on RVL-CDIP (Document Image Classification)

We summarize the validation results as follows. We also provide the fine-tuned weights. The detailed instructions to reproduce the results can be found at classification/README.md.

name	initialized checkpoint	resolution	accuracy	weight
DiT-base	dit_base_patch16_224	224x224	92.11	link
DiT-large	dit_large_patch16_224	224x224	92.69	link

Fine-tuning on PubLayNet (Document Layout Analysis)

We summarize the validation results as follows. We also provide the fine-tuned weights. The detailed instructions to reproduce the results can be found at object_detection/README.md.

name	initialized checkpoint	detection algorithm	mAP	weight
DiT-base	dit_base_patch16_224	Mask R-CNN	0.935	link
DiT-large	dit_large_patch16_224	Mask R-CNN	0.941	link
DiT-base	dit_base_patch16_224	Cascade R-CNN	0.945	link
DiT-large	dit_large_patch16_224	Cascade R-CNN	0.949	link

Fine-tuning on ICDAR 2019 cTDaR (Table Detection)

We summarize the validation results as follows. We also provide the fine-tuned weights. The detailed instructions to reproduce the results can be found at object_detection/README.md.

Modern

name	initialized checkpoint	detection algorithm	Weighted Average F1	weight
DiT-base	dit_base_patch16_224	Mask R-CNN	94.74	link
DiT-large	dit_large_patch16_224	Mask R-CNN	95.50	link
DiT-base	dit_base_patch16_224	Cascade R-CNN	95.85	link
DiT-large	dit_large_patch16_224	Cascade R-CNN	96.29	link

Archival

name	initialized checkpoint	detection algorithm	Weighted Average F1	weight
DiT-base	dit_base_patch16_224	Mask R-CNN	96.24	link
DiT-large	dit_large_patch16_224	Mask R-CNN	96.46	link
DiT-base	dit_base_patch16_224	Cascade R-CNN	96.63	link
DiT-large	dit_large_patch16_224	Cascade R-CNN	97.00	link

Combined (Combine the inference results of Modern and Archival)

name	initialized checkpoint	detection algorithm	Weighted Average F1	weight
DiT-base	dit_base_patch16_224	Mask R-CNN	95.30	-
DiT-large	dit_large_patch16_224	Mask R-CNN	95.85	-
DiT-base	dit_base_patch16_224	Cascade R-CNN	96.14	-
DiT-large	dit_large_patch16_224	Cascade R-CNN	96.55	-

Citation

If you find this repository useful, please consider citing our work:

@misc{li2022dit,
    title={DiT: Self-supervised Pre-training for Document Image Transformer},
    author={Junlong Li and Yiheng Xu and Tengchao Lv and Lei Cui and Cha Zhang and Furu Wei},
    year={2022},
    eprint={2203.02378},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Acknowledgement

This repository is built using the timm library, the detectron2 library, the DeiT repository, the Dino repository, the BEiT repository and the MPViT repository.

Contact Information

For help or issues using DiT models, please submit a GitHub issue.

For other communications related to DiT, please contact Lei Cui ([email protected]), Furu Wei ([email protected]).