Back to Transformers

LayoutXLM

docs/source/en/model_doc/layoutxlm.md

5.8.03.7 KB
Original Source
<!--Copyright 2021 The HuggingFace Team. All rights reserved. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer. -->

This model was released on 2021-04-18 and added to Hugging Face Transformers on 2021-11-03.

LayoutXLM

<div class="flex flex-wrap space-x-1"> </div>

Overview

LayoutXLM was proposed in LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding by Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei. It's a multilingual extension of the LayoutLMv2 model trained on 53 languages.

The abstract from the paper is the following:

Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually-rich document understanding tasks recently, which demonstrates the great potential for joint learning across different modalities. In this paper, we present LayoutXLM, a multimodal pre-trained model for multilingual document understanding, which aims to bridge the language barriers for visually-rich document understanding. To accurately evaluate LayoutXLM, we also introduce a multilingual form understanding benchmark dataset named XFUN, which includes form understanding samples in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese), and key-value pairs are manually labeled for each language. Experiment results show that the LayoutXLM model has significantly outperformed the existing SOTA cross-lingual pre-trained models on the XFUN dataset.

This model was contributed by nielsr. The original code can be found here.

Usage tips and examples

One can directly plug in the weights of LayoutXLM into a LayoutLMv2 model, like so:

python
from transformers import LayoutLMv2Model


model = LayoutLMv2Model.from_pretrained("microsoft/layoutxlm-base", device_map="auto")

Note that LayoutXLM has its own tokenizer, based on [LayoutXLMTokenizer]/[LayoutXLMTokenizerFast]. You can initialize it as follows:

python
from transformers import LayoutXLMTokenizer


tokenizer = LayoutXLMTokenizer.from_pretrained("microsoft/layoutxlm-base")

Similar to LayoutLMv2, you can use [LayoutXLMProcessor] (which internally applies [LayoutLMv2ImageProcessor] and [LayoutXLMTokenizer]/[LayoutXLMTokenizerFast] in sequence) to prepare all data for the model.

<Tip>

As LayoutXLM's architecture is equivalent to that of LayoutLMv2, one can refer to LayoutLMv2's documentation page for all tips, code examples and notebooks. </Tip>

LayoutXLMConfig

[[autodoc]] LayoutXLMConfig

LayoutXLMTokenizer

[[autodoc]] LayoutXLMTokenizer - call - build_inputs_with_special_tokens - get_special_tokens_mask - create_token_type_ids_from_sequences - save_vocabulary

LayoutXLMTokenizerFast

[[autodoc]] LayoutXLMTokenizerFast - call

LayoutXLMProcessor

[[autodoc]] LayoutXLMProcessor - call