[PDF] LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

Abstract

Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. In this paper, we present \textbf{LayoutLMv2} by pre-training text, layout and image in a multi-modal framework, where new model architectures and pre-training tasks are leveraged. Specifically, LayoutLMv2 not only uses the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks in the pre-training stage, where cross-modality interaction is better learned. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture, so that the model can fully understand the relative positional relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms strong baselines and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 -> 0.8420), CORD (0.9493 -> 0.9601), SROIE (0.9524 -> 0.9781), Kleister-NDA (0.834 -> 0.852), RVL-CDIP (0.9443 -> 0.9564), and DocVQA (0.7295 -> 0.8672). The pre-trained LayoutLMv2 model is publicly available at this https URL

Full PDF

WWork in progress L AYOUT LM V

2: M

ULTI - MODAL P RE - TRAINING FOR V ISUALLY -R ICH D OCUMENT U NDERSTANDING

Yang Xu ∗ , Yiheng Xu ∗ , Tengchao Lv ∗ , Lei Cui , Furu Wei , Guoxin Wang , Yijuan Lu ,Dinei Florencio , Cha Zhang , Wanxiang Che , Min Zhang , Lidong Zhou Harbin Institute of Technology Microsoft Research Asia Microsoft Cloud&AI Team Soochow University { yxu,car } @ir.hit.edu.cn { v-yixu,v-telv,lecu,fuwei,lidongz } @microsoft.com { guow,yijlu,dinei,chazhang } @[email protected] A BSTRACT

LayoutLMv2 by pre-training text, layout and image in amulti-modal framework, where new model architectures and pre-training tasksare leveraged. Speciﬁcally, LayoutLMv2 not only uses the existing maskedvisual-language modeling task but also the new text-image alignment and text-image matching tasks in the pre-training stage, where cross-modality inter-action is better learned. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture, so that the model canfully understand the relative positional relationship among different text blocks.Experiment results show that LayoutLMv2 outperforms strong baselines andachieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 → → → → → → NTRODUCTION

Visually-rich Document Understanding (VrDU) aims to analyze scanned/digital-born business doc-uments (images, PDFs, etc.) where structured information can be automatically extracted and orga-nized for many business applications. Distinct from conventional information extraction tasks, theVrDU task not only relies on textual information, but also visual and layout information that is vitalfor visually-rich documents. For instance, the documents in Figure 1 include a variety of types suchas digital forms, receipts, invoices and ﬁnancial reports. Different types of documents indicate thatthe text ﬁelds of interest locate at different positions within the document, which is often determinedby the style and format of each type as well as the document content. Therefore, to accurately rec-ognize the text ﬁelds of interest, it is inevitable to take advantage of the cross-modality nature ofvisually-rich documents, where the textual, visual and layout information should be jointly modeledand learned end-to-end in a single framework.The recent progress of VrDU lies primarily in two directions. The ﬁrst direction is usually built onthe shallow fusion between textual and visual/layout/style information (Yang et al., 2017a; Liu et al.,2019; Sarkhel & Nandi, 2019; Yu et al., 2020; Majumder et al., 2020; Wei et al., 2020; Zhang et al.,2020). These approaches leverage the pre-trained NLP and CV models individually and combinethe information from multiple modalities for supervised learning. Although good performance hasbeen achieved, these models often need to be re-trained from scratch once the document type is ∗ Equal contributions during internship at MSRA. Corresponding authors: Lei Cui and Furu Wei a r X i v : . [ c s . C L ] D ec ork in progress (a) Form (b) Receipt (c) Invoice (d) Report Figure 1: Visually-rich business documents with different layouts and formatschanged. In addition, the domain knowledge of one document type cannot be easily transferred intoanother document type, thereby the local invariance in general document layout (e.g. key-value pairsin a left-right layout, tables in a grid layout, etc.) cannot be fully exploited. To this end, the seconddirection relies on the deep fusion among textual, visual and layout information from a great numberof unlabeled documents in different domains, where pre-training techniques play an important rolein learning the cross-modality interaction in an end-to-end fashion (Lockard et al., 2020; Xu et al.,2020). In this way, the pre-trained models absorb cross-modal knowledge from different documenttypes, where the local invariance among these layout and styles is preserved. Furthermore, whenthe model needs to be transferred into another domain with different document formats, only a fewlabeled samples would be sufﬁcient to ﬁne-tune the generic model in order to achieve state-of-the-artaccuracy. Therefore, the proposed model in this paper follows the second direction, and we explorehow to further improve the pre-training strategies for the VrDU task.In this paper, we present an improved version of LayoutLM (Xu et al., 2020), aka

LayoutLMv2 .LayoutLM is a simple but effective pre-training method of text and layout for the VrDU task. Dis-tinct from previous text-based pre-trained models, LayoutLM uses 2-D position embeddings andimage embeddings in addition to the conventional text embeddings. During the pre-training stage,two training objectives are used, which are 1) a masked visual-language model and 2) multi-labeldocument classiﬁcation. The model is pre-trained with a great number of unlabeled scanned docu-ment images from the IIT-CDIP dataset (Lewis et al., 2006), and achieves very promising results onseveral downstream tasks. Extending the existing research work, we propose new model architec-tures and pre-training objectives in the LayoutLMv2 model. Different from the vanilla LayoutLMmodel where image embeddings are combined in the ﬁne-tuning stage, we integrate the image infor-mation in the pre-training stage in LayoutLMv2 by taking advantage of the Transformer architectureto learn the cross-modality interaction between visual and textual information. In addition, inspiredby the 1-D relative position representations (Shaw et al., 2018; Raffel et al., 2020; Bao et al., 2020),we propose the spatial-aware self-attention mechanism for the LayoutLMv2, which involves a 2-Drelative position representation for token pairs. Different from the absolute 2-D position embed-dings, the relative position embeddings explicitly provide a broader view for the contextual spatialmodeling. For the pre-training strategies, we use two new training objectives for the LayoutLMv2 inaddition to the masked visual-language model. The ﬁrst is the proposed text-image alignment strat-egy, which covers text-lines in the image and makes predictions on the text-side to classify whetherthe token is covered or not on the image-side. The second is the text-image matching strategy thatis popular in previous vision-language pre-training models (Tan & Bansal, 2019; Lu et al., 2019;Su et al., 2020; Chen et al., 2020; Sun et al., 2019), where some images in the text-image pairs arerandomly replaced with another document image to make the model learn whether the image andOCR texts are correlated or not. In this way, LayoutLMv2 is more capable of learning contextualtextual and visual information and the cross-modal correlation in a single framework, which leadsto better VrDU performance. We select 6 publicly available benchmark datasets as the downstreamtasks to evaluate the performance of the pre-trained LayoutLMv2 model, which are the FUNSDdataset (Jaume et al., 2019) for form understanding, the CORD dataset (Park et al., 2019) and theSROIE dataset (Huang et al., 2019) for receipt understanding, the Kleister-NDA dataset (Grali´nski2ork in progress

Transformer Layers with Spatial-Aware Self-Attention Mechanism [SEP]V1 V2 V3 V4 [CLS]Matched T1Covered T3Covered T5NotCovered T6NotCovered T7NotCoveredT4T2

80 1 2 3 0 1 3 5 6 742V2V1V3 V4

Document Page with Covered OCR Lines Document Page

OCR/PDF ParserVisual EncoderV1 V2 V3 V4

Feature Maps OCR Lines

Line 1 (covered):

Line 2 (not covered):

T1 [MASK] T3 [MASK] T5 T6 T7

Visual/Text TokenEmbeddings2D PositionEmbeddings [SEP]Box

PAD

V1Box V1 V2Box V2 V3Box V3 V4Box V4 [CLS]Box PAD

T1Box T1 T3Box T3 T5Box T5 T6Box T6 T7Box T7 Box T4 Box T2 LineCovering TokenMasking [MASK] [MASK][MASK] [MASK]

Segment

Embeddings

AC C C C A A A A A AAA

Figure 2: An illustration of the model architecture and pre-training strategies for LayoutLMv2et al., 2020) for long document understanding with complex layout, the RVL-CDIP dataset (Harleyet al., 2015) for document image classiﬁcation, as well as the DocVQA dataset (Mathew et al.,2020) for visual question answering on document images. Experiment results show that the Lay-outLMv2 model outperforms strong baselines including the vanilla LayoutLM and achieves newstate-of-the-art results in these downstream VrDU tasks, which substantially beneﬁts a great num-ber of real-world document understanding tasks.The contributions of this paper are summarized as follows:• We propose a multi-modal Transformer model to integrate the document text, layout andimage information in the pre-training stage, which learns the cross-modal interaction end-to-end in a single framework.• In addition to the masked visual-language model, we also add text-image matching andtext-image alignment as the new pre-training strategies to enforce the alignment among dif-ferent modalities. Meanwhile, a spatial-aware self-attention mechanism is also integratedinto the Transformer architecture.• LayoutLMv2 not only outperforms the baseline models on the conventional VrDU tasks,but also achieves new SOTA results on the VQA task for document images, which demon-strates the great potential for the multi-modal pre-training for VrDU.

PPROACH

The overall illustration of the proposed LayoutLMv2 is shown in Figure 2. In this section, we willintroduce the model architecture and pre-training tasks of the LayoutLMv2.3ork in progress2.1 M

ODEL A RCHITECTURE

We build an enhanced Transformer architecture for the VrDU tasks, i.e. the multi-modal Transformeras the backbone of LayoutLMv2. The multi-modal Transformer accepts inputs of three modalities:text, image, and layout. The input of each modality is converted to an embedding sequence andfused by the encoder. The model establishes deep interactions within and between modalities byleveraging the powerful Transformer layers. The model details are introduced as follows, wheresome dropout and normalization layers are omitted.

Text Embedding

We recognize text and serialize it in a reasonable reading order using off-the-shelf OCR tools and PDF parsers. Following the common practice, we use WordPiece (Wu et al.,2016) to tokenize the text sequence and assign each token to a certain segment s i ∈ { [A] , [B] } .Then, we add a [CLS] at the beginning of the token sequence and a [SEP] at the end of each textsegment. The length of the text sequence is limited to ensure that the length of the ﬁnal sequence isnot greater than the maximum sequence length L . Extra [PAD] tokens are appended after the last [SEP] token to ﬁll the gap if the token sequence is still shorter than L tokens. In this way, we getthe input token sequence like S = { [CLS] , w , w , ..., [SEP] , [PAD] , [PAD] , ... } , | S | = L The ﬁnal text embedding is the sum of three embeddings. Token embedding represents the tokenitself, 1D positional embedding represents the token index, and segment embedding is used to dis-tinguish different text segments. Formally, we have the i -th text embedding t i = TokEmb( w i ) + PosEmb1D( i ) + SegEmb( s i ) , ≤ i < L Visual Embedding

We use ResNeXt-FPN (Xie et al., 2016; Lin et al., 2017) architecture as thebackbone of the visual encoder. Given a document page image I , it is resized to × then fedinto the visual backbone. After that, the output feature map is average-pooled to a ﬁxed size with thewidth being W and height being H . Next, it is ﬂattened into a visual embedding sequence of length W H . A linear projection layer is then applied to each visual token embedding in order to unify thedimensions. Since the CNN-based visual backbone cannot capture the positional information, wealso add a 1D positional embedding to these image token embeddings. The 1D positional embeddingis shared with the text embedding layer. For the segment embedding, we attach all visual tokens tothe visual segment [C] . The i -th visual embedding can be represented as v i = Proj (cid:0) VisTokEmb( I ) i (cid:1) + PosEmb1D( i ) + SegEmb( [C] ) , ≤ i < W H Layout Embedding

The layout embedding layer aims to embed the spatial layout informationrepresented by token bounding boxes in which corner coordinates and box shapes are identiﬁed ex-plicitly. Following the vanilla LayoutLM, we normalize and discretize all coordinates to integers inthe range [0 , , and use two embedding layers to embed x -axis features and y -axis features sepa-rately. Given the normalized bounding box of the i -th text/visual token box i = ( x , x , y , y , w, h ) ,the layout embedding layer concatenates six bounding box features to construct a token-level layoutembedding, aka the 2D positional embedding l i = Concat (cid:0) PosEmb2D x ( x , x , w ) , PosEmb2D y ( y , y , h ) (cid:1) , ≤ i < W H + L Note that CNNs perform local transformation, thus the visual token embeddings can be mappedback to image regions one by one with neither overlap nor omission. In the view of the layoutembedding layer, the visual tokens can be treated as some evenly divided grids, so their boundingbox coordinates are easy to calculate. An empty bounding box box

PAD = (0 , , , , , is attachedto special tokens [CLS] , [SEP] and [PAD] . Multi-modal Encoder with Spatial-Aware Self-Attention Mechanism

The encoder concate-nates visual embeddings { v , ..., v W H − } and text embeddings { t , ..., t L − } to a uniﬁed sequence X and fuses spatial information by adding the layout embeddings to get the ﬁrst layer input x (0) . x (0) i = X i + l i , where X = { v , ..., v W H − , t , ..., t L − } Following the architecture of Transformer, we build our multi-modal encoder with a stack of multi-head self-attention layers followed by a feed-forward network. However, the original self-attention4ork in progressmechanism can only implicitly capture the relationship between the input tokens with the absoluteposition hints. In order to efﬁciently model local invariance in the document layout, it is neces-sary to insert relative position information explicitly. Therefore, we introduce the spatial-awareself-attention mechanism into the self-attention layers. The original self-attention mechanism cap-tures the correlation between query x i and key x j by projecting the two vectors and calculating theattention score α ij = 1 √ d head (cid:0) x i W Q (cid:1) (cid:0) x j W K (cid:1) T We jointly model the semantic relative position and spatial relative position as bias terms and ex-plicitly add them to the attention score. Let b (1D) , b (2D x ) and b (2D y ) denote the learnable 1D and2D relative position biases respectively. The biases are different among attention heads but sharedin all encoder layers. Assuming ( x i , y i ) anchors the top left corner coordinates of the i -th boundingbox, we obtain the spatial-aware attention score α (cid:48) ij = α ij + b (1D) j − i + b (2D x ) x j − x i + b (2D y ) y j − y i Finally, the output vectors are represented as the weighted average of all the projected value vectorswith respect to normalized spatial-aware attention scores h i = (cid:88) j exp (cid:0) α (cid:48) ij (cid:1)(cid:80) k exp ( α (cid:48) ik ) x j W V RE - TRAINING

We adopt three self-supervised tasks simultaneously during the pre-training stage, which are de-scribed as follows.

Masked Visual-Language Modeling

Similar to the vanilla LayoutLM, we use the MaskedVisual-Language Modeling (MVLM) to make the model learn better in the language side with thecross-modality clues. We randomly mask some text tokens and ask the model to recover the maskedtokens. Meanwhile, the layout information remains unchanged, which means the model knows eachmasked token’s location on the page. The output representations of masked tokens from the encoderare fed into a classiﬁer over the whole vocabulary, driven by a cross-entropy loss. To avoid visualclue leakage, we mask image regions corresponding to masked tokens on the raw page image inputbefore feeding into the visual encoder. MVLM helps the model capture nearby tokens features. Forinstance, a masked blank in a table surrounded by lots of numbers is more likely to be a number.Moreover, given the spatial position of a blank, the model is capable of using visual informationaround to help predict the token.

Text-Image Alignment

In addition to the MVLM, we propose the Text-Image Alignment (TIA)as a ﬁne-grained cross-modality alignment task. In the TIA task, some text tokens are randomly se-lected, and their image regions are covered on the document image. We call this operation coveringto avoid confusion with the masking operation in MVLM. During the pre-training, a classiﬁcationlayer is built above the encoder outputs. This layer predicts a label for each text token depend-ing on whether it is covered, i.e., [Covered] or [Not Covered] , and computes the binarycross-entropy loss. Considering the input image’s resolution is limited, the covering operation isperformed at the line-level. When MVLM and TIA are performed simultaneously, TIA losses ofthe tokens masked in MVLM are not taken into account. This prevents the model from learning theuseless but straightforward correspondence from [MASK] to [Covered] . Text-Image Matching

Furthermore, a coarse-grained cross-modality alignment task, Text-ImageMatching (TIM) is applied during the pre-training stage. We feed the output representation at [CLS] into a classiﬁer to predict whether the image and text are from the same document page.Regular inputs are positive samples. To construct a negative sample, an image is either replaced by apage image from another document or dropped. To prevent the model from cheating by ﬁnding taskfeatures, we perform the same masking and covering operations to images in negative samples. TheTIA target labels are all set to [Covered] in negative samples. We apply the binary cross-entropyloss in the optimization process. 5ork in progress2.3 F

INE - TUNING

LayoutLMv2 produces representations with fused cross-modality information, which beneﬁts a va-riety of VrDU tasks. Its output sequence provides representations at the token-level. Speciﬁcally,the output at [CLS] can be used as the global feature. For many downstream tasks, we only need tobuild a task speciﬁed head layer over the LayoutLMv2 outputs and ﬁne-tune the whole model usingan appropriate loss. In this way, LayoutLMv2 leads to much better VrDU performance by integrat-ing the text, layout, and image information in a single multi-modal framework, which signiﬁcantlyimproves the cross-modal correlation compared to the vanilla LayoutLM model.

XPERIMENTS

ATA

In order to pre-train and evaluate LayoutLMv2 models, we select datasets in a wide range from thevisually-rich document understanding area. Introduction to the dataset and task deﬁnitions alongwith the description of required data pre-processing are presented as follows.

Pre-training Dataset

Following LayoutLM, we pre-train LayoutLMv2 on the IIT-CDIP Test Col-lection (Lewis et al., 2006), which contains over 11 million scanned document pages. We extracttext and corresponding word-level bounding boxes from document page images with the MicrosoftRead API. FUNSD

FUNSD (Jaume et al., 2019) is a dataset for form understanding in noisy scanned doc-uments. It contains 199 real, fully annotated, scanned forms where 9,707 semantic entities areannotated above 31,485 words. The 199 samples are split into 149 for training and 50 for testing.The ofﬁcial OCR annotation is directly used with the layout information. The FUNSD dataset issuitable for a variety of tasks, where we focus on semantic entity labeling in this paper. Speciﬁcally,the task is assigning to each word a semantic entity label from a set of four predeﬁned categories:question, answer, header or other. The entity-level F1 score is used as the evaluation metric.

CORD

We also evaluate our model on the receipt key information extraction dataset, i.e. thepublic available subset of CORD (Park et al., 2019). The dataset includes 800 receipts for thetraining set, 100 for the validation set and 100 for the test set. A photo and a list of OCR annotationsare equipped for each receipt. An ROI that encompasses the area of receipt region is provided alongwith each photo because there can be irrelevant things in the background. We only use the ROI asinput instead of the raw photo. The dataset deﬁnes 30 ﬁelds under 4 categories and the task aims tolabel each word to the right ﬁeld. The evaluation metric is entity-level F1. We use the ofﬁcial OCRannotations.

SROIE

The SROIE dataset (Task 3) (Huang et al., 2019) aims to extract information from scannedreceipts. There are 626 samples for training and 347 samples for testing in the dataset. The task isto extract values from each receipt of up to four predeﬁned keys: company, date, address or total.The evaluation metric is entity-level F1. We use the ofﬁcial OCR annotations and results on the testset are provided by the ofﬁcial evaluation site.

Kleister-NDA

Kleister-NDA (Grali´nski et al., 2020) contains non-disclosure agreements collectedfrom the EDGAR database, including 254 documents for training, 83 documents for validation, and203 documents for testing. This task is deﬁned to extract the values of four ﬁxed keys. We get theentity-level F1 score from the ofﬁcial evaluation tools. Words and bounding boxes are extractedfrom the raw PDF ﬁle. We use heuristics to locate entity spans because the normalized standardanswers may not appear in the utterance. https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-recognizing-text https://gitlab.com/filipg/geval RVL-CDIP

RVL-CDIP (Harley et al., 2015) consists of 400,000 grayscale images, with 8:1:1 forthe training set, validation set, and test set. A multi-class single-label classiﬁcation task is deﬁnedon RVL-CDIP. The images are categorized into 16 classes, with 25,000 images per class. Theevaluation metric is the overall classiﬁcation accuracy. Text and layout information is extracted byMicrosoft OCR.

DocVQA

As a VQA dataset on the document understanding ﬁeld, DocVQA (Mathew et al., 2020)consists of 50,000 questions deﬁned on over 12,000 pages from a variety of documents. Pagesare split into the training set, validation set and test set with a ratio of about 8:1:1. The datasetis organized as a set of triples (cid:104) page image, questions, answers (cid:105) . Thus, we use Microsoft ReadAPI to extract text and bounding boxes from images. Heuristics are used to ﬁnd given answers inthe extracted text. The task is evaluated using an edit distance based metric ANLS (aka averagenormalized Levenshtein similarity). Given that human performance is about 98% ANLS on the testset, it is reasonable to assume that the found ground truth which reaches over 97% ANLS on trainingand validation sets is good enough to train a model. Results on the test set are provided by the ofﬁcialevaluation site.3.2 S

ETTINGS

Following the typical pre-training and ﬁne-tuning strategy, we update all parameters and train wholemodels end-to-end for all the settings.

Pre-training LayoutLMv2

We train LayoutLMv2 models with two different parameter sizes.We set hidden size d = 768 in LayoutLMv2 BASE and use a 12-layer 12-head Transformer en-coder. While in the LayoutLMv2

LARGE , d = 1024 and its encoder has 24 Transformer layerswith 16 heads. Visual backbones in the two models use the same ResNeXt101-FPN architec-ture. The numbers of parameters are 200M and 426M approximately for LayoutLMv2 BASE andLayoutLMv2

LARGE , respectively.The model is initialized from the existing pre-trained model checkpoints. For the encoder alongwith the text embedding layer, LayoutLMv2 uses the same architecture as UniLMv2 (Bao et al.,2020), thus it is initialized from UniLMv2. For the ResNeXt-FPN part in the visual embeddinglayer, the backbone of a Mask-RCNN (He et al., 2017) model trained on PubLayNet (Zhong et al.,2019) is leveraged. The rest of the parameters in the model are randomly initialized. We pre-trainLayoutLMv2 models using Adam optimizer (Kingma & Ba, 2017; Loshchilov & Hutter, 2019), withthe learning rate of × − , weight decay of × − ,and ( β , β ) = (0 . , . . The learningrate is linearly warmed up over the ﬁrst steps then linearly decayed. LayoutLMv2 BASE istrained with a batch size of for epochs, and LayoutLMv2 LARGE is trained with a batch size of for epochs on the IIT-CDIP dataset.During the pre-training, we sample pages from the IIT-CDIP dataset and select a random slidingwindow of the text sequence if the sample is too long. We set the maximum sequence length L = 512 and assign all text tokens to the segment [A] . The output shape of the adaptive pooling layer is setto W = H = 7 , so that it transforms the feature map into image tokens. In MVLM, 15% texttokens are masked among which 80% are replaced by a special token [MASK] , 10% are replacedby a random token sampled from the whole vocabulary, and 10% remains the same. In TIA, 15% ofthe lines are covered. In TIM, 15% images are replaced and 5% are dropped. Fine-tuning LayoutLMv2 for Visual Question Answering

We treat the DocVQA as an extrac-tive QA task and build a token-level classiﬁer on top of the text part of LayoutLMv2 output repre-sentations. Question tokens, context tokens and visual tokens are assigned to segment [A] , [B] and [C] , respectively. In the DocVQA paper, experiment results show that the BERT model ﬁne-tunedon the SQuAD dataset (Rajpurkar et al., 2016) outperforms the original BERT model. Inspired bythis, we add an extra setting, which is that we ﬁrst ﬁne-tune LayoutLMv2 on a Question Genera-tion (QG) dataset followed by the DocVQA dataset. The QG dataset contains almost one millionquestion-answer pairs generated by a generation model trained on the SQuAD dataset. “MaskRCNN ResNeXt101 32x8d FPN 3X” setting in https://github.com/hpanwar08/detectron2 Fine-tuning LayoutLMv2 for Document Image Classiﬁcation

This task depends on high-levelvisual information, thereby we leverage the image features explicitly in the ﬁne-tuning. We pool thevisual embeddings into a global pre-encoder feature, and pool the visual part of LayoutLMv2 outputrepresentations into a global post-encoder feature. The pre and post-encoder features along with the [CLS] output feature are concatenated and fed into the ﬁnal classiﬁcation layer.

Fine-tuning LayoutLMv2 for Sequence Labeling

We formalize FUNSD, SROIE, CORD andKleister-NDA as the sequence labeling tasks. To ﬁne-tune LayoutLMv2 models on these tasks, webuild a token-level classiﬁcation layer above the text part of the output representations to predict theBIO tags for each entity ﬁeld.

Baselines

We select 3 baseline models in the experiments to compare LayoutLMv2 with the SOTAtext-only pre-trained models as well as the vanilla LayoutLM model. Speciﬁcally, we compareLayoutLMv2 with BERT (Devlin et al., 2019), UniLMv2 (Bao et al., 2020) and LayoutLM (Xuet al., 2020) for all the experiment settings. We use the publicly available PyTorch models forBERT (Wolf et al., 2020) and LayoutLM, and use our in-house implementation for the UniLMv2models. For each baseline approach, experiments are conducted using both the BASE and

LARGE parameter settings.3.3 R

ESULTS

FUNSD

Table 1 shows the model accuracy on the FUNSD dataset which is evaluated using entity-level precision, recall and F1 score. For text-only models, the UniLMv2 models outperform theBERT models by a large margin in terms of the

BASE and

LARGE settings. For text+layout mod-els, the LayoutLM family brings signiﬁcant performance improvement over the text-only baselines,especially the LayoutLMv2 models. The best performance is achieved by the LayoutLMv2

LARGE ,where an improvement of 3% F1 point is observed compared to the current SOTA results. Thisillustrates that the multi-modal pre-training in LayoutLMv2 learns better from the interactions fromdifferent modalities, thereby leading to the new SOTA on the form understanding task.

Model Precision Recall F1

BERT

BASE

LARGE

BASE

LARGE

BASE

LARGE

Table 1: Model accuracy (entity-level Precision, Recall, F1) on the FUNSD dataset

CORD

Table 2 gives the entity-level precision, recall and F1 scores on the CORD dataset. TheLayoutLM family signiﬁcantly outperforms the text-only pre-trained models including BERT andUniLMv2, especially the LayoutLMv2 models. Compared to the baselines, the LayoutLMv2 modelsare also superior to the “SPADE” decoder method, as well as the “BROS” approach that is built onthe “SPADE” decoder, which conﬁrms the effectiveness of the pre-training for text, layout and imageinformation.

SROIE

Table 3 lists the entity-level precision, recall, and F1 score on Task 3 of the SROIE chal-lenge. Compared to the text-only pre-trained language models, our LayoutLM family models havesigniﬁcant improvement by integrating cross-modal interactions. Moreover, with the same modal in-formation, our LayoutLMv2 models also outperform existing multi-modal approaches (Anonymous, https://github.com/microsoft/unilm/tree/master/layoutlm Model Precision Recall F1

BERT

BASE

LARGE

BASE

LARGE

BASE

LARGE

Table 2: Model accuracy (entity-level Precision, Recall, F1) on the CORD dataset

Model Precision Recall F1

BERT

BASE

LARGE

BASE

LARGE

BASE

LARGE

LARGE (Excluding OCR mismatch) Table 3: Model accuracy (entity-level Precision, Recall, F1) on the SROIE dataset (until 2020-12-24)2021; Yu et al., 2020; Zhang et al., 2020), which demonstrates the model effectiveness. Eventually,the LayoutLMv2

LARGE single model can even beat the top-1 submission on the SROIE leaderboard.

Kleister-NDA

Table 4 gives the entity-level F1 score of the Kleister-NDA dataset. As the labeledanswers are normalized into a canonical form, we apply post-processing heuristics to convert theextracted date information into the “YYYY-MM-DD” format, and company names into the abbre-viations such as “LLC” and “Inc.”. We report the evaluation results on the validation set becausethe ground-truth labels and the submission website for the test set are not available right now. Theexperiment results have shown that the LayoutLMv2 models improve the text-only and vanilla Lay-outLM models by a large margin for the lengthy NDA documents, which also demonstrates thatLayoutLMv2 can handle the complex layout information much better than previous models.

RVL-CDIP

Table 5 shows the classiﬁcation accuracy on the RVL-CDIP dataset, including text-only pre-trained models, the LayoutLM family as well as several image-based baseline models. Asshown in the table, both the text and image information is important to the document image classiﬁ-cation task because document images are text-intensive and represented by a variety of layouts andformats. Therefore, we observed that the LayoutLM family outperforms those text-only or image-only models as it leverages the multi-modal information within the documents. Speciﬁcally, theLayoutLMv2

LARGE model signiﬁcantly improves the classiﬁcation accuracy by more than 1.2% F1point over the previous SOTA results, which achieves an accuracy of 95.64%. This also veriﬁes thatthe pre-trained LayoutLMv2 model not only beneﬁts the information extraction tasks in document Unpublished results, the leaderboard is available at https://rrc.cvc.uab.es/?ch=13&com=evaluation&task=3

Model F1

BERT

BASE

LARGE

BASE

LARGE

BASE

LARGE

BASE in (Grali´nski et al., 2020) 0.793 125M

Table 4: Model accuracy (entity-level F1) on the validation set of the Kleister-NDA dataset usingthe ofﬁcial evaluation toolkit

Model Accuracy

BERT

BASE

LARGE

BASE (w/ image) 94.42% 160MLayoutLM

LARGE (w/ image) 94.43% 390MLayoutLMv2

BASE

LARGE (Szegedy et al., 2016) 92.63% -LadderNet (Sarkhel & Nandi, 2019) 92.77% -Single model (Dauphinee et al., 2019) 93.03% -Ensemble (Dauphinee et al., 2019) 93.07% - Table 5: Classiﬁcation accuracy on the RVL-CDIP datasetunderstanding but also the document image classiﬁcation task through the effective model trainingacross different modalities.

DocVQA

Table 6 lists the Average Normalized Levenshtein Similarity (ANLS) scores on theDocVQA dataset of text-only baselines, LayoutLM family models and the previous top-1 on theleaderboard. With multi-modal pre-training, LayoutLMv2 models outperform LayoutLM mod-els and text-only baselines by a large margin when ﬁne-tuned on the train set. By using all data(train + dev) as the ﬁne-tuning dataset, the LayoutLMv2

LARGE single model outperforms theprevious top-1 on the leaderboard which ensembles 30 models. Under the setting of ﬁne-tuningLayoutLMv2

LARGE on a question generation dataset (QG) and the DocVQA dataset successively,the single model performance increases by more than 1.6% ANLS and achieves the new SOTA.3.4 A

BLATION S TUDY

To fully understand the underlying impact of different components, we conduct an ablation study toexplore the effect of visual information, the pre-training tasks, spatial-aware self-attention mecha-nism, as well as different initialization models. Table 7 shows model performance on the DocVQAvalidation set. Under all the settings, we pre-train the models using all IIT-CDIP data for one epoch.The hyper-parameters are the same as those used to pre-train LayoutLMv2

BASE in Section 3.2. https://medium.com/@jdegange85/benchmarking-modern-cnn-architectures-to-rvl-cdip-9dd0b7ec2955 Unpublished results, the leaderboard is available at https://rrc.cvc.uab.es/?ch=17&com=evaluation&task=1

Model Fine-tuning set ANLS

BERT

BASE train 0.6354 110MUniLMv2

BASE train 0.7134 125MBERT

LARGE train 0.6768 340MUniLMv2

LARGE train 0.7709 355MLayoutLM

BASE train 0.6979 113MLayoutLM

LARGE train 0.7259 343MLayoutLMv2

BASE train 0.7808 200MLayoutLMv2

LARGE train 0.8348 426MLayoutLMv2

LARGE train + dev 0.8529 426MLayoutLMv2

LARGE + QG train + dev - 0.8506 - Table 6: Average Normalized Levenshtein Similarity (ANLS) score on the DocVQA dataset (until2020-12-24), “QG” denotes the data augmentation with the question generation dataset.

BASE

BERT

BASE (cid:88)

BASE

BERT

BASE + X101-FPN (cid:88)

BASE

BERT

BASE + X101-FPN (cid:88) (cid:88)

BASE

BERT

BASE + X101-FPN (cid:88) (cid:88)

BASE

BERT

BASE + X101-FPN (cid:88) (cid:88) (cid:88)

BASE

BERT

BASE + X101-FPN (cid:88) (cid:88) (cid:88) (cid:88)

BASE

UniLMv2

BASE + X101-FPN (cid:88) (cid:88) (cid:88) (cid:88)

Table 7: Ablation study on the DocVQA dataset, where ANLS scores on the validation set are re-ported. “SASAM” means the spatial-aware self-attention mechanism. “MVLM”, “TIA” and “TIM”are the three proposed pre-training tasks. All the models are trained using all IIT-CDIP data for 1epoch with the

BASE model size.“LayoutLM” denotes the vanilla LayoutLM architecture in (Xu et al., 2020), which can be regardedas a LayoutLMv2 architecture without visual module and spatial-aware self-attention mechanism.“X101-FPN” denotes the ResNeXt101-FPN visual backbone described in Section 3.2. We ﬁrstevaluate the effect of introducing visual information. By comparing

ELATED W ORK

With the development of conventional machine learning, statistical machine learning ap-proaches (Shilman et al., 2005; Marinai et al., 2005) have become the mainstream for documentsegmentation tasks during the past decade. Shilman et al. (2005) consider the layout information ofa document as a parsing problem, and globally search the optimal parsing tree based on a grammar-based loss function. They utilize a machine learning approach to select features and train all param-eters during the parsing process. Meanwhile, artiﬁcial neural networks (Marinai et al., 2005) havebeen extensively applied to document analysis and recognition. Most efforts have been devoted tothe recognition of isolated handwritten and printed characters with widely recognized successful re-sults. In addition to the ANN model, SVM and GMM (Wei et al., 2013) have been used in document11ork in progresslayout analysis tasks. For machine learning approaches, they are usually time-consuming to designmanually crafted features and difﬁcult to obtain a highly abstract semantic context. In addition, thesemethods usually relied on visual cues but ignored textual information.Deep learning methods have become the mainstream and de facto standard for many machine learn-ing problems. Theoretically, they can ﬁt any arbitrary functions through the stacking of multi-layerneural networks and have been veriﬁed to be effective in many research areas. Yang et al. (2017b)treat the document semantic structure extraction task as a pixel-by-pixel classiﬁcation problem. Theypropose a multi-modal neural network that considers visual and textual information, while the limita-tion of this work is that they only used the network to assist heuristic algorithms to classify candidatebounding boxes rather than an end-to-end approach. Viana & Oliveira (2017) propose a lightweightmodel of document layout analysis for mobile and cloud services. The model uses one-dimensionalinformation of images for inference and compares it with the model using two-dimensional infor-mation, achieving comparable accuracy in the experiments. Katti et al. (2018) make use of a fullyconvolutional encoder-decoder network that predicts a segmentation mask and bounding boxes, andthe model signiﬁcantly outperforms approaches based on sequential text or document images. Soto& Yoo (2019) incorporate contextual information into the Faster R-CNN model that involves theinherently localized nature of article contents to improve region detection performance.In recent years, pre-training techniques have become more and more popular in both NLP and CV ar-eas, and have also been leveraged in the VrDU tasks. Devlin et al. (2019) introduced a new languagerepresentation model called BERT, which is designed to pre-train deep bidirectional representationsfrom the unlabeled text by jointly conditioning on both left and right context in all layers. As aresult, the pre-trained BERT model can be ﬁne-tuned with just one additional output layer to createstate-of-the-art models for a wide range of tasks. Bao et al. (2020) propose to pre-train a uniﬁedlanguage model for both autoencoding and partially autoregressive language modeling tasks using anovel training procedure, referred to as a pseudo-masked language model. In addition, the two taskspre-train a uniﬁed language model as a bidirectional encoder and a sequence-to-sequence decoder,respectively. Lu et al. (2019) proposed ViLBERT for learning task-agnostic joint representations ofimage content and natural language by extending the popular BERT architecture to a multi-modaltwo-stream model. Su et al. (2020) proposed VL-BERT that adopts the Transformer model as thebackbone, and extends it to take both visual and linguistic embedded features as input. (Xu et al.,2020) proposed the LayoutLM to jointly model interactions between text and layout informationacross scanned document images, which is beneﬁcial for a great number of real-world documentimage understanding tasks such as information extraction from scanned documents. This work isa natural extension of the vanilla LayoutLM, which takes advantage of textual, layout and visualinformation in a single multi-modal pre-training framework.

ONCLUSION

In this paper, we present a multi-modal pre-training approach for visually-rich document understand-ing tasks, aka LayoutLMv2. Distinct from existing methods for VrDU, the LayoutLMv2 model notonly considers the text and layout information but also integrates the image information in the pre-training stage with a single multi-modal framework. Meanwhile, the spatial-aware self-attentionmechanism is integrated into the Transformer architecture to capture the relative relationship amongdifferent bounding boxes. Furthermore, new pre-training objectives are also leveraged to enforce thelearning of cross-modal interaction among different modalities. Experiment results on 6 differentVrDU tasks have illustrated that the pre-trained LayoutLMv2 model has substantially outperformedthe SOTA baselines in the document intelligence area, which greatly beneﬁts a number of real-worlddocument understanding tasks.For future research, we will further explore the network architecture as well as the pre-trainingstrategies for the LayoutLM family, so that we can push the SOTA results in VrDU to the new height.Meanwhile, we will also investigate the language expansion to make the multi-lingual LayoutLMv2model available for different languages especially the non-English areas around the world.12ork in progress R EFERENCES

Muhammad Zeshan Afzal, Andreas K¨olsch, Sheraz Ahmed, and Marcus Liwicki. Cutting the errorby half: Investigation of very deep cnn and advanced training strategies for document imageclassiﬁcation. , 01:883–888, 2017.Anonymous. { BROS } : A pre-trained language model for understanding texts in document. In Submitted to International Conference on Learning Representations , 2021. URL https://openreview.net/forum?id=punMXQEsPr0 . under review.Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Songhao Piao,Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Unilmv2: Pseudo-masked language models foruniﬁed language model pre-training, 2020.Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, andJingjing Liu. Uniter: Universal image-text representation learning. In

ECCV , 2020.Arindam Das, Saikat Roy, and Ujjwal Bhattacharya. Document image classiﬁcation with intra-domain transfer learning and stacked generalization of deep convolutional neural networks. , pp. 3180–3185, 2018.Tyler Dauphinee, Nikunj Patel, and Mohammad Mehdi Rashidi. Modular multimodal architecturefor document classiﬁcation.

ArXiv , abs/1912.04376, 2019.Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deepbidirectional transformers for language understanding. In

Proceedings of the 2019 Conference ofthe North American Chapter of the Association for Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) , pp. 4171–4186, Minneapolis, Minnesota, June2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL .Filip Grali´nski, Tomasz Stanisławek, Anna Wr´oblewska, Dawid Lipi´nski, Agnieszka Kaliska,Paulina Rosalska, Bartosz Topolski, and Przemysław Biecek. Kleister: A novel task for infor-mation extraction involving long documents with complex layout, 2020.Adam W Harley, Alex Ufkes, and Konstantinos G Derpanis. Evaluation of deep convolutionalnets for document image classiﬁcation and retrieval. In

International Conference on DocumentAnalysis and Recognition (ICDAR) , 2015.Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In

Proceedings of theIEEE International Conference on Computer Vision (ICCV) , Oct 2017.Z. Huang, K. Chen, J. He, X. Bai, D. Karatzas, S. Lu, and C. V. Jawahar. Icdar2019 competition onscanned receipt ocr and information extraction. In , pp. 1516–1520, 2019. doi: 10.1109/ICDAR.2019.00244.Wonseok Hwang, Jinyeong Yim, Seunghyun Park, Sohee Yang, and Minjoon Seo. Spatial depen-dency parsing for semi-structured document information extraction, 2020.Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. Funsd: A dataset for formunderstanding in noisy scanned documents. , Sep 2019. doi: 10.1109/icdarw.2019.10029. URL http://dx.doi.org/10.1109/ICDARW.2019.10029 .Anoop R Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, JohannesH¨ohne, and Jean Baptiste Faddoul. Chargrid: Towards understanding 2D documents. In

Pro-ceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pp.4459–4469, Brussels, Belgium, October-November 2018. Association for Computational Lin-guistics. doi: 10.18653/v1/D18-1476. URL .Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.13ork in progressD. Lewis, G. Agam, S. Argamon, O. Frieder, D. Grossman, and J. Heard. Building a test col-lection for complex document information processing. In

Proceedings of the 29th Annual In-ternational ACM SIGIR Conference on Research and Development in Information Retrieval ,SIGIR ’06, pp. 665–666, New York, NY, USA, 2006. Association for Computing Machinery.ISBN 1595933697. doi: 10.1145/1148170.1148307. URL https://doi.org/10.1145/1148170.1148307 .Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie.Feature networks for object detection. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , July 2017.Xiaojing Liu, Feiyu Gao, Qiong Zhang, and Huasha Zhao. Graph convolution for multimodalinformation extraction from visually rich documents. In

Proceedings of the 2019 Confer-ence of the North American Chapter of the Association for Computational Linguistics: Hu-man Language Technologies, Volume 2 (Industry Papers) , pp. 32–39, Minneapolis, Minnesota,June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-2005. URL .Colin Lockard, Prashant Shiralkar, Xin Luna Dong, and Hannaneh Hajishirzi. Zeroshotceres: Zero-shot relation extraction from semi-structured webpages.

Proceedings of the 58th Annual Meetingof the Association for Computational Linguistics , 2020. doi: 10.18653/v1/2020.acl-main.721.URL http://dx.doi.org/10.18653/v1/2020.acl-main.721 .Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In

International Confer-ence on Learning Representations , 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7 .Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolin-guistic representations for vision-and-language tasks, 2019.Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata, James Bradley Wendt, Qi Zhao, andMarc Najork. Representation learning for information extraction from form-like documents. In

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pp.6495–6504, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.580. URL .S. Marinai, M. Gori, and G. Soda. Artiﬁcial neural networks for document analysis and recognition.

IEEE Transactions on Pattern Analysis and Machine Intelligence , 27(1):23–35, Jan 2005. ISSN1939-3539. doi: 10.1109/TPAMI.2005.4.Minesh Mathew, Dimosthenis Karatzas, R. Manmatha, and C. V. Jawahar. Docvqa: A dataset forvqa on document images, 2020.Seunghyun Park, Seung Shin, Bado Lee, Junyeop Lee, Jaeheung Surh, Minjoon Seo, and HwalsukLee. Cord: A consolidated receipt dataset for post-ocr parsing. 2019.Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, YanqiZhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a uniﬁed text-to-texttransformer.

Journal of Machine Learning Research , 21(140):1–67, 2020.Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questionsfor machine comprehension of text. In

Proceedings of the 2016 Conference on Empirical Methodsin Natural Language Processing , pp. 2383–2392, Austin, Texas, November 2016. Associationfor Computational Linguistics. doi: 10.18653/v1/D16-1264. URL .Ritesh Sarkhel and Arnab Nandi. Deterministic routing between layout abstractions for multi-scaleclassiﬁcation of visually rich documents. In

Proceedings of the Twenty-Eighth International JointConference on Artiﬁcial Intelligence, IJCAI-19 , pp. 3360–3366. International Joint Conferenceson Artiﬁcial Intelligence Organization, 7 2019. doi: 10.24963/ijcai.2019/466. URL https://doi.org/10.24963/ijcai.2019/466 .14ork in progressPeter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representa-tions.

Proceedings of the 2018 Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies, Volume 2 (Short Papers) , 2018. doi:10.18653/v1/n18-2074. URL http://dx.doi.org/10.18653/v1/N18-2074 .Michael Shilman, Percy Liang, and Paul Viola. Learning nongenerative grammatical models fordocument analysis. In

Tenth IEEE International Conference on Computer Vision (ICCV’05) Vol-ume 1 , volume 2, pp. 962–969. IEEE, 2005.Carlos Soto and Shinjae Yoo. Visual detection with context for document layout analysis. In

Pro-ceedings of the 2019 Conference on Empirical Methods in Natural Language Processing andthe 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pp.3462–3468, Hong Kong, China, November 2019. Association for Computational Linguistics. doi:10.18653/v1/D19-1348. URL .Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. Vl-bert: Pre-trainingof generic visual-linguistic representations, 2020.Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A jointmodel for video and language representation learning. In

Proceedings of the IEEE/CVF Interna-tional Conference on Computer Vision (ICCV) , October 2019.Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In

AAAI , 2016.Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from trans-formers. In

Proceedings of the 2019 Conference on Empirical Methods in Natural LanguageProcessing , 2019.Matheus Palhares Viana and D´ario Augusto Borges Oliveira. Fast cnn-based document layout anal-ysis. , pp. 1173–1180, 2017.H. Wei, M. Baechler, F. Slimane, and R. Ingold. Evaluation of svm, mlp and gmm classiﬁers forlayout analysis of historical documents. In , pp. 1220–1224, Aug 2013. doi: 10.1109/ICDAR.2013.247.Mengxi Wei, Yifan He, and Qiong Zhang. Robust layout-aware ie for visually rich documents withpre-trained language models.

Proceedings of the 43rd International ACM SIGIR Conference onResearch and Development in Information Retrieval , Jul 2020. doi: 10.1145/3397271.3401442.URL http://dx.doi.org/10.1145/3397271.3401442 .Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi,Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrickvon Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger,Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of-the-art naturallanguage processing. In

Proceedings of the 2020 Conference on Empirical Methods in NaturalLanguage Processing: System Demonstrations , pp. 38–45, Online, October 2020. Associationfor Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. URL .Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine trans-lation system: Bridging the gap between human and machine translation. arXiv preprintarXiv:1609.08144 , 2016.Saining Xie, Ross B. Girshick, Piotr Doll´ar, Zhuowen Tu, and Kaiming He. Aggregated residualtransformations for deep neural networks. , pp. 5987–5995, 2016.Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. Layoutlm: Pre-training of text and layout for document image understanding. In

Proceedings of the 26thACM SIGKDD International Conference on Knowledge Discovery & Data Mining , KDD ’20,15ork in progresspp. 1192–1200, New York, NY, USA, 2020. Association for Computing Machinery. ISBN9781450379984. doi: 10.1145/3394486.3403172. URL https://doi.org/10.1145/3394486.3403172 .Xiao Yang, Ersin Yumer, Paul Asente, Mike Kraley, Daniel Kifer, and C. Lee Giles. Learning toextract semantic structure from documents using multimodal fully convolutional neural networks. , Jul 2017a. doi:10.1109/cvpr.2017.462. URL http://dx.doi.org/10.1109/CVPR.2017.462 .Xiaowei Yang, Ersin Yumer, Paul Asente, Mike Kraley, Daniel Kifer, and C. Lee Giles. Learning toextract semantic structure from documents using multimodal fully convolutional neural networks. , pp. 4342–4351,2017b.Wenwen Yu, Ning Lu, Xianbiao Qi, Ping Gong, and Rong Xiao. Pick: Processing key informationextraction from documents using improved graph learning-convolutional networks, 2020.Peng Zhang, Yunlu Xu, Zhanzhan Cheng, Shiliang Pu, Jing Lu, Liang Qiao, Yi Niu, and Fei Wu.Trie: End-to-end text reading and information extraction for document understanding, 2020.Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Publaynet: largest dataset ever for docu-ment layout analysis. In2019 International Conference on Document Analysis and Recognition(ICDAR)