[PDF] Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

Abstract

We address the challenging problem of Natural Language Comprehension beyond plain-text documents by introducing the TILT neural network architecture which simultaneously learns layout information, visual features, and textual semantics. Contrary to previous approaches, we rely on a decoder capable of unifying a variety of problems involving natural language. The layout is represented as an attention bias and complemented with contextualized visual information, while the core of our model is a pretrained encoder-decoder Transformer. Our novel approach achieves state-of-the-art results in extracting information from documents and answering questions which demand layout understanding (DocVQA, CORD, SROIE). At the same time, we simplify the process by employing an end-to-end model.

Full PDF

GGoing Full-TILT Boogie on DocumentUnderstanding with Text-Image-LayoutTransformer

Rafa(cid:32)l Powalski , ∗ , (cid:32)Lukasz Borchmann , , ∗ , Dawid Jurkiewicz , , † ,Tomasz Dwojak , , † , Micha(cid:32)l Pietruszka , , † , and Gabriela Pa(cid:32)lka , Applica.ai, Warsaw, Poland Poznan University of Technology, Pozna´n, Poland Adam Mickiewicz University in Pozna´n, Poland Jagiellonian University, Cracow, Poland [email protected]

Abstract.

We address the challenging problem of Natural LanguageComprehension beyond plain-text documents by introducing the TILTneural network architecture which simultaneously learns layout informa-tion, visual features, and textual semantics. Contrary to previous ap-proaches, we rely on a decoder capable of unifying a variety of problemsinvolving natural language. The layout is represented as an attentionbias and complemented with contextualized visual information, whilethe core of our model is a pretrained encoder-decoder Transformer. Ournovel approach achieves state-of-the-art results in extracting informationfrom documents and answering questions which demand layout under-standing (DocVQA, CORD, WikiOps, SROIE). At the same time, wesimplify the process by employing an end-to-end model.

Keywords:

Natural Language Processing · Transfer learning · Docu-ment understanding · Layout analysis · Deep learning · Transformer.

Most tasks in Natural Language Processing (NLP) can be uniﬁed under oneframework by casting them as triplets of the question, context, and answer [30,40, 27]. We consider such uniﬁcation of Document Classiﬁcation, Key Informa-tion Extraction, and Question Answering in a demanding scenario where contextextends beyond the text layer. This challenge is prevalent in business cases sincecontracts, forms, applications, and invoices cover a wide selection of documenttypes and complex spatial layouts. ∗ RP, (cid:32)LB contributed equally. † DJ, TD, MP contributed equally. a r X i v : . [ c s . C L ] M a r Powalski et al.

The most remarkable successes achieved in NLP involved models that map rawtextual input into raw textual output, which usually were provided in a digitalform. An important aspect of real-world oriented problems is the presence ofscanned paper records and other analog materials that became digital.Consequently, there is no easily accessible information regarding the docu-ment layout or reading order, and these are to be determined as part of theprocess. Furthermore, interpretation of shapes and charts beyond the layoutmay help answer the stated questions. A system cannot rely solely on text butrequires incorporating information from the structure and image.

95 90 PERCENT

MORTALITY

Fig. 1.

The same document perceived diﬀerently depending on modalities. Respec-tively: its visual aspect, spatial relationships between the bounding boxes of detectedwords, and unstructured text returned by OCR under the detected reading order.

Thus, it takes three to solve this fundamental challenge — the extraction ofkey information from richly formatted documents lies precisely at the intersectionof NLP, Computer Vision, and Layout Analysis (Figure 1). These challengesimpose extra conditions beyond NLP that we sidestep by formulating layout-aware models within an encoder-decoder framework.

Sequence labeling models can be trained in all cases where the token-level an-notation is available or can be easily obtained. Limitations of this approach arestrikingly visible on tasks framed in either key information extraction or prop-erty extraction paradigms [20, 9]. Here, no annotated spans are available, andonly property-value pairs are assigned to the document. Occasionally, it is ex-pected from the model to mark some particular subsequence of the document.However, problems where the expected value is not a substring of the consideredtext are unsolvable assuming sequence labeling methods (Table 1). As a result,authors applying state-of-the-art entity recognition models were forced to relyon human-made heuristics and time-consuming rule engineering.Particular problems one has to solve when employing a sequence-labelingmethod can be divided into three groups. We investigate them below to preciselypoint out the limitations of this approach. ext-Image-Layout Transformer 3Task Annotation Exact match LayoutCoNLL 2003 word-level 100% − SROIE  document-level 93% +WikiReading 20% − Kleister 27% +

Table 1.

Comparison of extraction tasks. Expected values are always present in asubstring of a document in NER, but not elsewhere. Our estimation.

Take, for example, the total amount assigned to a receipt in the SROIEdataset [20]. Suppose there is no exact match for the expected value in thedocument, e.g., due to an OCR error, incorrect reading order or the use of adiﬀerent decimal separator. Unfortunately, a sequence labeling model cannot beapplied oﬀ-the-shelf. Authors dealing with property extraction rely on eithermanual annotation or the heuristic-based tagging procedure that impacts theoverall end-to-end results [56, 12, 11, 19, 55, 37]. Moreover, when receipts with oneitem listed are considered, the total amount is equal to a single item price, whichis the source of yet another problem. Precisely, if there are multiple matches forthe value in the document, it is ambiguous whether to tag all of them, part ornone.Another problem one has to solve is which and how many of the detectedentities to return, and whether to normalize the output somehow. Consequently,the authors of Kleister proposed a set of handcrafted rules for the ﬁnal selectionof the entity values [12]. These and similar rules are either labor-intensive orprone to errors [41].Finally, the property extraction paradigm does not assume the requestedvalue appeared in the article in any form since it is suﬃcient for it to be infer-able from the content, as in document classiﬁcation or non-extractive questionanswering [9].

Since sequence labeling-based extraction is disconnected from the ﬁnal purposethe detected information is used for, a typical real-world scenario demands thesetting of Key Information Extraction.To address this issue, we focus on the applicability of the encoder-decoderarchitecture since it can generate values not included in the input text explicitly[17] and performs reasonably well on all text-based problems involving naturallanguage [45]. Additionally, it eliminates the limitation prevalent in sequencelabeling, where the model output is restricted by the detected word order, pre-viously addressed by complex architectural changes (Section 2).Furthermore, this approach potentially solves all identiﬁed problems of se-quence labeling architectures and ties various tasks, such as Question Answeringor Text Classiﬁcation, into the same framework. For example, the model may

Powalski et al.

Encoder-decoder SpatialMulti-modal

LayoutLM

Our work

LAMBERTBERTgridT5BART VisualBERT VL-BERT

Fig. 2.

Our work in relation to encoder-decoder models, multi-modal transformers, andmodels for text that are able to comprehend spatial relationships between words. deduce to answer yes or no depending on the question form only. Its end-to-endelegance and ease of use allows one to not rely on human-made heuristics andto get rid of time-consuming rule engineering required in the sequence labelingparadigm.Obviously, employing a decoder instead of a classiﬁcation head comes withsome known drawbacks related to the autoregressive nature of answer generation.This is currently investigated, e.g., in the Neural Machine Translation context,and can be alleviated by methods such as lowering the depth of the decoder [48,25]. However, the datasets we consider have target sequences of low length; thus,the mentioned decoding overhead is mitigated. We aim to bridge several ﬁelds, with each of them having long-lasting researchprograms; thus, there is a large and varied body of related works. We restrictourselves to approaches rooted in the architecture of Transformer [54] and focuson the inclusion of spatial information or diﬀerent modalities in text-processingsystems, as well as on the applicability of encoder-decoder models to InformationExtraction and Question Answering.

Spatial-aware Transformers.

Several authors have shown that, when tasksinvolving 2D documents are considered, sequential models can be outperformedby considering layout information either directly as positional embeddings [18,11, 56] or indirectly by allowing them to be contextualized on their spatial neigh-borhood [6, 57, 16]. Further improvements focused on the training and inferenceaspects by the inclusion of the area masking loss function or achieving inde-pendence from sequential order in decoding respectively [19, 21]. In contrast to ext-Image-Layout Transformer 5 the mentioned methods, we rely on a bias added to self-attention instead ofpositional embeddings and propose its generalization to distances on the 2Dplane. Additionally, we introduce a novel word-centric masking method concern-ing both images and text. Moreover, by resorting to an encoder-decoder, theindependence from sequential order in decoding is granted without dedicatedarchitectural changes.

Encoder-decoder for IE and QA.

Most NLP tasks can be uniﬁed underone framework by casting them as Language Modeling, Sequence Labeling orQuestion Answering [44, 26]. The QA program of unifying NLP frames all theproblems as triplets of question, context and answer [30, 40, 27] or item, prop-erty name and answer [17]. Although this does not necessarily lead to the use ofencoder-decoder models, several successful solutions relied on variants of Trans-former architecture [54, 35, 9, 45]. The T5 is a prominent example of large-scaleTransformers achieving state-of-the-art results on varied NLP benchmarks [45].We extend this approach beyond the text-to-text scenario by making it possibleto consume a multimodal input.

Multimodal Transformers.

The relationships between text and other me-dia have been previously studied in Visual Commonsense Reasoning, Video-Grounded Dialogue, Speech, and Visual Question Answering [14, 33, 3]. In thecontext of images, this niche was previously approached with an image-to-textcross-attention mechanism, alternatively, by adding visual features to word em-beddings or concatenating them [38, 34, 36, 53, 56]. We diﬀer from the mentionedapproaches, as in our model, visual features added to word embeddings are al-ready contextualized on an image’s multiple resolution levels (see Section 3.2).

Our starting point is the architecture of the Transformer, initially proposed forNeural Machine Translation, which has proven to be a solid baseline for allgenerative tasks involving natural language [54].Let us begin from the general view on attention in the ﬁrst layer of theTransformer. If n denotes the number of input tokens, resulting in a matrix ofembeddings X , then self-attention can be seen as:softmax (cid:18) Q X K (cid:62) X √ n + B (cid:19) V X (1)where Q X , K X and V X are projections of X onto query, keys, and value spaces,whereas B stands for an optional attention bias. There is no B term in theoriginal Transformer, and information about the order of tokens is providedexplicitly to the model, that is: X = S + P B = 0 n × d Powalski et al. (A) Vanilla Transformer (B) T5 Architecture

KQ V (C) Our model

Pairwise1+2Ddistances Semantics ContextualizedVision × ×+ +

Sequentialword index

KQ V

Semantics × ×+

KQ V

Pairwisesequentialdistances Semantics × ×+

Fig. 3. (A) In the original Transformer, information about the order of tokens is pro-vided explicitly to the model by positional embeddings added to semantic embeddings.(B) T5 introduces sequential bias, thus separating semantics from sequential distances.(C) We maintain this clear distinction, extending biases with spatial relationships andproviding additional image semantics at the input.ext-Image-Layout Transformer 7 where S and P are respectively the semantic embeddings of tokens and positionalembedding resulting from their positions [54]. 0 n × d denote a zero matrix.In contrast to the original formulation, we rely on relative attention biasesinstead of positional embeddings. These are further extended to take into accountspatial relationships between tokens (Figure 3). Authors of the T5 architecture disregarded positional embeddings [45], by setting X = S . They used relative bias by extending self-attention’s equation with thesequential bias term B = B , a simpliﬁed form of positional signal inclusion.Here, each logit used for computing the attention head weights has some learnedscalar added, resulting from corresponding token-to-token oﬀsets.We extended this approach to spatial dimensions. In our approach, biasesfor relative horizontal and vertical distances between each pair of tokens arecalculated and added to the original sequential bias, i.e.: B = B + B H + B V Such bias falls into one of 32 buckets, which group similarly-distanced token-pairs. The size of the buckets grows logarithmically so that greater token pairdistances are grouped into larger buckets.

Amount: 100.002020 relativedistance

Fig. 4.

Document excerpt with distinguished vertical buckets for the

Amount token. Powalski et al.

Contextualized

Word

Embeddings are expected to capture context-dependentsemantics and return a sequence of vectors associated with an entire input se-quence [10]. We designed Contextualized

Image

Embeddings with the same ob-jective, i.e., they cover the image region semantics in the context of its entirevisual neighborhood.

Visual features.

To produce image embeddings, we use a convolutional net-work that consumes the whole page image of size 512 ×

384 and produces a featuremap of 64 × × Fig. 5.

Truncated U-Net network. (cid:4) conv (cid:4) max-pool (cid:4) up-conv (cid:4) residual

Embeddings.

In order to inject visual information to the Transformer, a matrixof contextualized image-region embeddings U is added to semantic embeddings,i.e. we deﬁne X = S + U in line with the convention from Section 3 (see Figure 3). In the sequence labeling scenario, each document leads to multiple training in-stances (token classiﬁcation), whereas in Transformer sequence-to-sequence mod- ext-Image-Layout Transformer 9 els, the same document results in one training instance with feature space ofhigher dimension (decoding from multiple tokens).Since most of the tokens are irrelevant in the case of Key Information Ex-traction and contextualized word embeddings are correlated by design, one cansuspect our approach to overﬁt easier than its sequence labeling counterparts.To improve the model’s robustness, we introduced a regularization technique foreach modality.

Case Augmentation.

Subword tokenization [50, 29] was proposed to solve theword sparsity problem and keep the vocabulary at a reasonable size. Althoughthe algorithm proved its eﬃciency in many NLP ﬁelds, the recent work showedthat it performs poorly in the case of an unusual casing of text [43], for instance,when all words are uppercased. The problem occurs more frequently in formateddocuments (FUNSD, CORD, DocVQA), where the casing is an important vi-sual aspect. We overcome both problems with a straightforward regularizationstrategy, i.e., produce augmented copies of data instances by lower-casing orupper-casing both the document and target text simultaneously.

Spatial Bias Augmentation.

Analogously to Computer Vision practices ofrandomly transforming training images, we augment spatial biases by multiply-ing the horizontal and vertical distances between tokens by a random factor.Such transformation resembles stretching or squeezing document pages in hor-izontal and vertical dimensions. Factors used for scaling each dimension weresampled uniformly from range [0 . , . Aﬃne Vision Augmentation.

To account for visual deformations of real-world documents, we augment images with aﬃne transformation, preservingparallel lines within an image but modifying its position, angle, size, and shear.When we perform such modiﬁcation to the image, the bounding box of everytoken is updated accordingly. The exact hyperparameters were subject to anoptimization. We use 0.9 probability of augmenting and report the followingboundaries for uniform sampling work best: [ − ,

5] degrees for rotation angle,[ − , . , .

1] for scaling multiplier, [ − ,

5] de-grees for the shearing angle.

Our model was validated on series of experiments involving Key InformationExtraction, Visual Question Answering, classiﬁcation of rich documents, andQuestion Answering from layout-rich texts. The following datasets representedthe broad spectrum of tasks and were selected for the evaluation process (seeTable 2 for additional statistics).

Datasets.

The CORD dataset [42] includes images of Indonesian receipts col-lected from shops and restaurants. The dataset is prepared for the informationextraction task and consists of four categories, which fall into thirty subclasses.The main goal of the SROIE dataset [20] is to extract values for four categories(company, date, address, total) from scanned receipts. The DocVQA dataset[39] is focused on the visual question answering task. The RVL-CDIP dataset[15] contains gray-scale images and assumes classiﬁcation into 16 categories suchas letter, form, invoice, news article, and scientiﬁc publication. The WikiOpsdataset [1] consists of tables extracted from Wikipedia and natural languagequestions corresponding to them. Each has an operand information assigned.For DocVQA, we relied on Amazon Textract OCR; for RVL-CDIP, we used Mi-crosoft Azure OCR, and for WikiOps, SROIE and CORD, we depended on theoriginal OCR.

Dataset Data type Image Docs (k) Questions (k)CORD [42] receipts + 1.0 —SROIE [20] receipts + 0.9 —DocVQA [39] industry documents + 12.7 50.0RVL-CDIP [15] industry documents + 400.0 —WikiOps [1] Wikipedia tables −  Wikipedia pages − − − − − − − − — 10.0FUNSD [22] RVL-CDIP forms + 0.1 —Infographics VQA infographics + 4.4 23.9TextCaps [51] Open Images + 28.4 —DVQA [23] synthetic bar charts + 300.0 3487.2FigureQA [24] synthetic, scientiﬁc + 140.0 1800.0TextVQA [52] Open Images + 28.4 45.3 Table 2.

Comparison of datasets considered for supervised pretraining and evaluationprocess. Statistics given in thousands of documents or questions.

The training procedure consists of three steps. First, the model is initializedwith vanilla T5 model weights and is pretrained on numerous documents in anunsupervised manner. It is followed by training on a set of selected supervised ext-Image-Layout Transformer 11 tasks. Finally, the model is ﬁnetuned solely on the dataset of interest. We trainedtwo size variants of TILT models, starting from T5-Base and T5-Large models.Our models grew to 230M and 780M parameters due to the addition of VisualEncoder weights.

Unsupervised Pretraining.

We constructed a corpus of documents with richstructure, based on RVL-CDIP (275k docs), UCSF Industry Documents Library(480k), ∗ and PDF ﬁles from Common Crawl (350k). The latter were ﬁlteredaccording to the score obtained from a simple SVM business document classiﬁer.Then, a T5-like masked language model pretraining objective is used, butin a salient span masking scheme, i.e., named entities are preferred rather thanrandom tokens [45, 13]. Additionally, regions in the image corresponding to therandomly selected text tokens are masked with the probability of 80%. Modelsare trained for 100 ,

000 steps with batch size of 64, AdamW optimizer and linearscheduler with an initial learning rate of 2 e − d o c V Q A W i k i O p s S R O I E R V L C D I P C O R D W i k i T a b l e Q u e s t i o n s W i k i R e a d i n g T y D i Q A T e x t V Q A T e x t C a p s S Q u A D . R A C E M i dd l e R A C E H i g h Q u A C Q A S C N a t u r a l Q u e s t i o n s i n f o g r a p h i c s V Q A F U N S D D V Q A D R O P C o Q A docVQAWikiOpsSROIERVL CDIPCORD −4−2024 Fig. 6.

Scores on CORD, DocVQA, SROIE, WikiOps and RVL-CDIP compared to thebaseline without supervised pretraining. The numbers represent the diﬀerences in themetrics, orange text denote datasets chosen for the ﬁnal supervised pretraining run.

Supervised Training.

To obtain a general-purpose model which can reasonabout documents with rich layout features, we constructed a dataset relying ona large group of tasks, representing diverse types of information conveyed by adocument (see Table 2 for datasets comparison). Datasets, which initially hadbeen plain-text, had their layout produced, assuming some arbitrary font sizeand document dimensions. Some datasets, such as

WikiTable Questions , comewith original HTML code – for the others, we render text alike. Finally, an imageand computed bounding boxes of all words are used. ∗ At this stage, the model is trained on each dataset for 10,000 steps or 5epochs, depending on the dataset size: the goal of the latter condition was toavoid a quick overﬁtting.We estimated each dataset’s value concerning a downstream task, assum-ing a ﬁxed number of pretraining steps followed by ﬁnetuning. The results ofthis investigation are demonstrated in Figure 6, where the group of WikiTable,WikiOps, SQuAD, and infographicsVQA performed robustly, convincing us torely on them as a solid foundation for further experiments.Model pretrained in unsupervised, and then supervised manner, is at theend ﬁnetuned for two epochs on a downstream task with AdamW optimizer andhyperparameters presented in Table 3.

Dataset Batch size Steps Learning rate SchedulerSROIE 8 6,200 1e-4 constantWikiOps 64 4,200 1e-4 constantDocVQA 64 100,000 2e-4 linearCORD 8 36,000 2e-4 linearRVL-CDIP 1,024 12,000 1e-3 linear

Table 3.

Parameters used during the ﬁnetuning on a downstream task.

The TILT model achieved state-of-the-art results on four out of ﬁve consideredtasks (Table 4). We have conﬁrmed that unsupervised layout- and vision-awarepretraining leads to good performance on downstream tasks that require com-prehension of tables and other structures within the documents. Additionally,we successfully leveraged supervised training from both plain-text datasets andthese involving layout information.

DocVQA.

We improved SOTA results on this dataset by 0 .

33 points. Moreover,detailed results show that model gained the most in table-like categories, i.e.,forms (89 . → .

6) and tables (87 . → . . → . † In such a case, our architecture generatessimply yes or no answer, while sequence labeling based models require addi-tional components such as an extra classiﬁcation head. We noticed that modelachieved lower results in the image/photo category, which can be explained bythe low presence of image-rich documents in our datasets. † Per-category test set scores are available after submission on the competition webpage: https://rrc.cvc.uab.es/?ch=17&com=evaluation&task=1.ext-Image-Layout Transformer 13CORD SROIE DocVQA WikiOps RVL-CDIPModel F1 F1 ANLS Accuracy AccuracyLayoutLMv2 [55] 96 .

01 97 .

81 86 .

72 — . LAMBERT [11] 96 . . — — —NeOp [1] — — — 59 .

50 —TILT-Base 95 .

11 97 .

65 83 .

92 69 .

16 95 . .

33 98 .

10 87 .

05 73 . . Table 4.

Results of previous state-of-the-art methods in relation to our base and largemodels. Bold indicates the best score in each category. All results on the test set.

RVL-CDIP.

Part of the documents to classify does not contain any readabletext. Because of this shortcoming, we decided to guarantee there are at least16 image tokens that would carry general image information. Precisely, we actas there were tokens with bounding boxes covering 16 adjacent parts of thedocument. These have representations from U-Net, exactly as they were regulartext tokens. Our model places second, 0 .

12 below the best model, achieving thesimilar accuracy of 95 . CORD.

Since the complete inventory of entities is not present in all examples,we force the model to generate a

None output for missing entities. Our modelachieved SOTA results on this challenge and improved the previous best scoreby 0 . SROIE.

Following the same evaluation procedure as the top submission (LAM-BERT), we excluded OCR mismatches and ﬁxed total entity annotations discrep-ancies. We achieved results indistinguishable from the SOTA (98 .

10 vs. 98 . In the following section, we analyze the design choices in our architecture, con-sidering the base model pretrained in an unsupervised manner and the samehyperparameters for each run. The DocVQA was used as the most representa-tive and challenging for Document Intelligence since its leaderboard reveals alarge gap to human performance. We report average results over two runs ofeach model varying only in the initial random seed to account for the impact ofdiﬀerent initialization and data order [7]. . ± . . ± . − .

8– Visual Embeddings 81 . ± . − .

7– Case Augmentation 82 . ± . − .

7– Spatial Augmentation 82 . ± . − .

3– Vision Augmentation 82 . ± . − . Table 5.

Results of ablation study. The minus sign indicates removal of the mentionedpart from the base model.

Signiﬁcance of Modalities.

We start with the removal of the 2D layout po-sitional bias. Table 5 demonstrates that information that allows models to rec-ognize spatial relations between tokens is a crucial part of our architecture. It isconsistent with the previous works on layout understanding [55, 11]. Removal ofthe UNet-based convolutional feature extractor results in a less signiﬁcant ANLSdecrease than the 2D bias. This permits the conclusion that contextualized imageembeddings are beneﬁcial to the encoder-decoder.

Justifying Regularization.

Aside from removing modalities from the network,we can also exclude regularization techniques. To our surprise, the results suggestthat the removal of case augmentation decreases performance most severely. Ourbaseline is almost one point better than the equivalent non-augmented model.Simultaneously, model performance tends to be reasonably insensitive to thebounding boxes’ and image alterations. It was conﬁrmed that other modalitiesare essential for the model’s success on real-world data, whereas regularizationtechniques we propose slightly improve the results, as they prevent overﬁtting.

In this paper, we introduced a novel encoder-decoder framework for layout-awaremodels. Compared to the sequence labeling approach, the proposed methodachieved better results while operating in an end-to-end manner. Moreover,the framework can handle various tasks such as Key Information Extraction,Question Answering or Document Classiﬁcation, while the need for complicatedpreprocessing and postprocessing steps is eliminated. We established state-of-the-art results on three datasets (DocVQA, CORD, WikiOps) and performedon par with the previous best scores on SROIE and RVL-CDIP, albeit having amuch simpler workﬂow.Spatial and image enrichment of the Transformer model allowed the TILT tocombine information from text, layout, and image modalities. We showed thatthe proposed regularization methods signiﬁcantly improve the results. ext-Image-Layout Transformer 15

Acknowledgments

The authors would like to thank Filip Grali´nski, Tomasz Stanis(cid:32)lawek, and (cid:32)LukaszGarncarek for fruitful discussions regarding the paper and our managing direc-tors at Applica.ai. Moreover, Dawid Jurkiewicz pays due thanks to his son forminding the deadline and generously coming into the world a day after.