[PDF] Document Visual Question Answering Challenge 2020

Abstract

This paper presents results of Document Visual Question Answering Challenge organized as part of "Text and Documents in the Deep Learning Era" workshop, in CVPR 2020. The challenge introduces a new problem - Visual Question Answering on document images. The challenge comprised two tasks. The first task concerns with asking questions on a single document image. On the other hand, the second task is set as a retrieval task where the question is posed over a collection of images. For the task 1 a new dataset is introduced comprising 50,000 questions-answer(s) pairs defined over 12,767 document images. For task 2 another dataset has been created comprising 20 questions over 14,362 document images which share the same document template.

Full PDF

DDocument Visual Question Answering Challenge2020

Minesh Mathew , Rubn Tito , Dimosthenis Karatzas , R. Manmatha , andC.V. Jawahar CVIT, IIIT Hyderabad, India Computer Vision Center, UAB, Spain University of Massachusetts Amherst, USA

Abstract.

This paper presents results of Document Visual QuestionAnswering Challenge organized as part of “Text and Documents in theDeep Learning Era” workshop, in CVPR 2020. The challenge introducesa new problem - Visual Question Answering on document images. Thechallenge comprised two tasks. The ﬁrst task concerns with asking ques-tions on a single document image. On the other hand, the second taskis set as a retrieval task where the question is posed over a collection ofimages. For the task 1 a new dataset is introduced comprising 50 , ,

767 document images. For task2 another dataset has been created comprising 20 questions over 14 , Keywords: visual question answering · document understanding Visual Question Answering (VQA) has attracted an intense research eﬀort overthe past few years, as one of the most important tasks at the frontier betweenvision and language. Notably, at the same time as reading systems research con-sidered that the ﬁeld of scene text understanding was mature enough to buildscene text based VQA systems on, VQA researchers realised that the capacityof reading is actually important for any VQA agent. As a result, ST-VQA [2]and TextVQA [6] were introduced in parallel in 2019 In this year’s Visual Ques-tion Answering workshop at CVPR 2020, three out of the ﬁve VQA challengesactually revolve around text [3, 4]. The time seemed right for the introductionof a large scale scanned Document VQA task.In this short paper we introduce the DocVQA dataset and the challengeorganized as part of the “Text and Documents in the Deep Learning Era” work-shop at CVPR 2020. The paper oﬀers a quick description of the dataset, thechallenge and results of the submitted methods to date. The challenge is openfor continuous submission at the Robust Reading Competition (RRC) portal . https://rrc.cvc.uab.es/?ch=17 a r X i v : . [ c s . C V ] A ug Mathew et al.Who is the message for ?What is the date of the message ? In which legislative counties did GaryL. Schoessler run for County Commis-sioner?

Fig. 1: (Left)

A sample document image and questions deﬁned on it from thedataset for task 1. (Right)

A sample document image from dataset for task 2and a sample question posed on the whole task 2 document collection.

The challenge ran between March and May 2020. Ranking of submitted methodspresented in this report reﬂect state of submissions at the closure of the oﬃcialchallenge period.The challenge comprised two separate tasks.

Task 1 of the challenge is similar to the typical VQA setting , i.e answer aquestion asked on an image; here a document image. Answer to the question isalways text present in the image, or in other words it is an extractive QA task.The participants are required to submit their results on the test split of thedataset which comprise of 5188 questions deﬁned on 1287 document images. Forevaluation of the submissions, we use the same metric as ST-VQA challenge [1],which is Average Normalized Levenshtein Similarity (ANLS).

In task 2 the question is posed over a collection of documents instead of a singleimage. Hence, the task requires one to retrieve the evidence as well output the ocument Visual Question Answering Challenge 2020 3 answer. For this ﬁrst edition, we focused on the ﬁrst part to rank the methods,and left the second as an optional response, which is nevertheless evaluated.Performance of the methods for the retrieval part is scored by the Mean Aver-age Precision (MAP), which is the standard metric in retrieval scenarios. It isimportant to note that we force positive evidences that have been equally scoredby the methods, to be at the end of the ranking among those documents. Thisis to ensure that the ranking is consistent and do not depend on the defaultorder, or the way the score is evaluated. Finally, in spite of the fact that answersscore is not used to rank the methods, precision and recall are used to show theperformance of the methods in this task.

Dataset for task 1 [5] comprises 50 ,

000 questions over 12 ,

767 document images.The images are pages extracted from 6071 scanned documents sourced from theindustry documents library . We manually selected the documents from thelibrary to ensure document variety. Each question-answer(s) pair in the datasetis also qualiﬁed with a question type among 9 types that denote the type ofanalysis required to arrive at the answer. The 9 question types are ﬁgure, ‘form’,‘table/list’, ‘layout’, ‘running text’, ‘photograph’, ‘handwritten’, ‘yes/no’ and‘other’. We also provide commercial OCR transcriptions of all the documents.Dataset for task 2 consists of 14 ,

362 document images sharing the samedocument template — US Candidate Registration form. Among these imagesthere are documents ﬁlled in handwriting like Figure 1 (right), while othersare typewritten. The images were sourced from the Open Data portal of thePublic Disclosure Commission (PDC) and over this collection a set of 20 diﬀerentquestions is posed.

Task 1:

We received submissions from 9 diﬀerent teams for task 1. The ﬁnalranking is shown in Table 1. Performance for the 9 submissions for diﬀerentquestion types is shown in Figure 2.

Task 2:

We received submissions from two diﬀerent teams. The winning methodis DQA from PingAn team, followed by a small margin by DOCR from iFLYTEKteam (see Table 2). None of the submitted methods provided the answers to thequestions.

This work has been supported by Amazon through an AWS Machine LearningResearch Award Mathew et al.Method ANLSPingAn-OneConnect-Gammalab-DQA 0.85Structural LM- v2 0.75QA Base MRC 1 0.74HyperDQA V4 0.69bert fulldata ﬁntuned 0.59Plain BERT QA 0.35HDNet 0.34CLOVA OCR 0.33docVQAQV V0.1 0.30

Table 1:

Final ranking fortask 1

Submitted Methods AN L S S c o r e P i ng A n - S t r u c t u r a l L M - v2 Q A _ B ase_ M R C _1 H y p e r D Q A _ V b e r t f u ll d a t a P l a i n B E R T Q A H D N e t C L O V A O C R do c V Q A Q V _ V . Overall Figure/Diagram Form Table/List Layout running_text Image/Photo Handwritten Yes/NoOthers

Fig. 2:

Performance of the submitted methods fordiﬀerent question types in task 1.Retrieval AnswersMethod MAP Precision RecallDQA 0.8090 - -DOCR 0.7915 - -

Table 2:

Final ranking for task 2.

Fig. 3:

Average precision of the submittedmethods for each question in the test set.

References

1. Biten, A.F., Tito, R., Maﬂa, A., G´omez, L., Rusi˜nol, M., Mathew, M., Jawahar,C.V., Valveny, E., Karatzas, D.: ICDAR 2019 competition on scene text visualquestion answering. CoRR abs/1907.00490 (2019)2. Biten, A.F., Tito, R., Maﬂa, A., Gomez, L., Rusinol, M., Valveny, E., Jawahar, C.,Karatzas, D.: Scene text visual question answering. In: ICCV (2019)3. Gurari, D., Li, Q., Stangl, A.J., Guo, A., Lin, C., Grauman, K., Luo, J., Bigham,J.P.: Vizwiz grand challenge: Answering visual questions from blind people. In:CVPR. pp. 3608–3617 (2018)4. Jayasundara, V., Jayasekara, S., Jayasekara, H., Rajasegaran, J., Seneviratne, S.,Rodrigo, R.: Textcaps: Handwritten character recognition with very small datasets.In: WACV. pp. 254–262. IEEE (2019)5. Mathew, M., Karatzas, D., Manmatha, R., Jawahar, C.V.: DocVQA: A Dataset forVQA on Document Images. CoRR abs/2007.00398abs/2007.00398