Document Visual Question Answering Challenge 2020
Minesh Mathew, Ruben Tito, Dimosthenis Karatzas, R. Manmatha, C.V. Jawahar
DDocument Visual Question Answering Challenge2020
Minesh Mathew , Rubn Tito , Dimosthenis Karatzas , R. Manmatha , andC.V. Jawahar CVIT, IIIT Hyderabad, India Computer Vision Center, UAB, Spain University of Massachusetts Amherst, USA
Abstract.
This paper presents results of Document Visual QuestionAnswering Challenge organized as part of “Text and Documents in theDeep Learning Era” workshop, in CVPR 2020. The challenge introducesa new problem - Visual Question Answering on document images. Thechallenge comprised two tasks. The first task concerns with asking ques-tions on a single document image. On the other hand, the second taskis set as a retrieval task where the question is posed over a collection ofimages. For the task 1 a new dataset is introduced comprising 50 , ,
767 document images. For task2 another dataset has been created comprising 20 questions over 14 , Keywords: visual question answering · document understanding Visual Question Answering (VQA) has attracted an intense research effort overthe past few years, as one of the most important tasks at the frontier betweenvision and language. Notably, at the same time as reading systems research con-sidered that the field of scene text understanding was mature enough to buildscene text based VQA systems on, VQA researchers realised that the capacityof reading is actually important for any VQA agent. As a result, ST-VQA [2]and TextVQA [6] were introduced in parallel in 2019 In this year’s Visual Ques-tion Answering workshop at CVPR 2020, three out of the five VQA challengesactually revolve around text [3, 4]. The time seemed right for the introductionof a large scale scanned Document VQA task.In this short paper we introduce the DocVQA dataset and the challengeorganized as part of the “Text and Documents in the Deep Learning Era” work-shop at CVPR 2020. The paper offers a quick description of the dataset, thechallenge and results of the submitted methods to date. The challenge is openfor continuous submission at the Robust Reading Competition (RRC) portal . https://rrc.cvc.uab.es/?ch=17 a r X i v : . [ c s . C V ] A ug Mathew et al.Who is the message for ?What is the date of the message ? In which legislative counties did GaryL. Schoessler run for County Commis-sioner?
Fig. 1: (Left)
A sample document image and questions defined on it from thedataset for task 1. (Right)
A sample document image from dataset for task 2and a sample question posed on the whole task 2 document collection.
The challenge ran between March and May 2020. Ranking of submitted methodspresented in this report reflect state of submissions at the closure of the officialchallenge period.The challenge comprised two separate tasks.
Task 1 of the challenge is similar to the typical VQA setting , i.e answer aquestion asked on an image; here a document image. Answer to the question isalways text present in the image, or in other words it is an extractive QA task.The participants are required to submit their results on the test split of thedataset which comprise of 5188 questions defined on 1287 document images. Forevaluation of the submissions, we use the same metric as ST-VQA challenge [1],which is Average Normalized Levenshtein Similarity (ANLS).
In task 2 the question is posed over a collection of documents instead of a singleimage. Hence, the task requires one to retrieve the evidence as well output the ocument Visual Question Answering Challenge 2020 3 answer. For this first edition, we focused on the first part to rank the methods,and left the second as an optional response, which is nevertheless evaluated.Performance of the methods for the retrieval part is scored by the Mean Aver-age Precision (MAP), which is the standard metric in retrieval scenarios. It isimportant to note that we force positive evidences that have been equally scoredby the methods, to be at the end of the ranking among those documents. Thisis to ensure that the ranking is consistent and do not depend on the defaultorder, or the way the score is evaluated. Finally, in spite of the fact that answersscore is not used to rank the methods, precision and recall are used to show theperformance of the methods in this task.
Dataset for task 1 [5] comprises 50 ,
000 questions over 12 ,
767 document images.The images are pages extracted from 6071 scanned documents sourced from theindustry documents library . We manually selected the documents from thelibrary to ensure document variety. Each question-answer(s) pair in the datasetis also qualified with a question type among 9 types that denote the type ofanalysis required to arrive at the answer. The 9 question types are figure, ‘form’,‘table/list’, ‘layout’, ‘running text’, ‘photograph’, ‘handwritten’, ‘yes/no’ and‘other’. We also provide commercial OCR transcriptions of all the documents.Dataset for task 2 consists of 14 ,
362 document images sharing the samedocument template — US Candidate Registration form. Among these imagesthere are documents filled in handwriting like Figure 1 (right), while othersare typewritten. The images were sourced from the Open Data portal of thePublic Disclosure Commission (PDC) and over this collection a set of 20 differentquestions is posed.
Task 1:
We received submissions from 9 different teams for task 1. The finalranking is shown in Table 1. Performance for the 9 submissions for differentquestion types is shown in Figure 2.
Task 2:
We received submissions from two different teams. The winning methodis DQA from PingAn team, followed by a small margin by DOCR from iFLYTEKteam (see Table 2). None of the submitted methods provided the answers to thequestions.
This work has been supported by Amazon through an AWS Machine LearningResearch Award Mathew et al.Method ANLSPingAn-OneConnect-Gammalab-DQA 0.85Structural LM- v2 0.75QA Base MRC 1 0.74HyperDQA V4 0.69bert fulldata fintuned 0.59Plain BERT QA 0.35HDNet 0.34CLOVA OCR 0.33docVQAQV V0.1 0.30
Table 1:
Final ranking fortask 1
Submitted Methods AN L S S c o r e P i ng A n - S t r u c t u r a l L M - v2 Q A _ B ase_ M R C _1 H y p e r D Q A _ V b e r t f u ll d a t a P l a i n B E R T Q A H D N e t C L O V A O C R do c V Q A Q V _ V . Overall Figure/Diagram Form Table/List Layout running_text Image/Photo Handwritten Yes/NoOthers
Fig. 2:
Performance of the submitted methods fordifferent question types in task 1.Retrieval AnswersMethod MAP Precision RecallDQA 0.8090 - -DOCR 0.7915 - -
Table 2:
Final ranking for task 2.
Fig. 3:
Average precision of the submittedmethods for each question in the test set.
References
1. Biten, A.F., Tito, R., Mafla, A., G´omez, L., Rusi˜nol, M., Mathew, M., Jawahar,C.V., Valveny, E., Karatzas, D.: ICDAR 2019 competition on scene text visualquestion answering. CoRR abs/1907.00490 (2019)2. Biten, A.F., Tito, R., Mafla, A., Gomez, L., Rusinol, M., Valveny, E., Jawahar, C.,Karatzas, D.: Scene text visual question answering. In: ICCV (2019)3. Gurari, D., Li, Q., Stangl, A.J., Guo, A., Lin, C., Grauman, K., Luo, J., Bigham,J.P.: Vizwiz grand challenge: Answering visual questions from blind people. In:CVPR. pp. 3608–3617 (2018)4. Jayasundara, V., Jayasekara, S., Jayasekara, H., Rajasegaran, J., Seneviratne, S.,Rodrigo, R.: Textcaps: Handwritten character recognition with very small datasets.In: WACV. pp. 254–262. IEEE (2019)5. Mathew, M., Karatzas, D., Manmatha, R., Jawahar, C.V.: DocVQA: A Dataset forVQA on Document Images. CoRR abs/2007.00398abs/2007.00398