English Machine Reading Comprehension Datasets: A Survey
EEnglish Machine Reading Comprehension Datasets: A Survey
Daria Dzendzik
ADAPT CentreDublin City UniversityDublin, Ireland [email protected]
Carl Vogel
School of ComputerScience and StatisticsTrinity College Dublinthe University of DublinDublin, Ireland [email protected]
Jennifer Foster
School of ComputingDublin City UniversityDublin, Ireland [email protected]
Abstract
This paper surveys 54 English Machine Read-ing Comprehension datasets, with a view toproviding a convenient resource for other re-searchers interested in this problem. We cat-egorize the datasets according to their ques-tion and answer form and compare them acrossvarious dimensions including size, vocabulary,data source, method of creation, human perfor-mance level, and first question word. Our anal-ysis reveals that Wikipedia is by far the mostcommon data source and that there is a relativelack of why , when , and where questions acrossdatasets. Reading comprehension is often tested by measur-ing a person or system’s ability to answer questionson a given text. Machine reading comprehension(MRC) datasets have proliferated in recent years,particularly for the English language – see Fig. 1.The aim of this paper is to make sense of this land-scape by providing as extensive as possible a surveyof English MRC datasets.Similar surveys have been carried out previ-ously (Liu et al., 2019; Zhang et al., 2019; Wang,2020; Ingale et al., 2019; Qiu et al., 2019; Lakshmiand Arivuchelvan, 2019; Baradaran et al., 2020;Zeng et al., 2020) but ours differs in its breadth –54 datasets compared to the two next largest, 47(Zeng et al., 2020) and 29 (Baradaran et al., 2020) –and its focus on MRC datasets rather than on MRC systems . Our survey takes a mostly structured form,with the following information presented for eachdataset: size, data source, creation method, hu-man performance level, whether the dataset hasbeen “solved”, availability of a leaderboard, themost frequent first question token, and whether thedataset is publicly available. We also categoriseeach dataset by its question/answer type.The study contributes to the field as follows: 1. it describes and teases apart the ways in whichMRC datasets can vary according to theirquestion and answer types;2. it provides analysis in a structured and vi-sual form (tables and figures) to facilitate easycomparison between datasets;3. by providing a systematic comparison, and byreporting the “solvedness” status of a dataset,it brings the attention of the community to lesspopular and relatively understudied datasets;4. it contains per-dataset statistics suchas number of instances, average ques-tion/passage/answer length, vocabulary sizeand text domain can be used to estimate thecomputational requirements for training anMRC system.This paper has been written with the following read-ers in mind: (1) those who are new to the field andwould like to get a quick yet informative overviewof English MRC datasets; (2) those who are plan-ning to create a new MRC dataset; (3) MRC systemdevelopers, interested in designing the appropriatearchitecture for a particular dataset, choosing ap-propriate datasets for a particular architecture, orfinding compatible datasets for use in transfer orjoint learning.
All MRC datasets in this survey have three compo-nents: passage , question , and answer . We beginwith a categorisation based on the types of answersand the way the question is formulated. We dividequestions into three main categories:
Statement , Query , and
Question . Answers are divided intothe following categories:
Cloze , Multiple Choice , Boolean , Extractive , Generative . The relationships We mention the datasets which do not meet this criteriain the supplementary materials Section B and explain why weexclude them. a r X i v : . [ c s . C L ] J a n igure 1: English MRC datasets released per year between question and answer types are illustratedin Fig. 2. In what follows we briefly describe eachquestion and answer category, followed by a discus-sion of passage types, and dialog-based datasets. The question is formulated as a sentencewith a missing word(s) and the correct entity shouldbe inserted according to the context. We con-sider the Cloze task in a broader sens as it is notonly insert word task but also sentence completion.The answer candidates may be included as in (1)from ReciteQA (Yagcioglu et al., 2018), and maynot, as in (2) from CliCR ( ˇSuster and Daelemans,2018).(1)
Passage (P):
You will need 3/4 cup of black-berries ... Pour the mixture into cups andinsert a popsicle stick in it or pour it in a posi-cle maker. Place the cup ... in the freezer. ...
Question (Q):
Choose the best title for themissing blank to correctly complete the recipe.Ingredients, , Freeze, Enjoying
Candidates(AC): (A) Cereal Milk Ice Cream (B) Ingre-dients (C) Pouring (D) Oven
Answer (A):
C(2) P: ... intestinal perforation in dengue is veryrare and has been reported only in eight pa-tients until today. ... Q: Perforation peritonitisis a .
Possible A: very rare complication ofdengue
Selective or Multiple Choice (MC)
A numberof options is given for each question, and the cor-rect one (or a number of correct answers) should beselected, e.g. (3) from MCTest (Richardson et al.,2013).(3) P: It was Jessie Bear’s birthday. She ... Q: Who was having a birthday?
AC: (A) JessieBear (B) no one (C) Lion (D) Tiger A: AWe distinguish cloze multiple choice datasetsfrom other multiple choice datasets. The differenceis the form of question: in the cloze datasets, theanswer is a missing part of the question contextand, combined together, they form a grammaticallycorrect sentence, whereas for other multiple choicedatasets, the question has no missing words.
Boolean
A “Yes/No” answer is expected, e.g. (4)from the BoolQ dataset (Clark et al., 2019). Somedatasets which we put in this category have thethird option “Cannot be answered” or “Maybe” ,e.g. (5) from PubMedQuestions (Jin et al., 2019).(4) P: The series is filmed partially in Prince Ed-ward Island as well as locations in ... Q: Isanne with an e filmed on pei? A: Yes (5) P: ... Young adults whose families wereabstainers in 2000 drank substantially lessacross quintiles in 2010 than offspring of non-abstaining families. The difference, however,was not statistically significant between quin-tiles of the conditional distribution. Actualdrinking levels in drinking families were notat all or weakly associated with drinking inoffspring. ... Q: Does the familial transmis-sion of drinking patterns persist into youngadulthood? A: Maybe
Extractive or Span Extractive
The answer is asubstring of the passage. In other words the taskis to determine the start and end index of the char-acters in the original passage. The string betweenthose two indexes is the answer, as shown in (6)from SQuAD (Rajpurkar et al., 2016). igure 2: Hierarchy of types of question and answer and the relationships between them. → indicates a sub typewhereas (cid:57)(cid:57)(cid:75) indicates inclusion. (6) P: With Rivera having been a linebacker withthe Chicago Bears in Super Bowl XX, .... Q: What team did Rivera play for in Super BowlXX? A: Generative or Free Form Answer
The answermust be generated based on information presentedin the passage. Although the answer mightbe in the text, as illustrated in (7) from Narra-tiveQA (Koˇcisk´y et al., 2018), no passage indexconnections are provided.(7) P: ...Mark decides to broadcast his final mes-sage as himself. They finally drive up to thecrowd of protesting students, .... The policestep in and arrest Mark and Nora.... Q: Whatare the students doing when Mark and Noradrive up? A: Protesting.
The question is an affirmative sen-tence and used in cloze, e.g. (1-2), and quiz ques-tions, e.g. (8) from SearchQA (Dunn et al., 2017).(8) P: Jumbuck (noun) is an Australian Englishterm for sheep, ... Q: Australians call thisanimal a jumbuck or a monkey A: Sheep
Question is an actual question in the standardsense of the word, e.g. (3)-(7). Usualy questionsare divided into Factoid (
Who? Where? What?When? ), Non-Factoid (
How? Why? ), and Yes/No.
Query
The question is formulated to obtain a par-ticular property of a particular object. It is similarto a knowledge graph query, and, in order to beanswered, a part of the passage might involve addi-tional sources as a knowledge graph, or the datasetmay have been created using a knowledge graph,e.g. (9) from WikiReading (Hewlett et al., 2016). (9) P: Cecily Bulstrode (1584-4 August 1609),was a courtier and ... She was the daughter ... Q: sex or gender A: female We put datasets with more than one type of ques-tion into a separate
Mixed category.
Passages can take the form of a one-document or multi-document passage. They can be categorisedbased on the type of reasoning required to answera question: Simple Evidence where the answer toa question is clearly presented in the passage, e.g.(3) and (6),
Multihop Reasoning with questions re-quiring that several facts from different parts of thepassage or different documents are combined to ob-tain the answer, e.g. (10) from the HotpotQA (Yanget al., 2018), and
Extended Reasoning where gen-eral knowledge or common sense reasoning is re-quired, e.g. (11) from the Cosmos dataset (Huanget al., 2019):(10) P: ...2014 S \ /S is the debut album of SouthKorean group WINNER. ... WINNER, is aSouth Korean boy group formed in 2013 byYG Entertainment and debuted in 2014. ... Q: \ /S is the debut album of a SouthKorean boy group that was formed by who? A: YG Entertainment (11) P: I was a little nervous about this today, butthey felt fantastic. I think they’ll be a verygood pair of shoes. This time I’m going tokeep track of the miles on them. Q: Why did the writer feel nervous?
AC: (A) None of the above choices. (B) Be-cause the shoes felt fantastic.(C) Because they were unsure if the shoeswould be good quality. (D) Because the writerthinks the shoes will be very good. A: C ataset Size (ques-tions) Data Source Q/A Source LB Human Perfor-mance Sol-ved TMFW PAD Cloze Datasets
CNN/Daily Mail (Hermann et al., 2015) 387k/997k CNN/DailyMail AG (cid:93) - (cid:55) - (cid:51) Children BookTest (Hill et al., 2016) 687k Project Gutenberg AG (cid:93) (cid:51) - (cid:51) Who Did What (Onishi et al., 2016) 186k Gigaword AG (cid:51) (cid:55) - (cid:41) BookTest (Bajgar et al., 2017) 14M Project Gutenberg AG (cid:55) - (cid:55) - (cid:55) Quasar-S (Dhingra et al., 2017) 37k Stack Overflow AG (cid:55) (cid:55) - (cid:51) RecipeQA (Yagcioglu et al., 2018) 9.8k instructibles.com AG (cid:51) (cid:55) - (cid:51) CliCR (ˇSuster and Daelemans, 2018) 105k Clinical Reports AG (cid:93) (cid:55) - (cid:41) ReCoRD (Zhang et al., 2018a) 121k CNN AG (cid:51)(cid:93) (cid:51) - (cid:51) Shmoop (Chaudhury et al., 2019) 7.2k Project Gutenberg ER, AG (cid:55) - (cid:55) - (cid:41) Multiple Choice Datasets
MCTest (Richardson et al., 2013) 2k/640 Stories CRW (cid:51)(cid:93) (cid:55) what (cid:51)
WikiQA (Yang et al., 2015) 3k Wikipedia UG, CRW (cid:93) - (cid:55) what (cid:51) bAbI (Weston et al., 2016) 40k AG AG (cid:93) (cid:51) what (cid:51) MovieQA (Tapaswi et al., 2016) 15k Wikipedia annotators (cid:51) - (cid:55) what (cid:41) RACE (Lai et al., 2017) 98k ER experts (cid:51)(cid:93) (cid:55) what (cid:51)
SciQ (Welbl et al., 2017) 12k Science Books CRW (cid:55) (cid:55) what (cid:51)
MultiRC (Khashabi et al., 2018) 10k reports, News, Wikipedia, ... CRW (cid:51)(cid:93) (cid:51) what (cid:51)
MedQA (Zhang et al., 2018b) 235k Medical Books expert (cid:55) - (cid:51) - (cid:55) MCScript (Ostermann et al., 2018) 14k Scripts, CRW CRW (cid:55) (cid:55) how (cid:51)
MCScript2.0 (Ostermann et al., 2019) 20k Scripts, CRW CRW (cid:55) (cid:55) what (cid:51)
RACE-C (Liang et al., 2019) 14k ER experts (cid:55) - (cid:55) the (cid:51) DREAM (Sun et al., 2019) 10k ER experts (cid:51) (cid:55) what (cid:51)
Cosmos QA (Huang et al., 2019) 36k Blogs CRW (cid:51) (cid:55) what (cid:51) ReClor (Yu et al., 2020) 6k ER experts (cid:51) (cid:55) which (cid:51)
QuAIL (Rogers et al., 2020) 15k News, Stories, Fiction,Blogs, UG CRW, experts (cid:51) (cid:55) - (cid:51) Boolean Questions
BoolQ (Clark et al., 2019) 16k Wikipedia UG, CRW (cid:51)(cid:93) (cid:51) is (cid:51) AmazonYesNo (Dzendzik et al., 2019) 80k Reviews UG (cid:55) - (cid:55) does (cid:41) PubMedQA (Jin et al., 2019) 211k PubMed CRW (cid:51) (cid:55) does (cid:51) Extractive Datasets
SQuAD (Rajpurkar et al., 2016) 108k Wikipedia CRW (cid:51)(cid:93) (cid:51) what (cid:51)
SQuAD2.0 (Rajpurkar et al., 2018) 151k Wikipedia CRW (cid:51)(cid:93) (cid:51) what (cid:51)
NewsQA (Trischler et al., 2017) 120k CNN CRW (cid:93) (cid:51) what (cid:51)
SearchQA (Dunn et al., 2017) 140k CRW, AG J!Archive (cid:93) (cid:51) this (cid:51)
Generative Datasets
MS MARCO (Nguyen et al., 2016) 100k Web documents UG, HG (cid:51)(cid:93) - (cid:55) what (cid:51) LAMBADA (Paperno et al., 2016) 10k BookCorpus CRW, AG (cid:55) - (cid:55) - (cid:51) WikiMovies (Miller et al., 2016; Watanabe et al., 2017) 116k Wikipedia, KG CRW, AG, KG (cid:55) (cid:55) what (cid:51)
WikiSuggest (Choi et al., 2017) 3.47M Wikipedia CRW, AG (cid:55) - (cid:55) - (cid:55) Continued on next page able 1 –
Continued from previous page
Dataset Size Data Source Q/A Source LB HP Sol-ved TMFW PADTriviaQA (Joshi et al., 2017) 96k Wikipedia, Web docs Trivia, CRW (cid:51)(cid:93) (cid:55) which (cid:51)
NarrativeQA (Koˇcisk´y et al., 2018) 47k Wikipedia, Project Guten-berg, movie, HG HG (cid:93) (cid:51) what (cid:51)
TweetQA (Xiong et al., 2019) 14k News, Twitter, HG CRW (cid:51) (cid:51) what (cid:51)
Conversational Datasets
ShARC (Saeidi et al., 2018) 32k Legal web sites CRW (cid:51) (cid:55) can (cid:51)
CoQA (Reddy et al., 2019) 127k Books, News, Wikipedia,ER CRW (cid:51)(cid:93) (cid:51) what (cid:51)
Mixed Datasets
TurkQA (Malon and Bai, 2013) 54k Wikipedia CRW (cid:55) - (cid:55) what (cid:51) WikiReading (Hewlett et al., 2016) 18.9M Wikipedia AG, KG (cid:55) - (cid:55) - (cid:51) Quasar-T (Dhingra et al., 2017) 43k Trivia ClueWeb09 AG (cid:93) (cid:55) what (cid:51)
HotpotQA (Yang et al., 2018) 113k Wikipedia CRW (cid:51)(cid:93) (cid:55) what (cid:51)
QAngaroo
WikiHop (Welbl et al., 2018) 51k Wikipedia CRW, KG (cid:51)(cid:93) (cid:55) - (cid:51) QAngaroo
MedHop (Welbl et al., 2018) 2.5k Medline abstracts CRW, KG (cid:51) - (cid:55) - (cid:51) QuAC (Choi et al., 2018) 98k Wikipedia CRW (cid:51)(cid:93) (cid:55) what (cid:51)
DuoRC (Saha et al., 2018) 86k Wikipedia + IMDB CRW (cid:51) - (cid:55) who (cid:51) emr QA (Pampari et al., 2018) 456k Clinic Records expert, AG (cid:55) - (cid:55) does (cid:41) DROP (Dua et al., 2019) 97k Wikipedia CRW (cid:51) (cid:55) how (cid:51)
NaturalQuestions (Kwiatkowski et al., 2019) 323k Wikipedia UG, CRW (cid:51)(cid:93) (cid:55) who (cid:51)
AmazonQA (Gupta et al., 2019b) 570k UG Review UG (cid:55) (cid:55) does (cid:51)
TyDi (Clark et al., 2020) 11k Wikipedia CRW (cid:51) (cid:55) what (cid:51) R (Wang et al., 2020b) 60k Wikipedia CRW (cid:55) - (cid:55) - (cid:55) Table 1: Reading comprehension datasets comparison. Where: LB – leader board available; Human Performance – (expert / non-expert if other not specified):accuracy if other is not specified; TMFW – the most frequent first word;
PAD – publicly available data; k/M – thousands/millions;
CRW – crowdsourcing; AG –automatically generated; KG – knowledge graph; ER – educational resources; UG – user generated; HG – human generated (UG + annotators, crw, experts); L/S –long/short answer; (cid:51) – available/“solved”; (cid:55) – unavailable/not “solved”; (cid:93) – the leader board is presented at https://paperswithcode.com/; (cid:41) – the dataset is availableby request. The information is verified in June 2020. .4 Conversational MRC
We put
Conversational or Dialog datasets in a sep-arate category as there is a unique combination ofpassage, question, and answer. The passage hasa particular context and is then completed by anumber of follow-up questions and answers. Thefull passage is presented as a conversation and thequestion should be answered based on previous ut-terances as illustrated in (12) from ShARC (Saeidiet al., 2018), where the scenario is an additionalpart of the passage unique for each dialog. Thequestion asked before and its answer becomes apart of the passage for the following question. (12) P: Eligibility. You’ll be able to claim the newState Pension if you’re: a man born on orafter 6 April 1951, a woman born on or after6 April 1953
Scenario:
I’m female and I was born in 1966 Q: Am I able to claim the new State Pension?
Follow ups: (1)
Are you a man born on orafter 6 April 1951? – No (2)
Are you a womanborn on or after 6 April 1953? – Yes A: Yes
All datasets and their properties of interest are listedin Table 1. We indicate the number of questionsper dataset (size), the text sources, the methodof creation, whether there are a leaderboard anddata publicly available, and whether the datasetis solved , i.e. the performance of a MRC systemexceeds the reported human performance (alsoshown). We will discuss each of these aspects.
A significant proportion of datasets (21 out of54) use Wikipedia as a passage source. Sixof those use Wikipedia along with additionalsources. Other popular sources of text data arenews (CNN/DailyMail, WhoDidWhat, NewsQA,CoQA, MultiRC, ReCoRD, QuAIL), books, in-cluding Project Gutenberg and BookCorpus (Zhuet al., 2015), (ChildrenBookTest, BookTest, LAM-BADA, partly CoQA, Shmoop, SciQ), moviescripts (MovieQA, WikiMovies, DuoRC), and a We do not include DREAM (Sun et al., 2019) in thiscategory as, even though the passages are in dialog form, thequestions are about the dialog but not a part of it. That is whyDREAM is in the Multiple-Choice category. Extra features are in the supplementary materials Table 3. Gutenberg:
BookCorpus: yknzhu.wixsite.com/mbweb – all links last verified(l.v.) 02/2020 combination of these (MultiRC and NarrativeQA).Five datasets (CliCR, PubMedQuestions, MedQA,emrQA, QAngaroo MedHop) were created in themedical domain based on clinical reports, med-ical books, MEDLINE abstracts, and PubMed.ShARL is based on a legal resource websites. Some datasets make use of exam questions. RACE,RACE-C, and DREAM use data from Englishas a Foreign Language examinations, ReClorfrom the Graduate Management Admission Test(GMAT) and The Law School Admission Test(LSAT), and MedQA from medical exams. Othersources of data include personal narratives fromthe Spinn3r Blog Dataset (Burton et al., 2009)(MCScript, MCScript2.0, CosmosQA), StackOver-flow.com (Quasar-S), Quora.com (QuAIL) Twit-ter.com (TweetQA), Amazon.com user reviewsand questions (AmazonQA, AmazonYesNo), and acooking website (RecipeQA).Fig. 3 shows the domains used by datasets aswell as any overlaps between datasets. Somedatasets share not only text sources but the ac-tual samples. SQuAD2.0 extends SQuAD withunanswerable questions. AmazonQA and Ama-zonYesNo overlap in questions and passages withslightly different processing. BoolQ shares 3kquestions and passages with the NaturalQuestionsdataset. The R dataset is fully based on DROPwith a focus on reasoning. Rule-based approaches have been used to automat-ically obtain questions and passages for the MRCtask by generating the sentences (e.g. bAbI) or, inthe case of cloze type questions, excluding a wordfrom the context. We call those methods automati-cally generated (AG). Most dataset creation, how-ever, involves a human in the loop. We distinguishthree types of people: experts are professionals ina specific domain; crowdworkers (CRW) are ca-sual workers who normally meet certain criteria(for example a particular level of proficiency in thedataset language) but are not experts in the subjectarea; users who voluntarily create content based ontheir personal needs and interests. For example: , – all links l.v. 06/2020 GMAT: , LSAT: – all links l.v. 03/2020 Xiong et al. (2019) selected tweets featured in the news. – l.v.03/2020 igure 3: Question Answering Reading Comprehension datasets overview. More than half of the datasets (33 out of 54)were created using crowdworkers. In one sce-nario, crowdworkers have access to the passage andmust formulate questions based on it. For exam-ple, MovieQA, ShaRC, SQuAD, and SQuAD2.0were created in this way. In contrast, another sce-nario involves finding a passage containing the an-swer for a given question. That works well fordatasets where questions are taken from alreadyexisting resources such as trivia and quiz questions:TriviaQA, Quasar-T, and SearchQA, or using websearch queries and results from Google and Bingas a source of questions and passages: BoolQ, Nat-uralQuestions, MS MARCO.In an attempt to avoid word repetition betweenpassages and questions, some datasets used differ-ent texts about the same topic as a passage and asource of questions. For example, DuoRC takesdescriptions of the same movie from Wikipedia andIMDB. One description is used as a passage whileanother is used for creating the questions. NewsQAuses only a title and a short news article summaryas a source for questions while the whole text be-comes the passage. Similarly, in NarrativeQA, onlythe abstracts of the story were used for the questioncreation. For MCScript and MCScript 2.0, ques-tions and passages were created by different sets ofcrowdworkers given the same script.
Each dataset’s size is shown in Table 1. The ma-jority of datasets contain 100k+ questions whichmakes them suitable for training and/or fine tun-ing a deep learning model. A few datasets containfewer than 10k samples: MultiRC (9.9k), Shmoop(7.2k), ReClor (6.1k), QAngaroo MedHop (2.5k),WikiQA (2k). Every dataset has its own structureand data format but we processed all datasets thesame way extracting lists of questions, passages,and answers, including answer candidates, usingthe spaCy tokenizer. Question/Passage/Answer Length
The graphsin Fig. 4 provide more insight into the differencesbetween the datasets in terms of answer, ques-tion, and passage length, as well as vocabularysize. The outliers are highlighted. The majorityof datasets have a passage length under 1500 to-kens with the median being 329 tokens but dueto seven outliers the average number tokens is1250 (Fig. 4 (a)). Some datasets (MS MARCO,SearchQA, AmazonYesNo, AmazonQA, MedQA)have a collection of documents as a passage butothers contain just a few sentences. The numberof tokens in a question lies mostly between 5 and spacy.io/api/tokenizer – l.v. 03/2020 See supplementary materials Table 4 for details. We use matplotlib for calculation and visualisation: https://matplotlib.org/ – l.v. 10/2020 igure 4: The average length in tokens of (a) passages, (b) questions, (c) answers, and (d) vocabulary size inunique lower-cased lemmas of datasets with the median , mean value, and standard deviation ( St. dev ). Outliersare highlighted.
20. Two datasets, ChildrenBookTest and WhoDid-What, have on average more than 30 tokens perquestion while WikiReading, QAngaroo MedHop,and WikiHope have only 2 – 3.5 average tokensper question (Fig. 4 (b)). The majority of datasetscontain fewer than 8 tokens per answer with theaverage being 3.5 tokens per answer. The Natu-ralQuestions is an outlier with average 164 tokensper answer (Fig. 4 (c)). Vocabulary Size
To obtain a vocabulary size wecalculate the number of unique lower cased lem-mas of tokens. A vocabulary size distribution ispresented in Fig. 4 (d). There is a moderate correla-tion between the number of questions in a datasetand its vocabulary size (see Fig. 5). WikiReadinghas the largest number of questions as well as therichest vocabulary. bAbI is a synthetic dataset with40k questions but only 152 lemmas in its vocabu-lary. Language Detection
We ran a language detec-tor over all datasets using the pyenchant forAmerican and British English, and langid li-braries. In 36 of the 54 datasets, more than10% of the words are reported to be non-English. We inspected 200 randomly chosen samples froma subset of these. For Wikipedia datasets (Hot-PotQA, QAngoroo WikiHop), around 70-75% ofthose words are named entities; 10-12% are spe-cific terms borrowed from other languages such asnames of plants, animals, etc.; another 8-10% areforeign words, e.g. the word “dialetto” from Hot-PotQA “Bari dialect (dialetto barese) is a dialect We focus on short answers, considering a long ones onlyif the short answer is not available. As data has non-normal distribution we calculated theSpearman correlation coefficient=0.58 and p-value is 1.3e-05. pypi.org/project/pyenchant/ and github.com/saffsd/langid.py – all links l.v. 10/2020. See supplementary materials Section A.2 and Table 5. of Neapolitan ...” ; about 1.5-3% are misspelledwords and tokenization errors. In contrast, for theuser-generated dataset, AmazonQA, 67% are to-kenization and spelling errors. This aspect of adataset’s vocabulary is useful to bear in mind when,for example, fine-tuning a pre-trained languagemodel which has been trained on less noisy text.
Question All Questions Unique QuestionsCount % Count % what 1497009 22.39 1069275 24.23when 137865 2.06 116158 2.63where 154990 2.32 119250 2.70which 275454 4.12 123731 2.80why 95493 1.43 68217 1.55how 456559 6.83 230948 5.23who/whose 392166 5.87 293130 6.64boolean 2236356 33.45 1259287 28.53other 1439241 21.53 975681 22.11
Table 2: Frequency of the first token of the questionsacross datasets.
First Question Word
A number of datasetscome with a breakdown of question types based onthe first token (Nguyen et al., 2016; Ostermannet al., 2018, 2019; Koˇcisk´y et al., 2018; Clarket al., 2019; Xiong et al., 2019). We inspect themost frequent first word in a dataset’s questionsexcluding cloze-style questions. Table 1 shows themost frequent first word per dataset and Table 2shows the same information over all datasets. The most popular first word is what – 22% of allquestions analysed and over half of questions inWikiQA, WikiMovies, MCTest, CosmosQA, andDREAM start with what . The majority of questionsin ReClor (56.5%) start with the word which , andRACE has 23.1%. DROP mostly focused on howmuch/many, how old questions (60.4%). DuoRChas a significant proportion of who/whose ques- See the supplementary materials Section A.3. igure 5: The number of questions and vocabulary size (unique lower-cased lemmas). The values for the BookTest(Bajgar et al., 2017) and WhoDidWhat (Onishi et al., 2016) are borrowed. tions (39.5%).
Why , When , and
Where questionsare under-represented – only 1.4%, 2%, and 2.3%of all questions respectively. Only CosmosQA hasa significant proportion (34.2%) of
Why questions,MCScript2 (27.9%) and TyDi (20.5%) of
When questions, and bAbI (36.9%) of
Where questions.
Human performance figures have been reportedfor some datasets – see Table 1. This is useful intwo ways. Firstly, it gives some indication of thedifficulty of the questions in the dataset. Contrast,for example, the low human performance scorereported for the Quasar and CliCR datasets withthe very high scores for DREAM, DROP, and MC-Script. Secondly, it provides a comparison pointfor automatic systems, which may serve to directresearchers to under-studied datasets where the gapbetween state-of-the-art machine performance andhuman performance is large, e.g. CliCR (33.9 vs.53.7), RecipeQA (29.07 vs 73.63), ShaRC (78.3 vs93.9) and HotpotQA (82.20 vs 96.37).Although useful, the notion of human perfor-mance is problematic and has to be interpretedwith caution. It is usually an average over individ-ual humans, whose reading comprehension abil-ities will vary depending on age, ability to con-centrate, interest in, and knowledge of the subjectarea. Some datasets (CliCR, QUASAR) take thelatter into account by distinguishing between expertand non-expert human performance, while RACEdistinguishes between crowd-worker and author an-notations. The authors of MedQA, which is basedon medical examinations, use a passing mark (of60%) as a proxy for human performance. It is im-portant to know this when looking at its “solved”status since state-of-the-art accuracy on this datasetis only 75.3% (Zhang et al., 2018b).Finally, Dunietz et al. (2020) call into questionthe importance of comparing human and machineperformance on the MRC task and argue that the questions that MRC systems need to be able toanswer are not necessarily the questions that peoplefind difficult to answer.
This paper represents an up-to-date, one-stop-shoppicture of 54 English MRC datasets. We comparethe datasets by question and answer type, size, datasource, creation method, vocabulary, question type,“solvedness”, and human performance level. See-ing the history of dataset creation we can observethe tendency of moving from smaller datasets to-wards large collections of questions, and from syn-thetically generated data through crowdsourcingtowards spontaneously created. We also observe ascarcity of why , when , and where questions.Gathering and processing the data for this sur-vey was a painstaking task, from which we emergewith some very practical recommendations for fu-ture MRC dataset creators. In order to 1) compareto existing datasets, 2) highlight possible limita-tion for applicable methods, and 3) indicate thecomputational resources required to process thedata, some basic statistics such as average pas-sage/question/answer length, vocabulary size andfrequency of question words should be reported;the data itself should be stored in consistent, easy-to-process fashion, ideally with an API provided;any data overlap with existing datasets should bereported; human performance on the dataset shouldbe measured and what it means clearly explained;and finally, if the dataset is for the English lan-guage and its design does not differ radically fromthose surveyed here, e.g. the recent Template ofUnderstanding approach (Dunietz et al., 2020), it iscrucial to explain why this new dataset is needed.For any future datasets, we suggest a moveaway from Wikipedia given the volume of exist-ing datasets that are based on it and its use in pre-rained language models such as BERT (Devlinet al., 2019). As shown by Petroni et al. (2019), itsuse in MRC dataset and pre-training data bringswith it the problem that we cannot always deter-mine whether a system’s ability to answer a ques-tion comes from its comprehension of the relevantpassage or from the underlying language model.The medical domain is well represented in thecollection of English MRC datasets, indicatinga demand for understanding of this type of text.Datasets may be required for other domains, suchas retail, law and government.Some datasets are designed to test the ability ofsystems to tell if a question cannot be answered, byincluding a “ no answer ” label. Building upon this,we suggest that datasets be created for the morecomplex task of answering questions differentlybased on different possible interpretations of thequestion, or determining whether contradictory in-formation is given, i.e. similar to dialog datasetssuch as ShARC but in a non-dialog scenario. Acknowledgements
This research is supported by Science FoundationIreland in the ADAPT Centre for Digital ContentTechnology, funded under the SFI Research Cen-tres Programme (Grant 13/RC/2106) and the Euro-pean Regional Development Fund.We also thank Andrew Dunne, Koel DuttaChowdhury, Valeriia Filimonova, Victoria Serga,Marina Lisuk, Ke Hu, Joachim Wagner, and Al-berto Poncelas.
References
Akari Asai, Akiko Eriguchi, Kazuma Hashimoto, andYoshimasa Tsuruoka. 2018. Multilingual extractivereading comprehension by runtime machine transla-tion. arXiv:1809.03275 .Ondrej Bajgar, Rudolf Kadlec, and Jan Kleindienst.2017. Embracing data abundance. In .Razieh Baradaran, Razieh Ghiasi, and HosseinAmirkhani. 2020. A survey on machine readingcomprehension systems.
CoRR , abs/2001.01582.Jonathan Berant, Andrew Chou, Roy Frostig, and PercyLiang. 2013. Semantic parsing on Freebase fromquestion-answer pairs. In
Proceedings of the 2013Conference on Empirical Methods in Natural Lan-guage Processing , pages 1533–1544, Seattle, Wash- ington, USA. Association for Computational Lin-guistics.Antoine Bordes, Nicolas Usunier, Sumit Chopra, andJason Weston. 2015. Large-scale simple question an-swering with memory networks. arXiv:1506.02075 .Kevin Burton, Akshay Java, and Ian Soboroff. 2009.The ICWSM 2009 Spinn3r Dataset. In
Third AnnualConference on Weblogs and Social Media (ICWSM2009) , San Jose, CA. AAAI.Casimiro Pio Carrino, Marta R. Costa-juss`a, and Jos´eA. R. Fonollosa. 2020. Automatic Spanish trans-lation of SQuAD dataset for multi-lingual questionanswering. In
Proceedings of The 12th LanguageResources and Evaluation Conference , pages 5515–5523, Marseille, France. European Language Re-sources Association.Atef Chaudhury, Makarand Tapaswi, Seung WookKim, and Sanja Fidler. 2019. The Shmoop Corpus:A Dataset of Stories with Loosely Aligned Sum-maries. arXiv:1912.13082 .Michael Chen, Mike D’Arcy, Alisa Liu, Jared Fer-nandez, and Doug Downey. 2019. CODAH: Anadversarially-authored question answering datasetfor common sense. In
Proceedings of the 3rd Work-shop on Evaluating Vector Space Representationsfor NLP , pages 63–69, Minneapolis, USA. Associ-ation for Computational Linguistics.Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettle-moyer. 2018. QuAC: Question answering in con-text. In
Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing ,pages 2174–2184, Brussels, Belgium. Associationfor Computational Linguistics.Eunsol Choi, Daniel Hewlett, Jakob Uszkoreit, IlliaPolosukhin, Alexandre Lacoste, and Jonathan Be-rant. 2017. Coarse-to-fine question answering forlong documents. In
Proceedings of the 55th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers) , pages 209–220,Vancouver, Canada. Association for ComputationalLinguistics.Christopher Clark, Kenton Lee, Ming-Wei Chang,Tom Kwiatkowski, Michael Collins, and KristinaToutanova. 2019. BoolQ: Exploring the surprisingdifficulty of natural yes/no questions. In
Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long and Short Papers) , pages 2924–2936, Min-neapolis, Minnesota. Association for ComputationalLinguistics.Jonathan H. Clark, Eunsol Choi, Michael Collins, DanGarrette, Tom Kwiatkowski, Vitaly Nikolaev, andJennimaria Palomaki. 2020. TyDi QA: A bench-mark for information-seeking question answering inypologically diverse languages.
To appear in Trans-actions of the Association for Computational Lin-guistics .Danilo Croce, Alexandra Zelenanska, and RobertoBasili. 2018. Neural learning for question answer-ing in italian. In
AI*IA 2018 – Advances in Arti-ficial Intelligence , pages 389–402, Cham. SpringerInternational Publishing.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In
Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Bhuwan Dhingra, Kathryn Mazaitis, and William WCohen. 2017. Quasar: Datasets for question answer-ing by search and reading. arXiv:1707.03904 .Martin d’Hoffschmidt, Maxime Vidal, Wacim Belb-lidia, and Tom Brendl´e. 2020. FQuAD: FrenchQuestion Answering Dataset. arXiv:2002.06071 .Dheeru Dua, Yizhong Wang, Pradeep Dasigi, GabrielStanovsky, Sameer Singh, and Matt Gardner. 2019.DROP: A reading comprehension benchmark requir-ing discrete reasoning over paragraphs. In
Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long and Short Papers) , pages 2368–2378, Min-neapolis, Minnesota. Association for ComputationalLinguistics.Jesse Dunietz, Greg Burnham, Akash Bharadwaj,Owen Rambow, Jennifer Chu-Carroll, and Dave Fer-rucci. 2020. To test machine comprehension, startby defining comprehension. In
Proceedings of the58th Annual Meeting of the Association for Compu-tational Linguistics , pages 7839–7859, Online. As-sociation for Computational Linguistics.Matthew Dunn, Levent Sagun, Mike Higgins, V. UgurG¨uney, Volkan Cirik, and Kyunghyun Cho. 2017.Searchqa: A new q&a dataset augmented with con-text from a search engine.
CoRR , abs/1704.05179.Daria Dzendzik, Carl Vogel, and Jennifer Foster. 2019.Is it dish washer safe? Automatically answering“Yes/No” questions using customer reviews. In
Pro-ceedings of the 2019 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Student Research Workshop , pages 1–6,Minneapolis, Minnesota. Association for Computa-tional Linguistics.Pavel Efimov, Leonid Boytsov, and Pavel Braslavski.2019. SberQuAD–Russian reading compre-hension dataset: Description and analysis. arXiv:1912.09723 . Ahmed Elgohary, Chen Zhao, and Jordan Boyd-Graber.2018. A dataset and baselines for sequential open-domain question answering. In
Proceedings of the2018 Conference on Empirical Methods in Natu-ral Language Processing , pages 1077–1083, Brus-sels, Belgium. Association for Computational Lin-guistics.Deepak Gupta, Asif Ekbal, and Pushpak Bhattacharyya.2019a. A deep neural network framework for en-glish hindi question answering.
ACM Trans. AsianLow-Resour. Lang. Inf. Process. , 19(2).Mansi Gupta, Nitish Kulkarni, Raghuveer Chanda,Anirudha Rayasam, and Zachary C. Lipton. 2019b.AmazonQA: A review-based question answeringtask. In
Proceedings of the Twenty-Eighth In-ternational Joint Conference on Artificial Intelli-gence, IJCAI-19 , pages 4996–5002. InternationalJoint Conferences on Artificial Intelligence Organi-zation.Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi Zhao,Xinyan Xiao, Yuan Liu, Yizhong Wang, Hua Wu,Qiaoqiao She, Xuan Liu, Tian Wu, and HaifengWang. 2018. DuReader: a Chinese machine read-ing comprehension dataset from real-world appli-cations. In
Proceedings of the Workshop on Ma-chine Reading for Question Answering , pages 37–46, Melbourne, Australia. Association for Computa-tional Linguistics.Karl Moritz Hermann, Tom´aˇs Koˇcisk´y, Edward Grefen-stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,and Phil Blunsom. 2015. Teaching machines to readand comprehend. In
Proceedings of the 28th Inter-national Conference on Neural Information Process-ing Systems - Volume 1 , NIPS’15, pages 1693–1701,Cambridge, MA, USA. MIT Press.Daniel Hewlett, Alexandre Lacoste, Llion Jones, IlliaPolosukhin, Andrew Fandrianto, Jay Han, MatthewKelcey, and David Berthelot. 2016. WikiReading: Anovel large-scale language understanding task overWikipedia. In
Proceedings of the 54th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers) , pages 1535–1545, Berlin,Germany. Association for Computational Linguis-tics.Felix Hill, Antoine Bordes, Sumit Chopra, and JasonWeston. 2016. The goldilocks principle: Readingchildren’s books with explicit memory representa-tions. In
Proceedings of the 4th International Con-ference on Learning Representations (ICLR 2016) ,San Juan, Puerto Rico.Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, andYejin Choi. 2019. Cosmos QA: Machine readingcomprehension with contextual commonsense rea-soning. In
Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages2391–2401, Hong Kong, China. Association forComputational Linguistics.aishali Ingale, Pushpender Singh, and Aditi Bhardwaj.2019. Datasets for machine reading comprehension:A literature review.
SSRN Electronic Journal .Qiao Jin, Bhuwan Dhingra, Zhengping Liu, WilliamCohen, and Xinghua Lu. 2019. PubMedQA: Adataset for biomedical research question answering.In
Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) , pages 2567–2577, Hong Kong, China. Association for Computa-tional Linguistics.Mandar Joshi, Eunsol Choi, Daniel Weld, and LukeZettlemoyer. 2017. TriviaQA: A large scale dis-tantly supervised challenge dataset for reading com-prehension. In
Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers) , pages 1601–1611, Van-couver, Canada. Association for Computational Lin-guistics.A. Kembhavi, M. Seo, D. Schwenk, J. Choi, A. Farhadi,and H. Hajishirzi. 2017. Are you smarter than asixth grader? textbook question answering for multi-modal machine comprehension. In , pages 5376–5384.Tom Kenter, Llion Jones, and Daniel Hewlett. 2018.Byte-level machine reading across morphologicallyvaried languages. In
Proceedings of the The Thirty-Second AAAI Conference on Artificial Intelligence(AAAI-18) , pages 5820–5827.Daniel Khashabi, Snigdha Chaturvedi, Michael Roth,Shyam Upadhyay, and Dan Roth. 2018. Looking be-yond the surface: A challenge set for reading com-prehension over multiple sentences. In
Proceedingsof the 2018 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long Pa-pers) , pages 252–262, New Orleans, Louisiana. As-sociation for Computational Linguistics.Tom´aˇs Koˇcisk´y, Jonathan Schwarz, Phil Blunsom,Chris Dyer, Karl Moritz Hermann, G´abor Melis, andEdward Grefenstette. 2018. The NarrativeQA read-ing comprehension challenge.
Transactions of theAssociation for Computational Linguistics , 6:317–328.Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-field, Michael Collins, Ankur Parikh, Chris Al-berti, Danielle Epstein, Illia Polosukhin, Jacob De-vlin, Kenton Lee, Kristina Toutanova, Llion Jones,Matthew Kelcey, Ming-Wei Chang, Andrew Dai,Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019.Natural questions: A benchmark for question an-swering research.
Transactions of the Associationfor Computational Linguistics , 7:452–466.Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,and Eduard Hovy. 2017. RACE: Large-scale ReAd-ing comprehension dataset from examinations. In
Proceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing , pages785–794, Copenhagen, Denmark. Association forComputational Linguistics.K Lakshmi and K.M. Arivuchelvan. 2019. A surveyon datasets for machine reading comprehension. In
Proceedings of International Conference on RecentTrends in Computing, Communication & Network-ing Technologies (ICRTCCNT) 2019.
SSRN Elec-tronic Journal (2019).Kyungjae Lee, Kyoungho Yoon, Sunghyun Park, andSeung-won Hwang. 2018. Semi-supervised train-ing data generation for multilingual question answer-ing. In
Proceedings of the Eleventh InternationalConference on Language Resources and Evaluation(LREC 2018) , Miyazaki, Japan. European LanguageResources Association (ELRA).Yichan Liang, Jianheng Li, and Jian Yin. 2019. A newmulti-choice reading comprehension dataset for cur-riculum learning. In
Proceedings of The EleventhAsian Conference on Machine Learning , volume101 of
Proceedings of Machine Learning Research ,pages 742–757, Nagoya, Japan. PMLR.Seungyoung Lim, Myungji Kim, and Jooyoul Lee.2019. KorQuAD1.0: Korean qa dataset for machinereading comprehension. arXiv:1909.07005 .Shanshan Liu, Xin Zhang, Sheng Zhang, Hui Wang,and Weiming Zhang. 2019. Neural machine readingcomprehension: Methods and trends.
Applied Sci-ences , 9(18):3698.Kateˇrina Mackov´a and Milan Straka. 2020. Readingcomprehension in czech via machine translation andcross-lingual transfer. In
Text, Speech, and Dia-logue , pages 171–179, Cham. Springer InternationalPublishing.Christopher Malon and Bing Bai. 2013. Answer ex-traction by recursive parse tree descent. In
Proceed-ings of the Workshop on Continuous Vector SpaceModels and their Compositionality , pages 110–118,Sofia, Bulgaria. Association for Computational Lin-guistics.Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston.2016. Key-value memory networks for directly read-ing documents. In
Proceedings of the 2016 Con-ference on Empirical Methods in Natural LanguageProcessing , pages 1400–1409, Austin, Texas. Asso-ciation for Computational Linguistics.Nasrin Mostafazadeh, Nathanael Chambers, XiaodongHe, Devi Parikh, Dhruv Batra, Lucy Vanderwende,Pushmeet Kohli, and James Allen. 2016. A cor-pus and cloze evaluation for deeper understanding ofcommonsense stories. In
Proceedings of the 2016Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies , pages 839–849, San Diego,alifornia. Association for Computational Linguis-tics.Nasrin Mostafazadeh, Michael Roth, Annie Louis,Nathanael Chambers, and James Allen. 2017. Ls-dsem 2017 shared task: The story cloze test. In
Proceedings of the 2nd Workshop on Linking Mod-els of Lexical, Sentential and Discourse-level Seman-tics, LSDSem@EACL 2017, Valencia, Spain, April 3,2017 , pages 46–51.Kiet Van Nguyen, Khiem Vinh Tran, Son T. Luu, AnhGia-Tuan Nguyen, and Ngan Luu-Thuy Nguyen.2020. A pilot study on multiple choice ma-chine reading comprehension for vietnamese texts. arXiv:2001.05687 .Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao,Saurabh Tiwary, Rangan Majumder, and Li Deng.2016. MS MARCO: A human generated MAchineReading COmprehension dataset. In
Proceedingsof the Workshop on Cognitive Computation: Inte-grating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neu-ral Information Processing Systems (NIPS 2016) ,Barcelona, Spain.Takeshi Onishi, Hai Wang, Mohit Bansal, Kevin Gim-pel, and David McAllester. 2016. Who did what:A large-scale person-centered cloze dataset. In
Pro-ceedings of the 2016 Conference on Empirical Meth-ods in Natural Language Processing , pages 2230–2235, Austin, Texas. Association for ComputationalLinguistics.Simon Ostermann, Ashutosh Modi, Michael Roth, Ste-fan Thater, and Manfred Pinkal. 2018. MCScript:A novel dataset for assessing machine comprehen-sion using script knowledge. In
Proceedings ofthe Eleventh International Conference on LanguageResources and Evaluation (LREC 2018) , Miyazaki,Japan. European Language Resources Association(ELRA).Simon Ostermann, Michael Roth, and Manfred Pinkal.2019. MCScript2.0: A machine comprehension cor-pus focused on script events and participants. In
Pro-ceedings of the Eighth Joint Conference on Lexicaland Computational Semantics (*SEM 2019) , pages103–117, Minneapolis, Minnesota. Association forComputational Linguistics.Anusri Pampari, Preethi Raghavan, Jennifer Liang,and Jian Peng. 2018. emrQA: A large corpus forquestion answering on electronic medical records.In
Proceedings of the 2018 Conference on Em-pirical Methods in Natural Language Processing ,pages 2357–2368, Brussels, Belgium. Associationfor Computational Linguistics.Denis Paperno, Germ´an Kruszewski, Angeliki Lazari-dou, Ngoc Quan Pham, Raffaella Bernardi, San-dro Pezzelle, Marco Baroni, Gemma Boleda, andRaquel Fern´andez. 2016. The LAMBADA dataset:Word prediction requiring a broad discourse context. In
Proceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers) , pages 1525–1534, Berlin, Germany.Association for Computational Linguistics.Fabio Petroni, Tim Rockt¨aschel, Sebastian Riedel,Patrick Lewis, Anton Bakhtin, Yuxiang Wu, andAlexander Miller. 2019. Language models as knowl-edge bases? In
Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 2463–2473, Hong Kong, China. As-sociation for Computational Linguistics.Boyu Qiu, Xu Chen, Jungang Xu, and Yingfei Sun.2019. A survey on neural machine reading compre-hension. arXiv:1906.03824 .Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.Know what you don’t know: Unanswerable ques-tions for SQuAD. In
Proceedings of the 56th An-nual Meeting of the Association for ComputationalLinguistics (Volume 2: Short Papers) , pages 784–789, Melbourne, Australia. Association for Compu-tational Linguistics.Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questions formachine comprehension of text. In
Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing , pages 2383–2392, Austin,Texas. Association for Computational Linguistics.Siva Reddy, Danqi Chen, and Christopher D. Manning.2019. CoQA: A conversational question answeringchallenge.
Transactions of the Association for Com-putational Linguistics , 7:249–266.Matthew Richardson, Christopher J.C. Burges, andErin Renshaw. 2013. MCTest: A challenge datasetfor the open-domain machine comprehension of text.In
Proceedings of the 2013 Conference on Empiri-cal Methods in Natural Language Processing , pages193–203, Seattle, Washington, USA. Association forComputational Linguistics.Anna Rogers, Olga Kovaleva, Matthew Downey, andAnna Rumshisky. 2020. Getting closer to AI com-plete question answering: A set of prerequisite realtasks. In
The Thirty-Fourth AAAI Conference onArtificial Intelligence, AAAI 2020, New York, NY,USA, February 7-12, 2020 , pages 8722–8731. AAAIPress.Marzieh Saeidi, Max Bartolo, Patrick Lewis, SameerSingh, Tim Rockt¨aschel, Mike Sheldon, GuillaumeBouchard, and Sebastian Riedel. 2018. Interpreta-tion of natural language rules in conversational ma-chine reading. In
Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing , pages 2087–2097, Brussels, Belgium.Association for Computational Linguistics.mrita Saha, Rahul Aralikatte, Mitesh M. Khapra, andKarthik Sankaranarayanan. 2018. DuoRC: Towardscomplex language understanding with paraphrasedreading comprehension. In
Proceedings of the 56thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers) , pages1683–1693, Melbourne, Australia. Association forComputational Linguistics.Guo Shangmin, Liu Kang, He Shizhu, Liu Cao, ZhaoJun, and Wei Zhuoyu. 2017. Ijcnlp-2017 task 5:Multi-choice question answering in examinations.In
Proceedings of the IJCNLP 2017, Shared Tasks ,page 34–40. Asian Federation of Natural LanguageProcessing.Kai Sun, Dian Yu, Jianshu Chen, Dong Yu, YejinChoi, and Claire Cardie. 2019. DREAM: A chal-lenge dataset and models for dialogue-based readingcomprehension.
Transactions of the Association forComputational Linguistics , 7(0):217–231.Simon ˇSuster and Walter Daelemans. 2018. CliCR: adataset of clinical case reports for machine readingcomprehension. In
Proceedings of the 2018 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers) , pages1551–1563, New Orleans, Louisiana. Associationfor Computational Linguistics.Alon Talmor, Jonathan Herzig, Nicholas Lourie, andJonathan Berant. 2019. CommonsenseQA: A ques-tion answering challenge targeting commonsenseknowledge. In
Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4149–4158, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen,Antonio Torralba, Raquel Urtasun, and Sanja Fidler.2016. MovieQA: Understanding stories in moviesthrough question-answering. In ,pages 4631–4640. IEEE Computer Society.Adam Trischler, Tong Wang, Xingdi Yuan, Justin Har-ris, Alessandro Sordoni, Philip Bachman, and Ka-heer Suleman. 2017. NewsQA: A machine compre-hension dataset. In
Proceedings of the 2nd Work-shop on Representation Learning for NLP , pages191–200, Vancouver, Canada. Association for Com-putational Linguistics.Bingning Wang, Ting Yao, Qi Zhang, Jingfang Xu,and Xiaochuan Wang. 2020a. Reco: A large scalechinese reading comprehension dataset on opinion.
Proceedings of the AAAI Conference on Artificial In-telligence , 34(05):9146–9153.Chao Wang. 2020. A study of the tasks andmodels in machine reading comprehension. arXiv:2001.08635 . Ran Wang, Kun Tao, Dingjie Song, Zhilong Zhang,Xiao Ma, Xi’ao Su, and Xinyu Dai. 2020b. R3: Areading comprehension benchmark requiring reason-ing processes. arXiv:2004.01251 .Yusuke Watanabe, Bhuwan Dhingra, and RuslanSalakhutdinov. 2017. Question answering fromunstructured text by retrieval and comprehension. arXiv:1703.08885 .Johannes Welbl, Nelson F. Liu, and Matt Gardner. 2017.Crowdsourcing multiple choice science questions.In
Proceedings of the 3rd Workshop on Noisy User-generated Text , pages 94–106, Copenhagen, Den-mark. Association for Computational Linguistics.Johannes Welbl, Pontus Stenetorp, and SebastianRiedel. 2018. Constructing datasets for multi-hopreading comprehension across documents.
Transac-tions of the Association for Computational Linguis-tics , 6:287–302.Jason Weston, Antoine Bordes, Sumit Chopra, andTomas Mikolov. 2016. Towards AI-complete ques-tion answering: A set of prerequisite toy tasks. In .Qizhe Xie, Guokun Lai, Zihang Dai, and Eduard Hovy.2018. Large-scale cloze test dataset created byteachers. In
Proceedings of the 2018 Conferenceon Empirical Methods in Natural Language Process-ing , pages 2344–2356, Brussels, Belgium. Associa-tion for Computational Linguistics.Wenhan Xiong, Jiawei Wu, Hong Wang, Vivek Kulka-rni, Mo Yu, Shiyu Chang, Xiaoxiao Guo, andWilliam Yang Wang. 2019. TWEETQA: A socialmedia focused question answering dataset. In
Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 5020–5031, Florence, Italy. Association for Computa-tional Linguistics.Semih Yagcioglu, Aykut Erdem, Erkut Erdem, and Na-zli Ikizler-Cinbis. 2018. RecipeQA: A challengedataset for multimodal comprehension of cookingrecipes. In
Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing ,pages 1358–1368, Brussels, Belgium. Associationfor Computational Linguistics.Yi Yang, Wen-tau Yih, and Christopher Meek. 2015.WikiQA: A challenge dataset for open-domain ques-tion answering. In
Proceedings of the 2015 Con-ference on Empirical Methods in Natural LanguageProcessing , pages 2013–2018, Lisbon, Portugal. As-sociation for Computational Linguistics.Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,William Cohen, Ruslan Salakhutdinov, and Christo-pher D. Manning. 2018. HotpotQA: A datasetfor diverse, explainable multi-hop question answer-ing. In
Proceedings of the 2018 Conference on Em-pirical Methods in Natural Language Processing ,ages 2369–2380, Brussels, Belgium. Associationfor Computational Linguistics.Weihao Yu, Zihang Jiang, Yanfei Dong, and JiashiFeng. 2020. ReClor: A reading comprehensiondataset requiring logical reasoning. In .Rowan Zellers, Yonatan Bisk, Roy Schwartz, andYejin Choi. 2018. SWAG: A large-scale adversar-ial dataset for grounded commonsense inference. In
Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing , pages 93–104, Brussels, Belgium. Association for Computa-tional Linguistics.Rowan Zellers, Ari Holtzman, Yonatan Bisk, AliFarhadi, and Yejin Choi. 2019. HellaSwag: Cana machine really finish your sentence? In
Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 4791–4800, Florence, Italy. Association for ComputationalLinguistics.Chengchang Zeng, Shaobo Li, Qin Li, Jie Hu, and Jian-jun Hu. 2020. A survey on machine reading compre-hension: Tasks, evaluation metrics, and benchmarkdatasets. arXiv:2006.11880 .Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jian-feng Gao, Kevin Duh, and Benjamin Van Durme.2018a. ReCoRD: Bridging the gap between humanand machine commonsense reading comprehension. arXiv:1810.12885 .Xiao Zhang, Ji Wu, Zhiyang He, Xien Liu, and YingSu. 2018b. Medical exam question answering withlarge-scale reading comprehension. In
Proceedingsof the The Thirty-Second AAAI Conference on Artifi-cial Intelligence (AAAI-18) , pages 5706–5713.Xin Zhang, An Yang, Sujian Li, and Yizhong Wang.2019. Machine reading comprehension: a literaturereview.
CoRR , abs/1907.01686.Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-dinov, Raquel Urtasun, Antonio Torralba, and SanjaFidler. 2015. Aligning books and movies: Towardsstory-like visual explanations by watching moviesand reading books. In
Proceedings of the 2015IEEE International Conference on Computer Vision(ICCV) , ICCV ’15, page 19–27, USA. IEEE Com-puter Society.
Supplementary Materials
A.1 Extra Features and Statistics
For some datasets it is not straightforward tocollect those numerical characteristics. Here weexplain how we calculated those numbers andsome features of counting we have done for certaindatasets.
A.1.1 Instances
The concept of an instance might not be straight-forward so we will explain it here. Some datasetsbased on books, movies or Wikipedia have dividedthe original source of data into different passages.For example, for SQuAD, SQuAD2.0, BoolQ, andNaturalQuestions the instance is a Wikipedia Ar-ticle; for AmazonQA and AmazonYesNo the in-stance would be a product; in case of RecipeQA,instance is a type of recipe; etc. In other words instance is a level up item which might containmultiple passages. For some datasets the numberof instances is equal to the number of passages. Forexample MovieQA, the instance is a movie and thepassage is a movie plot. In case of bAbI dataset weconsider the task as an instance. Not all datasetshave the concept of instances, that is why for someof those there is ”-” in the table.
A.1.2 Statistics Calculation
Some data we took from original and related papers.Those datasets are: BookTest (Bajgar et al., 2017),MedQA (Zhang et al., 2018b), and R (Wang et al.,2020b).We calculated all characteristics based on pub-licly available data. That means some datasets donot have the test set available so we based our cal-culations on training and development sets only.Those datasets are: AmazonQA, CoQA, DROP,MovieQA, QAngaroo (WikiHop and MedHop),QuAC, ShARC, SQuAD, SQuAD2, TyDi.As mention in Section 3.3 we processedall datasets in the same way with spiCy .If the data is tokenized or split by sen-tences we simply join it back using Python " ".join(tokens/sentences list) .Based on spiCy implementation we wouldnot expect significant differences between theoriginally provided tokenization and the results of spacy.io – l.v. 06/2020 spiCy the tokenization of the joined tokens. Thisensures the consistency of the processing of alldatasets.There are a few particular features about certaindatasets we would like to mention:• ShARC : in this case instance is a tree id .There are several scenarios for the same snip-pet. We consider concatenation of the snippet,scenario and follow up questions with answersas a passage;•
HotpotQA : we consider passage as a concate-nation of all supporting facts, and instance isa title of supporting fact. In this case there amultiple instances for the same question.•
WikiQA : we consider passage as a concatena-tion of all sentences, we did calculations basedon publicly available data and code from thegithub page; • TriviaQA : to obtain the data we modified thescript provided by the authors on the githubpage; • MSMARCO : we consider every passage sep-arately, so in this case, there are multiple pas-sages for one question.•
TyDi : we calculated the statistic for joineddata from English subset for both the Minimalanswer span task and the Gold passage taskfor training and development set.•
Who Did What dataset we looked into re-laxed setting. We do not have a licence to getgigaword data so we calculated only the aver-age length of the questions and answers. Thevocabulary size is provided by the originalpaper (Onishi et al., 2016).
A.2 Vocabulary
See Table 5 for detailed vocabulary analysis perdataset. github.com/RaRe-Technologies/gensim-data/issues/31 – l.v. 06/2020 github.com/mandarjoshi90/triviaqa – l.v.06/2020 ataset Dataset contains Extra Data YesNo Non-Factoid Query MultiHop MultiDoc Dialogs No AnswerAmazoneQA (cid:51) (cid:51) (cid:55) (cid:55) (cid:117) (cid:55) (cid:55) (cid:55)
AmazonYesNo (cid:51) (cid:55) (cid:55) (cid:117) (cid:51) (cid:55) (cid:117) (cid:55) bAbI (cid:51) (cid:55) (cid:55) (cid:51) (cid:55) (cid:55) (cid:55) (cid:55)
BookTest (cid:55) (cid:55) (cid:55) (cid:117) (cid:55) (cid:117) (cid:55) (cid:55)
BoolQ (cid:51) (cid:55) (cid:55) (cid:117) (cid:55) (cid:55) (cid:55) (cid:55)
CBT (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55)
CliCR (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55)
CNN/Daily Mail (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55)
Cosmos QA (cid:55) (cid:51) (cid:55) (cid:51) (cid:55) (cid:55) (cid:55) (cid:55)
CoQA (cid:51) (cid:55) (cid:55) (cid:117) (cid:55) (cid:51) (cid:51) (cid:55)
DREAM (cid:117) (cid:51) (cid:55) (cid:51) (cid:55) (cid:51) (cid:55) (cid:55)
DROP (cid:117) (cid:51) (cid:55) (cid:51) (cid:51) (cid:55) (cid:55) (cid:55)
DuoRC (cid:117) (cid:51) (cid:55) (cid:51) (cid:55) (cid:55) (cid:51) (cid:55) emrQA (cid:51) (cid:51) (cid:117) (cid:51) (cid:55) (cid:55) (cid:51) (cid:51)
HotpotQA (cid:51) (cid:51) (cid:55) (cid:51) (cid:51) (cid:55) (cid:51) (cid:55)
LAMBADA (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55)
MCScript (cid:51) (cid:51) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55)
MCScript2.0 (cid:55) (cid:51) (cid:55) (cid:55) (cid:55) (cid:55) (cid:117) (cid:55)
MCTest (cid:51) (cid:117) (cid:55) (cid:117) (cid:55) (cid:55) (cid:55) (cid:55)
MedQA (cid:55) (cid:51) (cid:55) (cid:51) (cid:51) (cid:55) (cid:55) (cid:55)
MovieQA (cid:117) (cid:51) (cid:55) (cid:117) (cid:55) (cid:55) (cid:55) (cid:51)
MSMARCO (cid:51) (cid:51) (cid:55) (cid:51) (cid:51) (cid:55) (cid:51) (cid:51)
MultiRC (cid:117) (cid:51) (cid:55) (cid:51) (cid:51) (cid:55) (cid:55) (cid:55)
NarrativeQA (cid:55) (cid:51) (cid:55) (cid:51) (cid:117) (cid:51) (cid:55) (cid:55)
NaturalQuestions (cid:51) (cid:51) (cid:55) (cid:55) (cid:55) (cid:55) (cid:51) (cid:55)
NewsQA (cid:117) (cid:51) (cid:55) (cid:51) (cid:55) (cid:55) (cid:51) (cid:55)
PubMedQA (cid:51) (cid:55) (cid:55) (cid:51) (cid:55) (cid:55) (cid:117) (cid:55)
RACE (cid:55) (cid:51) (cid:55) (cid:51) (cid:55) (cid:55) (cid:55) (cid:55)
RACE-C (cid:55) (cid:55) (cid:55) (cid:51) (cid:55) (cid:55) (cid:55) (cid:55)
Recipe QA (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:51)
ReClor (cid:55) (cid:51) (cid:55) (cid:51) (cid:55) (cid:55) (cid:55) (cid:55)
ReCoRD (cid:55) (cid:55) (cid:55) (cid:51) (cid:55) (cid:55) (cid:55) (cid:55)
SciQ (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:51)
SQuAD (cid:55) (cid:51) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55)
SQuAD2.0 (cid:55) (cid:51) (cid:55) (cid:55) (cid:55) (cid:55) (cid:51) (cid:55)
SearchQA (cid:55) (cid:51) (cid:55) (cid:51) (cid:51) (cid:55) (cid:55) (cid:55)
ShARC (cid:51) (cid:55) (cid:55) (cid:117) (cid:55) (cid:51) (cid:55) (cid:55)
TriviaQA (cid:55) (cid:117) (cid:55) (cid:51) (cid:51) (cid:55) (cid:51) (cid:55)
TurkQA (cid:51) (cid:51) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55)
TweetQA (cid:55) (cid:51) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55)
TyDi (cid:51) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:51) (cid:51)
QAngarooWikiHop (cid:55) (cid:55) (cid:51) (cid:51) (cid:51) (cid:55) (cid:55) (cid:55)
QAngarooMedHop (cid:55) (cid:55) (cid:51) (cid:51) (cid:51) (cid:55) (cid:55) (cid:55)
QuAC (cid:51) (cid:51) (cid:55) (cid:51) (cid:55) (cid:51) (cid:51) (cid:55)
QuAIL (cid:55) (cid:51) (cid:55) (cid:51) (cid:55) (cid:55) (cid:51) (cid:55)
Quasar-S (cid:117) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55)
Quasar-T (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55)
Who Did What (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55)
WikiMovies (cid:55) (cid:51) (cid:51) (cid:117) (cid:51) (cid:55) (cid:55) (cid:51)
WikiReading (cid:55) (cid:55) (cid:51) (cid:51) (cid:55) (cid:55) (cid:51) (cid:55)
WikiQA (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:51) (cid:55)
Table 3: Datasets in alphabetical order and additional properties. Where extra data means the English RC taskis only one part of bigger dataset with additional resources such as images or video, or there is an availability ofresources in other languages. (cid:51) – presented; (cid:117) – presented in a limited form; (cid:55) – not presented. .3 Questions
Some questions could be formulated with a ques-tion word inside, for example: ”About how muchdoes each box of folders weigh?” or ”Accordingto the narrator, what may be true about their em-ployer?” . We analyse 6.7M questions excluding allcloze datasets (ChildrenBookTest, CNN/DailyMail,WhoDidWhat, CliCR, LAMBADA, RecipeQA,Quasar-S, some cloze style questions from MSMARCO, DREAM, Quasar-T, RACE, RACE-C,SearchQA, TriviaQA, emrQA) (there are all to-gether approximately 2.5M cloze questions) andWikiReading, WikiHop, and MedHop (almost 19million questions-queries), as the queries are notformulated in question form. As mentioned in sec-tion 3.1 some datasets shared the questions andsome datasets have the same questions asked morethan once within a different context (for example,question ”Where is Daniel?” asked 2007 times inbAbI), or same questions asked with different an-swer options (for example in CosmosQA dataset).We calculated the frequency of question words forboth scenarios: all questions and unique questions (see Table 6).To separate boolean questions we used the samelist of words as Clark et al. (2019): “did”, “do”,“does”, “is”, “are”, “was”, “were”, “have”,“has”, “can”, “could”, “will”, “would” . Apartfrom datasets which contain only yes/no/maybequestions a significant portion of boolean ques-tions are in ShaRC (85.4%), emrQA (74.0 %)AmazonQA (55.3%), QuAC (36.6%), MCSCript(28.6%), TurcQA (25.7%), bAbI (25.0 %) andCoQA (20.7%).Almost a third of all questions and more than aquarter of unique questions are boolean. Anotherquarter of unique questions (26.57%) contain theword “What“ , 6.64% of questions asks “Who“ and “Whose“ , and 4.49% “Which“ , about 3% of ques-tions are “When“ and “Where“ . Only 5.95% askthe question “how“ excluding ( “how many/much“ and “how old“ ). Other questions which do not con-tain any of these question words constitute 16,73%of unique questions. There are datasets where morethan 20 % of questions are formulated in such away that the first token is not one of the consideredwords: Quasar-S (98.8 %), SearchQA (98.3 %),RACE-C (64.1 %), TriviaQA (49.6 %), HotPotQA(42.0 %), Quasar-T (40.7 %), MSMARCO (26.6%), NaturalQuestions(23.4 %), AmazonQA (22.8 %), and SQuAD (21.1 %).See Table 7 for more detailed information. B Other Datasets
There are a number of datasets we did not includeinto our analysis as we would like to stay focusedon the Question Answering Machine Reading Com-prehension task. In this section we mention thoseworks and explain why they are excluded.
B.1 Question Answering DatasetsCLOTH ((Xie et al., 2018)) and
Story Cloze Test ,((Mostafazadeh et al., 2016, 2017)) are cloze-styledatasets and it is a missing word from the contexttask without a specific query. As well as this, wedid not include a number of RC datasets where thestory should be completed such as
ROCStories ((Mostafazadeh et al., 2016)),
CODAH ((Chenet al., 2019)),
SWAG ((Zellers et al., 2018)), and
HellaSWAG ((Zellers et al., 2019)) because thereare no question so they are not question answer-ing datasets. In contrast, cloze question answeringdatasets considered in this work have separate co-herent text (passage) and separate sentence whichcan be treated as question with a missed word.
QBLink ((Elgohary et al., 2018)) is technicallya RC QA dataset but for every question there is onlythe name of a wiki page available. The “lead in”information is not enough to answer the questionwithout additional resources. In the other wordsQBLink is a more general QA dataset, like
Com-monSenceQA ((Talmor et al., 2019)), rather thanRC.
Textbook Question Answering (TQA) ((Kem-bhavi et al., 2017)) is a multi-modal dataset requir-ing not only text understanding but also pictureprocessing.
MCQA is a Multiple Choice Question Answer-ing dataset in English and Chinese based on ex-amination questions introduced as Shared Task onIJCNLP 2017 by (Shangmin et al., 2017). Theauthors do not provide any supportive documentswhich can be considered as a passage so it is not areading comprehension task.A number of datasets, such as
SimpleQuestions ((Bordes et al., 2015)) and
WebQuestion ((Berantet al., 2013)), were created with the idea of extract-ing answers from a knowledge graph. Even thoughthe additional resources are involved it is presentedin a structured form rather than a free natural texto we do not consider those dataset in the currentchapter.
B.2 Non-English Datasets
In the paper we focus on English dataset but thereare a number of RC datasets in other languages,and in this section we will briefly mention some ofthem.
B.2.1 Chinese datasetsDuReader ((He et al., 2018)) is a Chinese RCdataset. It contains mixed types of questions basedon Baidu Search and Baidu Zhidao. ReCO ((Wang et al., 2020a)) Re ading C omprehensiondataset on O pinion) is the largest human-curatedChinese reading comprehension dataset containing300k questions with “Yes/No/Unclear” answers. B.2.2 Other Languages
The extended version of
WikiReading ((Kenteret al., 2018)) apart of 18M English questions alsocontains 5M Russian and about 600K Turkish ex-amples.
TyDi ((Clark et al., 2020)) is a question answer-ing corpus of 11 Typologically Diverse languages(Arabic, Bengali, Korean, Russian, Telugu, Thai,Finnish, Indonesian, Kiswahili, Japanese, and En-glish). It contains 200k+ question answers pairsbased on the Wikipedia articles in those languages.
ViMMRC ((Nguyen et al., 2020)) is a multiple-choice questions RC dataset in Vietnamese lan-guage. It contains 2,783 questions based on a setof 417 texts.Following the approach of SQuAD datased con-straction there were a few more datasets created:
FQuAD ((d’Hoffschmidt et al., 2020)) is a 25,000+question French Native Reading Comprehensiondataset;
KorQuAD ((Lim et al., 2019)) has 70,000original questions in Korean. Both datasets arebased on Wikipedia.
B.2.3 Datasets Translation
SQuAD has been semi-automatically translatedinto several other languages as: Korean
K-QuAD ((Lee et al., 2018)); Italian
SQuAD-it ((Croce et al.,2018)); Japanese and French ((Asai et al., 2018));Spanish
SQuAD-es ((Carrino et al., 2020)); Hindi((Gupta et al., 2019a)); Russian
SberQuAD ((Efi-mov et al., 2019)); and Czech ((Mackov´a andStraka, 2020)). zhidao.baidu.com – last verified February2020 ataset AmazonQA 139,905 830,959 - 16.6 558.2 32.8 1,395,460AmazonYesNo 40,806 40,806 - 13.2 4398.2 - 864,929bAbI 20 1,2534 - 6.3 67.2 1.1 152BookTest 14,062 14,140,825 10 522 1 1,860,394BoolQ 8208 12,697 2 8.8 109.4 - 49,117CBT 108 687,343 10 30 440 1 53,628CliCR 11,846 11,846 - 22.6 1411.7 3.4 122,568CosmosQA 35,210 35,210 4 10.6 70.4 8.1 40,067CoQA - 7,699 - 6.5 328.0 2.9 59,840CNN - 107,122 - 12.8 708.4 1.4 111,198DailyMail - 218,017 - 14.8 854.4 1.5 197,388DREAM 6,138 6,444 3 8.8 86.4 5.3 9,850DROP - 6147 - 12.2 246.2 4 44,430DuoRC 7,477 7,477 - 8.6 1,260.9 3.1 119,547emrQA 2427 2,427 - 7.9 1328.4 2.0 70,837HotpotQA 534,433 105,257 - 20.0 1100.7 2.4 741,974LAMBADA 5,325 10,022 - 15.4 58.5 1 203,918MedQA 5 243,712 5 27.4 4.2 43.2 -MCScript 110 2,119 2 6.7 196.0 3.6 7,867MCScript2.0 200 3,487 2 8.2 164.4 3.4 11.890MCTest 160 160 160 4 9.2 241.8 3.7 2,246MCTest 500 500 500 4 8.9 251.6 3.8 3,334MovieQA 408 408 3-5 9.34 727.91 5.6 21,322MSMARCO - 10,087,677 - 6.5 65.9 11.1 3,324,030MultiRC 871 871 5.4 4.8 92.4 5.5 23,331NarrativeQA 1,572 1,572 2 9.9 673.9 4.8 38,870NaturalQuestions 109,715 315,203 - 9.36 7312.13 164.56 3,635,821NewsQA 12,744 12,744 - 7.8 749.2 5.0 90,854PubMedQA - 3,358 3 15.1 73.8 - 14,751QAngarooWikiHop - 48,867 - 3.5 1381 1.8 304,322QAngarooMedHop - 1962 - 3 9366.7 1 76,954QuAC 8853 13,594 - 5.6 401 14.1 99,912QuAIL 680 680 4 9.70 388.29 4.36 17271Quasar-S - 37,362 - 24.3 (S)1995.9(L)5210.1 1.5 (S)660,425(L)987,380Quasar-T - 43,012 - 11.1 (S)2256.2(L)7372.6 1.9 (S)1,021,823(L)2,019,336RACE - 27,933 4 12.0 329.5 6.3 98,482RACE-C - 2,708 4 13.8 423.8 7.4 38,399Recipe QA - 9,761 4 10.8 580.0 3.3 62,938ReClor - 6,138 4 17.0 73.6 20.6 17,865ReCoRD - 73190 - 24.72 193.64 1.5 139724SciQ - 12,252 - 14.6 87.1 1.5 23,320SearchQA 27,995 13,796,295 - 16.7 58.7 2.1 3,506,501ShARC 697 24,160 - 8.6 87.2 4.0 5,231SQuAD 490 20,963 - 11.4 137.1 3.5 87,765SQuAD2.0 477 20,239 - 11.2 137.0 3.5 88,081TyDi - 14,378 - 8.3 3,694.2 4.6 848,524TriviaQA - 801,194 - 16.4 3867.6 2.3 7,366,586TurkQA - 13,425 - 10.3 41.6 2.9 44,677TweetQA - 13757 - 8.02 31.93 2.70 32542WhoDidWhat - 205,978 3.5 31.2 N/A 2.1 347,406WikiMovies - 186,444 - 8.7 77.9 6.8 56,893WikiQA - 1,242 - 6.5 252.6 - 20,686WikiReading 4,313,786 18,807,888 - 2.35 569.0 2.2 8,928,645
Table 4: Basic statistics of RC datasets. Q - question; P - passage; A - answer; S - short passages; L - longpassages; Average length of Q/P/A are measured in tokens. Vocabulary size is measured in lower-cased uniquelemmas. Note: some numbers are aggregated and might be slightly different from other sources. The aim of thispaper is to show an estimation rather than an exact value. For PubMedQuestions we count statistic for labeled data(1000 questions). ataset English Words Numbers Not EnglishWords Not Ascii Web Links AmazonQA 1065795 (76.4%) 38323 (2.7%) 235019 (16.8%) 6240 (0.4%) 49765 (3.6%)AmazonYesNo 736037 (81.3%) 17931 (2.0%) 144761 (16.0%) 45 (0.0%) 6345 (0.7%)bAbI 145 (95.4%) 0(0%) 7 (4.6%) 0(0%) 0(0%)BoolQ 36940 (75.2%) 3050 (6.2%) 7081 (14.4%) 2007 (4.1%) 41 (0.1%)CBTest 29630 (88.4%) 167 (0.5%) 3651 (10.9%) 58 (0.2%) 0(0%)CNN 75523 (67.9%) 6290 (5.7%) 27250 (24.5%) 726 (0.7%) 1408 (1.3%)CliCR 82981 (67.7%) 7798 (6.4%) 30809 (25.1%) 890 (0.7%) 85 (0.1%)CoQA 45112 (75.4%) 2605 (4.4%) 10270 (17.2%) 1748 (2.9%) 93 (0.2%)CosmosQA 34466 (86.0%) 934 (2.3%) 4617 (11.5%) 6 (0.0%) 42 (0.1%)DREAM 8653 (87.8%) 711 (7.2%) 469 (4.8%) 11 (0.1%) 2 (0.0%)DROP 27458 (61.8%) 7564 (17.0%) 7545 (17.0%) 1840 (4.1%) 13 (0.0%)DailyMail 130062 (65.9%) 13919 (7.1%) 49752 (25.2%) 1457 (0.7%) 2197 (1.1%)DuoRC 73800 (72.5%) 1235 (1.2%) 22937 (22.5%) 3715 (3.6%) 33 (0.0%)emrQA 48174 (68.0%) 12287 (17.3%) 10060 (14.2%) 2 (0.0%) 0(0%)HotPotQA 341142 (50.2%) 29140 (4.3%) 199911 (29.4%) 107605(15.8%) 1901 (0.3%)LAMBADA 144310 (70.8%) 4828 (2.4%) 49745 (24.4%) 2846 (1.4%) 2186 (1.1%)MCScript 7544 (95.9%) 101 (1.3%) 198 (2.5%) 15 (0.2%) 6 (0.1%)MCScript2 9467 (94.4%) 138 (1.4%) 395 (3.9%) 17 (0.2%) 12 (0.1%)MCTest 160 2135 (95.1%) 31 (1.4%) 74 (3.3%) 1 (0.0%) 0(0%)MCTest 500 3145 (94.3%) 35 (1.0%) 147 (4.4%) 1 (0.0%) 0(0%)MSMARCO 2046615 (61.6%) 261290 (7.9%) 703298 (21.2%) 246936 (7.4%) 65825 (2.0%)MovieQA 18166 (85.2%) 385 (1.8%) 2768 (13.0%) 1 (0.0%) 0(0%)MultiRC 16034 (84.9%) 896 (4.7%) 1821 (9.6%) 106 (0.6%) 14 (0.1%)NarrativeQA 31058 (79.9%) 631 (1.6%) 6213 (16.0%) 927 (2.4%) 1 (0.0%)NaturalQuestions 1177894 (32.4%) 891487 (24.5%) 757428 (20.8%) 364341(10.0%) 444670(12.2%)NewsQA 65487 (72.1%) 4316 (4.7%) 19370 (21.3%) 716 (0.8%) 950 (1.0%)PubMedQA 11139 (75.4%) 2531 (17.1%) 941 (6.4%) 148 (1.0%) 1 (0.0%)QAngorooMedHop 59186 (77.2%) 4858 (6.3%) 10877 (14.2%) 1722 (2.2%) 26 (0.0%)QAngoroo Wik-iHop 173858 (57.1%) 22415 (7.4%) 93948 (30.9%) 13860 (4.6%) 345 (0.1%)QuAC 63683 (72.6%) 3499 (4.0%) 20315 (23.2%) 101 (0.1%) 107 (0.1%)Quasar-S 622534 (63.0%) 109403 (11.1%) 210401 (21.3%) 1 (0.0%) 45475 (4.6%)Quasar-T 941480 (55.5%) 167738(9.9%) 479864 (28.3%) 1 (0.0%) 107374(6.3%)RACE-C 30697 (79.9%) 1248 (3.3%) 3988 (10.4%) 2420 (6.3%) 30 (0.1%)RACE 75342 (76.5%) 6277 (6.4%) 15863 (16.1%) 1 (0.0%) 889 (0.9%)ReClor 16364 (91.6%) 326 (1.8%) 1174 (6.6%) 1 (0.0%) 0(0%)RecipeQA 48929 (77.0%) 1031 (1.6%) 10560 (16.6%) 1181 (1.9%) 835 (1.3%)SQuAD 58444 (66.6%) 5708 (6.5%) 16827 (19.2%) 6706 (7.6%) 55 (0.1%)SQuAD2 58793 (66.8%) 5724 (6.5%) 16935 (19.2%) 6548 (7.4%) 54 (0.1%)SearchQA 2129356 (60.7%) 313517 (8.9%) 957977 (27.3%) 392 (0.0%) 105207 (3.0%)ShaRC 4703 (90.6%) 303 (5.8%) 161 (3.1%) 15 (0.3%) 1 (0.0%)TriviaQA 3269469 (44.3%) 421543 (5.7%) 1566735 (21.2%) 1806003(24.5%) 293086 (4.0%)TurkQA 32225 (72.1%) 1660 (3.7%) 10778 (24.1%) 1 (0.0%) 25 (0.1%)TyDi 532336 (61.8%) 31785 (3.7%) 184915 (21.5%) 83113(9.6%) 828 (0.1%)WhoDidWhat 79056 (63.5%) 2658 (2.1%) 42670 (34.3%) 40 (0.0%) 52 (0.0%)WikiMovies 39249 (69.0%) 447 (0.8%) 15310 (26.9%) 1880 (3.3%) 3 (0.0%)WikiQA 17074 (82.5%) 1081 (5.2%) 2041 (9.9%) 477 (2.3%) 12 (0.1%)WikiReading 3431134(38.4%) 823603(9.2%) 2777734(31.1%) 1801580(20.2%) 94594(1.1%)
Table 5: Types of lemmas in datasets vocabulary in percentage listed in decreasing order according to the vocabu-lary size. uestion All Questions Unique QuestionsFirst token Contains First token Contains
Table 6: Frequency of first token of the questions and question words inside the question across datasets.
Dataset