[PDF] Think you have Solved Direct-Answer Question Answering? Try ARC-DA, the Direct-Answer AI2 Reasoning Challenge

Abstract

We present the ARC-DA dataset, a direct-answer ("open response", "freeform") version of the ARC (AI2 Reasoning Challenge) multiple-choice dataset. While ARC has been influential in the community, its multiple-choice format is unrepresentative of real-world questions, and multiple choice formats can be particularly susceptible to artifacts. The ARC-DA dataset addresses these concerns by converting questions to direct-answer format using a combination of crowdsourcing and expert review. The resulting dataset contains 2985 questions with a total of 8436 valid answers (questions typically have more than one valid answer). ARC-DA is one of the first DA datasets of natural questions that often require reasoning, and where appropriate question decompositions are not evident from the questions themselves. We describe the conversion approach taken, appropriate evaluation metrics, and several strong models. Although high, the best scores (81% GENIE, 61.4% F1, 63.2% ROUGE-L) still leave considerable room for improvement. In addition, the dataset provides a natural setting for new research on explanation, as many questions require reasoning to construct answers. We hope the dataset spurs further advances in complex question-answering by the community. ARC-DA is available at this https URL

Full PDF

TThink you have Solved Direct-Answer Question Answering?Try ARC-DA, the Direct-Answer AI2 Reasoning Challenge

Sumithra Bhakthavatsalam, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra,Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, Peter Clark

Allen Institute for Artiﬁcial Intelligence, Seattle, WA, U.S.A. { sumithrab,danielk,tushark,bhavanad,kyler,ashishs,carissas,oyvindt,peterc } @allenai.org Abstract

We present the ARC-DA dataset, a direct-answer (“open re-sponse”, “freeform”) version of the ARC (AI2 ReasoningChallenge) multiple-choice dataset. While ARC has been in-ﬂuential in the community, its multiple-choice format is un-representative of real-world questions, and multiple choiceformats can be particularly susceptible to artifacts. The ARC-DA dataset addresses these concerns by converting questionsto direct-answer format using a combination of crowdsourc-ing and expert review. The resulting dataset contains 2985questions with a total of 8436 valid answers (questions typ-ically have more than one valid answer). ARC-DA is oneof the ﬁrst DA datasets of natural questions that often re-quire reasoning, and where appropriate question decompo-sitions are not evident from the questions themselves. We de-scribe the conversion approach taken, appropriate evaluationmetrics, and several strong models. Although high, the bestscores (81% GENIE, 61.4% F1, 63.2% ROUGE-L) still leaveconsiderable room for improvement. In addition, the datasetprovides a natural setting for new research on explanation, asmany questions require reasoning to construct answers. Wehope the dataset spurs further advances in complex question-answering by the community. Introduction

Multiple-choice (MC) datasets are popular and common inthe NLP community, e.g., CommonsenseQA (Talmor et al.,2019), OpenbookQA (Mihaylov et al., 2018), and VCR(Zellers et al., 2019), in particular because of the ease ofautomatic evaluation. However, they have two notable draw-backs: First, they are unnatural (real-world questions rarelycome with answer options). Second, the multiple-choice for-mat is particularly susceptible to artifacts, where systemslearn short-cuts to obtain a high score (Gururangan et al.,2018).Similarly, while there are many NLP datasets of direct-answer questions (also called “open response” or “freeform”questions), e.g., SQuaD (Rajpurkar et al., 2016), TriviaQA(Joshi et al., 2017), and NaturalQuestions (Kwiatkowskiet al., 2019), the majority of these are span-retrieval(“lookup”) tasks where a question is matched against agiven/retrieved sentence or paragraph to identify an answerspan. The few DA datasets that do target reasoning, e.g., ARC-DA is available at https://allenai.org/data/arc-da

MC:

Many animals depend on plants for (A) shelter [correct] (B)pollination (C) seed dispersal (D) sunlight

DA:

Many animals depend on plants for what? food | shelter MC:

A solution with a pH of 2 can be increased to a pH above 7 byadding (A) an acid. (B) water. (C) a base. [correct] (D) hydrogen.

DA:

A solution with a pH of 2 can be increased to a pH above 7 byadding what? a baseWhat best describes skin? (A) stiff (B) ﬂexible [correct] (C) brittle(D) hard

DA: [Rejected: Too ambiguous as a DA question]

MC:

Water freezing is an example of a (A) liquid changing to asolid [correct] (B) solid changing to a liquid (C) gas changing to asolid (D) gas changing to a liquid

DA:

Water freezing is an example of what? liquid changing to asolid | phase transition | change of state of matter | a change instate | state change MC:

How are the stem of a tree and the stem of a ﬂower mostsimilar? (A) Both are soft. (B) Both have thorns. (C) Both supportthe plant. [correct] (D) Both have woody bark.

DA:

How are the stem of a tree and the stem of a ﬂower mostsimilar? both support the plant | support leaves | both carry water | both carry nutrients | they support the plant Figure 1: Multiple-choice (MC) questions from ARC, andtheir direct answer (DA) equivalents in the new ARC-DAdataset. Alternative DA answers are separated by a | .HotpotQA (Yang et al., 2018), DROP (Dua et al., 2019), andROPES (Lin et al., 2019), are crowdsourced, and thus tendto explore a single, speciﬁc style of reasoning in a controlledsetting.What is missing, still, are direct-answer (DA) datasets ofnatural questions exploring a wide variety of problem typesand reasoning styles, and where answers are not constrainedto be spans of a source text. This work alleviates this gap bysupplying such a dataset, namely ARC-DA, a direct-answerversion of the ARC (AI2 Reasoning Challenge) multiple-choice dataset (Clark et al., 2018). Note that ARC-DA ques-tions are not necessarily more difﬁcult than the original ARCquestions (we ﬁnd scores on ARC-DA are roughly similarto those on ARC), rather they are more natural, avoiding the a r X i v : . [ c s . C L ] F e b ultiple-choice format.The original ARC dataset contained questions collectedfrom a large number of science exam and quiz sources. Ithas proven useful for the community, stimulating new re-search in reasoning-based QA, e.g., (Musa et al., 2019; Bo-ratko et al., 2018; Ni et al., 2019; Xie et al., 2020), and as ofJanuary 2021 has 35 entries on its leaderboard . ARC is par-ticularly interesting from an NLP perspective: the questionswere authored by human experts (e.g., examination boards),they are sensible and high quality, they avoid the repetitioncommon to crowdsourced datasets, they are highly variedin both the language they use and the reasoning skills theyare designed to probe, and they are practical, understand-able, and motivating. Arguably, the combination of thesefactors makes the dataset a useful “Grand Challenge” forthe ﬁeld (Clark and Etzioni, 2016) (The current top score onARC-Challenge is 81.1%, thus still with room for improve-ment). The work here, ARC-DA, thus builds on this, pro-viding a direct-answer version of part of the ARC dataset.Several examples of original ARC questions and the ARC-DA versions are shown in Figure 1.We ﬁrst describe the method used for the conversion, andthen present baseline scores using strong T5-based mod-els. Evaluating DA questions poses an additional challenge,compared with scoring MC questions. To address this chal-lenge, we use both human judgements (obtained with GE-NIE, an automated crowdscoring pipeline (Khashabi et al.,2021)), and automated metrics. Although high, the bestscores (81% GENIE, 61.4% F1, 63.2% ROUGE-L) stillleave considerable room for improvement. In addition, thedataset provides a natural setting for new research on ex-planation, as many questions require reasoning to constructanswers. We encourage the community to make use ofthis dataset to make further progress in advanced question-answering. ARC-DA Dataset

Na¨ıvely, one can convert MC to DA simply by removing theanswer choices, and using the correct answer choice as thetarget answer. However, there are several problems that canarise:• There may be multiple ways of wording the correct an-swer.• There may be multiple possible correct answers, and insome cases too many to enumerate all of them.• The question itself may be ill-deﬁned without answer op-tions.To address these problems, we convert the 7787 ARC MCquestions to DA using the process described below.

Crowdworker Annotation

We start with a large scale crowdsourcing process to ﬁlterquestions to those suitable for the DA setting and collectalternative correct answers for them: https://leaderboard.allenai.org/arc/submissions/public Indeed, this is the approach taken by (Lin et al., 2020) to use(a ﬁltered subset of) ARC in a direct-answer setting. Initial Question Filtering:

Remove questions where thequestion sentence contains one of several empirically-chosen ﬁlter phrases, e.g., “Which of”. Questions con-taining these phrases were observed to usually be ill-formed without the answer options, e.g., “Which of theseitems contains only a liquid?”.2.

Collecting Answers:

Each question was then posed toﬁve independent crowdworkers as a DA question, and theworkers were asked to:• Answer the question (enter a free-form answer). Ifthere were multiple answers, they were asked to entertwo or three.• Identify if the question had one, several, or many an-swers, or if the question was nonsensical.If the question was too ambiguous or nonsensical, thecrowdworker had the option of not providing an answer.The crowdworker interface is shown in Appendix A.3.

Additional Filtering:

The questions were further ﬁltered,only retaining:• questions that had answers from at least two workers.• questions where at least two worker-provided answershad some non-stop-word overlap.Otherwise the question was deemed too open-ended andrejected.

In-House Review

The resulting questions were then reviewed by in-house(“expert”) workers, who performed the following opera-tions:1.

Question Filtering:

Rejected questions that still ap-peared too open-ended (e.g., “Name an insect.”).2.

Answer Veriﬁcation:

Reviewed crowdworker answers toremove incorrect answers, and add additional missed an-swers.3.

Question Rewording:

Reworded questions that werepoorly phrased or incomplete as standalone questions,e.g., “The cell structure that makes a plant cell more rigidthan an animal cell is the” becomes “The cell structurethat makes a plant cell more rigid than an animal cell iscalled what?”4.

Answer Modiﬁcation:

For long (wordy) answers, ensurethat a shorter version including just the salient terms isalso present. For example, for the question: “In what formdoes water vapor exist in the atmosphere?”, the crowd-workers gave two answers: “An invisible gas in the air”,and “An invisible gas”. As the simple answer “gas” is suf-ﬁcient for this question, the expert would add “gas” as anadditional answer option. Many questions are multi-sentence, with a preamble before theactual question sentence. The ﬁlter phrases are: which of, most, best, least, est, order,supports, characteristic, trait, which object, which statement, be-low, which is, which are, example, which term, conclusion, whichwould, which item, which action, which two, which sentence,which one, sequence, which fact, which < VERB > . rain Dev Test num. questions 1250 338 1397num. answers per qn (avg) 2.75 2.72 2.92num. words per answer (avg) 2.11 1.94 2.27 Table 1: Statistics of ARC-DA, with 2985 total questions.

Rating Score strongly agree 1.00agree 0.75neutral 0.50disagree 0.25strongly disagree 0.00

Table 2: GENIE’s crowdworker ratings of a model’s answersare mapped to real-value scores as shown.This process was run over the entire ARC question set. Ap-proximately 60% of the original questions were removedduring crowdworker annotation (50% in the initial questionﬁltering, 10% more in the additional ﬁltering), followed byanother 10% during in-house review, resulting in 2985 ques-tions in the ﬁnal ARC-DA dataset. Although the ﬁnal datasetis less that half the size of ARC, it is still large enough formodels to learn the style of the task (e.g., see Table 3 later),without simply memorizing the task itself, thus avoidinglarge-scale supervised training pitfalls. This trend towardsmore realistically sized datasets is seen elsewhere also, e.g.,OBQA (Mihaylov et al., 2018), QASC (Khot et al., 2019),TRACIE (Zhou et al., 2020).

Train/Dev/Test Split

We retain the same train/dev/test labels for questions as inthe original ARC dataset, resulting in approximately simi-lar proportions as ARC. We also do not separate the orig-inal ARC-Easy and ARC-Challenge questions, but insteadmerge them into a single dataset. We do this because the la-bels “Easy” and “Challenge” were based on the MC choices.(Switching from MC to DA can result in a “Hard” ques-tion becoming conceptually easy, and vice versa). However,we do retain the original Easy/Challenge labels as metadatain the ARC-DA dataset. The resulting dataset statistics aresummarized in Table 1.

Knowledge and Reasoning Types

We found that the distribution of knowledge and reasoningtypes required by ARC-DA questions, as classiﬁed by Bo-ratko et al. (2018), to be roughly the same as in ARC, seeFigure 2 (created using Boratko et al’s data). For a detaileddescription of these categories, see (Boratko et al., 2018).

Evaluation Metrics

It’s not immediately clear how one should score answers toDA questions. Doing this is more difﬁcult than for MC ques-tions, as (usually) the set of gold DA answers is incomplete.Further, even if the answer is unique conceptually (e.g., theanswer “gravity”) it may be phrased in multiple ways (“theforce of gravity” “gravitational force”, “gravitation”, ...). As

Knowledge TypesReasoning Types

Figure 2: Comparison of the distribution of questions amongdifferent knowledge (top) and reasoning types (bottom),comparing ARC with ARC-DA. Overall, the distributionsare roughly similar. Data is from sampled annotations cre-ated by (Boratko et al., 2018). For a detailed description ofthe categories, see (Boratko et al., 2018).a result, scoring is necessarily approximate. However, thisshould not be a reason to shy away from such problems;valid comparisons can still be made, and there are obviousbeneﬁts to working in the more realistic DA setting.We propose two ways to score answers to ARC-DA: Theﬁrst is human scoring via GENIE , a human-in-the-loopleaderboard framework that scores answers using an auto-mated crowdsourced pipeline (Khashabi et al., 2021). GE-NIE streamlines the human scoring of machine-generatedanswers by automatically posting them on crowdsourcingplatforms, collecting qualitative human judgements (con-verted to numeric scores using the rubric in Table 2), thenperforming statistical analyses to quantify uncertainty. Italso includes various constraints to ensure quality control.To use GENIE, we submit our answers to the leaderboard,then wait for the task to complete (which follows a ﬁxed, pe-riodic schedule). Note that GENIE is publicly available forother researchers interested in this dataset.Second, we consider two popular automatic metrics to Available at https://genie.apps.allenai.org/ For thesimple F1 word-overlap measure, we adopt the conventionsfrom the SQuAD dataset (Rajpurkar et al., 2016) in termsof ignoring punctuation and a few stop words. For bothROUGE and F1, we take the maximum score over all ofthe gold answers for a given question (i.e., an answer isscored against its best-matching gold answer), and then av-erage over all the questions.We note that both ROUGE and F1 have known intrinsicpitfalls. For example, as F1 ignores word order, the pre-diction “from solid to liquid” would be considered a perfectmatch for the gold answer “from liquid to solid”.For these reasons, our preferred metric for ARC-DA isGENIE (despite the turnaround time), which also alleviatesthe problem of missing gold answers.

Empirical Evaluation

We next describe a few strong baseline systems for ARC-DAand report their performance.

Baseline Models

To build a strong baseline model, we start with (a reimple-mentation of) UniﬁedQA (Khashabi et al., 2020), a QA sys-tem trained on multiple QA datasets using the text-to-textpretrained T5 transformer (Raffel et al., 2020) (we use the11B version). We then ﬁne-tune two models on ARC-DA,one using sentences retrieved from a general corpus of text K , and one without. The input to these models is the ques-tion Q (plus retrieved sentences, for the ﬁrst model). Thedesired output is a correct answer to Q . We call the result-ing models UniﬁedQA + ARC-DA .For the “with IR” (Information Retrieval) variant of Uni-ﬁedQA + ARC-DA, given a question Q , we retrieve 10 sen-tences K , ..., K from the corpus K using Q as the searchquery (here, using ElasticSearch). For K , we use the AristoCorpus, a Web-crawled corpus containing 280GB of generaland science-related sentences augmented with ≈

80k addi-tional science textbook sentences (Clark et al., 2016). Theinput to the model is then: $question$ = Q ; $context$ = K ...K The desired output of the model is a correct answer to thequestion. To train the model, since we (typically) have mul-tiple, alternative gold target answers A , ..., A n in the train-ing data, we generate N a training examples for each ques-tion, where each example uses a randomly sampled answerfrom A i . In other words, each individual gold answer (ofwhich there are a few per question) and unique question areused to construct an individual training example, capped at We use the implementation from https://github.com/google-research/google-research/tree/master/rouge, with stemming turnedon.

Score (Test Set)Model: GENIE F1 ROUGE-L

T5 + ARC-DA (no IR) 66 +3 − +2 − +2 − +2 − +2 − Table 3: Results on ARC-DA test set (1397 questions), bothwithout and with IR, according to different metrics. GENIEis a human (crowdsourced) metric, F1 and ROUGE-L areautomated metrics. The GENIE score includes a conﬁdenceinterval (+/-), as shown. (GENIE is our preferred measure.)

Score (Dev Set)Model: EXPERT F1 ROUGE-L

UniﬁedQA + ARC-DA (no IR) 78.8 53.9 55.4UniﬁedQA + ARC-DA (w/ IR) 84.0 63.0 65.2UniﬁedQA + ARC-DA/MC (no IR) 78.7 55.5 59.5UniﬁedQA + ARC-DA/MC (w/ IR)

Table 4: Results on ARC-DA dev set (338 questions). Herewe show human evaluation by one of the authors (EXPERT),rather than GENIE scores.a max of N a training examples per question. In our ex-periments, we used N a = 4 . Each training instance thushas a single gold answer, and the ﬁne-tuning otherwise fol-lows the T5 procedure of using teacher forcing (Williamsand Zipser, 1989). Note there is a (deliberate) asymmetryin train/test: Each training instance encourages the systemto predict a particular gold answer, while each test outputis considered correct if it predicts any of the gold answers.This style of teaching for questions with multiple answershas been found effective in previous work, e.g., (Bosselutet al., 2019; Rashkin et al., 2018).For the “without IR” variant, the same process is appliedexcept the input to the model is simply: $question$ = Q Since UniﬁedQA is question-format agnostic, we alsocreate variants of the above models (again with and with-out retrieval) by ﬁne-tuning them jointly on ARC-DA asdescribed above as well as on the original multiple choicequestions of ARC. The resulting models are referred to as UniﬁedQA + ARC-DA/MC . Results

The results for the models are shown in Table 3. To helpinterpret the GENIE scores, note that crowdworkers labelanswers according to the rubric and corresponding real val-ues as shown in Table 2. For comparison, one of the authorsmanually scored the answers on the development set, us-ing a principle of partial credit for non-ideal answers; this isshown under the EXPERT column of Table 4. That is, given an MC question, UniﬁedQA will output an an-swer choice label; while given a DA question, UniﬁedQA will gen-erate an answer directly. the scoresare high in absolute terms, with the human-scored GE-NIE/EXPERT numbers being roughly comparable to scoreson the original MC questions, found to be 86.8%/92.6%without/with IR. This suggests that the DA questions arenot necessarily harder than the MC versions, despite the for-mat change, although they are more natural (non-multiple-choice). While intuitively one might expect DA questionsto be more difﬁcult to answer as the number of potential an-swers changes from 4 to a potentially inﬁnite number, somemay also be easier as any correct answer is valid, allowingthe model to sidestep subtle distinctions that may be used inthe MC choices.Second, the

GENIE scores slightly underestimate the“true” score , which we take as the EXPERT score (Table 4),namely the score one might expect to receive in an exami-nation setting with a professional grader. This may be dueto occasional annotation errors and/or unreliable annotatorsthat slip through GENIE’s quality controls. (Also note theGENIE score in Table 3 is on the test set, while the EXPERTscore in Table 4 is on dev, which may account for some ofthe difference (test performance is typically slightly worsethan dev)). While in principle the upper bound on the EX-PERT score is 100%, namely for a perfect set of answers,our preliminary tests suggest the GENIE upper bound (forARC-DA) may be around 90% for a perfect set of answersdue to this noise, given GENIE’s current pipeline (additionalimprovements to GENIE are under consideration).Third, the automated metrics are only a loose approxi-mation of the true target. In absolute terms, there is a signif-icant gap between the automated metrics (F1 and ROUGE-L) and the human evaluations (GENIE and EXPERT), sug-gesting that there are indeed additional answers and answerphrasings missing in ARC-DA gold answers. We also seethat the rank-ordering of models based on human vs. au-tomated metrics is not identical (although is generally sim-ilar). Assuming that the human-based scores are the mostaccurate (although expensive), this indicates that automaticmetrics should be used with caution: While they can be usedas a useful proxy, it is not appropriate to draw conclusionsfrom them based on small (e.g., 1%) differences.

Impact on MC Question-Answering

As an unexpected corollary, we ran the UniﬁedQA +ARC-DA/MC model on the original ARC MC dataset, and obtained new state-of-the-art results (81.4% on ARC-Challenge and 92.7% on ARC-Easy). Note also that thismodel has the highest score on ARC-DA (GENIE score of81%, Table 3). This suggests that there is some additionaltraining signal provided by the DA training questions thatis assisting in MC QA, and likewise that the additional MC To obtain these MC scores, we ran the same UniﬁedQA model,before ﬁne-tuning on ARC-DA, on the original ARC multiple-choice versions of the 1397 ARC-DA test questions. As before, note that UniﬁedQA is format-agnostic, outputingan answer option label given an MC question, or a direct answergiven a DA question. https://leaderboard.allenai.org/arc/submissions/public training is helping answer DA questions. This phenomenonis reminiscent of the discovery in the original UniﬁedQA pa-per that multi-format training can provide an overall boost inindividual scores (Khashabi et al., 2020). Summary

Progress in QA requires new datasets in more realistic set-tings, for example using natural questions that require morethan a “lookup” answer. The ARC-DA dataset addresses thisneed, containing a direct answer version of (a subset of) theARC multiple-choice questions. These questions are expert(examination board) authored, high quality, sensible, andavoid the repetition common to crowdsourced datasets, mak-ing them of particular interest to NLP. We have also shownthat baseline scores, although strong, are far from perfect,offering a new challenge to the NLP community, as well asa new setting to study explanation in the context of ques-tions requiring reasoning. We invite readers to take up thischallenge!The ARC-DA dataset is available athttps://allenai.org/data/arc-da, and the GENIE hu-man evaluation framework is publicly available athttps://genie.apps.allenai.org.

Acknowledgements

Thanks to all in the Aristo team and the additional expertreviewers Kirsten Barber, Rosann Morrow-Clark, Tao Li,and Anjali Tandon who contributed to this dataset. TheTPU machines for conducting experiments were providedby Google.

References

M. Boratko, H. Padigela, D. Mikkilineni, P. Yuvraj, R. Das,A. McCallum, M. Chang, A. Fokoue, P. Kapanipathi,N. Mattei, R. Musa, K. Talamadupula, and M. Witbrock.A systematic classiﬁcation of knowledge, reasoning, andcontext within the ARC dataset. In

QA@ACL , 2018.A. Bosselut, H. Rashkin, M. Sap, C. Malaviya, A. Celiky-ilmaz, and Y. Choi. COMET: Commonsense transform-ers for automatic knowledge graph construction. In

ACL ,2019.P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal,C. Schoenick, and O. Tafjord. Think you have solvedquestion answering? Try ARC, the AI2 Reasoning Chal-lenge.

ArXiv , abs/1803.05457, 2018.P. Clark and O. Etzioni. My computer is an honor student –but how intelligent is it? standardized tests as a measureof AI.

AI Magazine , 37:5–12, 2016.P. Clark, O. Etzioni, T. Khot, A. Sabharwal, O. Tafjord, P. D.Turney, and D. Khashabi. Combining retrieval, statistics,and inference to answer elementary science questions. In

AAAI , 2016.D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, andM. Gardner. DROP: A reading comprehension bench-mark requiring discrete reasoning over paragraphs. In

NAACL-HLT , 2019.5. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. R.Bowman, and N. A. Smith. Annotation artifacts in naturallanguage inference data. In

NAACL-HLT , 2018.M. Joshi, E. Choi, D. S. Weld, and L. S. Zettlemoyer. Triv-iaqa: A large scale distantly supervised challenge datasetfor reading comprehension. In

ACL , 2017.D. Khashabi, S. Min, T. Khot, A. Sabharwal, O. Tafjord,P. Clark, and H. Hajishirzi. Uniﬁedqa: Crossing formatboundaries with a single QA system. In

EMNLP , 2020.D. Khashabi, G. Stanovsky, J. Bragg, N. Lourie, J. Kasai,Y. Choi, N. A. Smith, and D. S. Weld. GENIE: A leader-board for human-in-the-loop evaluation of text genera-tion. preprint arXiv:2101.06561 , 2021.T. Khot, P. Clark, M. Guerquin, P. Jansen, and A. Sabharwal.QASC: A dataset for question answering via sentencecomposition. arXiv preprint arXiv:1910.11473 , 2019.T. Kwiatkowski, J. Palomaki, O. Redﬁeld, M. Collins, A. P.Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin,K. Lee, K. Toutanova, L. Jones, M. Kelcey, M.-W. Chang,A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov. NaturalQuestions: A benchmark for question answering research.

TACL , 7:453–466, 2019.B. Y. Lin, H. Sun, B. Dhingra, M. Zaheer, X. Ren, and W. W.Cohen. Differentiable open-ended commonsense reason-ing.

ArXiv , abs/2010.14439, 2020.C.-Y. Lin, G. Cao, J. Gao, and J.-Y. Nie. An information-theoretic approach to automatic evaluation of summaries.In

HLT-NAACL , 2006.K. Lin, O. Tafjord, P. Clark, and M. Gardner. Reasoning overparagraph effects in situations. In

Proc. MRQA Workshop(EMNLP’19) , 2019. also arXiv:1908.05852.T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suitof armor conduct electricity? a new dataset for open bookquestion answering. In

EMNLP , 2018.R. Musa, X. Wang, A. Fokoue, N. Mattei, M. Chang, P. Ka-panipathi, B. Makni, K. Talamadupula, and M. Witbrock.Answering science exam questions using query reformu-lation with background knowledge. In

AKBC , 2019.J. Ni, C. Zhu, W. Chen, and J. McAuley. Learning to attendon essential terms: An enhanced retriever-reader modelfor open-domain question answering. In

NAACL-HLT ,2019.C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring thelimits of transfer learning with a uniﬁed text-to-text trans-former.

J. Mach. Learn. Res. , 21:140:1–140:67, 2020.P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD:100,000+ questions for machine comprehension of text.In

EMNLP , 2016.H. Rashkin, A. Bosselut, M. Sap, K. Knight, and Y. Choi.Modeling naive psychology of characters in simple com-monsense stories. In

ACL , 2018. A. Talmor, J. Herzig, N. Lourie, and J. Berant. Common-senseQA: A question answering challenge targeting com-monsense knowledge. In

NAACL-HLT , 2019.R. J. Williams and D. Zipser. A learning algorithm for con-tinually running fully recurrent neural networks.

NeuralComputation , 1:270–280, 1989.Z. Xie, S. Thiem, J. Martin, E. Wainwright, S. Marmorstein,and P. A. Jansen. WorldTree V2: A corpus of science-domain structured explanations and inference patternssupporting multi-hop inference. In

LREC , 2020.Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen,R. Salakhutdinov, and C. D. Manning. HotpotQA: Adataset for diverse, explainable multi-hop question an-swering. In

EMNLP , 2018.R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi. From recogni-tion to cognition: Visual commonsense reasoning. , pp. 6713–6724, 2019.B. Zhou, K. Richardson, Q. Ning, T. Khot, A. Sabharwal,and D. Roth. Temporal reasoning on implicit events fromdistant supervision.

ArXiv , abs/2010.12753, 2020.6 ppendix A. Instructions to Crowdworkers

Below are the instructions provided to the (Amazon Mechanical Turk)crowdworkers for answering DA questions:

Instructions (click here to collapse/expand instructions)

This HIT is to write down some answers to 5 science questions, so that we can test an AI system (Aristo)that we are developing. The questions were originally taken from multiple choice exams, but we are wantingto convert them to "direct answer" format. Your task is to write down one or more answers to the questions.As the questions originally came from multiple choice exams, there may often be more than one answer. Inthose cases, please enter two or three possible answers separated by a ";", e.g., For Q: Which is an animal?you might enter three answers "dog; cat; elephant".Here is an example:Question:

A ball is tossed up in the air and it comes back down. The ball comes back down becauseof

Enter your answer(s): gravity (If you see more than one answer, enter two or three separated by ";", e.g. "flower; tree; plant".)

Now select the appropriate option below about this question: There is a clear, single answer There is conceptually just one answer, but it could be expressed in different ways (enter 1-3 examplesabove) There are several (2-4) different, correct answers to this question (enter 2-3 examples above) There are many different, correct answers to this question (enter 2-3 examples) The question makes sense, but I don't know the answer (enter "don't know" as the answer) This question doesn't make sense or is unanswerable (enter "?" as the answer)Comment: In this case, there's one clear answer ("gravity"), hence the worker has entered it and checkedthe first box.Some more examples are below, please read them carefully!

Some important notes:

Some questions might sound a little strange. This is because they were originally a multiple choicequestion. Try and answer it as best you can.For "Which..." questions, think of these as asking a "What..." question, for example:Question:

What is an example of an animal?

Your answer (for example): dog; cat; mouse put down two or three example answers separated by a ";", e.g., "dog; cat; elephant".If you can see a couple of ways of answering a question, put them down separated by a ";". Forexample:Question:

Sleet, rain, snow, and hail are forms of:

Your answer (for example): weather; bad weather; precipitation

Question:

Which type of energy does a person use to pedal a bicycle?

Your answer (for example): motion; kinetic energy

Some answers might be a phrase or sentence, e.g.,: eel free to use the internet to help get information. BUT

If you happen to find exactly thisquestion on the internet (e.g., as part of a multiple-choice exam), please don't read the answer and inparticular don't copy in the multiple-choice answer ! We are wanting "natural" answers to thisquestion rather than the original multiple choice answer, so copying in the multiple-choice answerdefeats the point.If you're unsure, or it's taking too long to work out the answer, enter "don't know" and select the "Idon't know the answer" choiceIf the question doesn't make sense or is unanswerable, enter "?".For categorizing the question, just use your best judgement.Thank you for your help! You rock!1. Examples of questions where there is a clear, single answerQ:

In New York State, the longest period of daylight occurs during which month?

Your Answer:

June Q: Which form of energy is needed to change water from a liquid to a gas? A: heatComment: In these cases, there's one clear answer.2. Examples of questions where There is conceptually just one answer, but it could be expressed indifferent waysQ:

A dog opens its mouth and lets its tongue hang out. A human's body produces sweat. These are two waysthat organisms may adjust to

Your Answer (for example): warm weather; hot temperatures; hot weather; heat Q: What is the main source of energy for the water cycle? A: sun; sunlight; sunshineComment: As there are several different ways of describing the answer, they are listed above separated by";". Aim to enter two or three such variations. The above answers are just examples, others are possible.3. Examples of questions where There are several different answers to this questionQ:

Water freezing is an example of

Your answer (for example): a phase change; something solidifying Q: Which tool is used to measure the volume of a liquid? A: graduated cylinder; measuring cup; volumetric cylinder Q: Which characteristic is inherited rather than learned A: eye color; skin colorComment: The above answers are just examples, others are possible.4. Examples of questions where There are many different answers to this questionQ:

Which food is a fruit?

Your answer (for example): apple; banana; cherry Q: An example of a poor health habit is: : sitting around all day; eating candy; smokingComment: The above answers are just examples, others are possible.6. Examples of questions where the question doesn't make sense or is unanswerable (enter "?" asthe answer)Q: Which is the largest?

Your Answer: ? Q: Which animal is preparing for a seasonal change in the environment? A: ? Q: Which object is the best conductor of electricity? A: ?Comment: Enter a "?" if the question doesn't make sense or is unanswerable.Thank you for your help! You rock!?Comment: Enter a "?" if the question doesn't make sense or is unanswerable.Thank you for your help! You rock!