[PDF] Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies

Abstract

A key limitation in current datasets for multi-hop reasoning is that the required steps for answering the question are mentioned in it explicitly. In this work, we introduce StrategyQA, a question answering (QA) benchmark where the required reasoning steps are implicit in the question, and should be inferred using a strategy. A fundamental challenge in this setup is how to elicit such creative questions from crowdsourcing workers, while covering a broad range of potential strategies. We propose a data collection procedure that combines term-based priming to inspire annotators, careful control over the annotator population, and adversarial filtering for eliminating reasoning shortcuts. Moreover, we annotate each question with (1) a decomposition into reasoning steps for answering it, and (2) Wikipedia paragraphs that contain the answers to each step. Overall, StrategyQA includes 2,780 examples, each consisting of a strategy question, its decomposition, and evidence paragraphs. Analysis shows that questions in StrategyQA are short, topic-diverse, and cover a wide range of strategies. Empirically, we show that humans perform well (87%) on this task, while our best baseline reaches an accuracy of \sim66%.

Full PDF

DDid Aristotle Use a Laptop?

A Question Answering Benchmark with Implicit Reasoning Strategies

Mor Geva , Daniel Khashabi , Elad Segal , Tushar Khot ,Dan Roth , Jonathan Berant Tel Aviv University Allen Institute for AI University of Pennsylvania [email protected] , { danielk,tushark } @allenai.org , [email protected] , [email protected]@cs.tau.ac.il Abstract

A key limitation in current datasets for multi-hop reasoning is that the requiredsteps for answering the question are men-tioned in it explicitly . In this work, we in-troduce S

TRATEGY

QA, a question answer-ing (QA) benchmark where the required rea-soning steps are implicit in the question, andshould be inferred using a strategy . A fun-damental challenge in this setup is how toelicit such creative questions from crowd-sourcing workers, while covering a broadrange of potential strategies. We proposea data collection procedure that combinesterm-based priming to inspire annotators,careful control over the annotator popula-tion, and adversarial ﬁltering for eliminatingreasoning shortcuts. Moreover, we anno-tate each question with (1) a decompositioninto reasoning steps for answering it, and (2)Wikipedia paragraphs that contain the an-swers to each step. Overall, S

TRATEGY

QAincludes 2,780 examples, each consisting ofa strategy question, its decomposition, andevidence paragraphs. Analysis shows thatquestions in S

TRATEGY

QA are short, topic-diverse, and cover a wide range of strate-gies. Empirically, we show that humans per-form well ( %) on this task, while our bestbaseline reaches an accuracy of ∼ . Developing models that successfully reason overmultiple parts of their input has attracted substan-tial attention recently, leading to the creation ofmany multi-step reasoning Question Answering(QA) benchmarks (Welbl et al., 2018; Talmor andBerant, 2018; Khashabi et al., 2018; Yang et al.,2018; Dua et al., 2019; Suhr et al., 2019).Commonly, the language of questions in suchbenchmarks explicitly describes the process for de-riving the answer. For instance (Figure 1, Q2), thequestion “Was Aristotle alive when the laptop was When did Aristotle live? When was the laptop invented? Is When did Aristotle live? retrieval When was the laptop invented? retrieval Is operation

Q1 Q2AD implicit explicit “Aristotle (384 –322 BC) was a philosopher…” E Q2 Was Aristotle alive when the laptop was invented? Q1 Did Aristotle use a laptop?implicit explicit D

1. When did Aristotle live?2. When was the laptop invented?3. Is A No E “Aristotle (384-322 BC) was a philosopher…”“The ﬁrst laptop was… in 1980” Q2Q1 implicit explicit

D A E

Was Aristotle alive when the laptop was invented?Did Aristotle use a laptop?1. When did Aristotle live?2. When was the laptop invented?3. Is

Figure 1: Questions in S

TRATEGY

QA (Q1) re-quire implicit decomposition into reasoning steps (D),for which we annotate supporting evidence fromWikipedia (E). This is in contrast to multi-step ques-tions that explicitly specify the reasoning process (Q2). invented?” explicitly speciﬁes the required rea-soning steps. However, in real-life questions, rea-soning is often implicit . For example, the ques-tion “Did Aristotle use a laptop?” (Q1) can beanswered using the same steps, but the model mustinfer the strategy for answering the question – tem-poral comparison, in this case.Answering implicit questions poses severalchallenges compared to answering their explicitcounterparts. First, retrieving the context is difﬁ-cult as there is little overlap between the questionand its context (Figure 1, Q1 and ‘E’). Moreover,questions tend to be short, lowering the possibil-ity of the model exploiting shortcuts in the lan-guage of the question. In this work, we introduceS

TRATEGY

QA, a boolean QA benchmark focus-ing on implicit multi-hop reasoning for strategyquestions , where a strategy is the ability to inferfrom a question its atomic sub-questions. In con-trast to previous benchmarks (Khot et al., 2020a;Yang et al., 2018), questions in S

TRATEGY

QA arenot limited to predeﬁned decomposition patternsand cover a wide range of strategies that humansapply when answering questions.Eliciting strategy questions using crowdsourc-ing is non-trivial. First, authoring such ques-tions requires creativity . Past work often col- a r X i v : . [ c s . C L ] J a n ected multi-hop questions by showing workersan entire context, which led to limited creativ-ity and high lexical overlap between questionsand contexts and consequently to reasoning short-cuts (Khot et al., 2020a; Yang et al., 2018). Analternative approach, applied in Natural Ques-tions (Kwiatkowski et al., 2019) and MS-MARCO(Nguyen et al., 2016), overcomes this by collect-ing real user questions. However, can we elicitcreative questions independently of the contextand without access to users?Second, an important property in S TRATE - GY QA is that questions entail diverse strategies.While the example in Figure 1 necessitates tem-poral reasoning, there are many possible strate-gies for answering questions (Table 1). We wanta benchmark that exposes a broad range of strate-gies. But crowdsourcing workers often use repeti-tive patterns, which may limit question diversity.To overcome these difﬁculties, we use the fol-lowing techniques in our pipeline for elicitingstrategy questions: (a) we prime crowd workerswith random Wikipedia terms that serve as a min-imal context to inspire their imagination and in-crease their creativity; (b) we use a large set of an-notators to increase question diversity, limiting thenumber of questions a single annotator can write;and (c) we continuously train adversarial modelsduring data collection, slowly increasing the difﬁ-culty in question writing and preventing recurringpatterns (Bartolo et al., 2020).Beyond the questions, as part of S

TRATE - GY QA, we annotated: (a) question decomposi-tions : a sequence of steps sufﬁcient for answeringthe question (‘D’ in Figure 1), and (b) evidence paragraphs: Wikipedia paragraphs that contain theanswer to each decomposition step (‘E’ in Fig-ure 1). S

TRATEGY

QA is the ﬁrst QA dataset toprovide decompositions and evidence annotationsfor each individual step of the reasoning process.Our analysis shows that S

TRATEGY

QA neces-sitates reasoning on a wide variety of knowledgedomains (physics, geography, etc.) and logical op-erations (e.g. number comparison). Moreover, ex-periments show that S

TRATEGY

QA poses a com-bined challenge of retrieval and QA, and while hu-mans perform well on these questions, even strongsystems struggle to answer them.In summary, the contributions of this work are:1. Deﬁning strategy questions – a class of ques-tion requiring implicit multi-step reasoning. 2. S

TRATEGY

QA, the ﬁrst benchmark for im-plicit multi-step QA, that covers a diverse setof reasoning skills. S

TRATEGY

QA consistsof 2,780 questions, annotated with their de-composition and per-step evidence.3. A novel annotation pipeline designed to elicitquality strategy questions, with minimal con-text for priming workers.The dataset and codebase are publicly avail-able at https://allenai.org/data/strategyqa . We deﬁne strategy questions by characterizingtheir desired properties. Some properties, such aswhether the question is answerable, also dependon the context used for answering the question. Inthis work, we assume this context is a corpus ofdocuments, speciﬁcally, Wikipedia, which we as-sume provides correct content.

Multi-step

Strategy questions are multi-stepquestions, that is, they comprise a sequence of single-step questions . A single-step question iseither (a) a question that can be answered froma short text fragment in the corpus (e.g. steps 1and 2 in Figure 1), or (b) a logical operation overanswers from previous steps (e.g. step 3 in Fig-ure 1). A strategy question should have at leasttwo steps for deriving the answer. Example multi-and single- step questions are provided in Table 2.We deﬁne the reasoning process structure in §2.2.

Feasible

Questions should be answerable fromparagraphs in the corpus. Speciﬁcally, for eachreasoning step in the sequence, there should besufﬁcient evidence from the corpus to answer thequestion. For example, the answer to the ques-tion “Would a monocle be appropriate for a cy-clop?” can be derived from paragraphs statingthat cyclops have one eye and that a monocle isused by one eye at the time. This information isfound in our corpus, Wikipedia, and thus the ques-tion is feasible. In contrast, the question “DoesJustin Beiber own a Zune?” is not feasible, be-cause answering it requires going through Beiber’sbelongings, and this information is unlikely to befound in Wikipedia. Implicit

A key property distinguishing strategyquestions from prior multi-hop questions is their uestion Implicit factsCan one spot helium? (No) Helium is a gas, Helium is odorless, Helium is tasteless, Heliumhas no colorWould Hades and Osiris hypothetically compete for realestate in the Underworld? (Yes) Hades was the Greek god of death and the Underworld. Osiriswas the Egyptian god of the Underworld.Would a monocle be appropriate for a cyclop? (Yes) Cyclops have one eye. A monocle helps one eye at a time.Should a ﬁnished website have lorem ipsum para-graphs? (No) Lorem Ipsum paragraphs are meant to be temporary. Web de-signers always remove lorem ipsum paragraphs before launch.Is it normal to ﬁnd parsley in multiple sections of thegrocery store? (Yes) Parsley is available in both fresh and dry forms. Fresh parsleymust be kept cool. Dry parsley is a shelf stable product.

Table 1: Example strategy questions and the implicit facts needed for answering them.

Question MS IM ExplanationWas Barack Obama born in theUnited States? (Yes) The question explicitly states the required information for the answer –the birth place of Barack Obama. The answer is likely to be found in asingle text fragment in Wikipedia.Do cars use drinking water topower their engine? (No) The question explicitly states the required information for the answer –the liquid used to power car engines. The answer is likely to be foundin a single text fragment in Wikipedia.Are sharks faster than crabs? (Yes) (cid:88)

The question explicitly states the required reasoning steps: 1) How fastare sharks? 2) How fast are crabs? 3) Is (cid:88)

The question explicitly states the required reasoning steps: 1) Who isthe female star of Inland Empire? 2) Was Tom Cruise married to (cid:88) (cid:88)

The answer can be derived through geographical/botanical reasoningthat the climate in Antarctica does not support growth of watermelons.Would someone with a nosebleedbeneﬁt from Coca? (Yes) (cid:88) (cid:88)

The answer can be derived through biological reasoning that Coca con-stricts blood vessels, and therefore, serves to stop bleeding.

Table 2: Example questions demonstrating the multi-step (MS) and implicit (IM) properties of strategy questions. implicit nature. In explicit questions, each stepin the reasoning process can be inferred from the language of the question directly. For example,in Figure 1, the ﬁrst two questions are explicitlystated, one in the main clause and one in the adver-bial clause. Conversely, reasoning steps in strat-egy questions require going beyond the languageof the question. Due to language variability, a pre-cise deﬁnition of implicit questions based on lex-ical overlap is elusive, but a good rule-of-thumbis the following: if the question decompositioncan be written with a vocabulary limited to wordsfrom the questions, their inﬂections, and functionwords, then it is an explicit question. If new con-tent words must be introduced to describe the rea-soning process, the question is implicit. Examplesfor implicit and explicit questions are in Table 2.

Deﬁnite

A type of questions we wish to avoidare non-deﬁnitive questions, such as “Are ham-burgers considered a sandwich?” and “Doeschocolate taste better than vanilla?” for whichthere is no clear answer. We would like to col- lect questions where the answer is deﬁnitive or, atleast, very likely, based on the corpus. E.g., con-sider the question “Does wood conduct electric-ity?” . Although it is possible that a damp woodwill conduct electricity, the answer is generally no .To summarize, strategy questions are multi-stepquestions with implicit reasoning (a strategy) anda deﬁnitive answer that can be reached given a cor-pus. We limit ourselves to Boolean yes/no ques-tions, which limits the output space, but lets us fo-cus on the complexity of the questions, which isthe key contribution. Example strategy questionsare in Table 1, and examples that demonstrate thementioned properties are in Table 2. Next (§2.2),we describe additional structures annotated duringdata collection. Strategy questions involve complex reasoning thatleads to a yes/no answer. To guide and evaluatethe QA process, we annotate every example witha description of the expected reasoning process.Prior work used rationales or supporting facts ,.e., text snippets extracted from the context (DeY-oung et al., 2020; Yang et al., 2018; Kwiatkowskiet al., 2019; Khot et al., 2020a) as evidence for ananswer. However, reasoning can rely on elementsthat are not explicitly expressed in the context.Moreover, answering a question based on relevantcontext does not imply that the model performsreasoning properly (Jiang and Bansal, 2019).Inspired by recent work (Wolfson et al., 2020),we associate every question-answer pair with a strategy question decomposition . A decomposi-tion of a question q is a sequence of n steps (cid:104) s (1) , s (2) , ..., s ( n ) (cid:105) required for computing the an-swer to q . Each step s ( i ) corresponds to a single-step question and may include special references ,which are placeholders referring to the result ofa previous step s ( j ) . The last decomposition step(i.e. s ( n ) ) returns the ﬁnal answer to the question.Table 3 shows decomposition examples.Wolfson et al. (2020) targeted explicit multi-step questions (ﬁrst row in Table 3), where the de-composition is restricted to a small vocabulary de-rived almost entirely from the original question.Conversely, decomposing strategy questions re-quires using implicit knowledge, and thus decom-positions can include any token that is needed fordescribing the implicit reasoning (rows 2-4 in Ta-ble 3). This makes the decomposition task signiﬁ-cantly harder for strategy questions.In this work, we distinguish between two typesof required actions for executing a step. Retrieval :a step that requires retrieval from the corpus, and operation , a logical function over answers to pre-vious steps. In the second row of Table 3, the ﬁrsttwo steps are retrieval steps, and the last step is anoperation. A decomposition step can require bothretrieval and an operation (see last row in Table 3).To verify that steps are valid single-step ques-tions that can be answered using the corpus(Wikipedia), we collect supporting evidence foreach retrieval step and annotate operation steps.A supporting evidence is one or more paragraphsthat provide an answer to the retrieval step.In summary, each example in our dataset con-tains a) a strategy question, b) the strategy ques-tion decomposition, and c) supporting evidenceper decomposition step. Collecting strategy ques-tions and their annotations is the main challengeof this work, and we turn to this next.

Question DecompositionDid the Battleof Peleliu orthe SevenDays Battleslast longer? (1) How long did the Battle of Peleliulast ?(2) How long did the Seven Days Bat-tle last ?(3) Which is longer of the citizenship require-ment for voting in New Mexico ?(2) What is the citizenship require-ment of any

President of Mexico ?(3) Is kind of battery does a

Toy-ota Prius use?(2) What type of material is melting point of microwave’s temperature reach at least penguin’s nat-ural habitat ?(2) What conditions make penguins ?(3) Are all of

Miami ? Table 3: Explicit (row 1) and strategy (rows 2-4) ques-tion decompositions. We mark words that are explicit(italic) or implicit in the input (bold).

Our goal is to establish a procedure for collectingstrategy questions and their annotations at scale.To this end, we build a multi-step crowdsourcing pipeline designed for encouraging worker creativ-ity, while preventing biases in the data.We break the data collection into three tasks:question writing (§3.1), question decomposition(§3.2), and evidence matching (§3.3). In addition,we implement mechanisms for quality assurance(§3.4). An overview of the data collection pipelineis in Figure 2. Generating natural language annotations throughcrowdsourcing (e.g., question generation) isknown to suffer from several shortcomings. First,when annotators generate many instances, theyuse recurring patterns that lead to biases in thedata. (Gururangan et al., 2018; Geva et al., 2019).Second, when language is generated conditionedon a long context, such as a paragraph, annotatorsuse similar language (Kwiatkowski et al., 2019),leading to high lexical overlap and hence, inadver-tently, to an easier problem. Moreover, a unique We use Amazon Mechanical Turk as our framework.

ERMS T Silk A Yes F1 Silk is a natural protein ﬁber F2 Protein is denatured by heat Q + A + F1 + F2 S1 What kind of ﬁber is silk made of? S2 Is Q + A + S1 + S2 P1 Silk P2 Denaturation E1 ‘Silk’ is a natural protein ﬁber E2 ‘Denaturation’ is a process... CORPUS Q Is silk denatured by heat?EVMSQDCQW

Figure 2: Overview of the data collection pipeline. First (CQW, §3.1), a worker is presented with a term (T) andan expected answer (A) and writes a question (Q) and the facts (F1,F2) required to answer it. Next, the questionis decomposed (SQD, §3.2) into steps (S1, S2) along with Wikipedia page titles (P1,P2) that the worker expects toﬁnd the answer in. Last (EVM, §3.3), decomposition steps are matched with evidence from Wikipedia (E1, E2). property of our setup is that we wish to cover a broad and diverse set of strategies. Thus, we mustdiscourage repeated use of the same strategy.We tackle these challenges on multiple fronts.First, rather than using a long paragraph as con-text, we prime workers to write questions givensingle terms from Wikipedia, reducing the overlapwith the context to a minimum. Second, to en-courage diversity, we control the population of an-notators, making sure a large number of annotatorscontribute to the dataset. Third, we use model-in-the-loop adversarial annotations (Dua et al., 2019;Khot et al., 2020a; Bartolo et al., 2020) to ﬁl-ter our questions, and only accept questions thatfool our models. While some model-in-the-loopapproaches use ﬁxed pre-trained models to elimi-nate “easy” questions, we continuously update themodels during data collection to combat the use ofrepeated patterns or strategies.We now provide a description of the task, andelaborate on these methods (Figure 2, upper row).

Task description

Given a term (e.g., silk ), a de-scription of the term, and an expected answer (yesor no), the task is to write a strategy question aboutthe term with the expected answer, and the factsrequired to answer the question.

Priming with Wikipedia terms

Writing strat-egy questions from scratch is difﬁcult. To in-spire worker creativity, we ask to write questionsabout terms they are familiar with or can easilyunderstand. The terms are titles of “popular” Wikipedia pages. We provide workers only with ashort description of the given term. Then, workersuse their background knowledge and web searchskills to form a strategy question. We ﬁlter pages based on the number of contributors andthe number of backward links from other pages.

Controlling the answer distribution

We askworkers to write questions where the answer is setto be ‘yes’ or ‘no’. To balance the answer distribu-tion, the expected answer is dynamically sampledinversely proportional to the ratio of ‘yes’ and ‘no’questions collected until that point.

Model-in-the-loop ﬁltering

To ensure ques-tions are challenging and reduce recurring lan-guage and reasoning patterns, questions are onlyaccepted when veriﬁed by two sets of online solvers . We deploy a set of 5 pre-trained models(termed P TD ) that check if the question is too easy.If at least 4 out of 5 answer the question correctly,it is rejected. Second, we use a set of 3 models(called F NTD ) that are continuously ﬁne-tuned onour collected data and are meant to detect biasesin the current question set. A question is rejectedif all 3 solvers answer it correctly. The solvers areR O BERT A (Liu et al., 2019) models ﬁne-tuned ondifferent auxiliary datasets; details in §5.1. Auxiliary sub-task

We ask workers to providethe facts required to answer the question they havewritten, for several reasons: 1) it helps workersframe the question writing task and describe thereasoning process they have in mind, 2) it helpsreviewing their work, and 3) it provides useful in-formation for the decomposition step (§3.2).

Once a question and the corresponding facts arewritten, we generate the strategy question decom-position (Figure 2, middle row). We annotate de-compositions before matching evidence in order toavoid biases stemming from seeing the context.The decomposition strategy for a question is notalways obvious, which can lead to undesirable ex-plicit decompositions. For example, a possible ex-licit decomposition for Q1 (Figure 1) might be(1)

What items did Aristotle use? (2)

Is laptop in ; but the ﬁrst step is not feasible. To guidethe decomposition, we provide workers with thefacts written in the CQW task to show the strat-egy of the question author. Evidently, there can bemany valid strategies and the same strategy can bephrased in multiple ways – the facts only serve asa soft guidance.

Task description

Given a strategy question, ayes/no answer, and a set of facts, the task is towrite the steps needed to answer the question.

Auxiliary sub-task

We observe that in somecases, annotators write explicit decompositions,which often lead to infeasible steps that cannot beanswered from the corpus. To help workers avoidexplicit decompositions, we ask them to specify,for each decomposition step, a Wikipedia pagethey expect to ﬁnd the answer in. This encouragesworkers to write decomposition steps for which itis possible to ﬁnd answers in Wikipedia, and leadsto feasible strategy decompositions, with only asmall overhead (the workers are not required toread the proposed Wikipedia page).

We now have a question and its decomposition.To ground them in context, we add a third task ofevidence matching (Figure 2, bottom row).

Task description

Given a question and its de-composition (a list of single-step questions), thetask is to ﬁnd evidence paragraphs on Wikipediafor each retrieval step. Operation steps that do notrequire retrieval (§2.2) are marked as operation . Controlling the matched context

Workerssearch for evidence on Wikipedia. We indexWikipedia and provide a search interface whereworkers can drag-and-drop paragraphs from theresults shown on the search interface. Thisguarantees that annotators choose paragraphswe included in our index, at a pre-determinedparagraph-level granularity. For each task, we hold qual-iﬁcations that test understanding of the task, andmanually review several examples. Workers whofollow the requirements are granted access to our We use the Wikipedia Cirrus dump from 11/05/2020. tasks. Our qualiﬁcations are open to workers fromEnglish speaking countries who have high repu-tation scores. Additionally, the authors regularlyreview annotations to give feedback and preventnoisy annotations.

Real-time automatic checks

For CQW, we useheuristics to check question validity, e.g., whetherit ends with a question mark, and that it doesn’tuse language that characterizes explicit multi-hopquestions (for instance, having multiple verbs).For SQD, we check that the decomposition struc-ture forms a directed acyclic graph, i.e. (i) eachdecomposition step is referenced by (at least) oneof the following steps, such that all steps are reach-able from the last step; and (ii) steps don’t forma cycle. In the EVM task, a warning messageis shown when the worker marks an intermediatestep as an operation (an unlikely scenario).

Inter-task feedback

At each step of thepipeline, we collect feedback about previous steps.To verify results from the CQW task, we ask work-ers to indicate whether the given answer is incor-rect (in the SQD, EVM tasks), or if the question isnot deﬁnitive (in the SQD task) (§2.1). Similarly,to identify non-feasible questions or decomposi-tions, we ask workers to indicate if there is no evi-dence for a decomposition step (in the EVM task).

Evidence veriﬁcation task

After the EVM step,each example comprises a question, its answer, de-composition and supporting evidence. To verifythat a question can be answered by executing thedecomposition steps against the matched evidenceparagraphs, we construct an additional evidenceveriﬁcation task (EVV). In this task, workers aregiven a question, its decomposition and matchedparagraphs, and are asked to answer the questionin each decomposition step purely based on theprovided paragraphs. Running EVV on a subsetof examples during data collection, helps identifyissues in the pipeline and in worker performance.

TRATEGY

QA Dataset

We run our pipeline on 1,799 Wikipedia terms, al-lowing a maximum of 5 questions per term. Weupdate our online ﬁne-tuned solvers (F

NTD ) ev-ery 1K questions. Every question is decomposedonce, and evidence is matched for each decompo-sition by 3 different workers. The cost of annotat-ing a full example is $4. rain Test

Table 4: S

TRATEGY

QA statistics. Filtered questionswere rejected by the solvers (§3.1). The train and testsets of question writers are disjoint. The “top trigram”is the most common trigram.

To encourage diversity in strategies used in thequestions, we recruited new workers throughoutdata collection. Moreover, periodic updates ofthe online solvers prevent workers from exploit-ing shortcuts, since the solvers adapt to the train-ing distribution. Overall, there were 29 questionwriters, 19 decomposers, and 54 evidence match-ers participating in the data collection.We collected 2,835 questions, out of which 55were marked as having an incorrect answer duringSQD (§3.2). This results in a collection of 2,780veriﬁed strategy questions, for which we create anannotator-based data split (Geva et al., 2019). Wenow describe the dataset statistics (§4.1), analyzethe quality of the examples, (§4.2) and explore thereasoning skills in S

TRATEGY

QA (§4.3).

We observe (Table 4) that the answer distributionis roughly balanced (yes/no). Moreover, questionsare short ( < words), and the most common tri-gram occurs in roughly of the examples. Thisindicates that the language of the questions is bothsimple and diverse. For comparison, the averagequestion length in the multi-hop datasets H OT - POT

QA (Yang et al., 2018) and C

OMPLEX W E - B Q UESTIONS (Talmor and Berant, 2018) is . words and . words, respectively. Likewise, thetop trigram in these datasets occurs in 9.2% and4.8% of their examples, respectively.More than half of the generated questions areﬁltered by our solvers, pointing to the difﬁculty ofgenerating good strategy questions. We release all3,305 ﬁltered questions as well.To characterize the reasoning complexity re-quired to answer questions in S TRATEGY

QA, we p r o p o r t i o n Figure 3: The distributions of decomposition length(left) and the number of evidence paragraphs (right).The majority of the questions in S

TRATEGY

QA requirea reasoning process comprised of ≥ steps, of whichabout 2 steps involve retrieving external knowledge. multi-step single-stepimplicit 81 1 82explicit 14.5 3.5 1895.5 4.5 100 Table 5: Distribution over the implicit and multi-stepproperties (§2) in a sample of 100 S

TRATEGY

QA ques-tions, annotated by two experts (we average the expertdecisions). Most questions are multi-step and implicit .Annotator agreement is substantial for both the implicit( κ = 0 . ) and multi-step ( κ = 0 . ) properties. examine the decomposition length and the num-ber of evidence paragraphs. Figure 3 and Table 4(bottom) show the distributions of these propertiesare centered around 3-step decompositions and 2evidence paragraphs, but a considerable portion ofthe dataset requires more steps and paragraphs. TRATEGY

QA require multi-step implicit reasoning?

To assess the qualityof questions, we sampled 100 random examplesfrom the training set, and had two experts (au-thors) independently annotate whether the ques-tions satisfy the desired properties of strategyquestions (§2.1). We ﬁnd that most of the ex-amples (81%) are valid multi-step implicit ques-tions, 82% of questions are implicit, and 95.5%are multi-step (Table 5).

Do questions in S

TRATEGY

QA have a deﬁni-tive answer?

We let experts review the answersto 100 random questions, allowing access to theWeb. We then ask them to state for every questionwhether they agree or disagree with the providedanswer. We ﬁnd that the experts agree with theanswer in 94% of the cases, and disagree only in2%. For the remaining 4%, either the question wasmbiguous, or the annotators could not ﬁnd a deﬁ-nite answer on the Web. Overall, this suggests thatquestions in S

TRATEGY

QA have clear answers.

What is the quality of the decompositions?

We randomly sampled 100 decompositions andasked experts to judge their quality. Expertsjudged if the decomposition is explicit or utilizes astrategy. We ﬁnd that 83% of the decompositionsvalidly use a strategy to break down the question.The remaining 17% decompositions are explicit,however, in 14% of the cases the original questionis already explicit. Second, experts checked if thephrasing of the decomposition is “natural”, i.e., itreﬂects the decomposition of a person that doesnot already know the answer. We ﬁnd that 89%of the decompositions express a “natural” reason-ing process, while 11% may depend on the an-swer. Last, we asked experts to indicate any po-tential logical ﬂaws in the decomposition, but nosuch cases occurred in the sample.

Would different annotators use the same de-composition strategy?

We sample 50 exam-ples, and let two different workers decompose thequestions. Comparing the decomposition pairs,we ﬁnd that a) for all pairs, the last step returnsthe same answer, b) in 44 out of 50 pairs, the de-composition pairs follow the same reasoning path, and c) in the other 6 pairs, the decompositions ei-ther follow a different reasoning process (5 pairs)or one of the decompositions is explicit (1 pair).This shows that different workers usually use thesame strategy when decomposing questions.

Is the evidence for strategy questions inWikipedia?

Another important property iswhether questions in S

TRATEGY

QA can beanswered based on context from our corpus,Wikipedia, given that questions are written inde-pendently of the context. To measure evidencecoverage, in the EVM task (§3.3), we provideworkers with a checkbox for every decomposi-tion step, indicating whether only partial or noevidence could be found for that step. Recallthat three different workers match evidence foreach decomposition step. We ﬁnd that 88.3% ofthe questions are fully covered: evidence wasmatched for each step by some worker. Moreover,in 86.9% of the questions, at least one workerfound evidence for all steps. Last, in only 0.5%of the examples, all three annotators could notmatch evidence for any of the steps. This suggests that overall, Wikipedia is a good corpus forquestions in S

TRATEGY

QA, that were writtenindependently of the context.

Do matched paragraphs provide evidence?

We assess the quality of matched paragraphs byanalyzing both example-level and step-level anno-tations. First, we sample 217 decomposition stepswith their corresponding paragraphs matched byone of the three workers. We let 3 different crowd-workers decide whether the paragraphs provideevidence for the answer to that step. We ﬁnd thatin 93% of the cases, the majority vote is that theevidence is valid. Next, we analyze annotations of the veriﬁcationtask (§3.4), where workers are asked to answer alldecomposition steps based only on the matchedparagraphs. We ﬁnd that the workers could answersub-questions and derive the correct answer in 82out of 100 annotations. Moreover, in 6 questionsindeed there was an error in evidence matching,but another worker that annotated the example wasable to compensate for the error, leading to 88% ofthe questions where evidence matching succeeds.In the last 12 cases indeed evidence is missing, andis possibly absent from Wikipedia.Lastly, we let experts review the paragraphsmatched by one of the three workers to all the de-composition steps of a question, for 100 randomquestions. We ﬁnd that for 79 of the questions thematched paragraphs provide sufﬁcient evidencefor answering the question. For 12 of the 21questions without sufﬁcient evidence, the expertsindicated they would expect to ﬁnd evidence inWikipedia, and the worker probably could not ﬁndit. For the remaining 9 questions, they estimatedthat evidence is probably absent from Wikipedia.In conclusion, 93% of the paragraphs matchedat the step-level were found to be valid. More-over, when considering single-worker annotations, ∼

80% of the questions are matched with para-graphs that provide sufﬁcient evidence for all re-trieval steps. This number increases to 88% whenaggregating the annotations of three workers.

Do different annotators match the same evi-dence paragraphs?

To compare the evidenceparagraphs matched by different workers, wecheck whether for a given decomposition step, thesame paragraph IDs are retrieved by different an-notators. Given two non-empty sets of paragraph With moderate annotator agreement of κ = 0 . .trategy Example %Physical “Can human nails carve a statueout of quartz?” “Is a platypus immune fromcholera?” “Were mollusks an ingredient inthe color purple?” “Did the 40th president of theUnited States forward lolcats tohis friends?” “Are quadrupeds represented onChinese calendar?” “Would a compass attuned toEarth’s magnetic ﬁeld be a badgift for a Christmas elf?” “Was Hillary Clinton’s deputychief of staff in 2009 baptised?” “Would Garﬁeld enjoy a trip toItaly?” “Can Larry King’s ex-wives forma water polo team?” Table 6: Top strategies in S

TRATEGY

QA and their fre-quency in a 100 example subset (accounting for 70%of the analyzed examples).

IDs P , P , annotated by two workers, we com-pute the Jaccard coefﬁcient J ( P , P ) = |P ∩P ||P ∪P | .In addition, we take the sets of correspondingWikipedia page IDs T , T for the matched para-graphs, and compute J ( T , T ) . Note that a scoreof 1 is given to two identical sets, while a scoreof 0 corresponds to sets that are disjoint. The av-erage similarity score is 0.43 for paragraphs and0.69 for pages. This suggests that evidence for adecomposition step can be found in more than oneparagraph in the same page, or in different pages. We aim to generate creative and diverse questions.We now analyze diversity in terms of the requiredreasoning skills and question topic.

Reasoning skills

To explore the required rea-soning skills in S

TRATEGY

QA, we sampled 100examples and let two experts (authors) discuss andannotate each example with a) the type of strategyfor decomposing the question, and b) the requiredreasoning and knowledge skills per decompositionstep. We then aggregate similar labels (e.g. botan-ical → biological) and compute the proportion ofexamples each strategy/reasoning skill is requiredfor (an example can have multiple strategy labels).Table 6 demonstrates the top strategies, show-ing that S TRATEGY

QA contains a broad set ofstrategies. Moreover, diversity is apparent (Fig- historical34% biological25% numbercomparison18% setinclusion19% techno-logical14% physical12% geogra-phical13% enterta-inment10% ismemberof10% sports9% temporal9% de ﬁ nition8% food8% cultural7% hasparts7%temporalcomparison7% entitycomparison6% religious6% music5%political5% botanical4% cause&e ﬀ ect4% education4%literature4%medical4%pop-culture4% pre-conditions4%setintersection4% spatial4% life-style3% mechani-cal3% semanticsimilarity3% activity2% common-sense2%law2% semanticequivalence3% astrology1% chemistry1% contra-diction1% language1% locationcomparison1%setcomparison1% Figure 4: Reasoning skills in S

TRATEGY

QA; each skillis associated with the proportion of examples it is re-quired for. Domain-related and logical reasoning skillsare marked in blue and orange (italic), respectively. e t hn i c g r o u p f il m c h e m i c a l c o m p o un d w a r t e l e v i s i o n s e r i e s b u s i n e ss c h e m i c a l e l e m e n t g r o u p o f o r g a n i s m s a n a t o m i c a l s t r u c t u r e f oo d i n g r e d i e n tt y p e o f s p o r t d i s e a s e p r o f e ss i o n t a x o nhu m a n p r o p o r t i o n ( % ) Figure 5: The top 15 categories of terms used to primeworkers for question writing and their proportion. ure 4) in terms of both domain-related reason-ing (e.g. biological and technological) and logicalfunctions (e.g. set inclusion and “is member of”).While the reasoning skills sampled from questionsin S

TRATEGY

QA do not necessarily reﬂect theirprevalence in a “natural” distribution, we arguethat promoting research on methods for inferringstrategies is an important research direction.

Question topics

As questions in S

TRATEGY

QAwere triggered by Wikipedia terms, we use the “in-stance of” Wikipedia property to characterize thetopics of questions. Figure 5 shows the distri-bution of topic categories in S

TRATEGY

QA. Thedistribution shows S

TRATEGY

QA is very diverse,with the top two categories (“human” and “taxon”,i.e. a group of organisms) covering only a quarter It is usually a 1-to-1 mapping from a term to a Wikipediacategory. In cases of 1-to-many, we take the ﬁrst category.nswer accuracy 87%Strategy match 86%Decomposition usage 14%Average

Table 7: Human performance in answering questions.Strategy match is computed by comparing the expla-nation provided by the expert with the decomposition.Decomposition usage and the number of searches arecomputed based on information provided by the expert. of the data, and a total of 609 topic categories.We further compare the diversity of S

TRATE - GY QA to H

OTPOT

QA, a multi-hop QA datasetover Wikipedia paragraphs. To this end, we sam-ple 739 pairs of evidence paragraphs associatedwith a single question in both datasets, and mapthe pair of paragraphs to a pair of Wikipedia cat-egories using the “instance of” property. We ﬁndthat there are 571 unique category pairs in S

TRAT - EGY

QA, but only 356 unique category pairs inH

OTPOT

QA. Moreover, the top two categorypairs in both of the datasets (“human-human”,“taxon-taxon”) constitute 8% and 27% of the casesin S

TRATEGY

QA and H

OTPOT

QA, respectively.This demonstrates the creativity and breadth ofcategory combinations in S

TRATEGY

QA.

To see how well humans answer strategy ques-tions, we sample a subset of 100 questions fromS

TRATEGY

QA and have experts (authors) answerquestions, given access to Wikipedia articles andan option to reveal the decomposition for everyquestion. In addition, we ask them to provide ashort explanation for the answer, the number ofsearches they conducted to derive the answer, andto indicate whether they have used the decomposi-tion. We expect humans to excel at coming up withstrategies for answering questions. Yet, humansare not necessarily an upper bound because ﬁnd-ing the relevant paragraphs is difﬁcult and couldpotentially be performed better by machines.Table 7 summarizes the results. Overall, hu-mans infer the required strategy and answer thequestions with high accuracy. Moreover, the lownumber of searches shows that humans leveragebackground knowledge, as they can answer someof the intermediate steps without search. An er-ror analysis shows that the main reason for failure(10%) is difﬁculty to ﬁnd evidence, and the rest ofthe cases (3%) are due to ambiguity in the questionthat could lead to the opposite answer.

In this section, we conduct experiments to an-swer the following questions: a) How well do pre-trained language models (LMs) answer strategyquestions? b) Is retrieval of relevant context help-ful? and c) Are decompositions useful for answer-ing questions that require implicit knowledge?

Answering strategy questions requires externalknowledge that cannot be obtained by trainingon S

TRATEGY

QA alone. Therefore, our modelsand online solvers (§3.1) are based on pre-trainedLMs, ﬁne-tuned on auxiliary datasets that requirereasoning. Speciﬁcally, in all models we ﬁne-tuneR O BERT A (Liu et al., 2019) on a subset of:• B OOL

Q (Clark et al., 2019): A dataset forboolean question answering.• MNLI (Williams et al., 2018): A large naturallanguage inference (NLI) dataset. The task is topredict if a textual premise entails, contradictsor is neutral with respect to the hypothesis.• T

WENTY Q UESTIONS (20Q): A collection of50K short commonsense boolean questions. • DROP (Dua et al., 2019): A large dataset fornumerical reasoning over paragraphs.Models are trained in two conﬁgurations:• No context : The model is fed with the questiononly, and outputs a binary prediction using thespecial

CLS token.•

With context : We use BM25 (Robertson et al.,1995) to retrieve context from our corpus, whileremoving stop words from all queries. We ex-amine two retrieval methods: a) question-basedretrieval: by using the question as a queryand taking the top k = 10 results, and b)decomposition-based retrieval: by initiating aseparate query for each (gold or predicted) de-composition step and concatenating the top k =10 results of all steps (sorted by retrieval score).In both cases, the model is fed with the ques-tion concatenated to the retrieved context, trun-cated to 512 tokens (the maximum input lengthof R O BERT A ), and outputs a binary prediction. Predicting decompositions

We train a seq-to-seq model, termed BART

DECOMP , that given aquestion, generates its decomposition token-by- https://github.com/allenai/twentyquestions odel Solver group(s)R O BERT A ∅ (20Q) P TD , F NTD R O BERT A ∅ (20Q+B OOL

Q) P TD , F NTD R O BERT A ∅ (B OOL

Q) P TD , F NTD R O BERT A IR-Q (B OOL

Q) P TD R O BERT A IR-Q (MNLI+B

OOL

Q) P TD Table 8: QA models used as online solvers during datacollection (§3.1). Each model was ﬁne-tuned on thedatasets mentioned in its name. token. Speciﬁcally, we ﬁne-tune BART (Lewiset al., 2020) on S

TRATEGY

QA decompositions.

Baseline models

As our base model, we traina model as follows: We take a R O BERT A (Liuet al., 2019) model and ﬁne-tune it on DROP, 20Qand B OOL

Q (in this order). The model is trainedon DROP with multiple output heads, as in Segalet al. (2020), which are then replaced with a singleBoolean output. We call this model R O BERT A *.We use R O BERT A * and R O BERT A to trainthe following models on S TRATEGY

QA: withoutcontext (R O BERT A * ∅ ), with question-basedretrieval (R O BERT A * IR-Q , R O BERT A IR-Q ), andwith predicted decomposition-based retrieval(R O BERT A * IR-D ).We also present four oracle models:• R O BERT A * ORA-P : uses the gold paragraphs(no retrieval).• R O BERT A * IR-ORA-D : performs retrieval withthe gold decomposition.• R O BERT A * last-stepORA-P-D : exploits both the gold de-composition and the gold paragraphs. We ﬁne-tune R O BERT A on B OOL

Q and SQ U AD (Ra-jpurkar et al., 2016) to obtain a model that cananswer single-step questions. We then run thismodel on S

TRATEGY

QA to obtain answers forall decomposition sub-questions, and replace allplaceholder references with the predicted an-swers. Last, we ﬁne-tune R O BERT A * to answerthe last decomposition step of S TRATEGY

QA,for which we have supervision.• R O BERT A * last-step-rawORA-P-D : R O BERT A * that is ﬁne-tuned to predict the answer from the gold para-graphs and the last step of the gold decomposi-tion, without replacing placeholder references. Online solvers

For the solvers integrated in thedata collection process (§3.1), we use three no-context models and two question-based retrievalmodels. The solvers are listed in Table 8. For brevity, exact details on model training and hyper-parameters will be released as part of our codebase. Model Accuracy Recall@10M

AJORITY O BERT A * ∅ ± O BERT A IR-Q ± O BERT A * IR-Q ± O BERT A * IR-D ± O BERT A * IR-ORA-D ± O BERT A * ORA-P ± O BERT A * last-step-rawORA-P-D ± O BERT A * last-stepORA-P-D ± Table 9: QA accuracy (with standard deviation across 7experiments), and retrieval performance, measured byRecall@10, of baseline models on the test set.

Table 9 summarizesthe results of all models (§5.1). R O BERT A * IR-Q substantially outperforms R O BERT A IR-Q , indi-cating that ﬁne-tuning on related auxiliary datasetsbefore S

TRATEGY

QA is crucial. Hence, we focuson R O BERT A * for all other results and analysis.Strategy questions pose a combined challengeof retrieving the relevant context, and deriving theanswer based on that context. Training withoutcontext shows a large accuracy gain of . → . over the majority baseline. This is far fromhuman performance, but shows that some ques-tions can be answered by a large LM ﬁne-tunedon related datasets without retrieval. On the otherend, training with gold paragraphs raises perfor-mance to . . This shows that high-quality re-trieval lets the model effectively reason over thegiven paragraphs. Last, using both gold decompo-sitions and retrieval further increases performanceto . , showing the utility of decompositions.Focusing on retrieval-based methods, we ob-serve that question-based retrieval reaches an ac-curacy of . and retrieval with gold decomposi-tions results in an accuracy of . . This showsthat the quality of retrieval even with gold decom-positions is not high enough to improve the . accuracy obtained by R O BERTA* ∅ , a model thatuses no context. Retrieval with predicted decom-positions results in an even lower accuracy of . .We also analyze predicted decompositions below. Retrieval evaluation

A question decompositiondescribes the reasoning steps for answering thequestion. Therefore, using the decompositionfor retrieval may help obtain the relevant con-text and improve performance. To test this, wedirectly compare performance of question- andecomposition-based retrieval with respect to theannotated gold paragraphs. We compute Re-call@10, i.e., the fraction of the gold paragraphsretrieved in the top-10 results of each method.Since there are 3 annotations per question, wecompute Recall@10 for each annotation and takethe maximum as the ﬁnal score. For a fair compar-ison, in decomposition-based retrieval, we use thetop-10 results across all steps.Results (Table 9) show that retrieval perfor-mance is low, partially explaining why retrievalmodels do not improve performance comparedto R O BERT A * ∅ , and demonstrating the retrievalchallenge in our setup. Gold decomposition-based retrieval substantially outperforms question-based retrieval, showing that using the decom-position for retrieval is a promising direction foranswering multi-step questions. Still, predicteddecomposition-based retrieval does not improveretrieval compared to question-based retrieval,showing better decomposition models are needed.To understand the low retrieval scores, we ana-lyzed the query results of 50 random decomposi-tion steps. Most failure cases are due to the shal-low pattern matching done by BM25, e.g., fail-ure to match synonyms. This shows that indeedthere is little word overlap between decompositionsteps and the evidence, as intended by our pipelinedesign. In other examples, either a key questionentity was missing because it was represented bya reference token, or the decomposition step hadcomplex language, leading to failed retrieval. Thisanalysis suggests that advances in neural retrievalmight be beneﬁcial for S TRATEGY

QA.

Human retrieval performance

To quantify hu-man performance in ﬁnding gold paragraphs, weask experts to ﬁnd evidence paragraphs for 100random questions. For half of the questions wealso provide decomposition. We observe averageRecall@10 of . and . with and withoutthe decomposition, respectively. This shows thathumans signiﬁcantly outperform our IR baselines.However, humans are still far from covering thegold paragraphs, since there are multiple valid ev-idence paragraphs (§4.2), and retrieval can be dif-ﬁcult even for humans. Lastly, using decomposi-tions improves human retrieval, showing decom-positions indeed are useful for ﬁnding evidence. Predicted decompositions

Analysis shows thatBART

DECOMP ’s decompositions are grammati- cal and well-structured. Interestingly, the modelgenerates strategies, but often applies them toquestions incorrectly. E.g., the question “Can alifeboat rescue people in the Hooke Sea?” is de-composed to “1) What is the maximum depth ofthe Hooke Sea? 2) How deep can a lifeboat dive?3) Is . While thedecomposition is well-structured, it uses a wrongstrategy (lifeboats do not dive).

Prior work has typically let annotators write ques-tions based on an entire context (Khot et al.,2020a; Yang et al., 2018; Dua et al., 2019; Mi-haylov et al., 2018; Khashabi et al., 2018). In thiswork, we prime annotators with minimal informa-tion (few tokens) and let them use their imagina-tion and own wording to create questions. A re-lated priming method was recently proposed byClark et al. (2020), who used the ﬁrst 100 char-acters of a Wikipedia page.Among multi-hop reasoning datasets, ourdataset stands out in that it requires implicit de-compositions. Two recent datasets (Khot et al.,2020a; Mihaylov et al., 2018) have consideredquestions requiring implicit facts. However, theyare limited to speciﬁc domain strategies, while inour work we seek diversity in this aspect.Most multi-hop reasoning datasets do not fullyannotate question decomposition (Yang et al.,2018; Khot et al., 2020a; Mihaylov et al.,2018). This issue has prompted recent workto create question decompositions for existingdatasets (Wolfson et al., 2020), and to train mod-els that generate question decompositions (Perezet al., 2020; Khot et al., 2020b; Min et al., 2019).In this work, we annotate question decompositionsas part of the data collection.

We present S

TRATEGY

QA, the ﬁrst dataset of im-plicit multi-step questions requiring a wide-rangeof reasoning skills. To build S

TRATEGY

QA, weintroduced a novel annotation pipeline for elicit-ing creative questions that use simple language,but cover a challenging range of diverse strate-gies. Questions in S

TRATEGY

QA are annotatedwith decomposition into reasoning steps and evi-dence paragraphs, to guide the ongoing researchtowards addressing implicit multi-hop reasoning. cknowledgement

We thank Tomer Wolfson for helpful feedback andthe REVIZ team at Allen Institute for AI, par-ticularly Michal Guerquin and Sam Skjonsberg.This research was supported in part by the Yan-dex Initiative for Machine Learning, and the Euro-pean Research Council (ERC) under the EuropeanUnion Horizons 2020 research and innovation pro-gramme (grant ERC DELPHI 802800). Dan Rothis partly supported by ONR contract N00014-19-1-2620 and DARPA contract FA8750-19-2-1004,under the Kairos program. This work was com-pleted in partial fulﬁllment for the Ph.D degree ofMor Geva.

References

Max Bartolo, Alastair Roberts, Johannes Welbl,Sebastian Riedel, and Pontus Stenetorp. 2020.Beat the AI: Investigating adversarial humanannotation for reading comprehension.

Trans-actions of the Association for ComputationalLinguistics , 8:662–678.Christopher Clark, Kenton Lee, Ming-Wei Chang,Tom Kwiatkowski, Michael Collins, andKristina Toutanova. 2019. BoolQ: Exploringthe surprising difﬁculty of natural yes/no ques-tions. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associ-ation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long andShort Papers) , pages 2924–2936, Minneapolis,Minnesota. Association for Computational Lin-guistics.Jonathan H. Clark, Eunsol Choi, Michael Collins,Dan Garrette, Tom Kwiatkowski, Vitaly Niko-laev, and Jennimaria Palomaki. 2020. TyDiQA: A benchmark for information-seekingquestion answering in typologically diverse lan-guages.

Transactions of the Association forComputational Linguistics (TACL) , 8:454–470.Jay DeYoung, Sarthak Jain, Nazneen Fatema Ra-jani, Eric Lehman, Caiming Xiong, RichardSocher, and Byron C. Wallace. 2020. ERASER:A benchmark to evaluate rationalized NLPmodels. In

Proceedings of the 58th AnnualMeeting of the Association for ComputationalLinguistics , pages 4443–4458. Association forComputational Linguistics. Dheeru Dua, Yizhong Wang, Pradeep Dasigi,Gabriel Stanovsky, Sameer Singh, and MattGardner. 2019. DROP: A reading compre-hension benchmark requiring discrete reasoningover paragraphs. In

North American Chapter ofthe Association for Computational Linguistics(NAACL) .Mor Geva, Yoav Goldberg, and Jonathan Berant.2019. Are we modeling the task or the annota-tor? An investigation of annotator bias in nat-ural language understanding datasets. In

Pro-ceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing andthe 9th International Joint Conference on Nat-ural Language Processing (EMNLP-IJCNLP) ,pages 1161–1166, Hong Kong, China. Associa-tion for Computational Linguistics.Suchin Gururangan, Swabha Swayamdipta, OmerLevy, Roy Schwartz, Samuel Bowman, andNoah A. Smith. 2018. Annotation artifacts innatural language inference data. In

Proceed-ings of the 2018 Conference of the North Amer-ican Chapter of the Association for Computa-tional Linguistics: Human Language Technolo-gies, Volume 2 (Short Papers) , pages 107–112,New Orleans, Louisiana. Association for Com-putational Linguistics.Yichen Jiang and Mohit Bansal. 2019. Avoid-ing reasoning shortcuts: Adversarial evaluation,training, and model development for multi-hopQA. In

Association for Computational Linguis-tics (ACL) .Daniel Khashabi, Snigdha Chaturvedi, MichaelRoth, Shyam Upadhyay, and Dan Roth. 2018.Looking beyond the surface: A challenge setfor reading comprehension over multiple sen-tences. In

Proceedings of the 2018 Conferenceof the North American Chapter of the Associ-ation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long Pa-pers) , pages 252–262, New Orleans, Louisiana.Association for Computational Linguistics.Tushar Khot, Peter Clark, Michal Guerquin, PeterJansen, and Ashish Sabharwal. 2020a. QASC:A dataset for question answering via sentencecomposition. In

AAAI .Tushar Khot, Daniel Khashabi, Kyle Richard-son, Peter Clark, and Ashish Sabharwal. 2020b.ext modular networks: Learning to decomposetasks in the language of existing models. arXivpreprint arXiv:2009.00751 .Tom Kwiatkowski, Jennimaria Palomaki, OliviaRedﬁeld, Michael Collins, Ankur Parikh, ChrisAlberti, Danielle Epstein, Illia Polosukhin, Ja-cob Devlin, Kenton Lee, Kristina Toutanova,Llion Jones, Matthew Kelcey, Ming-WeiChang, Andrew M. Dai, Jakob Uszkoreit, QuocLe, and Slav Petrov. 2019. Natural questions:A benchmark for question answering research.

Transactions of the Association for Computa-tional Linguistics (TACL) , 7:453–466.Mike Lewis, Yinhan Liu, Naman Goyal, Mar-jan Ghazvininejad, Abdelrahman Mohamed,Omer Levy, Veselin Stoyanov, and Luke Zettle-moyer. 2020. BART: Denoising sequence-to-sequence pre-training for natural language gen-eration, translation, and comprehension. In

Proceedings of the 58th Annual Meeting ofthe Association for Computational Linguistics ,pages 7871–7880. Association for Computa-tional Linguistics.Yinhan Liu, Myle Ott, Naman Goyal, JingfeiDu, Mandar Joshi, Danqi Chen, Omer Levy,Mike Lewis, Luke Zettlemoyer, and VeselinStoyanov. 2019. RoBERTa: A robustly opti-mized bert pretraining approach. arXiv preprintarXiv:1907.11692 .Todor Mihaylov, Peter Clark, Tushar Khot, andAshish Sabharwal. 2018. Can a suit of armorconduct electricity? A new dataset for openbook question answering. In

Proceedings of the2018 Conference on Empirical Methods in Nat-ural Language Processing , pages 2381–2391,Brussels, Belgium. Association for Computa-tional Linguistics.Sewon Min, Victor Zhong, Luke Zettlemoyer, andHannaneh Hajishirzi. 2019. Multi-hop read-ing comprehension through question decompo-sition and rescoring. In

Proceedings of the 57thAnnual Meeting of the Association for Com-putational Linguistics , pages 6097–6109, Flo-rence, Italy. Association for Computational Lin-guistics.Tri Nguyen, Mir Rosenberg, Xia Song, JianfengGao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A human gen-erated machine reading comprehension dataset.In

Workshop on Cognitive Computing at NIPS .Ethan Perez, Patrick Lewis, Wen-tau Yih,Kyunghyun Cho, and Douwe Kiela. 2020. Un-supervised question decomposition for questionanswering. In

Proceedings of the 2020 Con-ference on Empirical Methods in Natural Lan-guage Processing (EMNLP) , pages 8864–8880.Pranav Rajpurkar, Jian Zhang, Konstantin Lopy-rev, and Percy Liang. 2016. SQuAD: 100,000+questions for machine comprehension of text.In

Empirical Methods in Natural LanguageProcessing (EMNLP) .Stephen Robertson, S. Walker, S. Jones, M. M.Hancock-Beaulieu, and M. Gatford. 1995.Okapi at TREC-3. In

Overview of the Third TextREtrieval Conference (TREC-3) , pages 109–126. Gaithersburg, MD: NIST.Elad Segal, Avia Efrat, Mor Shoham, AmirGloberson, and Jonathan Berant. 2020. A sim-ple and effective model for answering multi-span questions. In

Proceedings of the 2020Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) , pages 3074–3080.Alane Suhr, Stephanie Zhou, Ally Zhang, IrisZhang, Huajun Bai, and Yoav Artzi. 2019. Acorpus for reasoning about natural languagegrounded in photographs. In

Proceedings ofthe 57th Annual Meeting of the Association forComputational Linguistics , pages 6418–6428,Florence, Italy. Association for ComputationalLinguistics.Alon Talmor and Jonathan Berant. 2018. Theweb as a knowledge-base for answering com-plex questions. In

North American Chapter ofthe Association for Computational Linguistics(NAACL) .Johannes Welbl, Pontus Stenetorp, and SebastianRiedel. 2018. Constructing datasets for multi-hop reading comprehension across documents.

Transactions of the Association for Computa-tional Linguistics (TACL) , 6:287–302.Adina Williams, Nikita Nangia, and Samuel Bow-man. 2018. A broad-coverage challenge corpusor sentence understanding through inference.In

Proceedings of the 2018 Conference of theNorth American Chapter of the Association forComputational Linguistics: Human LanguageTechnologies, Volume 1 (Long Papers) , pages1112–1122, New Orleans, Louisiana. Associa-tion for Computational Linguistics.Tomer Wolfson, Mor Geva, Ankit Gupta, MattGardner, Yoav Goldberg, Daniel Deutch, andJonathan Berant. 2020. Break it down: A ques-tion understanding benchmark.

Transactions ofthe Association for Computational Linguistics(TACL) .Zhilin Yang, Peng Qi, Saizheng Zhang, YoshuaBengio, William Cohen, Ruslan Salakhutdinov,and Christopher D. Manning. 2018. HotpotQA:A dataset for diverse, explainable multi-hopquestion answering. In