[PDF] A Simple Yet Strong Pipeline for HotpotQA

Abstract

State-of-the-art models for multi-hop question answering typically augment large-scale language models like BERT with additional, intuitively useful capabilities such as named entity recognition, graph-based reasoning, and question decomposition. However, does their strong performance on popular multi-hop datasets really justify this added design complexity? Our results suggest that the answer may be no, because even our simple pipeline based on BERT, named Quark, performs surprisingly well. Specifically, on HotpotQA, Quark outperforms these models on both question answering and support identification (and achieves performance very close to a RoBERTa model). Our pipeline has three steps: 1) use BERT to identify potentially relevant sentences independently of each other; 2) feed the set of selected sentences as context into a standard BERT span prediction model to choose an answer; and 3) use the sentence selection model, now with the chosen answer, to produce supporting sentences. The strong performance of Quark resurfaces the importance of carefully exploring simple model designs before using popular benchmarks to justify the value of complex techniques.

Full PDF

AA Simple Yet Strong Pipeline for HotpotQA

Dirk Groeneveld † and Tushar Khot † and Mausam ‡ and Ashish Sabharwal †† Allen Institute for AI, Seattle, WA, U.S.A. dirkg,tushark,[email protected] ‡ Indian Institute of Technology, Delhi, India [email protected]

Abstract

State-of-the-art models for multi-hop questionanswering typically augment large-scale lan-guage models like BERT with additional, in-tuitively useful capabilities such as named en-tity recognition, graph-based reasoning, andquestion decomposition. However, doestheir strong performance on popular multi-hop datasets really justify this added designcomplexity? Our results suggest that the an-swer may be no, because even our simplepipeline based on BERT, named Q

UARK , per-forms surprisingly well. Speciﬁcally, on Hot-potQA, Q

UARK outperforms these models onboth question answering and support identiﬁ-cation (and achieves performance very closeto a RoBERTa model). Our pipeline has threesteps: 1) use BERT to identify potentially rele-vant sentences independently of each other; 2)feed the set of selected sentences as contextinto a standard BERT span prediction modelto choose an answer; and 3) use the sentenceselection model, now with the chosen answer,to produce supporting sentences. The strongperformance of Q

UARK resurfaces the impor-tance of carefully exploring simple model de-signs before using popular benchmarks to jus-tify the value of complex techniques.

Textual Multi-hop Question Answering (QA) is thetask of answering questions by combining informa-tion from multiple sentences or documents. Thisis a challenging reasoning task that requires QAsystems to identify relevant pieces of informationin the given text and learn to compose them to an-swer a question. To enable progress in this area,many datasets (Welbl et al., 2018; Talmor and Be-rant, 2018; Yang et al., 2018; Khot et al., 2020)and models (Min et al., 2019b; Xiao et al., 2019;Tu et al., 2019) with varying complexities havebeen proposed over the past few years. Our work score sentences ( ) question withparagraphsrankedsentences answeranswer &support build contextBERT-basedspan selection build supportscore sentences ( ) Figure 1: Overview of the Q

UARK model, with a ques-tion and context paragraphs as input. In both blueboxes, sentences are scored independently from one an-other. r na ( s ) and r a ( s ) use the same model architec-ture with different weights. focuses on HotpotQA (Yang et al., 2018), whichcontains 105,257 multi-hop questions derived fromtwo Wikipedia paragraphs, where the correct an-swer is a span in these paragraphs or yes/no.Due to the multi-hop nature of this dataset, itis natural to assume that the relevance of a sen-tence for a question would depend on the othersentences considered to be relevant. E.g., the rele-vance of “Obama was born in Hawaii.” to the ques-tion “Where was the 44th President of USA born?”depends on the other relevant sentence: “Obamawas the 44th President of US.” As a result, manyapproaches designed for this task focus on jointly identifying the relevant sentences (or paragraphs)via mechanisms such as cross-document attention,graph networks, and entity linking.Our results question this basic assumption. Weshow that a simple model, Q UARK (see Fig. 1),that ﬁrst identiﬁes relevant sentences from each a r X i v : . [ c s . C L ] A p r aragraph independent of other paragraphs, is sur-prisingly powerful on this task: in 90% of the ques-tions, Q UARK ’s relevance module recovers all goldsupporting sentences within the top-5 sentences.For QA, it uses a standard BERT (Devlin et al.,2019) span prediction model (similar to currentpublished models) on the output of this module.Additionally, Q

UARK exploits the inherent similar-ity between the relevant sentence identiﬁcation taskand the task of generating an explanation given ananswer produced by the QA module: it uses thesame architecture for both tasks.We show that this independent sentence scoringmodel results in a simple QA pipeline that outper-forms all other BERT models in both ‘distractor’and ‘fullwiki’ settings of HotpotQA. In the dis-tractor setting (10 paragraphs, including two gold,provided as context), Q

UARK achieves joint scores(answer and support prediction) within 0.75% ofthe current state of the art. Even in the fullwiki set-ting (all 5M Wikipedia paragraphs as context), bycombining our sentence selection approach with acommonly used paragraph selection approach (Nieet al., 2019), we outperform all previously pub-lished BERT models. In both settings, the onlymodels scoring higher use RoBERTa (Liu et al.,2019), a more robustly trained language model thatis known to outperform BERT across various tasks.While our design uses multiple transformer mod-els (now considered a standard starting point inNLP), our contribution is a simple pipeline with-out any bells and whistles, such as NER, graphnetworks, entity linking, etc.The closest effort to Q

UARK is by Minet al. (2019a), who also propose a simple QA modelfor HotpotQA. Their approach selects answers inde-pendently from each paragraph to achieve compet-itive performance on the question-answering sub-task of HotpotQA (they do not address the supportidentiﬁcation subtask). We show that while relevantsentences can be selected independently, operatingjointly over these sentences chosen from multipleparagraphs can lead to state-of-the-art question-answering results, outperforming independent an-swer selection by several points.Finally, our ablation study demonstrates that thesentence selection module beneﬁts substantiallyfrom using context from the corresponding para-graph. It also shows that running this module asecond time, with the chosen answer as input, re-sults in more accurate support identiﬁcation.

Most approaches for HotpotQA attempt to capturethe interactions between the paragraphs by eitherrelying on cross-attention between documents orsequentially selecting paragraphs based on the pre-viously selected paragraphs.While Nishida et al. (2019) also use a stan-dard Reading Comprehension (RC) model, theycombine it with a special Query Focused Extrac-tor (

QFE ) module to select relevant sentences forQA and explanation. The QFE module sequen-tially identiﬁes relevant sentences by updating aRNN state representation in each step, allowingthe model to capture the dependency between sen-tences across time-steps. Xiao et al. (2019) proposea Dynamically Fused Graph Networks (

DFGN )model that ﬁrst extracts entities from paragraphsto create an entity graph, dynamically extract sub-graphs and fuse them with the paragraph repre-sentation. The Select, Answer, Explain (

SAE )model (Tu et al., 2019) is similar to our approachin that it also ﬁrst selects relevant documents anduses them to produce answers and explanations.However, it relies on a self-attention over all doc-ument representations to capture potential interac-tions. Additionally, they rely on a Graph NeuralNetwork (GNN) to answer the questions. Hierar-chical Graph Network (

HGN ) model (Fang et al.,2019) builds a hierarchical graph with three lev-els: entities, sentences and paragraphs to allow forjoint reasoning.

DecompRC (Min et al., 2019b)takes a completely different approach of learningto decompose the question (using additional anno-tations) and then answer the decomposed questionsusing a standard single-hop RC system.Others such as Min et al. (2019a) have also no-ticed that many HotpotQA questions can be an-swered just based on a single paragraph. Our ﬁnd-ings are both qualitatively and quantitatively differ-ent. They did not consider the support identiﬁca-tion task, and showed strong (but not quite SoTA)QA performance by running a QA model indepen-dently on each paragraph. We, on the other hand,show that interaction is not essential for selectingrelevant sentences but actually valuable for QA!Speciﬁcally, by using a context of relevant sen-tences spread across multiple paragraphs in steps 2and 3, our simple BERT model outperforms previ-ous models with complex entity- and graph-basedinteractions on top of BERT. We thus view Q

UARK as a different, stronger baseline for multi-hop QA.2n the fullwiki setting, each question has no as-sociated context and models are expected to se-lect paragraphs from Wikipedia. To be able toscale to such a large corpus, the proposed systemsoften select the paragraphs independent of eachother. A recent retrieval method in this setting isSemantic Retrieval (Nie et al., 2019) where ﬁrstthe paragraphs are selected based on the question,followed by individual sentences from these para-graphs. However, unlike our approach, they do notuse the paragraph context to select the sentences,missing key context needed to identify relevance.

UARK

Our model works in three steps. First, we score in-dividual sentences from an input set of paragraphs D based on their relevance to the question. Second,we feed the highest-scoring sentences to a span pre-diction model to produce an answer to the question.Third, we score sentences from D a second time toidentify the supporting sentences using the answer.These three steps are implemented using the twomodules described next in Sections 3.1 and 3.2. In the distractor setting, HotpotQA provides 10context paragraphs that have an average length of41.4 sentences and 1106 tokens. This is too long forstandard language-model based span-prediction—most models scale quadratically with the numberof tokens, and some are limited to 512 tokens. Thismotivates selecting a few relevant sentences E toreduce the size of the input to the span-predictionmodel without losing important context. In a simi-lar vein, the support identiﬁcation subtask of Hot-potQA also involves selecting a few sentences thatbest explain the chosen answer. We solve both ofthese problems with the same transformer-basedsentence scoring module, with slight variation inits input.Our sentence scorer uses the BERT-Large-Cased model (Devlin et al., 2019) trained with whole-word masking, with an additional linear layer overthe [CLS] token. Here, whole word maskingrefers to a BERT variant that masks entire wordsinstead of word pieces during pre-training.We score every sentence s from every para-graph p ∈ D independently by feeding the follow-ing sequence to the model: [CLS] question[SEP] p [SEP] answer [SEP] . This se-quence is the same for every sentence in the para- graph, but the sentence being classiﬁed is indicatedusing a segment IDs: It is set to for tokens fromthe sentence and to for the rest. If a paragraph hasmore than 512 tokens, we restrict the input to theﬁrst 512. Each annotated support sentence formsa positive example and all other sentences from D form the negative examples. Note that our classiﬁerscores each sentence independently and never seessentences from two paragraphs at the same time.(See Appendix A.1 for further detail.)We train two variants of this model: (1) r na ( s ) is trained to score sentences given a question butno answer ( answer is replaced with a [MASK] token); and (2) r a ( s ) is trained to score sentencesgiven a question and its gold answer. We use r na ( s ) for relevant sentence selection and r a ( s ) for sup-port identiﬁcation (Sec. 3.3). To ﬁnd answers to questions, we use Wolfet al. (2019)’s implementation of Devlinet al. (2019)’s span prediction model. To achieveour best score, we use their

BERT-Large-Cased model with whole-word masking and SQuAD (Ra-jpurkar et al., 2016) ﬁne-tuning. We ﬁne-tune thismodel on the HotpotQA dataset with input QAcontext E from r na ( s ) . Since BERT models havea hard limit of 512 word-pieces, we use r na ( s ) to select the most relevant sentences that can ﬁtwithin this limit, as described next. (See AppendixA.2 for training details.)To accomplish this, we compute the score r na ( s ) for each sentence in the input D . Then we addsentences in decreasing order of their scores to theQA context E , until we have ﬁlled no more than508 word-pieces (incl. question word-pieces). Forevery new paragraph considered, we also add itsﬁrst sentence, and the title of the article (enclosed in ). This ensures that our span-predictionmodel has the right co-referential information fromeach paragraph. We arrange these paragraphs inthe order of their highest-scoring sentence, so themost relevant sentences come earlier – a signal thatcould be exploited by our model. The ﬁnal fourtokens are a separator, plus the words yes , no , and noans . This allows the model to answer yes/nocomparison questions, or give no answer at all. While we use the model ﬁne-tuned on SQuAD, ablationsshow that this only adds . to the ﬁnal score. A Model Answer Support JointEM F1 EM F1 EM F1Single-paragraph (Min et al., 2019a) – 67.08 – – – –QFE (Nishida et al., 2019) 53.70 68.70 58.80 84.70 35.40 60.60DFGN (Xiao et al., 2019) 55.66 69.34 53.10 82.24 33.68 59.86SAE (Tu et al., 2019) 61.32 74.81 58.06 85.27 39.89 66.45HGN (Fang et al., 2019) – 79.69 – – 71.45Q

UARK (Ours)

SAE (RoBERTa) (Tu et al., 2019) 67.70 80.75 – Table 1: HotpotQA’s distractor setting, Dev set. The bottom two models use larger language models than Q

UARK . QA Model Answer Support JointEM F1 EM F1 EM F1QFE (Nishida et al., 2019) 28.66 38.06 14.20 44.35 8.69 23.10SR-MRS (Nie et al., 2019) 45.32 57.34 38.67 70.83 25.14 47.60Q

UARK + SR-MRS (Ours)

HGN (RoBERTa) + SR-MRS (Fang et al., 2019)

Table 2: HotpotQA’s fullwiki setting, Test set. The bottom-most model uses a larger language model than Q

UARK . Given a question along with 10 distractor para-graphs D , we use the r na ( s ) variant of our sen-tence scoring module to score each sentence s in D , again without looking at other paragraphs. Inthe second step, the selected sentences are fed ascontext E into the QA module (as described in Sec-tion 3.2) to choose an answer. In the ﬁnal step, toﬁnd sentences supporting the chosen answer, weuse r a ( s ) to score each sentence in D , this timewith the chosen answer as part of the input. We deﬁne the score n ( S ) of a set of sentences S ⊂ D to be the sum of the individual sentencescores; that is, n ( S ) = (cid:80) s ∈ S r a ( s ) . In HotpotQA,supporting sentences always come from exactlytwo paragraphs. We compute this score for allpossible S satisfying this constraint and take thehighest scoring set of sentences as our support. Since there are too many paragraphs in the fullwikisetting, we use paragraphs from the SR-MRS sys-tem (Nie et al., 2019) as our context D for eachquestion. On the Dev set, we found Q UARK to per-form best with a paragraph score threshold of − . in MRS. Neither the sentence scorers r na ( s ) , r a ( s ) nor the QA module were retrained in this setting. We simply append the answer string to the question evenif it is “yes” or “no”. Note that r a ( s ) is the logit score and can be negative, soadding a sentence may not always improve this score. We evaluate on both the distractor and fullwikisettings of HotpotQA with the following goal:

Cana simple pipeline model outperform previous, morecomplex, approaches?

We present the EM (ExactMatch) and F1 scores on the evaluation metricsproposed for HotpotQA: (1) answer selection, (2)support selection, and (3) joint score.Table 1 shows that on the distractor setting,Q

UARK outperforms all previous models basedon BERT, including HGN, which like us also useswhole word masking for contextual embeddings.Moreover, we are within 1 point of models that useRoBERTa embeddings—a much stronger languagemodel that has shown improvements of 1.5 to 6points in previous HotpotQA models.Q

UARK also performs better than the recentsingle-paragraph approach for the QA subtask (Minet al., 2019a) by 14 points F1. While most of thisgain comes from using a larger language model,Q

UARK scores 2 points higher even with a lan-guage model of the same size (BERT-Base).We observe a similar trend in the fullwiki set-ting (Table 2) where Q

UARK again outperformsprevious approaches (except HGN with RoBERTa).While we rely on retrieval from SR-MRS (Nie et al.,2019) for our initial paragraphs, we outperform theoriginal work. We attribute this improvement totwo factors: our sentence selection capitalizing onthe sentence’s paragraph context leading to bettersupport selection, and a better span selection model4 op- n Sup F1 Ans F1B-Base w/o context 10 74.45 78.59B-Base w/ context 6 83.15 80.92+ B-Large ( r na ( s ) ) w/ answers ( r a ( s ) ) 5 –Oracle 3 – – Table 3: Ablation study on sentence selection in the dis-tractor setting. top- n indicates the number of sentencesrequired to cover the annotated support sentences in90% of the questions. leading to improved QA. To evaluate the impact of context on our sentenceselection model in isolation, we look at the numberof sentences that score at least as high as the lowest-scoring annotated support sentence. In other words,this is the number of sentences we must send tothe QA model to ensure all annotated support isincluded. Table 3 shows that providing the modelwith the context from the paragraph gives a substan-tial boost on this metric, bringing it down from 10to only 6 when using BERT-Base (an oracle wouldneed 3 sentences). It further shows that this boostcarries over to the downstream tasks of span selec-tion and choosing support sentences (improving itby 9 points to 83%). Finally, the table shows thevalue of running the sentence selection model asecond time: with BERT-Large, r a ( s ) outperforms r na ( s ) by 1.62% on the Support F1 metric.Looking deeper, we analyzed the accuracy of ourthird stage, r a ( s ) , as a function of the correctnessof the QA stage. When QA ﬁnds the correct goldanswer, r a ( s ) obtains the right support in 65.9% ofthe cases. If the answer from QA is incorrect, thesuccess rate of r a ( s ) is only 50.9%. Our work shows that on the HotpotQA tasks, asimple pipeline model can do as well as or betterthan more complex solutions. Powerful pre-trainedmodels allow us to score sentences one at a time,without looking at other paragraphs. By operatingjointly over these sentences chosen from multipleparagraphs, we arrive at answers and supportingsentences on par with state-of-the-art approaches.This result shows that retrieval in HotpotQA isnot itself a multi-hop problem, and suggests focus-ing on other multi-hop datasets to demonstrate thevalue of more complex techniques.

References

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

NAACL .Yuwei Fang, Siqi Sun, Zhe Gan, Rohit Pillai, ShuohangWang, and Jing jing Liu. 2019. Hierarchical graphnetwork for multi-hop question answering.

ArXiv ,abs/1911.03631.Tushar Khot, Peter Clark, Michal Guerquin, PeterJansen, and Ashish Sabharwal. 2020. QASC: Adataset for question answering via sentence compo-sition. In

AAAI .Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke S. Zettlemoyer, and Veselin Stoyanov. 2019.RoBERTa: A robustly optimized BERT pretrainingapproach.

ArXiv , abs/1907.11692.Sewon Min, Eric Wallace, Sameer Singh, Matt Gard-ner, Hannaneh Hajishirzi, and Luke S. Zettlemoyer.2019a. Compositional questions do not necessitatemulti-hop reasoning. In

ACL .Sewon Min, Victor Zhong, Luke S. Zettlemoyer, andHannaneh Hajishirzi. 2019b. Multi-hop readingcomprehension through question decomposition andrescoring. In

ACL .Yixin Nie, Songhe Wang, and Mohit Bansal. 2019. Re-vealing the importance of semantic retrieval for ma-chine reading at scale. In

EMNLP-IJCNLP .Kosuke Nishida, Kyosuke Nishida, Masaaki Nagata,Atsushi Otsuka, Itsumi Saito, Hisako Asano, andJunji Tomita. 2019. Answering while summarizing:Multi-task learning for multi-hop QA with evidenceextraction. In

ACL .Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questions formachine comprehension of text. In

EMNLP .Alon Talmor and Jonathan Berant. 2018. The web asa knowledge-base for answering complex questions.In

NAACL-HLT .Ming Tu, Kevin Huang, Guangtao Wang, Jui-TingHuang, Xiaodong He, and Bufang Zhou. 2019. Se-lect, answer and explain: Interpretable multi-hopreading comprehension over multiple documents.

ArXiv , abs/1911.00484.Johannes Welbl, Pontus Stenetorp, and SebastianRiedel. 2018. Constructing datasets for multi-hopreading comprehension across documents.

TACL ,6:287–302.Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-icz, and Jamie Brew. 2019. Huggingface’s trans-formers: State-of-the-art natural language process-ing.

ArXiv , abs/1910.03771. unxuan Xiao, Yanru Qu, Lin Qiu, Hao Zhou, Lei Li,Weinan Zhang, and Yong Yu. 2019. Dynamicallyfused graph network for multi-hop reasoning. In ACL .Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben-gio, William W. Cohen, Ruslan Salakhutdinov, andChristopher D. Manning. 2018. HotpotQA: Adataset for diverse, explainable multi-hop questionanswering. In

EMNLP . A Appendix

A.1 Training the sentence scoring model

Both r na ( s ) and r a ( s ) are trained the same way.We use the 90447 questions from the HotpotQAtraining set, shufﬂe them, and train for 4 epochs.Both models are trained in the distractor settingonly, but evaluated in both settings. We constructpositive and negative examples by choosing thetwo paragraphs containing the annotated supportsentences, plus two more randomly chosen para-graphs. All sentences from the chosen paragraphsbecome instances for the model.During training, we follow the ﬁne-tuning advicefrom (Devlin et al., 2019), with two exceptions. Weramp up the learning rate from to − over theﬁrst 10% of the batches, and then linearly decreaseit again to .To avoid biasing the training towards questionswith many context sentences, we create batchesat the question level. Three questions make upone batch, regardless of how many sentences theycontain. We cap the batch size at 5625 tokens forpractical purposes. If a batch exceeds this size, wedrop sentences at random until the batch is smallenough. As is standard for BERT classiﬁers, weuse a cross-entropy loss with two classes, one forpositive examples, and one for negative examples. A.2 Training the span prediction model

We train the BERT span prediction model on theoutput paragraphs from r na ( s ) . We use a batchsize of 16 questions and maximum sequence lengthof 512 word-pieces. We use the same optimizersettings as the sentence selection model with anadditional weight decay of . . The model istrained for a ﬁxed number of epochs (set to 3) andthe ﬁnal model is used for evaluation. Under thehood, this model consists of two classiﬁers thatrun at the same time. One ﬁnds the ﬁrst tokenof potential spans, and one ﬁnds the last token ofpotential spans. Each classiﬁer uses a cross entropyloss. The ﬁnal loss is the average loss of the twoclassiﬁers. We train one model on the output fromour best r na ( s ))