[PDF] Memory Augmented Sequential Paragraph Retrieval for Multi-hop Question Answering

Abstract

Retrieving information from correlative paragraphs or documents to answer open-domain multi-hop questions is very challenging. To deal with this challenge, most of the existing works consider paragraphs as nodes in a graph and propose graph-based methods to retrieve them. However, in this paper, we point out the intrinsic defect of such methods. Instead, we propose a new architecture that models paragraphs as sequential data and considers multi-hop information retrieval as a kind of sequence labeling task. Specifically, we design a rewritable external memory to model the dependency among paragraphs. Moreover, a threshold gate mechanism is proposed to eliminate the distraction of noise paragraphs. We evaluate our method on both full wiki and distractor subtask of HotpotQA, a public textual multi-hop QA dataset requiring multi-hop information retrieval. Experiments show that our method achieves significant improvement over the published state-of-the-art method in retrieval and downstream QA task performance.

Full PDF

MMemory Augmented Sequential Paragraph Retrieval for Multi-hopQuestion Answering

Nan Shao † , Yiming Cui ‡† , Ting Liu ‡ , Shijin Wang †§ , Guoping Hu †† State Key Laboratory of Cognitive Intelligence, iFLYTEK Research, China ‡ Research Center for Social Computing and Information Retrieval (SCIR),Harbin Institute of Technology, Harbin, China § iFLYTEK AI Research (Hebei), Langfang, China †§ {nanshao,ymcui,sjwang3,gphu}@iflytek.com ‡ {ymcui,tliu}@ir.hit.edu.cn Abstract

Retrieving information from correlative para-graphs or documents to answer open-domainmulti-hop questions is very challenging. Todeal with this challenge, most of the exist-ing works consider paragraphs as nodes in agraph and propose graph-based methods to re-trieve them. However, in this paper, we pointout the intrinsic defect of such methods. In-stead, we propose a new architecture that mod-els paragraphs as sequential data and considersmulti-hop information retrieval as a kind of se-quence labeling task. Speciﬁcally, we designa rewritable external memory to model thedependency among paragraphs. Moreover, athreshold gate mechanism is proposed to elim-inate the distraction of noise paragraphs. Weevaluate our method on both full wiki and dis-tractor subtask of HotpotQA, a public textualmulti-hop QA dataset requiring multi-hop in-formation retrieval. Experiments show that ourmethod achieves signiﬁcant improvement overthe published state-of-the-art method in re-trieval and downstream QA task performance.

Open-domain Question Answering (QA) is a pop-ular topic in natural language processing that re-quires models to answer questions given a large col-lection of text paragraphs (e.g., Wikipedia). Mostprevious works leverage a retrieval model to cal-culate semantic similarity between a question andeach paragraph to retrieve a set of paragraphs, thena reading comprehension model extracts an answerfrom one of the paragraphs. These pipeline meth-ods work well in single-hop question answering,where the answer can be derived from only oneparagraph. However, many real-world questionsproduced by users are multi-hop questions that re-quire reasoning across multiple documents or para-graphs.In this paper, we study the problem of textualmulti-hop question answering at scale, which re-

Paragraph A: Rand Paul presidential campaign, 2016The 2016 presidential campaign of Rand Paul, the junior UnitedStates Senator from Kentucky, was announced on April 7, 2015 atan event at the Galt House in Louisville, Kentucky. … …

Paragraph B: Galt House ……The Galt House is the city's only hotel on the Ohio River. …… Q : The Ran Paul presidential campaign, 2016 event was held at ahotel on what river? A : Ohio River Figure 1: An example of open-domain multi-hop ques-tion from HotpotQA. The model needs to retrieve evi-dence paragraphs from entire Wikipedia and derive theanswer. quires multi-hop information retrieval. An exam-ple from HotpotQA is illustrated in Figure 1. Toanswer the question “The Rand Paul presidentialcampaign, 2016 event was held at a hotel on whatriver?” , the retrieval model needs to ﬁrst identify

Paragraph A as an evidence paragraph accordingto the similarity of the terms. The retrieval modellearns that the event was held at a hotel named “Galt House” , which leads to the next-hop

Para-graph B . Existing neural-based or non-neural basedmethods can not perform well for such questionsdue to little lexical overlap or semantic relationsbetween the question and

Paragraph B .To tackle this challenge, the mainstream studiesmodel all paragraphs as a graph, where paragraphsare connected if they have the same entity men-tions or hyperlink relation (Ding et al., 2019; Zhaoet al., 2020; Asai et al., 2020). For example, Graph-based Recurrent Retriever (Asai et al., 2020) ﬁrstleverage non-parameterized methods (e.g., TF-IDFor BM25) to retrieve a set of initial paragraphs asstarting points, then a neural retrieval model willdetermine whether paragraphs that linked to an ini-tial paragraph is the next step of reasoning path.However, the method suffers several intrinsic draw-backs: (1) These graph-based methods are based ona hypothesis that evident paragraphs should havethe same entity mentions or linked by a hyperlink a r X i v : . [ c s . C L ] F e b o that they can perform well on bridge questions.However, for comparison questions (e.g., does Aand B have the same attribute?), the evidence para-graphs about A and B are independent. (2) Oncea paragraph is identiﬁed as an evidence paragraph,graph-based retriever like the Graph-based Recur-rent Retriever (Asai et al., 2020) will continue todetermine whether the paragraphs linked from theevidence paragraph are relevant to the question,which implies that these models have to assumethe relation between paragraphs is directional. (3)Several graph-based methods have to process alladjacent paragraphs simultaneously, leading to in-efﬁcient memory usage.In this paper, we discard the mainstream prac-tice. Instead, we propose to treat all candidateparagraphs as sequential data and identify themiteratively. The newly proposed framework forsolving multi-hop information retrieval has severaldesirable advantages: (1) The framework does notassume entity mentions or hyperlinks between twoevidence paragraphs so that it can be suitable forboth bridge and comparison questions. (2) It isnatural for the framework to take all paragraphsconnected to or from a certain paragraph into con-sideration. (3) As only one paragraph is located inGPU memory at the same time, the framework ismemory efﬁcient.We implement the framework and propose theGated Memory Flow model. Inspired by NeuralTurning Machine (Graves et al., 2014), we lever-age an external rewritable memory to memorizesinformation extract from previous paragraphs. Themodel iteratively reads a paragraph and the mem-ory to identify whether the current paragraph is anevidence paragraph. Due to many noise paragraphsin the candidate set, we design a threshold gatingmechanism to control the writing procedure. Inpractice, we ﬁnd that different negative examplesin the training set affect the retrieval performancesigniﬁcantly and propose a simple method to ﬁndmore valuable negatives for the retrieval model.Our method signiﬁcantly outperforms previousmethods on HotpotQA under both the full wiki andthe distractor settings. In analysis experiments, wedemonstrate the effectiveness of our method andhow different settings in the model affect down-stream performance.Our contributions are summarized as follows:• We propose a new framework that treats para-graphs as sequential data and iteratively re- trieves them for solving multi-hop questinganswering at scale.• We implement the framework and propose theGated Memory Flow model. Besides, we in-troduce a simple method to ﬁnd more valuablenegatives for open-domain multi-hop QA.• Our method signiﬁcantly outperforms all thepublished methods on HotpotQA by a largemargin. Extensive experiments demonstratethe effectiveness of our method. Textual Multi-hop Question Answering.

Incontrast to single-hop question answering, multi-hop question answering requires model reasoningacross multiple documents or paragraphs to derivethe answer. In recent years, this topic has attractedmany researchers attention, and many datasets havebeen released (Welbl et al., 2018; Talmor and Be-rant, 2018; Yang et al., 2018; Khot et al., 2020; Xieet al., 2020). In this paper, we focus our study ontextual on the textual multi-hop QA dataset, Hot-potQA.Methods for solving multi-hop QA can beroughly divided into two categories: graph-basedmethods and query reformulation based meth-ods. Graph-based methods often construct anentity graph based on co-reference resolution orco-occurrence (Dhingra et al., 2018; Song et al.,2018; De Cao et al., 2019). These works showthe GNN frameworks like graph convolution net-work (Kipf and Welling, 2017) and graph attentionnetwork (Veliˇckovi´c et al., 2018) perform on anentity graph can achieve promising results in Wiki-Hop. Tu (2019) propose to model paragraphs andcandidate answers as different kinds of nodes ina graph and extend an entity graph to a heteroge-neous graph. In order to adapt such methods tospan-based QA tasks like HotpotQA, DFGN (Qiuet al., 2019) leverage a dynamical fusion layer tocombine graph representations and sequential textrepresentations.Query reformulation based methods solve multi-hop QA tasks by implicitly or explicitly reformu-lating the query at different reasoning hop. Forexample, QFE (Nishida et al., 2019) update queryrepresentations at different hop. DecompRC (Minet al., 2019) decomposes a compositional questioninto several single-hop questions. erm-Based Retrieval Gated Memory Flow Question Answering S candi S kwm S TF-IDF S hyperlink Query Pre-trained Model … … … … P P i P n Read GateWrite Pre-trained ModelQuery P P … … C a sc ade P r ed i c t i on La y e r Gold ParagraphSupporting FactsAnswer SpanAnswer TypeValue EmbeddingsKey Embeddings

Figure 2: Overview of our architecture.

Multi-hop Information Retrieval.

Chen (2017)ﬁrst propose to leverage the entire Wikipedia para-graphs to answer open-domain questions. Most ex-isting open-domain QA methods use a pipeline ap-proach that includes a retriever model and a readermodel. For open-domain multi-hop QA, graph-based methods are also mainstream practice. Thesestudies organized paragraphs as nodes in a graphstructure and connected them according to hyper-links. Cognitive Graph (Ding et al., 2019) employa reading comprehension model to predict answerspans and next-hop answer to construct a cogni-tive graph. Transformer-XH (Zhao et al., 2020)introduce eXtra Hop attention to reasons over aparagraph graph and produces the global represen-tation of all paragraphs to predict the answers. Dueto encoding paragraphs independently, this methodlacks ﬁne-grained evidence paragraphs interactionand memory-inefﬁcient. Asai (2020) proposes agraph-based recurrent model to retrieves a reason-ing chain through a path in the paragraph graph.As described in Section 1, this method may notperform well on comparison questions and sufferfrom directional edges caused low recall.There are also several studies propose queryreformulation based methods for multi-hop infor-mation retrieval. Multi-step Reasoner (Das et al.,2019) employees a retriever interacts with a readermodel and reformulates the query in a latent spacefor the next retrieval hop. GoldEn Retriever (Qiet al., 2019) generates several single-hop questionsat different hop. These methods often performpoorly in result for the error accumulation throughdifferent hops.

In this section, we introduce the pipeline we de-signed for multi-hop QA, the overview of our sys-tem is shown in Figure 2. We ﬁrst leverage a heuris-tic term-based retrieval method to ﬁnd candidateparagraphs for the question Q as much as possible.Our Gated Memory Flow model takes all candidateparagraphs as input and processes them one by one.Finally, paragraphs that gain the highest relevancescore for a question are selected and fed into a neu-ral question answering model. The reader modeluses a cascade prediction layer to predict all targetsin a multi-task way. We leverage a heuristic term-based retrieval methodto construct the candidate paragraphs set that con-tains all possible paragraphs. Paragraphs in thecandidate set come from three sources.

Key Word Matching.

We retrieve all paragraphswhose title exact match terms in the query. The top N kwm paragraphs with highest TF-IDF score arecollected to set P kwm . TF-IDF.

The top N tfidf paragraphs retrieved byDrQA’s TF-IDF on the question (Chen et al., 2017). Hyperlink.

Paragraphs that necessary for de-rived the answer but do not have lexical overlapto the question often have hyperlink relation tothe paragraphs retrieved by key work matching orTF-IDF. Therefore, for each paragraph P i in set P kwm and P tfidf , we extract all the hyperlinkedparagraphs from P i to construct set P hyperlink . Un-like previous works (Nie et al., 2019; Asai et al.,020), both paragraphs that have a hyperlink con-nect to P i and paragraphs connected from P i aretaken into consideration.Finally, the three collections are merged into acandidate set P cand . After retrieving the candidate set, we propose aneural-based model named Gated Memory Flowthat accurately selects evidence paragraphs fromthe candidate set.

Model.

As describe above, for a compositionalquestion, whether a paragraph contains evidenceinformation is not only dependent on the query butalso other paragraphs. The task can be formulatedas: arg max θ P ( P t ∈ E | Q, P t − , θ ) (1)where the E denotes evidence set, θ denotes the pa-rameters of the model, P t is the t -th paragraphs incandidate set P cand = { P , . . . , P t , . . . , P n } and P t − represents the processed t − paragraphs.At the t -th time step, the model takes paragraph P t as input and determines whether it is an evidenceparagraph depending on identiﬁed paragraphs. Thequestion Q and paragraph P t are concatenated andfed into a pre-trained model to get the query-awareparagraph representations H t ∈ R l × d . To decreaseparameters, we compress the matrix into a vec-tor representation, then we employ the vector as aquery vector to address desired information fromthe external memory module. x t = M eanP ooling ( H t ) ∈ R d (2)We denote the external memory at step t as M t ∈ R l m × d , where l m is number of memory slotsthat have stored information at step t . We denotethe readout vector representation at step t as o t ,which include the missing information to identifythe current paragraph. The o t and x t are used toidentify D t : h t = tanh ( W o · o t + W · x t ) ∈ R d (3) s t = σ ( W s · h t ) (4)where s t represent the relevant score between para-graph P t and question Q .Now we describe the memory module andread/write procedure in detail. We follow theKVMemNN (Miller et al., 2016) architecture, which deﬁnes the memory slots as pairs of vec-tors { ( k , v ) , . . . , ( k l m , v l m ) } . In our implemen-tation, the key vectors and value vectors are thesame. Only a part of vectors are necessary for iden-tifying P t , therefore we use key vectors and queryvector to address the memory. The addressing andreading procedure are similar to soft attention: p i = Sof tmax ( W q x t · W k k i ) (5) o t = (cid:88) i p i W v v i (6)where p i denotes the readout probability of i -thmemory slot. In our implementation, we extendthe attention readout mechanism to a multi-headway (Vaswani et al., 2017): o t = Concat ( o (1) t , . . . , o ( h ) t ) (7) o ( h ) t denotes the output of h -th readout head. Dueto many irrelevant paragraphs in the candidate set,we leverage a threshold gating mechanism to con-trol the writing permission. The writing operationis concatenation for keeping the model concise. M t +1 = (cid:40) M t s t < gate Concat ( M t , x t ) s t ≥ gate (8)The value of scalar gate is a pre-deﬁned hyper-parameter. Training.

The parameters in GMF were updatedwith a binary cross entropy loss function: L retri = − (cid:88) P t ∈ P pos log ( s t ) − (cid:88) P t ∈ P neg log (1 − s t ) (9)where P neg contains eight negative paragraphs sam-pled by our negative sampling strategy for eachquestion. At the beginning of training, the modelcan not give each paragraph an accurate relevantscore, leading to unexpected behavior of the gate.For this reason, we only activate the gate after theﬁrst epoch of training. Training with Harder Negative Examples.

Foreach question in the training stage, we pass twogold paragraphs and eight negative examples fromthe term-based retrieval result to Gated MemoryFlow. Intuitively, different negative sampling strate-gies may inﬂuence the training result, but previousworks do not explore the effect.e empirically demonstrate harder negative ex-amples lead to better training results and proposea simple but effective strategy to ﬁnd them (seeSec. 4.4). Speciﬁcally, we ﬁrst train a BERT-basedbinary classiﬁer to calculate the relevant score be-tween the questions and paragraphs. Eight para-graphs are randomly selected from term based re-trieval results, combined with two evidence para-graphs for each question. Then we employ themodel as a ranker to select top-8 paragraphs asaugmented negative examples. We train our GMFmodel with these augmented examples.

Inference.

At the inference stage, there may behundreds of candidate paragraphs for a question.Nevertheless, it is unnecessary to put all of them inGPU memory besides the parameters of the modeland the current paragraph P t . To balance inferencespeed and memory usage, we load a chunk of para-graphs into GPU memory at once. After the modelgives all paragraphs a relevant score, we sort themby the score and retrieve the top N retri paragraphswith the scores higher than a threshold h d for eachquestion. We take these paragraphs as the input forthe reader model. There are many existing readers model for multi-hop QA, most of which use a graph-based approach.However, a recent study (Shao et al., 2020) showsself-attention in the pre-trained models can performwell in multi-hop QA. Therefore, we employ apre-trained model as a reader to solve multi-hopreasoning.We combine all the paragraphs selected by GatedMemory Flow into context C . For each example,we concatenate the question Q and context C asinput and fed into a pre-trained model following aprojection layer.We follow the similar structure of the predictionlayer from Yang (2018). There are four sub-tasksfor the prediction layer: (1) evidence paragraphsprediction as an auxiliary task; (2) supporting factsprediction based on evidence paragraphs; (3) thestart and end positions of span answer; (4) answertype prediction. We use a cascade structure dueto the dependency between different outputs. Fiveisomorphic BiLSTMs are stacked layer by layer toobtain different representations for each sub-task.To predict evidence paragraphs, we take the ﬁrstand last token of i -th paragraph as its representa- tions. M p = BiLSTM ( C ) ∈ R m × d (10) O para = σ ( W p [ M p [ P ( i ) s ]; M p [ P ( i ) e ]]) (11)where P ( i ) s and P ( i ) e denote the start and end to-ken index of i -th paragraph. Then we can predictwhether the i -th sentence is a supporting fact de-pend on the outputs of evidence paragraph predic-tion. M s = BiLSTM ([ C , O para ]) ∈ R m × d (12) O sent = σ ( W s [ M s [ S ( i ) s ]; M s [ S ( i ) e ]]) (13)where S ( i ) s and S ( i ) e denote the start and end tokenindex of i -th sentence. The answer span and typecan be predicted in a similar way. O start = SubLayer ([ C , O sup ]) (14) O end = SubLayer ([ C , O sup , O start ]) (15) O type = SubLayer ([ C , O sup , O end ]) (16)where SubLayer () denotes the similar LSTM layerand linear projection in eq. 12-13. We computea cross entropy loss over each outputs and jointlyoptimize them. L reader = λ a L start + λ a L end + λ p L para + λ s L sup + λ t L type (17)We assign each loss term a coefﬁcient. In this section, we describe the setup and resultsof our experiments on the HotpotQA dataset. Wecompare our method with published state-of-the-artmethods(Asai et al., 2020) in retrieval and down-stream QA performance to demonstrate the superi-ority of our proposed architecture.

Our experiments are conducted on Hot-potQA (Yang et al., 2018), a widely used textualmulti-hop question answering dataset. For eachquestion, models are required to extract a span oftext as an answer and corresponding sentences assupporting facts. Besides, there are two differentsettings in HotpotQA: distractor setting and fullwiki setting. For each question, two evidence para-graphs and eight distractor paragraphs collected byTF-IDF are provided in the distractor setting. Forthe full wiki setting, each answer and supporting ull wiki Distractor

Model Answer Sup Answer SupEM F1 EM F1 EM F1 EM F1

Baseline (Yang et al., 2018) 24.7 34.4 5.3 41.0 44.4 58.3 22.0 66.7QFE (Nishida et al., 2019) - - - - 53.7 68.7 58.8 84.7DFGN (Qiu et al., 2019) - - - - 55.4 69.2Cognitive Graph QA (Ding et al., 2019) 37.6 49.4 23.1 58.5 - - - -GoldEn Retriever (Qi et al., 2019) - 49.8 - 64.6 - - - -SemanticRetrievalMRS (Nie et al., 2019) 46.5 58.8 39.9 71.5 - - - -Transformer-XH (Zhao et al., 2020) 50.2 62.4 42.2 71.6 - - - -GRR w. BERT (Asai et al., 2020) 60.5 73.3 49.3 76.1 68.0 81.2 58.6 85.2DDRQA w. ALBERT-xxlarge ♣ (Zhang et al., 2020) 62.5 75.9 51.0 78.8 - - - -Gated Memory Flow Table 1: Results on HotpotQA development set in the fullwiki and distractor setting. “-” denotes no results areavailable. “ ♣ ” indicates the the work is recently presented on arXiv. facts should be extracted from the entire Wikipedia.HotpotQA contains two question types: bridgequestion and comparison question. The formerrequires models to reasoning from one evidence toanother. To answer the comparison question, mod-els need to compare two entities described in twoparagraphs. Metrics.

We evaluate our pipeline system notonly on downstream QA performance but also onthe intermediate retrieval result. We report F1 andEM scores for evaluating QA performance and Sup-porting Fact F1 (Sup F1) and Supporting Fact EM(Sup EM) to evaluate the supporting fact sentencesretrieval performance. In addition, joint EM andF1 scores are used to measure the system of thejoint performance of the QA model. To measurethe intermediate retrieval performance, we followAsai (2020) to use the three metrics: Answer Recall(AR), which evaluate whether the answer is locatedin the selected paragraphs, Paragraph Recall (PR),which evaluate if at least one of the evidence para-graphs is in retrieved set, Paragraph Exact Match(P EM), which evaluate whether all evidence para-graphs are retrieved. The number of selected para-graphs for the reader is alternative, hence we alsoreport the precision score to evaluate the retrievalperformance more appropriately.

Implementation Details.

Considering accuracyand computational cost, we use different pre-trained models in different components in the sys-tem. We report the whole pipeline system re-sult in Section 4.2 when the GMF is based on aRoBERTa-large model (Liu et al., 2019). In analy-sis experiments, the GMF is based on a RoBERTa-base model for saving computation cost. We use

Retrieval model Reader Model N kwm λ a , λ p , λ t N tfidf λ s h

16 epoch 4 d N retri h d Table 2: List of hyper-parameters used in our model.

Methods AR PR P EM

Entity-centric IR 63.4 87.3 34.9Cognitive Graph 76.0 87.6 57.8Semantic Retrieval 77.9 93.2 63.9GRR w. BERT 87.0 93.3 72.7GMF w. BERT 90.8 94.1 85.7GMF w. RoBERT

Table 3: Comparing our retrieval method with otherpublished methods across Answer Recall, ParagraphRecall and Paragraph EM metrics.

ALBERT-xxlarge (Lan et al., 2019) for the readermodel. Furthermore, all hyper-parameters in oursystem are listed in Table 2.

We ﬁrst evaluate our pipelinesystem on the downstream task on HotpotQA de-velopment set . Table 1 shows the results in fullwiki setting. The proposed GMF outperforms allpublished works on every metric by a large margin.The GMF achieves 3.1/3.2 points absolute improve-ment on Answer EM/F1 scores and 5.2/4.1 on Sup We will also submit our model to blind test set soon. ettings Retrieval QAAR PR P EM Prec. Joint EM Joint F1

Gated Memory Flow 91.7 94.7 86.3 49.1 39.6 65.8-Bi-direct. Doc. 87.2 95.6 78.8 51.3 37.1 62.3-Threshold Gate 91.2 94.3 85.6 33.8 39.0 65.3-Rewritable Memory 91.4 94.5 85.8 42.5 39.1 65.1-Harder Negatives 91.5 94.5 86.0 39.2 38.4 64.8

Table 4: Ablation study on the effectiveness of the GMF model on the dev set in the full wiki setting.

EM/F1 scores over the published state-of-the-artresults, which implies the superiority of the wholepipeline design. We also report our method resultsin the distractor setting. Our GMF also achievesstrong results in this setting.

Retrieval.

The retrieval results in full wiki settingare presented in Table 3. To fairly compared withthe published state-of-the-art results, we also re-port the results of our model based on BERT. OurGMF achieves 3.4 AR and 13.0 P EM over theprevious state-of-the-art model. The signiﬁcant im-provement comes from the proposed architecturefor solving multi-hop retrieval. The new frame-work abandons many unnecessary presuppositionsin graph-based methods and does not retrieve areasoning path explicitly. Our GMF beneﬁts fromthe generalization and ﬂexibility of the proposedframework, leading to a high recall of evidenceparagraphs. Note that we achieve a high recall withvery few retrieved paragraphs. The average num-ber of retrieved paragraphs for each question is lessthan 4.

To evaluate the effectiveness of our proposedmethod, we perform ablation study on our GMFmodel. In table 4, we report the ablation resultsin the development set. From the table, we cansee that only retrieved paragraphs connected fromparagraphs in P kwm and P tfidf will improve theprecision but signiﬁcantly hurt the recall scoresand downstream performance. We ﬁnd differentcomponents do not inﬂuence the recall but leadto higher precision. In particular, using the exter-nal rewritable memory can provide 6.6% precisionimprovement, which implies the effectiveness ofmodeling the history of retrieval. The gating mech-anism can help the model not memorizes and re-trieve irrelevant information. The precision willdecreases by 16% point after removing the gate. Inaddition, harder negatives also improve retrievalperformance. The QA performance also consis- Model Source AR PR P EM Prec. baseline K.W.M. 92.2 94.6 86.4 37.3baseline TF-IDF 91.8 94.5 86.3 39.2baseline Hyperlink 91.3 94.6 86.4 32.2baseline BERT

GMF K.W.M. 91.6 94.4 86.1 36.8GMF TF-IDF 91.5 94.2 86.0 39.2GMF Hyperlink 91.2 94.5 85.9 34.6GMF BERT

Table 5: Effectiveness of training data with differentdata distribution. tently decreases after ablating each component.

We ﬁnd that thetraining data sampled from different sources willsigniﬁcantly affect the ﬁnal retrieval performance.In Table 5, we report our GMF retrieval perfor-mance compared with a RoBERTa based re-rankbaseline model, where the training data is sampledfrom P kwm , P tfidf , P hyper and paragraphs selectedby a pre-trained model.The experiments show that the models canachieve comparable recall in different settings.However, Comparing data sampled from P kwm and P hyperlink , using training data sampled from P tfidf ,the GMF can gain about 2.4% and 4.6% precisionscores improvement respectively. The results implythat, in the training phase, examples sampled from P tfidf or P kwm are more confusing than P hyperlink for more lexical overlap. Therefore, we furtherleverage a neural model to rank the candidate para-graphs and sample the most challenging examples,leading to 9.9% absolute precision improvement.In addition, our GMF achieves better retrieval per-formance in each different setting, which impliesthe Effectiveness of our proposed model. Analysis on Candidate Set.

We also evaluatehow different sizes of the candidate set to inﬂuencethe downstream QA performance. Previous work(Asai et al., 2020) uses recall as a metric to measure alchelin de FerriersJohn of Berenne

Q: Who is the mother ofthe king that Walchelin deFerriers was principalcaptain to? Walchelin de Ferriersscore: 0.99 Marie de Coucyscore: 0.15 …… Walchelin de Ferrieres (or Walkelinde Ferrers) (died 1201) was aNorman baron and principal captainof King Richard I of England. Richard I of Englandscore: 0.26He was the third of ﬁve sons ofKing Henry II of England andDuchess Eleanor of Aquitaine. Q: Which is farther west,Sheridan County,Montana or ChandraTaal?Sheridan County, Montanascore: 0.99 Outlook, Montanascore: 0.006 …… Sheridan County is a county locatedin the U.S. state of Montana. Chandra Taalscore: 0.99He was the third of ﬁve sons ofKing Henry II of England andDuchess Eleanor of Aquitaine.

Sheridan County

Figure 3: Examples of evidence paragraphs prediction in the full wiki setting of HotpotQA . the quality of retrieval. Intuitively, more retrievedparagraphs will cause a higher recall. However,does higher recall leads to better downstream taskperformance? We increase the number of para-graphs retrieved by the TF-IDF method in the term-based retrieval phase. As expected, Table 6 showsmore retrieved paragraphs lead to a higher recall.Notice that the number of candidate paragraphs | P cand | signiﬁcantly increases with N tfidf . Nev-ertheless, Joint EM and Joint F1 scores do not in-crease as recall. Because there are too many noiseparagraphs in the candidate set, some will be re-trieved and misleading the reader model. The anal-ysis experiment suggests that only using recall asa metric to evaluate retrieval quality is not enough.We hope the future works introduce additional met-rics (e.g., precision) to evaluate the retrieval modelthey proposed comprehensively. We provide two examples to understand how ourmodel works. In Figure 3, the ﬁrst question is abridge question. To answer the question, our modelﬁrst retrieves the paragraph “Walchelin de Ferri-eres” with high conﬁdence. Then the model takesparagraph “Marie de Councy” and gives a 0.15relevance score. The score is lower than the thresh-old value of the gate, so the paragraph will not bewritten to the memory. There are two paragraphsstored in the memory when the model is tryingto calculate the relevance between the paragraph “Richard I England” and the question. The modeltakes the representation of paragraph as input toaddress necessary information from the memorymodule, then the memorized paragraph “Walchelinde Ferriers” is readout.The second case is a comparison question. Thequestion asks to compare the locations of “Sheri-

Top N tfidf Recall Num. Joint EM Joint F1

Top 5

Top 10 95.0 130 39.8 65.8Top 15 95.7 167 39.5 65.6Top 20 96.2 203 39.1 65.1

Table 6: Comparing the performance with differentrecall of candidate set. dan County, Montana” and “Chandra Taal” . Thetwo evidence paragraphs gain a very high relevantscore, while others score approximately equal tozero. In fact, because the keywords “SheridanCounty, Montana” and “Chandra Taal” appearin the question, such comparison questions arecomparatively easy for not only GMF but also aRoBERTa based re-rank baseline model. However,such questions may be tricky for graph-based re-triever due to lack of hyperlink connection or thesame entity mentions between them. In contrast,our architecture does not require such presupposi-tions, leading to better generalization and ﬂexibil-ity.

In this paper, to tackle the multi-hop informationretrieval challenge, we introduce an architecturethat models a set of paragraphs as sequential dataand iteratively identiﬁes them. Speciﬁcally, wepropose Gated Memory Flow to iterative read andmemorize reasoning required information withoutnoise information interference. We evaluate ourmethod on both full wiki and distractor settings onthe HotpotQA dataset and the method outperformsprevious works by a large margin. In the future, wewill attempt to design a more complicated modelto improve retrieval performance and explore moreabout the effect of training data with different datadistribution for multi-hop information retrieval. eferences

Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi,Richard Socher, and Caiming Xiong. 2020. Learn-ing to retrieve reasoning paths over wikipedia graphfor question answering. In

International Conferenceon Learning Representations .Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading Wikipedia to answer open-domain questions. In

Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1870–1879, Vancouver, Canada. Association for Computa-tional Linguistics.Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer,and Andrew McCallum. 2019. Multi-step retriever-reader interaction for scalable open-domain questionanswering. In

International Conference on Learn-ing Representations .Nicola De Cao, Wilker Aziz, and Ivan Titov. 2019.Question answering by reasoning across documentswith graph convolutional networks. In

Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long and Short Papers) , pages 2306–2317, Min-neapolis, Minnesota. Association for ComputationalLinguistics.Bhuwan Dhingra, Qiao Jin, Zhilin Yang, William Co-hen, and Ruslan Salakhutdinov. 2018. Neural mod-els for reasoning over multiple mentions using coref-erence. In

Proceedings of the 2018 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 2 (Short Papers) , pages 42–48,New Orleans, Louisiana. Association for Computa-tional Linguistics.Ming Ding, Chang Zhou, Qibin Chen, Hongxia Yang,and Jie Tang. 2019. Cognitive graph for multi-hopreading comprehension at scale. In

Proceedings ofthe 57th Annual Meeting of the Association for Com-putational Linguistics , pages 2694–2703, Florence,Italy. Association for Computational Linguistics.Alex Graves, Greg Wayne, and Ivo Danihelka.2014. Neural turing machines. arXiv preprintarXiv:1410.5401 .Tushar Khot, Peter Clark, Michal Guerquin, P. Jansen,and A. Sabharwal. 2020. Qasc: A dataset for ques-tion answering via sentence composition. In

AAAI .Thomas N. Kipf and Max Welling. 2017. Semi-supervised classiﬁcation with graph convolutionalnetworks. In

International Conference on LearningRepresentations (ICLR) .Zhenzhong Lan, Mingda Chen, Sebastian Goodman,Kevin Gimpel, Piyush Sharma, and Radu Soricut.2019. Albert: A lite bert for self-supervised learn-ing of language representations. arXiv preprintarXiv:1909.11942 . Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692 .Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston.2016. Key-value memory networks for directly read-ing documents. In

Proceedings of the 2016 Con-ference on Empirical Methods in Natural LanguageProcessing , pages 1400–1409, Austin, Texas. Asso-ciation for Computational Linguistics.Sewon Min, Victor Zhong, Luke Zettlemoyer, and Han-naneh Hajishirzi. 2019. Multi-hop reading compre-hension through question decomposition and rescor-ing. In

Proceedings of the 57th Annual Meetingof the Association for Computational Linguistics ,pages 6097–6109, Florence, Italy. Association forComputational Linguistics.Yixin Nie, Songhe Wang, and Mohit Bansal. 2019.Revealing the importance of semantic retrieval formachine reading at scale. In

Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) , pages 2553–2566, Hong Kong,China. Association for Computational Linguistics.Kosuke Nishida, Kyosuke Nishida, Masaaki Nagata,Atsushi Otsuka, Itsumi Saito, Hisako Asano, andJunji Tomita. 2019. Answering while summarizing:Multi-task learning for multi-hop QA with evidenceextraction. In

Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguistics ,pages 2335–2345, Florence, Italy. Association forComputational Linguistics.Peng Qi, Xiaowen Lin, Leo Mehr, Zijian Wang, andChristopher D. Manning. 2019. Answering complexopen-domain questions through iterative query gen-eration. In

Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages2590–2602, Hong Kong, China. Association forComputational Linguistics.Lin Qiu, Yunxuan Xiao, Yanru Qu, Hao Zhou, LeiLi, Weinan Zhang, and Yong Yu. 2019. Dynami-cally fused graph network for multi-hop reasoning.In

Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics , pages6140–6150, Florence, Italy. Association for Compu-tational Linguistics.Nan Shao, Yiming Cui, Ting Liu, Shijin Wang, andGuoping Hu. 2020. Is graph structure neces-sary for multi-hop reasoning? arXiv preprintarXiv:2004.03096 .Linfeng Song, Zhiguo Wang, Mo Yu, Yue Zhang,Radu Florian, and Daniel Gildea. 2018. Exploringraph-structured passage representation for multi-hop reading comprehension with graph neural net-works. arXiv preprint arXiv:1809.02040 .Alon Talmor and Jonathan Berant. 2018. The webas a knowledge-base for answering complex ques-tions. In

Proceedings of the 2018 Conference of theNorth American Chapter of the Association for Com-putational Linguistics: Human Language Technolo-gies, Volume 1 (Long Papers) , pages 641–651, NewOrleans, Louisiana. Association for ComputationalLinguistics.Ming Tu, Guangtao Wang, Jing Huang, Yun Tang, Xi-aodong He, and Bowen Zhou. 2019. Multi-hop read-ing comprehension across multiple documents byreasoning over heterogeneous graphs. In

Proceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics , pages 2704–2713,Florence, Italy. Association for Computational Lin-guistics.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

Advances in neural information pro-cessing systems , pages 5998–6008.Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova,Adriana Romero, Pietro Liò, and Yoshua Bengio.2018. Graph attention networks. In

InternationalConference on Learning Representations .Johannes Welbl, Pontus Stenetorp, and SebastianRiedel. 2018. Constructing datasets for multi-hopreading comprehension across documents.

Transac-tions of the Association for Computational Linguis-tics , 6:287–302.Zhengnan Xie, Sebastian Thiem, Jaycie Martin, Eliz-abeth Wainwright, Steven Marmorstein, and PeterJansen. 2020. WorldTree v2: A corpus of science-domain structured explanations and inference pat-terns supporting multi-hop inference. In

Proceed-ings of the 12th Language Resources and EvaluationConference , pages 5456–5473, Marseille, France.European Language Resources Association.Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,William Cohen, Ruslan Salakhutdinov, and Christo-pher D. Manning. 2018. HotpotQA: A datasetfor diverse, explainable multi-hop question answer-ing. In

Proceedings of the 2018 Conference on Em-pirical Methods in Natural Language Processing ,pages 2369–2380, Brussels, Belgium. Associationfor Computational Linguistics.Yuyu Zhang, Ping Nie, Arun Ramamurthy, andLe Song. 2020. Ddrqa: Dynamic document rerank-ing for open-domain multi-hop question answering. arXiv preprint arXiv:2009.07465 .Chen Zhao, Chenyan Xiong, Corby Rosset, XiaSong, Paul Bennett, and Saurabh Tiwary. 2020. Transformer-xh: Multi-evidence reasoning with ex-tra hop attention. In