[PDF] Iterative Relevance Feedback for Answer Passage Retrieval with Passage-level Semantic Match

Abstract

Relevance feedback techniques assume that users provide relevance judgments for the top k (usually 10) documents and then re-rank using a new query model based on those judgments. Even though this is effective, there has been little research recently on this topic because requiring users to provide substantial feedback on a result list is impractical in a typical web search scenario. In new environments such as voice-based search with smart home devices, however, feedback about result quality can potentially be obtained during users' interactions with the system. Since there are severe limitations on the length and number of results that can be presented in a single interaction in this environment, the focus should move from browsing result lists to iterative retrieval and from retrieving documents to retrieving answers. In this paper, we study iterative relevance feedback techniques with a focus on retrieving answer passages. We first show that iterative feedback is more effective than the top-k approach for answer retrieval. Then we propose an iterative feedback model based on passage-level semantic match and show that it can produce significant improvements compared to both word-based iterative feedback models and those based on term-level semantic similarity.

Full PDF

IIterative Relevance Feedback for Answer PassageRetrieval with Passage-level Semantic Match

Keping Bi, Qingyao Ai, and W. Bruce Croft

College of Information and Computer Sciences, University of Massachusetts Amherst,Amherst, MA, USA { kbi,aiqy,croft } @cs.umass.edu Abstract.

Relevance feedback techniques assume that users provide rel-evance judgments for the top k (usually 10) documents and then re-rankusing a new query model based on those judgments. Even though this iseﬀective, there has been little research recently on this topic because re-quiring users to provide substantial feedback on a result list is impracticalin a typical web search scenario. In new environments such as voice-basedsearch with smart home devices, however, feedback about result qualitycan potentially be obtained during users’ interactions with the system.Since there are severe limitations on the length and number of resultsthat can be presented in a single interaction in this environment, thefocus should move from browsing result lists to iterative retrieval andfrom retrieving documents to retrieving answers. In this paper, we studyiterative relevance feedback techniques with a focus on retrieving answerpassages. We ﬁrst show that iterative feedback is more eﬀective than thetop-k approach for answer retrieval. Then we propose an iterative feed-back model based on passage-level semantic match and show that it canproduce signiﬁcant improvements compared to both word-based iterativefeedback models and those based on term-level semantic similarity.

Keywords:

Iterative Relevance Feedback; Answer Passage Retrieval;Passage Embeddings

In typical relevance feedback (RF) techniques, users are provided with a list oftop-ranked documents and asked to assess their relevance. The judged docu-ments, together with the original query, are used to estimate a new query modelusing an RF model, which further acts as a basis for re-ranking. There wereextensive studies of RF [1,4,9,27,25,29,13,2,28,16,37] based on the vector spacemodel (VSM) [30], the probabilistic model [18] and, more recently, on the lan-guage model (LM) for Information Retrieval (IR) approach [23]. Despite theeﬀectiveness of RF, the overhead involved in obtaining user relevance judgmentshas meant that it is not used in typical search scenarios.With mobile and voice-based search becoming more popular, it becomes fea-sible to obtain feedback about result quality during users’ interactions with the a r X i v : . [ c s . I R ] D ec Keping Bi, Qingyao Ai, and W. Bruce Croft system. In these scenarios, the display space or voice bandwidth leads to severelimitations on the length and number of results shown in a single interaction.Thus, instead of providing a list of results, an iterative approach to feedbackmay be more eﬀective. There has been some work in the past on iterative rele-vance feedback (IRF) with only a few results in each interaction using the VSM[1,13,2], but this has not been looked at for a long time. In addition, the spaceand bandwidth limitations make the retrieval of longer documents less desir-able than shorter answer passages. Motivated by these reasons, in this paper, wepresent a detailed study of methods for IRF focused on answer passage retrieval.Although they could be applied to any text retrieval scenario, most existingRF algorithms use word-based models originally designed for document retrieval.Answer passages, however, are much shorter than documents, which could po-tentially present problems for accurate estimation of word weights in the existingword-based RF methods. Moreover, the limitations on the length and number ofresults in IRF mean that there is even less relevant text available at every itera-tion. Given these issues, introducing complementary information from semanticspace may help to estimate a more accurate RF model. Dense vector representa-tions of words and paragraphs in distributed semantic space, called embeddings,[21,17,32,8,5], have been eﬀectively applied to many natural language process-ing (NLP) tasks. Embeddings have also been used in pseudo relevance feedbackbased on documents [35,24], but their impact in iterative and passage-basedfeedback is not known. Besides, these previous work use semantic similarity atthe term level and does not consider semantic match at larger granularity. Thishad led us to incorporate passage-level semantic match in IRF for answer pas-sage retrieval to improve upon word-based IRF and other embedding-based IRFusing term-level semantic similarity.In the paper, we ﬁrst investigate whether iterative feedback based on diﬀerentframeworks is eﬀective relative to RF with a list of top k (k=10) results onanswer passage retrieval. The results indicate that IRF is signiﬁcantly moreeﬀective on answer passage collections. In addition, we propose an embedding-based IRF method using passage-level similarity for answer passage retrieval.This method incorporates the similarity scores computed with diﬀerent typesof answer passage embeddings and fuses them with other types of IRF models.The model we propose signiﬁcantly outperforms IRF baselines based on words orsemantic matches between terms. Combining both term-level and passage-levelsemantic match information leads to additional gains in performance.

In this section, we ﬁrst review previous approaches to RF and IRF. We thendiscuss related work on embeddings of words and paragraphs applied to IR andsome previous studies on answer passage retrieval.

Relevance Feedback.

In general, there are mainly three types of relevancefeedback methods for ad-hoc retrieval, which are based on the vector space model(VSM) [30], the probabilistic model [18] and the language model (LM) [23].Basically, they all extract expansion terms from annotated relevant documents

RF for Answer Passage Retrieval with Passage-level Semantic Match 3 and re-weight the original query terms so as to estimate a more accurate querymodel to retrieve better results.Rocchio [27] is generally credited as the ﬁrst RF technique, developed on theVSM. It reﬁnes the vector of a user query by bringing it closer to the averagevector of relevant documents and further from the average vector of non-relevantdocuments. In the probabilistic model, expansion terms are scored accordingto the probability they occur in relevant documents compared to non-relevantdocuments [25,12]. Salton et al. [29] studied various RF techniques based on theVSM and probabilistic model and showed that the probabilistic RF models arein general not as eﬀective as the methods in the VSM.More recently, feedback techniques have been investigated extensively basedon LM, among which, the relevance model [16] and the mixture model [37] aretwo well-known examples that empirically perform well. In the third versionof the relevance model (RM3) [16], the probabilities of expansion terms areestimated with occurrences of the terms in feedback documents. The mixturemodel [37] considers a feedback document to be generated from a mixture of acorpus language model, and a query topic model, which is estimated with theEM algorithm. Some recent work [4,9] extend the mixture model by consideringadditional or diﬀerent language models as components of the mixture.

Iterative Relevance Feedback.

In contrast to most RF systems that askusers to give relevance assessments on a batch of documents, Aalsberg et al. [1]proposed the alternative technique of incremental RF based on Rocchio. Usersare asked to judge a single result shown in each interaction, then the query modelcan be modiﬁed iteratively through feedback. This approach showed higher re-trieval quality compared with standard batch feedback. Later, Lwayama et al.[13] showed that the incremental relevance feedback used by Aalsberg et al. worksbetter for documents with similar topics, while not as well for documents span-ning several topics. In this paper, we investigate how IRF performs on retrievalof answer passages instead of documents using more recent retrieval models.Some recent TREC tracks [33,10] have made use of iterative and passage-level feedback, but they focus on document retrieval with diﬀerent objectivesand require a large amount of user feedback. The Total Recall track [33] aimsat high recall, where the goal is to promote all of the relevant documents beforenon-relevant ones. The target of the Dynamic Domain track [10] is to identifydocuments satisfying all the aspects of the users’ information need with passage-level feedback. In contrast, we focus on iterative feedback for the task of answerpassage retrieval and investigate IRF with a ﬁxed small amount of feedback.

Word and Paragraph Embeddings for RF.

Dense representations, calledembeddings, of words and paragraphs, have become popular and been used[21,36,32,5] to abstract the meaning from a piece of text in semantic space. Twowell-known techniques to train word and paragraph embeddings are Word2Vec[20] and Paragraph Vectors (PV) [17] respectively. The similarity of word embed-dings can be used to compute the transition probabilities between words [24,35]and incorporated with the VSM or relevance model [16] to solve problems ofterm mismatch. Basically, these approaches use semantic match at word level

Keping Bi, Qingyao Ai, and W. Bruce Croft and are in the form of query expansion. In contrast, our approach uses semanticmatch at passage level and is not based on query expansion.

In IRF, the topic model of users’ intent can be reﬁned each iteration after a smallnumber of results are assessed. Therefore, re-ranking is triggered earlier in IRFthan in standard top-k RF methods. On the one hand, earlier re-ranking mayproduce better results with fewer iterations, which essentially reduces the usereﬀorts in search interactions. On the other hand, having only a small amount offeedback information in each iteration may hurt the accuracy of model estimationand cause topic drift in the iterative process.We convert several representative models to iterative versions and investigatethe performance of the IRF models on answer passage retrieval. Since LM andVSM are the two most eﬀective frameworks for RF, we study iterative feedbackunder these two frameworks. We use RM3 [16] and the Distillation (or Distill)model [4] to represent the LM framework, and Rocchio [27] for VSM. RM3 is acommon baseline for pseudo RF that has also been used for RF. Distillation isone of the most recent RF methods, which is an extension of the mixture modelby incorporating a query speciﬁc non-relevant topic model. Rocchio [27] is thestandard feedback model in VSM. As for the retrieval models for initial ranking,we use Query Likelihood (QL) for LM, and BM25 [26] for VSM respectively.To keep the query model from diverging to non-relevant topics, we main-tain two pools for relevant and non-relevant results, which accumulate all thejudgments until the i th iteration. During the i th iteration, judged relevant re-sults and non-relevant results are added to the corresponding pool. Expandedquery models are then estimated from the relevant document pool by RM3 andfrom both relevant and non-relevant pools by Distillation and Rocchio. Detailedintroduction about the IRF models and the experiments can be found in [3]. Word-based RF methods were initially designed for document retrieval and usu-ally based on query expansion. In contrast to documents, answer passages do nothave suﬃcient text to estimate the probabilities or weights of the expansion termsaccurately, especially for IRF when fewer results are available in each iteration.To alleviate the problem of text insuﬃciency in IRF, we incorporate semanticinformation about paragraphs to the IRF models. Paragraph embeddings areshown to be capable of capturing the semantic meanings of passages [17,32,5],which could potentially help us build more robust IRF models by supportingsemantic matching between passages.In this section, we propose to use paragraph embeddings to improve the per-formance of IRF for answer passage retrieval. In contrast to existing word-basedand embedding-based RF methods, this approach does not extract expansionterms to update the query model. Instead, it represents the relevance topic fromfeedback passages with embeddings. Similar to Rocchio, we assume a relevantpassage should be near the centroid of other relevant passages in the embedding

RF for Answer Passage Retrieval with Passage-level Semantic Match 5 space. Also, we only focus on positive feedback as negative feedback has beenshown to have little beneﬁt for RF when positive feedback is available in previ-ous studies [1]. Therefore, our model can be viewed as an embedding version ofRocchio with only positive feedback.We ﬁrst describe the methods we use to obtain the semantic representationsfor answer passages. Then we will introduce the passage embedding based iter-ative feedback model.

One common way of representing passagesis to use aggregated embeddings of words in the paragraph. Word2Vec is a well-known method of training word embeddings [20,21]. It projects words to densevector space and uses a word to predict its context or predicts a word by itscontext. In our experiments, we also use average word embeddings trained fromWord2Vec both with and without IDF weighting as passage representations. theoncatthe… …Projection satProjection … the cat sat on the … c ni c ni c ni +1 c ni +2 w ni Projection ismat Average w nr w nq w np d n cat theoncatthe… …Projection sat c ni c ni c ni +1 c ni +2 w ni without corruption with corruption ˜ u n Fig. 1:

HDC models used in our experiments.Red words are local context, and blue words areglobal context.

Paragraph Vectors.

Theother way of representing pas-sages is using specially designedparagraph vectors models as in[17,5,32]. The models we use arePV-HDC [32] with or without cor-ruption, shown in Figure 1. PV-HDC is an extension of the ini-tially proposed paragraph vectormodel [17], where a documentvector is ﬁrst used to predict anobserved word, and afterward, theobserved word is used to predictits context words. The recent work of training paragraph representation throughcorruption [5] shows advantages in many tasks such as sentiment analysis. Itreplaces the original part of paragraph representation with a corruption module,where the global context ˜ u is generated through an unbiased dropout corruptionat each update and the paragraph representation is calculated as the averageembeddings of the words in ˜ u . The ﬁnal representation is simply the averageof the embeddings of all the words in the paragraph. We also investigate othermodels such as the original PV models, DM, DBOW, [17], and the Parallel Doc-ument Context Model (PDC) [32], both with and without corruption, but HDCis better in most cases. So we exclude the other models in the paper. As an alternative to query expansion based RF methods, we propose to representthe whole semantic meaning of a passage and a passage set with vectors in theembedding space and measure the similarity between them without explicitlyextracting any expansion terms. Speciﬁcally, we represent the relevance topicin the i th iteration as the embedding of the relevant passage pool and fuse thesimilarity between a passage with the relevance topic with other RF methods. Keping Bi, Qingyao Ai, and W. Bruce Croft

Thus the score function is shown as follows, score ( Q ( i ) , d ) = score rf ( Q ( i ) , d ) + λ sf score sem ( RP ( i ) , d ) (1) Q ( i ) is the expanded query model estimated by iterative version of RF modelssuch as RM3, Distillation and Rocchio; d is the candidate passage; RP ( i ) denotesthe relevant passage pool in the i th iteration; score rf denotes the score calculatedfrom other RF models; score sem is the semantic match score between passages,which is the commonly used cosine similarity in the paper; λ sf is the coeﬃcientof incorporating the passage embedding based similarity; Similar to Rocchio,we assume the topic of a passage set is the centroid of these passages and weconsider a relevant passage pool can be represented by average vectors of thepassages in it. Thus the similarity between a passage and the pool is score sem ( RP ( i ) , d ) = cos( 1 | RP ( i ) | (cid:88) d r ∈ RP ( i ) d r , d ) (2)where d r and d is the vector representation of d r and d in the embedding space.Our method has two advantages over existing RF methods. One is that com-pared to expansion term based methods that only alleviate word-level mismatch,semantic similarity of larger granularity is captured in our method. The otheris the ﬂexibility of combining this semantic match signal with diﬀerent types ofapproaches such as RM3, Distillation, Mixture, Rocchio, and other embedding-based feedback approaches. In this section, we introduce the experimental setup and results of word-basedIRF on answer passage retrieval.

In our experiments, we used WebAP and PsgRobust for answer passageretrieval. Statistics of the datasets are summarized in Table 1. WebAP [34] is aweb answer passage collection built on Gov2. It uses a subset of queries that arelikely to have passage-level answers from Gov2 and retrieved the top 50 docu-ments with the Sequential Dependency Model (SDM) [19]. After that, relevantdocuments were annotated for relevant answer passages. Overall, 3843 passagesfrom 1200 documents are annotated as relevant. In our experiments, we split therest of the documents into non-overlapping 2 or 3 (randomly chosen) contiguoussentences as non-relevant passages and used topic descriptions as questions.PsgRobust is a new collection we created for answer passage retrieval. It isbased on the TREC Robust collection following a similar approach as WebAPbut without manual annotation. In PsgRobust, we assume that top-ranked pas-sages in relevant documents can be considered as relevant and all passages innon-relevant documents are irrelevant . We ﬁrst retrieved the top 100 documentsfor each title query in Robust with SDM [19] and generated answer passagesfrom them with a sliding window of random lengths (2 or 3 sentences) and nooverlap. After that, we retrieved top 100 passages with SDM again and treated This dataset is publicly available at https://ciir.cs.umass.edu/downloads/PsgRobust/ .RF for Answer Passage Retrieval with Passage-level Semantic Match 7

Table 1:

Statistics of experimental datasets.Dataset those from relevant documents as the relevant passages. Similar to WebAP, weused the descriptions of Robust topics as questions and have 246 queries withnon-zero relevant answer passages in total. The Recall@100 in the initial retrievalprocess is 0.43, which means that 43% of relevant documents for all queries wereincluded in the passage collection on average. By manually checking some ran-domly sampled passages marked as relevant, we found most of them are indeedrelevant passages for the questions. There are 6589 relevant passages from 3544documents for the 246 queries in total.We also considered other collections that have passage-level annotation suchas the DIP2016Corpus [11] and the dataset from the Dynamic Domain track[33]. However, they either are trivial for RF tasks (almost all top 10 resultsretrieved by BM25 are relevant in DIP2016Corpus) or have few queries (only26 and 27 for the two domains of the Dynamic Domain track). Other popularquestion answering datasets usually only have one relevant answer for each queryand thus are not suitable for our RF task either. Therefore, we only report theresults of WebAP and PsgRobust in this paper.

System Settings . All the methods we implemented are based on the Galagotoolkit [7] . Stopwords were removed from all collections using the standardINQUERY stopword list and words were stemmed with Krovetz Stemmer [15]. Tocompare iterative feedback with typical top-k feedback in a fair manner, we ﬁxedthe total number of judged results as 10 and experimented with 1, 2, 5, and 10iterations, where 10, 5, 2, 1 results were judged during each iteration respectively.Then 10Doc-1Iter is exactly the top-k feedback. We pay more attention to thesettings of one or two results per iteration which are more likely to be in a realinteractive search scenario considering the limitation of presenting results. Truelabels of results were used to simulate users’ judgments.All the parameters were set using 5-fold cross-validation over all the queries ineach collection with grid search. For WebAP and PsgRobust, we tuned µ of QL in { , , , , , } and k of BM25 from { . , . , · · · , } , b set to 0.75as suggested by [22]. The number of expansion terms m is from { , , · · · , } .The range to scan parameters for RM3, Distillation and Rocchio is similar asthe corresponding original paper. They are not shown here due to space limits. Evaluation.

The evaluation should only focus on the ranking of unseen re-sults. So we use freezing ranking [6,28], as in [1,14], to evaluate the performanceof IRF. The freezing ranking paradigm freezes the ranks of all results presentedto the user in the earlier feedback iterations and assigns the ﬁrst result retrievedin the i th iteration rank iN + 1, where N is the number of results shown ineach iteration. Note that all the previously shown results are ﬁltered out in thefollowing retrieval to remove duplicates and the ﬁnal result list concatenates Keping Bi, Qingyao Ai, and W. Bruce Croft

Table 2:

Performance of IRF on answer passage collections. D × I stands for Doc × Iter .‘*’ denotes signiﬁcant improvements over the standard top 10 feedback model (10 × Dataset Method

MAP of freezing rank lists

NDCG @20 of freezing rank lists(D × I) Initial (10 ×

1) (5 ×

2) (2 ×

5) (1 ×

10) Initial (10 ×

1) (5 ×

2) (2 ×

5) (1 × Distill 0.076 0.099 0.104 ∗ ∗ ∗ *Rocchio 0.081 0.106 0.112 ∗ ∗ ∗ *PsgRobust RM3 0.248 0.293 0.299* 0.306* Distill 0.248 0.292 0.299 ∗ ∗ ∗ *Rocchio 0.191 0.268 0.280 ∗ ∗ ∗ * ( Iter − ∗ N freezing results with the rest candidates ranked in the last iter-ation, where Iter is the total number of iterations. Then we use mean averageprecision at cutoﬀ 100 (

M AP ) and

N DCG @20 to measure the performance ofresults overall and on the top. As suggested by Smucker et al. [31], statisticalsigniﬁcance is calculated with Fisher randomization test with threshold 0.05.

The performance of the initial retrieval with QL and BM25 and the IRF exper-imental results are shown in Table 2. All the feedback methods are signiﬁcantlybetter than their retrieval baselines, i.e. RM3 and Distillation compared withQL, Rocchio compared with BM25, in terms of both

M AP and

N DCG @20. In addition, on both WebAP and PsgRobust, the

M AP and

N DCG @20of RM3, Distillation and Rocchio increase as the ten results are judged in moreiterations. In other words, IRF is much more eﬀective for answer passage retrievalcompared with top-k feedback. Performance goes up when re-ranking is doneearlier even when we have only a small number of passages, probably becauseanswer passages are usually focused on a single topic and less likely to causetopic drift. Since

M AP and

N DCG @20 show similar trends using IRF underdiﬀerent settings, we only show

M AP in Section 6.2 due to the space limitations.

We compare our method with word-based and embedding-based RF baselines intwo groups of experiments. One is the same as in Section 5, i.e. retrieval with adiﬀerent number of iterations and 10 results judged in total. The other focuseson identifying more relevant passages given only one relevant answer passage.We ﬁrst describe the experimental setup and then introduce the two groups ofexperiments in Section 6.2 and Section 6.3.

In this part, we again use WebAP and PsgRobust for experiments. All compar-isons are based on LM (RM3, Distillation, and Rocchio) and VSM (Rocchio) tosee whether the complementary semantic match beneﬁts in both frameworks. Wealso include the Embedding-based Relevance Model (ERM) [35] as a baseline. On PsgRobust, BM25 and Rocchio underperform QL, RM3 and Distillation respec-tively by a large margin. Because its labels are generated based on retrieval withSDM, this collection favors approaches in the framework of LM more than VSM.RF for Answer Passage Retrieval with Passage-level Semantic Match 9

ERM revises P ( Q | D ) in the original RM3 as a linear combination of P ( Q | D )computed from exact term match and P ( Q | w, D ), which takes the semantic rela-tionship between words into account. The translation probability between wordsis computed with the cosine similarity of their embeddings transformed with thesigmoid function. Statistical signiﬁcance in all the result tables is calculatedwith Fisher randomization test with threshold 0.05.

Embeddings Training . Four paragraph representations are tested in thefour groups of experiments, where the base models ( BM ) can be RM3, ERM,Distillation (or Distill) and Rocchio: BM + W V / BM + idf W V : uniformly or idf-weighted average word vectorstrained with the skip-gram model [20]. BM + P V C / BM + P V : paragraph vectors trained with the HDC structurewith or without corruption [32,5].Embeddings of words or paragraphs were trained with each local corpusrespectively. Words with the frequency less than 5 were removed. No stemmingwas done across the collections. 10 negative samples were used for each targetword. The learning rate and batch size were 0.05 and 256. The dimension ofembedding vectors was set to 100. We also tried other hyper-parameters fortraining embeddings, and the results were similar under diﬀerent settings. ForPVC, corruption rate q [5] was set to 0.9. All the neural networks of trainingembeddings were implemented using TensorFlow . Parameter Settings.

We used the best settings of the baseline models andtuned the parameters of the semantic signals with 5-fold cross-validation fordiﬀerent paragraph embeddings. All the parameters of ERM are tuned in thesame range as [35] suggests. λ sf in equation 1 is selected from { , , , · · · , } for WebAP, and { . , , . , · · · , } for PsgRobust respectively. First, we conducted IRF experiments with diﬀerent number of iterations and 10results judged in total, as described in Section 5. We use

M AP at cutoﬀ 100 offreezing rank lists as the evaluation metric, which is described in section 5.1.

Results and Discussion.

We show the experimental results of using lan-guage model baselines (RM3, ERM, Distillation) in Table 3 and include Rocchioas a baseline in Figure 2. We can see in general the four representations ofparagraphs all can improve performance signiﬁcantly over the word-based andembedding-based baselines under most iteration settings. ERM performs similarto RM3 on WebAP, and our method based on RM3 and ERM also perform sim-ilarly. On PsgRobust, ERM performs slightly better than RM3 and our methodalso performs slightly better combined with ERM than RM3. This shows thatincorporating passage-level semantic similarity in embedding space produces im- We also tried the true RF version of BM25-PRF-GT [24], which is a generalizedtranslation model of BM25 based on word embeddings and Rocchio. Due to itsinferior performance on our dataset, we did not include the experiments here. The reason why ERM does not perform well will be shown in Section 6.3 where wediscuss the performance diﬀerence of ERM on the two tasks.0 Keping Bi, Qingyao Ai, and W. Bruce Croft

Table 3:

Performance of diﬀerent IRF models. ‘ ∗ ’ and ‘ † ’ denote signiﬁcant improve-ments over word-based (RM3, Distillation) and embedding-based (ERM) baselines re-spectively. (10 Doc × Iter ) represents standard top-k feedback.

Method

MAP on WebAP

MAP on PsgRobust (Doc × Iter) (10 ×

1) (5 ×

2) (2 ×

5) (1 ×

10) (10 ×

1) (5 ×

2) (2 ×

5) (1 × ∗† ∗† ∗† ∗† ∗† ∗ RM3+idfW2V 0.106 ∗† ∗† ∗† ∗ ∗† ∗ ∗† ∗† RM3+PV 0.102 ∗ ∗† ∗† ∗ ∗† ∗† ∗ ∗ RM3+PVC ∗† ∗† ∗† ∗† ∗ ∗ ERM+W2V ∗† ∗† ∗† ∗† ∗† ∗† ∗ ERM+idfW2V 0.106 ∗† ∗† ∗† ∗† ∗† ∗† ∗† ERM+PV 0.103 ∗ ∗† ∗† ∗ ∗† ∗† ∗† ∗ ERM+PVC ∗† ∗† ∗† ∗† ∗† ∗ ∗† Distillation 0.099 0.104 0.109 0.111 0.292 0.299 0.311 0.313Distill+W2V ∗ ∗ ∗ ∗ ∗ ∗ ∗ Distill+idfW2V ∗ ∗ ∗ ∗ ∗ ∗ ∗ Distill+PV 0.103 ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ Distill+PVC 0.105 ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ provements to both the word-based RF models and the embedding-based RFmodel using semantic similarity at term level.The conclusion that IRF shows advantages over top-k feedback still holdswhen we incorporate word-based RF models with passage-level semantic match.In addition, there is no one representation better than the others all the time,which implies for diﬀerent datasets, with diﬀerent baselines, some representationsshow their advantages ﬁtting the speciﬁc property underlying the setting. As we mentioned in Section 4, the small amount of text in answer passagesduring each iteration may not be enough to build word-based RF models. Theextreme case is when we have only one short passage as positive feedback. Ef-fective re-ranking after the ﬁrst positive feedback will show the user a secondrelevant answer in fewer iterations and make users less likely to leave after sev-eral interactions. Therefore, it is particularly important to perform well giventhe ﬁrst positive feedback from users. We designed the second type of experimentto be answer retrieval given one relevant passage.For each query, we randomly assign a relevant passage to the model as posi-tive feedback and then retrieve from the remaining results. To make the resultsmore reliable, we randomly draw a relevant passage for each query ten timesand do ten retrievals. Then we evaluate the performance of each model based onthe overall rank lists from the ten retrievals. We take QL and BM25 as baselineretrieval models that do not consider feedback. Similar to the ﬁrst group of ex-periments, we use RM3, Distillation, Rocchio as word-based RF baselines in theframework of LM and VSM, and ERM as the embedding-based RF baseline. Weuse P @1 (precision@1), M RR (mean reciprocal rank) to evaluate the ability ofa model to identify a second relevant passage in the next interaction given only

RF for Answer Passage Retrieval with Passage-level Semantic Match 11 (10 ×

1) (5 ×

2) (2 ×

5) (1 × × Iter . . . . . . M AP - F r ee z i ng + + + ++ + + ++ + ++ + + +WebAP-Rocchio R+PVR+PVCR+idfW2VR+W2VRocchio (10 ×

1) (5 ×

2) (2 ×

5) (1 × × Iter . . . . . M AP - F r ee z i ng + + + ++ + + ++ + ++ + + +PsgRobust-Rocchio R+PVR+PVCR+idfW2VR+W2VRocchio

Fig. 2:

Performance of our method with diﬀerent paragraph representations comparedwith Rocchio. ’+’ means signiﬁcant diﬀerence. one positive feedback. MAP at cutoﬀ 100 measures the ability of the model toidentify all the other relevant answers.

Results and Discussion.

In Table 4, feedback methods are always bet-ter than their base retrieval models, i.e. QL, BM25. In general, with the fourparagraph representations, the improvements of

M AP over the baselines are al-ways signiﬁcant; P @1, M RR can also be improved signiﬁcantly in many cases.This shows that incorporating the passage semantic similarity can improve sig-niﬁcantly over both the word-based RF baselines and the embedding-based RFbaseline with only term-level semantic match information.In contrast to the IRF experiments, ERM performs much better than RM3in this task. The reason may be that in the IRF experiments, there are morerelevant passages for RM3 to extract expansion terms and alleviate the termmismatch problem, which makes the term-level semantic match from ERM lesshelpful. In this task, the text for RM3 is not enough to estimate an accuratemodel and ERM is eﬀective with semantic match. Since our method considerssemantic match at passage level, its beneﬁt does not overlap with that fromterm-level semantic match.On WebAP, our method combined with RM3 performs similarly to ERMwhen using PV and PVC and worse than ERM using W2V and idfW2V. OnPsgRobust, incorporating our method to RM3 performs better than ERM interms of P @1 and M RR , but worse than ERM with

M AP . This shows thatincorporating embedding similarity to do RF at passage level or term level alonewith little information are comparable to each other. When we combine thesetwo ways of doing RF together, the performance can be further improved, whichis shown from the signiﬁcant improvements upon ERM when we add the passagesimilarity signal to ERM on both datasets. This is consistent with our claim thatthe semantic similarity of the passage level is complementary to the term levelwhen combined with word-based RF models since they capture two diﬀerentgranularities of semantic match.Diﬀerent from the IRF experiments, the performance of paragraph vectorsare better than W2V and idfW2V. This indicates that when there is little feed-

Table 4:

Performance of diﬀerent IRF methods on ﬁnding other relevant answers givenone relevant answer. ’ ∗ ’ and ’ † ’ denote signiﬁcant improvements over word-based (RM3,Distillation, Rocchio) or embedding-based (ERM) baselines respectively. Dataset

WebAP PsgRobust

Model P @1 MRR MAP P @1 MRR MAP

QL 0.259 0.373 0.071 0.367 0.486 0.231RM3 0.498 0.602 0.116 0.515 0.634 0.299ERM 0.516 0.615 0.125 0.513 0.634 0.307RM3+W2V 0.488 0.598 0.120 ∗ ∗† ∗† ∗ RM3+idfW2V 0.488 0.597 0.120 ∗ ∗† ∗† ∗ RM3+PV ∗ ∗ ∗ ∗ RM3+PVC 0.524 ∗ ∗† ∗ ∗† ∗† ∗ ERM+W2V 0.513 0.622 ∗ ∗† ∗† ∗† ∗† ERM+idfW2V 0.525 ∗ ∗ ∗† ∗† ∗† ∗† ERM+PV ∗† ∗† ∗† ∗† ∗† ∗† ERM+PVC ∗† ∗† ∗† ∗† ∗† ∗† Distillation 0.494 0.597 0.113 0.516 0.635 0.299Distill+W2V 0.489 0.593 0.117 ∗ * * *Distill+idfW2V 0.489 0.595 0.117 ∗ *Distill+PV 0.519 ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ *Rocchio+idfW2V 0.536 0.642 ∗ ∗ *Rocchio+PV ∗ ∗ ∗ ∗ ∗ ∗ * * 0.281* back information, more accurate representations lead to better performance. Inaddition, with scarce user feedback, PVC is more eﬀective than PV, probablybecause it is less susceptible to overﬁtting a small dataset due to many fewerparameters, i.e. vocabulary size versus corpus size. We ﬁrst showed that IRF is eﬀective on answer passage retrieval. Then we showedthat, with passage-level semantic match the performance of iterative feedbackand retrieval given one relevant passage can produce signiﬁcant improvementscompared with word-based RF models in the framework of both LM and VSM.The IRF experiments also show our method is better than the embedding-basedbaseline using term-level similarity. The retrieval experiment based on one rel-evance passage shows that combining the word and passage level granularitiesleads to the best performance.Our method focuses more on user requests of “more like this”. We knowdiversity is also very important to provide users more informative results andwe will take it into account in our future work. In addition, we will consider IRFon answer passage retrieval with end-to-end neural models.

This work was supported in part by the Center for Intelligent Information Re-trieval and in part by NSF IIS-1715095. Any opinions, ﬁndings and conclusions

RF for Answer Passage Retrieval with Passage-level Semantic Match 13 or recommendations expressed in this material are those of the authors and donot necessarily reﬂect those of the sponsor.

References

1. Aalbersberg, I.J.: Incremental relevance feedback. In: Proceedings of the 15th an-nual international ACM SIGIR conference. pp. 11–22. ACM (1992)2. Allan, J.: Incremental relevance feedback for information ﬁltering. In: Proceedingsof the 19th annual international ACM SIGIR conference. pp. 270–278. ACM (1996)3. Bi, K., Ai, Q., Croft, W.B.: Revisiting iterative relevance feedback for documentand passage retrieval. arXiv preprint arXiv:1812.05731 (2018)4. Brondwine, E., Shtok, A., Kurland, O.: Utilizing focused relevance feedback. In:Proceedings of the 39th International ACM SIGIR conference. pp. 1061–1064.ACM (2016)5. Chen, M.: Eﬃcient vector representation for documents through corruption. arXivpreprint arXiv:1707.02377 (2017)6. Cirillo, C., Chang, Y., Razon, J.: Evaluation of feedback retrieval using modiﬁedfreezing, residual collection, and test and control groups. Scientiﬁc Report No.ISR-16 to the National Science Foundation (1969)7. Croft, W.B., Metzler, D., Strohman, T.: Search engines: Information retrieval inpractice, vol. 283. Addison-Wesley Reading (2010)8. Dai, A.M., Olah, C., Le, Q.V.: Document embedding with paragraph vectors. In:NIPS Deep Learning Workshop (2015)9. Dehghani, M., Azarbonyad, H., Kamps, J., Hiemstra, D., Marx, M.: Luhn revisited:Signiﬁcant words language models. In: Proceedings of the 25th ACM Internationalon Conference on Information and Knowledge Management. pp. 1301–1310. ACM(2016)10. Grossman, M.R., Cormack, G.V., Roegiest, A.: Trec 2016 total recall trackoverview. In: TREC (2016)11. Habernal, I., Sukhareva, M., Raiber, F., Shtok, A., Kurland, O., Ronen, H., Bar-Ilan, J., Gurevych, I.: New collection announcement: Focused retrieval over theweb. In: Proceedings of the 39th International ACM SIGIR conference. pp. 701–704. ACM (2016)12. Harman, D.: Relevance feedback revisited. In: Proceedings of the 15th annual in-ternational ACM SIGIR conference. pp. 1–10. ACM (1992)13. Iwayama, M.: Relevance feedback with a small number of relevance judgements:incremental relevance feedback vs. document clustering. In: Proceedings of the23rd annual international ACM SIGIR conference on Research and developmentin information retrieval. pp. 10–16. ACM (2000)14. Jones, G., Sakai, T., Kajiura, M., Sumita, K.: Incremental relevance feedback injapanese text retrieval. Information Retrieval (4), 361–384 (2000)15. Krovetz, R.: Viewing morphology as an inference process. In: Proceedings of the16th annual international ACM SIGIR conference. pp. 191–202. ACM (1993)16. Lavrenko, V., Croft, W.B.: Relevance-based language models. In: ACM SIGIRForum. vol. 51, pp. 260–267. ACM (2017)17. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In:Proceedings of the 31st International Conference on Machine Learning (ICML-14).pp. 1188–1196 (2014)18. Maron, M.E., Kuhns, J.L.: On relevance, probabilistic indexing and informationretrieval. Journal of the ACM (JACM) (3), 216–244 (1960)4 Keping Bi, Qingyao Ai, and W. Bruce Croft19. Metzler, D., Croft, W.B.: A markov random ﬁeld model for term dependencies. In:Proceedings of the 28th annual international ACM SIGIR conference. pp. 472–479.ACM (2005)20. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Eﬃcient estimation of word repre-sentations in vector space. arXiv preprint arXiv:1301.3781 (2013)21. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre-sentations of words and phrases and their compositionality. In: Advances in neuralinformation processing systems. pp. 3111–3119 (2013)22. Mogotsi, I.: Christopher d. manning, prabhakar raghavan, and hinrich sch¨utze:Introduction to information retrieval (2010)23. Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval.In: Proceedings of the 21st annual international ACM SIGIR conference. pp. 275–281. ACM (1998)24. Rekabsaz, N., Lupu, M., Hanbury, A., Zuccon, G.: Generalizing translation modelsin the probabilistic relevance framework. In: Proceedings of the 25th ACM CIKMconference. pp. 711–720. ACM (2016)25. Robertson, S.E., Jones, K.S.: Relevance weighting of search terms. Journal of theAssociation for Information Science and Technology (3), 129–146 (1976)26. Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M.M., Gatford, M., et al.:Okapi at trec-3. Nist Special Publication Sp , 109 (1995)27. Rocchio, J.J.: Relevance feedback in information retrieval. The Smart retrievalsystem-experiments in automatic document processing (1971)28. Ruthven, I., Lalmas, M.: A survey on the use of relevance feedback for informationaccess systems. The Knowledge Engineering Review (2), 95–145 (2003)29. Salton, G., Buckley, C.: Improving retrieval performance by relevance feedback.Journal of the American Society for Information Science , 288–297 (1990)30. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing.Communications of the ACM18