A Comparison of Question Rewriting Methods for Conversational Passage Retrieval
Svitlana Vakulenko, Nikos Voskarides, Zhucheng Tu, Shayne Longpre
AA Comparison of Question Rewriting Methodsfor Conversational Passage Retrieval
Svitlana Vakulenko , Nikos Voskarides , Zhucheng Tu , and Shayne Longpre University of Amsterdam Apple Inc.
Abstract.
Conversational passage retrieval relies on question rewrit-ing to modify the original question so that it no longer depends on theconversation history. Several methods for question rewriting have re-cently been proposed, but they were compared under different retrievalpipelines. We bridge this gap by thoroughly evaluating those questionrewriting methods on the TREC CAsT 2019 and 2020 datasets underthe same retrieval pipeline. We analyze the effect of different types ofquestion rewriting methods on retrieval performance and show that bycombining question rewriting methods of different types we can achievestate-of-the-art performance on both datasets. Conversational search aims to provide automated support for natural and effec-tive human–information interaction [1]. The TREC Conversational AssistanceTrack (CAsT) introduced the task of conversational (multi-turn) passage re-trieval (PR) [3], where the goal is to retrieve short passages of text from a largepassage collection that answer the information need at the current turn.One prominent challenge in conversational PR is that the question at thecurrent turn often requires information from the conversation history (questionsand passages retrieved in previous turns) to be interpreted correctly. A proposedsolution to this challenge is question rewriting (or resolution, QR), i.e., modifyingthe question such that it no longer depends on the conversation history. Forinstance, the question “What did he work on?” can be rewritten into “What didBruce Croft work on?” based on the conversation history (see Table 4 for thecomplete example).Recently proposed methods for QR in conversational PR can be categorizedinto two types, namely sequence generation and term classification. Sequencegeneration QR methods generate natural language sequences using the conver-sation history [7,9], while term classification QR methods add terms from theconversation history to the current turn question [5,8]. The former can be trainedusing human generated rewrites or data obtained from search sessions and heuris-tics [7,9], while the latter are either heuristic-based [5], or trained using humangenerated rewrites or distant supervision [8]. Resources can be found at https://github.com/svakulenk0/cast evaluation. a r X i v : . [ c s . I R ] J a n Vakulenko et al.
In this paper, we conduct a systematic evaluation of the state-of-the-art QRmethods under the same retrieval pipeline on the CAsT 2019 and 2020 datasets.While CAsT 2019 only depends on the previous questions in the conversation,CAsT 2020 also includes questions that depend on the previously retrieved pas-sages. Our results provide insights on the ability of the QR methods to accountfor the conversation history, as well as on the potential of combining QR methodsof different types for improving retrieval effectiveness.
We model the conversational PR task as a sequence of two subtasks: (1) questionrewriting (QR) and (2) passage retrieval (PR) [7,8,9]. In this paper, we focus onthe QR subtask and investigate the impact of QR on PR performance.In the QR subtask, we are given the current turn question Q i and a sequenceof question-answer pairs H := Q , A , . . . , Q i − , A i − (the conversation history).The current turn question Q i may depend on the conversation history H andthus some information in H is required to correctly interpret Q i . The goal ofQR is to generate a question rewrite Q (cid:48) i that no longer depends on H .In the PR subtask, we are given the question rewrite Q (cid:48) i and a passage collec-tion C , and the goal is to retrieve a list of passages R sorted by their relevance to Q (cid:48) i from C . If Q (cid:48) i is semantically equivalent to (cid:104) Q i , H (cid:105) , we expect R to constituterelevant passages for (cid:104) Q i , H (cid:105) . We aim to answer the following research questions:
RQ1
How do different QR methods perform on the two datasets we consider(CAsT 2019 and CAsT 2020)?
RQ2
Can we combine different QR models to improve retrieval performance?Following previous work, we perform both intrinsic and extrinsic evalua-tion [2,8]. In intrinsic evaluation, we compare rewrites produced by QR methodswith manual rewrites produced by human annotators using ROUGE-1 Precision(P), Recall (R) and F-measure (F) [2]. In extrinsic evaluation, we measure PRperformance when using different QR methods using standard ranking metrics:NDCG@3, MRR and Recall@1000.
We compare the following question rewriting methods: – Original
The original current turn question without any modification. We use ROUGE-1 to measure unigram overlap after punctuation removal, lowercasing and Porter stemming. We use the following ROUGE implementation: https://github.com/google-research/google-research/tree/master/rouge Comparison of Question Rewriting Methods 3
Table 1.
Datasets statistics.Dataset – Human
The gold standard rewrite of the current turn question producedby a human annotator. – Rule-Based and
Self-Learn model question rewriting as a sequence gen-eration task and use GPT-2 to perform generation [9]. In order to gathertraining data, these methods convert ad-hoc search sessions to conversa-tional search sessions either by using heuristic rules (
Rule-Based ) or byusing self-supervised learning (
Self-Learn ). – Transformer++ [7] is a GPT-2 sequence generation model. It was trainedon CANARD, a conversational question rewriting dataset [4]. – QuReTeC [8] models question rewriting as term classification, i.e., predict-ing which terms from the conversation history to add to the current turnquestion. It uses BERT to perform term classification and can be trainedusing human rewrites or distant supervision obtained from query-passagerelevance labels. In this paper, we use the model trained on CANARD [4] tobe comparable with Transformer++ . Since
QuReTeC does not generatenatural language text but rather appends a bag-of-words (BoW) to the orig-inal question, we also introduce an oracle
Human-BoW as an upper-boundfor
QuReTeC performance.
We use the recently constructed TREC CAsT 2019 and CAsT 2020 datasets [3].Table 1 shows basic statistics of the datasets.
Copy indicates the number ofquestions for which the human rewrite is exactly the same as their correspondingoriginal question. This statistic shows that in contrast to CAsT 2019, in CAsT2020, only a very few questions can be copied verbatim and the majority ofquestions require extra terms.Another major difference between the two datasets is that the current turnquestion in CAsT 2020 may also depend on the answer passage to the previ-ous turn question ( A i − ), while in CAsT 2019 the current turn question de-pends only on the questions of the previous turns in the conversation history( Q , Q , . . . , Q i − ). Therefore, we experiment with two variations of input tothe QR models: (1) all previous questions (indicated as Q ) and (2) all previousquestions and the answer passage to the previous turn question (indicated as Q&A ). We use the answer passage to the previous turn question retrieved by the automatic rewriting system provided by the TREC CAsT 2020 organizers. Vakulenko et al.
Table 2.
Evaluation of question rewriting methods on CAsT 2019.QR Method Recall@1000 NDCG@3 ROUGE-1Initial Initial Reranked P R FOriginal 0.417 0.131 0.266 0.92 0.76 0.82Transformer++ Q 0.743 0.265
Self-Learn Q 0.725 0.261 0.513 0.93 0.89 0.90Rule-Based Q 0.717 0.248 0.487 0.94 0.89 0.91QuReTeC Q
Self-Learn Q + QuReTeC Q 0.785 0.293 0.519 0.90
Rule-Based Q + QuReTeC Q 0.783
Human-BoW Q 0.769 0.297 0.524 0.91 0.90 0.90Human 0.803 0.309 0.577 1.00 1.00 1.00
All QR methods described in Section 3.1 were previously evaluated on CAsT2019 using different retrieval pipelines. For a fair comparison, we evaluate theQR methods on both CAsT 2019 and CAsT 2020 using the same passage retrievalpipeline.We use a standard two-stage pipeline for passage retrieval, consisting ofan unsupervised ranker for initial retrieval performing efficient lexical match(BM25) and a supervised reranker (BERT) over the top-1000 passages returnedby initial retrieval [6]. Both components were fine-tuned on a subset of the MSMARCO dataset ( k = 0 . , b = 0 . Here we answer
RQ1 : How do different QR methods perform on the two datasetswe consider?
CAsT 2019.
In Table 2, we observe that QuReTeC outperforms all othermethods in initial retrieval (Recall@1000 and NDCG@3). However, we see thatTransformer++ Q outperforms QuReTeC in reranking (NDCG@3). This may in-dicate that the reranking component (BERT) is more sensitive to rewritten ques-tions that do not resemble natural language text (produced by QuReTeC) thanthe initial retrieval component (BM25). This is also reflected in the ROUGE-1metric variations: ROUGE-1 R is generally in agreement with initial retrievalperformance. This is expected since our initial retrieval component is BoW and Note that our pipeline outperforms the official baseline provided by the TREC CAsTorganizers for both 2019 and 2020 datasets for all query rewriting methods theyconsidered. Since our focus is on comparing different query rewriting methods, wedo not report those results for brevity. https://github.com/nyu-dl/dl4marco-bert Comparison of Question Rewriting Methods 5 Table 3.
Evaluation of question rewriting methods on CAsT 2020.QR Method Recall@1000 NDCG@3 ROUGE-1Initial Initial Reranked P R FOriginal 0.251 0.068 0.193
QuReTeC Q&A
Transformer++ Q + QuReTeC Q&A 0.525 0.160 0.351
Rule-Based Q&A + QuReTeC Q&A 0.519
Table 4.
Example question rewrites for the topic in CAsT 2020 starting with “Whoare some of the well-known Information Retrieval researchers?”.Answer Passage Original Rule-Based Q&A QuReTeC Q&ABruce Croft formedthe Center ... What did he workon? What did BruceCroft work on? What did he workon? croft bruceKarpicke and JanellR. Blunt (2011) fol-lowed up ... Who are someimportant Britishones? Who are someimportant Britishones? Who are someimportant Britishones? informationretrieval does not get substantially affected by missing or incorrect terms such as pro-nouns and stopwords, which are usually insignificant for lexical matching (seeHuman-BoW in Table 2). ROUGE-1 P, however, favours the sequence generationmethods, and penalizes QuReTeC, since QuReTeC does not have a mechanismto delete or replace such terms from the original question.
CAsT 2020.
In Table 3, we observe that the retrieval performance of Orig-inal and Human is much lower than in Table 2, which indicates that CAsT 2020is more challenging than CAsT 2019. We observe that QuReTeC outperformsall other methods in all ranking metrics. This indicates that QuReTeC bettercaptures relevant terms both from the previous turn questions and the answerpassage to the previous turn question than the other QR methods. Similarlyto Table 2, ROUGE-1 R is in agreement with initial retrieval performance. Asfor ROUGE-1 P, we observe that it is not as important for retrieval as in Ta-ble 2. Next, we assess the contribution of the answer passage to the previousturn question on QR performance. In Figure 1, we observe that most QR meth-ods (except Transformer++) do benefit from using the answer passage, withQuReTeC having the biggest gain in initial retrieval. Table 4 shows examples of Recall that questions in CAsT 2020 may depend on the answer of the previous turnquestion, but this is not the case in CAsT 2019. Vakulenko et al. N D C G @ QuReTeC Rule-Based Self-Learn Transformer++ 0.000.050.100.150.200.250.300.350.40 N D C G @ QuReTeC Rule-Based Self-Learn Transformer++ QQ&A
Input
Fig. 1.
Initial retrieval (left) and reranking (right) performance on CAsT 2020 whenthe answer passage to the previous turn question is used (Q&A) or not used (Q) asinput to the QR methods. question rewrites produced by Rule-Based and QuReTeC.
Next we answer
RQ2 : Can we combine different QR models to improve perfor-mance? In order to explore whether combining QR methods of different types(sequence generation or term classification) can be beneficial, we simply appendterms from the conversation history predicted as relevant by QuReTeC to therewrite produced by one of the sequence generation methods. We found that bydoing this we can improve upon individual QR methods and achieve state-of-the-art retrieval performance on CAsT 2019 by combining Transformer++ Qwith QuReTeC Q (see Table 2), and on CAsT 2020 by combining Self-Learn Qand QuReTeC Q&A (see Table 3); however the gains on CAsT 2020 are smaller.
We evaluated alternative question rewriting methods for conversational passageretrieval on the CAsT 2019 and CAsT 2020 datasets. On CAsT 2019, we foundthat QuReTeC performs best in terms of initial retrieval, while Transformer++performs best in terms of reranking. On CAsT 2020, we found that QuReTeCperforms best both in terms of initial retrieval and reranking. Moreover, weachieved state-of-the-art ranking performance on both datasets using a simplemethod that combines the output of QuReTeC (a term classification method)with the output of a sequence generation method. Future work should focuson developing more advanced methods for combining term classification andsequence generation question rewriting methods.
Acknowledgements
We thank Raviteja Anantha for providing the rewrites ofthe Transformer++ model.
Comparison of Question Rewriting Methods 7