Leveraging Query Resolution and Reading Comprehension for Conversational Passage Retrieval
Svitlana Vakulenko, Nikos Voskarides, Zhucheng Tu, Shayne Longpre
LLeveraging Query Resolution andReading Comprehension forConversational Passage Retrieval
Svitlana Vakulenko , Nikos Voskarides , Zhucheng Tu , and Shayne Longpre University of Amsterdam { s.vakulenko,n.voskarides } @uva.nl Apple Inc.
Abstract.
This paper describes the participation of UvA.ILPS groupat the TREC CAsT 2020 track. Our passage retrieval pipeline consistsof (i) an initial retrieval module that uses BM25, and (ii) a re-rankingmodule that combines the score of a BERT ranking model with thescore of a machine comprehension model adjusted for passage retrieval.An important challenge in conversational passage retrieval is that queriesare often under-specified. Thus, we perform query resolution, that is, addmissing context from the conversation history to the current turn queryusing QuReTeC, a term classification query resolution model. We showthat our best automatic and manual runs outperform the correspondingmedian runs by a large margin.
Our passage retrieval pipeline is shown schematically in Figure 1 and works asfollows. Given the original current turn query Q and the conversation history H , we first perform query resolution, that is, add missing context from the H to Q to arrive to the resolved query Q (cid:48) [8]. Next, we perform initial retrieval using Q (cid:48) to get a list of top-k passages P . Finally, for each passage in P , we combinethe scores of a reranking module and a reading comprehension module to obtainthe final ranked list R . We describe each module of the pipeline below. One important challenge in conversational passage retrieval is that the currentturn query is often under-specified. In order to address this challenge, we performquery resolution, that is, add missing context from the conversation history tothe current turn query [8].We use QuReTeC, a binary term classification query resolution model, whichuses BERT to classify each term in the conversation history as relevant or not,and adds the relevant terms to the original current turn query. Due to BERT’s We refer the interested reader to the original paper for more details [8]. a r X i v : . [ c s . I R ] F e b eadingComprehension Resolved query Q’ Original query Q Conversation history H PassageRe-rankingQueryResolution
Relevance scores R Passage collection C Initial
Retrieval
Top-k passages
P Q’Q’ P w1-w
Fig. 1.
Our passage retrieval pipeline. restrictions on the number of tokens, we cannot include the responses to all theprevious turn queries in the conversation history. Thus, we include (i) all theprevious turn queries and (ii) the automatic canonical response to the previousturn query only (provided by the track organizers). We use the QuReTeC modeldescribed in [8] that was trained on gold standard query resolutions derived fromthe CANARD dataset [2].
We perform initial retrieval using BM25. We tuned the parameters on the MSMARCO passage retrieval dataset ( k = 0 . , b = 0 . Here, we re-rank the original ranking list obtained in the initial retrieval step.The final ranking score is a weighted average of the Re-ranking (BERT) andReading Comprehension scores, which we describe below. The interpolationweight w is optimized on the TREC CAsT 2019 dataset [1]. Re-ranking (BERT).
We use a BERT model to get a ranking score for eachpassage as described in [4]. We initialize BERT with bert-large and fine-tunedit on the MS MARCO passage retrieval dataset as described in [6].
Reading Comprehension.
As an additional signal to rank passages we use areading comprehension model. The model is a RoBERTa-Large model trainedto predict an answer as a text span in a given passage or “No Answer” if thepassage does not contain the answer. It is fine-tuned on the MRQA dataset [3].We compute the reading comprehension score as the sum of the predicted startand end span logits: ( l start + l end ). able 1. Experimental results on the TREC CAsT 2020 dataset. Note that apartfrom our submitted runs, we also report performance of the Median runs for reference(
Median-auto and
Median-manual ). Run Type NDCG@3 NDCG@5 MAP MRR Recall@100 quretecNoRerank
Automatic 0.171 0.170 0.107 0.406 0.285
Median-auto
Automatic 0.225 0.220 0.145 - - baselineQR
Automatic 0.319 0.302 0.158 0.556 0.266 quretecQR
Automatic
Median-manual
Manual 0.317 0.303 0.201 - -
HumanQR
Manual
We submitted 3 automatic runs and 1 manual run. Automatic runs use the rawcurrent turn query, while the manual run uses the manually rewritten currentturn query. For all runs, we keep the top- ranked passages per query. quretecNoRerank : Uses QuReTeC for query resolution (Section 1.1) andthe initial retrieval module (Section 1.2), but does not use re-ranking (Sec-tion 1.3). – quretecQR : Uses the whole retrieval pipeline described in Section 1. – baselineQR : Uses the whole retrieval pipeline but uses the automatically rewritten version of the current turn query provided by the track organizers,instead of QuReTeC. HumanQR : Uses the whole retrieval pipeline but uses the manually rewrittenversion of the current turn query provided by the track organizers, insteadof QuReTeC.
Table 1 shows our experimental results. First, we observe that quretecNoRerank underperforms
Median-auto , thus highlighting the importance of the re-rankingmodule. Also, we observe that quretecQR , the run that uses the whole pipeline,outperforms
Median-auto by a large margin and also outperforms baselineQR ,on all reported metrics. This shows the effectiveness of QuReTeC for queryresolution [8]. Moreover, we see that quretecQR is outperformed by humanQR bya large margin, which highlights the need for future work on the task of queryresolution [7]. Lastly, we observe that our manual run ( humanQR ) outperforms
Median-manual , likely because of better (tuned) retrieval modules. able 2.
Error analysis when using Original, QuReTeC-resolved or Human queries.For a given query group, if NDCG@3 > (cid:88) ,otherwise we mark it with × (NDCG@3=0). Error type Query -resolved
Human
Ranking error × × ×
20 9.6 13.5 (cid:88) × × × (cid:88) × (cid:88) (cid:88) × × × (cid:88)
51 24.5 25.5 (cid:88) × (cid:88) × (cid:88) (cid:88)
88 42.2 61.0 (cid:88) (cid:88) (cid:88)
39 18.8
In this section, we analyze our results using the approach introduced in [5].
In our pipeline, passage retrieval performance is dependent on the performanceof the query resolution module. Thus, we try to estimate the proportion ofranking and query resolution errors separately. Specifically, we compare passageretrieval performance when using the Original queries, the QuReTeC-resolvedqueries or Human rewritten queries, and group queries into different types: (i)ranking error, (ii) query resolution error and (iii) no error. In order to simplifyour analysis, we first choose a ranking metric m (e.g., NDCG@3) and a threshold t . We define ranking errors as follows: we assume that Human rewritten queriesare always well specified (i.e., they do not need query resolution), and thus poorranking performance ( m < = t ) when using the Human rewritten queries can beattributed to the ranking modules. A query resolution error is one for which theHuman rewritten query has performance m > t , but for which the QuReTeC-resolved query has performance m < = t .Table 2 shows the results of this analysis when using NDCG@3 as the rankingmetric m and setting the threshold to t = 0. Since we assume that humanrewrites are always well specified, all queries with NDCG@3=0 ( × in columnHuman) are due to errors in retrieval (13.5%). Among the queries for which atleast one relevant passage was retrieved in the top-3 ( (cid:88) in column Human), wesee that 61.0% were correctly resolved by QuReTeC, and 25.5% were not. Thisshows that query resolution for conversational passage retrieval has more roomfor improvement. In addition, we observe that (0 + 1 + 2 + 39) / ≈
20% ofthe queries in the dataset do not need resolution, since when using those we canretrieve at least one relevant passage in the top-3 ( (cid:88) in column Original). able 3.
Error analysis when using Original, QuReTeC-resolved or Human queries. (cid:88) indicates that the retrieval performance (NDCG@3 or NDCG@5) reached the thresholdindicated in the right columns, and × indicates that it did not reach the threshold.The numbers correspond to the number of queries in each group. Query NDCG@3 NDCG@5Original QuReTeC -resolved
Human > > = 0.5 = 1 > > = 0.5 = 1 × × ×
20 88 185 17 87 196 (cid:88) × × × (cid:88) × (cid:88) (cid:88) × × × (cid:88)
51 42 10 50 48 4 (cid:88) × (cid:88) × (cid:88) (cid:88)
88 65 10 87 59 4 (cid:88) (cid:88) (cid:88)
39 6 2 50 9 1
Table 3 shows the same error analysis performed for different thresholds ofNDCG@3 and NDCG@5. We observe that, as the performance threshold in-creases, the number of ranking errors increases, which indicates that the passageranking modules have a lot of room for improvement. Figure 2 shows the sameanalysis for NDCG@3, for more thresholds. In order to gain further insights, we sample cases where using the QuReTeC-resolved queries result in a better or worse retrieval performance than whenusing the Human rewrites.In Table 4 we show examples where QuReTeC performs worse than Humanrewrites. In these cases, QuReTeC either misses relevant tokens or introducesredundant tokens.Interestingly, there are also cases in which QuReTeC performs better thanHuman rewrites (see Table 5 for examples). In these examples, QuReTeC intro-duced tokens from the conversation history that were absent from the manuallyrewritten queries but which helped to retrieve relevant passages.
We presented our participation in the TREC CAsT 2020 track. We found thatour best automatic run that uses QuReTeC for query resolution ( quretecQR )outperforms both the automatic median run and the run that uses the rewrittenqueries provided by the organizers ( baselineQR ). In addition, we found thatour manual run that uses the human rewrites ( humanQR ) outperforms our best The source code for this analysis that allows to produce the visualisation in Figure 2from the run files is available at https://github.com/svakulenk0/QRQA ig. 2.
Error analysis results using Original, QuReTeC-resolved and Human queriesfor all thresholds of NDCG@3 for (0; 1] with intervals of 0.02. Passage ranking errorsincrease as the NDCG threshold increases (blue). The proportion of correct queryresolutions (turquoise) is higher than the number of errors produced by QuReTeC(orange).
Best seen in color . automatic run ( quretecQR ), which, alongside with our analysis, highlight theneed for further work on the task of query resolution for conversational passageretrieval. References
1. Dalton, J., Xiong, C., Kumar, V., Callan, J.: Cast-19: A dataset for conversationalinformation seeking. In: SIGIR (2020)2. Elgohary, A., Peskov, D., Boyd-Graber, J.: Can you unpack that? learning to rewritequestions-in-context. In: EMNLP (2019)3. Fisch, A., Talmor, A., Jia, R., Seo, M., Choi, E., Chen, D.: MRQA 2019 sharedtask: Evaluating generalization in reading comprehension. In: MRQA (2019)4. Nogueira, R., Cho, K.: Passage re-ranking with bert. arXiv preprintarXiv:1901.04085 (2019)5. Vakulenko, S., Longpre, S., Tu, Z., Anantha, R.: A wrong answer or a wrong ques-tion? an intricate relationship between question reformulation and answer selectionin conversational question answering. In: SCAI (2020)6. Vakulenko, S., Longpre, S., Tu, Z., Anantha, R.: Question rewriting for conversa-tional question answering. In: WSDM (2021)7. Vakulenko, S., Voskarides, N., Tu, Z., Longpre, S.: A comparison of question rewrit-ing methods for conversational passage retrieval. In: ECIR (2021)8. Voskarides, N., Li, D., Ren, P., Kanoulas, E., de Rijke, M.: Query resolution forconversational search with limited supervision. In: SIGIR (2020) able 4.
Examples where QuReTeC performs worse than Human rewrites. qid NDCG@3
101 7 Human Does the public pay the First Lady of the
UnitedStates ? 0.864QuReTeC Do we pay the First Lady? melania trump 0101 8 Human Does the public pay Ivanka Trump? 0.883QuReTeC What about Ivanka? melania melanija trump 0102 5 Human How much money is owed to social security? 0.704QuReTeC How much is owed? program social security 0102 8 Human Can social security be fixed? 0.413QuReTeC Can it be fixed? checks social check security 0102 9 Human How much of a tax increase will keep social securitysolvent? 1.000QuReTeC How much of an increase? social security 0
Table 5.
Examples where QuReTeC performs better than Human rewrites. qid NDCG@3
101 9 Human Does the public pay Jared Kushner? 0QuReTeC And Jared? ivana donald trump trayvon martin zimmer-man 0.20293 6 Human What support does the franchise provide? 0QuReTeC What support does it provide? king franchise agree-ment burger vegetarian recipes with almonds? 0QuReTeC Oh almonds ? Can you show me recipes with it? al-mondsal-monds