[PDF] Leveraging Query Resolution and Reading Comprehension for Conversational Passage Retrieval

Abstract

This paper describes the participation of UvA.ILPS group at the TREC CAsT 2020 track. Our passage retrieval pipeline consists of (i) an initial retrieval module that uses BM25, and (ii) a re-ranking module that combines the score of a BERT ranking model with the score of a machine comprehension model adjusted for passage retrieval. An important challenge in conversational passage retrieval is that queries are often under-specified. Thus, we perform query resolution, that is, add missing context from the conversation history to the current turn query using QuReTeC, a term classification query resolution model. We show that our best automatic and manual runs outperform the corresponding median runs by a large margin.

Full PDF

LLeveraging Query Resolution andReading Comprehension forConversational Passage Retrieval

Svitlana Vakulenko , Nikos Voskarides , Zhucheng Tu , and Shayne Longpre University of Amsterdam { s.vakulenko,n.voskarides } @uva.nl Apple Inc.

Abstract.

This paper describes the participation of UvA.ILPS groupat the TREC CAsT 2020 track. Our passage retrieval pipeline consistsof (i) an initial retrieval module that uses BM25, and (ii) a re-rankingmodule that combines the score of a BERT ranking model with thescore of a machine comprehension model adjusted for passage retrieval.An important challenge in conversational passage retrieval is that queriesare often under-speciﬁed. Thus, we perform query resolution, that is, addmissing context from the conversation history to the current turn queryusing QuReTeC, a term classiﬁcation query resolution model. We showthat our best automatic and manual runs outperform the correspondingmedian runs by a large margin.

Our passage retrieval pipeline is shown schematically in Figure 1 and works asfollows. Given the original current turn query Q and the conversation history H , we ﬁrst perform query resolution, that is, add missing context from the H to Q to arrive to the resolved query Q (cid:48) [8]. Next, we perform initial retrieval using Q (cid:48) to get a list of top-k passages P . Finally, for each passage in P , we combinethe scores of a reranking module and a reading comprehension module to obtainthe ﬁnal ranked list R . We describe each module of the pipeline below. One important challenge in conversational passage retrieval is that the currentturn query is often under-speciﬁed. In order to address this challenge, we performquery resolution, that is, add missing context from the conversation history tothe current turn query [8].We use QuReTeC, a binary term classiﬁcation query resolution model, whichuses BERT to classify each term in the conversation history as relevant or not,and adds the relevant terms to the original current turn query. Due to BERT’s We refer the interested reader to the original paper for more details [8]. a r X i v : . [ c s . I R ] F e b eadingComprehension Resolved query Q’ Original query Q Conversation history H PassageRe-rankingQueryResolution

Relevance scores R Passage collection C Initial

Retrieval

Top-k passages

P Q’Q’ P w1-w

Fig. 1.

Our passage retrieval pipeline. restrictions on the number of tokens, we cannot include the responses to all theprevious turn queries in the conversation history. Thus, we include (i) all theprevious turn queries and (ii) the automatic canonical response to the previousturn query only (provided by the track organizers). We use the QuReTeC modeldescribed in [8] that was trained on gold standard query resolutions derived fromthe CANARD dataset [2].

We perform initial retrieval using BM25. We tuned the parameters on the MSMARCO passage retrieval dataset ( k = 0 . , b = 0 . Here, we re-rank the original ranking list obtained in the initial retrieval step.The ﬁnal ranking score is a weighted average of the Re-ranking (BERT) andReading Comprehension scores, which we describe below. The interpolationweight w is optimized on the TREC CAsT 2019 dataset [1]. Re-ranking (BERT).

We use a BERT model to get a ranking score for eachpassage as described in [4]. We initialize BERT with bert-large and ﬁne-tunedit on the MS MARCO passage retrieval dataset as described in [6].

Reading Comprehension.

As an additional signal to rank passages we use areading comprehension model. The model is a RoBERTa-Large model trainedto predict an answer as a text span in a given passage or “No Answer” if thepassage does not contain the answer. It is ﬁne-tuned on the MRQA dataset [3].We compute the reading comprehension score as the sum of the predicted startand end span logits: ( l start + l end ). able 1. Experimental results on the TREC CAsT 2020 dataset. Note that apartfrom our submitted runs, we also report performance of the Median runs for reference(

Median-auto and

Median-manual ). Run Type NDCG@3 NDCG@5 MAP MRR Recall@100 quretecNoRerank

Automatic 0.171 0.170 0.107 0.406 0.285

Median-auto

Automatic 0.225 0.220 0.145 - - baselineQR

Automatic 0.319 0.302 0.158 0.556 0.266 quretecQR

Automatic

Median-manual

Manual 0.317 0.303 0.201 - -

HumanQR

Manual

We submitted 3 automatic runs and 1 manual run. Automatic runs use the rawcurrent turn query, while the manual run uses the manually rewritten currentturn query. For all runs, we keep the top- ranked passages per query. quretecNoRerank : Uses QuReTeC for query resolution (Section 1.1) andthe initial retrieval module (Section 1.2), but does not use re-ranking (Sec-tion 1.3). – quretecQR : Uses the whole retrieval pipeline described in Section 1. – baselineQR : Uses the whole retrieval pipeline but uses the automatically rewritten version of the current turn query provided by the track organizers,instead of QuReTeC. HumanQR : Uses the whole retrieval pipeline but uses the manually rewrittenversion of the current turn query provided by the track organizers, insteadof QuReTeC.

Table 1 shows our experimental results. First, we observe that quretecNoRerank underperforms

Median-auto , thus highlighting the importance of the re-rankingmodule. Also, we observe that quretecQR , the run that uses the whole pipeline,outperforms

Median-auto by a large margin and also outperforms baselineQR ,on all reported metrics. This shows the eﬀectiveness of QuReTeC for queryresolution [8]. Moreover, we see that quretecQR is outperformed by humanQR bya large margin, which highlights the need for future work on the task of queryresolution [7]. Lastly, we observe that our manual run ( humanQR ) outperforms

Median-manual , likely because of better (tuned) retrieval modules. able 2.

Error analysis when using Original, QuReTeC-resolved or Human queries.For a given query group, if NDCG@3 > (cid:88) ,otherwise we mark it with × (NDCG@3=0). Error type Query -resolved

Human

Ranking error × × ×

20 9.6 13.5 (cid:88) × × × (cid:88) × (cid:88) (cid:88) × × × (cid:88)

51 24.5 25.5 (cid:88) × (cid:88) × (cid:88) (cid:88)

88 42.2 61.0 (cid:88) (cid:88) (cid:88)

39 18.8

In this section, we analyze our results using the approach introduced in [5].

In our pipeline, passage retrieval performance is dependent on the performanceof the query resolution module. Thus, we try to estimate the proportion ofranking and query resolution errors separately. Speciﬁcally, we compare passageretrieval performance when using the Original queries, the QuReTeC-resolvedqueries or Human rewritten queries, and group queries into diﬀerent types: (i)ranking error, (ii) query resolution error and (iii) no error. In order to simplifyour analysis, we ﬁrst choose a ranking metric m (e.g., NDCG@3) and a threshold t . We deﬁne ranking errors as follows: we assume that Human rewritten queriesare always well speciﬁed (i.e., they do not need query resolution), and thus poorranking performance ( m < = t ) when using the Human rewritten queries can beattributed to the ranking modules. A query resolution error is one for which theHuman rewritten query has performance m > t , but for which the QuReTeC-resolved query has performance m < = t .Table 2 shows the results of this analysis when using NDCG@3 as the rankingmetric m and setting the threshold to t = 0. Since we assume that humanrewrites are always well speciﬁed, all queries with NDCG@3=0 ( × in columnHuman) are due to errors in retrieval (13.5%). Among the queries for which atleast one relevant passage was retrieved in the top-3 ( (cid:88) in column Human), wesee that 61.0% were correctly resolved by QuReTeC, and 25.5% were not. Thisshows that query resolution for conversational passage retrieval has more roomfor improvement. In addition, we observe that (0 + 1 + 2 + 39) / ≈

20% ofthe queries in the dataset do not need resolution, since when using those we canretrieve at least one relevant passage in the top-3 ( (cid:88) in column Original). able 3.

Error analysis when using Original, QuReTeC-resolved or Human queries. (cid:88) indicates that the retrieval performance (NDCG@3 or NDCG@5) reached the thresholdindicated in the right columns, and × indicates that it did not reach the threshold.The numbers correspond to the number of queries in each group. Query NDCG@3 NDCG@5Original QuReTeC -resolved

Human > > = 0.5 = 1 > > = 0.5 = 1 × × ×

20 88 185 17 87 196 (cid:88) × × × (cid:88) × (cid:88) (cid:88) × × × (cid:88)

51 42 10 50 48 4 (cid:88) × (cid:88) × (cid:88) (cid:88)

88 65 10 87 59 4 (cid:88) (cid:88) (cid:88)

39 6 2 50 9 1

Table 3 shows the same error analysis performed for diﬀerent thresholds ofNDCG@3 and NDCG@5. We observe that, as the performance threshold in-creases, the number of ranking errors increases, which indicates that the passageranking modules have a lot of room for improvement. Figure 2 shows the sameanalysis for NDCG@3, for more thresholds. In order to gain further insights, we sample cases where using the QuReTeC-resolved queries result in a better or worse retrieval performance than whenusing the Human rewrites.In Table 4 we show examples where QuReTeC performs worse than Humanrewrites. In these cases, QuReTeC either misses relevant tokens or introducesredundant tokens.Interestingly, there are also cases in which QuReTeC performs better thanHuman rewrites (see Table 5 for examples). In these examples, QuReTeC intro-duced tokens from the conversation history that were absent from the manuallyrewritten queries but which helped to retrieve relevant passages.

We presented our participation in the TREC CAsT 2020 track. We found thatour best automatic run that uses QuReTeC for query resolution ( quretecQR )outperforms both the automatic median run and the run that uses the rewrittenqueries provided by the organizers ( baselineQR ). In addition, we found thatour manual run that uses the human rewrites ( humanQR ) outperforms our best The source code for this analysis that allows to produce the visualisation in Figure 2from the run ﬁles is available at https://github.com/svakulenk0/QRQA ig. 2.

Error analysis results using Original, QuReTeC-resolved and Human queriesfor all thresholds of NDCG@3 for (0; 1] with intervals of 0.02. Passage ranking errorsincrease as the NDCG threshold increases (blue). The proportion of correct queryresolutions (turquoise) is higher than the number of errors produced by QuReTeC(orange).

Best seen in color . automatic run ( quretecQR ), which, alongside with our analysis, highlight theneed for further work on the task of query resolution for conversational passageretrieval. References

1. Dalton, J., Xiong, C., Kumar, V., Callan, J.: Cast-19: A dataset for conversationalinformation seeking. In: SIGIR (2020)2. Elgohary, A., Peskov, D., Boyd-Graber, J.: Can you unpack that? learning to rewritequestions-in-context. In: EMNLP (2019)3. Fisch, A., Talmor, A., Jia, R., Seo, M., Choi, E., Chen, D.: MRQA 2019 sharedtask: Evaluating generalization in reading comprehension. In: MRQA (2019)4. Nogueira, R., Cho, K.: Passage re-ranking with bert. arXiv preprintarXiv:1901.04085 (2019)5. Vakulenko, S., Longpre, S., Tu, Z., Anantha, R.: A wrong answer or a wrong ques-tion? an intricate relationship between question reformulation and answer selectionin conversational question answering. In: SCAI (2020)6. Vakulenko, S., Longpre, S., Tu, Z., Anantha, R.: Question rewriting for conversa-tional question answering. In: WSDM (2021)7. Vakulenko, S., Voskarides, N., Tu, Z., Longpre, S.: A comparison of question rewrit-ing methods for conversational passage retrieval. In: ECIR (2021)8. Voskarides, N., Li, D., Ren, P., Kanoulas, E., de Rijke, M.: Query resolution forconversational search with limited supervision. In: SIGIR (2020) able 4.

Examples where QuReTeC performs worse than Human rewrites. qid NDCG@3

101 7 Human Does the public pay the First Lady of the

UnitedStates ? 0.864QuReTeC Do we pay the First Lady? melania trump 0101 8 Human Does the public pay Ivanka Trump? 0.883QuReTeC What about Ivanka? melania melanija trump 0102 5 Human How much money is owed to social security? 0.704QuReTeC How much is owed? program social security 0102 8 Human Can social security be ﬁxed? 0.413QuReTeC Can it be ﬁxed? checks social check security 0102 9 Human How much of a tax increase will keep social securitysolvent? 1.000QuReTeC How much of an increase? social security 0

Table 5.

Examples where QuReTeC performs better than Human rewrites. qid NDCG@3

101 9 Human Does the public pay Jared Kushner? 0QuReTeC And Jared? ivana donald trump trayvon martin zimmer-man 0.20293 6 Human What support does the franchise provide? 0QuReTeC What support does it provide? king franchise agree-ment burger vegetarian recipes with almonds? 0QuReTeC Oh almonds ? Can you show me recipes with it? al-mondsal-monds