Learning to Truncate Ranked Lists for Information Retrieval
Chen Wu, Ruqing Zhang, Jiafeng Guo, Yixing Fan, Yanyan Lan, Xueqi Cheng
LLearning to Truncate Ranked Lists for Information Retrieval
Chen Wu, Ruqing Zhang, Jiafeng Guo, Yixing Fan, Yanyan Lan, and Xueqi Cheng
CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology,Chinese Academy of Sciences, Beijing, ChinaUniversity of Chinese Academy of Sciences, Beijing, China { wuchen17z, zhangruqing, guojiafeng, fanyixing, lanyanyan, cxq } @ict.ac.cn Abstract
Ranked list truncation is of critical importance in a variety ofprofessional information retrieval applications such as patentsearch or legal search. The goal is to dynamically determinethe number of returned documents according to some user-defined objectives, in order to reach a balance between theoverall utility of the results and user efforts. Existing meth-ods formulate this task as a sequential decision problem andtake some pre-defined loss as a proxy objective, which suf-fers from the limitation of local decision and non-direct op-timization. In this work, we propose a global decision basedtruncation model named AttnCut, which directly optimizesuser-defined objectives for the ranked list truncation. Specif-ically, we take the successful transformer architecture to cap-ture the global dependency within the ranked list for trunca-tion decision, and employ the reward augmented maximumlikelihood (RAML) for direct optimization. We consider twotypes of user-defined objectives which are of practical usage.One is the widely adopted metric such as F1 which acts as abalanced objective, and the other is the best F1 under someminimal recall constraint which represents a typical objectivein professional search. Empirical results over the Robust04and MQ2007 datasets demonstrate the effectiveness of ourapproach as compared with the state-of-the-art baselines.
Existing information retrieval (IR) systems mainly focus onrelevance ranking which returns a ranked list of documentsaccording to their relevance scores. Recently, ranked listtruncation has attracted much attention in the IR commu-nity (Arampatzis, Kamps, and Robertson 2009; Lien, Co-hen, and Croft 2019; Culpepper, Diaz, and Smucker 2018).Generally, the task aims to dynamically determine the num-ber of returned documents according to some user-definedobjectives, so as to reach a balance between the overall rele-vance or utility of the returned results and user efforts. Suchtruncation task is of critical importance in a variety of pro-fessional IR applications where user efforts could not be ne-glected. For example, in patent search (Lupu, Hanbury et al.2013), it is time-consuming for a user to investigate each re-turned patent. In paid legal search (Tomlinson et al. 2007),litigation support professionals are paid by hour, thus each additional returned document would lead to some monetarypenalty.Without loss of generality, there are two typical truncationrequirements in practical IR applications. Firstly, the trun-cation needs to reach a balance between the precision andrecall of the returned results, leading to the optimization ofa mixed metric of the two, e.g., the F1 score. In other words,the system needs to automatically determine the cut-off po-sition by predicting the best F1 score. Secondly, in some sce-nario, recall is very critical which needs to pay more atten-tion. For example, in patent search, users often require thereturned list of patents to reach a target recall as they wantto find whether there exist conflict patents. In such scenario,the system needs to determine the cut-off position with re-spect to the target metric such as F1, under some minimalrecall constraint.The present state-of-the-art method for ranked list trunca-tion is BiCut (Lien, Cohen, and Croft 2019). BiCut takes thisproblem as a sequential decision process and adopts a Bi-directional Long Short-Term Memory (Bi-LSTM) (Graves2013) model to solve it. Specifically, given a ranked listof documents with relevance score and document statisti-cal information, BiCut attempts to predict Continue or EOL(end of the list) at each rank position, and decides to cutthe ranked list at the first occurrence of EOL. The model islearned towards some pre-defined loss as a proxy objectiveof some user-defined metric.However, the BiCut model suffers from the following twodrawbacks. Firstly, as the truncation problem is formulatedas a sequential decision process, the final cut-off decision isthus made upon a sequence of local decisions which maynot be optimal from a global view. Secondly, although thework claims to optimize arbitrary user-defined metrics, theactually relationship between the defined loss function andthe true F1 metric is not clear.To address these problems, in this paper, we proposea global decision based truncation model named AttnCut,which directly optimizes user-defined objectives for theranked list truncation. Specifically, we take the successfultransformer architecture to capture the long-range depen-dency within the ranked list. In this way, the truncation deci-sion could be made in a global way using the self-attentionmechanism. Meanwhile, we employ the reward augmentedmaximum likelihood (RAML) (Norouzi et al. 2016) for the a r X i v : . [ c s . I R ] M a r odel learning, which can directly optimize the user-definedmetric such as F1 and DCG. Besides direct optimization ofthe target metric, we also tackle the prediction task of thebest metric score under some minimal recall constraint.We conduct empirical experiments on two widely adoptedad-hoc retrieval datasets, including the Robust04 dataset andthe MQ2007 dataset (Qin et al. 2010). For evaluation, wecompare with several state-of-the-art methods to verify theeffectiveness of our model. Empirical results demonstratethat our model can well determine the number of returneddocuments and outperform all the baselines significantly onthe two datasets. In this section, we briefly review two lines of related work,including the ranked list truncation and reward augmentedmaximum likelihood method.
The goal of the ranked list truncation is to determine a besttruncation position according to the input ranked list. Exist-ing methods on ranked list truncation can be generally cate-gorized into parametric methods and assumption-free meth-ods. Parametric methods assume a prior distribution and findthe best truncation position by fitting it. Early work mainlyfocuses on modeling score distributions by fitting parametricprobability distributions (Manmatha, Rath, and Feng 2001).Arampatzis (Arampatzis, Kamps, and Robertson 2009) findsthe best cut-off value over ranked lists which optimizes theF1-measure. By making the assumption that the score dis-tributions of query-document pairs are normal for relevantand exponential for non-relevant, they adopt the ExpectationMaximization (EM) algorithm (Dempster, Laird, and Rubin1977) to estimate the parameters. However, this method isunder the normal-exponential mixture score distribution as-sumption (Arampatzis et al. 2000; Arampatzis 2002; Aram-patzis and van Hameran 2001) which does not always hold.Assumption-free approaches, on the other hand, aim tolearn from the score distribution over the retrieval model us-ing some machine learning methods and determine where totruncate (Wang, Lin, and Metzler 2011; Culpepper, Clarke,and Lin 2016; Lien, Cohen, and Croft 2019). Assumption-free approaches include Cascade-style approaches and somerecent deep learning methods. Cascade-style approaches(Wang, Lin, and Metzler 2011) view retrieval as a multi-stage progressive refinement problem and decide the set ofdocuments to prune at each stage, in order to achieve abalance between efficiency and effectiveness. In addition,Culpepper et al. (Culpepper, Clarke, and Lin 2016) usethe score information over a set of sampled documents forlearning dynamic cut-offs within cascade-style ranking sys-tems. Deep learning methods apply deep architectures to dothe truncation. For example, Bicut (Lien, Cohen, and Croft2019) applies a Bi-LSTM model to take the sequential rela-tions between documents’ score and statistical informationinto consideration. They mention that any user-defined met-ric could be maximized if there is an appropriate correspond-ing loss function for minimization. However, the process of constructing such loss function is not declared. Besides, theweight hyper-parameters also increase the difficulty of ap-plication and explanation. Thus, in this work, we employ theRAML to directly optimize user-defined metric. Recently,Choppy (Bahri et al. 2020) takes the transformer architec-ture to truncate the ranked list. They optimize the metric bymaximizing the expected evaluation metric on the trainingsamples. While this method is heuristic, we are under thetheoretical framework of RAML to make our model distri-bution approach the metric distribution and optimize user-defined metric directly and smoothly.
Reward Augmented Maximum Likelihood (RAML)(Norouzi et al. 2016; Dai, Xie, and Hovy 2018) is amethod which considers the task reward optimization overmaximum likelihood estimation (MLE). By applying anexponentiated scale over the task reward and sample fromit to get outputs, RAML optimizes log-likelihood on suchoutput samples and corresponding inputs. If we considerthe exponentiated pay-off distribution as the target distri-bution, RAML could be regarded as minimizing the KLdivergence between the target distribution and the modeldistribution, while MLE is the same except that the targetdistribution is the Dirac distribution of ground-truth label.In this sense, by elaborating a different target distribution,RAML alleviates the exposure bias (Ranzato et al. 2015)as well as creates exploration opportunity for model tolearn from sequences which are not exactly the same asthe ground truth but have high rewards. On the other hand,compared with reinforcement learning (RL) algorithmssuch as policy gradient (Williams 1992) which optimizestask metric directly, RAML samples from a stationaryreward distribution instead of the model distribution whichis consistently changing. As a result, RAML avoids the highvariance in gradient which is a well-known drawback of RLand enjoys a more stable optimization. We adopt RAMLto directly optimize user-defined metric which could beregarded as a reward. Previous works have only appliedRAML into image captioning, machine translation andsentence summarization (Dai, Xie, and Hovy 2018; Maet al. 2017; Sperber, Niehues, and Waibel 2017; Li et al.2018), to the best of our knowledge, AttnCut is the firstmodel that applies RAML to the ranked list truncation.
In this section, we introduce the AttnCut model, a novelattention-based global decision model designed for theranked list truncation task.
Formally, given a ranked document list D = { d , d , ..., d N } for a query q , AttnCut aims to find abest truncation position k ∈ [1 , N ] that maximizes anexternal metric (Lien, Cohen, and Croft 2019).Basically, our AttnCut model could be decomposed intothree dependent components: 1) Encoding Layer: to obtain 𝒙 𝒙 𝑁 . . . ℎ ℎ ℎ 𝑁 . . . Encoding LayerAttentionLayerMulti-head AttentionAdd & LayerNormMLP DecisionLayerSoftmaxOutputProbabilities
Figure 1: The overall architecture of AttnCut modelthe representation of each document in the ranked list; 2) At-tention Layer: to capture the long-range dependencies withinthe ranked document list through a direct connection be-tween every pair of documents; 3) Decision Layer: to predictthe final cut-off position based on the final representation ofthe ranked list. The overall architecture of BiCut is depictedin Figure 1, and we will detail our model as follows.
Generally, the encoding layer takes in the input ranked doc-uments, and encodes them into a series of hidden represen-tations.Each document d n in each ranked document list D isfirstly represented by its feature vector x n = [ r n || s n ] , whichis achieved by concatenating a relevance score r n given bya retrieval function (e.g., BM25) and corresponding doc-ument statistic s n , including document length, number ofunique tokens, and document similarity. Specifically, we fol-low (Lien, Cohen, and Croft 2019) to compute the docu-ment length and the number of unique tokens. The docu-ment similarity denotes some pre-defined cosine similarity(e.g., tf-idf and doc2vec) between a document and its neigh-borhood documents. Then, we use a two-layer bi-directionalLSTM as the document encoder, which summarizes not onlythe preceding documents, but also the following documents.The document encoder is used to sequentially receive thefeature vectors of documents { x , . . . , x N } and the hiddenrepresentation h n of each document d n ∈ D is given byconcatenating the forward and backward hidden states of thesecond layer in the document encoder. The attention layer aims to capture long-term dependencieswithin the ranked document list. The key idea is that the fi-nal cut-off decision should be made upon a global way. Toachieve this purpose, we leverage a single-layer transformerarchitecture (Vaswani et al. 2017) over the hidden repre-sentations h n of each document d n in the input ranked list.In particular, the multi-head attention mechanism in Trans-former allows every document to be directly connected toany other documents in a ranked document list.Specifically, in the multi-head attention mechanism, eachdocument will attend to all the documents and obtains a setof attention scores that are used to refine its representation.Given current document representations H = { h , . . . , h N } ,the refined new document representations M are calculatedas: M = MultiHeadAttention ( H )= Concatenation ( head , ..., head h ) W H , head i = softmax ( ( Q ( i ) )( K ( i ) ) T √ t )( V ( i ) ) , (1)where h is the number of heads and Q ( i ) = HW ( i ) Q , K ( i ) = HW ( i ) K , V ( i ) = HW ( i ) V . W H ∈ R t × t and W ( i ) Q , W ( i ) K , W ( i ) V ∈ R t × ( t/h ) are learnable parameters with t as themodel dimension. The dimension scaling factor √ t is ap-plied to adapt the fast growth in dot-product attention.After obtaining the refined representation of each docu-ment by the multi-head attention mechanism, we add a layernormalization (Ba, Kiros, and Hinton 2016) to obtain the fi-nal representation of the ranked document list M (cid:48) ∈ R N × t , M (cid:48) = LayerNorm ( M + H ) . (2) The goal of the decision layer is to identify an appropriatecut-off position k for each query q given the final represen-tation of the input ranked list M (cid:48) . Specifically, we arrive atthe output probability of AttnCut by applying a multilayerperceptron (MLP) followed by a softmax over positions inthe ranked document list: p = Softmax ( MLP ( M (cid:48) )) , (3)where p = { p n } Nn =1 ∈ R N × stands for a probability distri-bution over candidate N cut-off positions. In the training phase, we propose to take into account thealternative outputs beyond the ground truth for better modellearning, meanwhile attempt to keep the optimization pro-cedure simple and efficient. The key idea is that if we canderive a better target distribution (i.e., user-defined metric)which can convey the information of the output structure,we can then directly use it to replace the Dirac distributionin the MLE objective to achieve our purpose.Specifically, we try to derive the new target distributionby employing Reward Augmented Maximum LikelihoodRAML) (Norouzi et al. 2016), which consists of the fol-lowing two steps.•
Define Output Distribution . Without loss of generality,given the ground-truth relevance label y ∗ = { y ∗ , . . . , y ∗ N } ( y ∗ n = 1 if d n is relevant, y ∗ n = − if d n is non-relevant)of the ranked list D and a reward function r , we can com-pute the reward r k ( y ∗ ) if the ranked list D is truncatedat position k . Specifically, we take the user-defined metric(e.g., the evaluation metric F as defined at Equ.(8)) asthe reward function. Following the idea in (Dai, Xie, andHovy 2018), we normalize these rewards scores to obtainthe distribution of the outputs as q k = exp ( r k ( y ∗ ) /τ ) (cid:80) Nn =1 exp ( r n ( y ∗ ) /τ ) , (4)where τ is the hyper-parameter which controls the con-centration of the distribution around y ∗ . Obviously, thisdistribution reflects how the task rewards distributed in theoutput space.• Integrate into MLE Criterion . Previous existing neuralmodels rely on MLE for model learning. Specifically, theMLE criterion is to maximize the log-likelihood of theground-truth truncation positions as follows: L MLE ( θ ) = − log p ( D ; θ )= − (cid:88) k δ k ∗ ( k ) log p k ( D ; θ ) , (5)where p k ( D ; θ ) is the output probability defined asEqu.(3), and δ k denotes the Dirac distribution of theground-truth truncation position, i.e., δ k ∗ ( k ∗ ) = 1 else δ k ∗ ( k ) = 0 for other k .As we can see, the MLE criterion ignores the structureof the output space by treating all the outputs that do notmatch the ground-truth as equally poor, and thus brings thediscrepancy between training and test. Here, we replacethe Dirac distribution δ k of MLE in Equ.(5) with the abovederived distribution q k , and obtain our learning criterion asfollows, L r ( θ ; τ ) = − (cid:88) k q k log p k ( D ; θ )= − E q k [ log p k ( D ; θ )] . (6)This loss makes our model distribution approach the nor-malized reward distribution. Now we can directly optimizethis new target distribution augmented objective function forlearning AttnCut. We can see that this learning criterion iseasy to implement in practice. It is also a general learningcriterion to be adopted by almost all the existing ranked listtruncation models.In the inference phase, given a ranked document list D = { d , . . . , d N } with respect to a query q , we pick the position k with the highest target metric as defined in Equ.(3) to cutthe ranked list. In some scenarios, people have specific recall requirementsto the ranked list. For example, in patent search, users of- ten require the returned list of patents to reach a target re-call as they want to find whether there exist conflict patents.Therefore, it is necessary for ranked list truncation model todecide the cut-off position with respect to the target metricunder some minimal recall constraint. To achieve this pur-pose, we extend AttnCut to achieve an optimal target metric(e.g., F1) and ensure a target minimal recall simultaneouslyfor the final cut-off decision.Firstly, we compute the recall of the remaining rankeddocuments truncated at candidate cut-off positions k ∈ [1 , N ] , which is defined as, R @ k = 1 N D k (cid:88) n =1 δ ( y ∗ n = 1) , (7)where N D denotes the number of relevant documents in theranked list D and y ∗ n is the relevance label of document d n in the truncated ranked list. δ is the indicator function.Then, we employ the same encoding layer and attentionlayer in AttnCut to obtain the final representation of theranked list. Note we split R @ k ∈ [0 , into B ordered bins.Hence, we modify the decision layer to classify each can-didate position into B ordered recall bins and achieve theprobability distribution p (cid:48) = { p (cid:48) n } Nn =1 ∈ R N × B . We learnthe recall-constraint AttnCut using MLE objective since the B -dimension probability distribution of each position is notsuitable as the reward score.In the testing phase, we use AttnCut to pick the position m with the highest target metric (e.g., F1), and use recall-constraint AttnCut to pick the position j under the minimalrecall requirement σ . If m ≥ j , then m is the eligible po-sition. Otherwise, a sub-optimal position m (cid:48) , i.e., a positionwith the sub-highest target metric, will be tried until m (cid:48) > j . In this section, we conduct experiments to verify the effec-tiveness of our proposed model.
We conduct experiments on two representative IR datasets.•
Robust04 contains 250 queries and 528k news articles,whose topics are collected from TREC 2004 RobustTrack . There are about 70 relevant documents (news arti-cles) for each query.• Million Query Track 2007 (MQ2007) is a LETOR (Qinet al. 2010) benchmark dataset which uses Gov2 web col-lection. There are 1692 queries and 65323 documents,where each query has an average of 10 relevant docu-ments.We leverage two widely adopted retrieve models, i.e.,BM25 (Robertson and Walker 1994) and DRMM (Guo et al.2016), to obtain the ranked list. Specifically, we retrieve thetop 300 and top 150 documents as the ranked list for Ro-bust04 and MQ2007, respectively. The detailed statistics ofthese datasets are shown in Table 1. Note we do not use the https://trec.nist.gov/data/robust.html obust04 MQ2007 We implement our AttnCut model in PyTorch . For twodatasets, we randomly divide them into a training set (80%)and a testing set (20%) following (Lien, Cohen, and Croft2019) to achieve comparable performance. For the EncodingLayer, we first compute the tf-idf and doc2vec of each docu-ment using gensim tool over the whole corpus. The dimen-sion of tf-idf and doc2vec is 648730 and 200 respectively.The LSTM hidden unit size of the two-layer bi-directionalLSTM is set as 128. For the Attention Layer, the hiddensize t of the Transformer is 256 and the number h of self-attention heads is 4. For training, the mini-batch size for theupdate is set as 20 and 128 for Robust04 and MQ2007 re-spectively. The parameter τ for RAML learning is set as0.95. We apply stochastic gradient decent method Adam(Kingma and Ba 2014) to learn the model parameters withthe learning rate as × − . For recall-constraint model,we set the number of ordered bins B as 5. We adopt three types of baseline methods for comparison,including traditional truncation methods, neural truncationmodels and our model variants.For traditional truncation methods, we apply three repre-sentative methods with different policies:•
Oracle uses the ground-truth labels of the test queries tofind a best truncation position k for each query, which rep-resents an upper-bound on the metric performance that canbe achieved.• Fixed - k determines a fixed point k across test queries toreturn the top k document results (Fan et al. 2018; Tay,Tuan, and Hui 2018; Wang and Nyberg 2015).• Greedy - k chooses a fixed k over the training data to max-imize the user-defined evaluation metric.The neural truncation models include,• BiCut (Lien, Cohen, and Croft 2019) proposes an RNN-based model combined with a flexible cost function, andpredicts Continue and EOL for end-of-list. The ranked listis truncated at the first instance of EOL. https://pytorch.org/ https://radimrehurek.com/gensim/ • Choppy (Bahri et al. 2020) leverages a Transformer ar-chitecture for ranked list truncation and optimizes the ex-pected metric value.Furthermore, we implement some variants of our modelby using different learning objectives, including,•
AttnCut-MLE learns AttnCut by MLE as defined inEqu.5 to maximize the log-likelihood of the ground-truthtruncation positions.•
AttnCut-Bi uses the loss function applied in Bicut (Lien,Cohen, and Croft 2019) to learn our AttnCut model,which is formalized as follows: L BiCut = (cid:88) d n ∈ D k ( α I ( y ∗ n = 0) p n − r +(1 − α ) I ( y ∗ n = 1) 1 − p n r ) , where y ∗ n is the relevant label of the document d n , I isan indicator function, r is the normalization factor and α is a hyper-parameter. D k denotes the remaining rankeddocuments truncated at the cut-off position k .• AttnCut-RL learns AttnCut by reinforcement learning(RL) (Williams 1992). We define the user-defined metricas a reward, and the RL objective function is defined as, L RL = − (cid:88) k p k ( D ; θ ) γ l − r k ( y ∗ ) , where p k ( D ; θ ) is the output probability defined asEqu.(3) and γ l − r k ( y ∗ ) is the discounted reward. γ is thedecay rate and l denotes for trace l in training. As for evaluation measures, two standard evaluation metrics,i.e., F1 at rank k (F1 @ k ) and discounted cumulative gain atrank k (DCG @ k ), are used in experiments following previ-ous works (Lien, Cohen, and Croft 2019; Bahri et al. 2020).• F1 @ k is evaluated at the cut-off candidate position k : F @ k = 2 ∗ P @ k ∗ R @ kP @ k + R @ k ,P @ k = 1 k k (cid:88) n =1 δ ( y ∗ n = 1) ,R @ k = 1 N D k (cid:88) n =1 δ ( y ∗ n = 1) , (8)where y ∗ n ∈ {− , } is the relevance label of the docu-ment d n , and N D denotes the number of relevant docu-ments in the ranked list.• DCG @ k (J¨arvelin and Kek¨al¨ainen 2002) is also evaluatedat the cut-off candidate position k : DCG @ k = k (cid:88) n =1 y ∗ n log ( n + 1) . (9)For methods that optimize F1@k or DCG@k, we reportthe performance of the model when it is optimized specif-ically for that metric. Note that the widely used version ofethod Robust04 MQ2007BM25 DRMM BM25 DRMMF1@ k DCG@ k F1@ k DCG@ k F1@ k DCG@ k F1@ k DCG@ k AttnCut-MLE 0.2538 0.3338 0.2770 0.4416 0.3096 -0.0741 0.3536 -0.0241AttnCut-Bi 0.2819 - 0.2870 - 0.3302 - 0.4008 -AttnCut-RL 0.2733 0.3404 0.2808 0.6087 0.3248 -0.0716 0.3985 -0.0199AttnCut
Table 2: Model analysis of our AttnCut using different learning objectives under the F1 @ k and DCG @ k .Method Robust04 MQ2007BM25 DRMM BM25 DRMMF1@ k DCG@ k F1@ k DCG@ k F1@ k DCG@ k F1@ k DCG@ k Oracle 0.3591 1.3328 0.3863 1.5948 0.4767 1.0569 0.5570 1.5742Fixed- k (5) 0.1550 0.1876 0.1601 0.3114 0.2175 -0.5966 0.2486 -0.3227Fixed- k (10) 0.2103 -0.2672 0.2172 -0.1137 0.2794 -1.1860 0.3135 -0.8152Fixed- k (50) 0.2499 -5.3966 0.2649 -4.9261 0.2704 -6.9066 0.3224 -4.1455Greedy- k † † † -0.0659 0.4047 † -0.0144 Table 3: Comparisons between our AttnCut model and baselines for Robust04 and MQ2007 datasets. † represents statisticalsignificance against BiCut model ( p < . , Wilcoxon test).DCG (Burges et al. 2005) always increases monotonicallywith the list length k , leading to the best solution as no trun-cation. Here we adopt the definition of DCG from (J¨arvelinand Kek¨al¨ainen 2002) to penalize irrelevant documents sincewe have set y ∗ n = 1 for relevant document and -1 for irrele-vant. This monotony is also the reason why other commonlyused ranking metrics such as MAP (Sanderson 2010) andMRR (Voorhees 1999) cannot be used in the truncation task.For the evaluation of recall-constraint AttnCut, we alsocompute the recall defined in Equ.7 at candidate cut-off po-sitions to verify whether the truncation results are under therecall constraint. Model Analysis
We first analyze the three types of Attn-Cut models to investigate which learning objetive is betterfor ranked list truncation. As shown in Table 2, we can findthat: (1)
AttnCut-MLE cannot work well. This is mainly be-cause the MLE learning criterion brings the discrepancy be-tween training and test, leading to overfitting on the ground-truth labels and reduced generalization ability. (2)
AttnCut-Bi can achieve better results than
AttnCut-RL , indicating thatleveraging a joint loss function which controls the impact offalse positives and false negatives is better than reinforce-ment learning with the target metric as the reward. (3)
Attn-Cut achieves the best performance for two datasets as eval-uated by all the metrics.
Baseline Comparison
The performance comparisons be-tween our model and the baselines are shown in Table 3. Theactual k s learned by the Greedy- k are listed as follows: forRobust04 dataset with BM25, k = 44 and 2 for F @ k and DCG @ k respectively; for Robust04 dataset with DRMM, k = 37 and 3 for F @ k and DCG @ k respectively; forMQ2007 dataset with BM25, k = 23 and 1 for F @ k and DCG @ k respectively; for MQ2007 dataset with DRMM, k = 28 and 1 for F @ k and DCG @ k respectively. Wecan observe that: (1) For traditional truncation methods, the fixed- k methods perform poorly, indicating that simply re-turning the top- k results is not suitable for ranked list trun-cation. (2) Fixed- k on the fixed point 50 achieves the compa-rable results with Greedy- k on the Robust04 dataset in termsof F1. However, the best fixed point may vary across differ-ent datasets and evaluation metrics, resulting in the limita-tion of flexibility in truncating different ranked lists. Notethat the comparative results between Fixed- k and Greedy- k are slightly different with that reported in (Lien, Cohen,and Croft 2019). The reason is that we split the datasets intothe training and testing sets with different random seeds. (3)The neural truncation methods (i.e., BiCut and
Choppy ) canachieve better results than the traditional truncation meth-ods, since these neural methods apply deep architectures tolearn from the score distribution and truncate dynamically.(4) By learning the conditional joint distribution over can-didate cut positions that maximizes the expected evaluationmetric on the training samples,
Choppy is able to achieveecall Threshold Robust04 MQ2007BM25 DRMM BM25 DRMMF1@ k R@ k F1@ k R@ k F1@ k R@ k F1@ k R@ kσ = 0 , Oracle 0.3591 0.3777 0.3863 0.4352 0.4767 0.6356 0.5570 0.7422 σ = 0 , AttnCut 0.2821 0.3527 0.2881 0.3674 0.3059 0.4125 0.3959 0.7267 σ = 0 . , Oracle 0.3696 0.4641 0.3963 0.4948 0.5652 0.7631 0.6441 0.8617 σ = 0 . , AttnCut 0.2417 0.5726 0.2836 0.6558 0.4299 0.6442 0.4614 0.9049 σ = 0 . , Oracle 0.3600 0.5818 0.3967 0.5923 0.5688 0.8048 0.6412 0.8784 σ = 0 . , AttnCut 0.2171 0.6837 0.2522 0.7870 0.4453 0.7668 0.4612 0.9124 σ = 0 . , Oracle 0.3478 0.7470 0.3645 0.7467 0.5438 0.8935 0.6175 0.9361 σ = 0 . , AttnCut 0.1572 0.8323 0.1994 0.8794 0.4263 0.9074 0.4645 0.9467Table 4: Performance of the extended recall-constraint model on Robust04 dataset and MQ2007 dataset. σ denotes the minimalrecall threshold.the best performance among all the baseline methods. How-ever, compared with Oracle , there is still a large gap between
Choppy and the upper-bound. (5) The better results of
Attn-Cut over
BiCut demonstrate the effectiveness of directly op-timizing user-defined objectives, which captures the long-range dependency within the ranked list. (6)
AttnCut out-performs
Choppy , demonstrating the effectiveness of RAMLwhich makes our model distribution approach the metric dis-tribution is a better learning objective than that maximizesthe expected evaluation metric. (7)
AttnCut achieves the bestperformance. For example, for Robust04, the relative im-provement of
AttnCut against the
BiCut is about 12.26% interms of F1 under BM25 retrieve model.
Cut-off Position Distribution Analysis
To better under-stand what can be learned in AttnCut, we conduct qualita-tive analysis on the distribution of cut-off positions of testingqueries by comparing with the best cut-off given by Oracle.Specifically, we visualize the cut-off position distributions ofAttnCut and Oracle over the ranked list retrieved by BM25on Robust04 dataset in Figure 2 to help analysis. As we cansee, AttnCut is able to approximate the optimal distributionof cut-off value, and truncates the ranked list well before the300 document limit. The best truncation positions of Attn-Cut fall into the range of [0 , . The reason might be thatmost of relevant documents are ranked in top 50 by BM25.However, the inability to properly produce the optimal cut-off position when the position is greater than 250 suggeststhat the model learns a conservative approach to the trunca-tion task (Lien, Cohen, and Croft 2019). Cut-off Position Comparison
To show the difference be-tween AttnCut and BiCut, and better understand the ad-vantages of global decision, we conduct a case study on aspecific ranked list. Specifically, we look at the query 688on Robust04 which talks about “non-U.S. media bias”. Theranked list is returned by DRMM model and then truncatedby BiCut and AttnCut respectively. Since BiCut uses localdecision, it truncates the ranked list early at the position 71after seeing five consecutive irrelevant documents. However,on the same ranked list, our AttnCut model captures the two p r o b OracleAttnCut
Figure 2: The distribution of cut-off positions of testingqueries from AttnCut and Oracle.highly relevant documents after a few irrelevant documentsand truncates at the position 101. As a result, this truncationimproves the F metric of the query by 11 percent comparedwith the truncation result of BiCut. This example proves thatour AttnCut model could capture the global dependenciesof documents and better truncates the ranked list comparedwith BiCut. Recall-Constraint Results
For the evaluation of ourrecall-constraint AttnCut model under different minimal re-call requirements, we conduct a simulation experiment onRobust04 and MQ2007, and vary the user-given recall σ bysetting it to four different values (i.e., 0, 0.3, 0.5, 0.7). Sincethere are no existing works on this task, we take a brute-force search over the ranked list to get the upper-bound onthe metric performance as a comparison, which is denoted as Oracle in Table 4. To reveal whether the recalls of the trun-cated ranked list satisfy the specific recall requirements, wealso show R@k value defined in Equ.7. Note
AttnCut mightoutperform
Oracle in terms of R@k since both
Oracle and
AttnCut are optimized towards F with recall as a constraint.ote that the recall constraints are required for each query.The number of queries that meets the recall requirementsare listed as follows: for Robust04 dataset with BM25, thenumber of eligible queries is 49, 43, 36 and 18 with min-imal recall constraints varying from 0 to 0.7, while in theoracle the corresponding number is 49, 44, 36 and 19; forRobust04 dataset with DRMM, the eligible queries numberis 49, 36, 26 and 21 while in the oracle is 49, 45, 37 and 23;for MQ2007 dataset with BM25, the eligible queries num-ber is 338, 224, 151, 88 while in oracle is 338, 283, 274 and267; for MQ2007 dataset with DRMM, the eligible queriesnumber is 338, 281, 277, 252 while in the oracle is 338, 292,290 and 276.As shown in Table 4, we can observe that: (1) The F score of Oracle for Robust04 is much smaller than thatfor MQ2007, and the F of AttnCut drops significantly onRobust04 with the minimal recall requirement increasing.The reason might be that the query in Robust04 is associ-ated with more relevant documents than that in MQ2007(i.e., 70 vs 10) and the recall value may be smaller (e.g.,the R@k on Robust04 is worse than that on MQ2007). (2)Recall-constraint AttnCut could satisfy the recall require-ments, demonstrating the effectiveness of determining thecut-off position w.r.t. the target metric under some minimalrecall constraint.
In this paper, we proposed to directly optimize user-definedobjectives for the ranked list truncation, which aims to makethe final cut-off decision from a global view. We lever-aged the successful transformer architecture to capture thelong-range dependency within the ranked list, and employedRAML for the model learning. Thus, the user-defined met-ric, which can convey the information of the output struc-ture, could be directly optimized. Furthermore, we tackledthe prediction task of the best target metric under someminimal recall constraint. Empirical results showed thatour model can significantly outperform the state-of-the-artmethods. In the future work, we would like to considersome diversity related document features to obtain betterdocument representations. We can also extend our model topractical retrieval applications, e.g., the mobile search (Yi,Maghoul, and Pedersen 2008).
This work was supported by Beijing Academy of ArtificialIntelligence (BAAI) under Grants No. BAAI2019ZD0306and BAAI2020ZJ0303, and funded by the National Natu-ral Science Foundation of China (NSFC) under Grants No.61722211, 61773362, 62006218, 61872338, and 61902381,the Youth Innovation Promotion Association CAS underGrants No. 20144310, and 2016102, the National Key RDProgram of China under Grants No. 2016QY02D0405,the Lenovo-CAS Joint Lab Youth Scientist Project, theK.C.Wong Education Foundation, and the Foundation andFrontier Research Key Program of Chongqing Science andTechnology Commission (No. cstc2017jcyjBX0059).
References
Arampatzis, A. 2002. Unbiased sd threshold optimiza-tion, initial query degradation, decay, and incrementality,for adaptive document filtering.
NIST SPECIAL PUBLICA-TION SP (250): 596–603.Arampatzis, A.; Beney, J.; Koster, C. H.; and van der Weide,T. P. 2000. Incrementality, half-life, and threshold optimiza-tion for adaptive document filtering .Arampatzis, A.; Kamps, J.; and Robertson, S. 2009. Whereto stop reading a ranked list? Threshold optimization usingtruncated score distributions. In
SIGIR , 524–531.Arampatzis, A.; and van Hameran, A. 2001. The score-distributional threshold optimization for adaptive binaryclassification tasks. In
SIGIR , 285–293.Ba, J. L.; Kiros, J. R.; and Hinton, G. E. 2016. Layer nor-malization. arXiv:1607.06450 .Bahri, D.; Tay, Y.; Zheng, C.; Metzler, D.; and Tomkins, A.2020. Choppy: Cut Transformer For Ranked List Trunca-tion. arXiv:2004.13012 .Bajaj, P.; Campos, D.; Craswell, N.; Deng, L.; Gao, J.; Liu,X.; Majumder, R.; McNamara, A.; Mitra, B.; Nguyen, T.;et al. 2016. Ms marco: A human generated machine readingcomprehension dataset. arXiv:1611.09268 .Burges, C.; Shaked, T.; Renshaw, E.; Lazier, A.; Deeds, M.;Hamilton, N.; and Hullender, G. 2005. Learning to rankusing gradient descent. In
Proceedings of the 22nd interna-tional conference on Machine learning , 89–96.Culpepper, J. S.; Clarke, C. L.; and Lin, J. 2016. Dy-namic cutoff prediction in multi-stage retrieval systems. In
Proceedings of the 21st Australasian Document ComputingSymposium , 17–24.Culpepper, J. S.; Diaz, F.; and Smucker, M. D. 2018. Re-search frontiers in information retrieval: Report from thethird strategic workshop on information retrieval in lorne(swirl 2018). In
ACM SIGIR Forum , 34–90.Dai, Z.; Xie, Q.; and Hovy, E. 2018. From credit assignmentto entropy regularization: Two new algorithms for neural se-quence prediction. arXiv:1804.10974 .Dempster, A. P.; Laird, N. M.; and Rubin, D. B. 1977. Maxi-mum likelihood from incomplete data via the EM algorithm.
Journal of the Royal Statistical Society
SIGIR , 375–384.Graves, A. 2013. Generating sequences with re-current neural networks. CoRR abs/1308.0850 (2013). arXiv:1308.0850 .Guo, J.; Fan, Y.; Ai, Q.; and Croft, W. B. 2016. A deeprelevance matching model for ad-hoc retrieval. In
CIKM .J¨arvelin, K.; and Kek¨al¨ainen, J. 2002. Cumulated gain-basedevaluation of IR techniques.
TOIS arXiv:1412.6980 .i, H.; Zhu, J.; Zhang, J.; and Zong, C. 2018. Ensure thecorrectness of the summary: Incorporate entailment knowl-edge into abstractive sentence summarization. In
COLING ,1430–1441.Lien, Y.-C.; Cohen, D.; and Croft, W. B. 2019. AnAssumption-Free Approach to the Dynamic Truncation ofRanked Lists. In
ICTIR , 79–82.Lupu, M.; Hanbury, A.; et al. 2013. Patent retrieval.
Foun-dations and Trends® in Information Retrieval arXiv:1705.07136 .Manmatha, R.; Rath, T.; and Feng, F. 2001. Modeling scoredistributions for combining the outputs of search engines. In
SIGIR , 267–275.Norouzi, M.; Bengio, S.; Jaitly, N.; Schuster, M.; Wu, Y.;Schuurmans, D.; et al. 2016. Reward augmented maximumlikelihood for neural structured prediction. In
Advances InNeural Information Processing Systems , 1723–1731.Qin, T.; Liu, T.-Y.; Xu, J.; and Li, H. 2010. LETOR: Abenchmark collection for research on learning to rank forinformation retrieval.
Information Retrieval arXiv:1511.06732 .Robertson, S. E.; and Walker, S. 1994. Some simple effec-tive approximations to the 2-poisson model for probabilisticweighted retrieval. In
SIGIR’94 , 232–241. Springer.Sanderson, M. 2010. Christopher D. Manning, PrabhakarRaghavan, Hinrich Sch¨utze, Introduction to Information Re-trieval, Cambridge University Press. 2008. ISBN-13 978-0-521-86571-5, xxi+ 482 pages.
Natural Language Engineer-ing
IWSLT .Tay, Y.; Tuan, L. A.; and Hui, S. C. 2018. Cross temporalrecurrent networks for ranking question answer pairs. In
Thirty-Second AAAI Conference on Artificial Intelligence .Tomlinson, S.; Oard, D. W.; Baron, J. R.; and Thompson, P.2007. Overview of the TREC 2007 Legal Track. Citeseer.Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-tention is all you need. In
Advances in neural informationprocessing systems , 5998–6008.Voorhees, E. 1999. Proceedings of the 8th text retrieval con-ference.
TREC-8 Question Answering Track Report
ACL , 707–712.Wang, L.; Lin, J.; and Metzler, D. 2011. A cascade rankingmodel for efficient ranked retrieval. In
SIGIR , 105–114. Williams, R. J. 1992. Simple statistical gradient-followingalgorithms for connectionist reinforcement learning.
Ma-chine learning