Plackett-Luce model for learning-to-rank task
PP LACKETT -L UCE M ODEL FOR L EARNING - TO -R ANK T ASK
A P
REPRINT
Tian Xia
Wright State University
Shaodan Zhai
Wright State University
Shaojun Wang
Wright State University
[email protected] A BSTRACT
List-wise based learning to rank methods are generally supposed to have better performance thanpoint- and pair-wise based. However, in real-world applications, state-of-the-art systems are not fromlist-wise based camp. In this paper, we propose a new non-linear algorithm in the list-wise basedframework called ListMLE, which uses the Plackett-Luce (PL) loss. Our experiments are conductedon the two largest publicly available real-world datasets, Yahoo challenge 2010 and Microsoft 30K.This is the first time in the single model level for a list-wise based system to match or overpassstate-of-the-art systems in real-world datasets.
The learning to rank task arises from real-world applications such as Google, Yahoo, and other search engines. Aranking system returns a set of documents and ranks them by their relevance to the query from a user.Learning to rank techniques are influencing traditional natural language processing applications, such as modelparameter training [17], and non-linear feature extraction [33, 36].Generally, ranking models fall into three methodologies based on how they model basic ranking objects.
This definitionwould not be affected by how to utilize features , e.g., linear and non-linear features.The first methodology, point-wise based , breaks relationship between documents related to different queries [11, 12,14, 23], then uses traditional machine learning regression and classification techniques for training. For example, MART[14] uses the regression tree technique to fit model outputs to their relevance scores ; McRank [23] converts the rankprocedure as a multi-class classification.The second methodology, pair-wise based , considers the relationship among documents related to the same query[10, 13, 15, 16, 19, 30, 32, 37, 40], then adopts mature classification techniques to minimize the inversion number ofdocuments by considering document pairs. For example, RankBoost [13] plugs the exponential loss of document pairsinto a framework of Adaboost ; RankSVM [16, 19] uses SVM to perform a binary classification on the document pairs ;LambdaRank [30] and LambdaMART [40] take into account the influence of a correctly classified document pair to theobjective measures, and achieve a big success.The third methodology, list-wise based , treats a permutation of a set of documents as a basic unit, and builds lossfunctions on them [6, 25, 31, 34, 41, 42, 43, 44]. Because exact losses of performance measures are step-wise, non-differentiable as well as non-convex with respect to model parameters, most work in this methodology resort to suitablesurrogate functions. These surrogate functions are either not directly related to ranking performance measures [6, 29, 41,42], or just continuous and differentiable approximation bounds of ranking measures [7, 9, 20, 27, 35, 39, 40, 44, 43, 45].To further decrease the gap between optimization objectives and performance measures, some work attempt to directlyoptimize objective measures and show promising results. For example, in [25, 34], the authors use a coordinate ascentframework to directly optimize performance measures, and DirectRank in [34] is much faster in practice. However,both their work still can not match the state-of-the-art systems in large data sets when decision trees are used .
1. Tan et al. [34] use a mixed strategy, which borrows boosted trees generated from MART, to compete with LambdaMART.Their strategy should be treated as a system combination technique rather than a single ranking model. a r X i v : . [ c s . I R ] S e p lackett-Luce Model for Learning-to-Rank Task A P
REPRINT
Our work utilizes an elegant list-wise surrogate function called Plackett-Luce (PL) loss, which was first proposed in1975 [26] for horse gambling. Cao et al. [6] introduce it to the learning to rank task by using it to model the probabilisticdistribution of a set of documents given a query, where the training is conducted by minimizing the KL distance betweenthe probability distribution for the ranking model and that for the ground truth. Later Xia et al.[41, 42] provide a modelcalled ListMLE, which instead maximizes the likelihood of ground-truth permutations defined in the PL loss. ListMLEcould be viewed as a general framework to utilize linear and non-linear features, however, as its non-linear systemhas not been developed, we refer to l -ListMLE as its linear version hereafter. Because public large-scale datasetswere not available until 2010, many properties of the PL loss are not revealed in l -ListMLE. Even though l -ListMLEperforms pretty well on some datasets, it is rather unstable in many other cases, especially when compared with directoptimization based models, e.g. DirectRank [34] and LambdaRank [30]. Although not necessarily the best, DirectRankand LambdaRank often show reasonable good performance, while l -ListMLE under some circumstances performs farmore poorly than average performance. For example, on the Microsoft 30K data, the largest publicly available realworld dataset, l -ListMLE is approximately 7.6 points worse than the coordinate ascent based method [25] in termsof NDCG scores. Although, Xia et al. [41] further proved the PL loss is consistent with NDCG@K under certainassumptions, it is not guaranteed to achieve a reasonable performance on practical applications that use data sets withlimited size, and the unstable performance behavior greatly limits wide spread real-world applications for the ListMLEmodel.Understanding why the PL loss fails in some datasets is important to design more effective algorithms, thus we conductexperiments to analyze these datasets, and figure out one principle as the condition for the PL loss, which states that ascompared to average document number per query, the number of features should be large enough. Therefore in order togain better performance, we have to use more features for PL loss. There are several ways to enrich features of datasets : kernel mapping , neural network mapping , and gradient boosting . We select the gradient boosting with decision trees asweak rankers in this work due to the convenient comparison with LambdaMART, and leave the others for further work.A merit of the PL loss is its concise formula to compute functional gradients, Eqn. (9), which results in our rankingsystem, called PLRank.As suggested in [8], real-world datasets are closer to the scenario of search engine applications and have much smallerfluctuations in terms of performance. We conduct experiments on two publicly released real-world datasets. As faras we know, these datasets are larger than any used in previous research papers, except [40] . To compare with otherlist-wise based methods, we also extend three extra consistent list-wise surrogate functions in [31] in the gradientboosting framework. We find that PLRank not only maintains the merits of the PL loss, but also greatly alleviates theinstability problem of l -ListMLE. PLRank has the same time complexity with LambdaMART, and is M times as fast asMcRank . Given a set of queries Q = { q , . . . , q | Q | } , each query q i is associated with a set of candidate relevant documents D i = { d i , . . . , d i | D i | } and a corresponding vector of relevance scores r i = { r i , . . . , r i | D i | } for each D i . The relevancescore is usually an integer, and greater value means more related for the document to the query. An M -dimensionalfeature vector h ( d ) = [ h ( d | q ) , . . . , h M ( d | q )] T is created for each query-document pair, where h t ( · ) s are predefinedreal-value feature functions.A ranking function f scores each query-document pair, and returns sorted documents associated with the same query.Since these documents have a fixed ground truth rank, our goal is to learn an optimal ranking function returning resultsas close to the ground truth rank as possible.Generally, ranking functions use only linear information of original features h ( d | q ) or their nonlinear information. Thelinear form is as f ( d | q ) = w T · h ( d ) , where w = [ w , . . . , w M ] T ∈ R M is the model parameter. The nonlinear formoften adopts regression trees, kernel technique, and neural network.Several measures have been used to quantify the quality of a rank, such as NDCG@K, ERR, MAP etc. In this paper, weuse the most popular NDCG@K and ERR [8] as the performance measures.
2. They adopted a larger but proprietary one3. M is the number of different relevance scores in measuring a document. A P
REPRINT
We review gradient boosting [14] as a general framework for function approximation using regression trees as the weaklearners, which has been the most successful approach for learning to rank models.Gradient boosting iteratively finds an additive predictor f ( · ) ∈ H that minimizes a loss function L . At the t th iteration,a new weak learner g t ( · ) is selected to be added to current predictor f t ( · ) to construct a new predictor, f t +1 ( · ) = f t ( · ) + αg t ( · ) (1)where α is the learning rate.In gradient boosting, according to the following squared loss, g t ( · ) is chosen as the one most parallel to the pseudo-response − ∂ L ∂f t ( · ) , which is negative derivative of the loss function in functional space. g t ( · ) = arg min g ∈H (cid:107) − ∂ L ∂f t ( · ) − g ( · ) (cid:107) (2)To fit a regression tree, the data in each internal tree node is greedily splitted into two parts by minimizing Eqn. (2), andthis procedure recursively iterates until a predefined condition is satisfied. This tree construction procedure is applicablefor any differentiable loss function. The complexity of a regression tree is usually controlled by the tree height or leafnumber. In learning to rank, the latter is more flexible, thus is adopted in this work by default. The Plackett-Luce model was first proposed by Plackett [26] to predict the ranks of horses in gambling. Consider ahorse racing game with five horses. Suppose a probability distribution P on their abilities to win a race, then a rank ofthese horses can be understood as a generative procedure. Suppose we want to know the probability of a top3 rank , , . The result can be computed as follows :Being the champion for the 2nd horse, the probability is p among five candidates. Being the runner-up for the 3rdhorse, the probability p has to be normalized among the remaining four horses, which leads to p / ( p + p + p + p ) .Being the third winner for the 5th horse, its probability among the remaining three horses becomes p / ( p + p + p ) .So the probability of the rank , , is their product. It is not difficult to see that the most likely rank is all horses areranked by their winning probability in a descending order.The key idea for the Plackett-Luce model is the choice in the i th position in a rank π only depends on the candidates notchosen at previous positions. In learning to rank, each training sample has been labeled with a relevance score, so the ground-truth permutationof documents related to the i th query can be easily obtained and denoted as π i , where π i ( j ) denotes the index of thedocument in the j th position of the ground-truth permutation. We note that π i is not obligatory to be a full rank, as wemay only care about the top K documents.Consider a ranking function with linear features, the probability of a set of candidate relevant documents D i associatedwith a query q i is defined as p ( d ie ) = exp { h ( d ie ) T · w } (cid:80) d ∈ D i exp { h ( d ) T · w } (3)The probability of the Plackett-Luce model to generate a rank π i is given as p ( π i , w ) = | π i | (cid:89) j =1 p ( d iπ i ( j ) | C i,j ) p ( d ie | C i,j ) = p ( d ie ) (cid:80) d ∈ C i,j p ( d ) (4)where C i,j = D i − { d iπ i (1) , . . . d iπ i ( j − } .The training objective is to maximize the log-likelihood of all expected ranks over all queries and retrieved documentswith corresponding ranks in the training data with a zero-mean and unit-variance Gaussian prior parameterized by w . L = log { (cid:89) i p ( π i , w ) } − w T w (5)3lackett-Luce Model for Learning-to-Rank Task A P
REPRINT
The gradient can be calculated as follows, ∂ L ∂ w = (cid:88) i (cid:88) j { h ( d iπ i ( j ) ) − (cid:88) d ∈ C i,j ( h ( d ) · p ( d | C i,j )) } − w Since the log-likelihood function is smooth, differentiable, and concave with the weight vector w , global optimumguarantee is satisfied. In this paper, we build ensemble regression trees for the Plackett-Luce loss in the gradient boosting framework, Alg. 1summarizes the main procedure. We first describe how to compute the pseudo response and output value for fitting aregression tree, and then we provide more analysis for this new model.At the t th iteration, all fitted regression trees constitute the current predictor f t ( · ) , and the Eqn. (3) can be rewritten as p ( d ie ) = exp { f t ( d ie ) } (cid:80) | D i | k =1 exp { f t ( d ik ) } (6)We limit | π | = K , and adopt Eqn. (5) without a normalization as our objective . Plugging Eqn. (6) into Eqn. (5), andtaking derivative with respect to f t ( · ) , we obtain L (cid:48) ( f t ( d )) = I ( d ∈ top K ground-truth ) − (cid:88) C s.t. d ∈ C p ( d | C ) (7)where I ( · ) denotes the indicator function. When I ( · ) returns 0 for the current document, the size of { C } equals K ,otherwise it is smaller.We follow Eqn. (2) to fit a regression tree g t ( · ) . Denotes the documents falling in the leaf U as U d . We set the output ofthe leaf U as g t ( d ∈ U d ) = − v , and v is optimized independently from other leaves. Following Eqn. (1), we construct f t +1 ( · ) for documents in U d .We adjust v to maximize the log-likelihood L . Thus L has been reinterpreted as a function of v . We rewrite Eqn. (6) as p ( d ie ) = exp { f t ( d ie ) − I ( d ie ∈ U d ) · αv } (cid:80) | D i | k =1 exp { f t ( d ik ) − I ( d ie ∈ U d ) · αv ) } (8)By the Newton method, we have v = L (cid:48) ( v =0) L (cid:48)(cid:48) ( v =0) , where L (cid:48) ( v = 0) = (cid:88) d ∈ U d L (cid:48) ( f t ( d )) , L (cid:48)(cid:48) ( v = 0) = (cid:80) C p (cid:48) · ( p (cid:48) − , p (cid:48) = (cid:88) d ∈ U d ∩ C p ( d | C ) (9)To clarify this procedure, we take one query with four related documents as an example. Suppose the four documents d , d , d , d are sorted in a descending order with their relevance scores. In an other word, the ground-truth permutationis d , d , d , d . Let their scores after some iterations, from current predictor f t ( · ) , be s , s , s , s respectively forabbreviation. Considering the top documents of the ground-truth permutation, the log-likelihood is L = s − log { exp s + exp s + exp s + exp s } + s − log { exp s + exp s + exp s } Taking derivatives with respect to their scores, we obtain L (cid:48) ( s ) = 1 − p ( s | s , s , s , s ) , L (cid:48) ( s ) = 1 − p ( s | s , s , s , s ) − p ( s | s , s , s ) L (cid:48) ( s ) = 0 − p ( s | s , s , s , s ) − p ( s | s , s , s ) L (cid:48) ( s ) = 0 − p ( s | s , s , s , s ) − p ( s | s , s , s ) In this toy example, the samples s , s have K = 2 contextual probabilities.Suppose s , s fall into the same leaf of a regression tree, then L (cid:48) ( v = 0) = 1 − p ( s | C ) + 0 − { p ( s | C ) + p ( s | C ) }L (cid:48)(cid:48) ( v = 0) = ( p ( s | C ) + p ( s | C )) · ( p ( s | C ) + p ( s | C ) −
1) + p ( s | C ) · ( p ( s | C ) − where C = { s , s , s , s } , C = { s , s , s } .In the following, we describe more details of Alg. 1 that relate to initialization of models (line 1), selection ofground-truth permutation (line 3-4).
4. The model complexity of regression trees is often controlled by the learning rate α , different from the normalization factorused in a linear model. A P
REPRINT
Algorithm 1
PLRank
Require:
Documents D = { D , D , . . . } ; K defines top K documents of a ground-truth rank ; T defines regressiontree number ; L defines leaf number ; α defines learning rate. f ( · ) ← BackGroundModel( · ) (cid:46) Initialization for model adaptation. None by default. for D i in D do Randomly shuffle D i Sort D i by relevances. (cid:46) We could build several ground-truth permutations. end for for t = 1 to T do Resp ( d ∈ (cid:83) D i ) ← −L (cid:48) ( f t ( d )) (cid:46) Compute pseudo response following Equ. 7. Fit a L -leaf tree g t on Resp . (cid:46) By Eqn. (2) by default. for leaf U in g t do v ← L (cid:48) ( v = 0) / L (cid:48)(cid:48) ( v = 0) (cid:46) Set output of current leaf by Eqn. (9) g t ( d ∈ U d ) ← − v end for f t +1 ← f t + αg t (cid:46) Eqn. (1) end forreturn f T +1 As a statistical model is sensitive to data genres, a trivial yet effective way is to use more data for training. Borrowingthe idea from adaptive LambdaMART [40], our model could also first train a background model on plenty of generalgenre data. Then we assign the resulting model to initialize our Alg 1 (line 1), and continue to train our model using onobjective genre data. In this paper, we are not focusing on the adaptation experiments, and we initialize to zero.
In learning to rank, as the relevance scores are scattered among limited integers, e.g., 0 to 10 inclusively, there are manyties in the scores, this would impact the determination of ideal permutations and our training objective.We consider multiple ground-truth permutations (looping lines 2-5 in Alg. 1). Let toy documents be d , d , d , d withrelevance scores , , , , and considering top ground-truth documents. As the number of all permutation possibilitiesis huge, we randomly select several ground-truth ranks and store them compactly in terms of data structure. For instance,the ground truth permutation d , d , d , d consists of three contextual terms, C = { d , d , d , d } , C = { d , d , d } , C = { d , d } , while adding a second permutation d , d , d , d leads to merely one extra term C = { d , d } , ratherthan new three terms. The statistics about this issue are in Table 3. We use PLRank(obj= num ) to denote differentnumber of objectives. Regarding linear features, Xia et al. [41, 42] adopt a neural network to maximize the log-likelihood of expected ranks.The neural network works well in small datasets, e.g. LETOR, while it also requires suitable settings on hidden layerstructure and the number of hidden neurons.As our experiments are conducted on real-world datasets, we instead use L-BFGS [4] for parameter tuning to gainfaster convergence speed. It is observed that overfitting often occurs in small data sets, while in large datasets the thelog-likelihood correlates with ranking measures very well.Regarding non-linear features, kernel technique could map them into a linear form in a high dimensional space, andthen the neural network based training in Xia et al.’s work or LBFGS are applicable, provided that the new dimension isacceptable in practice. However, in the case of regression trees, it is impractical to expand all dimensions, which is whywe propose our new algorithm. We are following the boosting framework, which iteratively fits high-quality decisiontrees, to maximize the objective log-likelihood.
Calauzenes et al. [5] have proven that no consistent surrogate function exists for ERR and MAP. However, regardingNDCG, Xia et al. [41] proved that the ListMLE model is consistent with NDCG@K. They also modified two other5lackett-Luce Model for Learning-to-Rank Task
A P
REPRINT
System Yahoo 2010 Microsoft 30KNDCG@1 @3 @10 ERR NDCG@1 @3 @10 ERRPLRank(obj=1) 0.7210 0.7267 0.7885 0.4598 0.4947 0.4814 0.5045 0.3770PLRank(obj=3) 0.7228 0.7290 0.7895 - - - -PLRank(obj=15) 0.7205 0.7291 0.7896 0.4601 - - - -1 LambdaMART 0.7160 0.7187 0.7809 0.4589 0.4942 0.4793 0.4995 l -ListMLE 0.7017 0.7014 0.7673 0.4520 0.3838 0.3880 0.4230 0.3234 T ABLE l -ListMLE, thus it is less meaningful tolist them.losses, cosine and KL divergence, to make them NDCG@K consistent. As Xia et al. have compared them in theirwork, we thus compare the PL loss with three other consistent versions proposed in [31], squared loss , cosine , and KLdivergence , which were proved to be consistent with the whole list, in the case of boosted trees.We pay special attention to the first one since it has three different implementations. Let s denote a score vector of alldocuments, r denote the corresponding relevance vector, and G ( r ) = 2 r − . The consistent and inconsistent equationsin terms of square loss in [31] are φ consistentsq ( s , r ) = (cid:107) s − G ( r ) (cid:107) G ( r ) (cid:107) D (cid:107) (10)and φ inconsistentsq ( s , r ) = (cid:107) s − G ( r ) (cid:107) (11)where the norm (cid:107) · (cid:107) D defines the DCG value of a ground-truth permutation per query.A third equation in [11] is also inconsistent with NDCG. φ inconsistentsq ( s , r ) = (cid:107) s − r (cid:107) (12)All boosting systems with the least-squares loss are called MART in this paper. The two inconsistent versions arepoint-wise based, and the consistent one is list-wise based since the norm (cid:107) · (cid:107) D is operated by query. We removedetailed discussion about the functional gradients for all surrogates above due to space limitation. We studied the performance of the proposed algorithm in two real world datasets, Yahoo challenge 2010 and Microsoft30K. We implemented 9 baseline ranking systems in C++, which use boosted trees as features. System 1 is Lambda-MART. System 2 is McRank. System 3 is MART-1 which is the first inconsistent version of MART (Eqn. (11)). System4 is MART-2 which is the second inconsistent version of MART (Eqn. ). System 5 is c-MART-1 which is a consistentversion of MART-1 (Eqn. (10)). System 6 is CosMART which is an inconsistent version of cosine distance loss withboosted trees. System 7 is c-CosMART which is a consistent version of CosMART. System 8 is KLMART which is aMART using the KL distance. System 9 is c-KLMART which is a consistent version of KLMART.6lackett-Luce Model for Learning-to-Rank Task A P
REPRINT
Moreover, in order to compare tree features and linear features, we add two linear systems. System 10 is based on aheuristic coordinate ascend (CA) based optimization [25] which uses linear features and optimizes NDCG directly. CAis used as a reference system to represent the average performance of linear systems due to its relatively stable andgood performances among a variety of linear models in different datasets, including the datasets used in this work, asshown in the experiments of Tan et al. [34]. This system is akin to the one proposed by Tan et al., but the latter is anexact coordinate ascent optimization ranking method. We also used the experimental results in [34] as a reference here.System 11 is l -ListMLE that optimizes top retrieved documents.We set up the same parameters as in [40] for all systems. The learning rate α is 0.1 (line 15 in Alg. 1). We set the numberof decision tree leaves as 30, which is a classic setting. As in real world datasets, McRank requires more iterations toconverge, thus we use 2500 boosted trees as a final model, and use 1000 boosted trees for other systems. Regardingto PLRank, as we mainly concentrate on NDCG@10, we set K to 10 to optimize top documents of ground-truthpermutations. All results are reported with NDCG@(1,3,10) and ERR scores.In order to examine the industry-level performance of our system, we search exhaustively parameters to compare to theYahoo Challenge results [8] in Table 5. The LETOR benchmark datasets released in 2007 [28] have significantly boosted the development of learning to rankalgorithms since researchers could compare their algorithms on the same datasets for the first time. But unfortunately,the sizes of the datasets in LETOR are several orders of magnitude smaller than the ones used by search enginecompanies. Several researchers have noticed that the conclusions drawn from experiments based on LETOR datasetsare unstable and quite different from the ones based on large real datasets [8]. Thus in this work, we attempt to makestable system comparisons by using as large datasets as possible, and we use two real world datasets, Yahoo challenge2010 and Microsoft 30K. The statistics oh these three data sets are reported in Table 2 which might a bit different fromthose in [8] as we only give the statistics of training datasets. T ABLE l -ListMLE vs. Other Linear Systems We first examine the performance of l -ListMLE (System-11) compared to another linear system CA (System-10). Theirresults are shown in Table 1 and Figure 1. l -ListMLE obtains 0.7673 in NDCG@10 in the Yahoo 2010 dataset after100 iterations of quasi-Newton optimization, but performs unsatisfactorily in Microsoft 30K even after 1000 iterations,approximately 8 percent lower in NDCG@1, and several percent lower in other measures. Tan et al. [34] also comparedseveral linear systems in these two datasets, except l -ListMLE. Our implementation of l -ListMLE outperforms theirbest result 0.760 from DirectRank in the Yahoo datasets, while performs significant worse in the Microsoft 30K.The unexpectedly bad performance of ListMLE in the larger dataset contradicts the proof from [41], that is ListMLEis consistent with NDCG. In another words, ListMLE theoretically should perform better with more available data.The main reason may be that the features on Microsoft 30K is not rich enough to ensure the consistency of ListMLE.To verify this, we notice that the features of Yahoo 2010 data set are richer than Microsoft 30k, thus we conductexperiments on Yahoo 2010 dataset by adjusting the number of features and compare the performance of l -ListMLEand CA. The results are shown in Figure 2. Since the features might not be independent to each other, the NDCGperformance curves are not monotonic with the size of features number. However both figures have their own critical7lackett-Luce Model for Learning-to-Rank Task A P
REPRINT
Chart 8 Y a h oo c h a ll e n g e CA PL-linear
Chart 9 M i c r o s o f t K CA PL-linear
Chart 9 M i c r o s o f t K CA ListMLE F IGURE l -ListMLE and the selected linear reference system Coordinate Ascent (CA) on Yahoodata (left) and Microsoft 30K (right). CA is capable of representing the mainstream linear systems on these datasets[34].points, 200 for NDCG@1 and 100 for NDCG@10 : When the feature number is beyond this point, l -ListMLE beatsCA, otherwise it performs worse than CA. N D C G @ feature number CA PL-linear ! NDCG@10 N D C G @ feature number CA PL-linear
Chart 4 N D C G @ feature number CA ListMLE F IGURE l -ListMLE, instead of using a linear feature model, we need to increase the modelcapacity that have more expressive power. Thus we decide to use decision trees as our basic weak learners, and wegrown our model through gradient boosting that maximize the likelihood of ground-truth ranks. The PL loss is not theonly one that is consistent with NDCG, there are other three models proposed in [31] that are also consistent with it, sowe extend these three models to boosted trees versions for a full comparison. T ABLE l -ListMLE, MART, McRank and LambdaMART Currently, the state-of-the-art learning to rank systems use boosted trees which have been proven to be more powerfulthan those using linear features in real world datasets. The champion of Yahoo challenge 2010 is a system that combinesapproximately 12 models, most of which are trained with LambdaMART [3]. The other two state-of-the-art systemsusing trees are MART and McRank, one optimizes least-square loss and the other treat the ranking as a multi-classclassification. 8lackett-Luce Model for Learning-to-Rank Task
A P
REPRINT
Chart 13 N D C G @ Chart 14 N D C G @ Chart 15 N D C G @ Chart 16 E RR PLRank(obj=1) PLRank(obj=3) PLRank(obj=6)1 LambdaMART 2 McRank 4 MART-211 PL-linearPLRank(obj=1) PLRank(obj=3) PLRank(obj=6)1 LambdaMART 2 McRank 4 MART-211 PL-linear
0 2 3 4 5
Chart 18 N D C G @ Chart 19 N D C G @ Chart 20 N D C G @ Chart 21 E RR PLRank(obj=1) PLRank(obj=6) 1 LambdaMART2 McRank 4 MART-2PLRank(obj=1) PLRank(obj=6) 1 LambdaMART2 McRank 4 MART-2
0 1 2 3 4 5
Chart 4 N D C G @ Chart 5 N D C G @ Chart 6 N D C G @ Microsoft 30K 0.37200.37260.37330.37390.37450.37520.37580.37640.37700.37770.3783
Chart 7 E RR PLRank(obj=1) PLRank(obj=3) PLRank(obj=10)1 LambdaMART 2 McRank 4 MART-2PLRank(obj=1) PLRank(obj=3) PLRank(obj=10)1 LambdaMART 2 McRank 4 MART-2PLRank(obj=1) PLRank(obj=3) PLRank(obj=10)1 LambdaMART 2 McRank 4 MART-2 F IGURE l -ListMLE, which is a natural result as PLRank is in a more complex functionspace than the linear space. However, what surprises us is that, in the Yahoo dataset there are moderate improvements,approximately 2 points in NDCG(@1, 3, 10), while in the Microsoft dataset, there are significant 8 to 10 points inNDCG(@1, 3, 10). On one aspect, boosted trees indeed could capture the dependency between features, and on anotheraspect, it is especially effective for the PL loss when the features are not rich.As shown in Figure 3 and Table 1, the tree-based systems obviously perform well over linear feature systems. Amongtree-based systems, PLRank demonstrates some moderate improvements over MART, McRank and LambdaMART inthe Yahoo dataset, and in the Microsoft dataset, all tree-based systems perform pretty closely to each other.McRank and PLRank are more close in six NDCG scores except NDCG@1 in the Microsoft dataset. LambdaMARTperforms well in ERR, and is significantly better than McRank and MART, and close to PLRank(obj=1). Comparatively,three PLRank variants act more stably. PLRank(obj=1) is always in best two systems on all measures when it is comparedwith McRank, on the other hand, as shown in Table 1 LambdaMART and MART. PLRank(obj=2) is considered to bethe best in balancing the performance and running time.Two-tailed t-test results show PLRank(obj=*) systems would have significant improvements over others when theirdifferences are greater than about 0.5 point at 95% confidence. Unfortunately, in Table 1, most of the improvements ofPLRank(obj=*) are not significant, just matchable to these state-of-the-art systems.Our MART baseline results are close to those reported in [38]. Tan et al. [34] also used the same datasets to compareLambdaMART and MART, and their baselines are about 1 point lower in NDCG than our reported results. We noticethat their baselines are from RankLib, which is written in Java, and DirectRank is implemented in C++. In comparison,our 10 tree-based systems are re-implemented in C++ with an identical code template, thus our systems could be betterto reflect differences in models rather than being impacted by coding. The list-wise methods discussed in Section 3.4 have better performance than their in-consistent counterparts in Yahoodataset, although the differences are not that much. In contrast, it is reported in [31] that for all linear systems, theconsistent versions improves NDCG scores of the in-consistent counterparts by several points.As shown in Table 1, these consistent methods, after extended to boosted trees versions, unfortunately, have not showcompetitive performances when compared with LambdaMART, McRank and PLRank, so we did not run them on thelarger Microsoft dataset. 9lackett-Luce Model for Learning-to-Rank Task
A P
REPRINT
LambdaMART is a method that considers NDCG loss in optimization, and McRank optimizes unnormalized NDCG, sowe only need to further analyze the surrogate functions of PLRank and the three consistent versions, that are not directlyrelated to NDCG. A plausible explanation is the PL loss is consistent with NDCG@ K , K taken 10, while those ofc-MART-1, c-CosMART, c-KLMART are consistent with NDCG with a whole list. We conjecture that when we let K go to the whole list, these systems would show advantages. MT PR1 LM PR3 PR6 MRHours 83 91 101 106.2 126 250+ T ABLE n : PLRank(obj= n ).MR : McRank.The computational costs of tree-based systems are mainly at the stage of tree construction, thus these systems havethe same time complexity except that McRank requires more iterations to reach reasonable performance. The runningtimes of PLRank, MART, LambdaMART and McRank in Microsoft dataset are shown in Table 4. Their differences aremainly due to the computation of functional gradients. Last, in Table 5, we examine our PLRank system in the Yahoo Challenge set 1 data in an industry level. To save time,we use PLRank(obj=1) and search its parameters to gain best performance regardless of any cost. We sweep the numberof tree leaves from 100 to 1000 in steps of 100, and the learning rate α from 0.01 to 0.1 in steps of 0.03. We noticethat [3] actually did not release results of single LambdaMART systems in the standard test set, but in a self-definetest set. Since the final result of a LambdaMART-based system combination in the standard set has been available, wereasonably estimate their single LambdaMART systems in the standard test set. T ABLE
As a non-linear algorithm in the boosting framework, our proposed PLRank enriches the ListMLE framework. As faras we know, PLRank is the first list-wise based ranking system that in real-world datasets could match or outperformsuitably the famous LambdaMART and McRank in terms of NDCG and ERR.
Références [1] Mokhtar S Bazaraa, Hanif D Sherali, and CM Shetty.
Nonlinear Programming : Theory and Algorithms, 3rdEdition . 2006.[2] Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learningto rank using gradient descent. In
ICML , 2005.[3] Christopher JC Burges, Krysta Marie Svore, Paul N Bennett, Andrzej Pastusiak, and Qiang Wu. Learning to rankusing an ensemble of lambda-gradient models.
JMLR-Proceedings Track , 2011.[4] Richard H Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound constrainedoptimization.
SIAM Journal on Scientific Computing , (5), 1995.10lackett-Luce Model for Learning-to-Rank Task
A P
REPRINT [5] Clément Calauzènes, Nicolas Usunier, Patrick Gallinari, et al. On the (non-)existence of convex, calibratedsurrogate losses for ranking. In
NIPS , 2012.[6] Zhe Cao, Tao Qin, Tie Yan Liu, Ming Feng Tsai, and Hang Li. Learning to rank : from pairwise approach tolistwise approach. In
ICML , 2007.[7] Soumen Chakrabarti, Rajiv Khanna, Uma Sawant, and Chiru Bhattacharyya. Structured learning for non-smoothranking losses. In
SIGKDD , 2008.[8] Olivier Chapelle and Yi Chang. Yahoo ! learning to rank challenge overview.
JMLR-Proceedings Track , 2011.[9] Olivier Chapelle and Mingrui Wu. Gradient descent optimization of smoothed information retrieval metrics.
Information Retrieval , (3), 2010.[10] William W Cohen, Robert E Schapire, and Yoram Singer. Learning to order things.
JAIR , 1999.[11] David Cossock and Tong Zhang. Subset ranking using regression. In
COLT . Springer, 2006.[12] Koby Crammer, Yoram Singer, et al. Pranking with ranking. In
NIPS , 2001.[13] Yoav Freund, Raj Iyer, Robert E Schapire, and Yoram Singer. An efficient boosting algorithm for combiningpreferences.
JMLR , 2003.[14] Jerome H Friedman. Greedy function approximation : a gradient boosting machine.
Annals of Statistics , 2001.[15] Tamir Hazan, Joseph Keshet, and David A Mcallester. Direct loss minimization for structured prediction. In
NIPS ,2010.[16] Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Support vector learning for ordinal regression. In
ICANN ,1999.[17] Mark Hopkins and Jonathan May. Tuning as ranking. In
EMNLP , 2011.[18] Jim C Huang and Brendan J Frey. Structured ranking learning using cumulative distribution networks. In
NIPS ,2008.[19] Thorsten Joachims. Optimizing search engines using clickthrough data. In
SIGKDD , 2002.[20] Quoc Le and Alexander Smola. Direct optimization of ranking measures.
CoRR abs/0704.3359. Informalpublication , 2007.[21] Ping Li. Abc-boost : adaptive base class boost for multi-class classification. In
ICML , 2009.[22] Ping Li. Robust logitboost and adaptive base class (abc) logitboost.
UAI , 2010.[23] Ping Li, Qiang Wu, and Christopher J Burges. Mcrank : Learning to rank using multiple classification and gradientboosting. In
NIPS , 2007.[24] Tie Yan Liu. Learning to rank for information retrieval.
Foundations and Trends in IR , (3), 2009.[25] Donald Metzler and W Bruce Croft. Linear feature-based models for information retrieval.
Information Retrieval ,2007.[26] Robin L Plackett. The analysis of permutations.
Applied Statistics , 1975.[27] Tao Qin, Tie Yan Liu, and Hang Li. A general approximation framework for direct optimization of informationretrieval measures.
Information Retrieval , (4), 2010.[28] Tao Qin, Tie Yan Liu, Jun Xu, and Hang Li. Letor : A benchmark collection for research on learning to rank forinformation retrieval.
Information Retrieval , (4), 2010.[29] Tao Qin, Xu Dong Zhang, Ming Feng Tsai, De Sheng Wang, Tie Yan Liu, and Hang Li. Query-level loss functionsfor information retrieval.
Information Processing & Management , (2), 2008.[30] C Quoc and Viet Le. Learning to rank with nonsmooth cost functions.
NIPS , 2007.[31] Pradeep D Ravikumar, Ambuj Tewari, and Eunho Yang. On ndcg consistency of listwise ranking methods. In
AISTATS , 2011.[32] Cynthia Rudin. The p-norm push : A simple convex ranking algorithm that concentrates at the top of the list.
JMLR , 2009.[33] Artem Sokolov, Guillaume Wisniewski, and Francois Yvon. Non-linear n-best list reranking with few features. In amta , 2012.[34] Ming Tan, Tian Xia, Lily Guo, and Shao Jun Wang. Direct optimization of ranking measures for learning to rankmodels. In
SIGKDD , 2013.[35] Michael Taylor, John Guiver, Stephen Robertson, and Tom Minka. Softrank : optimizing non-smooth rank metrics.In
WSDM , 2008. 11lackett-Luce Model for Learning-to-Rank Task
A P
REPRINT [36] Kristina Toutanova and Byung-Gyu Ahn. Learning non-linear features for machine translation using gradientboosting machines. In
ACL , 2013.[37] Ming Feng Tsai, Tie Yan Liu, Tao Qin, Hsin Hsi Chen, and Wei Ying Ma. Frank : a ranking method with fidelityloss. In
SIGIR , 2007.[38] Stephen Tyree, Kilian Q Weinberger, Kunal Agrawal, and Jennifer Paykin. Parallel boosted regression trees forweb search ranking. In
WWW , 2011.[39] Hamed Valizadegan, Rong Jin, Ruofei Zhang, and Jianchang Mao. Learning to rank by optimizing ndcg measure.In
NIPS , 2009.[40] Qiang Wu, Christopher JC Burges, Krysta M Svore, and Jianfeng Gao. Adapting boosting for information retrievalmeasures.
Information Retrieval , (3), 2010.[41] Fen Xia, Tie Yan Liu, and Hang Li. Statistical consistency of top-k ranking. In
NIPS , 2009.[42] Fen Xia, Tie Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning to rank : theoryand algorithm. In
ICML , 2008.[43] Jun Xu and Hang Li. Adarank : a boosting algorithm for information retrieval. In
SIGIR , 2007.[44] Jun Xu, Tie Yan Liu, Min Lu, Hang Li, and Wei Ying Ma. Directly optimizing evaluation measures in learning torank. In
SIGIR , 2008.[45] Yisong Yue, Thomas Finley, Filip Radlinski, and Thorsten Joachims. A support vector method for optimizingaverage precision. In