[PDF] NeuralNDCG: Direct Optimisation of a Ranking Metric via Differentiable Relaxation of Sorting

Abstract

Learning to Rank (LTR) algorithms are usually evaluated using Information Retrieval metrics like Normalised Discounted Cumulative Gain (NDCG) or Mean Average Precision. As these metrics rely on sorting predicted items' scores (and thus, on items' ranks), their derivatives are either undefined or zero everywhere. This makes them unsuitable for gradient-based optimisation, which is the usual method of learning appropriate scoring functions. Commonly used LTR loss functions are only loosely related to the evaluation metrics, causing a mismatch between the optimisation objective and the evaluation criterion. In this paper, we address this mismatch by proposing NeuralNDCG, a novel differentiable approximation to NDCG. Since NDCG relies on the non-differentiable sorting operator, we obtain NeuralNDCG by relaxing that operator using NeuralSort, a differentiable approximation of sorting. As a result, we obtain a new ranking loss function which is an arbitrarily accurate approximation to the evaluation metric, thus closing the gap between the training and the evaluation of LTR models. We introduce two variants of the proposed loss function. Finally, the empirical evaluation shows that our proposed method outperforms previous work aimed at direct optimisation of NDCG and is competitive with the state-of-the-art methods.

Full PDF

NNeuralNDCG: Direct Optimisation of a RankingMetric via Diﬀerentiable Relaxation of Sorting

Przemys(cid:32)law Pobrotyn and Rados(cid:32)law Bia(cid:32)lobrzeski

ML Research at Allegro.pl [email protected]

Abstract.

Learning to Rank (LTR) algorithms are usually evaluatedusing Information Retrieval metrics like Normalised Discounted Cumu-lative Gain (NDCG) or Mean Average Precision. As these metrics rely onsorting predicted items’ scores (and thus, on items’ ranks), their deriva-tives are either undeﬁned or zero everywhere. This makes them unsuit-able for gradient-based optimisation, which is the usual method of learn-ing appropriate scoring functions. Commonly used LTR loss functionsare only loosely related to the evaluation metrics, causing a mismatchbetween the optimisation objective and the evaluation criterion. In thispaper, we address this mismatch by proposing NeuralNDCG, a noveldiﬀerentiable approximation to NDCG. Since NDCG relies on the non-diﬀerentiable sorting operator, we obtain NeuralNDCG by relaxing thatoperator using NeuralSort, a diﬀerentiable approximation of sorting. Asa result, we obtain a new ranking loss function which is an arbitrarilyaccurate approximation to the evaluation metric, thus closing the gap be-tween the training and the evaluation of LTR models. We introduce twovariants of the proposed loss function. Finally, the empirical evaluationshows that our proposed method outperforms previous work aimed at di-rect optimisation of NDCG and is competitive with the state-of-the-artmethods.

Keywords:

Learning to Rank · ranking metric optimisation · NDCGapproximation

Ranking is the problem of optimising, conditioned on some context, the orderingof a set of items in order to maximise a given metric. The metric is usually anInformation Retrieval (IR) criterion chosen to correlate with user satisfaction.Learning to Rank (LTR) is a machine learning approach to ranking, concernedwith learning the function which optimises the items’ order from superviseddata. In this work, without loss of generality, we assume our set of items aresearch results and the context in which we want to optimise their order is theuser query.Essentially, one would like to learn a function from search results into per-mutations. Since the space of all permutations grows factorially in the size of a r X i v : . [ c s . I R ] F e b P. Pobrotyn and R. Bia(cid:32)lobrzeski. the search results set, the task of learning such a function directly becomes in-tractable. Thus, most common LTR algorithms resort to the approach knownas score & sort . That is, instead of directly learning the correct permutation ofthe search results, one learns a scoring function which predicts relevancies ofindividual items, in the form of real-valued scores. Items are then sorted in thedescending order of the scores and thus produced ordering is evaluated using anIR metric of choice. Typically, scoring functions are implemented as either gra-dient boosted trees [14] or Multilayer Perceptrons (MLP) [25]. Recently, therehas been work in using the Transformer [28] architecture as a scoring function[21]. In order to learn a good scoring function, one needs a tagged dataset ofquery-search results pairs together with ground truth relevancy of each searchresult (in the context of a given query), as well as a loss function. There hasbeen extensive research into constructing appropriate loss functions for LTR(see [19] for an overview of the ﬁeld). Such loss functions fall into one of threecategories: pointwise , pairwise or listwise . Pointwise approaches treat the prob-lem as a simple regression or classiﬁcation of the ground truth relevancy for eachindividual search result, foregoing possible interactions between items. In pair-wise approaches, pairs of items are considered as independent variables and thefunction is learned to correctly indicate the preference among the pair. Exam-ples include RankNet [5], LambdaRank [6] or LambdaMART [7]. However, IRmetrics consider entire search results lists at once, unlike pointwise and pairwisealgorithms. This mismatch has motivated listwise approaches, which computethe loss based on the scores of the entire list of search results. Two popularlistwise approaches are ListNet [8] and ListMLE [29].What these loss functions have in common is that they are either not con-nected or only loosely connected to the IR metrics used in the evaluation. Theperformance of LTR models is usually assessed using Normalised Discounted Cu-mulative Gain (NDCG) [16] or Mean Average Precision (MAP) [1]. Since suchmetrics rely on sorting the ground truth labels according to the scores predictedby the scoring function, they are either not diﬀerentiable or ﬂat everywhere andthus cannot be used for gradient-based optimisation of the scoring function. As aresult, there is a mismatch between objectives optimised by the aforementionedpairwise or listwise losses and metrics used for the evaluation, even though it canbe shown that some of such losses provide upper bounds of IR measures [31], [30].On the other hand, as demonstrated in [23], under certain assumptions on theclass of the scoring functions, direct optimisation of IR measures on a largetraining set is guaranteed to achieve high test performance on the same IR mea-sure. Thus, attempts to bridge the gap between LTR optimisation objectivesand discontinuous evaluation metrics are an important research direction.In this work, we propose a novel approach to directly optimise NDCG by ap-proximating the sorting operator with NeuralSort [15]. Since the sorting operatoris the source of discontinuity in NDCG (and other IR metrics), by substitutingit with a diﬀerentiable approximation we obtain a smooth variant of the metric. euralNDCG: Direct Optimisation of Ranking Metrics... 3 The main contributions of the paper are: – We introduce NeuralNDCG, a novel smooth approximation of NDCG basedon diﬀerentiable relaxation of the sorting operator. The variants of the pro-posed loss are discussed in Section 4. – We evaluate a Context-Aware Ranker [21] trained with NeuralNDCG losson Web30K [22] and Istella [12] datasets. We demonstrate favourable per-formance of NeuralNDCG as compared to baselines. In particular, Neural-NDCG outperforms ApproxNDCG [23], a competing method for direct op-timisation of NDCG. – We provide an open-source Pytorch [20] implementation allowing for thereproduction of our results available at url removed .The rest of the paper is organised as follows. In Section 2, we review therelated literature. In Section 3 we formalise the problem of LTR. In Section 4,we recap NeuralSort and demonstrate how it can be used to construct a novelloss function, NeuralNDCG. In Section 5 we discuss our experimental setup andresults. In the ﬁnal Section 6 we summarise our ﬁndings and discuss the possiblefuture work.

As already mentioned in the introduction, most LTR approaches can be classiﬁedinto one of three categories: pointwise, pairwise or listwise. For a comprehensiveoverview of the ﬁeld and most common approaches, we refer the reader to [19].In this work, we are concerned with the direct optimisation of non-smoothIR measures. Methods for optimisation of such metrics can be broadly groupedinto two categories. The methods in the ﬁrst category try to optimise the upperbounds of IR metrics as surrogate loss functions. Examples include SVM map [31]and SVM

NDCG [9] which optimise upper bounds on 1 − MAP and 1 − NDCG,respectively. On the other hand, ListNet was originally designed to minimisecross-entropy between predicted and ground truth top-one probability distribu-tions, and as such its relation to NDCG was ill-understood. Only recently wasit shown to bound NDCG and Mean Reciprocal Rank (MRR) for binary la-bels [3]. Further, a modiﬁcation to ListNet was proposed in [2] for which it canbe shown that it bounds NDCG also for the graded relevance labels. Popularmethods like LambdaRank and LambdaMART forgo explicit formulation of theloss function and instead heuristically formulate the gradients based on NDCGconsiderations. Since the exact loss function is unknown, its theoretical relationto NDCG is diﬃcult to analyse.The second category of methods aims to approximate an IR measure with asmooth function and directly optimise resulting surrogate function. Our methodfalls into this category. We propose to smooth-out NDCG by approximatingnon-continuous sorting operator used in the computation of that measure. Re-cent works proposing continuous approximation to sorting are already mentioned

P. Pobrotyn and R. Bia(cid:32)lobrzeski.

NeuralSort, SoDeep [13] and smooth sorting as an Optimal Transport prob-lem [11]. We use NeuralSort for its ﬁrm mathematical foundation, the possibilityto control the degree of approximation and ability to generalise beyond the max-imum list length seen in training. SoDeep uses a deep neural network (DNN) andsynthetic data to learn to approximate the sorting operator and as such lacks theaforementioned properties. Smooth sorting as Optimal Transport reports similarperformance to NeuralSort at benchmark tasks and we aim to explore the useof it in NeuralNDCG in the future. By replacing the sorting operator with itscontinuous approximation, we obtain NeuralNDCG, a diﬀerentiable approxima-tion of the IR measure. Other notable methods for direct optimisation of NDCGinclude: – ApproxNDCG in which authors reformulated NDCG formula to involve sum-mation over documents, not their ranks. As a result, they introduce a non-diﬀerentiable position function, which they approximate using a sigmoid.This loss has been recently revisited in a DNN setting in [4]. – SoftRank [27], where authors propose to smooth scores returned by the scor-ing function with equal variance Gaussian distributions: thus deterministicscores become means of Gaussian score distributions. Subsequently, they de-rive an O ( n ) algorithm to compute Rank-Binomial rank distributions usingthe smooth scores. Finally, NDCG is approximated by taking its expectationw.r.t. the rank distribution. In this section, we formalise the problem and introduce the notation used through-out the paper. Let ( x , y ) ∈ X n × Z n ≥ be a training example consisting of a vector x of n items x i , 1 ≤ i ≤ n , and a vector y of corresponding non-negative integerrelevance labels. Note that each item x i is itself a d -dimensional vector of numer-ical features, and should be thought of as representing a query-document pair.The set X is the space of all such vectors x i . Thus, a pair ( x , y ) represents a listof vectorised search results for a given query together with the correspondingground truth relevancies. The dataset of all such pairs is denoted Ψ . The goalof LTR is to ﬁnd a scoring function f : X n → R n that maximises the chosen IRmetric on Ψ . The scoring function is learned by minimising the empirical risk L ( f ) = | Ψ | (cid:80) ( x , y ) ∈ Ψ (cid:96) ( y , s ) where (cid:96) ( · ) is a loss function and s = f ( x ) is thevector of predicted scores. As discussed earlier, in most LTR approaches thereis a mismatch between the loss function (cid:96) and the evaluation metric, causing adiscrepancy between the learning procedure and its assessment. In this work, wefocus on NDCG as our metric of choice and propose a new loss called Neural-NDCG, which bridges the gap between the training and the evaluation. Beforewe introduce NeuralNDCG, recall the deﬁnition of NDCG. Deﬁnition 1.

Let ( x , y ) ∈ X n × Z n ≥ be a training example and assume thedocuments in x have been ranked in the descending order of the scores computedusing some scoring function f . Let r j denote the relevance of the document euralNDCG: Direct Optimisation of Ranking Metrics... 5 ranked at j -th position, g ( · ) denote a gain function and d ( · ) denote a discountfunction. Then, the Discounted Cumulative Gain at k -th position ( k ≤ n ) isdeﬁned as DCG@ k = k (cid:88) j =1 g ( r j ) d ( j ) (1) and Normalised Discounted Cumulative Gain at k is deﬁned as NDCG@ k = 1maxDCG@ k DCG@ k (2) where maxDCG@ k is the maximum possible value of DCG@ k , computed by or-dering the documents in x by their decreasing ground truth relevancy. Note that, typically, the discount function d ( · ) and the gain function g ( · ) aregiven by d ( j ) = ( j +1) and g ( r j ) = 2 r j −

1, respectively.

In this section we deﬁne NeuralNDCG, a novel diﬀerentiable approximation toNDCG. It relies on NeuralSort, a smooth relaxation of the sorting operator. Webegin by recalling NeuralSort, proceed to deﬁne NeuralNDCG and discuss itspossible variants.

Recall that sorting a list of scores s is equivalent to left-multiplying a columnvector of scores by the permutation matrix P sort ( s ) induced by permutation sort( s ) sorting the scores. Thus, in order to approximate the sorting operator,it is enough to approximate the induced permutation matrix. In [15], the permu-tation matrix is approximated via a unimodal row stochastic matrix (cid:98) P sort ( s ) ( τ )given by: (cid:98) P sort ( s ) [ i, :]( τ ) = softmax[(( n + 1 − i ) s − A s ) /τ ] (3)where A s is the matrix of absolute pairwise diﬀerences of elements of s suchthat A s [ i, j ] = | s i − s j | , denotes the column vector of all ones and τ > (cid:98) P sort ( s ) ( τ ) simply as (cid:98) P .Note that the temperature parameter τ allows to control the trade-oﬀ be-tween the accuracy of the approximation and the variance of the gradients.Generally speaking, the lower the temperature, the better the approximationat the cost of a larger variance in the gradients. In fact, it is not diﬃcult todemonstrate that: lim τ → (cid:98) P sort ( s ) ( τ ) = P sort ( s ) (4)(see [15] for proof). This fact will come in handy once we deﬁne NeuralNDCG. P. Pobrotyn and R. Bia(cid:32)lobrzeski.

An approximation of a permutation matrix by Equation 3 is a deterministicfunction of the predicted scores. Authors of NeuralSort proposed also a stochas-tic version, by deriving a reparametrised sampler for a Plackett-Luce family ofdistributions. Essentially, they propose to perturb scores s with a vector g of i.i.d.Gumbel perturbations with zero mean and a ﬁxed scale β to obtain perturbedscores ˜ s = β log s + g . Perturbed scores are then used in place of deterministicscores in the formula for (cid:98) P .We experimented with both deterministic and stochastic approximations tosorting and found them to yield similar results. Thus, for brevity, in this workwe focus on the deterministic variant.Table 1: Approximate sorting with NeuralSort. Given ground truth y =[4 , , , , ,

3] and predicted scores s = [0 . , . , . , . , . , . y is sortedby (cid:98) P for diﬀerent values of τ . Exact sorting is shown in the ﬁrst row. Quasi-sorted ground truth Sum after sorting lim τ → τ = 0 .

01 4 4 3 2 0.99992 0.00012339 14.00004339 τ = 0 . τ = 1 3.3893 2.9820 2.4965 2.0191 1.6097 1.2815 10.388 If the ground truth permutation is known, one could minimise the cross-entropyloss between the ground truth permutation matrix and its approximation givenby (cid:98) P , as done in the experiments section in [15]. However, for many applications,including ranking, the exact ground truth permutation is not known. Relevancelabels of individual items produce many possible valid ground truth permutationmatrices. Thus, instead of optimising the cross-entropy loss, we use NeuralSortto introduce NeuralNDCG, a novel loss function appropriate for LTR.Given a list of documents x , its corresponding vector of scores s = f ( x ) andthe ground truth labels y we ﬁrst ﬁnd the approximate permutation matrix (cid:98) P induced by the scores s using Equation 3. We then apply the gain function g tothe vector of ground truths y and obtain a vector g ( y ) of gains per document.We then left-multiply the column vector g ( y ) of gains by (cid:98) P and obtain an“approximately” sorted version of the gains, (cid:100) g ( y ). Another way to think of thatapproximate sorting is that the k -th row of (cid:98) P gives weights of documents x i in the computation of gain at rank k after sorting. Gain at rank k is then theweighted sum of ground truth gains, weighted by the entries in the k -th row of (cid:98) P . Note that after the approximate sorting the actual integer values of groundtruth gains become ”distorted” and are not necessarily integers anymore (SeeTable 1 for example). In particular, the sum of quasi-sorted gains (cid:100) g ( y ) may euralNDCG: Direct Optimisation of Ranking Metrics... 7 diﬀer from that of the original vector g ( y ). This leads to a peculiar behaviourof NeuralNDCG near the discontinuities of true NDCG (Figure 1), which maybe potentially harmful for optimisation using Stochastic Gradient Descent [24].Since (cid:98) P is row-stochastic but not-necessarily column-stochastic (i.e. each columndoes not necessarily sum to one), an individual ground truth gain g ( y ) j mayhave corresponding weights in the rows of (cid:98) P that do not sum to one (and, inparticular, may sum to more than one), so it will overcontribute to the total sumof (cid:100) g ( y ). To alleviate that problem, we additionally perform Sinkhorn scaling [26]on (cid:98) P (i.e. we iteratively normalize all rows and columns until convergence )before using it for quasi-sorting. This way, the columns also sum to one and theapproximate sorting is smoothed-out (again, see Figure 1). The remaining stepsare identical to the computation of NDCG@ k , with the exception that the gainof the relevance function r j is replaced with the j -th coordinate of quasi-sortedgains (cid:100) g ( y ). For the discount function d , we use the usual inverse logarithmicdiscount and for the gain function g we used the usual power function. Forthe computation of the maxDCG, we use original ground truth labels y . Toﬁnd NeuralNDCG at rank k , we simply truncate quasi-sorted gains to the k -thposition and compute the maxDCG at k -th rank.We thus obtain the following formula for NeuralNDCG:NeuralNDCG k ( τ )( s , y ) = N − k k (cid:88) j =1 (scale( (cid:98) P ) · g ( y )) j · d ( j ) (5)where N − k is the maxDCG at k -th rank, scale( · ) is Sinkhorn scaling and g ( · )and d ( · ) are their gain and discount functions. Note that the summation is overthe ﬁrst k ranks.Finally, since the popular autograd libraries provide means to minimise agiven loss functions, we use ( − × NeuralNDCG for optimisation.

In the above formulation of NeuralNDCG, the summation is done over the ranksand gain at each rank is computed as a weighted sum of all gains, with weightsgiven by the rows of (cid:98) P . We now provide an alternative formulation of Neural-NDCG, called NeuralNDCG Transposed (NeuralNDCG T for short), where thesummation is done over the documents, not their ranks.As previously, let x be a list of documents with corresponding scores s andground truth relevancies y . We begin by ﬁnding the approximate permutationmatrix (cid:98) P . Since we want to sum over the documents and not their ranks, weneed to ﬁnd the weighted average of discounts per document, not the weightedaverage of gains per rank as before. To this end, we transpose (cid:98) P to obtain andapproximation (cid:98) P T of the inverse of the permutation matrix corresponding to We stop the procedure after 30 iterations or when the maximum diﬀerence betweenrow or column sum and one is less than 10 − , whatever happens ﬁrst. P. Pobrotyn and R. Bia(cid:32)lobrzeski. N D C G True NDCGWith Sinkhorn scalingWithout Sinkhorn scaling

Fig. 1: Given ground truth y = [2 , , , ,

0] and a list of scores s = [4 , , , , x ],we vary the value of the score x and plot resulting NDCG induced by the scoresalong with NeuralNDCG ( τ = 1 .

0) with and without Sinkhorn scaling of (cid:98) P .sorting the documents x by their corresponding scores y . Thus, can (cid:98) P T canbe thought of as an approximate unsorting matrix - when applied to sorteddocuments (ranks), it will (approximately) recover their original ordering. Since (cid:98) P is row-stochastic, (cid:98) P T is column-stochastic. As we want to apply it by left-multiplication, we want it to be row-stochastic. Thus, similarly to before, weperform Sinkhorn scaling of (cid:98) P T . After Sinkhorn scaling, the k -th row of (cid:98) P T can be thought of as giving the weights of diﬀerent ranks when computing thediscount of the k -th document. We can now ﬁnd the vector of the weightedaverages of discounts per document by computing (cid:98) P T d , where d is the vector oflogarithmic discounts per rank ( d j = d ( j )). Note that since we want to performsummation over the documents, not ranks, it is not enough to sum the ﬁrst k elements to truncate NDCG to the k -th position. Instead, the entries of thediscounts vector d corresponding to ranks j > k are set to 0. This way, thedocuments which would end up at ranks j > k after sorting end up havingweighted discounts being close to 0, and equal to 0 in the limit of the temperature τ . Thus, even though the summation is done over all documents, we still recoverNDCG@ k .Hence, NeuralNDCG T is given by the following formula:NeuralNDCG T k ( τ )( s , y ) = N − k n (cid:88) i =1 g ( y i ) · (scale( (cid:98) P T ) · d ) i (6)where N − k is the maxDCG at k -th rank, scale( · ) is Sinkhorn scaling, g ( · ) is thegain function, d is the vector of logarithmic discounts per rank set to 0 for ranks j > k , and the summation is done over all n documents. euralNDCG: Direct Optimisation of Ranking Metrics... 9 By Equation 4, in the limit of the temperature, the approximate permutationmatrix (cid:98) P becomes the true permutation matrix P . Thus, as the temperatureapproaches zero, NeuralNDCG approaches true NDCG in both its variants. SeeFigure 2 for examples of the eﬀect of the temperature on the accuracy of theapproximation.Comparing to ApproxNDCG, our proposed approximation to NDCG show-cases more favourable properties. We can easily compute NDCG at any rankposition k , whereas in ApproxNDCG, one needs to further approximate thetruncation function using an approximation of the position function. This ap-proximation of an approximation leads to a compounding of errors. We dealaway with that problem by using a single approximation of the permutation ma-trix. Furthermore, the approximation of the position function in ApproxNDCGis done using a sigmoid function, which may lead to the vanishing gradient prob-lem.SoftRank suﬀers from a high computational complexity of O ( n ): in order tocompute all the derivatives required by the algorithm, a recursive computation isnecessary. Authors relieve that cost by approximating all but a few of the Rank-Binomial distributions used, but at a cost of the accuracy of their solution. Onthe other hand, computation of (cid:98) P is of O ( n ) complexity. N D C G True NDCG=0.01=0.1=1.0=10.0

Fig. 2: Given ground truth y = [1 , , , ,

5] and a list of scores s = [1 , , , , x ],we vary the value of the score x and plot resulting NDCG induced by the scoresalong with Deterministic NeuralNDCG for diﬀerent temperatures τ . This section describes the experimental setup used to empirically verify theproposed loss functions.

Experiments were conducted on two datasets: Web30K and Istella . Both datasetsconsists of queries and associated search results. Each query-document pair isrepresented by a real-valued feature vector and has an associated graded rele-vance on the scale from 0 (irrelevant) to 4 (highly relevant). For both datasets,we standardise the features before feeding them to the learning algorithm. Sincethe lengths of search results lists in the datasets are unequal, we padded or sub-sampled to equal length for training, but used the full list length for evaluation.Web30K comes split into ﬁve folds. However, following the common practice inthe ﬁeld, we report results obtained on Fold 1 of the data. We used 60% ofthe data for training, 20% for validation and hyperparameter tuning and theremaining 20% for testing. Istella datasets comes partition into a training anda test fold according to a 80%-20% schema. We additionally split the trainingdata into training and validation data to obtain a 60%/20%/20% split, similarlyto Web30K. We tune the hyperparameters of our models on the validation dataand report performance on the test set, having trained the best models on thefull training fold. In both datasets there are a number of queries for which theassociated search results list contains no relevant documents (i.e. all documentshave label 0). We refer to these queries as empty queries. For such queries, theNDCG of their list of results can be arbitrarily set to either 0 or 1. To allowfor a fair comparison with the current state of the art, we followed LightGBM[17] implementation of setting NDCG of such lists to 1 during the evaluation.Table 2 summaries the characteristics of the datasets used.Table 2: Dataset statistics

Dataset Features Queries in training Queries in test Empty queriesWeb30K 136 18919 6306 982Istella 220 23219 9799 50

For the scoring function f , we used the Context-Aware Ranker, a ranking modelbased on the Transformer architecture. The model can be thought of as theencoder part of the Transformer, taking raw features of items present in the There are a few variants of this dataset, we used the Istella full dataset.euralNDCG: Direct Optimisation of Ranking Metrics... 11 same list as input and outputting a real-valued score for each item. Given theubiquity of Transformer-based models in the literature, we refer to reader to[21] for the details of the architecture used. Compared to the original networkdescribed in [21], we used smaller architectures. For both datasets, we used anarchitecture consisting of 2 encoder blocks of a single attention head each, witha hidden dimension of 384. The dimensionality of initial fully-connected layerwas set to 96 for models trained on Web30K and 128 for models trained onIstella. We did not apply any activation on the output except for ApproxNDCGand NeuralNDCG losses. They exhibited suboptimal performance without anynonlinear output activation function and, in their cases, we applied Tanh tothe output. For both datasets, the same architectures were used across all lossfunctions.

In all cases, we used Adam [18] optimiser and set the learning rate to 0 . . We compared the performance of variants of NeuralNDCG against the followingloss functions. For a pointwise baseline, we used a simple RMSE of predictedrelevancy. Speciﬁcally, the output of the network f is passed through a sigmoidfunction and then multiplied by the number of relevance levels. The root meansquared diﬀerence of this score and the ground truth relevance is the loss. Pair-wise losses we compared with consist of RankNet and LambdaRank. Similarlyto NeuralNDCG, these losses support training with a speciﬁc rank cutoﬀ. Wethus train models with these losses at ranks 5, 10 and at the maximum rank.The two most popular listwise losses are ListNet and ListMLE, and we, too, in-cluded them in our evaluation. Finally, the other method of direct optimisationof NDCG which we compared with was ApproxNDCG. We did not compare withSoftRank, as its O ( n ) complexity proved prohibitive. We tuned ApproxNDCGand NeuralNDCG smoothness hyperparameters for optimal performance on thevalidation set. Similarly to [4], we set ApproxNDCG’s α parameter to 10. Thetemperature τ in NeuralNDCG was set to 1. For both datasets, we report models performance in terms of NDCG@5 andNDCG@10. Results are collected in Table 3. Both variants of the proposedloss outperform ApproxNDCG on both datasets in all metrics reported. More-over, they provide competitive performance when compared to models trainedto minimize other, well-established losses. In fact, the diﬀerences in performance

Table 3: Test NDCG on Web30K and Istella. Boldface is the best performingloss column-wise and † next to a result means there is no statistically signiﬁcantdiﬀerence between that result and the best result in the column, according tot-test at level of 0.05. Loss WEB30K Istella

NDCG@5 NDCG@10 NDCG@5 NDCG@10NeuralNDCG@5 49.93 51.87 65 . † . † . † . † . † NeuralNDCG@max 50 . † . † . † T @5 50 . † T @10 50 . † . † . † . † NeuralNDCG T @max 50 . † . † . † . † . † . † . † ListMLE 49.19 51.36 60.25 66.42RankNet@5 48.53 50.18 64.70 69.00RankNet@10 50.02 51.88 65 . † . † RankNet@max 49.30 51.62 64.59 70.40LambdaRank@5 48.20 49.77 63.74 67.82LambdaRank@10 48.98 50.93 65.05 69.54LambdaRank@max . † . † RMSE 50 . † LambdaMART 46.80 49.17 61.04 65.74 between top performing NeuralNDCG variant and overall top performing lossare not statistically signiﬁcant, again on both datasets and in all metrics re-ported. Curiously, the best performing loss on Istella is RMSE, a pointwise loss.This strong performance of pointiwse losses when paired with an attention-basedscoring function was noted in [21]. For reference, we also report the results of aLambdaMART model trained using XGBoost [10] in the last row.

In this work we introduced NeuralNDCG, a novel diﬀerentiable approximationof NDCG. By substituting the discontinuous sorting operator with NeuralSort,we obtain a robust, eﬃcient and arbitrarily accurate approximation to NDCG.Not only does it enjoy favourable theoretical properties, but also proves to beeﬀective in empirical evaluation, yielding performance competitive with otherLTR losses. This work can easily be extended to other rank-based metrics likeMAP; a possibility we aim to explore in the future. Another interesting extensionof this work would be the substitution of NeuralSort with another method ofapproximation of the sorting operator, most notably the method treating sortingas an Optimal Transport problem [11]. euralNDCG: Direct Optimisation of Ranking Metrics... 13

References

1. Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA (1999)2. Bruch, S.: An alternative cross entropy loss for learning-to-rank (2019)3. Bruch, S., Wang, X., Bendersky, M., Najork, M.: An analysis of the soft-max cross entropy loss for learning-to-rank with binary relevance. In: Pro-ceedings of the 2019 ACM SIGIR International Conference on Theory of In-formation Retrieval. p. 75–78. ICTIR ’19, Association for Computing Ma-chinery, New York, NY, USA (2019). https://doi.org/10.1145/3341981.3344221,https://doi.org/10.1145/3341981.33442214. Bruch, S., Zoghi, M., Bendersky, M., Najork, M.: Revisiting approximate met-ric optimization in the age of deep neural networks. In: Proceedings of the42nd International ACM SIGIR Conference on Research and Development inInformation Retrieval. p. 1241–1244. SIGIR’19, Association for Computing Ma-chinery, New York, NY, USA (2019). https://doi.org/10.1145/3331184.3331347,https://doi.org/10.1145/3331184.33313475. Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hul-lender, G.: Learning to rank using gradient descent. In: Proceedings of the22Nd International Conference on Machine Learning. pp. 89–96. ICML ’05,ACM, New York, NY, USA (2005). https://doi.org/10.1145/1102351.1102363,http://doi.acm.org/10.1145/1102351.11023636. Burges, C.J., Ragno, R., Le, Q.V.: Learning to rank with nonsmoothcost functions. In: Sch¨olkopf, B., Platt, J.C., Hoﬀman, T. (eds.) Advancesin Neural Information Processing Systems 19, pp. 193–200. MIT Press(2007), http://papers.nips.cc/paper/2971-learning-to-rank-with-nonsmooth-cost-functions.pdf7. Burges, C.J.C.: From RankNet to LambdaRank to LambdaMART: Anoverview. Tech. rep., Microsoft Research (2010), http://research.microsoft.com/en-us/um/people/cburges/tech reports/MSR-TR-2010-82.pdf8. Cao, Z., Qin, T., Liu, T.Y., Tsai, M.F., Li, H.: Learning to rank:From pairwise approach to listwise approach. In: Proceedings of the 24thInternational Conference on Machine Learning. pp. 129–136. ICML ’07,ACM, New York, NY, USA (2007). https://doi.org/10.1145/1273496.1273513,http://doi.acm.org/10.1145/1273496.12735139. Chakrabarti, S., Khanna, R., Sawant, U., Bhattacharyya, C.: Structuredlearning for non-smooth ranking losses. In: Proceedings of the 14thACM SIGKDD International Conference on Knowledge Discovery andData Mining. p. 88–96. KDD ’08, Association for Computing Machin-ery, New York, NY, USA (2008). https://doi.org/10.1145/1401890.1401906,https://doi.org/10.1145/1401890.140190610. Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. In:Proceedings of the 22Nd ACM SIGKDD International Conference onKnowledge Discovery and Data Mining. pp. 785–794. KDD ’16, ACM,New York, NY, USA (2016). https://doi.org/10.1145/2939672.2939785,http://doi.acm.org/10.1145/2939672.293978511. Cuturi, M., Teboul, O., Vert, J.P.: Diﬀerentiable ranking and sorting using optimaltransport. In: Wallach, H., Larochelle, H., Beygelzimer, A., d Alche-Buc, F., Fox,E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32,pp. 6858–6868. Curran Associates, Inc. (2019), http://papers.nips.cc/paper/8910-diﬀerentiable-ranking-and-sorting-using-optimal-transport.pdf4 P. Pobrotyn and R. Bia(cid:32)lobrzeski.12. Dato, D., Lucchese, C., Nardini, F.M., Orlando, S., Perego, R., Tonel-lotto, N., Venturini, R.: Fast ranking with additive ensembles of obliviousand non-oblivious regression trees. ACM Trans. Inf. Syst. (2) (Dec 2016).https://doi.org/10.1145/2987380, https://doi.org/10.1145/298738013. Engilberge, M., Chevallier, L., Perez, P., Cord, M.: Sodeep: A sorting deep net tolearn ranking loss surrogates. In: The IEEE Conference on Computer Vision andPattern Recognition (CVPR) (June 2019)14. Friedman, J.H.: Greedy function approximation: A gradient boosting machine.Annals of Statistics , 1189–1232 (2000)15. Grover, A., Wang, E., Zweig, A., Ermon, S.: Stochastic optimization of sorting net-works via continuous relaxations. In: International Conference on Learning Repre-sentations (2019), https://openreview.net/forum?id=H1eSS3CcKX16. J¨arvelin, K., Kek¨al¨ainen, J.: Cumulated gain-based eval-uation of ir techniques. ACM Trans. Inf. Syst. (4),422–446 (Oct 2002). https://doi.org/10.1145/582415.582418,http://doi.acm.org/10.1145/582415.58241817. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.Y.: Light-gbm: A highly eﬃcient gradient boosting decision tree. In: Guyon, I., Luxburg,U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Ad-vances in Neural Information Processing Systems 30, pp. 3146–3154. Curran Asso-ciates, Inc. (2017), http://papers.nips.cc/paper/6907-lightgbm-a-highly-eﬃcient-gradient-boosting-decision-tree.pdf18. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization (2014),http://arxiv.org/abs/1412.6980, cite arxiv:1412.6980Comment: Published as aconference paper at the 3rd International Conference for Learning Representa-tions, San Diego, 201519. Liu, T.Y.: Learning to rank for information retrieval. Found. TrendsInf. Retr. (3), 225–331 (Mar 2009). https://doi.org/10.1561/1500000016,http://dx.doi.org/10.1561/150000001620. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z.,Desmaison, A., Antiga, L., Lerer, A.: Automatic diﬀerentiation in PyTorch. In:NIPS Autodiﬀ Workshop (2017)21. Pobrotyn, P., Bartczak, T., Synowiec, Miko(cid:32)laj an Bia(cid:32)lobrzeski, R., Bojar, J.:Context-aware learning to rank with self-attention. In: SIGIR eCom ’20. VirtualEvent, China. (2020)22. Qin, T., Liu, T.M.: Introducing letor 4.0 datasets. ArXiv abs/1306.2597 (2013)23. Qin, T., Liu, T.Y., Li, H.: A general approximation framework for direct op-timization of information retrieval measures. Inf. Retr. , 375–397 (08 2010).https://doi.org/10.1007/s10791-009-9124-x24. Robbins, H., Monro, S.: A stochastic approximation method. Annals of Mathe-matical Statistics , 400–407 (1951)25. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representationsby error propagation. Tech. rep., California Univ San Diego La Jolla Inst for Cog-nitive Science (1985)26. Sinkhorn, R.: A relationship between arbitrary positive matri-ces and doubly stochastic matrices. Ann. Math. Statist.35