[PDF] A Differentiable Ranking Metric Using Relaxed Sorting Operation for Top-K Recommender Systems

Abstract

A recommender system generates personalized recommendations for a user by computing the preference score of items, sorting the items according to the score, and filtering top-K items with high scores. While sorting and ranking items are integral for this recommendation procedure, it is nontrivial to incorporate them in the process of end-to-end model training since sorting is nondifferentiable and hard to optimize with gradient descent. This incurs the inconsistency issue between existing learning objectives and ranking metrics of recommenders. In this work, we present DRM (differentiable ranking metric) that mitigates the inconsistency and improves recommendation performance by employing the differentiable relaxation of ranking metrics. Via experiments with several real-world datasets, we demonstrate that the joint learning of the DRM objective upon existing factor based recommenders significantly improves the quality of recommendations, in comparison with other state-of-the-art recommendation methods.

Full PDF

AA D

IFFERENTIABLE R ANKING M ETRIC U SING R ELAXED S ORTING O PEARTION FOR T OP -K R ECOMMENDER S YSTEMS

Hyunsung Lee ∗ , Yeongjae Jang † , Jaekwang Kim ‡ , Honguk Woo § {hyunsung.lee, ja3156, linux, hwoo}@skku.edu A BSTRACT

A recommender system generates personalized recommendations for a user by computing thepreference score of items, sorting the items according to the score, and ﬁltering the top- K itemswith high scores. While sorting and ranking items are integral for this recommendation procedure,it is nontrivial to incorporate them in the process of end-to-end model training since sorting is non-differentiable and hard to optimize with gradient-based updates. This incurs the inconsistency issuebetween the existing learning objectives and ranking-based evaluation metrics of recommendationmodels. In this work, we present DRM (differentiable ranking metric) that mitigates the inconsistencyand improves recommendation performance, by employing the differentiable relaxation of ranking-based evaluation metrics. Via experiments with several real-world datasets, we demonstrate thatthe joint learning of the DRM cost function upon existing factor based recommendation modelssigniﬁcantly improves the quality of recommendations, in comparison with other state-of-the-artrecommendation methods. With the massive growth of online content, it has become common for online content platforms to operate recommendersystems that provide personalized recommendations, aiming at facilitating better user experiences and alleviating thedilemma of choices [1]. In general, recommender systems generate the relevance score of items with respect to a user,and recommend top- K items of high scores. Thus, sorting (or ranking) items serves an important role in such top- K recommendation tasks.While learning-based recommenders are popular, they are generally trained upon objectives that are limited in accuratelyreﬂecting the ranking nature of top- K recommendation tasks. It is because sorting operation is non-differentiable, andso incorporating it into the end-to-end model training driven by gradient-based updates is challenging, when commonlyused objectives such as mean squared error or log likelihood are used.As has been noted in several research works [2, 3, 4], optimizing objectives that are not aware of the ranking nature oftop- K recommendation tasks, does not always guarantee the best performance. Although there exist ranking-orientedobjectives: pairwise objectives such as Bayesian personalized pairwise loss [5] and listwise objectives based onPlackett-Luce distribution [6, 4], neither objectives ﬁt well with top- K recommendation tasks. Pairwise objectivesconsider only the pairwise ranking between a pair of items, while top- K recommendation tasks intend for generatingrecommendation lists of size K . On the other hand, listwise objectives consider all the items yet with equal importanceregardless of their ranks, while it is natural that top- K recommendation tasks need to relatively give more weights onthe items of higher ranks which are highly likely to be in a recommendation list.To bridge such inconsistency between the existing learning objectives commonly used for training recommendationmodels and the ranking nature of top- K recommendation tasks, we present DRM (differentiable ranking metric),which is a differentiable relaxation scheme of ranking-based evaluation metrics such as Precision@K or Recall@K. By ∗ Department of Electrical and Computer Engineering, Sungkyunkwan University, South Korea † Department of Mathematics, Sungkyunkwan University, South Korea ‡ Instutitue of Convergence, Sungkyunkwan University, South Korea § Department of Computer Science and Engineering, Sungkyunkwan University, South Korea a r X i v : . [ c s . I R ] S e p Differentiable Ranking Metric Using Relaxed Sorting Opeartion for Top-K Recommender Systemsemploying the differentiable relaxation scheme for sorting operation [7], DRM expedites direct optimization of givenranking-based evaluation metrics for recommendation models.We ﬁrst reformulate the ranking-based evaluation metrics in terms of permutation matrix arithmetic forms, and thenrelax the non-differentiable permutation matrix in the arithmetic forms to a differentiable row-stochastic matrix. Thisreformulation and relaxation allows us to represent non-differentiable ranking metrics in a differentiable form of DRM.Using DRM as an optimization objective renders end-to-end recommendation model training highly consistent withranking-based evaluation metrics. Moreover, DRM can be readily incorporated atop of an existing recommendationmodels via the joint learning with its own objectives without modifying its structure.For evaluating the effect of DRM upon existing models, we adopt two state-of-the-art factor based recommendationmodels, WARP [8] and CML [9]. Our experiments demonstrate that the DRM objective signiﬁcantly improves theperformances of top- K recommendations on several real-world datasets in terms of ranking based evaluation metrics,in comparison with several other recommendation models. Given a set of users U = { , , . . . , M } , a set of items I = { , , . . . , N } , and a set of interactions y u,i for all users u in U and all items i in I , a recommendation model aims to learn to predict preference , or score ˆ y u,i ∈ R of user u toitem i . We use binary implicit feedback y u,i such that y u,i = 1 if user u has interacted with item i , and 0 otherwise.Note that in this work, we only consider this binary feedback format, while our approach can be easily generalized tovarious implicit feedback settings. We use u to index a user, and i and j to index items, usually i is for items that user u has interacted, and j is items that user u did not interact. We denote a set of items with which user u has interacted as I u .We also use y u to represent the items that user u has interacted, in the bag of words notation, meaning column vector [ y u, , y u, , . . . , y u,n ] T . Similarly, we use ˆ y u to the vector of predicted scores of items, meaning [ˆ y u, , ˆ y u, , . . . , ˆ y u,n ] T . Objective for recommendation models are grouped into three categories: pointwise, pairwise, and listwise.Pointwise objectives maximize the accuracy of predictions independently. Mean squared error and Cross entropyare commonly used pointwise objectives for training machine learning models for recommenders. It is known thatpointwise objectives for recommendation have a limitation in that high predictive accuracy does not always lead tohigh-quality recommendations [10].Pairwise objectives gained popularity because they are more closely related to the top- K recommendation tasks thanpointwise objectives. It enables a recommendation model to learn users’ preferences by viewing the problem as a binaryclassiﬁcation, predicting whether user u prefer item i to item j . As noted in [11], One of the main concerns on pairwiseapproaches is that it is formalized to minimize classiﬁcation errors of item pairs, rather than errors of item rankings.Listwise objectives minimize errors in the list of sorted items or scores of items. They have been explored by a few priorworks [12, 4, 6], yet they are not fully investigated in the recommender systems domain. It is because list operationssuch as permutation or sorting are hard to be differentiated. One signiﬁcant drawback of listwise objectives is that theytreat all items of rankings with equal importance. However, items at higher ranks can be recommended and are moreimportant to the top- K recommendation.Our objective overcomes the limitations of the pairwise objectives and current listwise objectives while exploiting bothranking nature and emphasis on items at top ranks of the top- K personalized recommendations. In practice, performances of trained recommendation models should be validated with respect to its given objectivesbefore being deployed in target services.In our notation, we represent the list of items ordered by the predicted scores with respect to user u as π u , and the itemat rank k as π u ( k ) . In addition, we deﬁne the Hit ( k ) function that speciﬁes whether the k -th highest scored item foruser u in the recommendation list is in the validation dataset I u that contains all the items interacted by u , i.e.,Hit ( k ) = I [ π u ( k ) ∈ I u ] (1)where I [ statement ] is the indicator function, yielding if the statement is true and otherwise.Precision and Recall are two of the most widely used evaluation metrics for top- K recommendation tasks [13]. Foreach user, u , both metrics are based on how many items in the top- K recommendation are in the validation dataset I u .2 Differentiable Ranking Metric Using Relaxed Sorting Opeartion for Top-K Recommender SystemsThe Precision metric speciﬁes the fraction of hit items in the validation dataset I u among the items in the top- K recommended list, while the Recall metric speciﬁes the fraction of recommended items among the items in the validationdataset I u . Notice that both metrics emphasize items in high ranks by counting only items having rank K or smaller.They do not distinguish the relative ranking among the items in the top- K rankings.Precision@ K ( u, π u ) = K (cid:80) Kk =1 Hit ( k ) Recall@ K ( u, π u ) = |I u | (cid:80) Kk =1 Hit ( k ) On the other hand, Truncated Discounted Cumulative Gain (DCG) [14] and Truncated Average precision (AP) [15] takeinto account for the relative ranking of items by weighting the impact of Hit ( k ) according to its rank k . Furthermore,Normalized DCG (NDCG) speciﬁes a normalized value of DCG@ K , which is divided by the ideal discount cumulativegain IDCG@ K = max π u DCG@ K ( u, π u ) .NDCG@ K ( u, π u ) = DCG@ K ( u,π u ) IDCG@ K where DCG@ K ( u, π u ) = (cid:80) Kk =1 Hit ( k )log ( k +1) The truncated AP is deﬁned as AP@ K ( u, π u ) = 1 |I u | K (cid:88) k =1 Precision@ k ( u, π u ) Hit ( k ) . AP can be viewed as a weighted sum of Hit for each rank k weighted by Precision@ k .Notice that all the metrics above are consistent in a common form of a weighted sum of Hit. Accordingly, we formulatethese metrics in a uniﬁed way as O ( K ) conditioned on the weight function w ( k, K ) . For simplicity, we omit thearguments ( u, π u ) of these metrics without loss of generality. O ( K ) = K (cid:88) k =1 w ( k, K ) Hit ( k )=  Precision@ K if w ( k, K ) = 1 /K Recall@ K if w ( k, K ) = 1 / |I u | NDCG@ K if w ( k, K ) = k +1)( IDCG@ K ) AP@ K if w ( k, K ) = PR@ k |I u | (2) In this section, we propose DRM. We begin this section by introducing two working blocks for our method. The ﬁrstpart introduces matrix factorization with weighted hinge loss. Then, we introduce how to represent ranking basedevaluation metrics in terms of vector arithmetic and then relax the metrics to be differentiable, which can be optimizedby gradient descent. We conclude this section with traning procedure of DRM.

Factor based recommenders represent users and items in a latent vector space R d , and then formulate the preferencescore of user u to item i as a function of two vectors α u and β i in d -dimensional vector space R d , for users and items.The dot product is one common method for mapping a pair of user and item vectors to a preference score [5, 16, 17].In [9], the collaborative metric learning (CML) embeds users and items in the euclidean metric space and deﬁnes itsscore function as a negative value of L2 distance of two vectors. ˆ y u,i = (cid:26) α Tu β i Dot product −(cid:107) α u − β i (cid:107) L2 distancewhere (cid:107) x (cid:107) is the L2 norm of the vector x . 3 Differentiable Ranking Metric Using Relaxed Sorting Opeartion for Top-K Recommender SystemsOur model use either dot product or L2 distance of user vector α u of user u and item vector β i of item i as a scorefunction. Regardless of score functions, we update our model using weighted hinge loss with weight are calculated byan approximated ranking of the positive item i with respect to user u . L hinge = (cid:88) u ∈U (cid:88) i ∈I u (cid:88) j ∈I−I u Φ ui [ µ − ˆ y u,i + ˆ y u,j ] + (3)where [ x ] + = max ( x, is a clamp function, and µ is the margin for clamp function. We empirically tune the margin µ to be . The weight Φ ui is deﬁned to have larger values if the rank of positive item i is estimated to be at lower rank.Similar to sampling procedure in [9], it is deﬁned to be parallel to allow fast computations on GPU. Explicitly, Φ ui is Φ ui = log   − ˆ y u,i + 1 |I neg u | (cid:88) j ∈I neg u ˆ y u,j  +  With sampling I neg u items for each update from the set of items that the user u did not interact with. The number ofnegative sampling I neg u is usually between ten to a few hundreds. An n -dimensional permutation p = [ p , p , . . . , p n ] T is a vector of distinct indices from to n . Every permutation p can be represented using a permutation matrix P ( p ) ∈ { , } n × n and its element can be described as: P i,j = (cid:26) j = π i otherwiseFor example, a permutation matrix P = (cid:20) (cid:21) maps a score vector v = [3 , T to P v = [5 , T .We can represent a sorting by decreasing order with the score vector s and the permutation matrix P ( s ) as follow(Corollary 3 in [7]): P ( s ) i,j = (cid:26) j = argmax [( n + 1 − i ) s − A s ]0 otherwise (4)where refers to the column vector having for all elements and A s is the matrix such that A i,j = | s i − s j | . Note thatthe k -th row P k of the permutation matrix P is equal to the one-hot vector representation of the item of rank k . Thuswe can represent Hit(Eq. (1)) using the dot product of y u and P ( s ) k .Hit ( k ) = y Tu P ( s ) k Thus we obtain the representation of ranking metrics Eq. (2) in terms of vector arithmetic. O ( K, w ) = K (cid:88) k =1 w ( k, K ) y Tu P ( s ) k (5)In [7], they propose a differentiable generalization of sorting by relaxation of the permutation matrix Eq. (4) intorow-stochastic matrix, allowing differentiation operation involving sorting of elements of real values. We can constructthis relaxed matrix ˜ P ( s ) by following equation: ˜ P ( s ) k = softmax (cid:2) τ − (( n + 1 − k ) s − A s ) (cid:3) where τ > is a temperature parameter. Higher value of τ means each row of our relaxed matrix becomes ﬂatter. Thisrelaxation is continuous everywhere and differentiable almost everywhere with respect to the elements of s . As τ → , ˜ P ( s ) reduces to the permutation matrix P ( s ) .We can obtain differentiable relaxed objective, which can be used for optimization using gradient-based update, bysimply replacing P (ˆ y u ) in Eq. (5) to ˜ P (ˆ y u ) . Explicitly, ˜ O = K (cid:88) k =1 w ( k, K ) y Tu ˜ P (ˆ y u ) k . (6)4 Differentiable Ranking Metric Using Relaxed Sorting Opeartion for Top-K Recommender SystemsSince softmax function is differentiable, this value is differentiable now. We empirically ﬁnd that this is slightly morestable to update model by the equation below: L neu = (cid:107) y u − ˜ P [1: K ] (cid:107) (7)where ˜ P [1: K ] = (cid:80) k w ( k, K ) ˜ P k ( ˆ y u ) . Note that minimizing Eq. (7) is equivalent to maximizing Eq. (6). ˜ O = y uT ˜ P [1: K ] = 12 (cid:104) y uT y u + ˜ P T [1: K ] ˜ P [1: K ] (cid:105) − (cid:107) y u − ˜ P [1: K ] (cid:107) ≥ − (cid:107) y u − ˜ P [1: K ] (cid:107) = − L neu Once we developed our objective, as explained in above, we incorporate our loss Eq. (7) into the model learningstructure Eq. (3). We propose learning from two worlds by joint learning: L = L hinge + λ L neu = (cid:88) u ∈U (cid:88) i ∈I u (cid:88) j ∈I−I u Φ ui [ µ − ˆ y u,i + ˆ y u,j ] + + λ (cid:88) u ∈U (cid:107) y u − ˜ P [1: K ] (cid:107) (8)Eq. (8) is the objective of our model. We can also view this objective as regularizing the pairwise ranking objective L hinge (Eq. (3)) upon violations of correct rankings using L neu . The effect of L neu is controlled using scaling parameter λ .As we consider factor based models with gradient updates using negative sampling [5, 9, 18, 19, 20] we follow similarsampling procedure with additional positive item sampling. One training sample contains a user u , ρ positive items thatuser u has interacted with, and ν negative items that user u did not interact with. We empirically set ν to be times of ρ . We construct a list of items y u with ρ positive item and ν negative items sampled to build ρ + ν size array where ﬁrst ρ elements are all and zero elsewhere. We can construct ˆ y u similarly ˆ y u = [ˆ y u,i , ˆ y u,i , . . . , ˆ y u,i ρ , ˆ y u,j , . . . , ˆ y u,j ν ] T . The learning procedure for our model is summarized in Alg. 1.5 Differentiable Ranking Metric Using Relaxed Sorting Opeartion for Top-K Recommender Systems Algorithm 1:

Learning Procedure for DRMInitialize user factors α u where u ∈ { , , . . . , M } Initialize item factors β i where i ∈ { , , . . . , N } repeat Sample user u from U Sample ρ items i , i , . . . , i ρ from I u Sample ν items j , j , . . . , j ν from I − I u ∆ α u ← β i ← for i ∈ i , i , ..., i ρ ∆ β j ← for j ∈ j , j , ..., j ν Construct y u ← [1 , , . . . , (cid:124) (cid:123)(cid:122) (cid:125) ρ , , , . . . , (cid:124) (cid:123)(cid:122) (cid:125) ν ] Construct ˆ y u ← [ˆ y u,i , ˆ y u,i , . . . , ˆ y u,i ρ , ˆ y u,j , . . . , ˆ y u,j ν ] T Choose one positive item i with smallest score ˆ y u,i among i , . . . , i ρ Choose one negative item j with largest score ˆ y u,j among j , . . . , j ν for θ ← { α u , β i , β j } do ∆ θ ← ∆ θ + ∇L hinge end forfor θ ← { α u , β i , . . . , β i ρ , β j , . . . , β j ν } do ∆ θ ← ∆ θ + λ ∇L neu end forfor θ ← { α u , β i , . . . , β i ρ , β j , . . . , β j ν } do Update θ with ∆ θ using Adagrad Optimizer θ ← θ/ min(1 , (cid:107) θ (cid:107) ) end foruntil Converged

Bayesian Personalized Ranking [5] has proposed pairwise cost function to maximize area under the curve(AUC).This framework gives a method for factor based recommenders and nearest neighbors based recommenders to learnpersonalized ranking. One of the signiﬁcant drawbacks of this model is that the AUC does not discriminate betweenitems in higher ranks and those in lower ranks, unlike NDCG and MAP. This property does not ﬁt very well with thetop- k recommendation tasks. Our model, unlike BPR, focuses on a few items at higher ranks. This ﬁts more in theactual recommendation tasks where only a small number of items can be recommended at a time, thus resulting inbetter top- k recommendations.Cofactor [21] has proposed word2vec-like[22, 23] embedding techniques to embed item co-occurrence informationinto the matrix factorization model. This is achieved by adding additional objective function to pointwise mean squarederror matrix objective. SRRMF [24] claims that merely treating missing ratings to be zeros leads suboptimal behaviors.It proposes smoothing negative feedback to nonzero values according to their approximated ranks. These two works aremost similar to ours in that they propose a new objective or view of interpreting data without requiring additional inputsuch as context. However, they are limited in that their objectives cannot be applied to general gradient based models.Listwise Collaborative Filtering [4] attempts to tackle the misalignment between cost and objective on memory based, K -Nearest Neighbors recommenders [25]. It proposed a method to calculate similarity between two lists. Our work iscomplementary to it because we propose a solution for factor based, or model based recommenders. In this section, we evaluate our proposed method against various existing recommendation models.

We evaluate our approach and baseline models with four datasets of real-world user-item interactions. Their statisticsand characteristics are summarized in Table 1. 6 Differentiable Ranking Metric Using Relaxed Sorting Opeartion for Top-K Recommender Systems

SketchFab Epinion ML-20M Melon

Table 1: Dataset statistics. × • SketchFab [26]: This dataset contains user click streams on 3D models. We consider only items whoseinteraction count is greater than or equal to 5. • Epinion [27]: This dataset has product reviews and ﬁve-star rating information from a web commerce site.We view each rating as an user-item interaction signal. • ML-20M [28]: This dataset contains ﬁve-star ratings (with a half star) of movies. We interpret each rating asa user-item interaction. We exclude rating lower than four, and treat remainders as binary implicit feedback. • Melon : This dataset contains playlists from a music streaming service. To be consistent with the implicit userfeedback setting, we treat each playlist as a user, and songs in a playlist as a list of items a user has interactedwith. Evaluation Protocol

We randomly split interaction data into training, validation, and test datasets in 70%, 10%, and20% portions, respectively. We ﬁrst train models once using training data to ﬁnd the best hyperparameter settings foreach model, evaluating the hyperparameter settings using the validation datasets. We then train models ﬁve times withthe best hyperparameter settings using both training and validation data, evaluate the models using test data, and reportthe average of evaluation metrics. We skip evaluating users with fewer than three interactions in the training dataset.We use Recall@50 for model validation. We conduct Welch’s T-test [29] on results and denote results with a p valuelower than . in boldface and ties in italic.As results, we report mean AP@10 (MAP@10), NDCG@10, Recall@50, and NDCG@50. Our Model

We use two variants of our model. One use dot product as a score function and is denoted as DRM dot ;the other exploits the negative value of the L2 distance of user vector α u and item vector β i as a score, and it is denotedas DRM L2 . Baselines

We compare our method with the following baselines: • SLIM [30] is a state-of-the-art item based collaborative ﬁltering algorithm in which the item-item similaritymatrix is represented as a sparse matrix. It generates item prefrences of a user by the weight sum of similaritiesbetween items that the user has previously consumed. • CDAE [20, 31] is a factor based recommender, which represents user factors using an encoder, or a multi-layerperceptron, whose input is embeddings of items that the user has consumed. • WMF [17] is a state-of-the-art matrix factorization model that uses pointwise loss and minimize loss usingalternating least squares. • BPR [5] is a matrix factorization model that exploits pairwise sigmoid objective which is designed to optimizethe AUC of ROC score. • WARP [8] is a matrix factorization model trained using hinge loss with approximated rank based weights. • CML [9] is a factor based recommendation model that models user-item preference as a negative valuebetween the distance of user vector and item vector. • SQLRank-MF [6] is matrix factorization models having cost function based on a permutation probability oflist of items. https://arena.kakao.com/c/7 • SRRMF [24] is a state-of-the-art factor based recommendation, which interpolates scores of unobservedfeedback to be nonzero, giving different importances on unobserved feedback.For WMF, and BPR, we used an open-source implementation, Implicit [32]. For WARP, we used an open-sourcerecommender, lightFM, [33]. We used implementations publicly available by the authors for SLIM , SQLRank-MF and SRRMF . We implemented CDAE, CML, and DRM in using Pytorch 1.5.0. We run our experiment on a machinewith Intel(R) Xeon(R) CPU E5-2698 and NVIDIA Tesla V100 GPU with CUDA 10.1. Datasets Metrics Baselines Proposed MethodsSLIM CDAE BPR WMF WARP CML SQL-Rank SRRMF DRM dot

DRM L2 SketchFab MAP@ (9.9%) (7.4%)NDCG@ (7.2%) (6.3%)Recall@ (1.6%) (-0.3%)NDCG@ (4.3%) (3.2%)Epinion MAP@ (10.7%) (5.3%)NDCG@ (7.9%) (6.0%)Recall@ (1.0%) 0.1308 (-2.8%)NDCG@ (4.0%) (1.3%)ML-20M MAP@ (1.8%)NDCG@ (1.9%)Recall@ (1.4%) (2.8%)NDCG@ (6.6%)Melon MAP@ (6.4%)NDCG@ (1.9%) (13.6%)Recall@ (1.6%) (8.3%)NDCG@ (5.8%) (15.5%) Table 2: Recommendation Performances of different methods. the best performing models with p ≤ . with pairedT test are boldfaced. We describes the values in italic if the performances of two or more models are not statisticallysigniﬁcant. We conduct an illustrational experiment to show the objective of DRM (8). Figure 1 describes normalized costs andMAP@10 evaluated on training data over training epochs. We do not observe that decreasing costs also decreasesperformance on MAP@10. We conjecture that this is because both models pose a strong regularization, forcing the L norm of the latent representations of users and items to be strictly equal to or smaller than . However, we observed that,as we claimed, losses during learning joint learning with DRM model is more strongly correlated than that of WARPonly. The correlation between loss and MAP@10 for WARP is − . , and the correlation between loss and MAP@10for WARP + DRM is − . . This means that our objective is more strongly related to the top- K recommendationtask. Epochs T r a i n i n g L o ss WARP LossWARP+DRM Loss 0.0800.0900.1000.1100.1200.1300.140 A v e r a g e N D C G@ WARP MAP@10WARP+DRM MAP@10

Figure 1: Normalized loss per sample and MAP@10 using training data versus training epochs. The correlation betweenloss and MAP@10 for WARP is − . , The Correlation between loss and MAP@10 for WARP + DRM is − . . https://github.com/KarypisLab/SLIM https://github.com/wuliwei9278/SQL-Rank/ https://github.com/HERECJ/ We cannot train SQLRank-MF with large datasets, ML-20M and Melon, because of the huge training time. It tookabout a day to train on Epinion dataset, and it is impossible to run on the ML-20m and Melon datasets. Therefore, werecord the performances of SQLRank-MF only on the SketchFab and Epinion datasets.Table 2 shows the performance of various models for four datasets in terms of various ranking metrics. We observe thatthe proposed methods outperform state-of-the-art models by a large margin for all the datasets we use. We observethat although matrix factorization models(BPR, WMF, and WARP, DRM dot ) share the same model formulation, but thedifferences among their performances is large. For example, WMF achieves the smallest training error using pointwiseloss, however, its prediction quality is below other pairwise model such as WARP, CML and our models in manydatasets. Note that WARP behaves poorly in SketchFab dataset, however DRM dot achieves best prediction qualitiesamong models we evaluate. They only differ in the additional loss term L neu . These trends are same for all otherdatasets. We credit this performance gain to the proposed objective, enabling factor based models to be aware of top- K recommendation nature. = 1 = 2 = 3 = 4 = 5 Number of positive items in the training sample A v e r a g e ND C G @ DRM-dotDRM-L2 (a) SketchFab = 1 = 2 = 3 = 4 = 5

Number of positive items in the training sample A v e r a g e ND C G @ DRM-dotDRM-L2 (b) Melon

Figure 2: Effects of the number of positive samples ρ of the DRM cost. Results of NDCG@ of our models. Theweight of DRM cost λ is set to be .Our objective uses sampling items to sort and rank. Thus, we only sample ρ positive items and ν negative items tobuild a training sample, other than using the entire itemset. In ﬁgure 2, we conduct an experiment to see the effect ofthe size of the number of positive items in the sampling. We set ν to be times and the number of positive items ρ in the training sample. We observe the number of positive items ρ has a positive relation with the recommendationperformance. However, their effect varies across datasets, and increasing ρ does not increase the performance further atsome points. <10 (7721) <30, 10 (4922) 30 (2764) Users grouped by the number of items that user has interacted with A v e r a g e ND C G @ WARPCMLDRM-dotDRM-L2 (a) SketchFab

3, <20 (46254) 20, <50 (43773) 50 (46559)

Users grouped by the number of items that user has interacted with A v e r a g e ND C G @ WARPCMLDRM-dotDRM-L2 (b) ML-20M

Figure 3: NDCG@10 among different user groups by the number of interactions. The numbers in the parenthesisdenote the number of users in each group.Figure 3 shows NDCG@ of user groups grouped by the number of interactions in the training datasets. Our lossfunction consistently improves recommendation performances for all user groups, especially when the number ofinteractions of users is small. For models using negative L2 distance as a score function, CML and DRM L2 DRMsigniﬁcantly improves the quality of recommendations. 9 Differentiable Ranking Metric Using Relaxed Sorting Opeartion for Top-K Recommender SystemsCML DRM-only DRM L2 MAP@10 0.0358 0.0310 0.0390NDCG@10 0.1379 0.1234 0.1466Recall@50 0.3040 0.2802 0.3028NDCG@50 0.1778 0.1608 0.1836Table 3: Comparison on Melon dataset with joint learning (DRM L2 ) with pairwise loss only (CML), and DRM only.In Table 3, we evaluate the model learned using DRM cost only, over two models, one trained CML and one trainedonly with DRM cost. We ﬁnd that the model trained only with DRM loss performs worse than the other two models.We conjecture it may come from that the listwise objective does not have enough training samples (only one sampleexists for a user). While learning-based recommender systems are popular, their performance in terms of personalized ranking might besuboptimal because they are not directly optimized for top- K recommendation tasks. In this work, we have proposedDRM, a differentiable ranking metric that enables sorting-embedded end-to-end training for factor based recommenders.DRM utilizes the relaxation of sorting to continuous operation, hence leading to the high-performance cost function thatcan directly maximize metrics such as Precision. Via experiments, we demonstrate that DRM achieves the higher qualityof recommendation models in comparison with other state-of-the-art recommender methods on several real-worlddatasets.Our future work is to apply the DRM cost function to various recommendation models including AutoEncoder anddeep neural network models. It is also interesting to investigate other types of differentiable ranking metrics than therelaxed Precision we explored in this work. References [1] Barry Schwartz.

The paradox of choice . 2018.[2] Christopher J. C. Burges, Robert Ragno, and Quoc Viet Le. Learning to rank with nonsmooth cost functions.In

Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems , pages 193–200,2006.[3] Jun Xu, Tie-Yan Liu, Min Lu, Hang Li, and Wei-Ying Ma. Directly optimizing evaluation measures in learning torank. In

Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development inInformation Retrieval, SIGIR , pages 107–114, 2008.[4] Shanshan Huang, Shuaiqiang Wang, Tie-Yan Liu, Jun Ma, Zhumin Chen, and Jari Veijalainen. Listwise collabora-tive ﬁltering. In

Proceedings of the 38th International ACM SIGIR Conference on Research and Development inInformation Retrieval , pages 343–352, 2015.[5] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. BPR: bayesian personalizedranking from implicit feedback. In

UAI, Proceedings of the Twenty-Fifth Conference on Uncertainty in ArtiﬁcialIntelligence , pages 452–461, 2009.[6] Liwei Wu, Cho-Jui Hsieh, and James Sharpnack. Sql-rank: A listwise approach to collaborative ranking. In

Proceedings of the 35th International Conference on Machine Learning, , pages 5311–5320, 2018.[7] Aditya Grover, Eric Wang, Aaron Zweig, and Stefano Ermon. Stochastic optimization of sorting networks viacontinuous relaxations. In , 2019.[8] Jason Weston, Samy Bengio, and Nicolas Usunier. Wsabie: Scaling up to large vocabulary image annotation.In Toby Walsh, editor,

Proceedings of the 22nd International Joint Conference on Artiﬁcial Intelligence , pages2764–2770, 2011.[9] Cheng-Kang Hsieh, Longqi Yang, Yin Cui, Tsung-Yi Lin, Serge J. Belongie, and Deborah Estrin. Collaborativemetric learning. In

Proceedings of the 26th International Conference on World Wide Web , pages 193–201, 2017.[10] Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. Performance of recommender algorithms on top-nrecommendation tasks. In

Proceedings of the 2010 ACM Conference on Recommender Systems , pages 39–46,2010. 10 Differentiable Ranking Metric Using Relaxed Sorting Opeartion for Top-K Recommender Systems[11] Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach tolistwise approach. In

Machine Learning, Proceedings of the Twenty-Fourth International Conference , pages129–136, 2007.[12] Yue Shi, Martha A. Larson, and Alan Hanjalic. List-wise learning to rank with matrix factorization for collaborativeﬁltering. In

Proceedings of the 2010 ACM Conference on Recommender Systems , pages 269–272, 2010.[13] David C. Blair and M. E. Maron. An evaluation of retrieval effectiveness for a full-text document-retrieval system.

Communications of the ACM , 28(3):289–299, 1985.[14] Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of IR techniques.

ACM Transactions onInformation Systems , 20(4):422–446, 2002.[15] Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto.

Modern information retrieval , volume 463. 1999.[16] Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. In

Proceedings of the Twenty-FirstAnnual Conference on Neural Information Processing Systems , pages 1257–1264, 2007.[17] Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative ﬁltering for implicit feedback datasets. In

Proceedingsof the 8th IEEE International Conference on Data Mining , pages 263–272, 2008.[18] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. Neural collaborative ﬁltering.In

Proceedings of the 26th International Conference on World Wide Web , pages 173–182, 2017.[19] Christopher C. Johnson. Logistic matrix factorization for implicit feedback data. In

Advances in NeuralInformation Processing Systems 27 , 2014.[20] Yao Wu, Christopher DuBois, Alice X. Zheng, and Martin Ester. Collaborative denoising auto-encoders for top-nrecommender systems. In

Proceedings of the Ninth ACM International Conference on Web Search and DataMining , pages 153–162, 2016.[21] Dawen Liang, Jaan Altosaar, Laurent Charlin, and David M. Blei. Factorization meets the item embedding:Regularizing matrix factorization with item co-occurrence. In

Proceedings of the 10th ACM Conference onRecommender Systems , pages 59–66, 2016.[22] Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In

Proceedings of the27th International Conference on Neural Information Processing Systems , page 2177–2185, 2014.[23] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed representations ofwords and phrases and their compositionality. In

Proceedings of the 26th International Conference on NeuralInformation Processing Systems - Volume 2 , page 3111–3119, 2013.[24] Jin Chen, Defu Lian, and Kai Zheng. Improving one-class collaborative ﬁltering via ranking-based implicitregularizer. In

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence , pages 37–44, 2019.[25] Badrul Munir Sarwar, George Karypis, Joseph A. Konstan, and John Riedl. Item-based collaborative ﬁlteringrecommendation algorithms. In

Proceedings of the Tenth International World Wide Web Conference , pages285–295, 2001.[26] Ethan Rosenthal. Likes out! guerilla dataset! , 2016.[27] Jiliang Tang, Huiji Gao, Huan Liu, and Atish Das Sarma. etrust: understanding trust evolution in an online world.In

The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages 253–261,2012.[28] F. Maxwell Harper and Joseph A. Konstan. The movielens datasets: History and context.

ACM Transactions onInteractive Intelligent Systems , 5(4):19:1–19:19, 2016.[29] Bernard L. Welch. The generalization of ’student’s’ problem when several different population variances areinvolved.

Biometrika , 34(1/2):28–35, 1947.[30] Xia Ning and George Karypis. SLIM: sparse linear methods for top-n recommender systems. In , pages 497–506, 2011.[31] Dawen Liang, Rahul G. Krishnan, Matthew D. Hoffman, and Tony Jebara. Variational autoencoders for collabora-tive ﬁltering. In

Proceedings of the 2018 World Wide Web Conference , 2018.[32] Ben Fredrickson. Fast python collaborative ﬁltering for implicit datasets. https://github.com/benfred/implicit , 2017.[33] Maciej Kula. Metadata embeddings for user and item cold-start recommendations. In