Flatter is better: Percentile Transformations for Recommender Systems
FFlatter is better: Percentile Transformations forRecommender Systems
MASOUD MANSOURY,
DePaul University, USA
ROBIN BURKE,
University of Colorado, Boulder, USA
BAMSHAD MOBASHER,
DePaul University, USAIt is well known that explicit user ratings in recommender systems are biased towards high ratings, and thatusers differ significantly in their usage of the rating scale. Implementers usually compensate for these issuesthrough rating normalization or the inclusion of a user bias term in factorization models. However, thesemethods adjust only for the central tendency of users’ distributions. In this work, we demonstrate that lack of flatness in rating distributions is negatively correlated with recommendation performance. We propose a ratingtransformation model that compensates for skew in the rating distribution as well as its central tendency byconverting ratings into percentile values as a pre-processing step before recommendation generation. Thistransformation flattens the rating distribution, better compensates for differences in rating distributions, andimproves recommendation performance. We also show a smoothed version of this transformation designedto yield more intuitive results for users with very narrow rating distributions. A comprehensive set ofexperiments show improved ranking performance for these percentile transformations with state-of-the-artrecommendation algorithms in four real-world data sets.Additional Key Words and Phrases: Recommender systems, Rating distribution, Percentile transformation,Flatness
Recommender systems have become essential tools in e-commerce systems, helping users to finddesired items in many contexts. These systems use information from user profiles to generatepersonalized recommendations. User profiles are either implicitly inferred by the system throughuser interaction, or explicitly provided by users [Adomavicius et al. 2005; Adomavicius and Tuzhilin2015]. In the latter case, users are asked to rate different items based on their preferences and mayhave individual differences in how they use explicit rating scales: some users may tend to ratehigher, while some users may tend to rate lower; some users may use the full extent of the ratingscale, while others might use just a small subset. [Herlocker et al. 1999].When a user concentrates his or her ratings in only a small subset of the rating scale, this oftenresults in ratings distributions that are skewed – most often towards the high end of the scale. Thisis because items are not rated at random, but rather preferred items are more likely to be selectedand therefore rated due to selection bias [Marlin et al. 2007]. Figure 1 shows the overall ratingdistribution of two data sets that exhibit typically right-skewed distributions. Users in the CiaoDVDdata set, for example, have assigned less than 10% of the ratings to ratings 1 and 2 and some 70% ofratings are values 4 and 5. We can assume this is not because there are so many more good moviesthan bad, but rather than users are selecting movies to view that they are likely to enjoy and theratings are concentrated among those selections. A drawback of this skew to the distribution isthat we have more information about preferred items and less information about items that are notliked as well. It also means that a given rating value may be ambiguous in meaning.As an example, assume that Alice and Bob both purchase an item X and rate it. Alice is a userwho tends to rate lower and tends to use the whole rating scale, while Bob is a user who tendsto rate higher and never uses ratings at the bottom of the scale. Their profiles, sorted by rating Authors’ addresses: Masoud Mansoury, DePaul University, School of Computing, Chicago, IL, USA, ; Robin Burke, Universityof Colorado, Boulder, Department of Information Science, Boulder, CO, USA, [email protected]; Bamshad Mobasher,DePaul University, School of Computing, Chicago, IL, USA, [email protected]. a r X i v : . [ c s . I R ] J u l ig. 1. Rating distribution of CiaoDVD and MovieLens data sets.Table 1. Rating profiles with percentile transformation Alice rating ⟨ , , , , , , , , ⟩ Bob rating ⟨ , , , , , , , , ⟩ Alice percentile ⟨ , , , , , , , , ⟩ Bob percentile ⟨ , , , , , , , , ⟩ value, are shown in Table 1. After using item X, Alice is fully satisfied with it, but Bob is onlypartially satisfied. As a result, both rate the item X as 4 out of 5 although they have different levelsof satisfaction toward that item. These ratings, while identical, do not carry the same meaning. Atransformation based on percentiles, shown in the bottom rows of the table, captures this distinctionwell: a rating of 4 for Alice is percentile 80; whereas for Bob, the same score has a score of 50. Inaddition, unlike the original profiles, where the users’ ratings are distributed over different ranges,these profiles span the same numerical range from 20 to 90.Rating normalization in neighborhood models [Resnick et al. 1994] and inclusion of a bias termin factorization models [Koren 2008; Koren et al. 2009] are two common techniques for managingrating variances among users. However, these techniques adjust only for the central tendency ofusers’ rating distributions and do not fully compensate for different patterns of rating behavior thatusers exhibit. On the other hand, the percentile transformation proposed in this paper takes intoaccount the whole shape of the distribution, not only its central tendency, and therefore retainsmore information from the original user profile.Table 2 shows a hypothetical rating matrix. In this table, users with different rating patterns areexhibited. Some users tend to rate lower (e.g., U1 ), some users tend to rate higher (e.g., U4 and U6 ),some users show normal rating pattern (e.g., U2 and U5 ), and finally, some users do not show anypattern (e.g., U3 ). For illustration purposes, we show how different normalization methods affect thecomputation of user-user similarities (in this case similarities to user U U1 and U4 show different behavior when providing rating to items. Basedon their rating patterns, rating 3 provided by U1 can be mapped to rating 5 provided by U4 , orrating 1 provided by U1 can be mapped to rating 3 provided by U4 . Hence, a good transformation able 2. An example of user-item matrix I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11
Similarity to U1rating z-score percentile U1 U2 U3 - - - - 1 3 - 2 5 - 4 0.567 0.567 0.567 U4 U5 U6 technique assigns high similarity value to U1 and U4 . The same result can be observedbetween U1 and U6 . However, original ratings and z-score technique are unable to capture thesedifferences when calculating similarity values. In other cases, where users have normal ratingpatterns or do not show a specific rating pattern, our percentile technique behaves similarly toother normalization techniques.Moreover, above example shows that in all cases, original ratings and z-score technique behavesimilarly even in extreme cases when rating patterns are very different. However, our percentiletechnique properly takes into account those extreme cases, while behave almost similarly in normalcases.One can imagine the most informative rating distribution would be a flat, uniform, distribution.Users would provide ratings for items sampled uniformly across all of the items and the profileswould then represent their preferences across the whole inventory, and across all possible ratingvalues. One way to think about the difference between the typical, skewed, distribution and auniform one is in terms of information entropy. The worst case, a profile where every item is ratedthe same, carries no information that distinguishes the different items, and the assignment of arating to an item has low entropy. A profile where the rating values are distributed across the itemswith equal frequency has maximum entropy.In this paper, we formalize a rating transformation model as above that converts users’ ratingsinto percentile values as a pre-processing step before recommendation generation. Each valueassociated with an item therefore reflects its rank among all of the items that the user has rated.Thus, the percentile captures an item’s position within a user’s profile better than the raw ratingvalue and compensates for differences in users’ overall rating behavior. Also, the percentile, bydefinition, will span the whole range of rating values, and as we show, gives rise to a more uniformrating distribution. To handle cases in which users use only a small part of the rating range, we alsointroduce a smoothed variant of the percentile transformation that preserves distinctions amongusers with different rating baselines.We show that these two properties of the percentile transformation – its ability to compensatefor individual user biases and its ability to create a more uniform rating distribution – lead toenhanced recommender system performance across different algorithms, different data sets, anddifferent performance measures. We also show that the percentile transformation creates flatterrating distributions and that this is correlated with improved recommendation performance.Overall, our paper makes the following contributions:1. We propose a rating transformation model that converts users’ ratings into percentile valuesto compensate for skew in rating distributions and variances in users’ rating behaviors. Results are based on first index percentile transformation. The same results are observed for median and last indexpercentile transformation. . We empirically evaluate the proposed percentile technique using state-of-the-art recom-mendation algorithms on four real-world data sets. Our experiments include both overallrecommendation performance and recommendation performance on long-tail items.3. We show the relationship between the uniformity of the rating distribution and the quality ofrecommendation; with flatter distributions being correlated with better recommendations.4. We show that the smoothed version of the transformation overcomes the issue of identicalratings in percentile and z-score transformations, and provides further improvement over thepercentile alone. It has long been noted that users differ in their application of explicit rating scales. Resnick’salgorithm, perhaps the most well-known prediction method in recommendation, normalizes ratingsby user mean when computing its predictions [Resnick et al. 1994]. Herlocker, et al. [Herlocker et al.1999] used z-scores instead of absolute rating values in recommender systems and investigatedits effectiveness on quality of recommendations. In this research they compared the performanceof three rating normalization techniques and showed that bias-from-mean approach performssignificantly better than a non-normalized rating approach and slightly better than the z-scoreapproach in terms of mean absolute error. This result is consistent with our findings.Kamishima in [Kamishima and Akaho 2010] proposed a ranking-based method that replacesthe existing rating scheme with a ranking scheme. In this method, instead of rating the items,users order the items based on their preferences. Based on order statistics theory, preference ordersexpressed by users are converted into scores and then recommendation algorithms are appliedon these scores to generate recommendations. This method proved effective, but it is not widelyapplicable because order-based input is rare in recommender system interfaces, and requires moreeffort from users than rating assignment.Jin, et al. [Jin and Si 2004] compared the impact of two normalization techniques for user ratings,namely Gaussian and decoupling normalization techniques on the performance of recommendersystems. This research found that decoupling normalization is more effective than Gaussian normal-ization. A more recent study by [Kim et al. 2016] proposed a normalization model that learns thedifferences in users’ rating dispositions using two phases of clustering and normalization. At theclustering phase, users are clustered based on their rating disposition and then at the normalizationphase, users’ ratings are normalized through predicting their rating disposition and adjusting theirneighbors’ ratings based on that disposition.In the domain of trust relations in social networks, it has been shown that percentile values aremore effective than absolute trust ratings. Hasan et al. in [Hasan et al. 2009] showed that usingpercentile values instead of absolute trust ratings improves the accuracy of trust propagation model.They applied a method introduced by NIST for converting predicted percentile values into trustrating in social networks.Besides the bias in users’ rating distribution, popularity bias is another well-known problem inrecommender systems. Item popularity refers to the fact that depending on what recommenda-tion algorithm we apply, inherent popularity bias in input data causes algorithms to focus theirrecommendations on a small set of items. Because often the items of greatest interest to users arethe lesser-known ones [Brynjolfsson et al. 2006; Park and Tuzhilin 2001], but these items are lesscommon in recommendation lists – a consequence of low quality recommendations on these items.Jannach et. al., [Jannach et al. 2015] conducted a comprehensive set of analysis on popularitybias of several recommendation algorithms. They analyzed recommended items by different rec-ommendation algorithms in terms of their average ratings and their popularity. While it is very ependent to the characteristics of the data sets, they found that some algorithms (e.g., SlopeOne ,KNN techniques, and ALS-variants of factorization models) focus mostly on high-rated items whichbiases them toward a small sets of items (low coverage). Also, they found that some algorithms(e.g., ALS-variants of factorization model) tend to recommend popular items, while some otheralgorithms (e.g.,
UserKNN and
SlopeOne ) tend to recommend less-popular items.Abdollahpouri et. al., [Abdollahpouri et al. 2017] addressed popularity bias in learning-to-rankalgorithms by inclusion of fairness-aware regularization term into objective function. They showedthat the fairness-aware regularization term controls the recommendations being toward popularitems. Also, Steck in [Steck 2011] examined the trade-off between degrading accuracy for improvinglong-tail coverage. By conducting user study, they observed that adding a small bias toward long-tailitems leads to better feedback from users.Finally, Cremonesi et. al., [Cremonesi et al. 2010] proposed a new evaluation criterion formeasuring the effectiveness of recommendation algorithms on recommending long-tail items.They compared different recommendation algorithms in terms of how accurately they recommendlong-tail items to users. In fact, in their experimental setup, they measured ranking quality ofrecommendation outputs on long-tail items. We also follow the same evaluation criterion in thepresent paper to show the effectiveness of our percentile technique on long-tail items.
In statistics, given a series of measurements, percentile (or quantile) methods are used to estimatethe value corresponding to a certain percentile. Given the P th percentile, these methods attempt toput P % of the data set below and (100- P )% of the data set above. There are a number of differentdefinitions in the literature for computing percentiles [Hyndman and Fan 1996; Langford 2006].Although they are apparently different, the answers produced by these methods are very similarand the slight differences are negligible [Langford 2006]. In this paper, we use a definition from[Hyndman and Fan 1996] .The percentile value, p , corresponding to a measurement, x , in a series of measurements, M , iscomputed with regard to the position of x in the ordered list M , o ( M ) , as follows: p z ( x , M ) = × position z ( x , o ( M ))| M | + position z ( x , o ( M )) returns the index of occurrence of x in o ( M ) , or the position in the orderwhere x would appear if it is not present, and | M | is the number of measurements in M . For moredetails see [Hyndman and Fan 1996].This transformation assumes that values are distinct and there is no repetition in the series.However, with explicit rating data, we have a different situation. User profiles usually contain manyrepetitive ratings, and it is unclear how to specify the position of a rating. For example, in a seriesof ratings v = ⟨ , , , , , , , , ⟩ , it is not clear what the position of rating 3 should be. We couldtake the first occurrence, position 2, or the last occurrence 6, or something in between.In this work, we explore the performance of our percentile technique by taking the index of thefirst, median, and last occurrence of repeated ratings in the ordered vector. Hence, the parameter z determines the index rule that we want to use for percentile transformation and can take values f , m , and l as first , median , and last index assignments, respectively. Each of these index assignmentssignifies a particular meaning when transforming rating profiles. The index of the last occurrence,for example, is the highest rank (most preferred) position occupied by an item with the given ating. We experiment with all index assignments and show that the rule that yields a more uniformdistribution will provide greater recommendation performance .For our purposes, the entire set of ratings provided by a user u is considered a rating vector for u ,denoted by R u with an individual rating for an item i , denoted as r ui . Let p ( v , ℓ ) be the percentilemapping in Equation 1 from a rating value v in a list of values ℓ , using the first, median, and lastindex method. Then, the percentile value of a rating r provided by user u on an item i is computedby taking the rating r ui and calculating its percentile rank within the whole profile of the user. Forexample, based on the last index rule, for the user Bob from Table 1, an item rated 3 would havepercentile rank 100 ∗ /( + ) =
20. We define the user-percentile function,
Per zu , as follows:Per zu ( u , i ) = p z ( r ui , R u ) (2)Analogously, we can consider profiles for an item, denoted by R i , to be all of the ratings providedfor that item by users, and we can define a similar transformation for item profiles in which thetransformation takes into consideration the rank of the rating across all ratings for that item, anitem-percentile function. Per zi ( u , i ) = p z ( r ui , R i ) (3)Note that Per zu and Per zi might be quite different for the same user-item pair. For example,user x might be a strong outlier relative to the data set, liking an item y that no one else does. Per zi ( x , y ) would therefore be quite high. However, if user x has a strong tendency to high ratingsin general, Per zu ( x , y ) might be significantly lower. This paper concentrates on the user-orientedtransformation: we plan to explore the properties of the Per zi transformation in future work. One of our claims in this paper is that the flatness of the rating distribution in a data set is an indicatorof how well collaborative recommendation will perform, and that the percentile transformationachieves flatter distributions. In order to test this hypothesis, we need a measure of how close arating distribution is to uniformity.One common technique for measuring the shape of a distribution is kurtosis . Kurtosis is regularlyused for determining the normality of a distribution. A normal distribution has a kurtosis value of3 , and a value below 3 indicates a distribution closer to uniform. Although kurtosis can be usedfor measuring the uniformity of a distribution, it is not a robust measure and may be misleading.Therefore, to overcome this issue, we introduce a new technique for measuring the flatness of adistribution as an alternative along with kurtosis.To determine the flatness of a ratings distribution we calculate Kullback-Leibler divergence (KLD)between the observed rating distribution and a uniform distribution in which each rating valueoccurs the same number of times. If there is a discrete set of rating values V (for example, 1,2,3,4,5),then we define the flatness measure F as F ( D ∥ Q ) = (cid:213) v ∈ V D ( v ) log D ( v ) Q ( v ) (4)where V is the set of discrete rating values in rating matrix R , and D is the observed probabilitydistribution over those values. Q is a uniform distribution which associates a probability 1 /| V | foreach possible value in V (i.e., for each v ∈ V , Q ( v ) = /| V | ). Therefore, F ( D ) = (cid:213) v ∈ V D ( v ) log (| V | D ( v )) (5) See https://github.com/masoudmansoury/percentile for the code for computing these and other transformations describedin this paper. In some references, kurtosis is defined such that 0 reflects a normal distribution able 3. Flatness calculation of BookCrossing data set. rating frequency D ( v ) log (| V | D ( v )) F =0.448 The F function measures the distance between the two distributions and hence how closethe observed distribution is to the flat ideal, with a lower KLD value being indicative of a flatterdistribution.Table 3 illustrates the flatness calculation of BookCrossing data set for original ratings. In thisdata set, there are ten rating values, | V | = D ( v ) is the probability distribution over each ratingvalues calculated as D ( v ) = f requency ( v ) (cid:205) v ∈ V f requency ( v ) (6)By using equation 5, the flatness of this data set will be F = . F =
0) shows that the distribution of originalratings in BookCrossing data set is far from a flat ideal.
Fig. 2. Raw and binned percentile distributions for BookCrossing data set.
The percentile and z-score transformations yield real valued ratings, unlike the original discreteratings chosen by users in these data sets. Evaluating our flatness measure at every point in thesedistributions yields results that are not comparable to the original discrete distribution.In order to have comparable calculations of the F value across types of distributions, we createdbinned versions of the percentile and z-score distributions, using the same number of bins aspresent in the original ratings. In a 10-star rating system, such as found in the BookCrossing data, able 4. Statistical properties of data sets Dataset
Table 5. Flatness ( F ) and kurtosis ( K ) of rating distribution Dataset method rating z-score Per fu Per mu Per lu BX F K F K F K F K Flatness of a uniform distribution is F = .Kurtosis of a uniform distribution is K < . the rating distribution covers ten values, hence we created ten equal length bins for percentile andz-score values and aggregated each bin by its mean .Figure 2 shows the percentile distribution (last index illustrated here) and its aggregated distri-bution for the BookCrossing data set. The black curve is the percentile distribution and red line isits aggregated distribution with ten bins. It shows that aggregating by mean retains the shape ofthe percentile distribution, while being comparable to the original ratings for computing flatness. We evaluated the performance of percentile transformation on four publicly available data sets:BookCrossing, CiaoDVD, FilmTrust, and MovieLens. The characteristics of the data sets are sum-marized in Table 4. These data sets are from various domains and have different degrees of sparsity.The ML1M is movie ratings data and was collected by the MovieLens research group. TheCiaoDVD includes ratings of users for movies available on DVD. The FilmTrust is a small data setcollected from the FilmTrust website [Guo et al. 2013]. It contains both movie ratings and explicittrust ratings. Finally, the BX data set is a subset extracted from the BookCrossing data set suchthat each user has rated at least 5 books and each book is rated by at least 5 users. The ML1M hasthe highest density and CiaoDVD has the lowest density. To evaluate the percentile transformation for its distributional properties, we evaluated its flatnessand kurtosis compared to the original ratings distribution and a distribution based on the z-scoretransformation over four data sets: BX, CiaoDVD, FilmTrust, ML1M. As an example, we know that percentile values are between 0-100. Thus, we create ten bins each of which with the lengthof 10 and aggregate each bin by mean of its distribution. https://grouplens.org/datasets/movielens/ irst, we converted the original ratings into percentile and z-score values. Then, we appliedthe binned flatness and kurtosis measures described above to these data sets to evaluate thetransformations for their distributional properties. Table 5 shows the flatness ( F ) and kurtosis( K ) values for each type of transformation on the four data sets (user profile transformation). Asshown, the values for both measures are consistent across all three transformations and data sets.As anticipated, the proposed percentile model makes the rating values flatter than the originalratings and z-score values. Also, the original rating values show a flatter distribution than z-scorevalues over all the data sets.Thus, the proposed percentile transformation approach reduces skew in the rating distributionover the original ratings and over z-score values. Given these results, we expect to see betterrecommendation performance when we use percentile values as input for recommender systems.We also expect that in most cases, using the original ratings will result in better recommendationperformance than z-score values since they have lower F and K values. We performed a comprehensive evaluation of the effect of the percentile transformation on theranking performance of a number of recommendation algorithms. Due to the nature of our percentiletechnique, we experimented only with algorithms that make use of rating magnitude. The percentiletransformation rescales rating values without changing their relative ordering, so it will have noeffect when applied to ranking-based algorithms (for example, ListRank [Shi et al. 2010]). Implicitfeedback algorithms that use unary data, such as Bayesian Personalized Ranking [Rendle et al. 2009],would also be inappropriate to use with percentile transform because they use binary interactioninformation and ignore rating values.Our experiments included user-based collaborative filtering [Resnick et al. 1994], item-basedcollaborative filtering [Sarwar et al. 2001], biased matrix factorization [Koren et al. 2009], SVD++[Koren 2008], and non-negative matrix factorization [Lee and Seung 2001] However, in this paper,for the purpose of presentation, we only report results on biased matrix factorization (
BiasedMF )and
SVD++ . Results on other algorithms in some cases were similar – showing improved nDCGperformance, although the details and significance vary – and in some cases which our percentiletechnique did not improve the performance of recommendations, the results for all three inputvalues were the same. For instance, results produced by UserKNN on all datasets for all inputvalues were the same, however, our percentile technique produced better ranking performance by
ItemKNN on BX and ML1M .We performed 5-fold cross validation, and in the test condition, generated recommendation listsof size 10 for each user. Then, we evaluated nDCG at list size 10. Results were averaged over allusers and then over all folds. A paired t-test was used to evaluate the significance of results andbased on paired t-test, the results shown in bold are statistically significant with a p-value of lessthan 0.05.Before reporting on the results here, we performed extensive experiments with different param-eter configurations for each algorithm and data set combinations. To determine sensible valuesfor parameters, we followed the settings reported in the literature. In factorization models, forinstance, we approximately set the number of factors and iterations based on the density of thedata set and convergence of the loss function, and we tested these parameters for sensitivity. Weperformed a grid search over bias ∈ { . , . , . , . } , factor ∈ { , , } , iteration Results on all algorithms and datasets are available at https://github.com/masoudmansoury/percentile. The same result is also observed on
NMF . We also evaluated precision, recall, and F-measure, also finding significant improvement in these metrics. User, item, implicit feedback, and overall bias terms. able 6. Performance of recommendation algorithms at nDCG@10 dataset algorithm rate z-score Per fu Per mu Per lu BX BiasedMF
SVD++
CiaoDVD
BiasedMF
SVD++
FilmTrust
BiasedMF
SVD++
ML1M
BiasedMF
SVD++ ∈ { , , } , and learning rate ∈ { . , . , . , . } . Results of extensive experimentsshow that, in general, across on all settings, our percentile technique works significantly betterthan original ratings and z-score values. We include results for ten experimental conditions: two recommendation algorithms evaluatedover five different inputs: the original ratings, the results of the three percentile transformations,and the results of the z-score transformation. Table 6 shows the results for all the data sets andboth algorithms, reporting the best-performing configuration for each dataset, algorithm, and inputvalue. Results in Table 6 show that the percentile technique produces recommendations that areconsistently better than original rating and z-score values over all the recommendation algorithmsand data sets except for Per lu as input for BiasedMF on CiaoDVD data set. On the densest dataset (ML1M), the average improvement by our percentile technique on
BiasedMF is 33% and on
SVD++ is 7%. The improvement on the FilmTrust dataset is 268% and 66%, on the CiaoDVD datasetis 182% and 95%, and on BX dataset is 58% and 48%, respectively. Also, our results show that, inmost cases, the original ratings outperform z-score transformation, which is consistent with ourflatness hypothesis.
We hypothesize that a transformation that produces a flatter distribution will compensate for skewin the rating distribution and generate improved recommendation performance. As we have seen,the percentile transformation generally leads to better performance and to flatter distributions, andthe less-flat z-score transformation has lower performance.We examined this phenomena using five types of inputs for recommender system: originalratings, first, median, and last percentile values, and z-score values. We examined the F and K values for the training data under the different transformations and computed correlation againstthe recommendation performance using [email protected] 7 shows the correlation between F and K values of each input values (i.e., original rating,first index percentile values, median index percentile values, last index percentile values, andz-score values) and nDCG@10 of recommendation algorithms with those input values. It clearlyshows significant negative correlation between performance and divergence from uniformity. (Notethat a low F and K values correspond to a flatter distribution.) The flatter distributions (closer tozero for F and below 3 for K ) yield better performance for all three algorithms across all data sets. We used LibRec 2.0 for all experiments [Guo et al. 2015]. Because a limitation in LibRec, z-score values are shifted to positive values by the addition of an offset. able 7. Correlation between F / K and nDCG@10 for each algorithm dataset F K
BiasedMF SVD++ BiasedMF SVD++
BX -0.95 -0.88 -0.94 -0.85CiaoDVD -0.73 -0.72 -0.62 -0.87FilmTrust -0.96 -0.51 -0.99 -0.74ML1M -0.70 -0.70 -0.97 -0.97
Table 8. Performance of recommendation algorithms on long-tail items at nDCG@10 dataset algorithm rate zscore Per fu Per mu Per lu BX BiasedMF
SVD++
CiaoDVD
BiasedMF
SVD++
FilmTrust
BiasedMF
SVD++
ML1M
BiasedMF
SVD++
Except for the F value of FilmTrust on SVD++ and the K value of CiaoDVD on BiasedMF , all ofthe observed correlations between F / K and nDCG are between -0.99 and -0.70, indicative of astrong inverse relationship: in general, flatter distributions give better algorithmic performance. In this section, we examine the performance of recommendation algorithms on recommendinglong-tail items for different input values. To do this, we follow the methodology in [Cremonesiet al. 2010] for analyzing item popularity. In this methodology, for each user in test set, a list ofitems will be recommended, and then ranking quality will be measured only on long-tail itemsin the recommended list. The main goal of this methodology is to measure the effectiveness of arecommendation algorithm in recommending long-tail items.For this evaluation, we need to determine the long-tail items from training data. To do this, wecreate cumulative popularity list of items sorted from most popular to less-popular items, thenwe define a cutting point such that it divides the items into short-head and long-tail items. Forexperiments in this paper, we used cutting point of 20%, meaning that 20% of most popular itemsare considered as short-head items and the rest of less popular items are considered as long-tailitems.Table 8 shows the performance of recommendation algorithms on long-tail items for differentinput values. As shown in this table, some version of the percentile transform significantly outper-forms all other input values for each data set / algorithm combination in terms of nDCG@10. Onlyin three of the 24 conditions are the improvements not significant: CiaoDVD when Per lu is used asinput for BiasedMF and on ML1M when Per fu is used as input. A drawback of the percentile transform is the handling of a uniform user profile that consistsentirely of identical ratings, for example, ⟨ , , , , , ⟩ . When a user rates every item with thesame rating values, it is hard to determine user’s preferences and attitudes: if the user is generous(tends to rate highly), a rating value of 3 can be interpreted as dislike , while if user is stingy (tends a) BX (b) CiaoDVD (c) FilmTrustFig. 3. Percentage of users who provided identical ratings. to rate low), the same value can be interpreted as like . But in the absence of a rating distributionfor a given user, it is impossible to tell which assumption is correct .Figure 3 shows the percentage of users with uniform profiles at different rating values in threedata sets : BX, CiaoDVD, and FilmTrust. In CiaoDVD as the sparsest data set, more than half ofthe users have uniform profiles, with almost 40% rating all items at 5. These profiles provide littleinformation for a recommendation algorithm beyond the implicit association of user and item.To overcome the problem of uniform profiles, we introduce the notion of a smoothed percentiletransformation. Our inspiration for this method is additive (Laplace) smoothing as commonly foundin naive Bayes classification. The effect of additive smoothing is to shrink probability estimatesbased on counts towards a uniform probability; here our goal is to shrink the percentile estimatetowards a uniform (flat) distribution across the rating values. To create a smoothed version of thepercentile, we add a small number of artificial ratings, k , at each rating level. In 5-star rating system,for example, possible rating values are 1 , , , ,
5, so a k = ⟨ , , ⟩ yields the smoothed profile ⟨ , , , , , , , , , , , , ⟩ .After the smoothed profile is created, the percentile transformation is computed and then theartificial rating values are removed, leaving behind the altered percentiles for the original ratingvalues. Thus, the profile consisting only of 3s, as in our example above, would have middlingpercentile scores, being transformed to ⟨ , , ⟩ , using the last index method. If the profile hadbeen ⟨ , , ⟩ instead, the transformed version would be ⟨ , , ⟩ . The effect of the smoothedtransform is therefore to place the user profile in the context of the full rating scale.We formalize our smoothed version of the percentile transformation for each index assignmentas follows: p f ( x , M ) = × ( position f ( x , o ( M )) + ( k × ( index ( x ) − )))| M | + (| R | ∗ k ) + p m ( x , M ) = × ( position m ( x , o ( M )) + ( k × ( index ( x ) − )) + k / )| M | + (| R | ∗ k ) + p l ( x , M ) = × ( position l ( x , o ( M )) + ( k × index ( x )))| M | + (| R | ∗ k ) + index ( x ) returns the index of rating x in the full list of rating values available in the application.In rating system such as {0.5,1,1.5,2,2.5,3}, for example, index ( ) = index ( . ) = | R | is thenumber of rating values available to users (i.e., in 5-star rating system, | R | = Note that this issue can be even more problematic for some other transformation techniques: z-score, for example, isundefined for uniform profiles. There are no uniform user profiles in ML1M. able 9. Performance of recommendation algorithms with smoothed percentile as input at nDCG@10 dataset algorithms SPer fu SPer mu SPer lu BX BiasedMF
SVD++
CiaoDVD
BiasedMF
SVD++
FilmTrust
BiasedMF
SVD++
ML1M
BiasedMF
SVD++
We repeated our prior experiments using these smoothed transforms, achieving the results shownin Table 9. On the FilmTrust data set, the smoothed percentile showed significantly improvementover the percentile technique particularly on
BiasedMF algorithm. On BX data set, results are onlyslightly better than percentile values. One might attribute this result to the fact that there are fewuniform profiles ratings in BX data set. However, although ML1M does not have any users withuniform profiles, the smoothed percentile showed significant improvement over the percentiletechnique, indicative of effectiveness of smoothing even on non-uniform profiles.On the CiaoDVD data set, we expected significantly better results due to high number of userswho provided identical ratings. However, the improvement by smoothed percentile is only slightlybetter than percentile transform. One possible reason for this result is because most of the userswho provided identical ratings are cold-start users with few items rated.
In this paper, we presented a rating transformation model that converts rating values to percentilevalues as a pre-processing step before model generation. This technique addresses two well-knownproblems in ratings distributions in recommender systems: the problem of user rating bias, dueto variation in rating behavior, and the problem of right-skew, due to the selection bias towardspreferred items. This simple pre-processing step produces improved recommendation rankingperformance across multiple data sets, multiple algorithms, and multiple evaluation metrics. Inaddition, we introduced the smoothed percentile transformation to overcome the problem of identicalratings in users profiles. Experimental results showed that the smoothed percentile technique canimprove recommendation performance beyond the percentile technique alone, even in cases whereuniform profiles are not present.In introducing these transformations and demonstrating their benefits for recommender systemperformance, we also introduced the concept of distribution flatness and produced suggestiveevidence that distributional flatness may be a good indicator of the benefits of such rating trans-formations: flatter, indeed, seems to be better when it comes to rating value distributions forrecommendation.In future work, we plan to conduct additional experiments with the percentile transform, partic-ularly the item-based version of the transform, which was introduced here but for which no resultwere presented. Early experiments indicate that on algorithms that are item-oriented (for example,the Sparse Linear Method [Ning and Karypis 2011]), the item-oriented version of the transform ismore appropriate.We also plan to explore alternative approaches to enhancing the flatness of user profiles includingnegative sampling. Negative sampling has been shown to improve classification accuracy when theevidence is biased [Goldberg and Levy 2014]. For example, rather than adding artificial ratings justfor the percentile computation and removing them afterwards, it may be useful to sample items ith different average rating values and use them to augment uniform user profiles. This wouldhave the effect of smoothing such low-information profiles both towards flatness and towards thepopulation average for item preferences. REFERENCES
Himan Abdollahpouri, Robin Burke, and Bamshad Mobasher. 2017. Controlling Popularity Bias in Learning-to-RankRecommendation. In
RecSys ’17 Proceedings of the Eleventh ACM Conference on Recommender Systems . 42–46.Gediminas Adomavicius, Ramesh Sankaranarayanan, Shahana Sen, and Alexander Tuzhilin. 2005. Incorporating contextualinformation in recommender systems using a multidimensional approach.
ACM Transactions on Information Systems(TOIS)
23, 1 (2005), 103–145.Gediminas Adomavicius and Alexander Tuzhilin. 2015. Context-aware recommender systems. In
Recommender systemshandbook . Springer US, 191–226.Brynjolfsson, Erik, Hu, Yu Jeffrey, Smith, and Michael D. 2006. From niches to riches: Anatomy of the long tail.
SloanManagement Review
47, 4 (2006), 67–71.Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of recommender algorithms on top-n recommen-dation tasks. In
Proceedings of the fourth ACM conference on Recommender systems . ACM, 39–46.Yoav Goldberg and Omer Levy. 2014. word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embeddingmethod. arXiv preprint arXiv:1402.3722 (2014).Guibing Guo, Jie Zhang, Zhu Sun, and Neil Yorke-Smith. 2015. LibRec: A Java Library for Recommender Systems. In
UMAPWorkshops .Guibing Guo, Jie Zhang, and Neil Yorke-Smith. 2013. A novel bayesian similarity measure for recommender systems. In
Twenty-Third International Joint Conference on Artificial Intelligence .Omar Hasan, Lionel Brunie, Jean-Marc Pierson, and Elisa Bertino. 2009. Elimination of subjectivity from trust recommenda-tion. In
IFIP International Conference on Trust Management . Springer Berlin Heidelberg, 65–80.Jonathan L. Herlocker, Joseph A. Konstan, Al Borchers, and John Riedl. 1999. An algorithmic framework for performingcollaborative filtering. In
Proceedings of the 22nd annual international ACM SIGIR conference on Research and developmentin information retrieval . ACM, 230–237.Rob J. Hyndman and Yanan Fan. 1996. Sample quantiles in statistical packages.
The American Statistician
50, 4 (November1996), 361–365.Dietmar Jannach, Lukas Lerche, Iman Kamehkhosh, and Michael Jugovac. 2015. What recommenders recommend: ananalysis of recommendation biases and possible countermeasures.
User Modeling and User-Adapted Interaction
25, 5(2015), 427–491.Rong Jin and Luo Si. 2004. A study of methods for normalizing user ratings in collaborative filtering. In
Proceedings of the27th annual international ACM SIGIR conference on Research and development in information retrieval . ACM, 568–569.Toshihiro Kamishima and Shotaro Akaho. 2010. Nantonac collaborative filtering: A model-based approach. In
Proceedings ofthe fourth ACM conference on Recommender systems . ACM, 273–276.Soo-Cheol Kim, Kyoung-Jun Sung, Chan-Soo Park, and Sung Kwon Kim. 2016. Improvement of collaborative filtering usingrating normalization.
Multimedia Tools and Applications
75, 9 (May 2016), 4957âĂŞ4968.Yehuda Koren. 2008. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In
Proceedings ofthe 14th ACM SIGKDD international conference on Knowledge discovery and data mining . ACM, 426–434.Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems.
Computer
42, 8 (2009).Eric Langford. 2006. Quartiles in elementary statistics.
Journal of Statistics Education
14, 3 (November 2006), 1–27.Daniel D. Lee and H Sebastian Seung. 2001. Algorithms for non-negative matrix factorization.
Advances in neural informationprocessing systems (2001), 556–562.Benjamin M Marlin, Richard S Zemel, Sam Roweis, and Malcolm Slaney. 2007. Collaborative filtering and the missing atrandom assumption. In
Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence . AUAI Press,267–275.Xia Ning and George Karypis. 2011. SLIM: Sparse Linear Methods for Top-N Recommender Systems. In
Data Mining (ICDM),2011 IEEE 11th International Conference on . IEEE, 497–506.Yoon-Joo Park and Alexander Tuzhilin. 2001. The Long Tail of Recommender Systems and How to Leverage It. In
RecSys ’08Proceedings of the 2008 ACM Conference on Recommender Systems . 11–18.Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian personalized rankingfrom implicit feedback. In
Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence . AUAI Press,452–461. aul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, and John Riedl. 1994. GroupLens: an open architecturefor collaborative filtering of netnews. In Proceedings of the 1994 ACM conference on Computer supported cooperative work .ACM, 175–186.Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based collaborative filtering recommendationalgorithms. In
WWW’01 Proceedings of the 10th international conference on World Wide Web . 285–295.Yue Shi, Martha Larson, and Alan Hanjalic. 2010. List-wise learning to rank with matrix factorization for collaborativefiltering. In
Proceedings of the fourth ACM conference on Recommender systems . ACM, 269–272.Harald Steck. 2011. Item Popularity and Recommendation Accuracy. In
RecSys ’11 Proceedings of the fifth ACM Conferenceon Recommender Systems . 125–132.. 125–132.