Personal Recommendation via Modified Collaborative Filtering
aa r X i v : . [ phy s i c s . d a t a - a n ] J u l Personal Recommendation via Modified Collaborative Filtering
Run-Ran Liu a , Chun-Xiao Jia a , Tao Zhou a,b , ∗ Duo Sun a , and Bing-Hong Wang a,c a Department of Modern Physics and Nonlinear Science Center,University of Science and Technology of China, Hefei Anhui, 230026, PR China b Department of Physics, University of Fribourg, Chemin du Muse 3, CH-1700 Fribourg, Switzerland and c Institute of Complex Adaptive System, Shanghai Academy of System Science, Shanghai, P. R. China (Dated: October 31, 2018)In this paper, we propose a novel method to compute the similarity between congeneric nodes in bipartitenetworks. Different from the standard cosine similarity, we take into account the influence of node’s degree.Substituting this new definition of similarity for the standard cosine similarity, we propose a modified collabo-rative filtering (MCF). Based on a benchmark database, we demonstrate the great improvement of algorithmicaccuracy for both user-based MCF and object-based MCF.
PACS numbers: 89.75.Hc, 87.23.Ge, 05.70.LnKey words: recommendation system, bipartite network, similarity, collaborative filtering
I. INTRODUCTION
Recently, recommendation systems are attracting more and more attentions, because it can help users to deal with informationoverload, which is a great challenge in the modern society, especially under the exponential growth of the Internet [1] andthe World-Wide-Web [2]. Recommendation algorithm has been used to recommend books and CDs at Amazon.com, moviesat named Netflix.com, and news at VERSIFI Technologies (formerly AdaptiveInfo.com) [3]. The simplest algorithm we canuse in these systems is global ranking method (GRM) [4], which sorts all the objects in the descending order of degree andrecommends those with highest degrees. GRM is not a personal algorithm and its accuracy is not very high because it does nottake into account the personal preferences. Accordingly, various kinds of personal recommendation algorithms are proposed,for example, the collaborative filtering (CF) [5, 6], the content-based methods [7, 8], the spectral analysis [9, 10], the principlecomponent analysis [11], the diffusion approach [4, 12, 13, 14], and so on. However, the current generation of recommendationsystems still requires further improvements to make recommendation methods more effective [3]. For example, the contentanalysis is practical only if the items have well-defined attributes and those attributes can be extracted automatically; for somemultimedia data, such as audio/video streams and graphical images, the content analysis is hard to apply. The collaborativefiltering usually provides very bad predictions/recommendations to the new users having very few collections. The spectralanalysis has high computational complexity thus infeasible to deal with huge-size systems.Thus far, the widest applied personal recommendation algorithm is CF [3, 15]. The CF has two categories in general, oneis user-based (U-CF), which recommends the target user the objects collected by the users sharing similar tastes; the other isobject-based (O-CF), which recommends those objects similar to the ones the target user preferred in the past. In this paper, weintroduce a modified collaborative filtering (MCF), which can be implemented for both object-based and user-based cases andachieve much higher accuracy of recommendation.
II. METHOD
We assume that there is a recommendation system which consists of m users and n objects, and each user has collected someobjects. The relationship between users and objects can be described by a bipartite network. Bipartite network is a particularclass of networks [4, 16], whose nodes are divided into two sets, and connections among one set are not allowed. We use one setto represent users, and the other represents objects: if an object o i is collected by a user u j , there is an edge between o i and u j ,and the corresponding element a ij in the adjacent matrix A is set as 1, otherwise it is 0.In U-CF, the predicted score v ij (to what extent u j likes o i ), is given as : v ij = m X l =1 ,l = i s il a jl , (1) ∗ Electronic address: [email protected] where s il denotes the similarity between u i and u l . For any user u i , all v ij are ranked by values from high to low, objects on thetop and have not been collected by u i are recommended.How to determine the similarity between users? The most common approach taken in previous works focuses on the so-calledstructural equivalence. Two congeneric nodes (i.e. in the same set of a bipartite network) are considered structurally equivalentif they share many common neighbors. The number of common objects shared by users u i and u j is c ij = n X l =1 a li a lj , (2)which can be regarded as a rudimentary measure of s il . Generally, the similarity between u i and u j should be somewhat relativeto their degrees [17]. There are at least three ways previously proposed to measure similarity, as: s ij = 2 c ij k ( u i ) + k ( u j ) , (3) s ij = c ij p k ( u i ) k ( u j ) , (4) s ij = c ij min ( k ( u i ) , k ( u j )) . (5)The Eq.(3) is called Sorensen’s index of similarity (SI) [18], which was proposed by Sorensen in 1948; the Eq.(4), called thecosine similarity, was proposed by Salton in 1983 and has a long history of the study on citation networks [17]; the Eq.(5) iscalled Pearson correlation. Both the Eq.(4) and Eq.(5) are widely used in recommendation systems [3, 4].A common blemish of Eqs. (3)-(5) is that they have not taken into account the influence of object’s degree, so the objectswith different degrees have the same contribution to the similarity. If user u i and u j both have selected object o l , that is to say,they have a similar taste to the object o l . Provided that object o l is very popular (the degree of o l is very large), this taste (thefavor for o l ) is a very ordinary taste and it does not means u i and u j are very similar. Therefore, its contribution to s ij shouldbe small. On the other hand, provided that object o l is very unpopular (the degree of o l is very small), this taste is a peculiartaste, so its contribution to s ij should be large. In other words, it is not very meaningful if two users both select a popular object,while if a very unpopular object is simultaneously selected by two users, there must be some common tastes shared by thesetwo users. Accordingly, the contribution of object o l to the similarity s ij (if u i and u j both collected o l ) should be negativelycorrelated with its degree k ( o l ) . We suppose the object o l ’s contribution to s ij being inversely proportional to k α ( o l ) , with α afreely tunable parameter. The s ij , consisted of all the contributions of commonly collected objects, is measured by the cosinesimilarity as shown in Eq. (4). Therefore, the proposed similarity reads: s ij = 1 p k ( u i ) k ( u j ) n X l =1 a li a lj k α ( o l ) . (6)Note that, the influence of object’s degree can also be embedded into the other two forms, shown in Eq. (3) and Eq. (5), and thecorresponding algorithmic accuracies will be improved too. Here in this paper, we only show the numerical results on cosinesimilarity as a typical example.For any user-object pair u i - o j , if u i has not yet collected o j , the predicted score can be obtained by using Eq. (1). Here wedo not normalize Eq. (1), because it will not affect the recommendation list, since for a given target user, we need sort all heruncollected objects, and only the relative magnitude is meaningful. Note that, if two objects have exactly the same score, theirorder is randomly assigned. We call this method a modified collaborative filtering (U-MCF), for it belongs to the framework ofU-CF. III. NUMERICAL RESULTS
Using a benchmark data set namely
MovieLens [19], we can evaluate the accuracy of the current algorithm. The data consistsof 1682 movies (objects) and 943 users. Actually,
MovieLens is a rating system, where each user votes movies in five discreteratings 1-5. Hence we applied a coarse-graining method used in Refs. [4, 12]: A movie has been collected by a user if and onlyif the giving rating is at least 3 (i.e. the user at least likes this movie). The original data contains ratings, 85.25 % of whichare ≥ , thus the data after the coarse gaining contains 85250 user-object pairs. The current degree distributions of users andobjects were presented in Fig. 1. Clearly, the degree distributions of both users and objects obey an exponential form. To testthe recommendation algorithms, the data set is randomly divided into two parts: The training set contains 90 % of the data, andthe remaining 10 % of data constitutes the probe. Of course, we can divided it in other proportions, for example, 80 % vs . 20 % , k P ( K ) FIG. 1: The degree distributions of users (left panel) and objects (right panel) in linear-log plot, where P ( k ) denotes the cumulative degreedistribution. < R e c a ll > (c) < P r e c i s i on > (b) < R an k i ng sc o r e > (a) FIG. 2: The effect of parameter α in U-MCF. The ranking score has its minimal at about α = 1.85, at almost the same point, the recall andprecision achieve their maximums. Present results are obtained by averaging over four independent 90 % vs . 10 % divisions. The error barsdenote the standard deviations. % vs . 30 % , and so on. The training set is treated as known information, while no information in probe set is allowed to beused for prediction.A recommendation algorithm could provide each user a recommendation list which contains all her/his uncollected objects.There are several measures for evaluating the quality of these recommendation lists generated by different algorithms. In thispaper, we use ranking score , recall and precision to measure the effectiveness of a given recommendation approach. Goodoverview of these measures can be found in Ref [6]. Ranking score . For an arbitrary user u i , if the relation u i - o j is in the probe set (according to the training set, o j is anuncollected object for u i ), we measure the position of o j in the ordered queue. For example, if there are 1000 uncollectedmovies for u i , and o j is the 10th from the top, we say the position of o j is the top 10/1000, denoted by r ij = 0.01. Since theprobe entries are actually collected by users, a good algorithm is expected to give high recommendations to them, thus leadingto small r . Therefore, the mean value of the position value h r i (called ranking score [4]), averaged over all the entries in theprobe, can be used to evaluate the algorithmic accuracy. The smaller the ranking score, the higher the algorithmic accuracy, and R e c a ll The length of recommendation list (c) P r e c i s i on (b) r ij Rank (a)
FIG. 3: (Color online) (a): The predicted position of each entry in the probe ranked in the ascending order. (b): The precision for differentlengths of recommendation lists. (c): The recall for different lengths of recommendation lists. vice verse. The definition of ranking score here is slightly different from that of the Ref. [4]. It is because if a movie or user inthe probe set has not yet appeared in the training set, we automatically remove it from the probe and the number of total movieswas counted only for the ones appeared in the the training set; while the Ref. [4] takes into account those movies only appearedin the probe via assigning zero score to them. This slight difference in implementation does not affect the conclusion.
Recall is defined as the ratio of number of recommended objects appeared in the probe to the total number of objects. Thelarger recall corresponds to the better performance. Recall is also called hitting rate in literature [4].
Precision is defined as the ratio of number of recommended objects appeared in the probe to the total number of recommendedobjects. The larger precision corresponds to the better performance. Recall and precision depend on the length of recommenda-tion list L , we set L as 50 in our numerical experiment (in real e-commerce systems, the length of recommendation list usuallyranges from 10 to 100 [20]), therefor the total number of recommended objects is mL = 47150.Fig. 2 reports the algorithmic accuracy of U-MCF, which has a clear optimal case around α = 1.85. Fig. 3 (a) reportsthe distribution of all the position values, r ij , which are sorted from the top position ( r ij →
0) to the bottom position ( r ij → α = 0) and the the optimal cases ( α = 1.85) for different sizes of training sets. Allthese numerical results strongly demonstrate that to depress the contribution of common selected popular objects can furtherimprove the algorithmic accuracy.Similar to the U-CF, the recommendation list can also be obtained by object-based collaborative filtering (O-CF), that is tosay, the user will be recommended objects similar to the ones he/she preferred in the past [21]. By using the cosine expression,the similarity between two objects, o i and o j , can be written as: s ij = 1 p k ( o i ) k ( o j ) m X l =1 a il a jl . (7)The predicted score, to what extent u i likes o j , is given as: v ij = n X l =1 ,l = i s jl a li . (8)Analogously, taking into account the influence of user degree, a modified expression of object-object similarity reads: s ij = 1 p k ( o i ) k ( o j ) m X l =1 a il a jl k α ( u l ) , (9)where α is a free parameter. The modified object-based collaborative filtering (O-MCF for short) can be obtained by combiningEq. (8) and Eq. (9). Fig. 5 reports the algorithmic accuracy of O-MCF, which has a clear optimal case around α = 0 . . Fig. 6 SCF R e c a ll The size of training set (c)
MCF
SCF P r e c i s i on (b) MCF
SCF R an k i ng sc o r e (a) FIG. 4: (color online) The standard CF (SCF) (i.e. α = 0 ) vs . the optimal case for different sizes of training sets. < R e c a ll > (c) < p r e c i s i on > (b) < R an k i ng sc o r e > (a) FIG. 5: The effect of parameter α in O-MCF. The ranking score has its minimal at about α = 0.95, at almost the same point, the recall andprecision achieve their maximums. Present results are obtained by averaging over four independent 90 % vs . 10 % divisions. The error barsdenote the standard deviations.TABLE I: Three measures for different algorithms with probe set containing 10 % data. For precision and recall, L = 50 . Present results areobtained by averaging over four independent divisions. The values corresponding to U-MCF and O-MCF are the optimal ones.method < Ranking score > <
Precision > <
Recall > GRM 0.1502 0.3077 0.0540O-CF 0.1173 0.4035 0.0706U-CF 0.1252 0.3773 0.0660O-MCF 0.1019 0.4443 0.0777U-MCF 0.1101 0.4108 0.0719 R e c a ll The length of recommendition list (c) P r e c i s i on (b) r ij Rank (a)
FIG. 6: (color online) Similar to Fig.3. But for O-MCF.
SCF R e c a ll The size of training set (c)
MCF
SCF P r e c i s i on (b) MCF
SCF r an k i ng sc o r e (a) FIG. 7: (color online) Similar to Fig.4. But for O-MCF. (a) reports the distribution of all the position values, r ij , which are sorted from the top position ( r ij →
0) to the bottom position( r ij → α = 0) and the the optimal case ( α = 0.95) for different sizes of trainingsets. All these results, again, demonstrate that to depress the contribution of users with high degrees to object-object similaritycan further improve the algorithmic accuracy of object-based method. IV. CONCLUSION
We compare the MCF, standard CF and GRM in Tab. I. Clearly, MCF is the best method and GRM performs worst. Comparedwith the standard CF, the modified object-based algorithm and the modified user-based method improve the accuracy in differentextent in three measures. Ignoring the degree-degree correlation in user-object relations, the algorithmic complexity of U-MCFis O ( m h k u i + mn h k o i ) , the O-MCF is O ( n h k o i + mn h k u i ) , respectively. Here h k u i and h k o i denote the average degree ofusers and objects. Therefore, one can choose either O-MCF or U-MCF according to the specific property of data source. Forexample, if the user number is much larger than the object number (i.e. m ≫ n ), the O-MCF runs much faster. On the contrary,if n ≫ m , the U-MCF runs faster. Furthermore, the remarkable improvement of algorithmic accuracy also indicates that ourdefinition of similarity is more reasonable than the traditional one. ACKNOWLEDGMENTS
We acknowledge
GroupLens Research Group for providing us the data set
MovieLens . This work is funded by the NationalBasic Research Program of China (973 Program No.2006CB705500), the National Natural Science Foundation of China (GrantNos. 60744003, 10635040, 10532060 and 10472116), and the Specialized Research Fund for the Doctoral Program of HigherEducation of China. T.Z. acknowledges the support from SBF (Switzerland) for financial support through project C05.0148(Physics of Risk), and the Swiss National Science Foundation (205120-113842). [1] M. Faloutsos, P. Faloutsos, and C. Faloutsos, Comput. Commun. Rev. 29, 251 (1999).[2] A. Broder, R. Kumar, F. Moghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, J.Wiener, Comput. Netw. 33, 309 (2000).[3] G. Adomavicius, and A. Tuzhilin, IEEE Trans. Know. &&