Efficient Multicore Collaborative Filtering
EEfficient Multicore Collaborative Filtering
Yao Wu
Institute of AutomationChinese Academy of SciencesBeijing 100190, China [email protected] Qiang Yan
Institute of AutomationChinese Academy of SciencesBeijing 100190, China [email protected] Danny Bickson
Carnegie Mellon University5000 Forbes AvePittsburgh, PA [email protected] Low
Carnegie Mellon University5000 Forbes AvePittsburgh, PA [email protected] Qing Yang
Institute of AutomationChinese Academy of SciencesBeijing 100190, China [email protected]
ABSTRACT
This paper describes the solution method taken by LeBu-SiShu team for track1 in ACM KDD CUP 2011 contest (re-sulting in the 5th place). We identified two main challenges:the unique item taxonomy characteristics as well as the largedata set size.To handle the item taxonomy, we present a novel methodcalled Matrix Factorization Item Taxonomy Regularization(MFITR). MFITR obtained the 2nd best prediction resultout of more then ten implemented algorithms.For rapidly computing multiple solutions of various algo-rithms, we have implemented an open source parallel col-laborative filtering library on top of the GraphLab machinelearning framework. We report some preliminary perfor-mance results obtained using the BlackLight supercomputer.
General Terms
Machine learning, data mining
Keywords
Collaborative filtering, matrix factorization, tensor factor-ization.
1. INTRODUCTION
The task in the ACM KDD CUP track1 was to predictmusic ratings using a real dataset obtained from the Yahoo!music service. A full description of the dataset is given in [3].There are two main factors which make the prediction taskchallenging. Firstly, the magnitude of the dataset is ratherlarge: there are 1,000,990 users, 624,961 music items (songs)and 262,810,175 user ratings, spanning over 6649 time bins.For data of this magnitude, commonly used mathematicalsoftware like Matlab can not be efficiently deployed. Sec-ondly, the data includes additional features such as the timewhen the user ratings were recorded as well as the hierarchyof rated items to genres (each rated song can belong to oneor more genre), album and artist.In this paper we describe how we handled the two chal-lenges described above. Section 2 outlines the theoretical algorithms used for computing the prediction. Section 3explains how those algorithms where adapted to the KDDCUP contest, namely accounting for hierarchy of data items.Section 4 discusses our efficient custom parallel implemen-tation on top of the GraphLab machine learning framework,that was used to rapidly fine-tune multiple algorithm param-eters, including report of performance results. We concludein Section 5.As an additional contribution, we release open source codeof many of the implemented algorithms as part of GraphLab’scollaborative filtering library - available from http://graphlab.org/ .
2. ALGORITHMS
Inspired by the Bellkor team’s algorithm which won theNetflix contest [6], we deployed an ensemble method, com-bining a collection of collaborative filtering algorithms whileblending the solutions together. The ensemble comprises of12 methods listed in Table 1, of which the last two are novel.In the rest of this section we describe the implemented al-gorithms in more detail.1 Item-kNN [8] Neighborhood based2 ALS [16] Alternating least squares3 wALS [11] Weighted alternating least squares4 BPTF [15, 12] Bayesian prob. tensor factorization5 SGD [8, 13] Stochastic gradient descent6 SVD++ [5] SVD++ algorithm7 time-kNN [7] Time aware neighborhood model8 time-SGD [7] Time aware SGD9 time-SVD++ [7] Time aware SVD++10 Random-forest [2] Random forest11
MFITR
MF item taxonomy regularization12 time-MFITR
MF item taxonomy regularization,time aware
Table 1: Different algorithms implemented. The lasttwo are our novel contribution. a r X i v : . [ c s . L G ] A ug .1 Neighborhood models An item-based neighborhood approach predicts the rating r ui of a user u for a new item i , using the rating of the user u gave to the items which are similar to i . We choose theAdjusted Cosine (AC) similarity to measure the similarity w ij between item i and j . w ij = (cid:80) u ∈ U ij ( r ui − r u )( r uj − r u ) (cid:113)(cid:80) u ∈ U ij ( r ui − r u ) (cid:80) u ∈ U ij ( r uj − r u ) . Here, U ij denotes the users who have rated both item i and j . Based on the similarity, for every item i , we cancompute the neighborhood N i which contain the K itemsmost similar to i . Then we can predict (cid:98) r ui based on theitems in both N i and R u which is the set of ratings madeby user u : (cid:98) r ui = (cid:80) j ∈ R u ∩ N i w ij r uj (cid:80) j ∈ R u ∩ N i | w ij | . Here, ∩ denotes the intersection of two sets.To address the computational challenges arising from thehuge number of items, we split the items into N parts. Foreach i th iteration, we only need to compute the neighborsof the items in the i th part. In our experiments, we set N = 300, thus matrix M I × J fitted into a 8GB memorycomputer, here J = I/N and I is the number of items. Thismethod can be easily parllelized. Alternating least squares [16] is a simple matrix factoriza-tion algorithm. The non-zero rating of item form a matrix A of size M × N , where M number of users and N in thenumber of items. The matrix A is decomposed into two lowrank matrices A ≈ U ∗ V where U is of size M × D and V is D × N . Starting from an initial guess, each iteration firstfixes U and computes V using a least squares procedure,then fixes V and computes U using the same least squareprocedure. The rating is computed as a vector product ofthe matching user and item feature vectors: (cid:98) r ui ( t ) = D (cid:88) j =1 U u,j V j,i . ALS model can be extended to the tensor case where timeinformation is included with the rating. Bayesian proba-bilistic tensor factorization (BPTF) [15] is a Markov ChainMonte Carlo method, where on top of the least squares step,sampling from the hyperpriors of
U, V is added.
Matrix Factorization methods have demonstrated supe-rior performance vs. neighborhood based models [5]. Ma-trix factorization models map both users and items to a jointlatent factor space of dimension D , such the user-item inter-actions are modeled as inner products in that space. Eachitem i and user u is associated with a D -dimensional latentfeature vector q i and p u respectively. Thus predicted ratingis computed by: (cid:98) r ui = µ + b i + b u + q Ti p u . The parameters b i , b u , q i and p u are learned by minimizinga certain loss function based on the ( u, i ) pairs in the set of observed ratings O :min (cid:88) ( u,i ) ∈ O ( r ui − (cid:98) r ui ) + λ ( b i + b u + (cid:107) q i (cid:107) + (cid:107) p u (cid:107) ) (1)where (cid:107) . (cid:107) denotes the Frobenius 2-norm and the positiveconstant λ controls the extent of regularization and it isdetermined by cross validation. We used stochastic gradi-ent descent optimization to minimize the loss function (1).The complexity of each iteration is linear in the number ofratings. Implicit feedback can improve the prediction accuracysince it provides an additional indication of user preferences.SVD++ [5] is an extension of the linear model of (1). Foreach item i , we add an additional latent factor y i . Thus, thelatent factor vector of each user u can be characterized bythe set of items the user have rated. The exact model is asfollows: (cid:98) r ui = µ + b i + b u + q Ti ( p u + | R u | − / (cid:88) j ∈ R u y j ) . Similarly, we can learn the parameters b i , b u , q i , p u and y i using stochastic gradient descent optimization to minimiz-ing the quadratic loss function. Again, the complexity periteration is linear in the number of ratings. In the time-aware cf model, each rating r ui is associatedwith a time stamp t ui , which indicates the time when therating was observed. However, a rating r ui observed 3 yearsago is less important as a rating r uj taken 3 days ago, whenused to predict the current ratings.Following [7], we define a time-decay function to modelthis effect: f ui ( t ) = e − β ( t − t ui ) , where β ≥ β = 0, wedon’t consider the temporal effects.We can incorporate the temporal effect into the neighbor-hood models as follows: (cid:98) r ui ( t ) = (cid:80) j ∈ R u ∩ N i f ui ( t ) w ij r uj (cid:80) j ∈ R u ∩ N i f ui ( t ) | w ij | . In our experiments, we found that setting β = 0 .
08 gavethe best performance. Overall, time-aware neighborhoodmodel achieved a significantly better result than time-independentneighborhood model.
Like in [6], we model the temporal effect into the MatrixFactorization by allowing the parameters vary with differenttime. In particular, we quantize the user-based time effectby days T , T , ..., T N and add a item-time-bin bias to theequation. The predicted rating is computed as follows: (cid:98) r ui ( t ) = µ + b i + b u + b u,t + b i,Bin ( t ) + q Ti ( p u + | R u | − / (cid:88) j ∈ R u y j ) . Here, b = Bin ( t ) , b = 0 , , ..., N , we choose N = 30 as thenumber of time bins. b u,t is a 2-dimensional array factorizedby b u,t = x Tu z t to cut memory cost: (cid:98) r ui ( t ) = µ + b i + b u + x Tu z t + b i,Bin ( t ) + q Ti ( p u + | R u | − / (cid:88) j ∈ R u y j ) . ence, we could extend the time-independent loss func-tion to the following form:min (cid:88) ( u,i,t ) ∈ O ( r uit − (cid:98) r ui ( t )) + λ ( b i + b u + b i,Bin ( t ) ) ++ λ ( (cid:107) q i (cid:107) + (cid:107) p u (cid:107) + (cid:88) j ∈ R u (cid:107) y j (cid:107) ) + λ ( (cid:107) x u (cid:107) + (cid:107) z t (cid:107) ) . Similarly, the model described in the section 3 can be ex-tended to a time-aware version. (cid:98) r ui ( t ) = µ + b i + b u + b ar + x Tu z t + b i,Bin ( t ) + ( q i + q ar ) T p u . The above techniques largely make use of only rating in-formation. To utilize the remaining information such as “al-bum” and “artist info”, we used random forests to performregression of all item features. We consider this predictiontask as to estimate a function x ui → r ui . Here, x ui is aD-dimensional vector which maps to the features of a ( u, i )pair and r ui denotes the rating. In our experiments, the fea-tures included the user id, item id, artist, album and genres.We use random forests [2] to regress the features. The sin-gle model did not perform as well as traditional Collabora-tive Filtering algorithms (obtained RMSE 26 on validationdataset), but we found it can improve our final solution af-ter blending with other models (obtaining an improvementof 0.08 on the leaderboard). A lesson learned from the Netflix contest is that the com-bination of different algorithms can lead to significant per-formance improvement over individual algorithms [14]. Weblended our multiple predictors based on a linear regressionmodel. We use the validation set to compute the optimal M -dimensional linear combination weights w , where M isthe number of blending models.First, we trained each model m i independently based onthe training set and get the predictions x i of the N ratingsin validation set, where x i is a N -dimensional vector. Thetarget value for the N data points are y , the weights w can be obtained by solving a least squares problem. In oursolution, we used ridge regression: w = ( X T X + λ I ) − Xy , where I denotes the identity matrix and the regularization λ is determined by cross validation [4].Next, every model is trained again using the same ini-tialization parameters, but training is now performed usingboth the training set and the validation set. Finally, test setpredictions of each model m i is computed (ˆ x i ) and the finalprediction is obtained: (cid:98) r = (cid:98) X T w .
3. MFITR: MATRIX FACTORIZATION WITHITEM TAXONOMY REGULARIZATION
A unique property of the KDD data, is that tracks, al-bums, artists and genres form a hierarchy; where each trackbelongs to an album, each albums belongs to an artist, andboth are tagged by genres [3]. We propose MFITR, a novelmethod to utilize item taxonomy information to improveprediction accuracy. To capture the hierarchy of items, we construct a graphbetween tracks, albums and artists. (We did not use genreinformation.) The method is different from the traditionalmatrix factorization since we model the item hierarchy asa regularization term to constrain the matrix factorizationcomputation, as explained next. A closely related approachis recently proposed in the social network domain [10].We assume an object i is a parent of another object j ifthere exists a hierarchic relationship between them and j belongs to i , meanwhile, j can be seen as a child of i . Theroot nodes of the hierarchic relationship are artists, and thechildren nodes are albums. Tracks belonging to an albumare children of the album. This item hierarchy is therefore agraph and we denote the parent set of item i as P i and thechild set as C i .If a user u have given an artist a i a rating 100 and givenanother artist a j a rating 0, we can intuitively know that u will like tracks and albums of a i more than that of a j .Based on this intuition, we propose a model based on matrixfactorization. The prediction is computed by: (cid:98) r ui = µ + b i + b u + b a + ( q i + q a ) T p u . We use b a as the bias of the artist a and q Ta p u as theuser feature vector for artist a , the performer of music item i . Furthrmore, to support different ratings of tracks in thesame album, we propose a more advanced model: min (cid:88) ( u,i ) ∈ O ( r ui − µ − b i − b u − b a − ( q i + q a ) T p u ) ++ λ ( b i + b u + b a ) + λ ( (cid:107) q i (cid:107) + (cid:107) p u (cid:107) + || q a || ) ++ λ (cid:88) i (cid:88) j ∈ P i w ij (cid:107) q i − q j (cid:107) + λ (cid:88) i (cid:88) j ∈ C i w ij (cid:107) q i − q j (cid:107) . Here, w ij is the similarity between i and j , computed as inthe neighborhood model. If i and j are similar, the distancebetween q i and q j shouldn’t be large. Table 2 summarizesthe notations used in IMFTR. r ui Observed rating for item i by user u (cid:98) r ui Predicted rating for item i by user ub i Bias for item ib u Bias for user ub a Bias for artist aµ Mean rating q i Feature vector for item iq u Feature vector for user uq a Feature vector for artist aλ Weighting factor for biases λ Weighting factor for feature vectors λ Weighting factor for child similarity λ Weighting factor for parent similarity w ij neighborhood similarity between objects i and jC i Graph children set of object iP i Graph parent set of object i Table 2: MFITR Notations
The MFITR cost function is composed of four terms. Thefirst term minimizes the Euclidean distance between the ob-served and predicted rating, where biases of user item andartist are taken into account. The second term is a regu-larization term of the biases to prevent over-fitting. Thehird and fourth term enforce similarity between tracks inthe same album and between albums of the same artist,when they are rated closely by the neighborhood model.An advantage of this approach is that the similarity be-tween parent and children can propagate indirectly in thelearning phase [10]. For example, two tracks t a and t b fromthe same album l have no direct relationship between them.However, since they have the same parent l , the distance be-tween q t a and q t b is actually minimized indirectly when thedistances w l,t a (cid:107) q l − q t a (cid:107) and w l,t b (cid:107) q l − q t b (cid:107) are minimized.An immediate extension we implemented is to add timeinformation into the cost function as done in time-SVD++.We call this variant time-MFITR. As shown in the nextsection, time-MFITR has very good performance on KDDdata.
4. EFFICIENT MULTICORE IMPLEMEN-TATION
A majority of the algorithms described where implementedon top of the GraphLab parallel machine learning framework[9]. We selected GraphLab since it allowed us for rapid pro-totyping and testing of multiple CF algorithms. The fol-lowing algorithms where implemented: ALS, weighted-ALS(wALS), SVD++, PMF, BPTF and SGD. Since we usedmultiple algorithms, where each algorithm had multiple tun-able parameters that needed to be adjusted, an efficient par-allel solution was essential for rapidly improving our model.All of the above algorithms are open sourced as part of theGraphLab collaborative filtering library: http://graphlab.org/ .We have utilized our own cluster (several AMD Opteron8387 4-8 core machines, 2.7Ghz, 16-64GB memory) as wellas the BlackLight [1] supercomputer (SGI UV 1000 NUMAshared-memory system comprising 256 blades. Each bladeholds 2 Intel Xeon X7560 Nehalem 2.27 Ghz eight-core pro-cessors, for a total of 4096 cores.) Overall, we estimate thatwe have used around 10,000 cpu hours on our clusters and10,000 cpu hours on BlackLight. Each algorithm was run inparallel using 8-32 cores using line search for each tunableparameter. Each of those runs was repeated twice: withand without validation data used for training, as explainedin Section 2.8.
Table 3 lists the different tunable parameters we testedfor each algorithm, and the optimized setting we found. Asa baseline for performance, we measure RMSE (root meansquare error) on the validation dataset. The most effectivesingle algorithm is time-SVD++ which obtained RMSE of20.90 on the validation data. The second most effectivesingle algorithm is our novel time-MFITR algorithm whichobtained RMSE of 21.10 on the validation data. Note thatwhile wALS obtained the best performance, it did overfitand gave worse performance on the actual test data. Asummary of the results is given in Figure 1.Regarding performance of the parallel implementation.Figure 2 shows the speedup of several algorithms using Black-Light. Similar results were obtained also on the Linux clus-ter and will not be repeated here. Speedup is defined usingthe baseline of a single CPU run. For wALS,ALS we obtainan almost optimal speedup of x14 on 16 cores. BPTF per-forms slightly less since it has a sampling step after each it- V a li da t i on R M SE A L S S G D B P
T F
S V D + + t i m e − S V D M F I T R t i m e − M F I T R t i m e − S V D + + Figure 1: RMSE of the different algorithms on thevalidation data. eration which is serial and slows the algorithm a little. SGD,SVD++ performance is worse - with a speedup of about x6and x3, respectively. The reason is that we deploy a lockingmechanism to prevent users to update the same item featurevector concurrently.Regarding accuracy of the parallel computation vs. anequivalent serial result. Figure 3 examines the validationRMSE of 5 iterations the SGD algorithm (D=50) using dif-ferent number of cores 1-16 on BlackLight. Because of theparallel implementation there are slight variations in accu-racy. However, variations are not more than 0.1% of theserial result. Similar behavior was observed for the otheralgorithms.Another interesting question we looked at is how doescomputation scale with the length of the feature vector.Figure 4 shows the good scaling of SGD algorithm. Thisscaling was also observed for SVD++. Both algorithms per-formance is almost linear with the number of features. ALS,wALS and BPTF all perform matrix inversion as part of theupdate rule and thus the scaling is less good.Figure 5 depicts running time of a single iteration of sev-eral algorithms, on 16 cores with D = 20. SGD and SVD++(not shown, but has similar running time as SGD) are thefastest algorithms per iteration. But from the other hand,it is more difficult to make them work efficiently in parallel. S peedup ALSwALSOptimal SGDBPTFSVD++
Figure 2: Speedup of Graphlab using KDD data onBlackLight with up to 16 cores.
O Algorithm Tunable parameters RMSE on validation1 Neighborhood model Adjusted cosine (AC) similarity 23.342 ALS λ =regularization (1) D =feature vector width (120) r =number of iterations (50) 22.013 wALS λ =regularization (1) D =feature vector width (120) r =number of iterations50 18.874 BPTF b =burn-in period (5) D =feature width (20) s =scaling of paramters (100) r =number of iterations (100) 21.845 SGD q =learning rate decay (0.95) γ =learning rate (5e-4) λ =weight factor (1e-4) r =iterations (50) 21.92 r =iterations (100) 21.886 SVD++ γ =learning rate (5e-4) q =learning rate decay (0.95) λ =weight factor (1e-4) λ =weight factor (1e-4) D =features (50) 21.65 D =features (100) 21.597 Time aware neighborhood β =time decay (0 .
08) 22.78 time-SVD λ = 1e-4 λ , λ =5e-4 λ =5e-4 γ =learning rate (1e-4) q =learning rate decay (0.95) D =features (50) 21.45 D =features (100) 21.419 time-SVD++ λ = 1e-5 λ =1e-4 λ =3e-4 γ =learning rate (5e-5) q =learning rate decay (0.95) D =features (50) 20.97 D =features (100) 20.9010 MFITR λ = 1e-5 λ =1e-4 λ =1e-3 λ =1e-3 γ =learning rate (8e-5) q =learning rate decay (0.95) D =features (50) 21.39 D =features (100) 21.3011 time-MFITR λ = 1e-5 λ =1e-4 λ =1e-3 λ =1e-3 λ =1e-3 γ =learning rate (8e-5) q =learning rate decay (0.95) D =features (50) 21.1012 Random forest 26.0- Session based post-processing Neighborhood model 23.34=¿23.02SGD 21.92=¿21.77SVD++ 21.65=¿21.51- Blending all together 19.90 Table 3: Main results on the validation data obtained using the different algorithms. T r a i n R M SE Number of cores
Figure 3: Accuracy of SGD validation RMSE usingdifferent number of cores. Variations are up to 0.1%.
5. CONCLUSION
We have utilized the GraphLab parallel machine learningframework to efficiently and rapidly implement multiple col-laborative filtering algorithms. having a fast way of testingmultiple model settings allowed us for efficient blending ofmultiple algorithms together. We have further introduceda novel algorithm called MFITR for accounting for itemtaxonomy. Using fast multicore implementation of multi-ple algorithms as well combining solution of our MFITRalgorithm allowed us to achieve the 5 th place at track 1 ofthe ACM KDD CUP 2011 contest.
50 100 150 200 250100120140160180200 Feature vector width (D) O ne i t e r a t i on ( s e c ) SGD
Figure 4: SGD iteration time vs. feature vectorwidth (D).
Acknowledgement
D. Bickson would like to thank Liang Xiong from CarnegieMellon University for his extensive help in implementingthe BPTF algorithm [15] in parallel. D. Bickson wouldlike to thank Joel Welling from Pittsburgh SuperComput-ing Center for his great support of utilizing BlackLight Su-percomputer, generously made available by grant ”A VeryLarge Shared Memory System for Science and Engineering”OCI-1041726 (PI Levine). Carnegie Mellon work was sup-ported by grants ARO MURI W911NF0710287, ARO MURIW911NF0810242, NSF Mundo IIS-0803333, NSF Nets-NBDCNS-0721591 and ONR MURI N000140710747.
BPTF wALS SGD050100150200250 R un t i m e ( i t e r a t i on , D = ) s e c Algorithm
Figure 5: Running time of a single iteration with 16cores.
6. REFERENCES [1] .[2] L. Breiman. Random forests.
Machine Learning ,45:5–32, 2001.[3] G. Dror, N. Koenigstein, Y. Koren, and M. Weimer.The Yahoo! Music Dataset and KDD-Cup’11. In
KDD-Cup Workshop , 2011.[4] M. Jahrer, A. T¨oscher, and R. Legenstein. Combiningpredictions for an accurate recommender system.KDD’10, pages 693–701. ACM, 2010.[5] Y. Koren. Factorization meets the neighborhood: amultifaceted collaborative filtering model. In
Proceeding of the 14th ACM SIGKDD internationalconference on Knowledge discovery and data mining ,KDD ’08, pages 426–434, New York, NY, USA, 2008.ACM.[6] Y. Koren. The BellKor Solution tothe Netflix Grand Prize. Technical report, August 2008. .[7] Y. Koren. Collaborative filtering with temporaldynamics. In
Proceedings of the 15th ACM SIGKDDinternational conference on Knowledge discovery anddata mining , KDD ’09, pages 447–456, New York, NY,USA, 2009. ACM.[8] Y. Koren, R. Bell, and C. Volinsky. Matrixfactorization techniques for recommender systems.
Computer , 42:30–37, August 2009.[9] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson,C. Guestrin, and J. M. Hellerstein. Graphlab: A newparallel framework for machine learning. In
Proceedings of the 26th Conference on Uncertainty inArtificial Intelligence (UAI) , Catalina Island, USA,July 2010.[10] H. Ma, D. Zhou, C. Liu, M. R. Lyu, and I. King.Recommender systems with social regularization. In
Proceedings of the fourth ACM internationalconference on Web search and data mining , WSDM’11, pages 287–296, New York, NY, USA, 2011. ACM.[11] R. Pan, Y. Zhou, B. Cao, N. N. Liu, R. Lukose,M. Scholz, and Q. Yang. One-class collaborativefiltering. In
Proceedings of the 2008 Eighth IEEEInternational Conference on Data Mining , pages502–511, Washington, DC, USA, 2008. IEEEomputer Society.[12] R. Salakhutdinov and A. Mnih. Bayesian probabilisticmatrix factorization using Markov chain Monte Carlo.In
ICML ’08: Proceedings of the 25th internationalconference on Machine learning , pages 880–887, NewYork, NY, USA, 2008. ACM.[13] G. Tak´acs, I. Pil´aszy, B. N´emeth, and D. Tikk.Scalable collaborative filtering approaches for largerecommender systems.
J. Mach. Learn. Res. ,10:623–656, June 2009.[14] A. T¨oscher, M. Jahrer, and R. M. Bell. The BigChaosSolution to the Netflix Grand Prize. 2009.[15] L. Xiong, X. Chen, T. K. Huang, J. Schneider, andJ. G. Carbonell. Temporal collaborative filtering withBayesian probabilistic tensor factorization. In
Proceedings of SIAM Data Mining , 2010.[16] Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan.Large-scale parallel collaborative filtering for thenetflix prize. In