Cold-start recommendations in Collective Matrix Factorization
aa r X i v : . [ c s . I R ] M a r Cold-start recommendations in Collective MatrixFactorization
David CortesMarch 17, 2020
Abstract
This work explores the ability of collective matrix factorization modelsin recommender systems to make predictions about users and items forwhich there is side information available but no feedback or interactionsdata, and proposes a new formulation with a faster cold-start predictionformula that can be used in real-time systems. While these cold-startrecommendations are not as good as warm-start ones, they were found tobe of better quality than non-personalized recommendations, and predic-tions about new users were found to be more reliable than those aboutnew items. The formulation proposed here resulted in improved cold-startrecommendations in many scenarios, at the expense of worse warm-startones.
This work aims to explore the quality of cold-start recommendations derivedfrom collective matrix factorization models [11] in collaborative filtering withexplicit-feedback data in the form of ratings. Recommender systems based oncollaborative filtering are typically constructed solely based on data about user-item interactions [6], such as movies rated by different users, which result indomain-independent and easily-implementable models, but have the disadvan-tage of only being able to make recommendations about users and items forwhich there is interactions data available (known as warm-start recommenda-tions in the literature).In many settings however, there is oftentimes additional side informationavailable about users and/or items, which is not used in the most commonmodels such as low-rank matrix factorization [6] or kNN-based formulas [10],but which can be used both to improve recommendation models that take in-teractions data, and to make recommendations in the absence of interactionsdata (so-called cold-start recommendations).This work focuses on the second case: studying recommendations from ma-trix factorization models that are based on attributes data without interactionsdata. 1
Collective Matrix Factorization
Collective matrix factorization is an extension of the low-rank factorizationmodel that tries to incorporate attributes about the users and/or items byalso factorizing the matrices associated with their side information, sharing thelatent factors between them.More formally, recommendation models based on low-rank matrix factoriza-tion try to factorize a partially-observed matrix X ui of user-item interactions(e.g. movie ratings), where u is the number of users and i is the number ofitems, into the product of two lower-dimensional matrices A uk and B ik , where k ≪ u, i , which can be thought of as latent factors determined for each user anditem, by minimizing some loss function, such as squared loss, defined only onthe entries of X ui that are known (hereafter denoted by the indicator function I x ), e.g.: min A , B k I x ( X − AB T ) k Having obtained these matrices, its then possible to predict the values of X for entries that are not known by the dot product < a u , b i > for user u and item i . Recommendations are then made by sorting these predictions in descendingorder.In most implementations, this model is improved by centering the data (sub-tracting the global mean µ from each entry), adding user and item biases m u (row vector) and n i (column vector), which might be treated as model parame-ters or obtained through a simple heuristic before attempting to obtain optimalvalues for A and B , and by adding regularization on all the model parameters,resulting in the following problem:min A , B k I x ( X − µ − m − n − AB T ) k + λ ( k A k + k B k )This is a non-convex optimization problem for which local minima can befound either by gradient-based methods, or more typically, by the ALS (alter-nating least-squares) algorithm [6][16], which takes advantage of the fact that, ifholding one of the low-rank matrices constant, the optimal values for the othercan be obtained through a closed-form solution that implies solving linear sys-tems the algorithm then alternates between solving one or the other holdingthe other constant until convergence.The main idea behind collective matrix factorization is to jointly factorize theinteractions matrix X ui along with the user attributes matrix U up and the itemattributes matrix I iq , introducing new matrices C pk and D qk for the user anditem attributes (assuming there is data about both user and item attributes),but sharing the A uk and B ik matrices between factorizations:min A , B , C , D k I x ( X − AB T ) k + k U − AC T k + k I − BD T k
2p to this point, the problem is equivalent to factorizing an extended blockmatrix X ext = (cid:18) X UI T . (cid:19) and can be solved using the same methods as before.The new matrices C pk and D qk are not used in the prediction formula, buttheir presence in the minimization objective allows obtaining better estimatesfor A and B - informally, they now need to explain both the interactions and theside information, making them less prone to overfitting the observed interactionsand forcing these latent factors to relate to the non-latent attributes, therebygeneralizing better to new data.There are many logical improvements upon this model: the matrices mightnot share all the latent factors, but have independent parts, e.g. A = (cid:0) A attr A shared A main (cid:1) , B = (cid:0) B attr B shared B main (cid:1) each factorization might have a different weight, each matrix its own regular-ization hyperparameter, among others. Particularly, this work also applied asigmoid transformation to all binary variables in the side information matrices,took the user and item biases as model parameters, for which regularization wasalso applied, and divided the sum of residuals from each matrix by the numberof entries in order for their contribution not to be driven by the relative size ofeach, resulting in an optimization problem as follows:min A , B , C , D , m , n w x L x ( X , A x , B x ) + w u L u ( U , A u , C ) + w i L i ( I , B i , D ) + Rwhere:L x ( X , A x , B x ) = k I x ( X − µ − m − n − A x B Tx ) k | X | L u ( U , A u , C ) = k U − S( A u C T ) k | U | , L i ( I , B i , D ) = k I − S( B i D T ) k | I | R( A , B , C , D , m , n ) = λ ( k A k + k B k + k C k + k D k + k m k + k n k )S(x) = ( − x) , if x is in a binary column . x , otherwise . A x = (cid:0) A shared A main (cid:1) , A u = (cid:0) A attr A shared (cid:1) B x = (cid:0) B shared B main (cid:1) , B i = (cid:0) B attr B shared (cid:1) When having non-shared components in the factorizations, the closed-formminimizer for the A or B matrices becomes the solution of a linear system withblock matrices - for the A matrix, the solution would be: A a A s A m = Ta B Ts C Ts B m (cid:20) s B m C a C s (cid:21) + diag( λ ) − Ta B Ts C Ts B m (cid:20) XU (cid:21) With the fixed matrices for a given row of A consisting only of the rows ineach respective matrix which are not missing for that user. As such, the modelcan be fit through the ALS strategy by updating the matrices in sequence, one3ow at a time, just like in the non-collective case, and in principle this strategycan also be used for fitting variations of this model such as the implicit-feedbackvariation described in [5].When introducing sigmoid transformations however, the problem is no longersolvable through a closed-form, but can still be solved using gradient-basedmethods. The matrices here were optimized using an L-BFGS optimizer (alimited-memory quasi-Newton method [17]). The implementation of the meth-ods proposed here was made open-source and freely available .In many implementations of low-rank matrix factorization, the regularizationparameters are scaled by the number of ratings by each user and for each movie,but since this model was adding the side information matrices, this idea wasnot incorporated in the final objective formula.It can be seen that this optimization objective will produce values for a u and b i as long as there is either interactions data or side information about a givenuser u or item i . If not applying sigmoid transformations, it is also possible toobtain values for a u and b i for new users and items based on side informationalone without refitting the model entirely using the same closed-form solutionas ALS, if holding everything else constant: a u = ( CC T + diag( λ )) − Cu u and similarly for items: b i = ( DD T + diag( λ )) − Di i If using sigmoid or other transformations, such values might still be obtainedfrom solving smaller optimization problems through gradient-based methods.Calculating parameters for new users/items this way, while faster than refittingthe entire model from scratch, is still a rather slow process and not fast enoughto be used in live systems.Other approaches similar in spirit have also been proposed, e.g. [15], butthey are aimed at warm-start recommendations only.
Collective matrix factorization as presented here is not the only cold-start-capable model that has been proposed for integrating side information into low-rank matrix factorization models. For example, [13] proposed a Bayesian for-mulation which assumes a decomposable generative model X ≈ A ( B + D ) , I ≈ D D which, informally, can be thought of as calculating a base score for eachfactor derived from the item attributes, and an offset based on the observedbehavior.While [13] assumed counts data for the attributes and proposed a Bayesianapproach to this problem, the idea of decomposing the low-rank matrices into https://github.com/david-cortes/cmfrec A , B , C , D , m , n k I x ( X − µ − m − n − ( A + UC )( B + ID ) T ) k + RJust like before, optimization was done through L-BFGS. One could alsothink of solving this problem by first obtaining a solution for A and B withoutthe side information, then obtaining C and D as the least-squares minimizersfor them given by the product of the U and I matrices with C and D (i.e. solvemin C || A ∗ − UC || , A = A ∗ − UC ), but this approach doesn’t tend to reach thesame final solutions. As well, a gradient-based approach allows other potentialenhancements such as setting higher regularization for the free offset.Informally, this model tries to calculate a base matrix of latent factors by alinear combination of the user/item attributes, to which a free offset is addedbased on the observed interactions data in order to obtain the final latent factors.It will be referred hereafter as the ”offsets” model.This alternative formulation presents a computational advantage for cold-start recommendations compared to the previous formulation, as now the latentfactors based on attributes can be calculated by a simple vector-matrix product u u C instead of solving a larger linear system, while the offsets are zero in theabsence of any interaction data, which makes it suitable for producing cold-startrecommendations in real-time. Contrary to the models presented so far, herethe low-rank matrices related to the attributes are also used in the predictionformula - predictions for a new user of known items would be given by:ˆ x u = u u C ( B + ID ) T + n (For new items, b i and n i would be zero, while i i D would be calculated in thesame way).It also has the advantages of not requiring any special transformation forvariables that are limited in range (e.g. binary or non-negative), and of havingfewer hyperparameters to tune.Compared to other approaches for cold-start recommendations such as [7] orapproches based on user-wise regressions, this model can work in the absence ofside information for either users or items, thus being usable in all the differentcold-start scenarios, and its parameters are optimized to recommend items basedon both attributes and observed interactions.A further decomposition in which user latent factors are determined sep-arately for combining them with those derived from the interactions data foritems and those derived from the item attributes was also explored in [3] (”de-coupled” model, X ≈ A B + A D ) and was briefly attempted here, but theresults, in line with [3], were far below every other model and were left out ofthe analysis. 5 Empirical evaluation
Both of these models were evaluated using the MovieLens 1M dataset [4], com-plemented with the movie tag genome data [12] taken from the MovieLens latestdataset (last updated 08/2017 at the time these experiments were run), andwith demographical and geographical information about users, the later linkedto them through their zip code. Unfortunately, later (and larger) releases of theMovieLens dataset no longer include user attributes, nor was the author awareof any larger, public dataset with side information about both users and items,thus it was not possible to evaluate the models on bigger datasets.The MovieLens 1M dataset contains 1,000,209 ratings on 3,952 movies by6,040 users, in a timespan from 2000 to 2003. Information from the tag genomedataset, consisting of 1,128 attributes for which movies are assigned a continuousvalue (which can also be negative) under each of them, was available for 3,028of these movies only, but demographical information was available for all users,including their age group (7 buckets), occupation (21 categories), gender, andzip code (not used directly), which were taken as binary variables. Additionally,information about the US region of the user (or whether they were not fromthe US) was added to them as binary variables by linking them through the zipcode, using free zip code databases to determine the region .Recommendations were evaluated by randomly splitting the ratings data intoa training set and four test sets in order to evaluate the different possible cold-start scenarios and compare them to warm-start recommendations, containing a ) only users and items that were in the training data, b ) users that were not inthe training set but items that were, c ) users that were in the training set butitems that were not and had tags available, d ) users and items that were not inthe training set (exact same users as in b ), and only items that were also in c )),with each test set containing at least 5 ratings from each user included in thatset, having sizes as follows: Ratings Users ItemsTrain set
Test set 1: users ∈ train, items ∈ train Test set 2: users / ∈ train, items ∈ train Test set 3: users ∈ train, items / ∈ train Test set 4: users / ∈ train, items / ∈ train http://federalgovernmentzipcodes.us/ k = 40 and regu-larization λ = 10 − . The dimensionality of the tags data was reduced by takingonly their first 50 principal components, as the number of columns is too largefor the second model, and the first model seems to also benefit from reducingdimensionality. Intuitively, however, the first model should not need this typeof dimensionality reduction, as it performs it implicitly, but taking advantageof this rich side information would require setting the regularization parametersdifferently for each matrix, which was not experimented with in here. Some trialand error (not recorded here) suggests that also adding non-shared latent factorsbrings a slight improvement, particularly when using only item attributes.Models were evaluated in terms of their RMSE (root mean squared error)and NDCG@5 (net discounted cummulative gain at 5), the later calculated ona per-user basis and averaged across all users. The definition of DCG was takenas follows: DCG @5 = X i =1 x i − ( i + 1)Where i is the item with the i th highest predicted score for a user, and x isthe actual rating that the user gave to that item. NDCG is calculated as DCGdivided by the maximum possible DCG for that data, resulting in a value thatis upper bounded at 1. As this is explicit feedback data, all the evaluationwas done on the subset of ratings that were in one of the test sets, rather thancomparing predictions for movies that the users rated vs. movies they didntrate.As a baseline, a non-personalized ”most-popular” recommendation formulawas also evaluated, ranking the items according to their average rating - it shallbe noted that all the items here had a reasonable minimum number of ratings- computed from the ratings in the training set only. Since this simple formulacannot recommend items that were not in the training set, a weaker baselineconsisting of random predictions was also evaluated. This work proposed an enhancement to the collective matrix factorization modelin order to deal with binary data, and proposed an alternative formulation - the”offsets” model - that is able to make fast recommendations for new users anditems and which does not require any transformation for attributes data that islimited in range.Cold-start recommendations are understandably not as good as warm-startones, and the offsets model didn’t manage to beat non-personalized recommen-dations for new users when fit to user side information only, although it is signif-icantly better than random recommendations and it did beat non-personalizedrecommendations when adding item attributes too. The coverage of these rec-7able 1: Results on warm-start test set, MovieLens 1M
Model w x w u w i RMSE NDCG@5Random recommendations - - - - 0.606
Most-popular - - - - 0.7862
MF, no attributes - - - 0.865 0.8166
CMF, user attributes*
CMF, user attributes
CMF, user attributes
Offsets, user attributes - - - 0.8972 0.7968
CMF, item attributes*
CMF, item attributes
CMF, item attributes
Offsets, item attributes - - - 0.9225 0.7848
CMF, user and item attributes
CMF, user and item attributes
Offsets, user and item attributes - - - 0.8728 0.8072 * These models were fit to only the attributes data of the users/items with ratingsin the same training data
Table 2: Results on test set with new users only, MovieLens 1M
Model w x w u w i RMSE NDCG@5Random recommendations - - - - 0.5185
Most-popular - - - - 0.7584
CMF, user attributes
CMF, user attributes
CMF, user attributes - - - 0.9804 0.7414
CMF, user and item attributes
CMF, user and item attributes
Offsets, user and item attributes - - -
Model w x w u w i RMSE NDCG@5Random recommendations - - - - 0.5692
CMF, item attributes
CMF, item attributes
CMF, item attributes
Offsets, item attributes - - -
CMF, user and item attributes
Offsets, user and item attributes - - -
Model w x w u w i RMSE NDCG@5Random recommendations - - - - 0.5702
CMF, user and item attributes
CMF, user and item attributes
CMF, user and item attributes - - -
References [1] Ryan Prescott Adams, George E Dahl, and Iain Murray. Incorporating sideinformation in probabilistic matrix factorization with gaussian processes. arXiv preprint arXiv:1003.4944 , 2010.102] Ali Taylan Cemgil. Bayesian inference for nonnegative matrix factorisationmodels.
Computational intelligence and neuroscience , 2009, 2009.[3] Prem K Gopalan, Laurent Charlin, and David Blei. Content-based recom-mendations with poisson factorization. In
Advances in Neural InformationProcessing Systems , pages 3176–3184, 2014.[4] F Maxwell Harper and Joseph A Konstan. The movielens datasets: Historyand context.
ACM Transactions on Interactive Intelligent Systems (TiiS) ,5(4):19, 2016.[5] Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative filtering forimplicit feedback datasets. In , pages 263–272. Ieee, 2008.[6] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization tech-niques for recommender systems.
Computer , (8):30–37, 2009.[7] Seung-Taek Park and Wei Chu. Pairwise preference regression for cold-start recommendation. In
Proceedings of the third ACM conference onRecommender systems , pages 21–28. ACM, 2009.[8] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. Bpr: Bayesian personalized ranking from implicit feedback. In
Proceedings of the twenty-fifth conference on uncertainty in artificial intel-ligence , pages 452–461. AUAI Press, 2009.[9] Ruslan Salakhutdinov and Andriy Mnih. Bayesian probabilistic matrixfactorization using markov chain monte carlo. In
Proceedings of the 25thinternational conference on Machine learning , pages 880–887. ACM, 2008.[10] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. Item-based collaborative filtering recommendation algorithms. In
Proceedingsof the 10th international conference on World Wide Web , pages 285–295.ACM, 2001.[11] Ajit P Singh and Geoffrey J Gordon. Relational learning via collective ma-trix factorization. In
Proceedings of the 14th ACM SIGKDD internationalconference on Knowledge discovery and data mining , pages 650–658. ACM,2008.[12] Jesse Vig, Shilad Sen, and John Riedl. The tag genome: Encoding com-munity knowledge to support novel interaction.
ACM Transactions onInteractive Intelligent Systems (TiiS) , 2(3):13, 2012.[13] Chong Wang and David M Blei. Collaborative topic modeling for rec-ommending scientific articles. In
Proceedings of the 17th ACM SIGKDDinternational conference on Knowledge discovery and data mining , pages448–456. ACM, 2011. 1114] Hao Wang, Naiyan Wang, and Dit-Yan Yeung. Collaborative deep learn-ing for recommender systems. In
Proceedings of the 21th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining , pages1235–1244. ACM, 2015.[15] Feipeng Zhao, Min Xiao, and Yuhong Guo. Predictive collaborative filteringwith side information. In
IJCAI , pages 2385–2391, 2016.[16] Yunhong Zhou, Dennis Wilkinson, Robert Schreiber, and Rong Pan. Large-scale parallel collaborative filtering for the netflix prize. In
InternationalConference on Algorithmic Applications in Management , pages 337–348.Springer, 2008.[17] Ciyou Zhu, Richard H Byrd, Peihuang Lu, and Jorge Nocedal. Algo-rithm 778: L-bfgs-b: Fortran subroutines for large-scale bound-constrainedoptimization.