[PDF] Cold-start recommendations in Collective Matrix Factorization

Abstract

This work explores the ability of collective matrix factorization models in recommender systems to make predictions about users and items for which there is side information available but no feedback or interactions data, and proposes a new formulation with a faster cold-start prediction formula that can be used in real-time systems. While these cold-start recommendations are not as good as warm-start ones, they were found to be of better quality than non-personalized recommendations, and predictions about new users were found to be more reliable than those about new items. The formulation proposed here resulted in improved cold-start recommendations in many scenarios, at the expense of worse warm-start ones.

Full PDF

aa r X i v : . [ c s . I R ] M a r Cold-start recommendations in Collective MatrixFactorization

David CortesMarch 17, 2020

Abstract

This work explores the ability of collective matrix factorization modelsin recommender systems to make predictions about users and items forwhich there is side information available but no feedback or interactionsdata, and proposes a new formulation with a faster cold-start predictionformula that can be used in real-time systems. While these cold-startrecommendations are not as good as warm-start ones, they were found tobe of better quality than non-personalized recommendations, and predic-tions about new users were found to be more reliable than those aboutnew items. The formulation proposed here resulted in improved cold-startrecommendations in many scenarios, at the expense of worse warm-startones.

This work aims to explore the quality of cold-start recommendations derivedfrom collective matrix factorization models [11] in collaborative ﬁltering withexplicit-feedback data in the form of ratings. Recommender systems based oncollaborative ﬁltering are typically constructed solely based on data about user-item interactions [6], such as movies rated by diﬀerent users, which result indomain-independent and easily-implementable models, but have the disadvan-tage of only being able to make recommendations about users and items forwhich there is interactions data available (known as warm-start recommenda-tions in the literature).In many settings however, there is oftentimes additional side informationavailable about users and/or items, which is not used in the most commonmodels such as low-rank matrix factorization [6] or kNN-based formulas [10],but which can be used both to improve recommendation models that take in-teractions data, and to make recommendations in the absence of interactionsdata (so-called cold-start recommendations).This work focuses on the second case: studying recommendations from ma-trix factorization models that are based on attributes data without interactionsdata. 1

Collective Matrix Factorization

Collective matrix factorization is an extension of the low-rank factorizationmodel that tries to incorporate attributes about the users and/or items byalso factorizing the matrices associated with their side information, sharing thelatent factors between them.More formally, recommendation models based on low-rank matrix factoriza-tion try to factorize a partially-observed matrix X ui of user-item interactions(e.g. movie ratings), where u is the number of users and i is the number ofitems, into the product of two lower-dimensional matrices A uk and B ik , where k ≪ u, i , which can be thought of as latent factors determined for each user anditem, by minimizing some loss function, such as squared loss, deﬁned only onthe entries of X ui that are known (hereafter denoted by the indicator function I x ), e.g.: min A , B k I x ( X − AB T ) k Having obtained these matrices, its then possible to predict the values of X for entries that are not known by the dot product < a u , b i > for user u and item i . Recommendations are then made by sorting these predictions in descendingorder.In most implementations, this model is improved by centering the data (sub-tracting the global mean µ from each entry), adding user and item biases m u (row vector) and n i (column vector), which might be treated as model parame-ters or obtained through a simple heuristic before attempting to obtain optimalvalues for A and B , and by adding regularization on all the model parameters,resulting in the following problem:min A , B k I x ( X − µ − m − n − AB T ) k + λ ( k A k + k B k )This is a non-convex optimization problem for which local minima can befound either by gradient-based methods, or more typically, by the ALS (alter-nating least-squares) algorithm [6][16], which takes advantage of the fact that, ifholding one of the low-rank matrices constant, the optimal values for the othercan be obtained through a closed-form solution that implies solving linear sys-tems the algorithm then alternates between solving one or the other holdingthe other constant until convergence.The main idea behind collective matrix factorization is to jointly factorize theinteractions matrix X ui along with the user attributes matrix U up and the itemattributes matrix I iq , introducing new matrices C pk and D qk for the user anditem attributes (assuming there is data about both user and item attributes),but sharing the A uk and B ik matrices between factorizations:min A , B , C , D k I x ( X − AB T ) k + k U − AC T k + k I − BD T k

2p to this point, the problem is equivalent to factorizing an extended blockmatrix X ext = (cid:18) X UI T . (cid:19) and can be solved using the same methods as before.The new matrices C pk and D qk are not used in the prediction formula, buttheir presence in the minimization objective allows obtaining better estimatesfor A and B - informally, they now need to explain both the interactions and theside information, making them less prone to overﬁtting the observed interactionsand forcing these latent factors to relate to the non-latent attributes, therebygeneralizing better to new data.There are many logical improvements upon this model: the matrices mightnot share all the latent factors, but have independent parts, e.g. A = (cid:0) A attr A shared A main (cid:1) , B = (cid:0) B attr B shared B main (cid:1) each factorization might have a diﬀerent weight, each matrix its own regular-ization hyperparameter, among others. Particularly, this work also applied asigmoid transformation to all binary variables in the side information matrices,took the user and item biases as model parameters, for which regularization wasalso applied, and divided the sum of residuals from each matrix by the numberof entries in order for their contribution not to be driven by the relative size ofeach, resulting in an optimization problem as follows:min A , B , C , D , m , n w x L x ( X , A x , B x ) + w u L u ( U , A u , C ) + w i L i ( I , B i , D ) + Rwhere:L x ( X , A x , B x ) = k I x ( X − µ − m − n − A x B Tx ) k | X | L u ( U , A u , C ) = k U − S( A u C T ) k | U | , L i ( I , B i , D ) = k I − S( B i D T ) k | I | R( A , B , C , D , m , n ) = λ ( k A k + k B k + k C k + k D k + k m k + k n k )S(x) = ( − x) , if x is in a binary column . x , otherwise . A x = (cid:0) A shared A main (cid:1) , A u = (cid:0) A attr A shared (cid:1) B x = (cid:0) B shared B main (cid:1) , B i = (cid:0) B attr B shared (cid:1) When having non-shared components in the factorizations, the closed-formminimizer for the A or B matrices becomes the solution of a linear system withblock matrices - for the A matrix, the solution would be:  A a A s A m  =  Ta B Ts C Ts B m  (cid:20) s B m C a C s (cid:21) + diag( λ )  −  Ta B Ts C Ts B m  (cid:20) XU (cid:21) With the ﬁxed matrices for a given row of A consisting only of the rows ineach respective matrix which are not missing for that user. As such, the modelcan be ﬁt through the ALS strategy by updating the matrices in sequence, one3ow at a time, just like in the non-collective case, and in principle this strategycan also be used for ﬁtting variations of this model such as the implicit-feedbackvariation described in [5].When introducing sigmoid transformations however, the problem is no longersolvable through a closed-form, but can still be solved using gradient-basedmethods. The matrices here were optimized using an L-BFGS optimizer (alimited-memory quasi-Newton method [17]). The implementation of the meth-ods proposed here was made open-source and freely available .In many implementations of low-rank matrix factorization, the regularizationparameters are scaled by the number of ratings by each user and for each movie,but since this model was adding the side information matrices, this idea wasnot incorporated in the ﬁnal objective formula.It can be seen that this optimization objective will produce values for a u and b i as long as there is either interactions data or side information about a givenuser u or item i . If not applying sigmoid transformations, it is also possible toobtain values for a u and b i for new users and items based on side informationalone without reﬁtting the model entirely using the same closed-form solutionas ALS, if holding everything else constant: a u = ( CC T + diag( λ )) − Cu u and similarly for items: b i = ( DD T + diag( λ )) − Di i If using sigmoid or other transformations, such values might still be obtainedfrom solving smaller optimization problems through gradient-based methods.Calculating parameters for new users/items this way, while faster than reﬁttingthe entire model from scratch, is still a rather slow process and not fast enoughto be used in live systems.Other approaches similar in spirit have also been proposed, e.g. [15], butthey are aimed at warm-start recommendations only.

Collective matrix factorization as presented here is not the only cold-start-capable model that has been proposed for integrating side information into low-rank matrix factorization models. For example, [13] proposed a Bayesian for-mulation which assumes a decomposable generative model X ≈ A ( B + D ) , I ≈ D D which, informally, can be thought of as calculating a base score for eachfactor derived from the item attributes, and an oﬀset based on the observedbehavior.While [13] assumed counts data for the attributes and proposed a Bayesianapproach to this problem, the idea of decomposing the low-rank matrices into https://github.com/david-cortes/cmfrec A , B , C , D , m , n k I x ( X − µ − m − n − ( A + UC )( B + ID ) T ) k + RJust like before, optimization was done through L-BFGS. One could alsothink of solving this problem by ﬁrst obtaining a solution for A and B withoutthe side information, then obtaining C and D as the least-squares minimizersfor them given by the product of the U and I matrices with C and D (i.e. solvemin C || A ∗ − UC || , A = A ∗ − UC ), but this approach doesn’t tend to reach thesame ﬁnal solutions. As well, a gradient-based approach allows other potentialenhancements such as setting higher regularization for the free oﬀset.Informally, this model tries to calculate a base matrix of latent factors by alinear combination of the user/item attributes, to which a free oﬀset is addedbased on the observed interactions data in order to obtain the ﬁnal latent factors.It will be referred hereafter as the ”oﬀsets” model.This alternative formulation presents a computational advantage for cold-start recommendations compared to the previous formulation, as now the latentfactors based on attributes can be calculated by a simple vector-matrix product u u C instead of solving a larger linear system, while the oﬀsets are zero in theabsence of any interaction data, which makes it suitable for producing cold-startrecommendations in real-time. Contrary to the models presented so far, herethe low-rank matrices related to the attributes are also used in the predictionformula - predictions for a new user of known items would be given by:ˆ x u = u u C ( B + ID ) T + n (For new items, b i and n i would be zero, while i i D would be calculated in thesame way).It also has the advantages of not requiring any special transformation forvariables that are limited in range (e.g. binary or non-negative), and of havingfewer hyperparameters to tune.Compared to other approaches for cold-start recommendations such as [7] orapproches based on user-wise regressions, this model can work in the absence ofside information for either users or items, thus being usable in all the diﬀerentcold-start scenarios, and its parameters are optimized to recommend items basedon both attributes and observed interactions.A further decomposition in which user latent factors are determined sep-arately for combining them with those derived from the interactions data foritems and those derived from the item attributes was also explored in [3] (”de-coupled” model, X ≈ A B + A D ) and was brieﬂy attempted here, but theresults, in line with [3], were far below every other model and were left out ofthe analysis. 5 Empirical evaluation

Both of these models were evaluated using the MovieLens 1M dataset [4], com-plemented with the movie tag genome data [12] taken from the MovieLens latestdataset (last updated 08/2017 at the time these experiments were run), andwith demographical and geographical information about users, the later linkedto them through their zip code. Unfortunately, later (and larger) releases of theMovieLens dataset no longer include user attributes, nor was the author awareof any larger, public dataset with side information about both users and items,thus it was not possible to evaluate the models on bigger datasets.The MovieLens 1M dataset contains 1,000,209 ratings on 3,952 movies by6,040 users, in a timespan from 2000 to 2003. Information from the tag genomedataset, consisting of 1,128 attributes for which movies are assigned a continuousvalue (which can also be negative) under each of them, was available for 3,028of these movies only, but demographical information was available for all users,including their age group (7 buckets), occupation (21 categories), gender, andzip code (not used directly), which were taken as binary variables. Additionally,information about the US region of the user (or whether they were not fromthe US) was added to them as binary variables by linking them through the zipcode, using free zip code databases to determine the region .Recommendations were evaluated by randomly splitting the ratings data intoa training set and four test sets in order to evaluate the diﬀerent possible cold-start scenarios and compare them to warm-start recommendations, containing a ) only users and items that were in the training data, b ) users that were not inthe training set but items that were, c ) users that were in the training set butitems that were not and had tags available, d ) users and items that were not inthe training set (exact same users as in b ), and only items that were also in c )),with each test set containing at least 5 ratings from each user included in thatset, having sizes as follows: Ratings Users ItemsTrain set

Test set 1: users ∈ train, items ∈ train Test set 2: users / ∈ train, items ∈ train Test set 3: users ∈ train, items / ∈ train Test set 4: users / ∈ train, items / ∈ train http://federalgovernmentzipcodes.us/ k = 40 and regu-larization λ = 10 − . The dimensionality of the tags data was reduced by takingonly their ﬁrst 50 principal components, as the number of columns is too largefor the second model, and the ﬁrst model seems to also beneﬁt from reducingdimensionality. Intuitively, however, the ﬁrst model should not need this typeof dimensionality reduction, as it performs it implicitly, but taking advantageof this rich side information would require setting the regularization parametersdiﬀerently for each matrix, which was not experimented with in here. Some trialand error (not recorded here) suggests that also adding non-shared latent factorsbrings a slight improvement, particularly when using only item attributes.Models were evaluated in terms of their RMSE (root mean squared error)and NDCG@5 (net discounted cummulative gain at 5), the later calculated ona per-user basis and averaged across all users. The deﬁnition of DCG was takenas follows: DCG @5 = X i =1 x i − ( i + 1)Where i is the item with the i th highest predicted score for a user, and x isthe actual rating that the user gave to that item. NDCG is calculated as DCGdivided by the maximum possible DCG for that data, resulting in a value thatis upper bounded at 1. As this is explicit feedback data, all the evaluationwas done on the subset of ratings that were in one of the test sets, rather thancomparing predictions for movies that the users rated vs. movies they didntrate.As a baseline, a non-personalized ”most-popular” recommendation formulawas also evaluated, ranking the items according to their average rating - it shallbe noted that all the items here had a reasonable minimum number of ratings- computed from the ratings in the training set only. Since this simple formulacannot recommend items that were not in the training set, a weaker baselineconsisting of random predictions was also evaluated. This work proposed an enhancement to the collective matrix factorization modelin order to deal with binary data, and proposed an alternative formulation - the”oﬀsets” model - that is able to make fast recommendations for new users anditems and which does not require any transformation for attributes data that islimited in range.Cold-start recommendations are understandably not as good as warm-startones, and the oﬀsets model didn’t manage to beat non-personalized recommen-dations for new users when ﬁt to user side information only, although it is signif-icantly better than random recommendations and it did beat non-personalizedrecommendations when adding item attributes too. The coverage of these rec-7able 1: Results on warm-start test set, MovieLens 1M

Model w x w u w i RMSE NDCG@5Random recommendations - - - - 0.606

Most-popular - - - - 0.7862

MF, no attributes - - - 0.865 0.8166

CMF, user attributes*

CMF, user attributes

Oﬀsets, user attributes - - - 0.8972 0.7968

CMF, item attributes*

CMF, item attributes

Oﬀsets, item attributes - - - 0.9225 0.7848

CMF, user and item attributes

Oﬀsets, user and item attributes - - - 0.8728 0.8072 * These models were ﬁt to only the attributes data of the users/items with ratingsin the same training data

Table 2: Results on test set with new users only, MovieLens 1M

Model w x w u w i RMSE NDCG@5Random recommendations - - - - 0.5185

Most-popular - - - - 0.7584

CMF, user attributes

CMF, user attributes - - - 0.9804 0.7414

CMF, user and item attributes

Oﬀsets, user and item attributes - - -

Model w x w u w i RMSE NDCG@5Random recommendations - - - - 0.5692

CMF, item attributes

Oﬀsets, item attributes - - -

CMF, user and item attributes

Oﬀsets, user and item attributes - - -

Model w x w u w i RMSE NDCG@5Random recommendations - - - - 0.5702

CMF, user and item attributes

CMF, user and item attributes - - -

References [1] Ryan Prescott Adams, George E Dahl, and Iain Murray. Incorporating sideinformation in probabilistic matrix factorization with gaussian processes. arXiv preprint arXiv:1003.4944 , 2010.102] Ali Taylan Cemgil. Bayesian inference for nonnegative matrix factorisationmodels.

Computational intelligence and neuroscience , 2009, 2009.[3] Prem K Gopalan, Laurent Charlin, and David Blei. Content-based recom-mendations with poisson factorization. In

Advances in Neural InformationProcessing Systems , pages 3176–3184, 2014.[4] F Maxwell Harper and Joseph A Konstan. The movielens datasets: Historyand context.

ACM Transactions on Interactive Intelligent Systems (TiiS) ,5(4):19, 2016.[5] Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative ﬁltering forimplicit feedback datasets. In , pages 263–272. Ieee, 2008.[6] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization tech-niques for recommender systems.

Computer , (8):30–37, 2009.[7] Seung-Taek Park and Wei Chu. Pairwise preference regression for cold-start recommendation. In

Proceedings of the third ACM conference onRecommender systems , pages 21–28. ACM, 2009.[8] Steﬀen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. Bpr: Bayesian personalized ranking from implicit feedback. In

Proceedings of the twenty-ﬁfth conference on uncertainty in artiﬁcial intel-ligence , pages 452–461. AUAI Press, 2009.[9] Ruslan Salakhutdinov and Andriy Mnih. Bayesian probabilistic matrixfactorization using markov chain monte carlo. In

Proceedings of the 25thinternational conference on Machine learning , pages 880–887. ACM, 2008.[10] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. Item-based collaborative ﬁltering recommendation algorithms. In

Proceedingsof the 10th international conference on World Wide Web , pages 285–295.ACM, 2001.[11] Ajit P Singh and Geoﬀrey J Gordon. Relational learning via collective ma-trix factorization. In

Proceedings of the 14th ACM SIGKDD internationalconference on Knowledge discovery and data mining , pages 650–658. ACM,2008.[12] Jesse Vig, Shilad Sen, and John Riedl. The tag genome: Encoding com-munity knowledge to support novel interaction.

ACM Transactions onInteractive Intelligent Systems (TiiS) , 2(3):13, 2012.[13] Chong Wang and David M Blei. Collaborative topic modeling for rec-ommending scientiﬁc articles. In

Proceedings of the 17th ACM SIGKDDinternational conference on Knowledge discovery and data mining , pages448–456. ACM, 2011. 1114] Hao Wang, Naiyan Wang, and Dit-Yan Yeung. Collaborative deep learn-ing for recommender systems. In

Proceedings of the 21th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining , pages1235–1244. ACM, 2015.[15] Feipeng Zhao, Min Xiao, and Yuhong Guo. Predictive collaborative ﬁlteringwith side information. In

IJCAI , pages 2385–2391, 2016.[16] Yunhong Zhou, Dennis Wilkinson, Robert Schreiber, and Rong Pan. Large-scale parallel collaborative ﬁltering for the netﬂix prize. In

InternationalConference on Algorithmic Applications in Management , pages 337–348.Springer, 2008.[17] Ciyou Zhu, Richard H Byrd, Peihuang Lu, and Jorge Nocedal. Algo-rithm 778: L-bfgs-b: Fortran subroutines for large-scale bound-constrainedoptimization.