Conditional Restricted Boltzmann Machines for Cold Start Recommendations
aa r X i v : . [ c s . I R ] A ug Conditional Restricted Boltzmann Machines forCold Start Recommendations
Jiankou Li , and Wei Zhang , State Key Laboratory of Computer Science, Institute of Software, ChineseAcademy of Sciences, P.O. Box 8718, Beijing, 100190, P.R.China School of Information Science and Engineering, University of Chinese Academy ofSciences, P.R.China
Abstract.
Restricted Boltzman Machines (RBMs) have been success-fully used in recommender systems. However, as with most of other col-laborative filtering techniques, it cannot solve cold start problems forthere is no rating for a new item. In this paper, we first apply condi-tional RBM (CRBM) which could take extra information into accountand show that CRBM could solve cold start problem very well, espe-cially for rating prediction task. CRBM naturally combine the contentand collaborative data under a single framework which could be fittedeffectively. Experiments show that CRBM can be compared favourablywith matrix factorization models, while hidden features learned from theformer models are more easy to be interpreted.
Keywords: cold start, conditional RBM, rating prediction
The cold start problem generally means making recommendations for new itemsor new users. It has attracted attentions of many researchers for its importance.Collaborative filtering(CF) techniques make prediction of unknown preferenceby known preferences of a group of users [22]. Preferences here maybe explicitmovie rating or implicit item buying in the online services. Though CF techniquehas been successfully used in recommender system, when there is no rating fornew items or new users such methods become invalid. So pure CF could notsolve cold start problems.Different from CF, content-based techniques do not suffer from the cold startproblem, as they make recommendations based on content of items, e.g. actors,genres of a movie is usually used. However, a pure content-based techniques onlyrecommends items that are similar to users’ previously consumed items and soits result lack diversity [16].A key step for solving the cold start problem is how to combine the col-laborative information and content information. Hybrid techniques retain theadvantages of CF as well as not suffer from the cold start problem [1], [4], [6],[14], [19]. In the paper, we firstly use condition restricted Boltzmann machineshat could combine the collaborative information and content information nat-urally to solve cold start problem.RBMs are powerful undirected graphical models for rating prediction [18].Different from directed graphical models [1], [4], [6], [14], [19], whose exact in-ference is usually hard, inferring latent feature of RBMs is exact and efficient.An important feature of undirected graphical model is that they do not sufferthe ’explaining-away’ problem of directed graphical models [21]. What’s moreimportant is that RBMs could be efficiently trained using Contrastive Diver-gence [7], [8]. Based on RBM, CRBM takes the extra information into account[24], [25]. The items content feature like actors, genres etc could be added intoCRBM naturally, with little more extra fitting procedure.The remainder of this paper is organized as follows. Section 2 is devotedto related work. Section 3 introduces the models we used including RBM andCRBM. In section 4 we show the results of our experiments. Finally we concludethe paper with a summary and discuss the work in the future in section 5.
Cold start problems have attracted lots of researchers. The key step for solvingcold start problem is combining content feature with collaborative filtering tech-niques. On one hand, topic models could model content information well. Onthe other hand, matrix factorization is one of the most successful techniques forcollaborative. So most previous works for solving cold start problem are basedon these two techniques.Previous work [19] uses a probabilistic topic model to solve the cold startproblem in three step. Firstly, they map users into a latent space by features oftheir related items. Every user could be represented by some topics of features.Then a ’folding-in’ algorithm is used to fold new items into a user-feature aspectmodel. Finally, the probability of a user given a new movie (could be seen assome similarity measure) was calculated in order to make recommendation.Gantner et all propose a high-level framework for solving cold start problem.They train a factorization model to learn the latent factors and then learn themapping function from features of entities to their latent features [4]. This kind ofmodel use the content feature while retain the advantages of matrix factorization[12], [15], [17] and could generalize well to other latent factor models and mappingfunctions.Another matrix factorization based model is the regression-based latent fac-tor model (RLFM). RLFM simultaneously incorporate user features, item fea-tures, into a single modeling framework. Different from [4], where the latent fac-tors’ learning phase and mapping phase are separated, a Gaussian prior with afeature-base regression was added to the latent factors for regularization. RLFMprovided a modeling framework through hierarchical models which add moreflexibility to the factorization methods.In [26], the collaborative filtering and probabilistic topic modeling were com-bined through latent factors. Making recommendations is a process of balancinghe influence of the content of articles and the libraries of the other users. Tech-niques used for solving cold start problem falling into the directed graphicalmodels include latent Dirichlet allocation [2], [14], probabilistic latent semanticindexing [10] etc. Other technique for solving cold start problem include semi-supervised [28], decision trees [23] etc.Besides directed graphical models, undirected graphical models also foundapplication in recommender systems. A tied Boltzmann machine in [6] capturesthe pairwise interactions between items by features of items. Then a probabilityof new items could be given to users in order to be ranked. A more powerful modelof undirected graphical model is Restricted Boltzmann Machines which has foundwide application. RBM was first introduced in [20] named as harmonium. In [13],the Discriminative RBM was successfully used for character recognition and textclassification. The most important success of RBMs are as initial training phasefor deep neural networks [9] and as feature extractors for text and image [5]. [18]firstly introduce the RBM for collaborative filtering. In this paper we will showits good performance for solving cold start problem.
Fig. 1.
Illustration of a restricted Boltzmann machine with binary hidden units andbinary visual units. In this paper, each RBM represents an movie, each visual unitrepresents a user that has rated the item and each hidden unit represents a feature ofmovies.
RBM is a two-layer undirected graphical model, see figure 1. It defines a jointdistribution over visible variables v and a hidden variable h . In the cold startsituation, we consider the case where both v and h are binary vectors.RBM is an energy-based model, whose energy function is given by E ( v, h ) = − M X m =1 a m v m − N X n =1 b n h n − M X m =1 N X n =1 v m h n W mn (1)here M is the number of visual units, N is the number of hidden units, v m , h n are the binary states of visible unit m and hidden unit n, a m , b n are theirbiases and W mn is the weight between v and h . Every joint configuration ( v, h )has probability of the form: p ( v, h ) = 1 Z e − E ( v,h ) (2)where Z is the partition function given by summing over all possible config-uration of ( v, h ): Z = X v,h e − E ( v,h ) (3)For a given visible vector v , the marginal probability of p ( v ) is given by p ( v ) = 1 Z X h e − E ( v,h ) (4)The condition distribution of a visible unit given all the hidden units and thecondition distribution of a hidden unit given all the visible units are as follows: p ( v m | h ) = σ ( a m + M X n =1 h n W mn ) (5) p ( h n | v ) = σ ( b n + N X m =1 v m W mn ) (6)where σ ( x ) = 1 / (1 + e − x ) is the logistic function.In this paper, each RBM represents a movie, each visual unit represents auser that has rated the movie. This differs from [18] where each RBM representsa user and each visual unit represents a movie. All RBMs have the same numberof hidden units and different number of visible units because different movies arerated by different users. The corresponding weights and biases are tied togetherin all RBMS. So if two movies are rated by the same people, then these two RBMhave the same weights and biases. In other words, we could say all movies use thesame RBM while the missing value of users that do not rating a specific movieare ignored. This idea is similar to the matrix factorization based technique [11],[15] that only model the observing ratings and ignore the missing values. We need learn the model by adjusting the weights and biases to lower the energyof the item while in the same time arise the energy of other configuration. Thiscould be done by performing the gradient ascent in the log-likelihood of eq. 4, T X t =1 ln p ( v t | θ ) (7)here T is the number of movies, θ = { w, a, b } is the parameter of RBM. Thegradient is as follows, ∂logp ( v ) ∂a m = h v m i data − h v m i model (8) ∂logp ( v ) ∂b n = h h n i data − h h n i model (9) ∂logp ( v ) ∂W mn = h v m h n i data − h v m h n i model (10)where hi represents the expectations under the distribution of data and modelrespectively. hi data is generally easy to get, while hi model cannot be computed inless than exponential time. So we use the Contrastive Divergence (CD) algorithmwhich is much faster [8]. Instead of hi model , the CD algorithm use hi recon , the’reconstruction’ produced by setting each v m to 1 with probability in equation5. Indeed, maximizing the log likelihood of the data is equivalent to minimizingthe Kullback-Liebler divergence between the data distribution and the modeldistribution. CD only minimizing the data distribution and the reconstructiondistribution, and usually the one step reconstruction works well [18]. The updaterule is as follows. a τ +1 m = a τm + ǫ ( h v m i data − h v m i T ) (11) b τ +1 n = a τm + ǫ ( h h n i data − h h n i T ) (12) W τ +1 mn = W τmn + ǫ ( h v m h n i data − h v m h n i T ) (13)where T represent the one step reconstruction, ǫ is the learning rate. The above model could be used in collaborative filtering as in [18]. But it becomesinvalid for new items as there are no rating and the correspond hidden unitscould not be activated. We must add extra information to this model for solvingcold start problem. Conditional RBM (CRBM) could easily take these extrainformation into account [24], [25].For each movie we use a binary vector f to denote its feature and then adddirected connection from f to its hidden factors h . So the joint distribution over( v, h ) conditional on f is defined, see figure 2. The energy function becomes thefollowing form. E ( v, h, f ) = − M X m =1 a m v m − N X n =1 b n h n − M X m =1 N X n =1 v m h n W mn − K X k =1 N X n =1 f k h n U kn (14) ig. 2. Illustration of a conditional restricted Boltzmann machine with binary hiddenunits and binary visual units
The conditional distinction of v m given hidden units is just the same as eq.5.The conditional distribution of h n becomes the following form. E ( h n | f, v ) = σ ( b n + M X m =1 v m w mn + K X k =1 f k U kn ) (15)In CRBM, the hidden states are affected by the feature vector f . When wehave new items, there are no rating by users, but the hidden state could still beactivated by f , then the visible unit could be reconstructed as the usual RBM.What’s more, this does not add much extra computation.Learning the weight matrix U is just like learning the bias of hidden unit. Inour experiments, we add an regularization term to the normal gradient knownas weight-decay [7]. We used the ’ L ’ penalty function to avoid large weights.The update rule becomes follows. U τ +1 kn = U τkn + ǫ (( h h n i data − h h n i T ) f k + λU kn ) (16)Weight-decay for U brings some advantages. On one hand, it improves theperformance of CRBM by avoid overfitting in the train process. On the otherhand, small weight can be more interpretable than very large values which willbe shown in section 4. More reasons for weight-decay could be found in [7]. .2 Recommendation Task In this section, we make a distinguish among different recommendation task inreal world applications. There are mainly three recommendation tasks dependingon the data we have. They are one-class explicit prediction, one-class implicitprediction, and rating prediction.1. One-class explicit predictionThis task is usually arised where we have data that only contains explictpossitive feedback. Our goal is to predict whether a user likes an item ornot. In this paper, we convert the original integer rating values from 1 to 5into binary state 0 and 1. Concretely, if a rating is bigger than 3, we take itas 1, otherwise we take it as 0. As a result, all the rating not bigger than 3or missing rating are taken to be 0.2. One-class implicit predictionImplicit rating means some feedback of a user which does not give explicitpreference, e.g. the purchase history, watching habits and browsing activityetc. In this task we would predict probability that a user will rate a givenmovie. In this paper, all the observed ratings are taken to be 1 and all themissing rating are taken to be 0.3. Rating predictionRating imputation is to predict whether a user will like an item or notcondition on his implicit rating. Usually this could be used to evaluate analgorithm’s performance by holding out some rating from training data andthen predicting their ratings. In the train phase, all missing value are ignoredand only the observed value are used. In the test phase, only the observedvalue in the test set are evaluated. This is very different from the former twotaskes where all missing value are taken to be 0.In the first two recommendation taskes all the missing values are taken to bezeros. As a result, the rating matrix becomes dense. While in the rating predic-tion task, rating matrix could remain sparse. [19] also give three recommendationtask which are similar to ours. It’s important to make such a distinction as mod-els can give notable different performance in difference task. In this paper wecompare algorithms on these tasks and analysis their performance.
Precision-recall curve and root mean square error (RMSE) are two popular eval-uation metrics. But both of them may be pitfall when the data class has skewdistribution. As we know, the rating data is usually sparse and each user onlyrates a very small fraction of all the items. A trivial idea is that we take all thepair as 0, then we will get a not very bad result according RMSE or precision-recall curve. Besides they also suffer from the problem of zero rating’s uncertain.Zero rating’s uncertain here means that a missing rating may either indicate theuser does not like the item or that the user does not know the item at all.n this paper, we use receiver operating characteristics (ROC) graph to eval-uate the performance of different models. ROC graphs have long been used fororganizing classifiers and visualizing their performance. One attractive propertyof ROC curve is that they are insensitive to changes in class distribution. Wecould also reduce ROC performance to a single scalar value to compare classi-fiers, that is the area under the ROC curve (AUC) [3].
In this paper, we mainly focus on the cold start problem. As there are so manytechniques for recommender systems, we select three typical models as our base-lines. They are aspect model (AS)[19], tied Boltzmann machine (TBM)[6] andregression-based latent factor model (RLFM) [1]. We select these three modelsbecause they belong to discrete latent factor model, continuous latent factormodel and undirected graphical models respectively. They represent three direc-tions for sovling cold start problems.The former two models both of which are latent factor models belong tothe directed graphical models. Their difference mainly lie in the type of latentfactors. In the aspect model, the latent factor is discrete, so aspect model isa mixture model indeed. RLFM is strongly correlated with SVD style matrixfactorization methods where the latent factors are continuous. TBM is anothertype of model belonging to the undirected graphical model which directly modelsthe relationship between items. T r u e P o s i t i v e R a t e RLFM (AUC 0.783)AS (AUC 0.745)TBM (AUC 0.806)CRBM (AUC 0.738) (a) implicit rating predic-tion T r u e P o s i t i v e R a t e RLFM (AUC 0.763)AS (AUC 0.733)TBM (AUC 0.799)CRBM (AUC 0.721) (b) rating prediction T r u e P o s i t i v e R a t e RLFM (AUC 0.684)AS (AUC 0.54)TBM (AUC 0.552)CRBM (AUC 0.697) (c) rating imputation
Fig. 3.
Performance of different algorithm in three recommendation tasks
We compare the performance of these four algorithms, including TBM, AS,RLFM and CRBM, in three recommendation tasks, see figure 3. As the one-classimplicit prediction and one-class explicit prediction give very similar results, wewill use one-class prediction for brief. According to figure 3(a) and figure 3(b),we could sort these models by their performance in descending order and get theesult (TBM, RLFM, AS, CRBM). However, we get almost an opposite resultwhen considering figure 3(c). In the following, we discuss how the differencearise. A key difference between one-class prediction and rating prediction is howto deal with the missing value.On one hand, all user-item pairs have a certain value in the one-class predic-tion task. All missing value between users and items take value 0 when we trainmodels, see section 4.2. When predicting for cold start items, all user will be con-sidered. The ROC curve is plotted using all pairs between users and new items.On the other hand, we remain the missing state of the corresponding user-itempair. When predicting for new items we only test users that have rated them.The ROC curve is plotted only using user-item pairs in the test set. Concretely,in one-class prediction there are two states for each user-item pair 0 and 1, therating matrix is a dense. While in rating prediction, the rating matrix is sparse.This difference is of significance and play an important role for evaluate the per-formance of models. In the former case we need to predict ratings between alluser-item pairs. Problems arise when we fill 0 value to the missing pairs whichwill be illustrated by the following example.Suppose there are 1000 users and 1000 movies in our data set. Without lossof generality, suppose a user in the test set likes 20 movies. Now we need evaluatean algorithm using this data set. In the one-class prediction task a perfect modelshould put these 20 movies on the top of the prediction rank list, which means weassume this user like these 20 movies better than other movies. This assumptiondoes not seem rational. It’s very likely that the user likes the other movie betterthan these 20 ones. Putting these 20 ones in the top rank list is not what wereally want to do. We want to know the correct rank of all 1000 movies, butthis is impossible generally as the other 900 movies rating are missing. So theproblem of rating prediction we caused by its filling missing value.When comes to the rating prediction, the situation is difference. We only needpredict whether a user will like or not for each of the test movie which are all givenin the test set. In other words, we just to give a rank list of the test set but notthe whole movie set. When we do not know the correct the answer, it’s better toignore them than to take them as 0. In the above, we analysis the main problemof rating prediction. We must realize that we could not ignore the missing inone-class prediction task just as in rating prediction. Rating prediction couldnot replace one-class prediction completely because the later is suitable whenthere are only posititve feedback and no negative feedback. Collaborative filteringtechniques, such as use-based CF, item-based CF or matrix factorization, becomeinvalid in such situation. These techniques always give some trivial result, all userliking all items.It’s easy for us to understand the performance of these four models as weknow about the difference and characteristic about the recommendation task.Our data set contains both possitive feedback and negative feedback. So weshould use the rating prediction task to evaluate the performance. Though ASmodel and TBM gives perfect resutls in one-class prediction, they predict ratingsin the dense way. They use missing values which includes uncertainty. RFLM andRBM could deal with the origin sparse raing matrix. If we also sample somenegative feedback from the missing pair, they could give good performance in theone-class prediction. In our experiment we sample the negative feedback about10 times the possitive sapmle.In the one-class prediction, we ignore the negative feedback and use fewerinformation both in train and test phase. RFLM and CRBM could deal withthis situation. But in the opposite, AS and TBM could only deal with one-classproblem, when we ignore the negative feedback in the train phase and predictmore informative result, they become invalid, just as the performance of AS andTBM in rating prediction.
In this section, we will show the matrix U we have learned indeed make sense.The matrix U gives us more intuition. Every row of U could be seen as a featureof movies. A movie has a feature if the feature’s corresponding state is active.Every column could be seen as the correspondence value of actor or genre. Soevery movie has a representation by actors. If two actors are similar (similarmeans user has similar rating for movies of these actors), then the value of thedistance of correspond column should be small. So we could cluster the actorsby U. In this paper we use the k-mean algorithm to cluster the actors. Table1 illustrates 8 clusters from our experiment. In each cluster, we show 4 actorsor genres. we could see who and who are more similary with each other and inwhich type of movie one actor is usually appear. For example, we see Brad Pittand Kevin Bacon are more likely appear in the same type movie. Also, the mainmovie types comedy, crime, war etc. are partitioned into different clusters.
Table 1.
Cluster actors and genres based on UTopic Number Actors and Genres1 Rance Howard,Parker Posey,James Earl Jones,John Cusack2 Children’s, Brad Pitt, Kevin Bacon, Western3 Comedy, Paul Herman, Michael Rapaport, John Diehl4 R. Lee Ermey, Tim Roth, Dermot Mulroney, War5 Wallace Shawn, James Gandolfini, Crime, Sci-Fi6 Action, Bruce Willis, Sigourney Weaver, Xander Berkeley7 Adventure, Drama, John Travolta, David Paymer8 Thriller, Paul Calderon, Gene Hackman, Steve Buscemi
In this paper, we apply CRBM to solve the new items cold-start problem. Infact, it could be easily extended for new user situation. We compare CRBM withther three typical techniques and show that in the implicit rating prediction andrating prediction task, CRBM gives comparable performance, while in the ratingimputation task CRBM gives the best result. According our analysis, the ratingimputation task is more coincidence with the real situation. So CRBM showsits superiority to other models. Such results give us more indication for futureresearches. Firstly, RBM-based models are good at extract features, so we couldgive more easy explainable result. Secondly, this kinds of model are easily to beapplied for online application. When there are new items, we just need updatethe parameters by the reconstruction. Thirdly, RBM-based model could be easilycombined with deep models to extract more feature for recommendation [21].Just as [27], we do not want to show CRBM models are more superior thanthe directed graphical models. But the energy-based model give us some moreinformation in different application. In the future we will use deep models tosolve the cold start problem.
References
1. Deepak Agarwal and Bee-Chung Chen. Regression-based latent factor models. In
Proceedings of the 15th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining , pages 19–28. ACM, 2009.2. David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. the Journal of Machine Learning Research , 3:993–1022, 2003.3. Tom Fawcett. An introduction to roc analysis.
Pattern recognition letters ,27(8):861–874, 2006.4. Zeno Gantner, Lucas Drumond, Christoph Freudenthaler, Steffen Rendle, and LarsSchmidt-Thieme. Learning attribute-to-feature mappings for cold-start recommen-dations. In
International Conference on Data Mining , pages 176–185. IEEE, 2010.5. Peter V Gehler, Alex D Holub, and Max Welling. The rate adapting poissonmodel for information retrieval and object recognition. In
Proceedings of the 23rdInternational Conference on Machine learning , pages 337–344. ACM, 2006.6. Asela Gunawardana and Christopher Meek. Tied boltzmann machines for coldstart recommendations. In
Proceedings of the 2008 ACM conference on Recom-mender Systems , pages 19–26. ACM, 2008.7. Geoffrey Hinton. A practical guide to training restricted boltzmann machines.
Momentum , 9(1):926, 2010.8. Geoffrey E Hinton. Training products of experts by minimizing contrastive diver-gence.
Neural Computation , 14(8):1771–1800, 2002.9. Geoffrey E Hinton. To recognize shapes, first learn to generate images.
Progressin brain research , 165:535–547, 2007.10. Thomas Hofmann. Probabilistic latent semantic indexing. In
Proceedings of the22nd annual International ACM SIGIR Conference on Research and Developmentin Information Retrieval , pages 50–57. ACM, 1999.11. Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborativefiltering model. In
Proceedings of the 14th ACM SIGKDD international conferenceon Knowledge discovery and data mining , pages 426–434. ACM, 2008.12. Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniquesfor recommender systems.
Computer , 42(8):30–37, 2009.3. Hugo Larochelle and Yoshua Bengio. Classification using discriminative restrictedboltzmann machines. In
Proceedings of the 25th International Conference on Ma-chine learning , pages 536–543. ACM, 2008.14. Jovian Lin, Kazunari Sugiyama, Min-Yen Kan, and Tat-Seng Chua. Addressingcold-start in app recommendation: latent user models constructed from twitterfollowers. In
Proceedings of the 36th international ACM SIGIR Conference onResearch and Development in Information Retrieval , pages 283–292. ACM, 2013.15. Andriy Mnih and Ruslan Salakhutdinov. Probabilistic matrix factorization. In
Advances in Neural Information Processing Systems , pages 1257–1264, 2007.16. Seung-Taek Park and Wei Chu. Pairwise preference regression for cold-start recom-mendation. In
Proceedings of the third ACM Conference on Recommender Systems ,pages 21–28. ACM, 2009.17. Ruslan Salakhutdinov and Andriy Mnih. Bayesian probabilistic matrix factoriza-tion using markov chain monte carlo. In
Proceedings of the 25th InternationalConference on Machine learning , pages 880–887. ACM, 2008.18. Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. Restricted boltzmannmachines for collaborative filtering. In
Proceedings of the 24th International Con-ference on Machine learning , pages 791–798. ACM, 2007.19. Andrew I Schein, Alexandrin Popescul, Lyle H Ungar, and David M Pennock.Methods and metrics for cold-start recommendations. In
Proceedings of the 25thannual International ACM SIGIR Conference on Research and Development inInformation Retrieval , pages 253–260. ACM, 2002.20. Paul Smolensky. Information processing in dynamical systems: Foundations ofharmony theory. 1986.21. Nitish Srivastava, Ruslan R Salakhutdinov, and Geoffrey E Hinton. Modelingdocuments with deep boltzmann machines. arXiv preprint arXiv:1309.6865 , 2013.22. Xiaoyuan Su and Taghi M Khoshgoftaar. A survey of collaborative filtering tech-niques.
Advances in Artificial Intelligence , page 4, 2009.23. Mingxuan Sun, Fuxin Li, Joonseok Lee, Ke Zhou, Guy Lebanon, and HongyuanZha. Learning multiple-question decision trees for cold-start recommendation. In
Proceedings of the Sixth ACM International Conference on Web Search and DataMining , WSDM ’13, pages 445–454, New York, NY, USA, 2013. ACM.24. Ilya Sutskever and Geoffrey E Hinton. Learning multilevel distributed represen-tations for high-dimensional sequences. In
International Conference on ArtificialIntelligence and Statistics , pages 548–555, 2007.25. Graham W Taylor, Geoffrey E Hinton, and Sam T Roweis. Modeling human motionusing binary latent variables.
Advances in Neural Information Processing Systems ,19:1345, 2007.26. Chong Wang and David M Blei. Collaborative topic modeling for recommendingscientific articles. In
Proceedings of the 17th ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining , pages 448–456. ACM, 2011.27. Max Welling, Michal Rosen-Zvi, and Geoffrey E Hinton. Exponential family har-moniums with an application to information retrieval. In
Advances in NeuralInformation Processing Systems , volume 17, pages 1481–1488, 2004.28. Mi Zhang, Jie Tang, Xuchen Zhang, and Xiangyang Xue. Addressing cold startin recommender systems: A semi-supervised co-training algorithm. In