An Item Recommendation Approach by Fusing Images based on Neural Networks
AAn Item Recommendation Approach by FusingImages based on Neural Networks
Weibin Lin
College of Computer Science and TechnologyWuhan University of Technology
Wuhan, [email protected]
Lin Li
College of Computer Science and TechnologyWuhan University of Technology
Wuhan, [email protected]
Abstract —There are rich formats of information in the net-work, such as rating, text, image, and so on, which representdifferent aspects of user preferences. In the field of recom-mendation, how to use those data effectively has become adifficult subject. With the rapid development of neural network,researching on multi-modal method for recommendation hasbecome one of the major directions. In the existing recommendersystems, numerical rating, item description and review aremain information to be considered by researchers. However, thecharacteristics of the item may affect the user’s preferences,which are rarely used for recommendation models. In this work,we propose a novel model to incorporate visual factors intopredictors of peoples preferences, namely MF-VMLP, based onthe recent developments of neural collaborative filtering (NCF).Firstly, we get visual presentation via a pre-trained convolutionalneural network (CNN) model. To obtain the nonlinearitiesinteraction of latent vectors and visual vectors, we propose toleverage a multi-layer perceptron (MLP) to learn. Moreover, thecombination of MF and MLP has achieved collaborative filteringrecommendation between users and items. Our experimentsconduct Amazon’s public dataset for experimental validation androot-mean-square error (RMSE) as evaluation metrics. To someextent, experimental result on a real-world data set demonstratesthat our model can boost the recommendation performance.
Index Terms —Recommender System, Neural Network, MatrixFactorization, MLP
I. I
NTRODUCTION
With the massive amount of different source data generatedby online services, including e-commerce, social media appli-cations and online news, recommender systems are playing anincreasingly important role among them. Matrix factorization(MF) [12], [14], is one of the most popular collaborative filter-ing (CF) [22] techniques, using the numerical rating properlyfor predicting missing ratings. MF method has achieved greatsuccess in the Netflix Prize contest. It casts users and itemsinto a shared latent space, using a latent vector to representusers’ interests or items’ features. Hence, the result of innerproduct of these two latent vectors means a user’s preferenceson a item.Because of the sparsity of user-item interactions, MFmethod usually suffers from limited performance. Moreover,without explaining the basic principles, a rating only reflectsa user’s overall satisfaction towards an item. To improve theaccuracy of recommender system, hybrid recommendation [5],[6] are proposed, which includes two mainly research lines the hybridization of algorithms and the hybridization of multi-modal data. For the first research line, researchers attemptto integrate different recommendation methods to improveperformance. There are various strategies to combine twomethods, such as weighting, switching, mixing, cascadingfeature combination [6], [8] and so on. The combination ofcontent-based [17] and CF-based algorithms is the most com-mon use for research and application. For the other researchline, to a certain extent, using more than one informationcan improve the prediction accuracy of recommender systems,such as rating plus review or rating plus image [9], [10], [16].Researchers propose to integrate numerical rating with textualreview for recommendation, including topic modeling [2],[23], and neural network approaches for review modeling [32],[33]. In order to fuse the characteristics of multi-source data,multi-view learning is a common solution [26]–[28]. Althoughresearchers use a variety of data to improve recommendationperformance or interpretability, most of them have ignoredimage information. Currently, the picture is mainly used inthe field of retrieval [25] and recognition. The appearance ofthe item is one of the most intuitive and most influential factorsof a user’s hobby.Different from the predecessors’ works, He et al. [10] inte-grates image with implicit feedback based on bayesian person-alized ranking [19]. In order to reduce the feature dimensionof the picture, authors propose to learn an embedding kernelwhich linearly transforms the high-dimensional features intoa much lower-dimensional space. Compared with the rapiddevelopment of neural networks, this method not only has aslower training speed, but also the result is not well. AlthoughZhang et al. [32] used deep neural networks (DNNs) [29],[30] to model auxiliary information, such as visual content ofimages. When building models for key collaborative filteringeffects, they still use MF method, using an inner product tocombine the latent vectors of the users and the items. However,we use neural networks to train model with non-linear method.In our model, we use explicit feedback as the input, whichrepresents the user’s overall satisfaction with the item. Thermse is an evaluation index of our model to predict the missingratings and train the model. First of all, we extract the visualrepresentations by a pre-trained convolutional neural network(CNN) [24] named ImageNet. Secondly, image features are a r X i v : . [ c s . I R ] J u l ombined with traditional MF method, which can be deemedas a linear transformation model to learn features. However,He et al. [11] use a MLP model to replace the interactionfunction and show a well performance gains over traditionalmethods such as MF. Unlike it’s approach, we consider thevisual features and plus the features into the model. The lastjob is to fuse the MF method and MLP, taking into accountthe image characteristics. The experiments on authoritativedataset like clothing from Amazon.com show that our modeloutperforms baseline in prediction accuracy. Specifically, ourmain contributions are listed as follows: • We propose a recommended method for fusing imagesand ratings. • We propose a suitable model to incorporate image intopredictors of people’s preferences based on NCF frame-work.In the rest of the paper, we first review the related workin Section 2, and present the proposed model in Section 3.Section 4 gives the experimental results and analysis, andSection 5 concludes the paper.II. R
ELATED W ORK
As we all know, collaboration filtering (CF) [22] algorithmis the most commonly used algorithm in recommendation sys-tem. It is divided into three categories, including user-based,item-based and model-based. Among these recommendationalgorithms, the model-based recommendation algorithm hasbecome the mainstream of research for a long time. It mainlyuses the existing feedback data, such as click, rating, etc, tolearn the parameters of model, and then uses this model topredict and recommend. However, MF approach is the mostimportant one in the model-based CF methods. Koren et al.[14] proposes the MF approach based on latent factor model(LFM). The algorithm has an excellent performance in theNetflix competition, which provides a reference for futureworks.Because of the sparsity of user-item interactions, CF methodusually suffers from limited performance. User-based modelis good for recommending popular items in a wide range ofinterests, but lacks personalization. Item-based model is ableto discover long tail items from the users’ personal interests,but lacks diversity. From this point of view, each method hasits own advantages and disadvantages. In order to combine thebenefits of various methods, hybrid model is proposed to makeuse of different information, including text, context, social [2],[15], [18], and so on.The recommended tasks for CF method are mainly dividedinto two types, one is to predict missing values, and the otheris to recommend a short list of items to users named top-k. In the early days, although literature on recommendationsprimarily focus on explicit feedback [21], recent attention hasincreasingly turned to implicit data [3], [12]. Depending onthe data format used, it can be roughly divided into two ways,which are using implicit feedback to predict the recommendedlist and using explicit feedback to predict missing scoringvalues. In our model, we select the latter one. With the development of neural network, to make use ofdeep learning, many recent researches have developed non-linear neural network models for CF method [1], [4], [7], [11],[31]. In particular, instead of using fixed interaction function(i.e., inner product) in MF, He et al. [11] propose a neuralcollaborative filtering (NCF) framework to learn the user-iteminteraction function from data. And then, they propose a modelNCF model named NeuMF, which fuse the MF and MLP tolearn the interaction function. After that, based on the NCFframework, Bai et al. [1] incorporate the neighborhoods ofusers and items, Cheng et al. [7] model aspects in textualreviews, and so on.The appearance of item, such as clothes, is one of themost important factors to influence user’s opinions to item.However, in the past, image factors have been ignored becausefeature extraction methods are failed to achieve effectiveperformance in visual machine learning. However, while deeplearning technology has been greatly improved, we can ex-tracted a visual feature from a pre-trained convolutional neuralnetwork (CNN) effectively to represent the latent feature ofimage. Based on the MF method, He et al. [10] proposevisually-aware recommender systems. They embedding thehigh-dimensional image into low-dimensional image by a pre-trained CNN model, which can extract the high performancevisual features to use. In the neural network model, theyseldom consider the incorporation of rating and image. Asa comparison, our focus is to consider visual features basedon NCF framework for rating prediction.III. METHODOLOGYIn this section, we introduce three models used in this paperin turn and explain how to generate our models. In orderto fuse image and numerical rating for recommendation, wepropose the joint model. Table I summarizes the key notationsused in the models.
TABLE IN
OTATIONS
Notation Explanation
U, V
User set ( | U | = n ), Item set ( | V | = m ) R Rating matrix ( n × m ) y ui Rating assigned by user u to item iK Number of latent features in MF p u , q i , v i Latent factors of user u, item i, and image i, respectively λ u , λ v Regularization parameters ∅ x x-th neural network layer z i New item representation
A. Visual Matrix Factorization (VMF)
MF method relates each user and item to a real-valuedvector of latent features. Given n users and m products, todecompose rating matrix R ∈ R n × m into two low rank- K matrices U ∈ R K × n and V ∈ R K × m , is the goal of MFapproach. Let p u and q i denote the latent vector for user usand item i, respectively. MF estimates an interaction y ui asthe inner product of p u and q i : y ui = p uT q i = K (cid:88) k =1 p uk q ik (1)Where K denotes the dimension of the latent space. How-ever, The basic MF approach has a over-fitting problem. Theregularized MF method add a normalization factor to the lossfunction as follows: min U,V n (cid:88) u =1 m (cid:88) j =1 ( y ui − p uT q i ) + λ u || U || F + λ v || V || F (2) λ u and λ v are regularization parameters for user embeddingmatrix U and item embedding matrix V , respectively, and || · || F denotes the Frobenius norm. The gradient descent-based optimization technique is generally applied to find thelocal minimum solution for Eq. (2).Considering that image characteristics can affect user pref-erences, users who buy clothes will care about its style, color,etc. In order to improve the accuracy of the recommendedmodel, we have added image parameters. The newly lossfunction as follows: min U,V n (cid:88) u =1 m (cid:88) j =1 ( y ui − p uT q i − θ uT θ i ) + λ u || U || F + λ v || V || F (3) θ u and θ i are newly introduced D-dimensional visual factorswhose inner product models the visual interaction between u and i . To some extent, the user u pays attention to D visualdimensions. How to use image features effectively in a modelhas become a research problem. A simple method is to extractfeatures directly from Deep CNN as item features θ i . However,due to the extracted image features are too high in dimension,the results of the model are unsatisfied. Another approach isto reduce the dimensions of image features. Although somereduction techniques like PCA and CCA are possible to solvethis problem, experiments show that these methods lose toomuch useful information of the original features. Nevertheless,we adopt an embedding kernel [10] to linearly transformssuch high-dimensional features into a much lower-dimensionalspace: θ i = Ef i (4)Here E is a D × F matrix embedding Deep CNN fea-ture space (F-dimensional) into visual space (D-dimensional),where f i is the original visual feature vector for item i . B. Visual Multi-Layer Perceptron (VMLP)
Our model VMLP is based on NCF model, which combinethe several pathways to model users and items. As we cansee from the first layer of the model in Fig. 1, original visualfeatures has extracted from pre-trained CNN model. At thesame time, we use the look-up layer to project the one-hot
Fig. 1. VMLP model input of user and item into low-dimensional embedding. Thesame as VMF, we need to reduce the dimensional of originalimage features. To address this issue, we propose to addhidden layers on the concatenated vector. In order to integratethe latent vectors of items and image vector, we concatenatethese vectors into item enhanced factor. However, to learn andpredict users’ preferences for items, we use a standard MLPmodel to train parameters. Unlike the VMF model, we canendow the model a large level of flexibility and nonlinearityto learn the interactions between users and items, rather thana fixed element-wise method. Precisely, the VMLP model isdefined as follows z = a ( p u , q i ) φ ( z ) = a ( W T z + b ) ...φ L ( z L − ) = a L ( W TL z L − + b L )ˆ y ui = δ ( h T φ L ( z L − )) (5)Where p u , q i , W l , b l , and a l represent users’ factors,items’ enhanced factors, the weight matrix, bias vector, andactivation function for the l -th layer’s perceptron, respectively.The latent vector of user and item are fed into a multi-layerneural architecture, which we term as neural network layers,to map the latent vectors to prediction scores. Each layer canbe customized to discover certain latent structures of user-item interactions. The dimension of the last hidden layer Xdetermines the model’s capability. The final output layer is thepredicted score ˆ y ui , and training is performed by minimizingthe point-wise loss between ˆ y ui and its target value y ui . C. Fusion of MF and VMLP (MF-VMLP)
Till this moment, we have proposed two model of VMF andVMLP. Through above introduction, VMF uses a linear kernelto model the latent feature interactions, and VMLP applies anonlinear kernel to learn the interaction function from data.And then, an idea is born that how can we fuse the MF and ig. 2. MF-VMLP model
VMLP based on NCF framework, so as to learn the user-iteminteractions better. There are two possible ways to solve theissue.Firstly, one of the easiest ways to work is to share thesame input and embedding layers between them, and thencombine the outputs of their interaction functions. However,the performance of the fused model might be limited bysharing embedding layers. Once sharing embedding layers,MF and VMLP should use the same size of embedding. WhileMF and VMLP have their own optimal effect, it’s most likelythat the embedding layers of both of them are different in size.So this solution is no feasible.Secondly, to solve the different size from above problem,we let MF and VMLP have their own embedding layers,and concatenate their last hidden layer for combining thetwo methods. A layer of fully connected layer is added. Themodel we propose is named MF-VMLP, whose formulationsare given as follows φ MF = p u (cid:12) q i φ V MLP = a L ( W TL ( a L − ( ...a ( W T (cid:20) p Vu q Vi (cid:21) + b ) ... )) + b L )ˆ y ui = δ ( h T (cid:20) φ MF φ V MLP (cid:21) (6)Where p u and p Vu represent the user embedding for MF andVMLP, respectively, and q i and q Mi denote item embeddings.Among several activation functions, we choose the ReLU forMLP layers. IV. EXPERIMENTSIn this section, we introduce the dataset and perform thevalidation results of several models in various datasets. The-oretically, visual appearance is expected to have an influenceon users’ decision-making process. TABLE IID
ATASET STATISTICS ( AFTER PREPROCESSING )Dataset
A. Datasets
The datasets we use are from Amazon.com which intro-duced by McAuley et al. [16]. We select two categories thatare proven to be useful in visual features, named Women’sand Men’s Clothing. In addition, we also consider Cell Phones,which are expected to paly a little role for models. Each datasethas been process by extracting implict feedback and visualfeatures. We only use the data of user u who has scored morethan four items. Table II shows the datasets.
B. Visual Features
Recall that each item i is associated with a visual featurevector, denoted by v i . To set v i , we use a pre-trained approachto generate visual features from raw product images usingthe deep learning framework of CAFFE [13]. Following [10],we adopt the CAFFE reference model with five convolutionallayers followed by three fully-connected layers that has beenpre-trained on 1.2 million IMAGENET [20] images. For item i , the second fully-connected layer is taken as the visualfeature vector f i , which is a feature vector of length 4096. C. Evaluation Methodology
We split our dataset into three parts, such as training,validation and test sets. The article uses the root-mean-squareerror (RMSE) to evaluate the model. The formula is as follows
RM SE = (cid:115) (cid:80) u,i ∈ T estSet ( r u,i − ˆ r u,i ) T estSet (7)Where r ui , ˆ r ui and TestSet denote real score, predictionscore and test set, respectively. As we all know, the lower thermse, the higher the accuracy of the model. TABLE IIIRMSE
ON THE TEST SET
Dataset MF VMF VMLP MF-VMLP improvementAmazon Women 1.1303 1.0886 1.0639 1.0575 7.3%Amazon Men 1.0579 1.0287 1.0193 1.0034 5.2%Amazon Phones 1.0883 1.0794 1.0854 1.0879 0%
D. Result Analysis
We can find some details from Table 2. Toward these threedatasets, the results show image features affect the users’preferences to some extent. The hybrid recommendation modelMF-VMLP gets the best results, it’s predicted performance is7.3% higher than MF in Amazon Women data. Visual featuresshow greater benefits on clothing than cellphone datasets. We ig. 3. Experimental comparison consider that the functional items are not much different inappearance, such as phones, which play a little role on themodel. More importantly, neural networks have a large impacton the models.
E. Conclusion And Future Work
The recommendation system combined with deep learninghas become a hot research topic at present. With the rapiddevelopment of deep learning, image information becomesmore and more important. Visual features influence many ofthe choices people make. In this paper, we proposed a suitablemodel to fuse image features and ratings for recommendation.For future research work, due to research on recommendationmethods for multi-modal data fusion is also a research hot-pot, we will consider more side information to improve theaccuracy of recommendation model.R
EFERENCES[1] Ting Bai, Ji-Rong Wen, Jun Zhang, and Wayne Xin Zhao. A neuralcollaborative filtering model with interaction-based neighborhood. In
Proceedings of the 2017 ACM on Conference on Information andKnowledge Management , pages 1979–1982. ACM, 2017.[2] Yang Bao, Hui Fang, and Jie Zhang. Topicmf: Simultaneously exploitingratings and reviews for recommendation. In
AAAI , volume 14, pages2–8, 2014.[3] Immanuel Bayer, Xiangnan He, Bhargav Kanagal, and Steffen Rendle.A generic coordinate descent framework for learning from implicitfeedback. In
Proceedings of the 26th International Conference onWorld Wide Web , pages 1341–1350. International World Wide WebConferences Steering Committee, 2017.[4] Alex Beutel, Paul Covington, Sagar Jain, Can Xu, Jia Li, Vince Gatto,and Ed H Chi. Latent cross: Making use of context in recurrent rec-ommender systems. In
Proceedings of the Eleventh ACM InternationalConference on Web Search and Data Mining , pages 46–54. ACM, 2018.[5] Robin Burke. Hybrid recommender systems: Survey and experiments.
User modeling and user-adapted interaction , 12(4):331–370, 2002.[6] Robin Burke. Hybrid web recommender systems. In
The adaptive web ,pages 377–408. Springer, 2007.[7] Zhiyong Cheng, Ying Ding, Xiangnan He, Lei Zhu, Xuemeng Song,and Mohan S Kankanhalli. Aˆ 3ncf: An adaptive aspect attention modelfor rating prediction. In
IJCAI , pages 3748–3754, 2018.[8] Asela Gunawardana and Christopher Meek. A unified approach tobuilding hybrid recommender systems. In
Proceedings of the third ACMconference on Recommender systems , pages 117–124. ACM, 2009. [9] Ruining He and Julian McAuley. Ups and downs: Modeling the visualevolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web ,pages 507–517. International World Wide Web Conferences SteeringCommittee, 2016.[10] Ruining He and Julian McAuley. Vbpr: Visual bayesian personalizedranking from implicit feedback. In
AAAI , pages 144–150, 2016.[11] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, andTat-Seng Chua. Neural collaborative filtering. In
Proceedings of the26th International Conference on World Wide Web , pages 173–182.International World Wide Web Conferences Steering Committee, 2017.[12] Xiangnan He, Hanwang Zhang, Min-Yen Kan, and Tat-Seng Chua. Fastmatrix factorization for online recommendation with implicit feedback.In
Proceedings of the 39th International ACM SIGIR conference onResearch and Development in Information Retrieval , pages 549–558.ACM, 2016.[13] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, JonathanLong, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe:Convolutional architecture for fast feature embedding. In
Proceedingsof the 22nd ACM international conference on Multimedia , pages 675–678. ACM, 2014.[14] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorizationtechniques for recommender systems.
Computer , (8):30–37, 2009.[15] Zhongqi Lu, Zhicheng Dou, Jianxun Lian, Xing Xie, and Qiang Yang.Content-based collaborative filtering for news topic recommendation. In
AAAI , pages 217–223, 2015.[16] Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton VanDen Hengel. Image-based recommendations on styles and substitutes.In
Proceedings of the 38th International ACM SIGIR Conference onResearch and Development in Information Retrieval , pages 43–52.ACM, 2015.[17] Michael J Pazzani and Daniel Billsus. Content-based recommendationsystems. In
The adaptive web , pages 325–341. Springer, 2007.[18] Zhi Qiao, Peng Zhang, Yanan Cao, Chuan Zhou, Li Guo, and BinxingFang. Combining heterogenous social and geographical information forevent recommendation. In
AAAI , volume 14, pages 145–151, 2014.[19] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and LarsSchmidt-Thieme. Bpr: Bayesian personalized ranking from implicitfeedback. In
Proceedings of the twenty-fifth conference on uncertaintyin artificial intelligence , pages 452–461. AUAI Press, 2009.[20] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, SanjeevSatheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla,Michael Bernstein, et al. Imagenet large scale visual recognitionchallenge.
International Journal of Computer Vision , 115(3):211–252,2015.[21] Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. Restrictedboltzmann machines for collaborative filtering. In
Proceedings of the24th international conference on Machine learning , pages 791–798.ACM, 2007.[22] Xiaoyuan Su and Taghi M Khoshgoftaar. A survey of collaborativefiltering techniques.
Advances in artificial intelligence , 2009, 2009.[23] Yunzhi Tan, Min Zhang, Yiqun Liu, and Shaoping Ma. Rating-boostedlatent topics: Understanding users and items with ratings and reviews.In
IJCAI , pages 2640–2646, 2016.[24] Yang Wang, Xuemin Lin, Lin Wu, and Wenjie Zhang. Effective multi-query expansions: Robust landmark retrieval. In
Proceedings of the23rd ACM international conference on Multimedia , pages 79–88. ACM,2015.[25] Yang Wang, Xuemin Lin, Lin Wu, and Wenjie Zhang. Effective multi-query expansions: Collaborative deep networks for robust landmarkretrieval.
IEEE Transactions on Image Processing , 26(3):1393–1404,2017.[26] Yang Wang, Xuemin Lin, Lin Wu, Wenjie Zhang, Qing Zhang, andXiaodi Huang. Robust subspace clustering for multi-view data by ex-ploiting correlation consensus.
IEEE Transactions on Image Processing ,24(11):3939–3949, 2015.[27] Yang Wang, Lin Wu, Xuemin Lin, and Junbin Gao. Multiview spectralclustering via structured low-rank matrix factorization.
IEEE transac-tions on neural networks and learning systems , (99):1–11, 2018.[28] Yang Wang, Wenjie Zhang, Lin Wu, Xuemin Lin, Meng Fang, and ShiruiPan. Iterative views agreement: An iterative low-rank based structuredoptimization method to multi-view spectral clustering. arXiv preprintarXiv:1608.05560 , 2016.29] Lin Wu, Yang Wang, Xue Li, and Junbin Gao. Deep attention-basedspatially recursive networks for fine-grained visual recognition.
IEEEtransactions on cybernetics , (99):1–12, 2018.[30] Lin Wu, Yang Wang, Ling Shao, and Meng Wang. 3-d personvlad:Learning deep global representations for video-based person reidenti-fication.
IEEE transactions on neural networks and learning systems ,2019.[31] Yao Wu, Christopher DuBois, Alice X Zheng, and Martin Ester. Col-laborative denoising auto-encoders for top-n recommender systems. In
Proceedings of the Ninth ACM International Conference on Web Searchand Data Mining , pages 153–162. ACM, 2016.[32] Wei Zhang, Quan Yuan, Jiawei Han, and Jianyong Wang. Collaborativemulti-level embedding learning from reviews for rating prediction. In
IJCAI , pages 2986–2992, 2016.[33] Lei Zheng, Vahid Noroozi, and Philip S Yu. Joint deep modeling ofusers and items using reviews for recommendation. In