[PDF] Dual-embedding based Neural Collaborative Filtering for Recommender Systems

Abstract

Among various recommender techniques, collaborative filtering (CF) is the most successful one. And a key problem in CF is how to represent users and items. Previous works usually represent a user (an item) as a vector of latent factors (aka. \textit{embedding}) and then model the interactions between users and items based on the representations. Despite its effectiveness, we argue that it's insufficient to yield satisfactory embeddings for collaborative filtering. Inspired by the idea of SVD++ that represents users based on themselves and their interacted items, we propose a general collaborative filtering framework named DNCF, short for Dual-embedding based Neural Collaborative Filtering, to utilize historical interactions to enhance the representation. In addition to learning the primitive embedding for a user (an item), we introduce an additional embedding from the perspective of the interacted items (users) to augment the user (item) representation. Extensive experiments on four publicly datasets demonstrated the effectiveness of our proposed DNCF framework by comparing its performance with several traditional matrix factorization models and other state-of-the-art deep learning based recommender models.

Full PDF

DDual-embedding based Neural Collaborative Filtering for Recommender Systems

Gongshan He, Dongxing Zhao, Lixin Ding ∗ School of Computer Science, Wuhan University, Wuhan, Hubei Province 430072, PR China

Abstract

Among various recommender techniques, collaborative ﬁltering (CF) is the most successful one. And a key problem inCF is how to represent users and items. Previous works usually represent a user (an item) as a vector of latent factors(aka. embedding ) and then model the interactions between users and items based on the representations. Despite itseﬀectiveness, we argue that it’s insuﬃcient to yield satisfactory embeddings for collaborative ﬁltering. Inspired by theidea of SVD++ that represents users based on themselves and their interacted items, we propose a general collaborativeﬁltering framework named DNCF, short for Dual-embedding based Neural Collaborative Filtering, to utilize historicalinteractions to enhance the representation. In addition to learning the primitive embedding for a user (an item), weintroduce an additional embedding from the perspective of the interacted items (users) to augment the user (item)representation. Extensive experiments on four publicly datasets demonstrated the eﬀectiveness of our proposed DNCFframework by comparing its performance with several traditional matrix factorization models and other state-of-the-artdeep learning based recommender models.

Keywords: recommender systems, collaborative ﬁltering, neural network, dual embeddings

1. Introduction

In the era of information explosion, users are oftenoverwhelmed by numerous choices available online, whichis named information overload . Over the past decades,recommender systems have been intensively studied andextensively deployed in various scenarios, such as e-commerce and music platforms, to alleviate this problem.Collaborative Filtering (CF) [1, 2] is one of the most suc-cessful recommender techniques and has been widely usedto build personalized recommender systems, which utilizescollective wisdoms and experiences to generate recommen-dations.The key challenge to design a CF model is: how to rep-resent a user and an item and how to model their interac-tions based on the representation [3]. As a dominant modelin CF, Matrix Factorization (MF) [4, 5] characterizes usersand items with latent vectors (aka. embedding ) in a sharedlatent space and then each user-item interaction is mod-eled as the inner product between the user embedding anditem embedding. Many extensions have been developed forMF from both the modeling perspective [6, 7] and learningperspective [8]. For example, NSVD [6] characterizes usersby the items that they have rated. Speciﬁcally, a user em-bedding is represented by combining the embeddings of allitems rated by the user. To further step, SVD++ [7] rep- ∗ Corresponding author

Email addresses: [email protected] (Gongshan He), [email protected] (Dongxing Zhao),

[email protected] (LixinDing) resents a user by integrating the embeddings of the itemsinteracted by the user with primitive user embedding.In recent years, deep learning methods achieve tremen-dous success in many ﬁelds, such as computer vision [9]and natural language processing [10]. There are also manyworks applying deep learning to recommender systems.Neural matrix factorization (NeuMF) [11] represents a useror an item with an ID and learns the interactions by fus-ing the linear MF and non-lineaer multi-layer perceptron(MLP) models. DeepMF [12] feeds related rating vectorsinto MLP to learn users’ (items’) embeddings and thenuses cosine similarity as interaction function to predict therelevance score.Inspired by NSVD and NeuMF, DELF [13] is proposedto represent users or items by their dual embeddings. Tobe more speciﬁc, in addition to model primitive embed-dings, it obtains additional embeddings from the perspec-tive of the interacted users or items and learn the inter-actions between users and items from four aspects: user-to-user, item-to-item, user-to-item (ID) and user-to-item(historical interactions). However, it only employs dualembeddings to learn four kinds of interaction functions foreach user-item pair and does not combine two types ofembeddings to get better user’s (item’s) representation.To tackle this problem, we propose a general dual em-beddings based CF framework named DNCF, short for D ual-embedding based N eural C ollaborative F iltering, tocombine the strengths of the two types of embeddings.Speciﬁcally, we use the items interacted by users to aug-ment user representation and use the users once interacted Preprint submitted to Elsevier February 8, 2021 a r X i v : . [ c s . I R ] F e b ith items to enrich item representation. And then we em-ploy deep neural network architecture to model the user-item interactions.The main contributions of this work are as follows:1. We propose to combine users’ (items’) dual embed-dings into ﬁnal users’ (items’) representation, namely,integrating their primary embeddings with additionalembeddings obtained from the perspective of histori-cal interactions to get the ﬁnal representation.2. We devise a novel framework named D ual-embeddingbased N eural C ollaborative F iltering (DNCF) whichmodels the interactions between users and items basedon their dual embeddings.3. We conduct extensive experiments on four real-worlddatasets to demonstrate the eﬀectiveness of our pro-posed DNCF approaches.The remaining of this article is organized as follows: Sec-tion 2 introduces the preliminaries for top-N recommenda-tion. Section 3 introduces some related works. Section 4presents our proposed DNCF framework in detail. Sec-tion 5 illustrates the experimental results on four publicdatasets. Finally, we conclude this work and point outfuture research directions in section 6.

2. Preliminaries

Let M and N denote the total number of users and itemsin the systems, respectively. Following [11, 12, 14, 15], weconstruct the user-item interaction matrix Y ∈ R M × N from users’ implicit feedback as follows, y ui = (cid:40) , if interaction (user u, item i ) is observed0 , otherwise (1)For implicit feedback, all observed interactions are con-sidered as noisy positive instances which reﬂect users’ pref-erence to some extent. However, there are no negativeinstances. A simple solution is to treat all unobserved in-teractions (i.e. the value of y ui is equal to 0) as negativefeedback. Nevertheless, not all unobserved interactions aretrue negative instances. To be speciﬁc, an unobserved in-teraction does not necessarily mean user u does not likeitem i . As a matter of fact, user u may have never seenitem i since there are too many items in a system. Anotherapproach is to sample negative instances from unobservedinteractions [16, 11]. In this work, we choose the latter,i.e. randomly sample negative instances from unobservedinteractions without replacement.The problem of recommendation with implicit feedbackis to estimate the scores of unobserved entries in Y , whichare used for ranking the items. Model-based approaches[4, 5] generally assume that data can be generated by anunderlying model which can be formulated asˆ y ui = f ( u, i | Θ) (2) where ˆ y ui denotes the predicted score of interaction y ui ,Θ denotes model parameters, and f denotes the functionthat maps model parameters to the predicted score. Most of existing approaches generally estimate param-eters Θ through optimizing an objective function. Threetypes of objective functions are most commonly used inrecommender systems —— point-wise loss [17, 18, 14],pair-wise loss [8, 19] and list-wise loss [20, 21]. In thispaper, we explore the point-wise loss only and leave thepair-wise and list-wise loss as a future work. Point-wiseloss has been widely studied in collaborative ﬁltering withexplicit feedback under regression framework. The mostcommonly used point-wise loss is the squared loss, whichminimizes the diﬀerence between the predicted value ˆ y ui and its target value y ui . L = (cid:88) ( u,i ) ∈Y + ∪Y − w ij ( y ui − ˆ y ui ) (3)where Y + = { ( u, i ) | y ui = 1 } denotes the set of observedinteractions, Y − = { ( u, i ) | y ui = 0 } denotes the sampledunobserved interactions, i.e., negative instances and w ij denotes the weight of training instance ( u, i ). However, thesquared loss is not suitable for implicit feedback becausethe implicit data is discrete and binary. The target value y ui is 1 if u has interacted with i , otherwise 0.Following [11], we adopt the binary cross-entropy loss asthe objective function, which views top-N recommendationproblem with implicit feedback as a binary classiﬁcationproblem. L = − (cid:88) ( u,i ) ∈Y + ∪Y − y ui log ˆ y ui + (1 − y ui ) log(1 − ˆ y ui ) (4)

3. Related Work

Matrix Factorization (MF) typically represents eachuser/item as a low-dimensional embedding vector. Let p u and q i denote the latent vector for user u and item i in ashared embedding space, respectively. The relevance score y ui between user u and item i is estimated by the innerproduct of p u and q i :ˆ y ui = < p u , q i > = p Tu q i (5)Diﬀerent from traditional MF methods, NSVD [6] repre-sents users based on the items that they have rated. Notethat each item i is associated with two latent vector q i and y i . Formally, the preference score of user u to item i is predicted as:ˆ y ui = b u + b i + q Ti  | R ( u ) | − (cid:88) j ∈ R ( u ) y j (cid:124) (cid:123)(cid:122) (cid:125) user u (cid:48) s representation (6)2here b u and b i denote the bias terms of user u and item i ,respectively; and R ( u ) is the set of items rated by user u .However, a main drawback of NSVD is that two diﬀerentusers who have rated the same set of items with diﬀerentratings have same representation.To address this problem, SVD++ [7] is proposed forrecommendation with explicit ratings, which estimates therelevance score between user u and item i as follows:ˆ y ui = µ + b u + b i + q Ti  p u + | N ( u ) | − (cid:88) j ∈ N ( u ) y j (cid:124) (cid:123)(cid:122) (cid:125) user u (cid:48) s representation (7)where µ is the average rating over all items, p u is the la-tent vector for user u and N ( u ) denotes the set of items forwhich u provided an implicit preference. The user latentvector is complemented by the sum | N ( u ) | − (cid:80) j ∈ N ( u ) y j ,which represents the perspective of implicit feedback. Inother words, SVD++ leverages historical interactions tosupplement the user latent factor rather than directly rep-resent the user. For top-N recommendation, Kabbur et al. [22] proposedFISM (short for

Factored Item Similarity Model ), whichlearns the item-item similarity matrix as a product of twolow-dimensional latent factor matrices. Formally, the pre-dictive model of FISM isˆ y ui = b u + b i + p Ti  |R + u | α (cid:88) j ∈R + u \ i q j (cid:124) (cid:123)(cid:122) (cid:125) user u (cid:48) s representation (8)where α is a hyper-parameter controlling the normaliza-tion eﬀect, p i and q j denote the embedding vector for item i and j , respectively. In Equation (8), the term in bracketcan be viewed as the user u ’s representation, which is ag-gregated from the embeddings of the historical items of u . Despite the eﬀectiveness of above approaches, they havean inherent limitation in their model design. Speciﬁcally,they use a simple and ﬁxed inner product as interactionfunction, which is insuﬃcient to capture the complex user-item interactions in the low-dimensional latent space.Neural collaborative ﬁltering (NCF) [11] is thus pro-posed to learn the user–item interaction function via amulti-layer perceptron (MLP). DeepCF [14] fuses repre-sentation learning-based CF methods and interaction func-tion learning-based CF methods and applies multi-hot en-coding on the ID feature of user u ’s interacted items R + u torepresent user u (analogously for the items). J-NCF [15]feeds user’s (item’s) rating vector into MLP to learn user’s (item’s) feature vector and then concatenate them to feedinto another MLP to learn the interaction function.All these methods mentioned above build the embeddingfunction with either ID or historical interactions only. Asreported in [23], these methods cannot yield satisfactoryembeddings and have to rely on the interaction fucntionto make up for the deﬁciency of suboptimal embeddings.For this reason, DELF [13] is proposed to jointly adoptboth ID and historical interactions to model user and item.However, they only employ dual embeddings to learn fourkinds of interaction functions for each user-item pair anddon not combine two types of embeddings to get betteruser’s (item’s) representation. In fact, it can be seen as thefusion of four MLP models which take the concatenationof diﬀerent types of embeddings as the input.

4. Proposed Methods

In this section, we ﬁrst present the D ual-embeddingbased N eural C ollaborative F iltering (DNCF) framework.Before diving into the technical details, we ﬁrst introducesome basic notations.Throughout the paper, we used bold uppercase letter todenote a matrix (e.g., R ), bold lowercase letter to denotea vector (e.g., p ) and lowercase letter to denote a scalar(e.g., y ). Figure 1 illustrates our proposed DNCF (short for Dual-embedding based Neural Collaborative Filtering) frame-work to model the interaction between users and itemsbased on dual embeddings.Next, we elaborate the architecture layer by layer.

Input Layer.

We take ID and historical interactions asthe input features and then transform them into binarizedsparse vector with one-hot encoding and multi-hot encod-ing, respectively. Hence, we obtain two kinds of featurevectors for both user u and i . Embedding Layer.

The embedding layer projectseach high-dimensional and sparse feature vector from theinput layer into a low-dimensional and dense embeddingvector. Let P ∈ R M × k , Q ∈ R N × k , M ∈ R N × k and N ∈ R M × k denote latent factor matrix for users and itemsfrom the perspective of ID and historical interactions, re-spectively, and k is the dimension of embedding. Theprimitive embedding can be obtained as below: p u = P T x u , q i = Q T x i (9)where p u and q i denote the latent vector for user u anditem i from the perspective of ID, respectively. We termit as ID embedding .As for the historical interactions, we learn the embed-ding by an aggregation function which summarizes the em-beddings of the historical items (users) into a vector. We3

User ID Embedding User History Embedding Item History Embedding Item ID EmbeddingUser Vector Item VectorLayer 1Layer 2Layer X

Neural Collaborative Filtering Layers … 𝑦" User u’s One-Hot Encoding User u’s Multi-Hot Encoding Item i’s Multi-Hot Encoding Item i’s One-Hot Encoding

Embedding Combination LayerEmbedding LayerInput LayerPrediction Layer

Figure 1: The architecture of DNCF. term it as history embedding , short for historical interac-tions based embedding . Take user u as an example. His-torical items are associated with another group of latentvectors y j m u = AGG ( { y j , ∀ j ∈ R + u } ) (10)where R + u is the set of items which user u has interactedwith, y j is the corresponding column in M which repre-sents items j , and AGG ( · ) denotes any aggregation func-tion.A common aggregation function is summation with nor-malization. In this case, the embedding layer can be sim-pliﬁed as: m u = |R + u | − M T Y u ∗ , n i = |R + i | − N T Y ∗ i (11)where m u and n i denote the latent vector for user u anditem i from the perspective of historical interactions, re-spectively; Y u ∗ and Y ∗ i are the u -th row and i -th columnin Y , respectively. Embedding Combination Layer.

The output of pre-vious embedding layer is two kinds of embeddings for usersand items, respectively. And the embedding combinationlayer integrates ID embedding with history embedding toget the ﬁnal representation. Formally, it can be deﬁnedas: v u = g ( p u , m u ) (12)where g ( · ) is any binary operator, which is termed as em-bedding combination function. Similarly, item i (cid:48) s ﬁnalrepresentation v i can be obtained. Neural Collaborative Filtering Layers.

We feedthe users’ and items’ representation into neural collabora-tive ﬁltering layers to learn the interactions between usersand items. Formally, this process is formulated as: z out = ncf ( v u , v i ) (13)where z out denotes the output vector of neural collabora-tive ﬁltering layers ncf ( · ). Prediction Layer.

The prediction layer maps the out-put vector of the neural collaborative ﬁltering layers intothe prediction score ˆ y ui of the interaction between user u and item i . Following the MLP model [11], we propose Dual-embedding based Multi-Layer Perceptron (DMLP) whichemploys vector concatenation as embedding combinationfunction g ( · ) and uses MLP to learn the interaction be-tween user and item latent vector. Formally, the formula-tion of DMLP are given as follows: z = ( p u ⊕ m u ) ⊕ ( q i ⊕ n i ) z = a ( W T z + b ) · · · z out = z L = a L ( W TL z L − + b L )ˆ y ui = σ ( h T z out + b out ) (14)where ⊕ denotes the concatenation operation between twovectors; W l , b l , a l and z l denote the weight matrix, biasvector, activation function, and output vector of the l -thhidden layer; h and b out denote the weight vector and biasterm of the prediction layer; σ ( · ) is the sigmoid functiondeﬁned as σ ( x ) = − x ) . In this work, we choose Rec-tiﬁer Linear Unit (ReLU) as the activation function.

Following the GMF model [11], we propose Dual-embedding based Generalized Matrix Factorization(DGMF) which employs element-wise product to combineusers’ and items’ representation. φ ( v u , v i ) = g ( p u , n u ) (cid:12) g ( q i , n i ) (15)where (cid:12) denotes the element-wise product of vectors.Inspired by [24], we try stacking non-linear layer af-ter element-wise product operation. However, it does notachieve better performance than the design of GMF [11]which has no hidden layer. This is probably because sim-ple element-wise product is suﬃcient enough to capturethe interactions between users and items in DGMF. Forthis reason, we directly project the vector into the pre-dicted score: ˆ y ui = σ ( h T φ ( v u , v i ) + b out ) (16)4o test the impact of g ( · ) in Equation (15), we inves-tigate four methods to combine the diﬀerent embeddingsinto ﬁnal user (item) vector for DGMF: element-wise sum , element-wise mean , concatenation and attention mecha-nism [25]. Take attention mechanism as an example: att ( x ) = h Ta ReLU ( W Ta x + b a ) ,α = exp( att ( p u ))exp( att ( p u )) + exp( att ( m u )) v u = α p u + (1 − α ) m u (17)where W a ∈ R k × k (cid:48) and b a ∈ R k (cid:48) denote the weight matrixand bias vector of the attention network, respectively, and k (cid:48) denotes the size of hidden layer; h a ∈ R k (cid:48) denotes theweight vector of the output layer of the attention network.Likewise, we can also get the ﬁnal item representation.Without special mention, we use simple element-wisesum as embedding combination function g ( · ). And wecompare the impact of these methods for DGMF and theexperimental results are shown in section 5.5. Following the design of NeuMF [11], we allow DGMFand DMLP to learn separate embeddings and combine thetwo models by concatenating the output vectors of neuralCF layers. And then we feed them into a fully connectedlayer. Speciﬁcally, it can be formulated as: φ DGMF = g ( p Gu , m Gu ) (cid:12) g ( q Gi , n Gi ) φ DMLP = M LP ( p Mu ⊕ m Mu ⊕ q Mi ⊕ n Mi )ˆ y ui = σ ( h T ( φ DGMF ⊕ φ DMLP ) + b out ) (18)where p Gu and p Mu denote the user’s ID embedding forDGMF and DMLP, respectively, and similar notations forothers. We refer to this model as DNMF, short for Dual-embedding based Neural Matrix Factorization . As reported in [3], the initialization plays a signiﬁcantrole for the convergence and performance of deep learn-ing model. Since DNMF is an ensemble of DGMF andDMLP, we propose to initialize DNMF using the pre-trained models of DGMF and DMLP. First, we trainDGMF and DMLP from scratch using Adam [26] untilconvergence. Then, we use their model parameters as theinitialization for the corresponding parts of DNMF’s pa-rameters. Notice that the DNMF with pre-training is op-timized by the vanilla SGD rather than Adam. This isbecause Adam requires momentum information to updateparameters which is not saved in DNMF with pre-training.

In this subsection, we ﬁrst show how DNCF generalizesSVD++ [7] and FISM [22]. In what follows, we analyzethe time complexity of DNMF. & FISM

Both SVD++ and FISM can be viewed as a special caseof DNCF. In particular, we use element-wise sum as em-bedding combination function. In the neural collaborativeﬁltering layers, we employ inner product to model the in-teractions between users and items. We term this modelas

DNCF-MF , which can be formulated as:ˆ y ui = ( p u + (cid:12)(cid:12) R + u (cid:12)(cid:12) − (cid:88) i ∈R + u y i ) T ( q i + (cid:12)(cid:12) R + i (cid:12)(cid:12) − (cid:88) u ∈R + i y u ) (19)Clearly, by disabling additional embeddings for itemswhich aggregates from the embedding of the historicalusers, we can exactly recover SVD++ model. Analogously,if we disable primitive user embeddings and additional em-bedding for items in Equation (19), we can recover FISM. For the embedding layer, the matrix multiplicationhas computational complexity O (cid:0) ( k + d )( |R + u | + |R + i | ) (cid:1) ,where k denotes the embedding size for DGMF part whichis equal to the number of predictive factors, d representsthe embedding size for DMLP part and |R + u | denotes thenumber of historical items interacted by user u and similarnotations for |R + i | . For the collaborative ﬁltering layer,the time complexity is O ( (cid:80) Ll =1 d l d l − ), where d l repre-sents the size of the l -th hidden layer and d L = k . Theprediction layer only involves inner product of two vec-tors which can be done in O ( d L ). Therefore, the overalltime complexity for evaluating a prediction with DNMF is O (cid:16) ( k + d )( |R + u | + |R + i | ) + (cid:80) Ll =1 d l d l − (cid:17) .

5. Experiments

In this section, we conduct plenty of experiments onfour publicly accessible datasets to answer the followingresearch questions:

RQ1

Do our proposed DNCF methods outperform thestate-of-the-art collaborative ﬁltering methods?

RQ2

How do the key hyper-parameter settings imposeinﬂuence on the performance of our DNCF approaches?

RQ3

Are deeper layers of hidden units helpful for therecommendation performance of DNCF?

RQ4

How is the performance of DNCF impacted bydiﬀerent embedding combine functions?Hereinafter, we ﬁrst describe experimental settings andthen answer the above questions one by one.

Dataset Description.

We evaluate our model in fourreal-world datasets: MovieLens 1M, Last.FM, AMusic andAToy. The statistics of the four datasets are summarizedin Table 1. Following previous work [14], we use the pro-cessed datasets . The processed datasets are downloaded from:https://github.com/familyld/DeepCF able 1: Statistics of the Datasets Dataset

MovieLens 1M 6,040 3,706 1,000,209 4.47%Last.FM 1,741 2,665 69,149 1.49%AMusic 1,776 12,929 46,087 0.20%AToy 3,137 33,953 84,642 0.08%

Evaluation Protocols.

Following [11, 12, 27, 13, 14],we adopted leave-one-out evaluation which holds out thelatest interaction of each user as the test set and uses theremaining interactions for training. In terms of evalua-tion metrics, we used

Hit Ratio at rank k (HR@k) [22]and

Normalized Discounted Cumulative Gain at rank k(NDCG@k) [11, 27, 18, 13, 14] to evaluate the performanceof the ranked list generated by our models. In this case,HR@k is deﬁned as HR @ k = (cid:40) , if the test item is in the top k0 , otherwise (20)And NDCG@k is deﬁned as N DCG @ k = 1log ( pos i + 1) (21)where pos i denotes the position of the test item in theranked recommendation list for the i -th hit.Unless otherwise stated, the ranked list is truncated at10 for both metrics. The metric of HR@10 is capable ofmeasuring intuitively if the test item is present at the top-10 ranked list and NDCG@10 illustrates the quality ofranking which assigns higher score to hits at top positionranks [11]. We calculated both metrics for each test userand reported the average score. Baselines.

To evaluate the performance of our pro-posed model, we compared it with the following ap-proaches: • ItemPop . This is a non-personalized method that isoften used as a benchmark for recommendation tasks.Items are ranked by their popularity measured by thenumber of interactions. • eALS [28]. This is a state-of-the-art MF methodwhich learns MF model by optimizing a point-wiseregression loss that treats all missing data as negativefeedback with a smaller weight . • BiasedMF [29]. This method optimizes biased MFmodel with binary cross-entropy loss to learn fromimplicit feedback data . • GMF [11]. This is a generalized version of MF whichextends MF by introducing non-linear activation func-tion and allowing varying importance of latent dimen-sions . https://github.com/hexiangnan/sigir16-eals https://github.com/google-research/google-research/tree/master/dot_vs_learned_similarity https://github.com/hexiangnan/neural_collaborative_filtering • MLP [11]. This approach applies the one-hot en-coding of users’ (items’) ID to represent users (items)and adopts multi-layer perceptron instead of the ﬁxedinner product to learn the non-linear interactions be-tween users and items . • NeuMF [11]. This is a state-of-the-art interactionfunction learning-based MF model which combinesthe last hidden layer of GMF and MLP to learn the in-teraction function based on binary cross-entropy loss . • CFNet-ml [14]. This is a interaction functionlearning-based CF method which employs historicalinteractions as the input of the model and then feedsthem into MLP to learn the complex interactions be-tween users and items . • CFNet [14]. This is a state-of-the-art method whichcombines the strengths of representation learning-based and matching function learning-based CFmethod . • J-NCF [15]. This is a state-of-the-art method whichapplies a joint neural network that couples deep fea-ture learning and deep interaction modeling with arating matrix. For a fair comparison, we choose bi-nary cross-entropy loss function as objective function.We employ three layers in the DF network with thesize of [256,128,64] and two layers in the DI networkwith the size of [128,64].As our proposed methods focus on modeling the rela-tionship between users and items, we mainly compare withuser–item models. We do not compare with DELF [13] be-cause its performance is similar to or worse than NeuMF.

Parameter Settings.

We implemented our proposedmodel based on Keras and Tensorﬂow , which will bereleased publicly upon acceptance. To determine hyper-parameters of DNCF methods, we held-out the latest in-teraction for each user in the training set as the validationdata and tuned hyper-parameters on it. We sampled 4negative instances per positive instance. For DGMF andDMLP, we randomly initialized model parameters with aGaussian Distribution (with a mean of 0 and standarddeviation of 0.01), optimizing the model with mini-batchAdam [26]. We used the batch size of 256 and the learn-ing rate of 0.001. And the regularization coeﬃcient λ isset to 1 e − . The size of last hidden layer was referredto as predictive factors [11] and we evaluated the factorsof [8,16,32,64]. Unless speciﬁed, we employed three hid-den layers for MLP. For instance, if the size of predic-tive factors is 64, neural collaborative ﬁltering layers follow256 → →

64 and the embedding size is 64.6 able 2: Performance of HR@10 and NDCG@10 of diﬀerent methods at predictive factor 64.

Datasets MovivLens 1M Last.FM AMusic AToyMethods HR@10 NDCG@10 HR@10 NDCG@10 HR@10 NDCG@10 HR@10 NDCG@10ItemPop eALS

BiasedMF 0.7295 0.4492 0.9041

GMF

MLP

NeuMF

CFNet-ml

CFNet

DGMF

DMLP

DNMF 0.7341 0.4531 H R @ MovieLen 1M

GMFMLPCFNet-mlDGMFDMLP (a) MovieLens 1M — HR@10 N D C G @ MovieLen 1M

GMFMLPCFNet-mlDGMFDMLP (b) MovieLens 1M — NDCG@10 H R @ Last.FM

GMFMLPCFNet-mlDGMFDMLP (c) Last.FM — HR@10 N D C G @ Last.FM

GMFMLPCFNet-mlDGMFDMLP (d) Last.FM — NDCG@10Figure 2: Evaluation of Top-K item recommendation where K ranges from 1 to 10 on the MovieLens 1M and Last.FM.

Table 2 shows the performance of HR@10 andNDCG@10 of all compared methods. The best and thesecond best results are highlighted as bold font. For afair comparison, the size of predictive factors is ﬁxed to 64for all methods. For eALS and BiasedMF, the number ofpredictive factors is equal to the number of latent factors.We have the following observations: • DNMF yields the best performance on most of thedatasets except the Last.FM dataset. Speciﬁcally,DNMF improves over the strongest baselines by 0.7%,5.1% and 3.0% for MovieLens 1M, AMUsic and AToy,respectively. On the Last.FM dataset, DNMF slightlyunderperforms BiasedMF and CFNet in terms ofHR@10 while outperforms other baseline methods ex-cept CFNet in terms of NDCG@10. This result jus-tiﬁes the eﬀectiveness of our proposed DNCF frame-work that models the interactions between users anditems based on dual embeddings. • DMLP outperforms MLP and CFNet-ml by a largemargin. Besides, DGMF also demonstrates consis-tent improvements over GMF. This ﬁndings provideempirical evidence for the eﬀectiveness of utilizinghistorical interactions based embedding to augment https://github.com/familyld/DeepCF https://keras.io the representation. For baseline methods, CFNet-ml achieves better performance than MLP on alldatasets. This indicates adopting historical interac-tions as input can get better representation than ID.Between DMLP and DGMF, DMLP slightly under-performs DGMF, which is in consistent with the re-sults shown in [11]. • J-NCF underperforms CFNet on all datasets. Onereason is that the superiority of J-NCF might mainlybe attributed to its loss function. In original pa-per [15], the authors proposed a hybrid loss functionwhich combines point-wise and pair-wise loss func-tion.We also evaluate the performance of Top-K recom-mended lists where the ranking position K ranges from 1 to10 on the MovieLens 1M and Last.FM as illustrated in Fig-ure 2. To make the ﬁgure more clear, we only show MLP,GMF and their variants — CFNet-ml, DMLP and DGMF.We can ﬁnd that DGMF achieves consistent improvementsover GMF across positions on both datasets. Likewise,DMLP outperforms CFNet-ml and MLP on all ranking po-sitions. This demonstrates the advantage of dual embed-dings again. For baseline methods, MLP underperformsGMF and CFNet-ml on both datasets. For MovieLens1M, CFNet-ml achieves better performance than GMF onboth metrics. However, it underperfoms GMF on Last.FMin terms of NDCG.7 .2.1. Utility of Pre-training

To demonstrate the impact of pre-training for DNMF,we compared the performance of two versions of DNMF —with and without pre-training. Diﬀerent from the DNMFwith pre-training, we used mini-batch Adam to learn theDNMF without pre-training with random initializations.The experimental results are provided in Table 3. We canﬁnd that the DNMF with pre-training outperforms theDNMF without pre-training on all datasets. The relativeimprovements of DNMF with pre-training are 2.6%, 1.8%,9.7% and 5.9% for MovieLens 1M, Last.FM, AMusic andAToy, respectively. This result veriﬁes the utility of pre-training process for DNMF.

Table 3: Performance of DNMF with/without pre-training at pre-dictive factor 64.

Datasets Without pre-training With pre-trainingHR@10 NDCG@10 HR@10 NDCG@10MovieLens 1M

In this section, we study the impact of diﬀerent hyper-parameter values on the performance of our proposed mod-els.

Fixing the remaining parameter values, we did a full pa-rameter study for predictive factors. Figure 3 shows theperformance of HR@10 and NDCG@10 on MovieLens 1Mand Last.FM with respect to the number of predictive fac-tors. The proposed models oﬀer the best performance with64 predictive factors on both datasets. For MovieLens 1M,the performance of all models increase gradually with theincrease of predictive factors. For Last.FM, the perfor-mances of NDCG@10 of DGMF and DMLP increase ﬁrstand then decrease. It’s worth noticing that for MovieLens1M with a small predictive factors of 8 and 16, DGMF un-derperforms DMLP, while shows consistent improvementover DMLP on the Last.FM dataset. One possible reasonis that DGMF has the ability to express stronger repre-sentation than DMLP on the relative sparse dataset.

To analyse the impact of negative sampling for DNCFmethods, we tested diﬀerent negative sampling ratio, i.e.the number of negative samples per positive instance. Fig-ure 4 reports the performance of DNCF methods with re-spect to diﬀerent negative sampling ratios on MovieLens1M and Last.FM. As we can see, employing one negativeinstance is not enough and sampling more negative in-stances is beneﬁcial to recommendation performance. ForMovieLens 1M, the best performance is obtained when thenegative sampling ratio is set to 10. For Last.FM, thebest HR@10 is obtained when the negative sampling ra-tio is set to 7 while the best NDCG@10 is obtained when the negative sampling ratio is set to 6. To sum up, theoptimal number of negative samples per positive instanceis around 3 to 7, which is similar to the results shownin [11, 14]. Notice that it is not always a good idea tosampling more negative instances which requires not onlymore time to train the model but also more powerful ma-chine with large memories to store the training data, evendegrades the performance.

To test the eﬀect of the number of hidden layersfor DNCF approaches, we compared the performance ofDMLP with respect to the number of hidden layers whenthe predictive factors is equal to 64 on four datasets. Theexperimental results are provided in Table 4. The Layer-3denotes the DMLP method with three hidden layers, andsimilar notations for others. We can ﬁnd that to some ex-tent stacking more non-linear hidden layers is beneﬁcial torecommendation performance. This result demonstratesthe eﬀectiveness of using deep architecture for complexuser-item interactions, which is in consistent with [11].To verify the above ﬁndings, we further investigatedDMLP with diﬀerent number of hidden layers and predic-tive factors on MovieLens 1M and Last.FM. The resultsare shown in Figure 5. We have the following observations:In most cases, stacking more layers yields better perfor-mance while the relative improvements decrease graduallywith the increase of predictive factors.

Table 4: Performance of HR@10 and NDCG@10 with Diﬀerent Lay-ers at the Predictive Factor 64.

Datasets Layer-0 Layer-1 Layer-2 Layer-3 Layer-4HR@10MovieLens 1M

Last.FM

AToy

NDCG@10MovieLens 1M

Last.FM

We compared diﬀerent methods to summary dual em-beddings into one vector: element-wise sum, mean, con-catenation and attention . Table 5 shows the experimen-tal results on four datasets when the predictive factors isset to 64. We make the following observations: There isno one-size-ﬁts-all method. Element-wise sum performsmuch better than other methods on MovieLens 1M. ForLast.FM, element-wise sum also outperforms other meth-ods in terms of HR while underperforms element-wisemean in terms of NDCG. However, concatenation achievesthe best performance on both metrics on AMusic andelement-wise mean outperforms other approaches on AToy.In order to verify the above conclusion, we further com-pared four methods with respect to diﬀerent predictive8

16 32 64Factors0.640.660.680.700.720.74 H R @ MovieLens 1M

DGMFDMLPDNMF (a) MovieLens 1M — HR@10 N D C G @ MovieLens 1M

DGMFDMLPDNMF (b) MovieLens 1M — NDCG@10 H R @ Last.FM

DGMFDMLPDNMF (c) Last.FM — HR@10 N D C G @ Last.FM

DGMFDMLPDNMF (d) Last.FM — NDCG@10Figure 3: Performance of HR@10 and NDCG@10 w.r.t. the number of predictive factors. H R @ MovieLens 1M

DGMFDMLPDNMF (a) MovieLens 1M — HR@10 N D C G @ MovieLens 1M

DGMFDMLPDNMF (b) MovieLens 1M — NDCG@10 H R @ Last.FM

DGMFDMLPDNMF (c) Last.FM — HR@10 N D C G @ Last.FM

DGMFDMLPDNMF (d) Last.FM — NDCG@10Figure 4: Performance of DNCF methods w.r.t. the number of negative samples per positive instance (predictive factor = 64).Table 5: Performance of Diﬀerent Variants of DGMF at PredictiveFactor 64.

Methods MovieLens 1M Last.FM AMusic AToyHR@10element-wise sum 0.7232 0.8914 element-wise mean attention

NDCG@10element-wise sum 0.4440 element-wise mean attention factors on MovieLens 1M and Last.FM. The results areshown in Figure 6. It can be seen clearly that the optimalmethod varies with the increase of predictive factors onboth datasets. In general, simple summation and meanperform as good as or even better than attention and con-catenation. Moreover, they do not involve any additionaltrainable parameters.

6. Conclusion and Future Work

In this work, we explored dual-embedding based collab-orative ﬁltering methods for top-N recommendation. Inaddition to the primitive user and item embeddings, weobtained additional embedding for user and item basedon their historical interactions from implicit feedback. Inother words, we employed the items interacted by users toenhance user representation and useed the users once in-teracted with items to enrich item representation. Basedon dual embeddings mentioned above, we devised a gen-eral framework DNCF and proposed three instantiations— DMLP, DGMF and DNMF. We conducted comprehen-sive experiments on four real-world datasets and the corre-sponding experimental results demonstrated the superior performance of our proposed models compared with otherstate-of-the-art approaches for top-N item recommenda-tion task.In the future, we will study the following problems.First, all historical items (users) of a user (an item) con-tribute equally to the ﬁnal history embedding in this work,which is an unrealistic assumption as reported in [30, 13].So we would like to employ attention mechanism [25]to distinguish the importance of interacted items (users)when constructing historical interactions based embeddingfor each user (item). Second, auxiliary information canbe used to further improve the representation of usersand items, such as user reviews [31, 32], item informa-tion [33, 34], knowledge base [35, 36] and social networks[37, 38]. Richer information usually leads to better per-formance. Third, we will also try to use diﬀerent type ofloss function, for example, BPR [8] to learn our models.Finally, Graph Convolutional Networks (GCNs) [39] haveattracted considerable research interest and some recentworks [40, 41, 42] have employed GCNs to improve theperformance of top-N recommendation. We’re also veryinterested in exploring it to enhance the quality of the em-bedding.

Acknowledgment

We would like to thank our anonymous reviewers fortheir helpful comments and valuable suggestions.

References [1] B. M. Sarwar, G. Karypis, J. A. Konstan, J. Riedl, Item-basedcollaborative ﬁltering recommendation algorithms, in: V. Y.Shen, N. Saito, M. R. Lyu, M. E. Zurko (Eds.), Proceedings ofthe Tenth International World Wide Web Conference, WWW

16 32 64Factors0.400.470.540.610.680.75 H R @ MovieLens 1M

Layer-4Layer-3Layer-2Layer-1Layer-0 (a) MovieLens 1M — HR@10 N D C G @ MovieLens 1M

Layer-4Layer-3Layer-2Layer-1Layer-0 (b) MovieLens 1M — NDCG@10 H R @ Last.FM

Layer-4Layer-3Layer-2Layer-1Layer-0 (c) Last.FM — HR@10 N D C G @ Last.FM

Layer-4Layer-3Layer-2Layer-1Layer-0 (d) Last.FM — NDCG@10Figure 5: Performance of DMLP with diﬀerent layers on the MovieLens 1M and Last.FM. H R @ MovieLens 1M element-wise sumelement-wise meanconcatenationattention (a) MovieLens 1M — HR@10 N D C G @ MovieLens 1M element-wise sumelement-wise meanconcatenationattention (b) MovieLens 1M — NDCG@10 H R @ Last.FM element-wise sumelement-wise meanconcatenationattention (c) Last.FM — HR@10 N D C G @ Last.FM element-wise sumelement-wise meanconcatenationattention (d) Last.FM — NDCG@10Figure 6: Performance of diﬀerent variants of DGMF on the MovieLens 1M and Last.FM.10, Hong Kong, China, May 1-5, 2001, ACM, 2001, pp. 285–295. doi:10.1145/371920.372071 .URL https://doi.org/10.1145/371920.372071 [2] G. Linden, B. Smith, J. York, Amazon.com recommendations:Item-to-item collaborative ﬁltering, IEEE Internet Comput.7 (1) (2003) 76–80. doi:10.1109/MIC.2003.1167344 .URL https://doi.org/10.1109/MIC.2003.1167344 [3] X. He, X. Du, X. Wang, F. Tian, J. Tang, T. Chua, Outerproduct-based neural collaborative ﬁltering, in: J. Lang (Ed.),Proceedings of the Twenty-Seventh International Joint Confer-ence on Artiﬁcial Intelligence, IJCAI 2018, July 13-19, 2018,Stockholm, Sweden, ijcai.org, 2018, pp. 2227–2233. doi:10.24963/ijcai.2018/308 .URL https://doi.org/10.24963/ijcai.2018/308 [4] R. Salakhutdinov, A. Mnih, Probabilistic matrix factorization,in: J. C. Platt, D. Koller, Y. Singer, S. T. Roweis (Eds.),Advances in Neural Information Processing Systems 20, Pro-ceedings of the Twenty-First Annual Conference on NeuralInformation Processing Systems, Vancouver, British Columbia,Canada, December 3-6, 2007, Curran Associates, Inc., 2007,pp. 1257–1264.URL http://papers.nips.cc/paper/3208-probabilistic-matrix-factorization [5] Y. Koren, R. M. Bell, C. Volinsky, Matrix factorization tech-niques for recommender systems, Computer 42 (8) (2009) 30–37. doi:10.1109/MC.2009.263 .URL https://doi.org/10.1109/MC.2009.263 [6] A. Paterek, Improving regularized singular value decompositionfor collaborative ﬁltering, Proceedings of KDD Cup and Work-shop.[7] Y. Koren, Factorization meets the neighborhood: a multifacetedcollaborative ﬁltering model, in: Y. Li, B. Liu, S. Sarawagi(Eds.), Proceedings of the 14th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, Las Ve-gas, Nevada, USA, August 24-27, 2008, ACM, 2008, pp. 426–434. doi:10.1145/1401890.1401944 .URL https://doi.org/10.1145/1401890.1401944 [8] S. Rendle, C. Freudenthaler, Z. Gantner, L. Schmidt-Thieme,BPR: bayesian personalized ranking from implicit feedback, in:J. A. Bilmes, A. Y. Ng (Eds.), UAI 2009, Proceedings of theTwenty-Fifth Conference on Uncertainty in Artiﬁcial Intelli-gence, Montreal, QC, Canada, June 18-21, 2009, AUAI Press,2009, pp. 452–461. URL https://dslpitt.org/uai/displayArticleDetails.jsp?mmnu=1&smnu=2&article_id=1630&proceeding_id=25 [9] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for im-age recognition, in: 2016 IEEE Conference on Computer Visionand Pattern Recognition, CVPR 2016, Las Vegas, NV, USA,June 27-30, 2016, IEEE Computer Society, 2016, pp. 770–778. doi:10.1109/CVPR.2016.90 .URL https://doi.org/10.1109/CVPR.2016.90 [10] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-trainingof deep bidirectional transformers for language understand-ing, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceed-ings of the 2019 Conference of the North American Chap-ter of the Association for Computational Linguistics: HumanLanguage Technologies, NAACL-HLT 2019, Minneapolis, MN,USA, June 2-7, 2019, Volume 1 (Long and Short Papers), As-sociation for Computational Linguistics, 2019, pp. 4171–4186. doi:10.18653/v1/n19-1423 .URL https://doi.org/10.18653/v1/n19-1423 [11] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, T. Chua, Neural col-laborative ﬁltering, in: R. Barrett, R. Cummings, E. Agichtein,E. Gabrilovich (Eds.), Proceedings of the 26th InternationalConference on World Wide Web, WWW 2017, Perth, Aus-tralia, April 3-7, 2017, ACM, 2017, pp. 173–182. doi:10.1145/3038912.3052569 .URL https://doi.org/10.1145/3038912.3052569 [12] H. Xue, X. Dai, J. Zhang, S. Huang, J. Chen, Deep matrixfactorization models for recommender systems, in: C. Sierra(Ed.), Proceedings of the Twenty-Sixth International JointConference on Artiﬁcial Intelligence, IJCAI 2017, Melbourne,Australia, August 19-25, 2017, ijcai.org, 2017, pp. 3203–3209. doi:10.24963/ijcai.2017/447 .URL https://doi.org/10.24963/ijcai.2017/447 [13] W. Cheng, Y. Shen, Y. Zhu, L. Huang, DELF: A dual-embedding based deep latent factor model for recommendation,in: J. Lang (Ed.), Proceedings of the Twenty-Seventh Interna-tional Joint Conference on Artiﬁcial Intelligence, IJCAI 2018,July 13-19, 2018, Stockholm, Sweden, ijcai.org, 2018, pp. 3329–3335. doi:10.24963/ijcai.2018/462 .URL https://doi.org/10.24963/ijcai.2018/462 [14] Z. Deng, L. Huang, C. Wang, J. Lai, P. S. Yu, Deepcf: A uni-ﬁed framework of representation learning and matching func-tion learning in recommender system, in: The Thirty-ThirdAAAI Conference on Artiﬁcial Intelligence, AAAI 2019, The hirty-First Innovative Applications of Artiﬁcial IntelligenceConference, IAAI 2019, The Ninth AAAI Symposium on Ed-ucational Advances in Artiﬁcial Intelligence, EAAI 2019, Hon-olulu, Hawaii, USA, January 27 - February 1, 2019, AAAI Press,2019, pp. 61–68. doi:10.1609/aaai.v33i01.330161 .URL https://doi.org/10.1609/aaai.v33i01.330161 [15] W. Chen, F. Cai, H. Chen, M. de Rijke, Joint neural collabora-tive ﬁltering for recommender systems, ACM Trans. Inf. Syst.37 (4) (2019) 39:1–39:30. doi:10.1145/3343117 .URL https://doi.org/10.1145/3343117 [16] R. Pan, Y. Zhou, B. Cao, N. N. Liu, R. M. Lukose, M. Scholz,Q. Yang, One-class collaborative ﬁltering, in: Proceedings ofthe 8th IEEE International Conference on Data Mining (ICDM2008), December 15-19, 2008, Pisa, Italy, IEEE Computer So-ciety, 2008, pp. 502–511. doi:10.1109/ICDM.2008.16 .URL https://doi.org/10.1109/ICDM.2008.16 [17] Y. Hu, Y. Koren, C. Volinsky, Collaborative ﬁltering for im-plicit feedback datasets, in: Proceedings of the 8th IEEE Inter-national Conference on Data Mining (ICDM 2008), December15-19, 2008, Pisa, Italy, IEEE Computer Society, 2008, pp. 263–272. doi:10.1109/ICDM.2008.22 .URL https://doi.org/10.1109/ICDM.2008.22 [18] F. Xue, X. He, X. Wang, J. Xu, K. Liu, R. Hong, Deepitem-based collaborative ﬁltering for top-n recommendation,ACM Trans. Inf. Syst. 37 (3) (2019) 33:1–33:25. doi:10.1145/3314578 .URL https://doi.org/10.1145/3314578 [19] H. Liu, Z. Wu, X. Zhang, CPLR: collaborative pairwise learningto rank for personalized recommendation, Knowl. Based Syst.148 (2018) 31–40. doi:10.1016/j.knosys.2018.02.023 .URL https://doi.org/10.1016/j.knosys.2018.02.023 [20] Y. Shi, M. A. Larson, A. Hanjalic, List-wise learning to rankwith matrix factorization for collaborative ﬁltering, in: X. Am-atriain, M. Torrens, P. Resnick, M. Zanker (Eds.), Proceedingsof the 2010 ACM Conference on Recommender Systems, RecSys2010, Barcelona, Spain, September 26-30, 2010, ACM, 2010, pp.269–272. doi:10.1145/1864708.1864764 .URL https://doi.org/10.1145/1864708.1864764 [21] Y. Shi, A. Karatzoglou, L. Baltrunas, M. A. Larson, N. Oliver,A. Hanjalic, Climf: learning to maximize reciprocal rank withcollaborative less-is-more ﬁltering, in: P. Cunningham, N. J.Hurley, I. Guy, S. S. Anand (Eds.), Sixth ACM Conference onRecommender Systems, RecSys ’12, Dublin, Ireland, September9-13, 2012, ACM, 2012, pp. 139–146. doi:10.1145/2365952.2365981 .URL https://doi.org/10.1145/2365952.2365981 [22] S. Kabbur, X. Ning, G. Karypis, FISM: factored item similaritymodels for top-n recommender systems, in: I. S. Dhillon, Y. Ko-ren, R. Ghani, T. E. Senator, P. Bradley, R. Parekh, J. He, R. L.Grossman, R. Uthurusamy (Eds.), The 19th ACM SIGKDD In-ternational Conference on Knowledge Discovery and Data Min-ing, KDD 2013, Chicago, IL, USA, August 11-14, 2013, ACM,2013, pp. 659–667. doi:10.1145/2487575.2487589 .URL https://doi.org/10.1145/2487575.2487589 [23] X. Wang, X. He, M. Wang, F. Feng, T. Chua, Neuralgraph collaborative ﬁltering, in: B. Piwowarski, M. Cheva-lier, ´E. Gaussier, Y. Maarek, J. Nie, F. Scholer (Eds.), Pro-ceedings of the 42nd International ACM SIGIR Conference onResearch and Development in Information Retrieval, SIGIR2019, Paris, France, July 21-25, 2019, ACM, 2019, pp. 165–174. doi:10.1145/3331184.3331267 .URL https://doi.org/10.1145/3331184.3331267 [24] Y. Zhang, Q. Ai, X. Chen, W. B. Croft, Joint representa-tion learning for top-n recommendation with heterogeneous in-formation sources, in: E. Lim, M. Winslett, M. Sanderson,A. W. Fu, J. Sun, J. S. Culpepper, E. Lo, J. C. Ho, D. Do-nato, R. Agrawal, Y. Zheng, C. Castillo, A. Sun, V. S. Tseng,C. Li (Eds.), Proceedings of the 2017 ACM on Conference onInformation and Knowledge Management, CIKM 2017, Sin-gapore, November 06 - 10, 2017, ACM, 2017, pp. 1449–1458. doi:10.1145/3132847.3132892 . URL https://doi.org/10.1145/3132847.3132892 [25] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translationby jointly learning to align and translate, in: Y. Bengio, Y. Le-Cun (Eds.), 3rd International Conference on Learning Repre-sentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015,Conference Track Proceedings, 2015.URL http://arxiv.org/abs/1409.0473 [26] D. P. Kingma, J. Ba, Adam: A method for stochastic optimiza-tion, in: Y. Bengio, Y. LeCun (Eds.), 3rd International Confer-ence on Learning Representations, ICLR 2015, San Diego, CA,USA, May 7-9, 2015, Conference Track Proceedings, 2015.URL http://arxiv.org/abs/1412.6980 [27] T. Bai, J. Wen, J. Zhang, W. X. Zhao, A neural collaborative ﬁl-tering model with interaction-based neighborhood, in: E. Lim,M. Winslett, M. Sanderson, A. W. Fu, J. Sun, J. S. Culpepper,E. Lo, J. C. Ho, D. Donato, R. Agrawal, Y. Zheng, C. Castillo,A. Sun, V. S. Tseng, C. Li (Eds.), Proceedings of the 2017ACM on Conference on Information and Knowledge Manage-ment, CIKM 2017, Singapore, November 06 - 10, 2017, ACM,2017, pp. 1979–1982. doi:10.1145/3132847.3133083 .URL https://doi.org/10.1145/3132847.3133083 [28] X. He, H. Zhang, M. Kan, T. Chua, Fast matrix factorization foronline recommendation with implicit feedback, in: R. Perego,F. Sebastiani, J. A. Aslam, I. Ruthven, J. Zobel (Eds.), Pro-ceedings of the 39th International ACM SIGIR conference onResearch and Development in Information Retrieval, SIGIR2016, Pisa, Italy, July 17-21, 2016, ACM, 2016, pp. 549–558. doi:10.1145/2911451.2911489 .URL https://doi.org/10.1145/2911451.2911489 [29] S. Rendle, W. Krichene, L. Zhang, J. R. Anderson, Neu-ral collaborative ﬁltering vs. matrix factorization revisited, in:R. L. T. Santos, L. B. Marinho, E. M. Daly, L. Chen, K. Falk,N. Koenigstein, E. S. de Moura (Eds.), RecSys 2020: Four-teenth ACM Conference on Recommender Systems, VirtualEvent, Brazil, September 22-26, 2020, ACM, 2020, pp. 240–248. doi:10.1145/3383313.3412488 .URL https://doi.org/10.1145/3383313.3412488 [30] X. He, Z. He, J. Song, Z. Liu, Y. Jiang, T. Chua, NAIS:neural attentive item similarity model for recommendation,IEEE Trans. Knowl. Data Eng. 30 (12) (2018) 2354–2366. doi:10.1109/TKDE.2018.2831682 .URL https://doi.org/10.1109/TKDE.2018.2831682 [31] L. Zheng, V. Noroozi, P. S. Yu, Joint deep modeling of usersand items using reviews for recommendation, in: M. de Rijke,M. Shokouhi, A. Tomkins, M. Zhang (Eds.), Proceedings of theTenth ACM International Conference on Web Search and DataMining, WSDM 2017, Cambridge, United Kingdom, February6-10, 2017, ACM, 2017, pp. 425–434. doi:10.1145/3018661.3018665 .URL https://doi.org/10.1145/3018661.3018665 [32] D. Liu, J. Li, B. Du, J. Chang, R. Gao, DAML: dual at-tention mutual learning between ratings and reviews for itemrecommendation, in: A. Teredesai, V. Kumar, Y. Li, R. Ros-ales, E. Terzi, G. Karypis (Eds.), Proceedings of the 25th ACMSIGKDD International Conference on Knowledge Discovery &Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8,2019, ACM, 2019, pp. 344–352. doi:10.1145/3292500.3330906 .URL https://doi.org/10.1145/3292500.3330906 [33] H. Wang, N. Wang, D. Yeung, Collaborative deep learning forrecommender systems, in: L. Cao, C. Zhang, T. Joachims,G. I. Webb, D. D. Margineantu, G. Williams (Eds.), Proceed-ings of the 21th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, Sydney, NSW, Aus-tralia, August 10-13, 2015, ACM, 2015, pp. 1235–1244. doi:10.1145/2783258.2783273 .URL https://doi.org/10.1145/2783258.2783273 [34] X. Li, J. She, Collaborative variational autoencoder for recom-mender systems, in: Proceedings of the 23rd ACM SIGKDD In-ternational Conference on Knowledge Discovery and Data Min-ing, Halifax, NS, Canada, August 13 - 17, 2017, ACM, 2017,pp. 305–314. doi:10.1145/3097983.3098077 . RL https://doi.org/10.1145/3097983.3098077 [35] F. Zhang, N. J. Yuan, D. Lian, X. Xie, W. Ma, Collaborativeknowledge base embedding for recommender systems, in: B. Kr-ishnapuram, M. Shah, A. J. Smola, C. C. Aggarwal, D. Shen,R. Rastogi (Eds.), Proceedings of the 22nd ACM SIGKDD In-ternational Conference on Knowledge Discovery and Data Min-ing, San Francisco, CA, USA, August 13-17, 2016, ACM, 2016,pp. 353–362. doi:10.1145/2939672.2939673 .URL https://doi.org/10.1145/2939672.2939673 [36] X. Wang, Y. Xu, X. He, Y. Cao, M. Wang, T. Chua, Reinforcednegative sampling over knowledge graph for recommendation,in: Y. Huang, I. King, T. Liu, M. van Steen (Eds.), WWW’20: The Web Conference 2020, Taipei, Taiwan, April 20-24,2020, ACM / IW3C2, 2020, pp. 99–109. doi:10.1145/3366423.3380098 .URL https://doi.org/10.1145/3366423.3380098 [37] G. Guo, J. Zhang, N. Yorke-Smith, Trustsvd: Collaborativeﬁltering with both the explicit and implicit inﬂuence of usertrust and of item ratings, in: B. Bonet, S. Koenig (Eds.), Pro-ceedings of the Twenty-Ninth AAAI Conference on ArtiﬁcialIntelligence, January 25-30, 2015, Austin, Texas, USA, AAAIPress, 2015, pp. 123–129.URL [38] M. Wang, X. Zheng, Y. Yang, K. Zhang, Collaborative ﬁl-tering with social exposure: A modular approach to socialrecommendation, in: S. A. McIlraith, K. Q. Weinberger(Eds.), Proceedings of the Thirty-Second AAAI Conference onArtiﬁcial Intelligence, (AAAI-18), the 30th innovative Appli-cations of Artiﬁcial Intelligence (IAAI-18), and the 8th AAAISymposium on Educational Advances in Artiﬁcial Intelligence(EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018,AAAI Press, 2018, pp. 2516–2523.URL [39] T. N. Kipf, M. Welling, Semi-supervised classiﬁcation withgraph convolutional networks, in: 5th International Conferenceon Learning Representations, ICLR 2017, Toulon, France, April24-26, 2017, Conference Track Proceedings, OpenReview.net,2017.URL https://openreview.net/forum?id=SJU4ayYgl [40] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamil-ton, J. Leskovec, Graph convolutional neural networks for web-scale recommender systems, in: Y. Guo, F. Farooq (Eds.),Proceedings of the 24th ACM SIGKDD International Con-ference on Knowledge Discovery & Data Mining, KDD 2018,London, UK, August 19-23, 2018, ACM, 2018, pp. 974–983. doi:10.1145/3219819.3219890 .URL https://doi.org/10.1145/3219819.3219890 [41] L. Chen, L. Wu, R. Hong, K. Zhang, M. Wang, Revisitinggraph based collaborative ﬁltering: A linear residual graphconvolutional network approach, in: The Thirty-Fourth AAAIConference on Artiﬁcial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artiﬁcial Intelligence Confer-ence, IAAI 2020, The Tenth AAAI Symposium on EducationalAdvances in Artiﬁcial Intelligence, EAAI 2020, New York, NY,USA, February 7-12, 2020, AAAI Press, 2020, pp. 27–34.URL https://aaai.org/ojs/index.php/AAAI/article/view/5330 [42] X. He, K. Deng, X. Wang, Y. Li, Y. Zhang, M. Wang, Light-gcn: Simplifying and powering graph convolution network forrecommendation, in: J. Huang, Y. Chang, X. Cheng, J. Kamps,V. Murdock, J. Wen, Y. Liu (Eds.), Proceedings of the 43rdInternational ACM SIGIR conference on research and devel-opment in Information Retrieval, SIGIR 2020, Virtual Event,China, July 25-30, 2020, ACM, 2020, pp. 639–648. doi:10.1145/3397271.3401063 .URL https://doi.org/10.1145/3397271.3401063https://doi.org/10.1145/3397271.3401063