Graph Attention Collaborative Similarity Embedding for Recommender System
Jinbo Song, Chao Chang, Fei Sun, Zhenyang Chen, Guoyong Hu, Peng Jiang
GGraph Attention Collaborative SimilarityEmbedding for Recommender System
Jinbo Song , Chao Chang , Fei Sun , Zhenyang Chen , Guoyong Hu , andPeng Jiang The Institute of Computing Technology of the Chinese Academy of Sciences, China [email protected] Beijing Kuaishou Technology Co., Ltd., China { changchao, chenzhenyang, huguoyong, jiangpeng } @kuaishou.com Alibaba Group, China [email protected]
Abstract.
We present Graph Attention Collaborative Similarity Em-bedding (GACSE), a new recommendation framework that exploits col-laborative information in the user-item bipartite graph for representa-tion learning. Our framework consists of two parts: the first part is tolearn explicit graph collaborative filtering information such as user-itemassociation through embedding propagation with attention mechanism,and the second part is to learn implicit graph collaborative informationsuch as user-user similarities and item-item similarities through auxiliaryloss. We design a new loss function that combines BPR loss with adap-tive margin and similarity loss for the similarities learning. Extensiveexperiments on three benchmarks show that our model is consistentlybetter than the latest state-of-the-art models.
Keywords:
Recommendation systems · Collaborative Filtering · GraphNeural Networks.
Personalized recommendation plays a pivotal role in many internet scenarios,such as e-commerce, short video recommendations and advertising. Its coremethod is to analyze the user’s potential preferences based on the user’s his-torical behavior, to measure the possibility of the user to select a certain itemand to tailor the recommendation results for the user.One of the major topics to be investigated in the personalized recommenda-tion is Collaborative Filtering (CF) which generates recommendations by takingadvantage of the collective wisdom from all users. Matrix factorization (MF) [12]is one of the most popular CF model, which decomposes the interaction matrixbetween the user and item into discrete vectors and then calculates the innerproduct to predict the connected edges between the user and item. Neural Col-laborative Filtering (NCF) [19] predict the future behavior of users by learningthe historical interactions between users and items. It employs neural network a r X i v : . [ c s . I R ] F e b Song et al. instead of traditional matrix factorization to enhance the non-linearity of themodel. In general, there are two key components in learnable CF models—1)embeddings that represent users and items by vectors, and 2) interaction mod-eling, which formulates historical interactions upon the embeddings.Despite their prevalence and effectiveness, we argue that these models arenot sufficient to learn optimal embdeddings. The major limitation is that theembdeddings does not explicitly encode collaborative information propagatedin user-item interaction graph. Following the idea of representation learning ingraph embedding, Graph Neural Networks (GNN) are proposed to collect ag-gregate information from graph structure. Methods based on GraphSAGE [8] orGAT [23] have been applied to recommender systems. For example, NGCF [17]generates user and item embeddings based on the information propagation inthe user-item bipartite graph; KGAT [24] adds a knowledge graph and enti-ties attention based on the bipartite graph and uses entity information to moreeffectively model users and items. Inspired by the success of GNN in recommen-dation, we build a embedding propagation and aggregating architecture basedon attention mechanism to learn the variable weight of each neighbor. The at-tention weight explicitly represents the relevance of interaction between user anditem in bipartite graph.Another limitation is that many existing model-based CF algorithms leverageonly the user-item associations available in user-item bipartite graph. The effec-tiveness of these algorithms depends on the sparsity of the available user-itemassociations. Therefore, other types of collaborative relations, such as user-usersimilarity and item-item similarity, can also be considered for embedding learn-ing. Some works [27,25] exploit higher-order proximity among users and itemsby taking random walks on the graph. A recent work [26] presents collaborativesimilarity embedding (CSE) to model direct and in-direct edges of user-iteminteractions. The effectiveness of these methods lies in sampling auxiliary infor-mation from graph to augment the data for representation learning.Based on the above limitation and inspiration, in this paper, we propose aunified representation learning framework, called Graph Attention CollaborativeSimilarity Embedding (GACSE). In the framework, the embedding is learnedfrom direct, user-item association through embedding propagation with atten-tion mechanism, and indirect, user-user similarities and item-item similaritiesthrough auxiliary loss, user-item similarities in bipartite graph. Meanwhile, wecombine adaptive margin in BPR loss [2] and similarity loss to optimize GACSE.The contributions of this work are as follows: – We propose GACSE, a graph based recommendation framework that com-bines both attention propagation & aggregation in graph and similarity em-bedding learning process. – To optimize GACSE, we introduce a new loss function, which, to the bestof our knowledge, is the first time to combine both BPR loss with adaptivemargin and similarity loss for similarity embedding learning.
ACSE 3 – We compare our model with state-of-the-art methods and demonstrate theeffectiveness of our model through quantitative analysis on three benchmarkdatasets. – We conduct a comprehensive ablation study to analyze the contributions ofkey components in our proposed model.
In this section, we will briefly review several lines of works closely related toours, including general recommendation and graph embedding-based recommen-dation.
Recommender systems typically use Collaborative Filtering (CF) to model users’preferences based on their interaction histories [3,1]. Among the various CFmethods, item-based neighborhood methods [5] estimate a user’s preference onan item via measuring its similarities with the items in her/his interaction historyusing a item-to-item similarity matrix. User-based neighborhood methods findsimilar users to the current user using a user-to-user similarity matrix, followingby recommending the items in her/his similar users’ interaction history. MatrixFactorization (MF) [12,13] is another most popular one, which projects usersand items into a shared vector space and estimate a user’s preference on an itemby the inner product between user’s and items’ vectors. BPR-MF [2] optimizesthe matrix factorization with implicit feedback using a pairwise ranking loss.Recently, deep learning techniques has been revolutionizing the recommendersystems dramatically. One line of deep learning based model seeks to take theplace of conventional matrix factorization [7,11,19]. For example, Neural Collab-orative Filtering (NCF) estimates user preferences via Multi-Layer Perceptions(MLP) instead of inner product.
Another line is to integrate the distributed representations learning from user-item interaction graph. GC-MC [10] employs a graph convolution auto-encoderon user-item graph to solve the matrix completion task. HOP-Rec [15] employslabel propagation and random walks on interaction graph to compute similar-ity scores for user-item pairs. NGCF [19] explicitly encodes the collaborativeinformation of high-order relations by embedding propagation in user-item in-teraction graph. PinSage [16] utilizes efficient random walks and graph convo-lutions to generate embeddings which incorporate both graph structure as wellas node feature information. Multi-GCCF [9] constructs two separate user-userand item-item graphs. It employs a multi-graph encoding layer to integrate theinformation provided by the user-item, user-user and item-item graphs.
Song et al.
In this section, we introduce our proposed model called GACSE. The overallframework is illustrated in Figure 1. There are three components in the model:(1) an embedding layer that can map users and items from one hot vector toinitial embeddings; (2) an embedding propagation layer, which consists of twosub-layers: a warm-up layer that propagates and aggregates graph embeddingswith equal weight, and an attention layer that uses attention mechanism toperform non-equal weight aggregation on the embedding of neighboring nodes;and (3) a prediction layer that concatenates embeddings from embedding layerand attention layer, then outputs affinity score between user and item. Thefollowing descriptions take user as central node, if there is no special instruction,it is also applicable to item as centre node.
Fig. 1.
An illustration of GACSE model architecture. The flow of embedding is pre-sented by the arrowed lines. FC1 and FC2 are shared parameters on user embeddingside and item embedding side. An illustration of attention aggregation is shown onthe left. FC-a transforms the concatenated vector into a new vector. FC-b transformsvector into attention score.
Embedding layer aims at mapping the ids of user u and item i into embeddingvectors e (0) u ∈ R d and e (0) i ∈ R d , where d denotes the embedding dim. We use atrainable embedding lookup table to build our embedding layer for embeddingpropagation: E = [ users embeddings (cid:122) (cid:125)(cid:124) (cid:123) e (0) u , · · · , e (0) u N , items embeddings (cid:122) (cid:125)(cid:124) (cid:123) e (0) i , · · · , e (0) i M ] (1) ACSE 5 where N is number of users and M is number of items. In order to establish the embedding propagation architecture for collaborativeinformation in graph, we define a embedding propagation layer. Our embeddingpropagation layer consists of two parts: (1) a warm-up layer and (2) an attentionlayer. Both layers have two steps: embedding propagation and aggregation.
Warm-up Layer
To make the model expand the receptive field and grasp thestructure of the user-item bipartite graph, we set up a warm-up layer based onthe GNN [8,18] embedding propagation architecture.
Warm-up Propagation
In the warm-up layer, we set all the embeddings to haveequal weight. We define warm-up embedding propagation function as: (cid:40) m (0) i → u = π (0)( u,i ) W e (0) i m (0) u → u = W e (0) u (2)where W ∈ R d × d is the trainable weight matrix to distill important infor-mation in embedding propagation. d is the dimension of e (0) , and d is thedimension of transformation. m (0) i → u is the embedding from item i to user u . m (0) u → u is self-connection of u . π (0)( u,i ) is the weight of the embedding that i passesto u .In embedding propagation of warm-up layer, we set the propagation weight π u,i ) to be equal. Inspired by GCN [8], we define weight of each user’s interacteditem as: π (0)( u,i ) = 1 (cid:112) |N u (cid:107)N i | (3)where N u and N i denote the first hop neighboring nodes of user u and item i. Warm-up Aggregation
After receiving the embeddings from neighbor nodes, weneed to aggregate these embeddings. We define embeddings aggregation functionin warm-up layer as: e (1) u = σ (cid:0) m (0) u → u + (cid:88) i ∈N u m (0) i → u (cid:1) (4)where σ is nonlinear function such as LeakyReLU. Analogously, we can obtainitem i ’s embedding e (1) i . Attention Layer
Next, in order to further encode the variable weight of neigh-bors, we build an embedding propagation and aggregation architecture basedon the attention mechanism. The attention mechanism explicitly captures therelevance of interaction between user and item in bipartite graph.
Song et al.
Attention Propagation
Intuitively, the importance of each item that interactswith the user should be different. We introduce attention mechanism into em-bedding passing function: (cid:40) m (1) i → u = π (1)( u,i ) e (1) i m (1) u → u = e (1) u (5)where π (1)( u,i ) is attention weight.Inspired by several kinds of attention score functions, we define score functionof our model: score( e (1) u , e (1) i ) = V (cid:62) tanh (cid:16) P (cid:104) e (1) u (cid:107) e (1) i (cid:105)(cid:17) (6)where V ∈ R d × and P ∈ R d × d are trainable parameters. d is the dimensionof attention transformation. (cid:107) denotes concatenate operation. After calculatingattention score, we normalize it to get the attention weight via softmax function: π (1)( u,i ) = exp(score( e (1) u , e (1) i )) (cid:80) j ∈S u exp(score( e (1) u , e (1) j )) (7)where S u is a set of user u ’s one hop neighboring items sampled in this mini-batch. Attention Aggregation
After attention massage passing, the attention aggrega-tion function is defined as: e (2) u = σ ( W ( m (1) u → u + (cid:88) i ∈S u m (1) i → u ))+ σ ( W ( m (1) u → u (cid:12) (cid:88) i ∈S u m (1) i → u )) (8)where σ is LeakyReLU non-linear function. W , W ∈ R d × d are trainableparameters. d is the dimension of attention aggregation. (cid:12) denotes element-wiseproduct. Similar to NGCF, the aggregated embedding e (2) u does not only relatedto e (1) i , but also encodes the interaction between e (1) u and e (1) i . The interactioninformation can be represented by the element-wise product between m u → u and (cid:80) i ∈S u m i → u . Analogously, item i ’s attention layer embedding e (2) i can beobtained. We incorporate the attention mechanism to learn variable weight π (1)( u,i ) for each neighbor’s propagated embedding m (1) i → u . After embedding passing and aggregation with attention mechanism, we ob-tained two different representations e (0) u and e (2) u of user node u ; also analogous ACSE 7 to item node i , we obtained e (0) i and e (2) i . We choose to concatenate the twoembeddings as follows: e ∗ u = e (0) u (cid:107) e (2) u , e ∗ i = e (0) i (cid:107) e (2) i (9)where (cid:107) denotes the concatenate operation. In this way, we could predict thematching score between user and item by inner product: y ui = e ∗ u (cid:62) e ∗ i (10)More broadly, we can define the matching score between any two nodes a and b: y ab = e ∗ a (cid:62) e ∗ b (11) To optimize the GACSE model, we carefully designed our loss function. Ourloss function consists of two basic parts: BPR loss with adaptive margin andsimilarity loss.
We employ BRP loss for optimization, which considers the relative order be-tween observed and unobserved interactions. In order to improve the model’sdiscrimination of similar positive and negative samples, we define BPR loss withadaptive margin as: L BPR = 1 |B| (cid:88) ( u,i,j ) ∈B − σ ( y ui − y uj − max(0 , y ij )) (12)where B ⊆ { ( u, i, j ) | ( u, i ) ∈ R + , ( u, j ) ∈ R − } denotes the sampled data of mini-batch. R + denotes observed interactions, and R − is unobserved interactions. σ is softplus function. max(0 , y ij ) indicates that the more similar the positive andnegative samples of a node are, the larger the margin of the loss function is. Other types of collaborative relations, such as user-user similarity and item-itemsimilarity in graph, can also be considered for embedding learning. The introduceof similarity loss for both user-user and item-item pair can reduce the sparsityproblem by augmenting the data for representation learning. In this paper, the2-order neighborhood proximity of a pair of users (or items) is defined as thesimilarity.
Song et al.
In order to avoid similarity loss affecting the embedding in the embeddingpropagation, we only calculate between E and context mapping embedding ma-trices E UC and E IC for users and items, respectively. Context mapping embed-ding matrices are defined: E UC = [ e UC u , · · · , e UC u N ] E IC = [ e IC i , · · · , e IC i M ] (13)It should be noted that the dimensions of embeddings in E , E UC and E IC areequal. The similarity loss for e (0) with context embeddings e UC ∈ E UC and e IC ∈ E IC is defined as: L similarity = − (cid:88) log( σ ( e (0)T u e UCu-pos )) + (cid:88) log( σ ( e (0)T u e UCu-neg )) − (cid:88) log( σ ( e (0)T i e ICi-pos )) + (cid:88) log( σ ( e (0)T i e ICi-neg )) (14)where σ is sigmoid function. e UCu-pos and e UCu-neg are a positive and negative samplesof user u , respectively. e ICi-pos and e ICi-neg are a positive and negative samples ofitem i , respectively. We employ random walk and negative sampling to constructpositive and negative sample pairs for similarity loss. Finally, we get the overall loss function: L = L BPR + λ L similarity + λ (cid:107) Θ (cid:107) (15)where Θ = { E , E UC , E IC , W , W , W , V , P } . λ controls the strength of BPRloss and λ controls the L regularization strength to prevent overfitting. Weuse mini-batch Adam [28] to optimize the model and update the parameters ofmodel. We evaluate the proposed model on three real-world representative datasets:Gowalla , Yelp2018 and Amazon-book . These datasets vary significantly indomains and sparsity. The statistics of the datasets are summarized in Table 1.For each dataset, the training set is constructed by 80% of the historicalinteractions of each user, and the remaining as the test set. We randomly select https://snap.stanford.edu/data/loc-gowalla.htm http://jmcauley.ucsd.edu/data/amazon/ACSE 9
10% of interactions as a validation set from the training set to tune hyper-parameters. We employ negative sampling strategy to produce one negative itemthat the user did not act before and treat observed user-item interaction asa positive instance. To ensure the quality of the datasets, we use the 10-coresetting, i.e., retaining users and items with at least ten interactions.
Gowalla Yelp2018 Amazon-Book
Table 1.
Statistics of the datasets
To evaluate the effectiveness of top-K task in recommender system, we adoptedRecall@K and NDCG@K, which has been widely used in [19,17]. In this paper,1) we set K = 20; 2) all items that the user has not interacted with are thenegative items; 3) all items is scored by each method in descend order exceptthe positive ones used in the training set. Average metrics for all users in thetest set is used for evaluation.To verify the effectiveness of our approach, we compare it with the followingbaselines: – BPR-MF [2] optimizes the matrix factorization with implicit feedback usinga pairwise ranking loss. – NCF [19] learns user’s and item’s embeddings from user–item interactionsin a matrix factorization, which by a MLP instead of the inner product . – PinSage [16] combines efficient random walks and graph convolutions togenerate embeddings of nodes that incorporate both graph structure as wellas node feature information. – GC-MC [10] is a graph auto-encoder framework based on differentiable em-bedding passing on the bipartite interaction graph. The auto-encoder pro-duces latent user and item representations, and they are used to reconstructthe rating links through a bilinear decoder. – NGCF [17] explicitly encodes the collaborative signal of high-order relationsby embedding propagation in user-item inter-action graph. – Multi-GCCF [9] constructs two separate user-user and item-item graphs. Itemploys a multi-graph encoding layer to integrate the information providedby the user-item, user-user and item-item graphs.
We implement GACSE with TensorFlow. The embedding size is fixed to 64 forall models. All models are optimized with the Adam optimizer, where the batchsize is fixed at 1024. The learning rate of our model was set to 0.0001; λ wasset to 1 × − ; λ was set to 1 × − ; Number of sampling neighbors was set to64. Number of positive and negative samples for similarity loss was set to 5. Allhyper-parameters of the above baselines are either followed the suggestion fromthe methods’ author or turned on the validation sets. We report the results ofeach baseline under its optimal hyper-parameter settings. Gowalla Yelp2018 Amazon-BookRecall NDCG Recall NDCG Recall NDCGBPR-MF 0.1291 0.1878 0.0494 0.0662 0.0250 0.0518NCF 0.1326 0.1985 0.0513 0.0719 0.0253 0.0535PinSage 0.1380 0.1947 0.0612 0.0750 0.0283 0.0545GC-MC 0.1395 0.1960 0.0597 0.0741 0.0288 0.0551NGCF 0.1547 0.2237 0.0581 0.0719 0.0344 0.0630Multi-GCCF 0.1595 0.2126 0.0667 0.0810 0.0363 0.0656
GACSE 0.1654 0.2328 0.0672 0.0836 0.0386 0.0703 %Improv. 3.70% 4.06% 0.75% 3.21% 6.34% 7.16%
Table 2.
Overview performance comparison. Bold scores are the best in each column,while underlined scores are the second best. Improvements are statistically significant.
Overall Comparison
Table 2 summarized the best results of all models onthree benchmark dataset. The last row is the improvements of GACSE relativeto the best baseline.BPR-MF method gives the worst performance on all datasets since the in-ner product cannot capture complex collaborative signals. NCF outperformsBPR-MF on all datasets consistently. Compared with BPR-MF, the main im-provement of NCF is that MLP can model the nonlinear feature interactionsbetween user and item embeddings.Among all the baseline methods, graph based methods (e.g., PinStage, GC-MC, NGCF, Multi-GCCF) consistently outperform general methods (e.g., BPR-MF, NCF) on all datasets. The main improvement is that graph based modelexplicitly models the graph structure in embedding learning.Multi-GCCF are the strongest baseline. It outperforms other baselines onall datasets except NDCG on Gowalla. NGCF gives the best performance of For reproducibility, we share the source code of GACSE online:https://github.com/GACSE/GACSE.gitACSE 11
NDCG on Gowalla. They all employ embedding propagation to obtain neigh-bor’s information and stack multiple embedding propagation layers to explorethe high-order connectivity. This verifies the importance of capturing collabora-tive signal in the embedding function. Moreover, Multi-GCCF compared threedifferent multi-grained representations fusion methods.According to the results, GACSE preforms best among all baselines on threedatasets in terms of all evaluation metrics. It improves over the best baselinemethod by 3 . . .
34% in terms of Recall on Gowalla, Yelp2018 andAmazon-book. It gains 4 . . .
16% NDCG improvements against thebest baseline on Gowalla, Yelp2018 and Amazon-book respectively. Comparedwith Multi-GCCF and NGCF, GACSE builds an embedding propagation andaggregation architecture based on the attention mechanism. The attention mech-anism enable GACSE to learn variable weights of embedding propagation forneighbors explicitly. Meanwhile it obtains high-order implicit collaborative in-formation between user-user and item-item through similarity loss.
Gowalla Yelp2018Recall NDCG Recall NDCGGACSE 0.1654 0.2328 0.0672 0.0836GACSE-sl 0.1632 0.2334 0.0641 0.0805(-1.33%) (+0.26%) (-4.61%) (-3.71%)GACSE-am 0.1468 0.2149 0.0571 0.0728(-11.25%) (-7.68%) (-15.03%) (-12.92%)
Table 3.
Ablation studies of GACSE. GACSE-sl means GACSE without similarityloss. GACSE-am means GACSE without adaptive margin.
Ablation Analysis
Table 3 reports the influences of similarity loss and adap-tive margin of GACSE on Gowalla and Yelp2018 datasets. As expected, theperformance degrades greatly after removing adaptive margin and similarityloss. This confirms the importance of adaptive margin and similarity loss forembedding learning. Adaptive margin can improve model’s discrimination forpositive and negative sample with similar embeddings. Similarity loss for bothuser-user and item-item pair can reduce the sparsity problem by augmentingthe data for representation learning. Similarity loss and adaptive margin can en-hance the effectiveness of attention mechanism for embedding propagation andaggregation.
Test Performance w.r.t. Epoch
Figure 7 shows the test performance w.r.t.recall of each epoch of MF and NGCF. Due to the space limitation, we omit theperformance w.r.t. ndcg which has the similar trend. We can see that, NGCF
Fig. 2.
Recall@20 on Gowalla
Fig. 3.
NDCG@20 on Gowalla
Fig. 4.
Recall@20 on Yelp
Fig. 5.
NDCG@20 on Yelp exhibits fast convergence than MF on three datasets. It is reasonable since in-directly connected users and items are involved when optimizing the interactionpairs in mini-batch. Such an observation demonstrates the better model capac-ity of NGCF and the effectiveness of performing embedding propagation in theembedding space.
In this work, we explicitly incorporated collaborative signal and indirect similar-ities into the embedding function. We proposed a unified representation learningframework GACSE, in which the embedding is learned from direct user-item in-teraction through attention propagation, and indirect user-user similarities anditem-item similarities through auxiliary loss, user-item similarities in bipartitegraph. In addition, we combine adaptive margin in BPR loss and similarity lossto optimize GACSE. Extensive experimental results on three real-world datasetsshow that our model outperforms state-of-the-art baselines.Several directions remain to be explored. A valuable direction is to incorpo-rate rich side information into GACSE instead of just modeling user & item ids.
ACSE 13
Another interesting direction for the future work would be exploring multi-task& multi-object embedding learning on heterogeneous graph for recommendersystem.