A Unified Model for Recommendation with Selective Neighborhood Modeling
AA Unified Model for Recommendation with Selective Neighborhood Modeling
Jingwei Ma a , Jiahui Wen b , Panpan Zhang a , Mingyang Zhong c, ∗ , Guangda Zhang b , Xue Li d a Shandong Normal University, Jinan, China b National Innovative Institute of Defense Technology, Beijing, China c The University of Queensland, Brisbane, Australia d Neusoft Education Technology Group, Dalian, China
Abstract
Neighborhood-based recommenders are a major class of Collaborative Filtering (CF) models. The intuition is to ex-ploit neighbors with similar preferences for bridging unseen user-item pairs and alleviating data sparseness. Manyexisting works propose neural attention networks to aggregate neighbors and place higher weights on specific subsetsof users for recommendation. However, the neighborhood information is not necessarily always informative, and thenoises in the neighborhood can negatively affect the model performance. To address this issue, we propose a novelneighborhood-based recommender, where a hybrid gated network is designed to automatically separate similar neigh-bors from dissimilar (noisy) ones, and aggregate those similar neighbors to comprise neighborhood representations.The confidence in the neighborhood is also addressed by putting higher weights on the neighborhood representationsif we are confident with the neighborhood information, and vice versa. In addition, a user-neighbor component isproposed to explicitly regularize user-neighbor proximity in the latent space. These two components are combinedinto a unified model to complement each other for the recommendation task. Extensive experiments on three publiclyavailable datasets show that the proposed model consistently outperforms state-of-the-art neighborhood-based recom-menders. We also study different variants of the proposed model to justify the underlying intuition of the proposedhybrid gated network and user-neighbor modeling components.
Keywords: recommendation; gated network; similar neighbors
1. Introduction
With the prevalence of Internet, online services generate massive amount of data on a daily basis, and users arefacing the problem of information overload when finding their interested items (e.g. books, movies). To addressthis problem, various recommendation techniques are proposed to quickly find interested information for the users.Model-based collaborative filtering is one of the most widely employed techniques for recommendation. However, it ∗ Corresponding author
Email addresses: [email protected] (Jingwei Ma), [email protected] (Jiahui Wen),
[email protected] (Panpan Zhang), [email protected] (Mingyang Zhong), [email protected] (Guangda Zhang), [email protected] (Xue Li)
Preprint submitted to Journal of L A TEX Templates October 20, 2020 a r X i v : . [ c s . I R ] O c t uffers from data sparseness [1], as most users usually give few ratings to the items, making it difficult to capture userpreferences based mainly on the interaction data.To mitigate data sparseness, many works have proposed to incorporate auxiliary information [2] [3] [4] [5] [6] suchas texts, images for recommendation. Among the extra knowledge, neighborhood information has gained increasingpopularity due to the fact that neighborhood-base approaches [7] [8] [9] [10] [11] [12] are a major class of collaborativefiltering models. The intuition of the neighborhood-based approaches is that neighboring users usually share similarpreferences, and those neighbors can be exploited to bridge unseen user-item pairs and mitigate data sparseness [13][14] [15]. Neighborhood methods capture localized semantics among users, which complements the overall globalstructure between users and items capitalized by latent factor models [12]. The advantages of neighborhood modelsand latent factor models lead to hybrid models such as SVD++ [16] which joints neighborhood-based models andlatent factor models to boost recommendation performance.Recently, deep learning techniques such as the attention mechanism have wide applications in many researchareas such as computer vision [17], question answering [18] and machine translation [19]. Previous works [12] [11][20] have demonstrated noticeable advantages by integrating the attention mechanism in neighborhood models foridentifying similar users. The basic idea behinds the neural attention mechanism is to place higher weights on specificsubsets of users in the neighborhood who share similar preferences, since not all neighbors are equally informative.For example, Ebesu et al. [12] propose a neural attention mechanism to learn a user-item specific neighborhood,and integrate the neighborhood component with a latent factor model for simultaneously capturing global user-itemrelations and local neighborhood-based structure. In [20] and [11], the authors aggregate neighbors in a weightedmanner with an attention mechanism. However, there are several drawbacks with those models. First, those modelssimply aggregate all the neighbors, and ignore noises in the neighborhood which may negatively affect the modelperformance. Especially when the data is sparse and only a few neighbors are present, the aggregation of thoseneighbors has a great influence on the model performance. Second, they fail to model personalized neighborhoodinfluence, and ignore the informativeness of the neighborhood given different users. The observation is that some usersmay highly depend on their neighbors for capturing preferences, while the others may not rely on their neighbors forrecommendation. Finally, most of those works neglect the semantic proximity between the users and their neighbors,leading to an inefficient learning of the user-neighbor compatibility.To this end, we propose a novel recommendation model (SNM for short) that seamlessly integrates selectiveneighborhood modeling and user-neighbor proximity preserving. One core component of SNM is a hybrid gatednetwork that addresses neighborhood noises by identifying similar users from the neighborhood, and distilling thesesimilar neighbors to produce a neighborhood representation for each user. Then, the hybrid gated network furtherapproaches the neighborhood noises by pooling a user representation and his/her neighborhood representation into aunified vector for predicting the ranking score. Also, we propose to explicitly model user-neighbor similarities, andpreserve user-neighbor proximity by predicting the users with their selected similar neighbors. Therefore, we canalign the most informative neighbors for learning compact user representations. We integrate these two components2nto a unified model, so that they can be jointly learned and complementary to each other for the recommendationtask.The rationale of the proposed model is to capture personalized neighborhood influence for recommendation, asneighborhood information may have different influence on the decision-making process for different user-item pairs.With the hybrid gated network, we can control the information flow from the neighborhood based on the confidencethat we have in the separation between similar and dissimilar neighbors. Therefore, we can not only select themost similar neighbors, but also model the credibility of the neighborhood information for recommendation. Thecontributions of this works can be concluded as follows:• We propose a novel neighborhood-based recommendation model. The core component of the model is a hy-brid gated network that is invulnerable to neighborhood noises when exploiting neighborhood information forrecommendation. It first automatically separates similar neighbors from the dissimilar ones with a thresholdingmechanism, and only aggregates those similar neighbors to produce the neighborhood representations. Then, itfurther filters out noises in the neighborhood by pooling representations of users and their neighborhood whileconsidering the confidence level of the neighborhood information. Therefore, we are able to select the mostinformative neighbors and encode the credibility of neighborhood information for recommendation.• We explicitly preserve user-neighbor proximity for learning compact user representations. The user-neighborsimilarities are captured by predicting users with their neighbors. Since the neighborhood representations areparameterized by the users, neighbors and the target items, user representations are learned by attending to theinformative neighbors and specified for the recommendation task. We integrate the hybrid gated network anduser-neighbor proximity components into a unified model, where they can mutually complement and reinforceeach other to enhance the recommendation performance.• We validate the effectiveness of the proposed model with three publicly available datasets, and demonstrate itsadvantage over the state-of-the-art models. We also study different variants of the proposed model to justify theintuitions underlying each of its components.
2. Preliminaries
In a recommendation problem, we have a user set U = { u , u , ..., u M } and an item set V = { v , v , ..., v N } ,where M and N are the number of users and items respectively. The interactions between the users and items canbe denoted by a rating matrix R ∈ { , } M × N , where an entry r ij = 1 in R means that user u i has positivelyrated item v j . As we focus on implicit feedback recommendation in this work, the missing entries (i.e. r ij = 0 ) areviewed as unobserved records, and they need to be predicted. Similar to [12], we denote N ( v j ) as the set of all users(neighborhood) that have provided implicit feedback for item v j .3iven the rating matrix and neighborhood information, the task of this work is to jointly learn user/item represen-tations (e.g. u i , v j ) and predict the missing values in R , and recommend items with high predicted values (i.e. ˆ r ij )for each user. Recent works [21] [22] [23] employ deep neural networks for deeply modeling user-item interactions. Specifically,given the latent vectors of a user-item pair, u i and v j respectively, we concatenate them into a vector, and input thevector through a multi-layer perceptrons and produce a user-item interaction vector z ij , which is later for predictingthe ranking score ˆ r ij : z ij = φ L ( ...φ ( φ ( z )) ... ) φ l = σ l ( W Tl z l − + b l ) , l ∈ [1 , L ] z = [ u i ; v j ; u i ◦ v j ] (1)where [; ] and ◦ are the concatenation operation and element-wise multiply operation respectively. φ l is the l -th layerneural network, and σ l , W l , b l are the corresponding activation function, weight matrix and bias vector respectively.
3. The Proposed Model
The overview of the proposed model for estimating a ranking score ˆ r ij is illustrated in Fig.1. As shown in thefigure, given the embeddings of a user-item pair, u i , v j , and the corresponding embeddings of the neighbors N ( v j ) , { u , · · · , u t } . We first calculate the relevance scores between the user and his/her neighbors. Based on the relevancescores, we propose a hybrid gated network [13] to filtering out the dissimilar neighbors and aggregate those similarneighbors into a neighborhood representation p i . After that, the hybrid gate network pools u i and p i into a unifiedneighborhood-based user representation h i . The first gate g is learned from the data and parameterized by u i and p i ,while the second gate f ( θ i ) is based on the relevance scores and encoded domain knowledge. Finally, h i and v j areinput to a multilayer neural network to drive the ranking score ˆ r ij . In addition, we propose to learn user-neighborhierarchy for capturing compatibility between users and their neighbors. The neighborhood representation p i is usedfor predicting the user embedding u i , so that the relations between users and their neighbors can be well preserved. The underlying rationale of neighborhood attention is that users with similar rating baheviors are more likely toprovide similar implicit feedback to a given item. Due to the data sparsity of recommendation data, those neighborscan be exploited to gain additional insight about the existing user and item relations. As not all the neighbors are4 (cid:2191) (cid:28595) (cid:2203) (cid:883) (cid:28595) (cid:2203) (cid:884) (cid:28595) (cid:2203) (cid:1872) (cid:28595) ... (cid:2204) (cid:1862) (cid:28595)
Neighbor_Att (cid:2198) (cid:2191) (cid:28595)
First gate (cid:963) (cid:28595) (cid:1768) (cid:28595) (cid:1768) (cid:28595) (cid:1769) (cid:28595) (cid:2190) (cid:1861) (cid:28595) (cid:2189) (cid:28595)
Multilayer NeuralNetworks (cid:2778) (cid:3398) (cid:2189) (cid:28595) (cid:883) (cid:3398) (cid:1858) (cid:4666) (cid:2016) (cid:1861) (cid:4667) (cid:28564) (cid:1858) (cid:4666) (cid:2016) (cid:1861) (cid:4667) (cid:28564) (cid:2203) (cid:2191) (cid:28595) (cid:2203) (cid:883) (cid:28595) (cid:2203) (cid:884) (cid:28595) (cid:2203) (cid:1872) (cid:28595) (cid:2204) (cid:1862) (cid:28595) (cid:2198) (cid:2191) (cid:28595) (cid:963) (cid:28595) (cid:2203) (cid:2191) (cid:28595) (cid:1870)(cid:440) (cid:1861)(cid:1862) (cid:28595)
Second gate (cid:2203) (cid:2191) (cid:28595) prediction
Figure 1: Overview of the proposed model. equally informative, we propose an attention mechanism to capture the most informative neighbors for neighborhoodmodeling. p i = (cid:88) u t ∈ N ( v j ) α t u t (2)where p i is the neighborhood representation of user u i , and α t is the attention score assigned to one of the neighbors u t for comprising p i , and it is parameterized by the interactions among u i , u t and v j : β t = v T tanh ( W Tut ( u i ◦ u t ) + W Ttj ( u t ◦ v j ) + b u ) α t = exp ( β t ) (cid:80) u t ∈ N ( v j ) exp ( β t ) (3)where the matrices W ut , W tj and the vectors v , b u are model parameters. The rationale is as follows, u i ◦ u t modelsthe similarities of rating behaviors between the user u i and his/her neighbors who have rated item v j , while u t ◦ v j captures the preferences of the neighbors over the target item v j . Therefore, a neighbor yields higher relevancescore β t if it has similar rating history as the target user u i , and meanwhile has high confidence of supporting therecommendation of item v j . Thresholding Mechanism.
Even though the attention mechanism is proposed to focus on or place higher weights onspecific users in the neighborhood, simply aggregating all the neighbors can inevitably introduce noises and weakenthe impact of informative neighbors. To address this problem, we employ a thresholding mechanism [13] to filterout noisy neighbors, and control the information flows from the neighborhood representation to the calculation of theranking score. 5ith the thresholding mechanism, the calculation of attention scores can be reformulated as follows, α t = I ( β t , θ i ) exp ( β t ) (cid:80) u t ∈ N ( v j ) I ( β t , θ i ) exp ( β t ) (4)where I ( β t , θ i ) is an indication function, and it is used to filter out the neighbors whose relevance scores (i.e. β t ) withthe target user u i are lower then a threshold θ i . The indication function is defined as follows: I ( β t , θ i ) = β t > θ i β t ≤ θ i (5)where the user-specific threshold θ i is not a predefined constant in this work, but rather is left for the proposed modelto learn. Therefore, a neighbor u t is not selected for calculating the neighborhood representation if the relevance scorewith the target user is below the threshold θ i . The neighborhood representation can be incorporated for predicting the ranking scores, as it captures localizeduser-item relations [12] and complements the global user-item interactions described in Section.2.2. In this work, fora given user u i and item v j , we proposed a hybrid gated network to select between the user representation u i and itsneighborhood representation p i , and produce a unified neighborhood-based user representation h i : g = σ ( W Tg u i + W Tg p i + b g ) f ( θ i ) = σ (( t s − θ i )( θ i − t d )) − . h i = (1 − f ( θ i )) ∗ g ◦ u i + f ( θ i ) ∗ (1 − g ) ◦ p i (6)where the matrices W g , W g , the vector b g are model parameters. σ ( x ) = exp ( − x ) is the sigmoid function thatlimits the output to the range [0,1]. The predictive vector h i is a hybrid combination of the user representation u i andits neighborhood representation p i through two gates, g and f ( θ i ) . The first gate g is parameterized by u i and p i , andit is automatically learned from the training data, while the second gate f ( θ i ) encodes domain knowledge describedas follows.The basic idea behinds f ( θ i ) is that it provides the degree of separation between similar neighbors and dissimilarneighbors. Specifically, t s , t d are the averages of the relevance scores of similar neighbors (i.e. neighbors with rele-vance scores exceed θ i ) and dissimilar neighbors (i.e. neighbors with relevance scores smaller than θ i ) respectively.If t s and t d are close to θ i , then f ( θ i ) will be close to 0, indicating small differences between similar and dissimilarneighbors. In this case, there are high uncertainties in the neighborhood, and we have low confidence in the neigh-borhood preferences, hence lower weight is given to the neighborhood representation p i . On the contrary, f ( θ i ) isclose to 0.5 if θ i provides a large degree of separation between the similar and dissimilar neighbors, then we have highconfidence in the neighborhood and distribute equal weights to u i and p i for comprising the unified representation.6 (cid:2191) (cid:28595) (cid:2203) (cid:883) (cid:28595) (cid:2203) (cid:884) (cid:28595) (cid:2203) (cid:1872) (cid:28595) (cid:2204) (cid:1862) (cid:28595) (cid:2198) (cid:2191) (cid:28595) (cid:963) (cid:28595) (cid:1768) (cid:28595) (cid:1768) (cid:28595) (cid:1769) (cid:28595) (cid:2190) (cid:1861) (cid:28595) (cid:2189) (cid:28595) (cid:2778) (cid:3398) (cid:2189) (cid:28595) (cid:883) (cid:3398) (cid:1858) (cid:4666) (cid:2016) (cid:1861) (cid:4667) (cid:28564) (cid:1858) (cid:4666) (cid:2016) (cid:1861) (cid:4667) (cid:28564) (cid:2203) (cid:2191) (cid:28595) (cid:2203) (cid:883) (cid:28595) (cid:2203) (cid:884) (cid:28595) (cid:2203) (cid:1872) (cid:28595) ... (cid:2204) (cid:1862) (cid:28595) Neighbor_Att (cid:2198) (cid:2191) (cid:28595) (cid:963) (cid:28595) (cid:2203) (cid:2191) (cid:28595) prediction (cid:1870)(cid:440) (cid:1861)(cid:1862) (cid:28595) (cid:2203) (cid:2191) (cid:28595)
Figure 2: Illustration of user-neighbor relation modeling. It is based on skip-gram model that uses a word to predict its context words or the otherway around, and word embeddings are learned by maximizing the predictive probabilities. In the proposed user-neighbor modeling, users in theneighborhood are viewed as the contexts of a user, and are aggregated for predicting the target user.
The unified representation h i is then input to multilayer neural networks for estimation the ranking scores: ˆ r ij = φ L ( ...φ ( φ ( z )) ... ) φ l = σ l ( W Tl z l − + b l ) , l ∈ [1 , L ] z = [ h i ; v j ; h i ◦ v j ] (7)The deep insight of modeling neighborhood representations is that they can bridge the semantic gap betweenunseen user-item pairs, and mitigate the data sparseness problem. In other words, an item can be ranked higher in therecommendation list of a user, as long as the item has been positively rated by any one of the similar neighbors. The intuition of modeling user-neighbor relations is to explicitly capture compatibility between users and theirneighbors, as users are suppose to be close to their informative neighbors in the latent space. Notice that modeling ofuser-item interactions implicitly adjusts users with similar rating behaviours to have similar representations, while theproposed user-neighbor modeling explicitly captures the similarities between users and their informative neighbors.These implicit and explicit modeling of user similarities can be complementary to each other for learning compactuser representations.The proposed user-neighbor modeling is based on skip-gram model [24], which uses a word to predict its con-textual words or the other way around [25] [26], and learns word representations by maximizing the predictive prob-abilities over the words. Similarly, as shown in Fig.2, in the proposed user-neighbor modeling, the neighborhoodrepresentation of a user can be viewed as its context. The intuition is that users with similar neighborhood representa-tions are more likely to have similar rating behaviours, and are supposed to have similar representations in the latentspace. Therefore, the user-neighbor relations are preserved by maximizing the probability of observing a user given7is/her neighbors: P ( u i | p i ) = exp ( u Ti p i ) (cid:80) u ∈ U exp ( u T p i ) (8)Notice that the neighborhood representation p i is attentive summation over the neighbors, which filters out dissimilarneighbors and places higher weights on specific users in the neighborhood for a given target item v j . Therefore,instead of treating the neighbors indiscriminately, the proposed user-neighbor modeling attends differently to theneighbors, finds the most informative users in the neighborhood for learning user representations. Moreover, theneighbors of a given user vary given different target items, hence user representations are learned for the specific taskof recommendation. We integrate the hybrid gated network and user-neighbor modeling into a unified model, so that the model param-eters can be jointly learned toward the optimization for the recommendation task. In the case of implicit feedback, anentry in the rating matrix equals 1 if the item is observed and 0 otherwise. Due to the data sparseness of the recom-mendation problem [27], there is a large volume of unobserved items compared to the few items rated by the users.Therefore, in practice we randomly sample unobserved items as the negative items, and define binary cross-entropyloss over the estimated ranking scores and the ground truths: L uv = − (cid:88) ( u i ,v j ) ∈D ( r ij log (ˆ r ij ) + (1 − r ij ) log (1 − ˆ r ij )) (9)where D is the training set that consists of observed user-item pairs and randomly sampled unobserved user-itempairs. As for the user-neighbor modeling, the training objective can be obtained by taking the negative log-likelihoodof the conditional probability of a user given his/her neighbors: L u = − (cid:88) ( u i ,v j ) log exp ( u Ti p i ) (cid:80) u ∈ U exp ( u T p i ) (10)As shown in the equation, the calculation of (cid:80) u ∈ U exp ( u T p i ) requires the summation over all users, and it in-curs high computational overhead. To approach this problem, we employ negative sampling [24] to approximate log exp ( u Ti p i ) (cid:80) u ∈ U exp ( u T p i ) : log exp ( u Ti p i ) (cid:80) u ∈ U exp ( u T p i ) ≈ logσ ( u Ti p i ) + K (cid:88) k =1 logσ ( − u Tk p i ) (11)where K is the number of randomly sampled negative samples. Therefore, the objective function of the unified modelcan be defined as: L = L uv + α ( L u )= − (cid:88) ( u i ,v j ) ∈D (cid:8) r ij log (ˆ r ij ) + (1 − r ij ) log (1 − ˆ r ij )+ α (cid:2) logσ ( u Ti p i ) + K (cid:88) k =1 logσ ( − u Tk p i ) (cid:3)(cid:9) (12)8here α is the trade-off between rating score estimation and user-neighbor modeling. We optimize the objectivefunction L with Adam optimizer [28], which is a variant of Stochastic Gradient Descent with a dynamically tunedlearning rate, and updates parameters every step along the gradient direction with the following protocol: θ t ← θ t − − lr ∂ L ∂ θ (13)where lr is the learning rate, and θ are the model parameters and ∂ L ∂ θ are the partial derivatives of the objective functionwith respect to the model parameters, and they can be automatically computed with typical deep learning libraries.The overall learning algorithm of the unified model is illustrated in Algorithm.1. Algorithm 1
Learning algorithm of SNM.
Input: the training set D ; the learning rate lr ; the regularization parameter α ; Output: latent factors of each user u i and item v j , u i and v j ; semantic representation of each user u i and v j ; themodel parameters, θ . Initialize all model parameters θ ; while not convergence do Randomly sample a tuple ( u i , v j ) ∈ D ; Calculate objective loss as descried in Eqn.12; Calculate ∂ L ∂ θ and update θ by Eqn.13; end while For each user-item pair ( u i , v j ) , the time complexity for selecting attentive neighbors (i.e. Section.3.2) is O ( | N ( v j ) d | ) ,where d is the embedding size and | N ( v j ) | is the number of concerned neighbors. The time complexity for hybridgate prediction (i.e. Section.3.3) is O ( d + (cid:80) Ll =1 d l − d l ) , where O ( d ) is the time complexity for computing thepredictive vector h i , O ( (cid:80) Ll =1 d l − d l ) is the time complexity of the multilayer neural networks and d l is the di-mension of the l -th layer network. As for the user-neighbor modeling (i.e. Section.3.4), the time complexity is O (( K + 1) d ) where K is the number of negative neighbors. Therefore, the time complexity of the proposed modelis O ( | N ( v j ) | d + d + (cid:80) Ll =1 d l − d l + ( K + 1) d ) . In practice, | N ( v j ) | , K are much smaller than d , hence the timecomplexity of SNM mainly depends on the quantity of latent dimensions.
4. Experiment
In this section, we validate the effectiveness of the proposed model with three publicly available datasets. Thefirst dataset is movieLens-20m [11] where users provide explicit ratings and reviews toward movies. To transform9 able 1: Statistics of the datasets for experiment Dataset movielen-20m
Pinterest citeulike-a
Pinterest [29] where users can save or pin an image thatthey are interested in. A positive feedback is recorded if a user save or pin an image. The third datasets citeulike-a [30] is collected from an online service that allows users to save and share academic papers. A user-item interactionis encoded as 1 if the user has saved the paper in his/her library. Similar to previous works [31] [32] [33], we filter outusers with fewer than 5 positive items and the items with fewer than 2 users. The statistics of the datasets is shown inTable.1. The baselines employed for performance comparison are list as follows,• SLIM [34] generates top-N recommendations by aggregating from user purchase/rating profiles.• NeuMF[21] combines generalized Matrix Factorization (MF) and Multi-Layer Perception (MLP) for modelinguser-item latent structures.• SVD++[16] is a hybrid model that encodes latent factor model and neighborhood similarity into a unifiedframework for recommendation.• GATE [11] exploits neighboring relations to help infer users’ preferences.• SAMN [20] models aspect and friend-level influences in an hierarchical manner.• CMN [12] identifies similar neighboring users with an attention mechanism based on the specific user-item pair,and jointly exploits the neighborhood state and user-item interactions to derive recommendation.• DELF[35] proposes an attention mechanism to aggregate an additional embedding for each user/item, and thenfurther introduce a neural network architecture to incorporate dual embeddings for recommendation. https://grouplens.org/datasets/movielens/ http://sites.google.com/site/xueatalphabeta/ We implement the proposed model based on tensorflow deep-learning library . As for the hyper-parameters, weperform a grid search for latent factors amongst { } , and the trade-off weight between the ranking scoreestimation and user-neighbor modeling from { } . The number of layers for modeling the user-iteminteractions is varied from { } , with the dimensionality of each layer being halved from the previous layer. Asthe initialization of the deep neural network has a crucial impact on the recommendation performance, we employ theNeuMF model to pre-train the user/item embeddings, and then use them to initialize the corresponding parameterswhen training the proposed model. The implemented model is trained via stochastic gradient descent over shuffledmini-batches with a batch size of 256. The defined training objective is optimized using Adam [28] optimizer withan initial learning rate of 0.001, and it is decayed with a rate of 0.9 for every 100 steps. We perform early stoppingand fine tune the parameters with the dev set. All of the experiments and training are done using a NVIDIA GeForceGTX 1070 graphics card with 8G memory. To evaluate the proposed model, we randomly split the datasets into training set (70%), validation set (10%) andtesting set (20%). We repeat the experiments for 10 times to avoid splitting bias, and report the average results over the10 runs. However, we notice very minimal differences among the performances of different runs. All the competitivemodels are fine-tuned on the validation set. In the training process, for each positive item, we randomly sample 5items as the negative samples. In the testing process, as it is time-consuming to rank all the items for every user ateach time, hence for each ( u i , v j ) pair in the testing set, we mix the testing item with 99 random items, and rankthe testing item along with the 99 items for the related user. We measure the recommendation performance with thecommonly used Hit Ratio (HR) and Normalized Discounted Cumulative Gain (NDCG), as shown follows, HR = hits testN DCG = 1 test test (cid:88) i =1 log ( p i + 1) (14) https://tensorflow.google.cn/api docs/python hits is the number of testing item that appears in the the recommendation list of the related user and test isthe total number of ( u i , v j ) pair in the testing set. p i is the position of the testing item in the recommendation list forthe i -th hit. HR measures whether the testing item is in the recommendation list, while NDCG assigns higher score tothe testing item with higher position. In this paper, we truncate the ranking list at k ∈ [1 , , · · · , for both metrics.For example, HR@5 measures the ratio of the testing items that appears in the Top-5 recommendation list. The performance comparison among the models is presented in Table.2. From the table, we have the followingkey observations. First, the proposed model, SNM, outperforms the baselines significantly, demonstrating the suc-cessful integration of the hybrid gated neighborhood selection and user-neighbor modeling over existing attentiveneighborhood modeling methods. Second, SNM yields better results than the dual attention model, DELF. AlthoughDELF dually exploits user and item neighboring information for recommendation, it simply incorporates aggregatedneighborhood for interactions, which dose not consider the usefulness of the neighborhood information, leading tothe incomplete exploration of neighborhood credibility. Third, SNM outperforms CMN, SAMN and GATE by a largemargin. Those three models all aggregate neighbors into a unified vector with an attention mechanism, and incorporateit with different neural networks. However, those models do not discriminate the effects of different neighborhood,which may inevitably introduce noises and negatively affect model performance. Forth, SNM achieves better re-sults than SVD++, since SVD++ treats neighbors equally and neglects the effects that informative neighbors canbetter bridge gaps between unseen user-item pairs. Finally, SNM demonstrates improved performance over NeuMFand SLIM, as those two models dose not incorporate neighborhood information. Therefore, they mainly depend onuser-item interaction data and suffer from the problem of data sparseness.
Other observations.
First, all the attention-based models (i.e. DELF, CMN, SAMN, GATE) achieve competingrecommendation performance, since they all involve an attention mechanism to incorporate neighborhood informationfor recommendation. Second, among the attention-based models, DELF achieves overall better performance thanCMN, SAMN and GATE. One possible explanation is that DELF simultaneously explores user-user and item-itemrelations for modeling dual interactions, while the other three models mainly depend on singular neighborhood forrecommendation. Third, all the attention-based models achieve better performance than SVD++, demonstrating thebenefit of placing higher weights on informative neighbors that can better infer users’ preferences. Forth, SLIMgenerally performs worst among the baselines, since it mainly depends on user-item interactions and suffers fromdata sparsity. Fifth, on
Pinterest dataset, SVD++ obtains the lowest HR and NDCG, due to the restrictive ability tohandle sparse data when only few neighbors are present. Finally, on dataset movielen-20m and
Pinterest , NeuMFoutperforms the neighborhood-based SVD++ revealing the effectiveness of multi-layer non-linear transformations forcapturing complex user-item interactions. 12 able 2: Performance comparison on three datasets. Best performance is in boldface and the second best is underlined. * indicates the results aresignificant at level 0.01. movielen-20m
Models HR@5 HR@10 NDCG@5 NDCG@10SLIM 0.4021 0.4754 0.3277 0.3631NeuMF 0.4198 0.5311 0.3157 0.3515SVD++ 0.4086 0.5186 0.3004 0.3356GATE 0.4523 0.5448 0.3591 0.3922SAMN 0.4412 0.5217 0.3542 0.3873CMN 0.4680 0.5687 0.3509 0.3832DELF 0.4517 0.5499 0.3462 0.3779SNM * * * * Pinterest
Models HR@5 HR@10 NDCG@5 NDCG@10SLIM 0.5115 0.6232 0.3797 0.4125NeuMF 0.5206 0.6371 0.3819 0.4197SVD++ 0.4804 0.6075 0.3430 0.3842GATE 0.5550 0.6500 0.4262 0.4580SAMN 0.5637 0.6412 0.4243 0.4565CMN 0.5454 0.6415 0.4170 0.4482DELF 0.5699 0.6578 0.4283 0.4618SNM * * * * citeulike-a Models HR@5 HR@10 NDCG@5 NDCG@10SLIM 0.3932 0.5637 0.3018 0.3651NeuMF 0.4230 0.5807 0.3395 0.3997SVD++ 0.4971 0.6094 0.3689 0.4017GATE 0.4995 0.6134 0.3702 0.4079SAMN 0.5094 0.6154 0.3710 0.4109CMN 0.4377 0.5996 0.3481 0.4099DELF 0.5095 0.6177 0.3785 0.4090SNM * * * *13 able 3: Performance comparisons among different variants of SNM. SNM-Th excludes the hybrid gated network, where the neighborhoodrepresentation is a weighted sum over all the neighbors, and it is weighted equally as the user representation for comparison neighborhood-baseduser representation h i . SNM-Un excludes the user-neighbor modeling, and users are not regularized to be close to their neighbors. Datasets Metrics SNM-Th SNM-Un SNM movielen-20m
HR@5 0.4686 0.4828 0.4986HR@10 0.5766 0.5789 0.5981NDCG@5 0.3567 0.3635 0.3837NDCG@10 0.3884 0.3929 0.4160
HR@5 0.5702 0.5845 0.5882HR@10 0.6653 0.6749 0.6787NDCG@5 0.4109 0.4452 0.4480NDCG@10 0.4417 0.4759 0.4762 citeulike-a
HR@5 0.5191 0.5084 0.5224HR@10 0.6228 0.6112 0.6305NDCG@5 0.3866 0.3850 0.3927NDCG@10 0.4216 0.4183 0.4263
In this section, we further investigate the impact that different components have on the recommendation perfor-mance. To this end, we introduce two variants of SNM, namely (a) the variant that excludes the Th resholding mech-anism in the hybrid gated network (SNM-Th for short) and (b) the variant that excludes the U ser- n eighbor modeling(SNM-Un for short). For SNM-Th, the neighborhood representation is an attentive aggregation over the neighborswithout filtering out the dissimilar users, and we place equal weights on the user representation and its neighborhoodrepresentations to comprise h i .As shown in Table.3, the unified model SNM generally outperforms SNM-Th that excludes the hybrid gatednetwork and attends to all the neighbors for comprising the neighborhood representation. Further illustrating effec-tiveness of the hybrid gated network, as it can not only filter out noisy neighbors for a given item, but also capture theuncertainty in the neighborhood. SNM-Un uniformly performs worse than SNM hinting at the effectiveness of theuser-neighbor modeling that explicitly captures the proximity between each user and its neighborhood in the latentspace. The performance differences between SNM-Th and SNM-Un vary across the datasets. Specifically, SNM-Unyields better performance than SNM-Th on movielen-20m and Pinterest datasets, while SNM-Th shows performanceimprovement over SNM-Un on citeulike-a dataset. This can be explained by the fact that items of movielen-20m and
Pinterest datasets have more ratings on average, hence the denser neighborhood allows SNM-Un to sufficientlyexplore the neighborhood and identify the most informative neighbors for bridging unseen user-item pairs.14 .550.600.65 P e r f o r m a n c e
16 32 64 128Embedding Size0.350.400.45
HR@10NDCG@10 movielen-20m P e r f o r m a n c e
16 32 64 128Embedding Size0.300.350.400.450.50
HR@10NDCG@10
Pinterest P e r f o r m a n c e
16 32 64 128Embedding Size0.300.350.400.450.50
HR@10NDCG@10 citeulike-a
Figure 3: Recommendation performance with respect to different embedding sizes on the three datasets. P e r f o r m a n c e HR@10NDCG@10 movielen-20m P e r f o r m a n c e HR@10NDCG@10
Pinterest P e r f o r m a n c e HR@10NDCG@10 citeulike-a
Figure 4: Recommendation performance with respect to different regularization weights (i.e. α in Eqn.12) on the three datasets. The recommendation performance (i.e. HR@10, NDCG@10) of SNM with respect to differentembedding sizes is presented in figure 3. As HR@10 and NDCG@10 shows similar trends, we focus the analysison HR@10 in this section. For the movielen-20m dataset, the general trends shows a steady improvement as theembedding size increases, indicating large embedding size improves the model’s expressiveness to encode complexuser-item interactions of the training data. For the
Pinterest dataset, the similar trend of model performance with theincreasing embedding size can also be observed. An exception can be found when the embedding size is set to 32,where the model experiences an unusual drop in performance. A possible explanation is that the model falls intoa local optimum because of nonconvexity of neural networks. Further, increasing the embedding size from 64 to128 does not provide significant benefits, indicating the embedding size can be appropriately tuned to yield a tradeoff between computational overhead and model performance. Experiment results on citeulike-a dataset show similarperformance gains as the embedding size increases. However, unlike the results from previous datasets that the modelachieves the best performance with the embedding size 128, the citeulike-a dataset shows a sudden drop with thatembedding size potentially due to overfitting. 15 H R @ movielen-20mPinterestciteulike-a Figure 5: HR@10 of SNM with respect to different ratios of negative items.
Trade-off Weight.
In this subsection, we investigate the impact that the trade-off weight (i.e. α in Eqn.12) has on themodel performance. The trade-off weight is used to introduce the user-neighbor modeling, and regularize how theusers should be close to their neighbors. Figure 4 shows the mode performance with respect to different trade-offweights across the datasets. For the movielen-20m dataset, the model performance shows a gradual improvementwhen the weight is increased from 0.001 to 0.1, signifying the benefits of explicitly capturing the proximity betweenthe users and their neighbors. However, the model performance degrades significantly when the weight is furtherincreased indicating strong regularizations on the users can confine the model’s expressiveness to infer complex user-item relations. On the contrary, on the Pinterest dataset, the proposed model presents constantly stable performancewith respect to different trade-off weights. This is potentially due to the dense neighborhood of that dataset, hencesufficient neighborhood information makes the model invulnerable to the regularizations between users. The modelperformance on citeulike-a shows similar trends as that on movielen-20m datasets, however, the model is more sensi-tive to the hyperparameter. This is probably because the modeling of user-neighbor similarities causes big variancesto user representations when only a few neighbors are present.
Negative Sampling Ratio.
In this subsection, we study the influence of negative sampling ratio. Negative items aredominant in the training set, and they usually contain rich information for recommendation [36]. Figure 5 shows theHR@10 of the proposed model with respect to different negative sampling ratios. From the figure, we can observethat the model performance constantly improves with increasing negative ratio. This is probably because when theratio is small, the informativeness of the negative items can not be sufficiently exploited for boosting recommendation.However, when the ratio is too large, the model performance begins to degrade. One possible explanation is that withlarge negative ratio, the training set is dominated with negative items, and the model is biased to those items, whichlead to sub-optimal results. 16 able 4: Running time in seconds of the competing models across the datasets. The training time is the time for training an epoch of the data.
Models movielen-20m Pinterest citeulike-a train test train test train testGATE 1006.1691 50.2672 63.748 3.2282 22.3168 1.1196CMN 1328.8509 49.7335 86.7495 3.3231 30.7609 1.1525SNM 1215.6646 61.0406 77.2744 3.8188 27.1653 1.2702SAMN 1538.63 69.7845 98.7012 4.4937 34.772 1.535DELF 2650.0845 143.348 167.1174 9.2574 60.343 3.1212 n n n n n n n n n n n n n n n n n n n n neighboring users of user 1 t a r g e t i t e m I D movielen-20m n n n n n n n n n n n n n n n n n n n n neighboring users of user 334 t a r g e t i t e m I D Pinterest n n n n n n n n n n n n n n n n n n n n neighboring users of user 911 t a r g e t i t e m I D citeulike-a Figure 6: Relevance scores between a randomly sampled user and his/her neighbors across the datasets, where a dark color represents a highervalue while lighter colors indicate lower values. The y-axis is the target item while the x-axis illustrates the gathered neighbors given the targetitem.
To study the time efficiency of the proposed model, we compare its running time with some representative base-lines. We record the running time across the models on the same computing environment, and set the hyperparametersaccording to the original works. Table.4 shows the training and prediction time of the neighborhood-based models.The training time is the time for training one epoch of the data, and the prediction time is the time required to completethe prediction for the whole testing set. From the table we can observe that GATE takes the least time to finish thetraining and testing processes, as the component for calculating textual representations is excluded in this work. CMNand SNM yield competing results, since the time complexity mainly depends on the embedding size in both models.SAMN and DELF incur high computational overhead, as they involve hierarchical or dual attentions. For example,SAMN proposes to capture aspect attentions and friend-level attentions for user modeling, while DELF simultane-ously model user-user and item-item relations for multilevel interactions. This efficiency study shows the proposedSNM model is able to achieve the best recommendation accuracy without incurring noticeable overhead.
To provide a deep insight of the attentive neighbor selection, we visualize the relevance scores (i.e. β t in Eqn.3)between the users and their neighbors given different target items. Fig.6 present a heatmap of the scores for a randomlysampled user across the datasets. The color scale indicates the intensities of the relevance scores, where a dark color17epresents a higher value and lighter colors indicates lower values. Each row is a score distribution over the neighborsgiven a target item as specified by the y-axis labels. The x-axis represents the users in the neighborhood notated from“n1”, and is truncated at “n20”. Notice that the notations do not necessarily reflect user id from the datasets.From the figures we have the following observations. First, not all the neighbors are equally informative. Forexample, user 1 of the movielen-20m dataset places a higher weight on the 5-th neighbor for estimating the rankingscore of item 3560, and user 911 of the citeulike-a depends uniquely on the 10-th neighbor to derive the recommen-dation of item 3943. This justifies the necessity to align the most informative neighbors for recommendation. Second,each user has different distributions of the relevance scores over the neighbors given different target items, validatingthe intuition of involving the target item for calculating the relevance scores. The underlying reason is that givenitems of different characteristics, users may refer to neighbors of different preferences to drive the recommendations.Third, in some cases, the relevance scores are evenly distributed among the neighbors, and due to the large numberof neighbors, each of them receives a tiny weight for comprising the neighborhood representation. For example, therelevance scores between user 911 ( citeulike-a ) and the neighbors are all close to 0 given item 5099. Similar situationcan be observed between user 334 ( Pinterest ) and his/her neighbors on recommending item 816. In these cases, theneighborhood information is not informative and should be automatically blocked for the recommendation task. Thismotivates the proposed hybrid gated network for filtering out noisy users, and select between a user representationand the neighborhood representation based on the confidence level of the neighborhood information.
5. Related Work
Recently, deep learning has been widely applied in recommendation due to its immense success in many researchareas such as computer vision, speech recognition and natural language processing [37]. Some works [38] propose toboost recommendation performance by exploiting different neural network structures. He et al. [21] combines gener-alized matrix factorization and multilayer perceptrons into a unified Neural Collaborative Filtering (NCF) frameworkfor modeling user-item interactions. NCF is also the state-of-the-art recommendation model that is mainly based onuser-item historical records. To comprehensively explore the user-item interactions, He et al. [39] propose NeuralFactorization Machines (NFM) to model higher-order and non-linear feature interactions. Collaborative denoisingautoencoders (CDAE) [40] incorporates user-specific bias into an antoencoder for recommendation, and it is provedto be able to generalize many existing collaborative filtering methods.As those aforementioned works mainly rely on user-item interactions, they suffer from data sparseness [1] as auser usually give few ratings compared to the large item set. To this end, many research [41, 42, 43] propose to exploitadditional knowledge about the users/items to mitigate data sparseness. For example, Zhang et al. [44] employ Con-volution Neural Networks (CNNs) to extract local features representations from the reviews, and utilize FactorizationMachines to capture high-order interactions between representations of users and items. Wang et al. [45] use pre-18rained CNNs model to exploit visual content from images, and integrate those visual features for recommending pointof interest. In [5], the authors leverage the effective representation learning of deep learning techniques, and proposea model that jointly performs users/items representation learning from side information and collaborative filteringfrom rating matrix. Meng et al. [46] exploit positive and negative emotions on reviews for recommendation productsinspiring the fact that emotions on reviews have strong indication of user preferences. One drawback of those worksis that they require side information, and it is not always effective for learning informative user/item latent factors.
Neighborhood-based approaches for recommendation is another major class of collaborative filtering. The un-derlying reason is that users usually show somewhat similar preferences with their neighbors, and the semantic gapsbetween unseen user-item pairs can be bridged by the neighbors with sufficient historical interaction records. Forexample, Ebesu et al. [12] proposed a recommender that is a hybrid of latent factor model and neighborhood-basedmodel. It leverages a memory component to encode complex user-item relations, and a neural attention mechanismto learn a user-item specific neighborhood, and use a output module to jointly exploit the memory component and theneighborhood to produce the ranking score. In [20], the authors address for social-aware recommendation by using ahierarchical attention mechanism that exploits aspect-level and friend-level attentions from the neighborhood for rec-ommendation. Ma et al. [11] address the sparse implicit feedback of recommendation by proposing a neighbor-levelattention that learns the neighborhood representation of an item by considering its neighbors in a weighted manner.Attention mechanism is an indispensable technique in those neighborhood-based approaches. It has wide ap-plication in many machine learning tasks such as image captioning and machine translation [19, 47]. Since not allusers/items in the neighborhood are equally informative, it is natural to place higher weights on some specific neigh-bors when aggregating the representations of the neighbors. However, the neighborhood is usually noisy and simplyaggregating all the neighbors has negative impact on the recommendation performance. Moreover, most of the pre-vious words tend to assign equal weights to the user and neighborhood representation, ignoring the confidence ofthe neighbors in recommending the target item. Finally, those previous neighborhood-based approaches ignore thesemantic proximity between the users/items and their neighbors, and the user-user similarities are implicitly capturedduring the collaborative filtering process, and it is inefficient for modeling localized neighborhood information.
6. Conclusion
In this paper, we propose a novel neighborhood-based recommendation model to deal with neighborhood noisesand learn compact user-neighbor compatibility. We design a hybrid gated network for separating similar neighborsfrom dissimilar ones, and aggregate those similar neighbors for comprising neighborhood representations. We alsopropose to explicitly preserve user-neighbor proximity, and learn specialized user representations for the recommen-dation task. Extensive experiments on three publicly datasets have demonstrated the advantage of the proposed model19ver state-of-the-art neighborhood-based models, and justified the rationale underlying the two components in theproposed model.
7. Acknowledgement
This research was supported in part by the National Natural Science Foundation of China under Grant 61802427.
References [1] X. Luo, M. Zhou, S. Li, M. Shang, An inherently nonnegative latent factor model for high-dimensional and sparse matrices from industrialapplications, IEEE Transactions on Industrial Informatics 14 (5) (2017) 2011–2022.[2] F. Zhang, N. J. Yuan, D. Lian, X. Xie, W.-Y. Ma, Collaborative knowledge base embedding for recommender systems, in: Proceedings of the22nd ACM SIGKDD international conference on knowledge discovery and data mining, ACM, 2016, pp. 353–362.[3] Y. Zhang, Q. Ai, X. Chen, W. B. Croft, Joint representation learning for top-n recommendation with heterogeneous information sources, in:Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, ACM, 2017, pp. 1449–1458.[4] J. Chen, H. Zhang, X. He, L. Nie, W. Liu, T.-S. Chua, Attentive collaborative filtering: Multimedia recommendation with item-andcomponent-level attention, in: Proceedings of the 40th International ACM SIGIR conference on Research and Development in Informa-tion Retrieval, ACM, 2017, pp. 335–344.[5] X. Dong, L. Yu, Z. Wu, Y. Sun, L. Yuan, F. Zhang, A hybrid collaborative filtering model with deep structure for recommender systems, in:Thirty-First AAAI Conference on Artificial Intelligence, 2017.[6] T. Yamasaki, J. Hu, S. Sano, K. Aizawa, Folkpopularityrank: Tag recommendation for enhancing social popularity using text tags in contentsharing services., in: IJCAI, 2017, pp. 3231–3237.[7] C.-Y. Liu, C. Zhou, J. Wu, Y. Hu, L. Guo, Social recommendation with an essential preference space, in: Thirty-Second AAAI Conferenceon Artificial Intelligence, 2018.[8] T. Zhao, J. McAuley, I. King, Leveraging social connections to improve personalized ranking for collaborative filtering, in: Proceedings ofthe 23rd ACM international conference on conference on information and knowledge management, ACM, 2014, pp. 261–270.[9] S. Sedhain, A. K. Menon, S. Sanner, L. Xie, D. Braziunas, Low-rank linear cold-start recommendation from social data, in: Thirty-FirstAAAI Conference on Artificial Intelligence, 2017.[10] Z. Ren, S. Liang, P. Li, S. Wang, M. de Rijke, Social collaborative viewpoint regression with explainable recommendations, in: Proceedingsof the tenth ACM international conference on web search and data mining, ACM, 2017, pp. 485–494.[11] C. Ma, P. Kang, B. Wu, Q. Wang, X. Liu, Gated attentive-autoencoder for content-aware recommendation, in: Proceedings of the TwelfthACM International Conference on Web Search and Data Mining, ACM, 2019, pp. 519–527.[12] T. Ebesu, B. Shen, Y. Fang, Collaborative memory network for recommendation systems, in: The 41st International ACM SIGIR Conferenceon Research & Development in Information Retrieval, ACM, 2018, pp. 515–524.[13] X. Wang, W. Lu, M. Ester, C. Wang, C. Chen, Social recommendation with strong and weak ties, in: Proceedings of the 25th ACM Interna-tional on Conference on Information and Knowledge Management, ACM, 2016, pp. 5–14.[14] J. Tang, C. Aggarwal, H. Liu, Recommendations in signed social networks, in: Proceedings of the 25th International Conference on WorldWide Web, International World Wide Web Conferences Steering Committee, 2016, pp. 31–40.[15] L. Gao, H. Yang, J. Wu, C. Zhou, W. Lu, Y. Hu, Recommendation with multi-source heterogeneous information, in: IJCAI International JointConference on Artificial Intelligence, 2018.[16] Y. Koren, Factorization meets the neighborhood: a multifaceted collaborative filtering model, in: Proceedings of the 14th ACM SIGKDDinternational conference on Knowledge discovery and data mining, ACM, 2008, pp. 426–434.
17] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, in: Proceedingsof the IEEE international conference on computer vision, 2015, pp. 1026–1034.[18] A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, V. Zhong, R. Paulus, R. Socher, Ask me anything: Dynamic memorynetworks for natural language processing, in: International conference on machine learning, 2016, pp. 1378–1387.[19] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473.[20] C. Chen, M. Zhang, Y. Liu, S. Ma, Social attentional memory network: Modeling aspect-and friend-level differences in recommendation, in:Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, ACM, 2019, pp. 177–185.[21] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, T.-S. Chua, Neural collaborative filtering, in: Proceedings of the 26th international conference onworld wide web, International World Wide Web Conferences Steering Committee, 2017, pp. 173–182.[22] J. Manotumruksa, C. Macdonald, I. Ounis, A deep recurrent collaborative filtering framework for venue recommendation, in: Proceedings ofthe 2017 ACM on Conference on Information and Knowledge Management, ACM, 2017, pp. 1429–1438.[23] X. He, X. Du, X. Wang, F. Tian, J. Tang, T.-S. Chua, Outer product-based neural collaborative filtering, in: Proceedings of the 27th Interna-tional Joint Conference on Artificial Intelligence, AAAI Press, 2018, pp. 2227–2233.[24] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in:Advances in neural information processing systems, 2013, pp. 3111–3119.[25] C. Yang, L. Bai, C. Zhang, Q. Yuan, J. Han, Bridging collaborative filtering and semi-supervised learning: a neural approach for poi recom-mendation, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2017,pp. 1245–1254.[26] X. Cai, J. Gao, K. Y. Ngiam, B. C. Ooi, Y. Zhang, X. Yuan, Medical concept embedding with time-aware attention, in: Proceedings of the27th International Joint Conference on Artificial Intelligence, AAAI Press, 2018, pp. 3984–3990.[27] X. Luo, M. Zhou, S. Li, L. Hu, M. Shang, Non-negativity constrained missing data estimation for high-dimensional and sparse matrices fromindustrial applications, IEEE transactions on cybernetics.[28] D. Kingma, J. Ba, Adam: A method for stochastic optimization, Computer Science.[29] X. Geng, H. Zhang, J. Bian, T.-S. Chua, Learning image and user features for recommendation in social networks, in: Proceedings of theIEEE International Conference on Computer Vision, 2015, pp. 4274–4282.[30] C. Wang, D. M. Blei, Collaborative topic modeling for recommending scientific articles, in: Proceedings of the 17th ACM SIGKDD interna-tional conference on Knowledge discovery and data mining, ACM, 2011, pp. 448–456.[31] D. Liang, L. Charlin, J. McInerney, D. M. Blei, Modeling user exposure in recommendation, in: Proceedings of the 25th InternationalConference on World Wide Web, International World Wide Web Conferences Steering Committee, 2016, pp. 951–961.[32] D. Lian, C. Zhao, X. Xie, G. Sun, E. Chen, Y. Rui, Geomf: joint geographical modeling and matrix factorization for point-of-interestrecommendation, in: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM,2014, pp. 831–840.[33] Q. Yuan, G. Cong, Z. Ma, A. Sun, N. M. Thalmann, Time-aware point-of-interest recommendation, in: Proceedings of the 36th internationalACM SIGIR conference on Research and development in information retrieval, ACM, 2013, pp. 363–372.[34] X. Ning, G. Karypis, Slim: Sparse linear methods for top-n recommender systems, in: 2011 IEEE 11th International Conference on DataMining, IEEE, 2011, pp. 497–506.[35] W. Cheng, Y. Shen, Y. Zhu, L. Huang, Delf: a dual-embedding based deep latent factor model for recommendation, in: Proceedings of the27th International Joint Conference on Artificial Intelligence, AAAI Press, 2018, pp. 3329–3335.[36] C. Wu, F. Wu, M. An, J. Huang, Y. Huang, X. Xie, Npa: Neural news recommendation with personalized attention, in: Proceedings of the25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ACM, 2019, pp. 2576–2584.[37] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, nature 521 (7553) (2015) 436.[38] T. Nguyen, A. Takasu, Npe: neural personalized embedding for collaborative filtering, in: Proceedings of the 27th International Joint Confer-ence on Artificial Intelligence, AAAI Press, 2018, pp. 1583–1589.
39] X. He, T.-S. Chua, Neural factorization machines for sparse predictive analytics, in: Proceedings of the 40th International ACM SIGIRconference on Research and Development in Information Retrieval, ACM, 2017, pp. 355–364.[40] Y. Wu, C. DuBois, A. X. Zheng, M. Ester, Collaborative denoising auto-encoders for top-n recommender systems, in: Proceedings of theNinth ACM International Conference on Web Search and Data Mining, ACM, 2016, pp. 153–162.[41] X. Wang, X. He, L. Nie, T.-S. Chua, Item silk road: Recommending items from information domains to social users, in: Proceedings of the40th International ACM SIGIR conference on Research and Development in Information Retrieval, ACM, 2017, pp. 185–194.[42] H. Ma, H. Yang, M. R. Lyu, I. King, Sorec: social recommendation using probabilistic matrix factorization, in: Proceedings of the 17th ACMconference on Information and knowledge management, ACM, 2008, pp. 931–940.[43] G. Guo, J. Zhang, N. Yorke-Smith, Trustsvd: collaborative filtering with both the explicit and implicit influence of user trust and of itemratings, in: Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.[44] L. Zheng, V. Noroozi, P. S. Yu, Joint deep modeling of users and items using reviews for recommendation, in: Proceedings of the Tenth ACMInternational Conference on Web Search and Data Mining, ACM, 2017, pp. 425–434.[45] S. Wang, Y. Wang, J. Tang, K. Shu, S. Ranganath, H. Liu, What your images reveal: Exploiting visual contents for point-of-interest recom-mendation, in: Proceedings of the 26th International Conference on World Wide Web, International World Wide Web Conferences SteeringCommittee, 2017, pp. 391–400.[46] X. Meng, S. Wang, H. Liu, Y. Zhang, Exploiting emotion on reviews for recommender systems, in: Thirty-Second AAAI Conference onArtificial Intelligence, 2018.[47] A. M. Rush, S. Chopra, J. Weston, A neural attention model for abstractive sentence summarization, in: Proceedings of the 2015 Conferenceon Empirical Methods in Natural Language Processing, 2015, pp. 379–389.39] X. He, T.-S. Chua, Neural factorization machines for sparse predictive analytics, in: Proceedings of the 40th International ACM SIGIRconference on Research and Development in Information Retrieval, ACM, 2017, pp. 355–364.[40] Y. Wu, C. DuBois, A. X. Zheng, M. Ester, Collaborative denoising auto-encoders for top-n recommender systems, in: Proceedings of theNinth ACM International Conference on Web Search and Data Mining, ACM, 2016, pp. 153–162.[41] X. Wang, X. He, L. Nie, T.-S. Chua, Item silk road: Recommending items from information domains to social users, in: Proceedings of the40th International ACM SIGIR conference on Research and Development in Information Retrieval, ACM, 2017, pp. 185–194.[42] H. Ma, H. Yang, M. R. Lyu, I. King, Sorec: social recommendation using probabilistic matrix factorization, in: Proceedings of the 17th ACMconference on Information and knowledge management, ACM, 2008, pp. 931–940.[43] G. Guo, J. Zhang, N. Yorke-Smith, Trustsvd: collaborative filtering with both the explicit and implicit influence of user trust and of itemratings, in: Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.[44] L. Zheng, V. Noroozi, P. S. Yu, Joint deep modeling of users and items using reviews for recommendation, in: Proceedings of the Tenth ACMInternational Conference on Web Search and Data Mining, ACM, 2017, pp. 425–434.[45] S. Wang, Y. Wang, J. Tang, K. Shu, S. Ranganath, H. Liu, What your images reveal: Exploiting visual contents for point-of-interest recom-mendation, in: Proceedings of the 26th International Conference on World Wide Web, International World Wide Web Conferences SteeringCommittee, 2017, pp. 391–400.[46] X. Meng, S. Wang, H. Liu, Y. Zhang, Exploiting emotion on reviews for recommender systems, in: Thirty-Second AAAI Conference onArtificial Intelligence, 2018.[47] A. M. Rush, S. Chopra, J. Weston, A neural attention model for abstractive sentence summarization, in: Proceedings of the 2015 Conferenceon Empirical Methods in Natural Language Processing, 2015, pp. 379–389.