[PDF] Factor-level Attentive ICF for Recommendation

Abstract

Item-based collaborative filtering (ICF) enjoys the advantages of high recommendation accuracy and ease in online penalization and thus is favored by the industrial recommender systems. ICF recommends items to a target user based on their similarities to the previously interacted items of the user. Great progresses have been achieved for ICF in recent years by applying advanced machine learning techniques (e.g., deep neural networks) to learn the item similarity from data. The early methods simply treat all the historical items equally and recent ones distinguish the different importance of items for a prediction. Despite the progress, we argue that those ICF models neglect the diverse intents of users on adopting items (e.g., watching a movie because of the director, leading actors, or the visual effects). As a result, they fail to estimate the item similarity on a finer-grained level to predict the user's preference for an item, resulting in sub-optimal recommendation. In this work, we propose a general factor-level attention method for ICF models. The key of our method is to distinguish the importance of different factors when computing the item similarity for a prediction. To demonstrate the effectiveness of our method, we design a light attention neural network to integrate both item-level and factor-level attention for neural ICF models. It is model-agnostic and easy-to-implement. We apply it to two baseline ICF models and evaluate its effectiveness on six public datasets. Extensive experiments show the factor-level attention enhanced models consistently outperform their counterparts, demonstrating the potential of differentiate user intents on the factor-level for ICF recommendation models.

Full PDF

FFactor-level Attentive ICF for Recommendation

ZHIYONG CHENG,

Qilu University of Technology (Shandong Academy of Sciences)

SHENGHAN MEI,

Shandong University, China

YANGYANG GUO,

Shandong University, China

LEI ZHU,

Shandong Normal University, China

LIQIANG NIE,

Shandong University, ChinaItem-based collaborative filtering (ICF) enjoys the advantages of high recommendation accuracy and easein online penalization and thus is favored by the industrial recommender systems. ICF recommends itemsto a target user based on their similarities to the previously interacted items of the user. Great progresseshave been achieved for ICF in recent years by applying advanced machine learning techniques (e.g., deepneural networks) to learn the item similarity from data. The early methods simply treat all the historical itemsequally and recent ones distinguish the different importance of items for a prediction. Despite the progress,we argue that those ICF models neglect the diverse intents of users on adopting items (e.g., watching a moviebecause of the director, leading actors, or the visual effects). As a result, they fail to estimate the item similarityon a finer-grained level to predict the user’s preference for an item, resulting in sub-optimal recommendation.In this work, we propose a general factor-level attention method for ICF models. The key of our method isto distinguish the importance of different factors when computing the item similarity for a prediction. Todemonstrate the effectiveness of our method, we design a light attention neural network to integrate bothitem-level and factor-level attention for neural ICF models. It is model-agnostic and easy-to-implement. Weapply it to two baseline ICF models and evaluate its effectiveness on six public datasets. Extensive experimentsshow the factor-level attention enhanced models consistently outperform their counterparts, demonstratingthe potential of differentiate user intents on the factor-level for ICF recommendation models.CCS Concepts: •

Information systems → Personalization ; Recommender systems ; Collaborative fil-tering .Additional Key Words and Phrases: Attention, Collaborative Filtering, Diverse preference, Item-based Recom-mendation

ACM Reference Format:

Zhiyong Cheng, Shenghan Mei, Yangyang Guo, Lei Zhu, and Liqiang Nie. 2018. Factor-level Attentive ICF forRecommendation. In

Woodstock ’18: ACM Symposium on Neural Gaze Detection, June 03–05, 2018, Woodstock,NY .

ACM, New York, NY, USA, 26 pages. https://doi.org/10.1145/1122445.1122456

In the information age, we face overwhelming information at almost all aspects of our workand life. How to quickly find the desired information has thus become crucial in our daily lives.Recommendation as an effective information filtering and seeking technique [28, 38] has beenwidely deployed in current online service platforms, including information/media provider, E-commerce and social platforms. Among the many recommendation methods, collaborative filtering(CF) [29, 51, 74], which is one of the most dominant recommendation techniques, has attracted a

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice andthe full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requiresprior specific permission and/or a fee. Request permissions from [email protected].

Woodstock ’18, June 03–05, 2018, Woodstock, NY © 2018 Association for Computing Machinery.ACM ISBN 978-1-4503-XXXX-X/18/06...$15.00https://doi.org/10.1145/1122445.1122456 1 a r X i v : . [ c s . I R ] F e b ot of attention from researchers and practitioners since its birth. In general, CF methods can becategorized into two paradigms: user-based CF (UCF) and item-based CF (ICF) [55]. The key ofUCF is that users sharing close preferences often like the same items. That is, previously consumeditems by one user will be recommended to another similar user with a large probability. In contrast,ICF methods represent a user with all his/her historically consumed items [51]. Specifically, thesimilarities between the target item and the previously interacted items are estimated firstly, whichare then treated as the pivot for recommending similar items to the target user.Comparing to UCF, ICF has several advantages in practice. Firstly, representing a user based onpreviously interacted items provides more accurate user modeling and thus has more potential toimprove the performance. The user preference on items are relatively stable unless the background(or context) has changed dramatically, especially for the long-term preference. The aspects (orcharacteristics) that a user cares in the past is likely to be also important for the users in a long time.The ICF models represent a user by previous interaction items, which is actually profiling the userwith the characteristics of those interacted items. Several empirical studies in literature providesome evidences on the superiority of ICF over UCF on accuracy in top-N recommendation [27, 33].Secondly, ICF enjoys better interpretation, because it can explain a recommendation with similaritems that the user interacted before. It is more acceptable for the users than the “similar users"based explanation, as those similar users might be strangers for the target user. In addition, ICF isflexible to incorporate new user-item interactions into the model, which makes it more suitablefor online personalization [27, 72]. For new interactions, UCF methods need to re-train the modelto update the user representations, which is very time-consuming and impractical in industrialapplications. On the contrary, ICF can simply retrieve items similar to the newly interacted ones(i.e., leveraging item similarities) and recommend them to the current user. It does not need themodel re-training processing and thus is more time efficient [14, 21, 54].Early ICF approaches estimate the item similarities using statistical measures, such as Pearsoncoefficient [35] or cosine similarity [51]. The main drawback of those heuristic approaches is thatthey often require heavy manual tuning on the similarity measure for good performance on a targetdataset. As a result, such methods are hard to be directly applied to a new dataset. To tackle thislimitation, data-driven methods [33, 47] have been developed to learn item similarity from data.These methods first calculate the final result by setting an objective function, and then calculatethe parameters by passing data into the loss function. Theoretically, the richer the data, the moreaccurate the model can be. The data-driven methods save the time of parameter adjustment. Theycan not only improve the efficiency but also enjoy higher accuracy. This is because the calculationof the parameters is based on the real data and does not rely on the experience of the participators.Recently, He et al. [27] pointed out that existing data-driven ICF methods assume all historicalitems of a user profile contribute equally in estimating the target item for the current user, whichwill result in sub-optimal performance. They therefore developed a neural attentive item similaritymodel (NAIS) to distinguish the different importance of previously interacted items for the userpreference to the target item.Though NAIS has achieved superior performance over existing ICF methods, we argue thatits performance is still limited because it neglects user diverse intents on adopting items. Morespecifically, a user often pays attention to certain factors when selecting an item to consume.Accordingly, those factors will dominate the attitude of user preference towards this item. Inaddition, for each user, the dominant factors are usually different from item to item. For example,a user may favor a movie because of its plot, and likes another movie because he is fan of itsdirector. With this consideration, we deem that treating all the factors equally is not optimal in In this paper, we regard different dimensions of the item embedding as different factors that reflect user intents.2 ecommendation. However, it is not straightforward to explicitly model the impact of differentfactors in ICF for recommendation, because ICF models rely on estimating the similarity betweenthe target item and historical items for prediction. In addition, the item-level attention has beendemonstrated to be important for ICF [27]. To further enhance the recommendation accuracy, it isnecessary to consider both item-level and factor-level attentions simultaneously in the model. Howto combine them without complicating the model and increasing the computational burden muchis another problem to be solved.In this work, we make an effort to tackle aforementioned problems and present a general factor-level attention method for ICF models to consider user diverse intents in recommendation. Ourproposed method models user diverse intents by distinguishing the impact of different factorsof a historical item to the target item for prediction. More concretely, our method computes aweight vector for each historical item to estimate the similarity between this historical item andthe target item. This weight vector is used to differentiate the contributions of different factors forthe prediction by assigning different weights to each factor of the embedding vector. Based on thisidea, we further design an attention neural network to combine the item- and factor-level attentionfor neural ICF models. It is light and easy-to-implement into different ICF models. To evaluate itseffectiveness, we apply it to two models NAIS [27] & DeepICF [72] and conduct experiments on sixAmazon datasets. Extensive experimental results show that the factor-level attention enhancedmodels can indeed improve the performance consistently over the couterpart model (i.e., NAIS andDeepICF) and achieve the state-of-the-art performance.In summary, the main contributions of this work are threefolds: • We highlight the importance of considering user diverse intents in ICF model andpropose to model the intents on the factor level. In particular, we present a generalfactor-level attention method to measure the importance of different factors of ahistorical item to the target item for ICF models. • We design a light and model-agnostic attention neural network which can effectivelycombine the item- and factor-level attention for neural ICF models. It is easy toimplement in existing ICF models and we apply it to the NAIS and DeepICF toenhance their performance. • We conduct extensive experiments on six publicly available datasets and demon-strate the effectiveness of our proposed method. Experimental results show that theenhanced NAIS and DeepICF achieve superior performance over their counterparts,demonstrating the effectiveness of our proposed method. We released our codes andthe parameter settings for the experiments to facilitate others to repeat this work. The rest of this paper is organized as follows. We first introduce some background and therecent advancement of ICF models in Section 2. In Section 3, we elaborate our factor-level attentionmethod and then describe its application to NAIS and DeepICF in Section 4. In the next, we reportthe experimental results in Section 5 and review related work in Section 6. Finally, we conclude thepaper in Section 7.

This section first introduces the general framework of item-based collaborative filtering (ICF) andthen recapitulates the recent advancement of ICF models. Table 1 lists the main notations used inthis paper. https://github.com/masonmsh/FLA 3 able 1. The main notations used in this paper. Notation Definition W the trainable weight H ⊤ the matrix that projects the hidden layer into an output layer h ⊤ the vector that projects the hidden layer into the output layer b the bias vector that projects input layer into hidden layer p j the embedding of the historically interacted item 𝑗 q i the embedding of the target item 𝑖 a 𝑖 𝑗 attention vector for items 𝑖 and 𝑗𝑏 𝑖 𝑗 item-level attention for items 𝑖 and 𝑗𝑎 𝑖 𝑗𝑚 the 𝑚 -th dimension of the factor-level attention vector for items 𝑖 and 𝑗𝑠 𝑖 𝑗 similarity score between item i and j 𝑟 𝑢𝑖 user 𝑢 ’s rating for item 𝑖 R + 𝑢 user 𝑢 ’s interacted item set R + positive instances set R − negative instances set Item-based CF (ICF) predicts the preference of a user 𝑢 to a target item 𝑖 based on the similarity of 𝑖 to all the items interacted by 𝑢 in the past [54]. Formally, the prediction of ICF can be expressed as:ˆ 𝑟 𝑢𝑖 = ∑︁ 𝑗 ∈R + 𝑢 𝑠 𝑖 𝑗 𝑟 𝑢 𝑗 , (1)where ˆ 𝑟 𝑢𝑖 is the predicted ratings, R + 𝑢 represents the set of items that user 𝑢 has interacted, 𝑠 𝑖 𝑗 measures the similarity between item 𝑖 and 𝑗 , and 𝑟 𝑖 𝑗 denotes the preference level of user 𝑢 for item 𝑗 . Notice that 𝑟 𝑢 𝑗 can be real rating score (for explicit feedback) or binary value 0 or 1 (for implicitfeedback). The matrix form of Eq. 1 is: ˆ R = RS , (2)where R ∈ R 𝑈 × 𝐼 is the original interaction matrix, 𝑈 and 𝐼 represent the number of users anditems, respectively. Each element 𝑟 𝑢𝑖 ∈ R represents the rating score of a user 𝑢 given to an item 𝑖 .ˆ R ∈ R 𝑈 × 𝐼 is the reconstructed interaction matrix based on the ICF model. S ∈ R 𝐼 × 𝐼 represents theitem-item similarity matrix.We can see that ICF is easy to take the top similar or recently interacted items into the modelfor prediction [2, 15]. This nice property makes it suitable for online learning and real-time per-sonalization. The key of ICF lies in how to accurately and efficiently compute the similarity score 𝑠 𝑖 𝑗 . Early ICF models usually adopt heuristic-based approaches, such as using similarity measureslike cosine similarity and Pearson coefficient [51] or applying random walks on the user-iteminteraction bipartite graph [40]. Such approaches are designed in an intuitive way and lack thetailored optimization for recommendation, resulting in suboptimal performance. In contrast, thelearning-based methods optimize the ICF models with specially designed recommendation-oriented bject function to learn item similarities. In the next subsections, we sequentially introduce severallearning-based ICF models, including SLIM [47], FISM [33], NAIS [27], DeepICF [72]. SLIM [47] is among the earliest attempts on learning-based ICF models which learn item-itemsimilarity from the user-item interaction matrix. The basic idea is to minimize the reconstructionerrors between the original user-item interaction matrix and the reconstructed matrix based on theitem-based CF model. Specifically, the objective function is formulated as:minimize S || R − RS || 𝐹 + 𝛽 || S || 𝐹 + 𝛾 || S || ;subject to S ≥ , 𝑑𝑖𝑎𝑔 ( S ) = , (3)where || · || 𝐹 is the matrix Forbenius norm and the ℓ -norm is commonly used to prevent overfitting.The ℓ -norm regularization is to introduce sparsity to the model, i.e., enforcing only a few itemsthat are similar to an item in the solution. 𝛽 and 𝛾 are constants to balance the ℓ 𝐹 -norm and ℓ -normregularization, respectively. The non-negative constraint S ≥ 𝑑𝑖𝑎𝑔 ( S ) = S is anidentical matrix to minimize || R − RS || 𝐹 ).With the designed object function optimized for recommendation, SLIM can achieve higherrecommendation accuracy. However, with 𝐼 elements in S , it is space- and time-consuming to learnthe similarity matrix S , which makes SLIM unscalable and limits its application in real systems,considering the tens of millions of items in modern E-commerce platforms. Another limitationis that SLIM can only learn the similarity between items which have been co-interacted by usersand thus fails to capture the transitive relations between items. To address the second limitations,HOSLIM [10] was designed to model the high-order relations. In particular, it first mines itemsetswhich are frequently co-interacted by users, and then learns both item-item similarity and itemset-item similarity jointly. As it is a direct extension of SLIM and learns the similarity in the samemanner, we omit the introduction of HOSLIM here. In the next, we would like to introduce FISM,which adopts a different learning strategy. SLIM directly learns the whole similarity matrix S , which causes unaffordable resource consumptionin terms of both space and item. To address the limitation, FISM [33] first represents each item as alow-dimensional vector and then models the similarity between each pair of items by the innerproduct of their embedding vectors. Specifically, let p 𝑖 and q 𝑗 be respectively the embedding vectorsof item 𝑖 and 𝑗 , the similarity between 𝑖 and 𝑗 can be computed by p 𝑇𝑖 q 𝑗 . In FISM, the prediction ofa user 𝑢 ’s preference to an item 𝑖 is modeled as:ˆ 𝑟 𝑢𝑖 = (|R + 𝑢 | − ) 𝛼 ∑︁ 𝑗 ∈R + 𝑢 \{ 𝑖 } p 𝑇𝑖 q 𝑗 , (4)where 𝛼 ∈ [ , ] is a predefined hyper-parameter to control the normalization effect. R + 𝑢 \{ 𝑖 } denotes all the items interacted by user 𝑢 , excluding the current item 𝑖 . It equals the constraint of 𝑑𝑖𝑎𝑔 ( S ) = ot co-interacted by users, FISM can also estimate their similarity. To this end, FISM addresses theaforementioned limitations of SLIM and achieves better performance.From Eq. 4, we can see that the preference of a user 𝑢 towards a target item 𝑖 depends on theaggregating effects of the similarity between 𝑖 and all the items interacted by user 𝑢 . In particular,when 𝛼 =

0, the predicted rating is the aggregated similarity between 𝑖 and all the items that 𝑢 interacted ( R + 𝑢 \{ 𝑖 } ); and when 𝛼 =

1, the predicated rating is the average similarity between 𝑖 andthe items in R + 𝑢 \{ 𝑖 } . The underlying assumption of FISM is that each item contributes equally forthe preference prediction to the target item. However, this is often not true in practice, because someitems are more relevant to the currently targeted item. For example, to infer a user’s preference on a“keyboard", the previously purchased “computer" plays a more important role in the prediction thana pair of “shoes" purchased by the user in the past. As different items are relevant to the currentitems at different levels, it is beneficial to assign different weights to the historical interacted itemsfor more accurate prediction. To capture the different contributions of historical items to the preference prediction of a user tothe target item, NAIS [27] introduces the attention mechanism [4] to assign different weights todifferent items. Specifically, the prediction of NAIS is formulated as:ˆ 𝑟 𝑢𝑖 = ∑︁ 𝑗 ∈R + 𝑢 \{ 𝑖 } 𝑎 𝑖 𝑗 p 𝑇𝑖 q 𝑗 , (5)where 𝑎 𝑖 𝑗 denotes the attentive weight assigned to the similarity 𝑠 𝑖 𝑗 , indicating the contribution ofitem 𝑗 to the preference prediction of item 𝑖 . The attentive neural network is used to automaticallylearn 𝑎 𝑖 𝑗 by taking p 𝑖 and q 𝑖 as input. Two different methods have been presented in NAIS tocombine p 𝑖 and q 𝑖 , i.e., vector concatenation and element-wise product:  𝑓 𝑐𝑜𝑛𝑐𝑎𝑡 ( p 𝑖 , q 𝑗 ) = h 𝑇 𝑅𝑒𝐿𝑈 ( W (cid:34) p 𝑖 q 𝑗 (cid:35) + b ) 𝑓 𝑝𝑟𝑜𝑑 ( p 𝑖 , q 𝑗 ) = h 𝑇 𝑅𝑒𝐿𝑈 ( W ( p 𝑖 ⊙ q 𝑗 ) + b ) , (6)where W ∈ R 𝑑 ′ × 𝑑 and b ∈ R 𝑑 ′ represent the weight matrix and bias vector of the attention network,respectively. 𝑑 ′ denotes the size of the hidden layer. h is the weight vector of the output layer ofthe attention network. 𝑅𝑒𝐿𝑈 [45] is used as the activation function. The 𝑎 𝑖 𝑗 is then normalized viaa modified softmax function: 𝑎 𝑖 𝑗 = exp ( 𝑓 ( p 𝑖 , q 𝑗 ))[ (cid:205) 𝑗 ∈R + 𝑢 \{ 𝑖 } exp ( 𝑓 ( p 𝑖 , q 𝑗 ))] 𝛽 , (7)where 𝛽 is a hyperparameter to smooth the denominator of the softmax function. The rational liesin the fact that the number of users’ interacted items can vary in a wide range. As a result, thestandard softmax normalization will overly punish the weights of active users, who have muchmore interacted items than inactive users. With a smaller 𝛽 , the denominator can be suppressedand thus reduce the punishment on the attention weights of active users [27]. Notice that with thenormalization of the modified softmax, the normalization term ( ( |R 𝑢 + |− ) 𝛼 ) is discarded in NAIS.With the item-level attention modeling , NAIS can distinguish the different importance ofinteracted items on the preference of a user to the target item and thus achieve better performance.Both FISM and NAIS only focus on the modeling of the second-order item relations, however, they Becuase the attentive weight (or contribution) is assigned to each historical item in NAIS, we call it item-level attentionmodeling. 6 gnore the higher-order item relations in the data and thus may yield suboptimal performance [10,72]. In the next, we introduce the recently proposed DeepICF which applies deep interaction layersto capture the higher-order item relations.

There are two steps in the above introduced ICF models to predict user preference: 1) item similarityestimation and 2) similarity aggregation. SLIM and FSIM focus on the first step, i.e., proposingdifferent learning methods to estimate the item similarity, and NAIS contributes to the second stepby introducing weights to similarities from different items in the aggregation. In [72], authors arguethat previous models fail to capture high-order item relations and propose the DeepICF model,which adopts a different strategy for preference prediction. Specifically, DeepICF first adopts apairwise interaction layer to model the interaction between each historical item and the target item,and then introduces an attention-based pooling layer to assign different weights to the outputs(of the pairwise interaction layer) from different historical layer. In the next, the output from theprevious two layers are fed into deep interaction layers, which consists of a multi-layer perceptron,to model the high-order interaction between items. Finally, a linear regression model is applied topredict the preference. In the next, we introduce each component in details for a clear impressionof the DeepICF model. The pairwise interaction layer and the attention-based pooling layer areexpressed as: e ui = ∑︁ 𝑗 ∈R + 𝑢 \{ 𝑖 } 𝑎 𝑖 𝑗 ( p 𝑖 ⊙ q 𝑗 ) , (8)where ⊙ indicates element-wise product. 𝑎 𝑖 𝑗 is the item-level attention, which denotes the contri-bution of item 𝑗 to the user preference on item 𝑖 . The attention is computed as: 𝑎 𝑖 𝑗 = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 ′ ( h T 𝑅𝑒𝐿𝑈 ( W ( p 𝑖 ⊙ q 𝑗 ) + b )) , (9)where W , b , and h are defined as in Eq. 6 represent the weight matrix and bias vector of the attentionnetwork, respectively. 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑡𝑥 ′ is a modified softmax function as defined in NAIS (see Eq. 7).The deep interaction layers are stacked above the output of the interaction layer to model thehigher-order item relations as follows: e L = 𝑅𝑒𝐿𝑈 ( W L ( 𝑅𝑒𝐿𝑈 ( W L − · · · 𝑅𝑒𝐿𝑈 ( W e ui + b )) + b L − ) + b L ) , (10)where W l , b l , and e l denote the weight matrix, bias vector, and output vector of the 𝑙 th hiddenlayer respectively. 𝐿 is the total number of network layers. Finally, the prediction is achieved by alinear regression model: ˆ 𝑟 𝑢𝑖 = V 𝑇 e L + 𝑏 𝑢 + 𝑏 𝑖 , (11)where V is the weight vector for the prediction; 𝑏 𝑢 and 𝑏 𝑖 are the user and item bias as in thestandard matrix factorization [38]. As we can see, different from previous models that predictpreference based on the aggregation of the similarity between the target item and the historicalitems, DeepICF first models the complicated interactions between the target item and the historicalitem (based on the attention-based pairwise interaction pooling and deep interaction modeling),and then uses a simple regression model based on the interaction vectors for preference prediction.Note that DeepICF also adopts the item-level attention (in the attention-based pooling layers) asNAIS. Due to the high-order item relation modeling, DeepICF achieves better performance thanNAIS. .6 Motivation of Our Work Despite the success of NAIS and DeepICF by considering the different contributions of historicalitems to the target item, however, we argue that the item-level attention cannot well capture thefine-grained preferences of users on items. The overall preference of a user towards an item dependson the user’s attention and satisfaction on different features (or factors) of the item [8]. For example,when a user cares more about the “taste" and “price" for dinner, she will choose a restaurant mainlybased on the consideration of the price and taste of the food; in contrast, if the user pays moreattention to the “service" and “ambience" of the restaurant when have dinner with friends, theservice and ambience will become the dominant factors. As we can see, for different items, a usermay focus on different factors. The item-level attention cannot distinguish the importance ofdifferent factors, and thus cannot capture the fine-grained preference of a user to the differentfactors of an item. In the next section, we present our method to model the factor-level attention,which can be easily applied to existing ICF models for fine-grained preference modeling.

The underlying intuition of NAIS is that the more relevant of a historical item to the target item, themore important role it plays in the preference prediction. Therefore, NAIS introduces an attentionmechanism to estimate the contribution of each historical item to the target item. The item-levelattention only computes an attentive weight based on the overall relevance between the twoitems while ignore the attention on different factors, and thus fail to capture the fine-grained userpreference. It is well-known that an item is depicted by different factors [7] and the preferenceof a user to an item often depends on a few factors, such as the directors or actors of an movie.Therefore, a user 𝑢 ’s preference to an item 𝑖 depends on 𝑢 ’s attention on which factors of the item and whether those factors of the item fit the user’s tastes . Based on this consideration, when consideringthe contribution of a historical item to the target item in preference prediction in ICF, it is better tomeasure the importance of different factors of the historical item. In this section, we will introducea factor-level attention method, which considers the contribution of historical items on the factor-level for ICF recommendation. We first introduce the general method of computing the factor-levelattention between two items, and then introduce the method to consider both item-level andfactor-level attention in ICF models. In the embedding-based ICF models (such as FISM, NAIS, DeepICF), items are mapped into a latentfeature space and each item is represented by a vector in this space. Let p i ∈ R 𝑑 and q j ∈ R 𝑑 denotethe feature vector of the target item 𝑖 and a historical item 𝑗 , respectively. 𝑑 is the dimensionality ofthe latent space and each dimension can be regarded as a factor to describe the items. Our intuitionis that different factors of a historical item 𝑗 contribute differently to a target item 𝑖 . Taking a toyexample: if we only use two factors - “leading actors" and “director" to describe a movie, given ahistorical movie 𝑚 of a user 𝑢 , it has the same leading actors but different directors from a movie 𝑚 ; and for another movie 𝑚 , it has different leading actors but the same director . When predicting 𝑢 ’s preference to 𝑚 , the factor of “leading actors" should play a more important role than that ofthe “director", but for predicting 𝑢 ’s preference to 𝑚 , the factor of “director" will be more important.Therefore, for each historical item 𝑗 ∈ R + 𝑢 \{ 𝑖 } , our goal is to compute an attentive vector a ij , inwhich each element { 𝑎 𝑖 𝑗𝑘 | 𝑘 = { , · · · , 𝑑 }} indicates the importance of 𝑘 -th factor of the item 𝑗 withrespect to the target item 𝑖 .Notice that the value of 𝑎 𝑖 𝑗𝑘 indicates the relatively importance of the 𝑘 -th factor, it depends onthe similarity of other factors between the two items 𝑖 and 𝑗 . For example, if all the factors of the wo items are the same (e.g., the leading actors and directors are all the same for two movies), thefactors are equally important. Inspired by the effectiveness of NAIS and DeepICF, we first use theelement-wise product on the vectors of two items to obtain an interacted vector, i.e., p i ⊙ q j , where ⊙ indicates element-wise product. We then follow the standard attention mechanism by applying anon-linear transformation to obtain the attentive weights: ˆa ij = H ⊤ 𝑅𝑒𝐿𝑈 ( W ( p 𝑖 ⊙ q 𝑗 ) + b ) , (12)where W ∈ R 𝑑 ′ × 𝑑 and b ∈ R 𝑑 ′ denote the weight matrix and bias vector of the attention network,respectively. H ∈ R 𝑑 ′ × 𝑑 denotes the weight matrix of the output layer of the attention network. 𝑑 ′ denotes the size of the hidden layer. Because our goal is to compute the importance of differentfactors of the items. The softmax function is then used to normalize the attentive weights: 𝑎 𝑖 𝑗𝑘 = exp ( ˆ 𝑎 𝑖 𝑗𝑘 ) (cid:205) 𝑑𝑘 ′ = exp ( ˆ 𝑎 𝑖 𝑗𝑘 ′ ) . (13)In this design, the attentive weights of factors are computed based on the interaction between thevectors of two items. Theoretically, other functions can be also applied to encode the interaction, forexample, addition, subtraction, etc. We use the element-wise product because it is a generalizationof inner product to vector space. Notice that each element in v ij (i.e., 𝑣 𝑖 𝑗𝑘 ) is a product of thecorresponding factor of the two vectors (i.e., 𝑝 𝑖𝑘 · 𝑞 𝑗𝑘 ), and it can be regarded as a similarity ofthe corresponding factor between two items. This is similar to the inner product to compute thesimilarity between two vectors.The proposed factor-level attention looks similar to that of NAIS, because both methods computethe attention of a historical item to a target item. The difference is that NAIS computes an attentiveweight for the historical item, but we compute an attentive weight vector for all the factors ofthe historical item. Our method takes one-step further to consider the different contributions offactors in items than NAIS which considers the contributions of different items. As aforementioned,our model computes the relatively importance of each factor among all the factors of an item.Considering that historical items have different relevance levels to a target item, we should alsoconsider the item-level attention simultaneously. Because for a target item, the factors with highattentive weight of an irrelevant item could contribute less than the factors with relatively lowattentive weights of a relevant item. In the next section, we introduce our design to consider bothitem-level attention and factor-level attention in ICF models. Design 1 . To integrate both the item-level and factor-level attention in an ICF model, an straightfor-ward method is first to use two attention networks to compute the two types of attention separately,and then combine them together. Fig. 1 shows the network of this design. Specifically, one networkis to model the different importance of previous items, and the other network is to capture thedifferent contributions of factors inside items. For the item-level attention, the attentive weight 𝑏 𝑖 𝑗 of a historical item 𝑗 to a target item 𝑖 is computed according to the method described in NAIS, andwe use the element-wise product method in our implementation . Formally, the 𝑏 𝑖 𝑗 is computed asfollowing: 𝑣 𝑖 𝑗 = h 𝑇 𝑅𝑒𝐿𝑈 ( W ( p 𝑖 ⊙ q 𝑗 ) + b ) , (14) Note that in the NAIS, it is a weight vector h ∈ R 𝑑 ′ for the output layer of the attention network, see Eq. 6. Note that the concatenation method in Eq. 6 can be also applied here. We use the element-wise product for the ease ofcomputing both the item-level attention and factor-level attention9 L a y e r L a y e r … L a y e r X s o f t m a x 𝒒 𝒑 𝒋 ⊙ L a y e r L a y e r … L a y e r X 𝒒 𝒑 𝒂 𝒑 𝒑 s o f t m a x 𝒂 𝒂 …… ⊕ ⊕ 𝑴𝒖𝒍𝒕𝒊𝒑𝒍𝒊𝒄𝒂𝒕𝒊𝒐𝒏⊙ 𝑬𝒍𝒆𝒎𝒆𝒏𝒕 − 𝒘𝒊𝒔𝒆 𝑷𝒓𝒐𝒅𝒖𝒄𝒕 ⊕ ⊕ Fig. 1. The structure of the item- and factor-level attention network for the first design. 𝑏 𝑖 𝑗 = exp ( 𝑣 𝑖 𝑗 )[ (cid:205) 𝑗 ′ ∈R + 𝑢 \{ 𝑖 } exp ( 𝑣 𝑖 𝑗 ′ ] 𝛽 , . (15)Here the parameters W , b , h and the variant of softmax function (Eq. 15) are defined as they arein NAIS (i.e., Eq. 6 & 7). For the factor-level attention, the attentive weight vector a ′ ij is computedbased on the Eq. 12 and 13. The final attentive weight of a factor of the item 𝑗 is weighted by theitem attentive weight. a ij = 𝑏 𝑖 𝑗 · a ′ ij . (16)It is easy to understand the intuition of the equation. If the historical item itself is irrelevant to thetarget item, the impact of its factors on predicting user preference to the target item should also besmall. This method is easy to understand but the network structure is complicated. Design 2.

Since our goal is still to assign an attentive weight vector for each historical item, weattempt to simplify the network structure in the first design and propose a fusion method, whosestructure is shown in Fig. 2. In this method, the attentive weights of a historical item 𝑗 ’s factors fora target item is computed as: (cid:40) ˆa ij = H ⊤ 𝑅𝑒𝐿𝑈 ( W ( p 𝑖 ⊙ q 𝑗 ) + b ) 𝑎 𝑖 𝑗𝑘 = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 ′ ( f ( ˆ 𝑎 𝑖 𝑗𝑘 ) , 𝑘 = , , · · · , 𝑑 (17)Similar to the calculation of the factor-level attention, we first model the interactions betweenthe historical items and the target items using a nonlinear transformation upon the element-wise 𝑬𝒍𝒆𝒎𝒆𝒏𝒕 − 𝒘𝒊𝒔𝒆 𝑷𝒓𝒐𝒅𝒖𝒄𝒕 ⊙ L a y e r L a y e r … L a y e r X s o f t m a x 𝒒 𝒑 𝒂 𝒑 𝒑 s o f t m a x s o f t m a x 𝒂 𝒂 ⊙ … … Fig. 2. The structure of the item- and factor-level attention network for the second design. product of their embedding vectors. The computation of ˆa ij is the same as in Eq. 12 and the notationsare defined in the same way. The difference comes from the normalization part. For the factor-levelattention inside an item in the “Design 1", the attentive weight of a factor is normalized over theweights of all the factors of this item . In this design, the weight of a factor is normalized over theweights of all the historical items on this particular factor . In particular, the final attentive weight ofthe 𝑘 -th factor inside an item 𝑗 is obtained via a normalization based on the variant of the softmaxfunction [27]: 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 ′ ( ˆa ijk ) = exp ( 𝑎 𝑖 𝑗𝑘 )[ (cid:205) 𝑗 ′ ∈R + 𝑢 \{ 𝑖 } exp ( ˆ 𝑎 𝑖 𝑗 ′ 𝑘 )] 𝛽 . (18)Comparing to the Eq. 13, we can see that for a factor of a historical item, its importance is evaluatedamong the same factor of all the historical items in the normalization. In this way, the computationof the factor-level takes the item-level effects into consideration. Notice that it is possible that theattentive weights of all the factors of an item are small because this item is not relevant to thetarget item. Similar in NAIS, the hyper-parameter 𝛽 is to smooth the value of the denominator insoftmax. It can help regulate the weights of the item factors for users with different numbers ofinteracted items.The mechanisms of the above two designs for considering both the item- and factor-levelattentions are different. It is theoretically difficult to analyze which one works better in practice.The advantage of the second method is that the network structure is simple and computationallyefficient. Besides, it has less parameters and thus is relatively more resistant to overfitting over thefirst method. We compare the recommendation performance of the two methods in experiments. With the attentive weight vector for each historical item 𝑗 ∈ R 𝑢 of a user 𝑢 , the preference to atarget item 𝑖 based on the factor-level attentive method is predicted by: 𝑟 𝑢𝑖 = ∑︁ 𝑗 ∈R + 𝑢 \{ 𝑖 } p 𝑇𝑖 ( a ij ⊙ q 𝑗 ) . (19)From this equation, we can see that our model considers the influence of different factors of allthe historical items for the preference prediction. The proposed factor-level attention model can be easily applied to existing embedding-based ICFmodels. In this section, we show the applications of our factor-level attentive (FLA) method totwo recently proposed ICF models: NAIS [27] and DeepICF [72]. For the ease of presentation, wename the two models with the use of our factor-level attention method as

FLA

𝑁𝐴𝐼𝑆 and

FLA

𝐷𝐼𝐶𝐹 ,respectively.

FLA

𝑁𝐴𝐼𝑆 . NAIS [27] considers the different contributions of historical items. The application ofthe factor-level attention to NAIS is the integration of the item- and factor-level. Therefore, themethods described in section 3.2 is applied to compute the attentive weight vectors for historicalitems (Eq. 16 or Eq. 18), and then the preference to the target item is predicted by Eq. 19.

FLA

𝐷𝐼𝐶𝐹 . DeepICF also considers the item-level attention. Similar to NAIS, the methods insection 3.2 is used to compute the attentive weight vectors for historical item model. The attentiveweight ( 𝑎 𝑖 𝑗 ) in Eq 8 is replaced by the obtained weight vector ( a ij ), and Eq 8 becomes: e ui = ∑︁ 𝑗 ∈R + 𝑢 \{ 𝑖 } p 𝑖 ⊙ ( a ij ⊙ q 𝑗 ) (20)We keep the other parts as the same as DeepICF, and thus the preference is still predicted by Eq. 11 In this work, we target at the top- 𝑛 recommendation, which is a more practical task than ratingprediction in real commercial systems [50]. It aims to recommend a set of 𝑛 to-ranked items whichmatch the target user’s preferences. Similar to other rank-oriented recommendation work [25, 27,62], we adopt the pairwise-based learning method for optimization. As we would like to validatethe effectiveness of the proposed factor-level attention, we strictly follow the implicit feedbacksetting in the work of NAIS [27] and DeepICF [72], w here each user-item interaction has a valueof 1 and other non-observed user-item pairs have a 0 value. The recommendation model is alsotreated as a binary classification task, and the objective function is as follows: 𝐿 = − |R + | + |R − | (cid:169)(cid:173)(cid:171) ∑︁ ( 𝑢,𝑖 ) ∈R + log 𝜎 ( ˆ 𝑟 𝑢𝑖 ) + ∑︁ ( 𝑢,𝑖 ) ∈R − log ( − 𝜎 ( ˆ 𝑟 𝑢𝑖 )) (cid:170)(cid:174)(cid:172) + 𝜆 ∥ Θ ∥ , (21)where R + denotes the positive instances set and R − denotes the negative one where each user-iteminstance is sampled from the non-interacted pairs; 𝜎 is a sigmoid function, which can convert thepredicted score ˆ 𝑟 𝑢𝑖 of user 𝑢 and item 𝑖 into a probability representation, constraining the result to(0,1); 𝜆 is the parameter to control the effect of ℓ regularization, which is used to prevent overfitting;and Θ represents all the trainable parameters including p 𝑖 , q 𝑗 , H , W and b . In addition, FLA 𝐷𝐼𝐶𝐹 has a multi-layer perception behind the attention network to simulate high-level interactions ofusers and Θ also contains their weight parameters. Model training . We adopt Adagrad [19] for optimize the prediction model and update the modelparameters. Because the objective function is non-convex, the loss function might be trapped ina local minimum, resulting in sub-optimal performance. Previous work has demonstrated that pre-training is particularly useful in practice for accelerating the training process and achieving able 2. Basic statistics of the experimental datasets. Dataset

Patio 1,686 962 13,272 99.18%Music 5,541 3,568 64,706 99.67%Grocery 14,681 8,713 151,254 99.88%Beauty 22,363 12,101 198,502 99.93%Clothing 39,387 23,033 278,677 99.97%Home 66,519 28,237 551,682 99.97%better performance [26, 28]. We report the results with and without the pre-training in experiments(see section 5.5).

We conducted extensive experiments on six publicly accessible datasets to evaluate the effectivenessof the proposed method. In particular, we mainly answer the following research questions. • RQ1:

Which design is better to integrate the item-level and factor-level attention,

Design 1 or Design 2 ? • RQ2:

Are our proposed factor-level attention methods useful for providing more accurate rec-ommendations? How do our factor-level attention enhanced methods perform with comparisonto the state-of-the-art item-based CF methods? • RQ3:

How do the hyper-parameters, i.e., the embedding size 𝑑 and 𝛽 , affect the performance ofthe factor-level attention enhanced methods? • RQ4:

Is the pre-training strategy useful for our factor-level attention enhanced methods?In what follows, we first present the experimental settings, and then answer the above questionssequentially based on experimental results.

Datasets.

We adopt the widely used benchmark dataset - the Amazon review dataset [46] forrecommendation evaluation in our experiments. This dataset contains user interactions on items aswell as item metadata from Amazon. In our experiments, we only use the interaction information.For each observed user-item interaction, we treated it as a positive instance; otherwise, it is negative.Six product categories from this dataset are used in evaluation, as shown in Table 2. The 5-coreversion of the dataset is used, which means that users and items in the dataset have at least 5interactions. The basic statistics of the six categories are also shown in Table 2. As we can see, theselected datasets are of different sizes and sparsity levels, which can evaluate the performance ofthe proposed method for item recommendation under different scenarios.

Evaluation Protocols.

As our main focus in this work is to study whether the factor-levelattention can enhance the performance of item-based CF model’s performance, We strictly follow thesame evaluation protocol as the one used in NAIS [27] and DeepICF [72] to study the performanceof item recommendation. Specifically, the latest interaction of each user is held-out as the testingdata, the second latest interaction is reserved as the validation data, and the remaining interactionsare used for training. For each user, we randomly sampled 99 items (negative instances) whichare not interacted by this user for the testing item (positive instance). In the testing stage, each tudied recommendation model predicts the preference scores for the 100 items (1 positive and 99negative instances). The performance is evaluated by the widely used metrics - Hit Ratio (HR) [16]and

Normalized Discounted Cumulative Gain (NDCG) [16]. For each metric, the performance ofrecommendation methods is often judged by the top 𝑛 results. Particularly, 𝐻𝑅 @ 𝑛 is a recall-basedmetric, measuring whether the test item is in the top- 𝑛 positions of the recommendation list. 𝑁 𝐷𝐶𝐺 @ 𝑛 emphasizes the quality of ranking, which assigns higher score to the top-ranked itemsby taking the position of correctly recommended into considerations. The reported results are theaverage values across all the testing users based on the top 10 results (i.e., 𝑛 = Compared Baselines.

As the main contribution of this work is to advocate the importanceof considering the factor-level attention in recommendation, especially for the item-based CFrecommendation methods. Therefore, we mainly compared our factor-level attention enhancedmethods with the state-of-the-art ICF models in the empirical studies. Specifically, we comparedour methods with the following baselines. • BPR [50] is a popular pair-wise learning method, which employs a Bayesian Per-sonalized Ranking loss to optimize the matrix factorization model. This is a basicbaseline with competitive performance on the top- 𝑛 recommendation task and hasbeen widely used in empirical studies to evaluate the newly developed method. • MLP [28] uses a multi-layer perceptron above user and item embeddings to replacethe inner product for recommendation. Following the setting in [27], we also use athree-layer MLP and optimize the point-wise log loss in experiments. • SLIM [47] is the earliest learning-based item-based CF model. It learns an item-itemsimilarity matrix to reconstruct the user-item interaction function as demonstratedin Eq. 3. • FISM [33] is a pioneering learning-based ICF model by directly learning item em-beddings as formulated in Eq. 4. In experiments, we carefully tuned 𝛼 from 0 to 1with a step size of 0.1 and reported the best result for each experimental dataset. • NAIS [27] is a state-of-the-art item-based CF model. It considers the different effectsof historical items to the target item and applies the attention mechanism to modelthe item-level attention in the modeling. • DeepICF [72] is a recently proposed deep ICF method which can capture the high-order interactions between items. By stacking perceptron layer above the interactionsbetween items, it adopts a linear regression for the final prediction.BRP is a traditional and competitive CF model based on matrix factorization for the top- 𝑛 recommendation task. MLP is a state-of-the-art CF method based on the neural network proposedin recent years. Both methods are widely used as baselines in many studies, and they are used asthe basic references to show the performance of other methods. SLIM is the representation of theearliest learning-based ICF method. NAIS and DeepICF are the main baselines to compared withFLA 𝑁𝐴𝐼𝑆 and FLA

𝐷𝐼𝐶𝐹 to study the effects of the factor-level attention in recommendation.

Parameter Settings.

All the considered methods in experiments use the pair-wise learningstrategy. For fair comparisons, we paired each positive instance in the training set with fourrandomly sampled negative instances to train all methods. Four embedding sizes ( 𝑑 𝑖𝑛 { , , , } )are tested in experiments. The learning rate is searched in [0.01, 0.001, 0.0001, 0.00001]. Thesmoothing parameter 𝛽 is tuned in the range of [0.1, 0.9] with a step size of 0.2 for NAIS, DeepICF,and our methods. The best results of each method on the test datasets are reported below. Withoutparticular specifying the value of 𝛽 , the reported results are obtained by setting 𝛽 = . 𝑁𝐴𝐼𝑆 , DeepICF, and FLA

𝐷𝐼𝐶𝐹 . The benefits of pre-training for Patio Music Grocery Beauty Clothing Home H i t R a t i o ( % ) Dataset(a) FLA

NAIS

FLADesign1FLADesign2

Design1Design2 Patio Music Grocery Beauty Clothing Home N D C G ( % ) Dataset(b) FLA

NAIS

FLADesign1

FLADesign2

Design1Design2 Patio Music Grocery Beauty Clothing Home H i t R a t i o ( % ) Dataset(c) FLA

DICF

FLADesign1FLADesign2

Design1

Design2 Patio Music Grocery Beauty Clothing Home N D C G ( % ) Dataset(d) FLA

DICF

FLADesign1FLADesign2

Design1Design2

Fig. 3. Performance of different designs for combining item- and factor-level attentions in two ICF models:NAIS and DeepICF.

NAIS and DeepICF have been demonstrated in [27] and [72], respectively. We also study its effectsto FLA

𝑁𝐴𝐼𝑆 and FLA

𝐷𝐼𝐶𝐹 in Section 5.5.

In this section, we report the performance of different designs for combining item-level and factor-level attentions in ICF models, namely,

Design 1 and

Design 2 as described in Section 3. Figure 3shows the performance (in terms of HR and NDCG) of applying the two designs to NAIS andDeepICF (i.e., FLA

𝑁𝐴𝐼𝑆 and FLA

𝐷𝐼𝐶𝐹 ) on the six evaluation datasets.From the results, we can see that for the NAIS model, the

Design 2 can obtain consistentlyand slightly better results than the

Design 1 ; for the DeepICF model,

Design 2 yields much betterperformance than the

Design 1 across the other five datasets besides the “Music" dataset. The betterperformance of

Design 2 is largely attributed to its simple design with less parameters, makingthe model easier to be trained and more resistant to overfitting. Because DeepICF has much moreparameters than NAIS, the improvement of

Design 2 over

Design 1 in DeepICF is larger than it inNAIS. The marginally better performance of

Design 1 over

Design 2 in DeepICF on the Music datasetmight because the relatively denser interactions between users and items, and this needs furthervalidation. Because of the better performance of

Design 2 in our experiments, in the followingsections, all the reported results of FLA

𝑁𝐴𝐼𝑆 and FLA

𝐷𝐼𝐶𝐹 are based on

Design 2 . In this section, we compare the performance of our factor-level attention (FLA) enhanced ICFmodels with all the adopted competitors. The results of all compared methods over all the testdatasets are reported in Table 3 in terms of HR@10 and NDCG@10. For a fair comparison, thereported results are based on the same embedding size 𝑑 =

16 for all the methods.First, we would like to validate the effects of factor-level attentions on enhancing the performanceof ICF models by comparing FLA

𝑁𝐴𝐼𝑆 and FLA

𝐷𝐼𝐶𝐹 to NAIS and DeepICF, respectively. From thetable, we can observe that with the consideration of factor-level attentions, the performance ofNAIS and DeepICF can achieve better performance in most cases across the six datasets, which areof different scales and sparsity. Note that both NAIS and DeepICF already consider the differentimportance of items to the target item (i.e., item-level attention), the better performance of FLA

𝑁𝐴𝐼𝑆 and FLA

𝐷𝐼𝐶𝐹 demonstrates that differentiating the contributions of different factors (i.e., factor-levelattention) can further improve the performance. The results validate our main assumption thatusers attend to different factors of varied items and the incorporation of factor-level attention intoICF models are beneficial. Another interesting observation is that NAIS consistently outperforms Note that the “Patio" dataset is the densest one among all the datasets, however, this dataset is too small for training a deepmodel. This is also demonstrated by the results of MLP and DeepICF in Table 3.15 able 3. Performance of HR@10 and NDCG@10 of compared approaches at embedding size 16. Noticed thatthe values are reported by percentage with ‘%’ omitted.

Methods Patio Music Grocery Beauty Clothing HomeHR NDCG HR NDCG HR NDCG HR NDCG HR NDCG HR NDCG

BPR 36.71 19.91 66.40 40.69 50.00 29.80 48.65 30.60 39.50 22.99 43.02 25.18MLP 33.39 16.99 59.00 35.14 47.75 28.18 42.42 24.23 34.40 19.49 45.34 26.89SLIM 36.06 21.76 56.99 40.76 42.14 28.63 39.34 27.11 27.23 18.50 28.89 18.59FISM 31.32 15.37 56.40 33.70 49.38 29.65 49.48 30.21 40.55 24.00 48.93 29.44NAIS 40.21 21.68 68.76 43.96 54.58 33.77 54.02 34.75 45.03 27.95 50.02 30.71FLA

𝑁𝐴𝐼𝑆

𝐷𝐼𝐶𝐹

DeepICF with a large margin, especially on the smaller datasets. This is because DeepICF adoptsdeep networks which often require large-scale data in training for a good performance. As we cansee, with the increasing of the data scale, the performance of DeepICF becomes closer to that ofNAIS. Besides the scale of the data size, the sparse interactions of most users in the training datasetsalso negatively affect the performance of deep-learning based models (see the performance of MLP),because we adopted the 5-core version of those Amazon datasets in experiments. NAIS is a directextension from FISM by considering the item-level attention, we can see that a large improvementof NAIS over FISM; and with the additionally considering factor-level attention, FLA

𝑁𝐴𝐼𝑆 can onlyslightly outperform NAIS. This is because NAIS itself is a very competitive ICF model (achieving alarge improvement over the FISM). More importantly, FLA

𝑁𝐴𝐼𝑆 attempts to capture the factor-levelpreference of users on items, which needs more interactions or side information to model sucha fine-grained level preference on items. In this work, because only the interaction informationis exploited and the interactions are fairly sparse for most users (most users have less than 10interactions in the training data), it is difficult to model the fine-grained preference well, resultingin marginally improvement of FLA

𝑁𝐴𝐼𝑆 over NAIS. Despite the limited information of the trainingdata, we can still observe a consistent performance improvement by a small modification of theICF models - replacing item-level attention (a scalar weight) with our proposed attention network(a weight vector), which is encouraging.Second, we compare the performance of all the adopted methods. There are some interestingfindings: 1) BPR is very competitive when the sizes of datasets are small, such as “Patio", “Music",and “Grocery". It performs the best besides NAIS and FLA

𝑁𝐴𝐼𝑆 on “Patio" and “Music". SLIM is tolearn a complete item-item similarity matrix and reconstruct the interaction matrix. It also performswell when the dataset is small. Reminder that the main drawbacks of SLIM are its scalability andgeneralization capability to un-interacted items. For larger datasets, FISM yields better performancethan SLIM. 2) Because deep-based models often need large-scale training data for good performance,we can see that MLP and DeepICF do not perform well on small datasets. MLP surpasses BPRon the largest dataset “HOME", and DeepICF cannot compete BRP, MLP, SLIM, and FISM on thetwo smallest datasets (e.g., “Patio" and “Music"), even it considers both item- and factor-levelattentions. 3) NAIS consistently preforms best among the baselines, which indicates the importance To validate this viewpoint, we conducted another experiment, in which the dataset is much larger and users/items withless than 20 interactions are removed. On this dataset, we observed a better performance of DeepICF over NAIS. Becausethis is not our focus in this study, we omitted the result here.16 H i t R a t i o ( % ) Embedding Size(a) Grocery

NAISFLANAISDeepICFFLADeepICFFLA

NAIS

FLA

DICF

37 8 16 32 64 N D C G ( % ) Embedding Size (d) Grocery

NAISFLANAIS

DeepICF

FLADeepICFFLA

NAIS

FLA

DICF H i t R a t i o ( % ) Embedding Size(b) Clothing

NAISFLANAISDeepICFFLADeepICFFLA

NAIS

FLA

DICF N D C G ( % ) Embedding Size (e) Clothing

NAISFLANAISDeepICF

FLADeepICF

FLA

NAIS

FLA

DICF H i t R a t i o ( % ) Embedding Size(c) Home

NAIS

FLANAIS

DeepICFFLADeepICFFLA

NAIS

FLA

DICF N D C G ( % ) Embedding Size(f) Home

NAISFLANAISDeepICFFLADeepICF

FLA

NAIS

FLA

DICF

Fig. 4. Performance of HR@10 and NDCG@10 w.r.t. the number of embedding sizes on three datasets. of differentiating the different contributions of items. Note that the performance of our factor-levelattention enhanced models depends on performance of the backbone models. Although FLA

𝐷𝐼𝐶𝐹 obtains much better results than DeepICF, it is still inferior to NAIS in this experiment. Overall, thebest performance is obtained by FLA

𝑁𝐴𝐼𝑆 , which further enhances the performance of NAIS withthe consideration of factor-level attention.

In this section, we analyze the influence of two hyper-parameters, i.e., embedding size 𝑑 andsmoothing exponent 𝛽 , on the performance of our factor-level attention enhanced ICF models. Effect of embedding size.

For analyzing the effect of the embedding size for the performanceimprovement of the factor-level attention module, we test FLA

𝑁𝐴𝐼𝑆 and FLA

𝐷𝐼𝐶𝐹 with their coun-terparts with respect to different embedding sizes.The results on three relatively larger datasetsare reported in Figure 4. Firstly, we can have a clearly observation which is consistent with manyprevious studies: the performance (in terms of accuracy) of all models is increasing with a largerembedding size, which is attributed to the increasing representation capability of the larger em-bedding size. Note that when the embedding size continue increasing, there is a risk of overfitting,which has not been observed in this study because the largest embedding size in our experimentsis 64. A more interesting observation is that our factor-level attention enhanced models obtainlarger the performance gain with a smaller embedding size. This observation is more consistentfor the improvement of FLA

𝑁𝐴𝐼𝑆 over NAIS. The underlying reason is that when the embeddingsize, i.e., number of item factors, is smaller, it is relatively easier for the attention network to learngood factor-level attention weights (because of less factors). Therefore, our models can benefitsmore from the factor-level effects for user preference modeling, leading to better performance.Comparing to NAIS, DeepICF is more difficult to train. As a result, the performance gain of FLA

𝐷𝐼𝐶𝐹 H i t R a t i o ( % ) β (a) Grocery NAIS FLANAISFLA

NAIS N D C G ( % ) β (d) Grocery NAIS FLANAISFLA

NAIS

46 0.1 0.3 0.5 0.7 0.9 H i t R a t i o ( % ) β (b) Clothing NAIS FLANAISFLA

NAIS N D C G ( % ) β (e) Clothing NAIS FLANAISFLA

NAIS

51 0.1 0.3 0.5 0.7 0.9 H i t R a t i o ( % ) β (c) Home NAIS FLANAISFLA

NAIS N D C G ( % ) β (f) Home NAIS FLANAISFLA

NAIS

Fig. 5. Performance of FAMR-NAIS on different smoothing exponent 𝛽 . over DeepICF is not very stable, although the largest gain for the three datasets is also achievedwhen the embedding size is 8. Effect of the smoothing exponent.

Because of the different numbers of history items for users,using the standard softmax normalization can excessively penalize the weights of active users witha long history. We use the performance of NAIS and FLA

𝑁𝐴𝐼𝑆 to demonstrate the effects of thesmoothing factor 𝛽 . We omit results of DeepICF and FLA 𝑁𝐴𝐼𝑆 , as they adopted the same smoothingstrategy and similar results are observed. Figure 5 shows the performance of NAIS and FLA

𝑁𝐴𝐼𝑆 with different 𝛽 . We can see FLA 𝑁𝐴𝐼𝑆 consistently outperforms NAIS; and the general trends of theperformance change for both methods are similar, indicating the smoothing effects are the sameto the two methods. The optimal value of 𝛽 depends on the target datasets. It seems 0.7 is a goodchoice across all the datasets. Note that when 𝛽 =

1, it means that a standard attention methodis used to normalize the attention weights. As pointed out in [27], a standard stetting does notwork well because of the large variance of the length of user histories. We can observe a dramaticperformance degradation when 𝛽 = .

9, indicating it already becomes insufficient to reduce thepunishment on the attention weights of active users. This also demonstrates the importance ofsmoothing the denominator in the softmax function for attention weight computation on userbehavior data.

As pre-training has been widely used for model training and demonstrated good performance,we also employ this technique in our experiments. To demonstrate the effects of pre-training, wecompare the factor-level attention enhanced models with (denoted by FLA

𝑁𝐴𝐼𝑆 /w and FLA

𝐷𝐼𝐶𝐹 /w)and without pre-training (denoted by FLA

𝑁𝐴𝐼𝑆 / 𝑜 and FLA 𝐷𝐼𝐶𝐹 / 𝑜 ). In our implementation, weused the learned user/items’ embeddings by FISM as model initialization for both FLA 𝑁𝐴𝐼𝑆 and able 4. Performance of FLA 𝑁𝐴𝐼𝑆 and FLA

𝐷𝐼𝐶𝐹 with (/w) and without (/o) pre-training at embedding size 16.

Methods Patio Music Grocery Beauty Clothing HomeHR NDCG HR NDCG HR NDCG HR NDCG HR NDCG HR NDCG

FLA

𝑁𝐴𝐼𝑆 /o 37.60 19.88 64.25 40.53 52.26 31.69 48.35 29.67 37.37 22.18 46.64 27.85FLA

𝑁𝐴𝐼𝑆 /w FLA

𝐷𝐼𝐶𝐹 /o 23.96 11.64 34.31 19.46 33.04 18.74 33.06 19.79 18.05 9.07 40.30 24.07FLA

𝐷𝐼𝐶𝐹 /w

58 0 5 10 15 20 25 30 35 40 45 50 H i t R a t i o ( % ) Epoch(a) Grocery

FLANAIS/oFLANAIS/wFLADeepICF/oFLADeepICF/wFLA

NAIS /oFLA

NAIS /wFLA

DICF /oFLA

DICF /w N D C G ( % ) Epoch(d) Grocery

FLANAIS/o

FLANAIS/w

FLADeepICF/oFLADeepICF/w

FLA

NAIS /w FLA

DICF /wFLA

NAIS /oFLA

DICF /o H i t R a t i o ( % ) Epoch(b) Beauty

FLANAIS/oFLANAIS/wFLADeepICF/o

FLADeepICF/w

FLA

NAIS /wFLA

DICF /wFLA

NAIS /oFLA

DICF /o

36 0 5 10 15 20 25 30 35 40 45 50 N D C G ( % ) Epoch(e) Beauty

FLANAIS/oFLANAIS/w

FLADeepICF/o

FLADeepICF/wFLA

NAIS /wFLA

DICF /wFLA

NAIS /oFLA

DICF /o H i t R a t i o ( % ) Epoch(c) Home

FLANAIS/oFLANAIS/wFLADeepICF/oFLADeepICF/wFLA

NAIS /wFLA

DICF /wFLA

NAIS /oFLA

DICF /o

32 0 5 10 15 20 25 30 35 40 45 50 N D C G ( % ) Epoch (f) Home

FLANAIS/oFLANAIS/wFLADeepICF/o

FLADeepICF/w

FLA

NAIS /wFLA

DICF /wFLA

NAIS /oFLA

DICF /o Fig. 6. Performance of FLA

𝑁𝐴𝐼𝑆 and FLA

𝐷𝐼𝐶𝐹 with (/w) and without (/o) pre-training at embedding size 16at each epoch.

FLA

𝐷𝐼𝐶𝐹 . For FLA

𝑁𝐴𝐼𝑆 / 𝑜 and FLA 𝐷𝐼𝐶𝐹 / 𝑜 , the hyper-parameters have been separately tuned. Notethat we can also use the learned embeddings by NAIS and DeepICF as model initialization forFLA 𝑁𝐴𝐼𝑆 and FLA

𝐷𝐼𝐶𝐹 . Because NAIS and DeepICF themselves also need pre-training for fasterconvergence and better performance [27, 72], it is cumbersome to use their learned embedding inpractice. Therefore, we used the embedding learned in FISM as pre-training results for simplicityand consistency. The comparison results with and without pre-training are shown in Table 4. Itcan be seen that with the pre-training, the performance of both methods have been significantlyimproved. By initializing the model randomly, it is easier to be trapped in local minimums, whichhurts the performance of the model. Beyond performance improvements, pre-training can alsoaccelerate the convergence speed. Figure 6 shows the convergence rate of FLA

𝑁𝐴𝐼𝑆 with (FAMR/w)and without (FAMR/o) pre-training. We find that in the three datasets

Grocery , Beauty , and

Home ,there is a faster convergence speed with pre-training than without pre-training. The case withpre-training basically converges at the tenth epoch, while the case without pre-training takes onger. For the Home dataset, the situation without pre-training even drops after 20 epochs. This isbecause the random initialization is more difficult to find the optimal solution. Collaborative filtering (CF) [31, 37, 48] has long been recognized as an effective approach inrecommendation over the past decades. Based on the standpoint of the interacted instances, CFmethods can be classified into two categories: user-based CF (UCF) and item-based CF (ICF). Theformer one recommends a user with the items favored by her similar users; and the latter onerecommends a user with the items that are similar to the items she liked in the history. UCFhas been extensively studied in both academia and industry. A typical UCF method is matrixfactorization (MF) [38], which represents users and items as feature vectors in the same embeddingspace based on the user-item interactions, and then predicts the preference of a user to an item byan interaction function (i.e., inner product) between their embedding vectors. This simple idea hasachieved great success in the Netflix contest and many variants have been developed later on, suchas WRMF [31], SVD++ [36], BPR [50], NeuMF [28]. Although UCF has achieved significant progress,a big limitation is that the UCF models require to be re-trained when new interactions come in,which is unacceptable in real-time recommender systems [27, 72]. In contrast, ICF predicts userpreference to a target item by estimating the similarity scores between the previously interacteditems of this user and the target one, which enables ICF to easily incorporate new interactions intothe preference modeling. Due to the nice property of ease online updating, ICF models are favoredby industry and have been widely-adopted in real recommender systems [17, 24, 63].Early ICF models leverage heuristic metrics, such as cosine similarity [51] or Pearson correlationcoefficient [35] to calculate the similarity, which require quite a lot of manual tuning when adaptingto another brand-new dataset. In order to tackle this limitation, several data-driven methods havebeen proposed [12, 70]. For example, SLIM [47] learns a complete item-item similarity matrix byminimizing the errors between the reconstructed rating matrix and the ground-truth. However, thetransductive relations are omitted since only co-interacted items are considered. FISM [33] adoptsthe inner product of item embeddings between the historical items and the target item for prediction.To model the preference of like-minded usersets, Christakopoulou et al. [11] proposed a global andlocal SLIM (GLSLIM) method, which applies different SLIM models to capture the preference ofdifferent user subsets. An early neural network based ICF mdoel is the CADE model [70], whichlearns the item similarity by using nonlinear auto-encoder architecture. He et al. [27] claimed thatthe historically interacted items of a user contributed differently for the current user preferencefor the target item, they therefore developed an attention-based method NAIS to assign differentweights to the historical items for better capturing the user preferences. Christakopoulou et al. [10]pointed out that high-order item relations also provide valuable information for user preferencemodeling. They proposed a higher-order sparse linear method (HOSLIM) which extends the SLIMmodel to learn the item-itemset similarity for capturing the higher-order relations. More recently,Xue et al. [72] proposed a DeepICF model, which captures the higher-order item relations bystacking multiple layer over the second-order item relations with a non-linear way. Despite greatprogress has been achieved by those ICF models, those models have not considered user diverseintents towards different items in an explicitly way. In this paper, we make an effort to model userdiverse intents at the factor-level (i.e., each factor is considered as an intent dimension) in the ICFmodel and propose a factor-level attention method to enhance the performance of ICF models. .2 Attention-based Recommendation The attention mechanism has been widely-used in deep learning methods and achieved greatsuccess in many tasks in computer vision and natural language processing. With the widespreadapplication of deep learning in recommendation, this technique has also been used in variousways in recommender systems in order to model user preference more accurately. Many attention-based recommender systems have been developed. A comprehensive survey of attention-basedrecommender system is out-of-the-scope of this paper. In this section, we briefly review the threeparadigms of using attention-mechanism used in recommender systems.

Item-level attention.

As discussed, historically interacted items have different contributions tomodel users’ preference. Therefore, it is important to assign different weights to the items for moreaccurate recommendation [20, 75, 76]. NAIS [27] and DeepICF [72] are typical examples of thisparadigm. Besides the ICF models, the item-level attention has also been used in graph convolutionnetwork (GCN) based recommender systems. The core of GCN-based recommendation models isthat the embeddings of users/items are iteratively updated by aggregating information from theirlocal neighbors (i.e., interacted items/users) [25, 32, 73]. The attention mechanism is introduced todifferentiate the different contributions of neighboring nodes in the user/item embedding learningprocess [42, 65, 67]. Another widely applied task for item-level attention is the the session-basedrecommendation task. Because interacted items in a session are typically sparse, it becomes verycrucial to identify important items for user intent inference [60]. A general framework is to usea recurrent neural network to learn the hidden states of items inside a session, followed by anattention model on the items’ hidden representations to capture the main purpose of users [39, 57].Recently, the self-attention blocks, such as Transformer [58] and BERT [18] have also been appliedto the session-based recommendation [1, 34, 56].

Feature-level attention in side information.

The attention mechanism has become a standardcomponent in the side information enriched recommender system, in order to extract effectivefeatures from the side information to represent item features or user preference. The most widelyused side information is review and user/item attributes. At the beginning, the attention mechanismis only used to assign different weights on the review-level for learning user and item embeddings [3,43, 69]. Later on, the review-aware recommender systems exploit the reviews at a more fine-grained level by applying the attention mechanism in a hierarchical manner [13, 44]: 1) firstattending important words of a review (i.e., word-level) to learn better review representations,and then 2) assigning different weights to review representations for user and item embeddinglearning. Beyond the two-layer of hierarchical attention network design, Wu et al. [68] proposed toadditionally encode the sentence-level attention in the review and developed a three-tier attentionnetwork for recommendation. Besides, there are also aspect-aware attention-based recommendationmodels [9, 22], which extract aspects from the reviews and then assign weights to different aspectsin the user preference modeling.Attribute information is often used in factorization machine [49] and graph-based models [42],especially knowledge-graph (KG) based recommendation models [23, 30, 53, 59, 61]. A representativeattention-based FM based model is the AFM model [71], which learns the importance of each featureinteractions from data via a neural attention network. In the KG-based recommender systems,the attributes of items/users are taken as node entities in the graph. There are two typical waysof applying KGs: embedding-based and meta-path based. In the embedding-based models, suchas KGAT [61], RippleNet [59], AKGE [52], and A -GCN [42], the attention mechanism is oftenused to learn the importance of neighbor nodes during the embedding propagation. For meta-pathapproaches, the attention mechanism can be applied inside a meta-path to learn representation of themeta-paths or directly attends to different meta-paths. A typical meta-path based recommendation pproach is MCRec [30], which first uses the attention mechanism to learn the representationof meta-paths and then applies it to assign the weights of different meta-paths for final userrepresentation learning.The attention mechanism is also used in visual-aware and multimedia recommendation. Forexample, Chen et al. [4] proposed an ACF model for multimedia recommendation, in which acomponent-level attention model is used to capture the user’s different preferences on differentcomponents, e.g., certain actions in a video; and an item-level attention model is leveraged totreat historically interacted items differently. In [5], a visually explainable recommendation modelis presented to capture use attention on different regions of images based on attention neuralnetworks. Factor-level attention.

Different from the above methods, we use the attention mechanism toattend each factor of an item embedding, aiming to capture users’ diverse intents towards variousitems. In other words, the attention weights are assigned to different factors of the target itemembedding to capture the user’s specific preference on this item. From this perspective, the mostsimilar method is A NCF [7]. In this method, for each user-item pair, it learns the attentive weightsfor each factor by taking the user’s and item’s embedding, as well as their text-based representationslearned from review into an attentive neural network. Note that there is a big difference betweenthe A NCF and the method presented in this work. The A NCF is a user-based CF model whichlearns the attentive weights based on the target user and item embeddings; and our method here isdesigned for item-based CF methods, which do not modeling user embeddings.

The underlying rationale of factor-level attention is that a user’s intent to different items could bediverse. Traditional recommender systems often represent a user preference with a fix embeddingvector, which is then used to match the vectors of different items for preference prediction. Thisprocess does not differentiate user intents on different items. In recent few years, researchers startto pay attention to model the diverse preferences of users towards different items and proposedseveral methods. Cheng et al. [6–8] proposed to model user intents on different aspects of items.They first applied topic models on side information (e.g., reviews and images) to analyze userinterests on different aspects of items. These aspects are then linked to the factors of (user/item’s)embeddings (learned by matrix factorization [6, 8] or neural networks [7]). For a target user-itempair, a unique weight vector is learned to represent this user’s attention on different factors ofthe target item. This unique weight vector is expected to capture the user’s intent (e.g., on whichaspects) to the target item. Following this idea, Chin et al. [9] presented an end-to-end neuralrecommendation model called ANR, which exploits the review information to model user diversepreference on different aspects of items. Later on, Liu et al. [41] presented a metric-learning basedrecommendation model, which uses an attentive neural network to estimate user attention ondifferent aspects of the target item by exploiting the item’s multimodal features (e.g., review andimage).To take the user diverse preference on items into consideration, another line of work is todynamically adapt the target user’s or item’s embedding to accurately predict the user preferenceto the target item. For example, CMN [20] adapts the target user embedding based on the selectedmost influential neighbor users, whose influential scores are computed according to the target item.MARS [75] adopts a different strategy, which adapts the user vector embedding based on the mostinfluential item vectors of the target item. In contrast, DIN [76] adapts the target item embeddingbased on the user’s previously purchased items. More recently, the disentangled representationlearning approach has been applied in recommendation for disentangled embedding learning. Arepresentative method is the disentangled graph collaborative filtering (DGCF) method proposed y Wang et al. [64]. In this method, different intents are represented as different chunks in theembedding vector and a distance correlation regularization is applied to make those chunkedrepresentations independent. Different from this method, DisenHAN [66] learns the disentangledrepresentations by aggregating aspect features from different meta relations in a heterogeneousinformation network (HIN).Apparently, the method presented in this work is fundamentally different from the above method.Our method predicts user preference on the target item by attending each factor in the itemembedding vectors of the historical items. All the above methods fall into the user-based CFapproach, and they use the learned user embedding to analyze the user intent to the target item. In this work, we advocate the importance of modeling user diverse intents to items in recommenda-tion and present a factor-level attention model for ICF models. The proposed model distinguishesthe contribution of different factors of a historical item to the target item for prediction. In thisway, our model captures user intents at the factor-level of item embeddings. In addition, we designa light attention neural network to combine the item- and factor-level attentions for neural ICFmodels. It is model-agnostic and easy-to-implement in ICF models. To show its effectiveness, weapply it to the recently proposed NAIS and DeepICF models and evaluate its effectiveness on sixAmazon datasets. The superior performance over several competitive baselines demonstrates thebenefits of modeling the impact of different factors (in item embeddings) for recommendation.We hope this work can shed light on modeling user preference at a fine-grained level (likethe factor-level) to capture user diverse intents on adopting items for recommendation, and canmotivate more researches in this direction in the future. Because it typically needs more data tomodel user preference on such a fine-grained level, an interesting future study is to exploit therich side information, such as reviews and knowledge-graphs, in the modeling. In addition, how toleverage the fine-grained preference modeling to provide better interpretation for recommendationsis also worth studying.

REFERENCES [1] Pham Hoang Anh, Ngo Xuan Bach, and Tu Minh Phuong. 2019. Session-Based Recommendation with Self-Attention.In

Proceedings of the Tenth International Symposium on Information and Communication Technology . ACM, 1–8.[2] Immanuel Bayer, Xiangnan He, Bhargav Kanagal, and Steffen Rendle. 2017. A generic coordinate descent frameworkfor learning from implicit feedback. In

Proceedings of the 26th International Conference on World Wide Web . 1341–1350.[3] Chong Chen, Min Zhang, Yiqun Liu, and Shaoping Ma. 2018. Neural Attentional Rating Regression with Review-levelExplanations. In

Proceedings of the 2018 World Wide Web Conference on World Wide Web . ACM, 1583–1592.[4] Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat-Seng Chua. 2017. Attentive collaborativefiltering: Multimedia recommendation with item-and component-level attention. In

Proceedings of the 40th internationalACM SIGIR conference on Research and development in Information Retrieval . ACM, 335–344.[5] Xu Chen, Hanxiong Chen, Hongteng Xu, Yongfeng Zhang, Yixin Cao, Zheng Qin, and Hongyuan Zha. 2019. Per-sonalized Fashion Recommendation with Visual Explanations based on Multimodal Attention Network: TowardsVisually Explainable Recommendation. In

Proceedings of the 42nd International ACM SIGIR Conference on Research andDevelopment in Information Retrieval . ACM, 765–774.[6] Zhiyong Cheng, Xiaojun Chang, Lei Zhu, Rose Catherine Kanjirathinkal, and Mohan S. Kankanhalli. 2019. MMALFM:Explainable Recommendation by Leveraging Reviews and Images.

ACM Trans. Inf. Syst.

37, 2 (2019), 16:1–16:28.[7] Zhiyong Cheng, Ying Ding, Xiangnan He, Lei Zhu, Xuemeng Song, and Mohan S Kankanhalli. 2018. A3NCF: AnAdaptive Aspect Attention Model for Rating Prediction.. In

Proceedings of the Twenty-Seventh International JointConference on Artificial Intelligence . Morgan Kaufmann, 3748–3754.[8] Zhiyong Cheng, Ying Ding, Lei Zhu, and Kankanhalli Mohan. 2018. Aspect-aware latent factor model: Rating predictionwith ratings and reviews. In

Proceedings of the 27th International Conference on World Wide Web Companion . IW3C2,639–648. 239] Jin Yao Chin, Kaiqi Zhao, Shafiq Joty, and Gao Cong. 2018. ANR: Aspect-based neural recommender. In

Proceedings ofthe 2018 ACM on Conference on Information and Knowledge Management . ACM, 147–156.[10] Evangelia Christakopoulou and George Karypis. 2014. Hoslim: Higher-order sparse linear method for top-n recom-mender systems. In

Pacific-Asia Conference on Knowledge Discovery and Data Mining . Springer, 38–49.[11] Evangelia Christakopoulou and George Karypis. 2016. Local item-item models for top-n recommendation. In

Proceedingsof the 10th ACM Conference on Recommender Systems . 67–74.[12] Evangelia Christakopoulou and George Karypis. 2018. Local latent space models for top-n recommendation. In

Proceedings of the 24th ACM SIGKDD international conference on Knowledge discovery and data mining . ACM, 1235–1243.[13] Dawei Cong, Yanyan Zhao, Bing Qin, Yu Han, Murray Zhang, Alden Liu, and Nat Chen. 2019. Hierarchical attentionbased neural network for explainable recommendation. In

Proceedings of the 2019 on International Conference onMultimedia Retrieval . ACM, 373–381.[14] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In

Proceedingsof the 10th ACM conference on recommender systems . ACM, 191–198.[15] James Davidson, Benjamin Liebald, Junning Liu, Palash Nandy, Taylor Van Vleet, Ullas Gargi, Sujoy Gupta, Yu He,Mike Lambert, Blake Livingston, et al. 2010. The YouTube video recommendation system. In

Proceedings of the fourthACM conference on Recommender systems . 293–296.[16] Mukund Deshpande and George Karypis. 2004. Item-based top-N Recommendation Algorithms. In

ACM Transactionson Information Systems , Vol. 22. 143–177.[17] Mukund Deshpande and George Karypis. 2004. Item-based top-n recommendation algorithms.

ACM Transactions onInformation Systems

22 (2004), 143–177.[18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep BidirectionalTransformers for Language Understanding. In

Proceedings of the 2019 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human Language Technologies . NAACL, 4171–4186.[19] John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochasticoptimization.

Journal of machine learning research

12 (2011), 2121–2159.[20] Travis Ebesu, Bin Shen, and Yi Fang. 2018. Collaborative memory network for recommendation systems. In

Proceedingsof the 41st international ACM SIGIR conference on Research and development in Information Retrieval . ACM, 515–524.[21] Chantat Eksombatchai, Pranav Jindal, Jerry Zitao Liu, Yuchen Liu, Rahul Sharma, Charles Sugnet, Mark Ulrich, and JureLeskovec. 2018. Pixie: A system for recommending 3+ billion items to 200+ million users in real-time. In

Proceedings ofthe 27th International Conference on World Wide Web Companion . ACM, 1775–1784.[22] Xinyu Guan, Zhiyong Cheng, Xiangnan He, Yongfeng Zhang, Zhibo Zhu, Qinke Peng, and Tat-Seng Chua. 2019.Attentive Aspect Modeling for Review-Aware Recommendation.

ACM Trans. Inf. Syst.

37, 3 (2019).[23] Qingyu Guo, Fuzhen Zhuang, Chuan Qin, Hengshu Zhu, Xing Xie, Hui Xiong, and Qing He. 2020. A Survey onKnowledge Graph-Based Recommender Systems.

CoRR abs/2003.00911 (2020).[24] Taolin Guo, Junzhou Luo, Kai Dong, and Ming Yang. 2019. Locally differentially private item-based collaborativeItem-based collaborative filtering recommendation algorithms filtering.

Information Sciences

502 (2019), 229–246.[25] Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yong-Dong Zhang, and Meng Wang. 2020. LightGCN: Simplifyingand Powering Graph Convolution Network for Recommendation. In

Proceedings of the 43rd international ACM SIGIRconference on Research and development in Information Retrieval . 639–648.[26] Xiangnan He, Zhankui He, Xiaoyu Du, and Tat-Seng Chua. 2018. Adversarial personalized ranking for recommendation.In

Proceedings of the 41st international ACM SIGIR conference on Research and development in Information Retrieval .ACM, 355–364.[27] Xiangnan He, Zhankui He, Jingkuan Song, Zhenguang Liu, Yu-Gang Jiang, and Tat-Seng Chua. 2018. NAIS: Neuralattentive item similarity model for recommendation.

IEEE Transactions on Knowledge and Data Engineering

30 (2018),2354–2366.[28] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering.In

Proceedings of the 26th international conference on world wide web . ACM, 173–182.[29] Jonathan L Herlocker, Joseph A Konstan, Al Borchers, and John Riedl. 1999. An Algorithmic Framework for PerformingCollaborative Filtering. In

Proceedings of the 22nd international ACM SIGIR conference on Research and development inInformation Retrieval . ACM, 230–237.[30] Binbin Hu, Chuan Shi, Wayne Xin Zhao, and Philip S. Yu. 2018. Leveraging Meta-path based Context for Top- NRecommendation with A Neural Co-Attention Model. In

Proceedings of the 24th ACM SIGKDD International Conferenceon Knowledge Discovery & Data Mining . ACM, 1531–1540.[31] Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative filtering for implicit feedback datasets. In . IEEE, 263–272.2432] Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques.

ACM Transactions onInformation Systems

20, 4 (2002), 422–446.[33] Santosh Kabbur, Xia Ning, and George Karypis. 2013. Fism: factored item similarity models for top-n recommendersystems. In

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining .ACM, 659–667.[34] Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation. In . IEEE, 197–206.[35] Yehuda Koren. 2008. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In

Proceedingsof the 14th ACM SIGKDD international conference on Knowledge discovery and data mining . ACM, 426–434.[36] Yehuda Koren. 2008. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In

Proceedingsof the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . ACM, 426–434.[37] Yehuda Koren. 2010. Factor in the neighbors: Scalable and accurate collaborative filtering.

ACM Transactions onKnowledge Discovery from Data

Computer

42 (2009), 30–37.[39] Jing Li, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Tao Lian, and Jun Ma. 2017. Neural attentive session-basedrecommendation. In

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management . ACM,1419–1428.[40] David C Liu, Stephanie Rogers, Raymond Shiau, Dmitry Kislyuk, Kevin C Ma, Zhigang Zhong, Jenny Liu, and YushiJing. 2017. Related pins at pinterest: The evolution of a real-world recommender system. In

Proceedings of the 26thInternational Conference on World Wide Web Companion . 583–592.[41] Fan Liu, Zhiyong Cheng, Changchang Sun, Yinglong Wang, Liqiang Nie, and Mohan S. Kankanhalli. 2019. User DiversePreference Modeling by Multimodal Attentive Metric Learning. In

Proceedings of the 27th ACM International Conferenceon Multimedia . ACM, 1526–1534.[42] Fan Liu, Zhiyong Cheng, Lei Zhu, Chenghao Liu, and Liqiang Nie. 2021. Aˆ 2-GCN: An Attribute-aware AttentiveGCN Model for Recommendation.

IEEE Trans. Knowl. Data Eng. (2021), to appear.[43] Hongtao Liu, Wenjun Wang, Hongyan Xu, Qiyao Peng, and Pengfei Jiao. 2020. Neural Unified Review Recommendationwith Cross Attention. In

Proceedings of the 43rd International ACM SIGIR conference on research and development inInformation Retrieval . ACM, 1789–1792.[44] Hongtao Liu, Fangzhao Wu, Wenjun Wang, Xianchen Wang, Pengfei Jiao, Chuhan Wu, and Xing Xie. 2019. NRPA:Neural Recommendation with Personalized Attention. In

Proceedings of the 42nd International ACM SIGIR Conferenceon Research and Development in Information Retrieval . ACM, 1233–1236.[45] Andrew L. Maas, Awni Y. Hannum, and Andrew Y. Ng. 2013. Rectifier nonlinearities improve neural network acousticmodels. In

Proceedings of the 30th International Conference on Machine Learning .[46] Julian McAuley and Jure Leskovec. 2013. Hidden factors and hidden topics: understanding rating dimensions withreview text. In

Proceedings of the 7th ACM Conference on Recommender Systems . ACM, 165–172.[47] Xia Ning and George Karypis. 2011. Slim: Sparse linear methods for top-n recommender systems. In . IEEE, 497–506.[48] Rong Pan, Yunhong Zhou, Bin Cao, Nathan N Liu, Rajan Lukose, Martin Scholz, and Qiang Yang. 2008. One-classcollaborative filtering. In . IEEE, 502–511.[49] Steffen Rendle. 2010. Factorization machines. In . IEEE, 995–1000.[50] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian personalizedranking from implicit feedback. In

Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence .AUAI, 452–461.[51] Badrul Munir Sarwar, George Karypis, Joseph A Konstan, John Riedl, et al. 2001. Item-based collaborative filteringrecommendation algorithms.

Proceedings of the 10th International Conference on World Wide Web Companion

CoRR abs/1910.08288 (2019). http://arxiv.org/abs/1910.08288[53] Chuan Shi, Binbin Hu, Wayne Xin Zhao, and Philip S. Yu. 2019. Heterogeneous Information Network Embedding forRecommendation.

IEEE Trans. Knowl. Data Eng.

31, 2 (2019), 357–370.[54] Brent Smith and Greg Linden. 2017. Two decades of recommender systems at Amazon. com.

Ieee internet computing

21 (2017), 12–18.[55] Xiaoyuan Su and Taghi M Khoshgoftaar. 2009. A survey of collaborative filtering techniques.

Advances in artificialintelligence

IEEE Access

Proceedings of the 10th ACM conference on recommender systems . ACM, 17–22.[58] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and IlliaPolosukhin. 2017. Attention is all you need. In

Advances in neural information processing systems . MIT Press, 5998–6008.[59] Hongwei Wang, Fuzheng Zhang, Jialin Wang, Miao Zhao, Wenjie Li, Xing Xie, and Minyi Guo. 2019. ExploringHigh-Order User Preference on the Knowledge Graph for Recommender Systems.

ACM Trans. Inf. Syst.

37, 3 (2019),32:1–32:26.[60] Shoujin Wang, Longbing Cao, Yan Wang, Quan Z. Sheng, Menmet Orgun, and Defu Lian. 2019. A Survey on Session-based Recommender Systems.

CoRR abs/1902.04864 (2019). arXiv:1902.04864 http://arxiv.org/abs/1902.04864[61] Xiang Wang, Xiangnan He, Yixin Cao, Meng Liu, and Tat-Seng Chua. 2019. KGAT: Knowledge Graph AttentionNetwork for Recommendation. In

Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery& Data Mining . ACM, 950–958.[62] Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. Neural graph collaborative filtering.In

Proceedings of the 42nd international ACM SIGIR conference on Research and development in Information Retrieval .165–174.[63] Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. Neural graph collaborative filtering. In

Proceedings of the 42nd international ACM SIGIR conference on Research and development in Information Retrieval . ACM,165–174.[64] Xiang Wang, Hongye Jin, An Zhang, Xiangnan He, Tong Xu, and Tat-Seng Chua. 2020. Disentangled Graph Collabora-tive Filtering. In

Proceedings of the 43rd International ACM SIGIR conference on research and development in InformationRetrieval . ACM, 1001–1010.[65] Xiao Wang, Ruijia Wang, Chuan Shi, Guojie Song, and Qingyong Li. 2020. Multi-Component Graph ConvolutionalCollaborative Filtering. In

The Thirty-Fourth AAAI Conference on Artificial Intelligence . AAAI Press, 6267–6274.[66] Yifan Wang, Suyao Tang, Yuntong Lei, Weiping Song, Sheng Wang, and Ming Zhang. 2020. DisenHAN: Disentan-gled Heterogeneous Graph Attention Network for Recommendation. In

The 29th ACM International Conference onInformation and Knowledge Management . ACM, 1605–1614.[67] Yinwei Wei, Zhiyong Cheng, Xuzheng Yu, Zhou Zhao, Lei Zhu, and Liqiang Nie. 2019. Personalized Hashtag Recom-mendation for Micro-videos. In

Proceedings of the 27th ACM International Conference on Multimedia . ACM, 1446–1454.[68] Chuhan Wu, Fangzhao Wu, Junxin Liu, and Yongfeng Huang. 2019. Hierarchical User and Item Representationwith Three-Tier Attention for Recommendation. In

Proceedings of the 2019 Conference of the North American Chapterof the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019 . Association forComputational Linguistics, 1818–1826.[69] Libing Wu, Cong Quan, Chenliang Li, Qian Wang, Bolong Zheng, and Xiangyang Luo. 2019. A Context-AwareUser-Item Representation Learning for Item Recommendation.

ACM Trans. Inf. Syst.

37, 2 (2019), 22:1–22:29.[70] Yao Wu, Christopher DuBois, Alice X Zheng, and Martin Ester. 2016. Collaborative denoising auto-encoders for top-nrecommender systems. In

Proceedings of the Ninth ACM International Conference on Web Search and Data Mining . ACM,153–162.[71] Jun Xiao, Hao Ye, Xiangnan He, Hanwang Zhang, Fei Wu, and Tat-Seng Chua. 2017. Attentional Factorization Machines:Learning the Weight of Feature Interactions via Attention Networks. In

Proceedings of the Twenty-Sixth InternationalJoint Conference on Artificial Intelligence . ijcai.org, 3119–3125.[72] Feng Xue, Xiangnan He, Xiang Wang, Jiandong Xu, Kai Liu, and Richang Hong. 2019. Deep Item-based CollaborativeFiltering for Top-N Recommendation.

ACM Transactions on Information Systems

37 (2019), 33.[73] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton, and Jure Leskovec. 2018. GraphConvolutional Neural Networks for Web-Scale Recommender Systems. In

Proceedings of the 24th ACM SIGKDDInternational Conference on Knowledge Discovery & Data Mining . ACM, 974–983.[74] Hanwang Zhang, Fumin Shen, Wei Liu, Xiangnan He, Huanbo Luan, and Tat-Seng Chua. 2016. Discrete collaborativefiltering. In

Proceedings of the 39th international ACM SIGIR conference on Research and development in InformationRetrieval . ACM, 325–334.[75] Lei Zheng, Chun-Ta Lu, Lifang He, Sihong Xie, Huang He, Chaozhuo Li, Vahid Noroozi, Bowen Dong, and S Yu Philip.2019. MARS: Memory attention-aware recommender system. In . IEEE, 11–20.[76] Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai.2018. Deep interest network for click-through rate prediction. In