[PDF] UPRec: User-Aware Pre-training for Recommender Systems

Abstract

Existing sequential recommendation methods rely on large amounts of training data and usually suffer from the data sparsity problem. To tackle this, the pre-training mechanism has been widely adopted, which attempts to leverage large-scale data to perform self-supervised learning and transfer the pre-trained parameters to downstream tasks. However, previous pre-trained models for recommendation focus on leverage universal sequence patterns from user behaviour sequences and item information, whereas ignore capturing personalized interests with the heterogeneous user information, which has been shown effective in contributing to personalized recommendation. In this paper, we propose a method to enhance pre-trained models with heterogeneous user information, called User-aware Pre-training for Recommendation (UPRec). Specifically, UPRec leverages the user attributes andstructured social graphs to construct self-supervised objectives in the pre-training stage and proposes two user-aware pre-training tasks. Comprehensive experimental results on several real-world large-scale recommendation datasets demonstrate that UPRec can effectively integrate user information into pre-trained models and thus provide more appropriate recommendations for users.

Full PDF

JJOURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

UPRec: User-Aware Pre-training forRecommender Systems

Chaojun Xiao, Ruobing Xie, Yuan Yao, Zhiyuan Liu, Maosong Sun, Xu Zhang, and Leyu Lin

Abstract —Existing sequential recommendation methods rely on large amounts of training data and usually suffer from the datasparsity problem. To tackle this, the pre-training mechanism has been widely adopted, which attempts to leverage large-scale data toperform self-supervised learning and transfer the pre-trained parameters to downstream tasks. However, previous pre-trained modelsfor recommendation focus on leverage universal sequence patterns from user behaviour sequences and item information, whereasignore capturing personalized interests with the heterogeneous user information, which has been shown effective in contributing topersonalized recommendation. In this paper, we propose a method to enhance pre-trained models with heterogeneous userinformation, called U ser-aware P re-training for Rec ommendation (UPRec). Speciﬁcally, UPRec leverages the user attributes andstructured social graphs to construct self-supervised objectives in the pre-training stage and proposes two user-aware pre-trainingtasks. Comprehensive experimental results on several real-world large-scale recommendation datasets demonstrate that UPRec caneffectively integrate user information into pre-trained models and thus provide more appropriate recommendations for users.

Index Terms —Recommender System, Pre-training, User Information, Sequential Recommendation (cid:70)

NTRODUCTION W ITH the rapid development of various online plat-forms, large amounts of online items expose users toinformation overload. Recommender systems aim to accu-rately characterize users’ interests and provide recommen-dations according to their proﬁles and historical behaviors.The application of recommendation systems makes it pos-sible for users to obtain useful information efﬁciently, andthus has received great attention in recent years [1].In many real-world scenarios, users’ preference is in-trinsically dynamic and evolving over time, which makeit challenging to recommend appropriate items for users.

Sequential recommendation focuses on capturing users’ longand short-term preferences and recommend the next itemsbased on their chronological behaviours [2], [3], [4]. A mainline of work attempts to obtain expressive user represen-tations with sequential models, such as recurrent neuralnetwork [2], [5], convolutional neural network [6], and self-attention [3]. And some researchers seek to enhance theneural sequential models with rich contextual information,such as item attributes and knowledge graphs [7], [8], [9].These works achieve promising results in generatingpersonalized recommendations. However, they rely on suf-ﬁcient user behavior data for training and usually suffer • Chaojun Xiao, Yuan Yao, Zhiyuan Liu (corresponding author) andMaosong Sun are with the Department of Computer Science andTechnology Institute for Artiﬁcial Intelligence, Tsinghua University,Beijing, China, 100084.E-mail: [email protected], [email protected], { lzy,sms } @tsinghua.edu.cn • Ruobing Xie, Xu Zhang and Leyu Lin are with WeChat Search Applica-tion Department, Tencent, China, 100089.E-mail: [email protected], { xuonezhang,goshawklin } @tencent.com • This work is ﬁnished during Chaojun Xiao’s internship at WeChat SearchApplication Department, Tencent. • This work has been submitted to the IEEE for possible publication.Copyright may be transferred without notice, after which this version mayno longer be accessible. from data sparsity problem [10], [11]. The similar problemalso exists in the ﬁeld of natural language processing (NLP).To tackle this, many efforts have been devoted to conduct-ing self-supervised pre-training from large-scale unlabelledcorpus [12], [13]. It has been proven that pre-trained mod-els can effectively capture complicated sequence patternsfrom large-scale raw data and transfer the knowledge tovarious downstream NLP tasks, especially in the few-shotsetting [14], [15].Inspired by the success of pre-trained language modelsin NLP, many researchers propose to utilize pre-trainedmodels, especially the BERT (Bidirectional Encoder Repre-sentations from Transformers) [12], to derive user repre-sentations from their behaviour sequences in recommenda-tion tasks [16], [17], [18]. Similar to the masked languagemodel, these works pre-train the model with a cloze-styletask, which randomly masks some items in the behavioursequences and requires the model to reconstruct the maskedones [19]. Furthermore, researchers seek to conduct more ef-fective pre-training with various learning mechanisms [17],[18], [20]. And some works attempt to leverage item sideinformation in the pre-training stage, including item at-tributes [18] and knowledge graphs [21]. These works haveshown that pre-trained models can capture complex se-quence patterns and generate expressive user representa-tions even with sparse data.However, these works mainly focus on adopt the cloze-style task on behaviour sequences, but ignore the abundantheterogeneous user information. Different from languageunderstanding in NLP, which focuses on learning generallanguage knowledge, recommender systems should notonly leverage universal sequence patterns but also cap-ture the personalized interests of each user. Therefore, itis necessary and indispensable to exploit user informationfor pre-trained recommender systems. Previous works haveshown that the user information contains rich clues which a r X i v : . [ c s . I R ] F e b OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

Average Stars : 3.57

Review Count : 553

Compliment : 87

Gender : F

Fans : 14 ......

RafaelJenny ScottJaeman

AttributesI love the Aria. I was there in January on business. This place is very pricey ......

Fig. 1. An example of heterogeneous user information from YELP. Eachuser can post their reviews for different items, and they are sociallyconnected with each other to form a social graph. Besides, there arevarious attributes including proﬁles and behaviour properties for allusers. The rich user information can help the recommender systemsto better capture their preference. can help models capture users’ interest and further facilitatethe recommender systems to alleviate the data sparsityproblem [22], [23]. For instance, users in the same age grouptend to like similar songs for music recommendation [24],and users who are friends tend to have similar behavioursfor social recommendation [25].Therefore, in this work, we propose to enhance pre-trained recommendation models with various user infor-mation. To this end, we need to tackle the problem ofheterogeneous information integration. Figure 1 shows anexample from YELP online platform . The user informationis complicated and consists of various types of data, includ-ing sequential behaviour data, structured social graphs, andtabular user attributes. The formats of three types of data arequite different from each other, leading to three diverse se-mantic spaces. How to design special pre-training objectivesto fuse the spaces together is an important problem.To overcome this challenge, we propose U ser-aware P re-training for Rec ommendation (

UPRec ), which use thesame encoder to align the symbolic spaces of social graphsand user attributes with the semantic space of behavioursequences under a pre-training framework. In particular,for the sequential behaviour data, we adopt the cloze-styleMask Item Prediction task to learn the items’ and users’ rep-resentations from bidirectional context following previousworks [17], [19]. Based on the representations, we proposetwo simple and effective user-aware pre-training tasks toleverage the two type of symbolic user information: (1) UserAttribute Prediction: We argue that the users’ behaviour se-quences can reﬂect various users’ attributes to some extent.And in this task, we require the model to predict the users’attributes given the users’ representations, which can help the model inject the attribute knowledge into the model.(2) Social Relation Detection: The task is specially designedto incorporate social graphs into pre-training and aims tomake the representations between socially connected usersto be similar. Given representations of different users, werequire the model to detect the social relations betweenthem. Via these pre-training tasks, we can effectively fusevarious kinds of user information and train a user-awarepre-trained model.Moreover, to verify the effectiveness of UPRec, we con-duct comprehensive experiments on two real-world sequen-tial recommendation datasets from different domains. Andwe evaluate the performance of UPRec on the downstreamtasks, user proﬁle prediction. Experimental results showthat both two user-aware pre-training tasks can help UPReccapture user interests more accurately, and achieve perfor-mance improvement.To summarize, we make several noteworthy contribu-tions in this paper: • To the best of our knowledge, we are the ﬁrst tosystematically integrate heterogeneous user informa-tion, including user attributes, sequential behaviorsand social graphs, under the pre-training recommen-dation framework. • We propose two effective user-aware pre-trainingtasks: user attribute prediction and social relationdetection, which possess plug and play characteristicand can be easily adopted in various recommenda-tion scenarios. • We perform comprehensive experiments two real-world recommendation datasets, and the experimen-tal results demonstrate the effectiveness of our pro-posed model. The source code of this paper willbe released to promote the improvements in recom-mender systems.

ELATED W ORK

In this section, we will introduce previous works related toours from the following three aspects, including general rec-ommendation, sequential recommendation and pre-trainingfor recommendation.

Recommender systems aim to estimate user interests andrecommend items that users may like [1], [26]. Existingrecommendation models can be divided into two categories:collaborative ﬁltering and content-based models.Collaborative ﬁltering (CF) focuses on capturing userpreference based on their historical feedback, such as clicks,likes. One typical class of CF is matrix factorization, whichdecomposes the user-item interaction matrix to obtain theuser and item vectors, and the preference scores are es-timated as the inner product between the user and itemvectors [27], [28], [29]. Some works estimate the similaritybetween different items, and recommend items that aresimilar to ones the user has interacted with before [30], [31],[32]. With the development of deep learning, various modelarchitectures are introduced to learn the representations,such as multi-layer perceptions [33] and auto-encoder [34].

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

Content-based models aim to integrate items’ and users’auxiliary information into recommender models. Theseworks mainly focus on enriching the items’ or users’ rep-resentation mainly by utilizing neural models to encode theside information, such as text [35], [36], images [37], [38],and social graphs [23].

Sequential recommendation aims to capture the users’ dy-namic preferences from their chronological user-item se-quences [39]. Early works mainly rely on the Markov chainmethod, which predict the next item by estimating theitem-item transition probability matrix [40], [41]. Further,some researchers employ the high order Markov chains toconsider more items in the sequences [42], [43].Recently, inspired by the powerful representation abilityof various neural models, sequential neural models arewidely adopted in the recommendation. For instance, someworks propose to encode the user behaviour sequenceswith various recurrent neural networks, including GatedRecurrent Units (GRU) [7], Long Short-Term Memory Net-work (LSTM) [5] and other effective variants [44], [45], [46].Besides, other powerful neural models are also introducedfor recommendation. Tang et al. [6] utilize ConvolutionalNeural Networks to capture sequential patterns with bothhorizontal and vertical convolutional ﬁlters. Kang et al. [3]and Sun et al. [19] introduce the multi-head self-attentionmechanism to model behaviour sequences. Though theseapproaches achieve remarkable results in sequential recom-mendation, they neglect the rich information about users.To tackle this issue, some works [23], [47] incorporate so-cial relations to the sequential recommendation. Despitethe success of these models, the sufﬁcient heterogeneoususer information has not been fully utilized for user-itemsequences modelling.

Pre-training aims to learn useful representation from large-scale data, which will beneﬁt speciﬁc downstream tasks.Recently, the pre-training mechanism achieves great successin many computer vision tasks [48], [49], [50], and naturallanguage process tasks [12], [13], [51]. In the ﬁeld of rec-ommender systems, pre-training has also received great at-tention. Early works attempt to apply pre-trained models toleverage side-information to enrich representations for usersor items directly. According to the type of side-information,various pre-trained models are required. For instance, someresearchers seek to utilize pre-trained word embeddingsfor textual data [52], [53], pre-trained knowledge graphembeddings for knowledge graphs [45], [54], [55] and pre-trained network embeddings for social graphs [56], [57].By leveraging the side-information, these approaches canconstruct expressive representations for users and items,thus achieve performance gain for recommender systems.Recently, inspired by the rapid progress of pre-trainedlanguage models in natural language processing [12],[13], many efforts have been devoted to designing self-supervised pre-trained models to capture information fromuser behaviour sequences [21]. Sun et al. [19] and Chenet al. [16] propose to train the deep bidirectional encoder by predicting randomly masked item in sequences for se-quential recommendation. Xie et al. [20] further proposeto utilize contrastive pre-training framework for sequen-tial recommendation. Besides, some works attempt to uti-lize side-information of items, e.g., item attributes, in pre-training with mutual information maximization [18] andgraph neural network [58]. And Yuan et al. [17] propose toﬁne-tune large-scale pre-trained network with parameter-efﬁcient grafting neural networks. These works achievesigniﬁcant improvement in user modeling and recommen-dation tasks. As these works mainly focus on utilizing iteminformation or other recommendation tasks, they cannot beapplied in this paper.Different from previous works, we focus on constructingpre-training signals from user information, including userproﬁles and social relations. To the best of our knowledge,we are the ﬁrst to enhance diverse user information in pre-training for the recommender system.

ETHODOLOGY

In this section, we will introduce our proposed user-aware pre-training framework for recommendation (UP-Rec), which incorporates various user information into thepre-trained model. The overview of UPRec is shown inFigure 3. We employ BERT [12] as our sequence encoderand utilize three objectives to pre-train the encoder. Fol-lowing previous works [12], [17], [18], we adopt maskitem prediction as our basic pre-training task to capturecomplex sequence patterns. Besides, in order to take fulluse of adequate user information, we propose two user-aware pre-training tasks: user attribute prediction and socialrelation detection, which leverage user attributes and socialrelations, respectively.In the following sections, we will ﬁrst introduce nota-tions and our sequence encoder, BERT [12]. Then we willdescribe how we utilize three tasks to train UPRec in detail.

Let U denote the user set and I denote the item set.For each user u ∈ U , we use s u = { i u , ..., i un } to repre-sent his/her chronologically-ordered interaction sequence,where i uj ∈ I , ≤ j ≤ n , and n is the sequence length. Let R u denote the set of users who are socially connected with u . Besides, each user is associated with several attributes A u = { a u , ..., a um } . The attributes can be very diverse. Forinstance, for the users from YELP platform, we can adoptthe numerical average rating of all their posted reviews, andtheir gender, region as their attributes. Sequential recommendation aims to exploit user chronolog-ical interaction sequence for next item recommendation. In-spired by the great success of pre-trained deep bidirectionaltransformers (BERT) in NLP [12], [13], many researchersbegin to leverage BERT-based models to capture informa-tion from user behaviour sequences [16], [19]. Followingprevious works, we adopt BERT as our basic module toencode the behaviour sequences. BERT is stacked by anembedding layer and L bidirectional transformer layers. OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

Position-Wise Feed-ForwardMulti-Head Self-AttentionAdd & Layer NormalizationDropout

Input Sequence

Trm

Add & Layer NormalizationDropout

Fig. 2. The architecture of the transformer layer, which consists of amulti-head self-attention layer and a point-wise feed-forward layer.

Figure 2 presents the framework of the transformer layer.Each transformer layer consists of two sub-layer: multi-headself-attention layer and point-wise feed-forward layer. Thenwe will introduce the encoder in detail.

In the embedding layer, the high dimensional one-hotrepresentations of items are projected to low dimensionaldistributed representations with an item embedding ma-trix M . Moreover, to make use of position information insequence, learnable position embeddings are injected intothe item representations. Formally, given the item sequence { i , ..., i n } , we ﬁrst map it into the embedding sequence { v , ..., v n } , where v i is d -dimensional vector. And theinput representation is constructed by summing the itemembeddings and position embeddings: h i = v i + p i , (1)where p i ∈ R d is the position embedding for positionindex i . Compared with conventional neural models, self-attentionmechanism is able to capture long distance dependenciesfrom sequences. Thus, it has achieve promising results andis widely adopted in sequence modelling for both NLP andrecommendation area. Moreover, multi-head mechanismallows the models to attend to information from multiplerepresentation sub-spaces. Speciﬁcally, given the input hid-den representation H l from the l -th layer, the multi-headself-attention ﬁrst project the input sequence into severalvector sub-spaces, and then compute the output vectorswith multiple attention heads:MultiHead ( H l ) = Concat ( head , ..., head h ) W O , (2) head i = Attention ( H l W Qi , H l W Ki , H l W Vi ) . (3)Here h is the number of heads, W Qi , W Ki , and W Vi aretrainable projection matrices for the i -th head, Concat ( · ) refers to concatenation operation and W O is learnable pa-rameters for output. And the attention function is imple-mented as scaled dot-product attention:Attention ( Q , K , V ) = softmax ( QK T (cid:112) d/h ) V , (4)where query Q , key K , and value V are linear transforma-tion from the same input hidden representation, (cid:112) d/h isthe scaling factor. In addition to multi-head self-attention layer, each trans-former layer also contains a fully connected feed-forwardlayer, which incorporate the model with non-linearity. Inthis layer, a feed-forward network is applied in each posi-tion separately and identically:FFN ( h li ) = GELU ( h li W F + b ) W F + b , (5)where W F , W F , b and b are trainable parameters, andGELU ( · ) is the gaussian error linear unit activation function.The parameters are same for different positions in the samelayer, but are different for different layers.To avoid overﬁtting, a dropout operation is performedfollowing each multi-head self-attention layer and point-wise feed-forward layer. Then a residual connection [48] andlayer normalization operation [48] are employed to stabilizeand accelerate the network training process. Based on the above encoder, we further incorporate threepre-training tasks to enable the model to generate expressivesequence representations: Mask Item Prediction, User At-tribute Prediction, and Social Relation Detection. The threeobjectives are optimized jointly.

Traditional sequential recommendation models usually uti-lize the left-to-right paradigm to train models to predictthe next items. However, such unidirectional models willrestrict the power of representations of items and se-quences [19], [59]. Therefore, following previous works [18],[19], we adopt the mask item prediction (MIP) task inour method. MIP enables the model to generate item rep-resentation based on the context from both directions inthe sequences and capture the complex sequence patterns.Speciﬁcally, when given a user-item interaction sequence s = { i , ..., i n } , we ﬁrst randomly replace part of itemswith a special token [MASK] , and then the model is re-quired to predict the masked ones based on their context.Formally, we ﬁrst mask p proportion items to get the inputs { i , ..., [MASK] , i t , ..., i n } , which are then fed into the BERTencoder to generate the hidden representations: H L = BERT ( { [CLS] , i , ..., [MASK] , i t , ..., i n , [SEP] } ) . (6)Here H L is the hidden vectors of the sequences from theﬁnal layer, and [CLS] and [SEP] are special tokens used tomark the beginning and end of sequence, respectively. Andthe ﬁnal hidden vectors corresponding to the [MASK] arefed into an output softmax function over the whole item set. OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

PositionEmbeddingItemEmbedding i [MASK] i N i t+1 i t ... ......... Trm Trm Trm Trm TrmTrm Trm Trm Trm Trm ... ... ...

Mask ItemPrediction u Max-Pooling Social Relation DetectionUser Attribute Prediction

BERT ...... gender: female region: Washington...

User-Aware TasksUserBehaviour age: 23

Fig. 3. The architecture of UPRec in the pre-training stage. We adopt BERT as our encoder and utilize three tasks to pre-train UPRec: (1) Maskitem prediction (MIP); (2) User attribute prediction (UAP); (3) Social Relation Detection (SRD). MIP allows the model to capture complex sequencepatterns, and the two user-aware tasks can inject the user information into the representations.

And the loss is deﬁned as the mean cross-entropy loss ofeach masked item: L MIP = − | S M | (cid:88) j ∈ S M − log P ( i pj = i j ) , (7)where S M is the set of the positions of masked items,and i pj , i j are the predicted item and the original item atposition j , respectively. Notably, in the ﬁne-tuning stage, weadopt the task for sequential recommendation evaluation. Inparticular, we add the [MASK] to the end of the sequence,and require the model to recommend items based on therepresentation of the [MASK] token. User attributes can provide sufﬁcient ﬁne-grained infor-mation about the users’ preferences. And it is crucial totake full advantage of user attributes for recommendation.For instance, music tastes change over time, and differentgenerations prefer different music [60]. Therefore, we aim toinject the useful attributes information into the user repre-sentations. We argue that the users’ behaviours can reﬂectthe information about the users’ attributes. Speciﬁcally, wepropose the user attribute prediction (UAP) task, whichrequires the model to predict the user attributes based ontheir interaction sequences.Formally, given the ﬁnal hidden representations H L asin Equation 6, we ﬁrst employ max-pooling operation toobtain the user representation: u = MaxPooling ( H L ) . (8)Here u is the user representation. For different types ofattributes, we employ different loss functions. For numericalattributes, such as age and the average rating of all reviews,we formalize the task as a regression problem. We project the user representation to the predicted value with a linearlayer and minimize the Huber loss [61]: L r = 1 |U| (cid:88) u ∈U z u , (9)where z u is given by: z u = (cid:26) . a p − a u ) if | a p − a u | < | a p − a u | − . otherwise , (10)where a p and a u are the predicted value and ground truthvalue of the attribute. For discrete attributes, such as genderand region, we formalize the task as a classiﬁcation problem.Similar to the MIP task, we employ an output softmaxfunction over the value set of the attribute, and deﬁne theloss as the mean cross-entropy loss: L c = 1 |U| (cid:88) u ∈U − log P ( a p = a u ) . (11)The overall loss of the UAP task is computed as the sumof all loss of different attributes: L UAP = (cid:88) a ∈A n L r + (cid:88) a ∈A d L c , (12)where A n is the set of numerical attributes and A d is the setof discrete attributes. Previous works demonstrate that users who are sociallyconnected are more likely to share similar preferences [25].And incorporating social relations in recommender systemscan improve the performance of the personalized recom-mendation [62], [63]. Therefore, we formalize the task as ametric learning problem, and the goal is to create a vectorspace such that the distance of representations betweenfriends is smaller than irrelevant ones. Formally, given thetraining data { u q , u + c , u − c, , ..., u − c,m } , where u q is the query OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6 user, u + c ∈ R u is the friend of the query user and u − c,i isthe negative samples. We deﬁne the similarity between thequery u q and the candidate u c as:sim ( u q , u c ) = − (cid:104) w Ts ( u q − u c ) + b s (cid:105) , (13)where the square notation indicates squaring each dimen-sion of the vector. The similarity function can be regardedas a weighted L2 similarity with trainable w s and b s . Andwe optimize the loss function as the cross-entropy loss: L SRD = − log e sim ( u q , u + c ) e sim ( u q , u + c ) + (cid:80) mj =1 e sim ( u q , u − c,j ) . (14)For this task, the positive candidate users are pro-vided explicitly, while the negative candidates need tobe sampled from the whole user set. And how to selectthe negative samples is important for training a high-quality sequence encoder. Inspired by the previous works,we employ the in-batch negative strategy in this task.That is, we reuse the positive candidate from the samebatch as negatives, which can make computation efﬁcientand achieve great performance. Formally, we have B userpairs { ( u ,q , u +1 ,c ) , ..., ( u B,q , u + B,c ) } in a mini-batch. For eachquery user u j,q , u + j,c is his/her positive candidate and u + i,c ( i (cid:54) = j ) are his/her negative candidates. Moreover,we argue that if two users are two-hop friends or havesimilar proﬁles, they are likely to become friends in thefuture. Therefore, to avoid introducing noise, we mask thesenegative candidate users, who are two-hop friends or havesimilar proﬁles with the query users. The process of UPRec consists of two steps. We ﬁrst pre-train the encoder by optimizing the weighted sum of threetasks: L = λ L MIP + λ L UAP + λ L SRD , (15)where λ , λ and λ are hyper-parameters. In the ﬁne-tuning stage, we employ the pre-trained parameters to ini-tialize the encoder’s parameters for downstream tasks. Forthe sequential recommendation task, we ﬁne-tune the modelby masking the last item of each sequence and adopt thenegative log-likelihood of the masked targets to optimizethe model. For the user proﬁle prediction task, we use thehidden vector of the beginning token [CLS] to representusers, and then adopt the regression objective for numericalattributes and classiﬁcation objective for discrete attributes. XPERIMENT

To verify the effectiveness of UPRec, we conduct experi-ments on two large-scale real-world datasets. Besides, ab-lation study and hyper-parameter sensitive analysis areprovided to study whether UPRec works well in detail. Inorder to evaluate the generalization ability of UPRec, weperform experiments on the user proﬁle prediction tasks.The comprehensive analysis proves that UPRec can captureuseful information from behaviour sequences and improvethe performance of recommendation.

To evaluate our proposed model, we conduct experimentson two datasets collected from real-world platforms.(1)

YELP : this is a large-scale dataset for business rec-ommendation, which is collected from an onlinesocial network. Users make friends with each otherand post reviews and ratings for items on the web-site. Following previous works [18], we only use theinteraction records after January 1st, 2019. And wetreat the user metadata as attributes, including thenumber of compliments received by the users andthe average rating of their posted reviews.(2) WeChat : we build a new large-scale dataset fromthe largest Chinese social app, WeChat. We randomlyselect some users and collect their click behaviours intwo weeks to build this dataset. Besides, we collecttheir gender, age, and regions as their attributes forthe user attribution prediction task. The large-scaleWeChat dataset contains tens of millions of inter-action records and hundreds of thousands of socialrelations.For these datasets, we group interaction records byusers and sort them by timestamp to construct behavioursequences. Following previous works [9], [18], [64], we onlykeep the -core data, and ﬁlter out users and items with lessthan interaction records in the data preprocessing stage.We keep data for pre-training and sequential recom-mendation evaluation. The rest data are used for userproﬁle prediction and social relation detection evaluation.The statistics of these datasets are shown in Table 1. TABLE 1The statistics of the datasets. The

Dataset

We adopt widely used top- k Hit Ratio (HR@ k ), Normal-ized Discounted Cumulative Gain (NDCG@ k ), and MeanReciprocal Rank (MRR) as metrics to evaluate the models.As HR@ is equal to NDCG@ , we only report results onHR@ { , , } , NDCG@ { , } and MRR. Following pre-vious works [18], [19], we adopt leave-one-out strategyfor evaluation. In particular, we use the last item of eachsequence as the test data, and use the item before the lastone as valid data. The rest items of the sequences are usedas training data. As the item set is quite large, it is verytime-consuming to use all items as candidates to evaluatemodels. Therefore, we follow a common strategy [3], [33]by randomly sampling negative items based on theirpopularity for each ground-truth item to speed up theexperiments. OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

TABLE 2Performance on the sequential recommendation task of different methods on two different real-world datasets. The best performance are denotedin bold. Note that, “–” means the model does not converge.

Dataset YELP WeChatModels HR@1 HR@5 HR@10 NDCG@5 NDCG@10 MRR HR@1 HR@5 HR@10 NDCG@5 NDCG@10 MRRGRU4Rec 09.86 37.66 57.22 23.81 30.13 24.02 – – – – – –Caser 14.57 40.54 56.97 27.81 33.12 27.75 11.64 37.64 57.10 24.67 30.95 25.11SASRec 14.17 38.98 53.71 26.85 31.62 26.73 17.32 48.89 66.05 33.50 39.05 32.35BERT4Rec 14.60 45.98 64.81 30.58 36.66 29.79 28.61 58.02 72.46 43.85 48.53 42.45UPRec w/o All

To evaluate the effectiveness of our proposed model, wecompare UPRec with following representative models.(1)

GRU4Rec [2] is a GRU-based model. It utilizes GRUsto model user behaviour sequences and adopts rank-ing based loss for session-based recommendation.(2)

Caser [6] employs convolutional neural networkwith both horizontal and vertical ﬁlters to capturesequential patterns from multiple levels, which allowit to model high order markov chains.(3)

SASRec [3] utilizes self-attention mechanism for se-quence modelling, which allows the model to cap-ture long distance dependencies. It employs a left-to-right objective to optimize the model, and achievespromising results in next item recommendation task.(4)

BERT4Rec [19] also adopts BERT to encode be-haviour sequences. It proposes to use a cloze-styleobjective to generate representations with bidirec-tional context for sequential recommendation.(5)

UPRec w/o All is the pre-trained model with only theMIP task. The architecture of UPRec w/o All is same asUPRec, but it is pre-trained without user-aware tasks.Notably, as previous pre-trained models mainly focus onutilizing item information or other recommendation tasks,these models cannot be applied in this paper. UPRec w/o All can serve as a strong pre-trained baseline.

For our proposed UPRec, we implement it by PyTorchand Transformers package [65]. The hyper-parameters areselected by grid search on the valid dataset. We set thenumber of the transformers layers and the attention headsas . The dropout rate is set as . . The dimension of theembeddings is set as . For the YELP, the maximum lengthof behaviour sequences is set as , and for the WeChat,the maximum length is set as . Following [18], the maskproportion of item is set as . . In the pre-training stage,we set the weights for three loss (i.e., MIP, UAP, SRD) as λ = 1 . , λ = 0 . and λ = 0 . , respectively. We employAdam [66] as optimizer with the learning rate of − . Weset the batch size as for YELP, and for WeChat. Weoptimize the model with , iteractions in each epoch.We pre-train our model for epochs and save checkpointevery epochs. Each checkpoint is further used to ﬁne-tunefor epochs, and the checkpoint with highest HR @1 scoreson the valid datasets are used to evaluate on test datasets. In the ﬁne-tuning stage, we set the learning rate as − andset the batch size as .For baseline models, we use the source code provided bythe authors. For a fair comparison, we set the dimension ofhidden vectors as for all baseline models. And for SASRecand BERT4Rec, which are self-attention based models, weset the number of model layers and attention heads as . The remaining hyper-parameters are set following theirsuggestion in their papers.All models for the YELP dataset are trained on NVIDIAGeForce GTX 2080Ti GPUs, and models for the WeChatdataset are trained on NVIDIA Tesla P40 GPUs. The results of baseline models and UPRec are shown inTable 2. From the results we can observe that:Compared with the baseline models, UPRec can sig-niﬁcantly outperform them by a large margin on bothtwo datasets. The results show that our method can effec-tively incorporate various user information into pre-trainedmodels, and generate expressive user representations basedon their behaviour sequences. Moreover, both UPRec andBERT4Rec adopt BERT as the basic encoder and optimizethe model with a cloze-style objective. And UPRec canachieve better performance for sequential recommendation,which further proves that constructing self-supervised sig-nals from social networks and user attributes can helpthe model obtain general users’ preferences and captureintricate sequence patterns.As for the baselines for sequential recommendation,we can observe that SASRec and BERT4Rec achieve betterperformance than Caser and GRU4Rec on the two datasets.Both SASRec and BERT4Rec employ the self-attention mech-anism to capture information from behaviour sequences.This indicates that the self-attention mechanism is moresuitable for sequence modelling than convolutional neuralnetworks and recurrent neural networks.Moreover, the two attention-based models, SASRec andBERT4Rec, consist of the same model architecture, while thetraining objectives are different. SASRec adopts an autore-gressive objective to train the model, which predicts theitems unidirectionally. BERT4Rec adopts a cloze-style ob-jective which can utilize bidirectional sequence information.BERT4Rec can consistently outperform the SASRec model,which indicates that it is important to generate representa-tions bidirectionally for sequential recommendation.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

TABLE 3The performance on sequences with different length. The data is divided into three groups according to the length of behaviour sequences.

Dataset small medium large allMetrics NDCG@10 MRR NDCG@10 MRR NDCG@10 MRR NDCG@10 MRRUPRec w/o All ( ↑ ( ↑ ( ↑ ( ↑ ( ↑ ( ↑ ( ↑ ( ↑ To further evaluate how the two user-aware pre-trainingtasks improve the performance of recommendation, weshow the results on behaviour sequences with differentlength. As shown in Table 3, we divide the data into threegroups according to their sequence length. Here, sequenceswith length < are divided into the small group, sequenceswith length ≥ are divided into the large group, andothers are divided into the medium group. Due to the spacelimitation we report NDCG@ and MRR for comparison.In order to investigate how the user-aware pre-training tasksbeneﬁt the performance, we compare the results of UPRecand UPRec w/o All in the experiments. From the results, wecan ﬁnd that UPRec can achieve more improvements on thesmall group. It demonstrates that for the use with only afew interactions, the user attributes and social graphs canprovide useful information, and help the model to capturetheir preferences more accurately. Besides, for sequences inthe large group, even the behaviour sequences contain sufﬁ-cient information, the extra user information can also furtherbeneﬁt the recommendation performance, which veriﬁes theeffectiveness of UPRec in integrating user information intothe pre-trained model. As our model aims to utilize the rich user information inpre-training, we argue that UPRec can achieve accurate usermodelling. Therefore, we evaluate UPRec on the user proﬁleprediction task. We adopt three tasks: (1) Compliment Pre-diction: it requires the model to prediction the number ofcompliments received by the user. (2) Average Star Regres-sion: it requires the model to predict the average rating ofreviews posted by the user. (3) Gender Prediction: it requiresthe model to predict the gender of the user. Notably, the ﬁrsttwo tasks are also used in the pre-training stage, and genderprediction is a new challenging task, which is not adoptedin pre-training. In these experiments, we employ the BERTas a baseline, which encodes the user behaviour sequenceswith BERT and utilizes the cross-entropy loss or Huber lossas objectives.

TABLE 4The performance of user proﬁle prediction task on the YELP dataset.We evaluate the performance on compliment prediction, average starregression, and gender prediction. We adopt accuracy as metric forcompliment prediction task and gender prediction task. We adoptmean-square error as metric for the average star regression task.

Task Compliment Star GenderBERT 0.6752 0.0375 0.6097UPRec

The results are shown in Table 4. We adopt accuracy asmetric for compliment prediction and gender prediction.We adopt mean-square error as metric for average starregression. From the results, we can observe that UPRec canachieve performance improvements in all three tasks, espe-cially in the average star regression task. Besides, though thegender prediction task is not used to pre-train our model,UPRec can also outperform the baseline model on this task.The improvements in ﬁrst two tasks prove that our modelcan learn useful information from user attributes and socialgraphs, and thus beneﬁt the recommender systems. And theimprovements in the gender prediction task demonstratethat our user-aware pre-training tasks can help the modelto capture user attributes from their behaviours. The resultsfurther verify that utilizing various user information in pre-training can signiﬁcantly help the model to capture userpreference effectively and accurately.

UPRec adopts the social relation detection task to generatesimilar representations for social connected users. To verifywhether UPRec can work well and generate social awareuser representations, we evaluate our proposed model onsocial relation detection task. Speciﬁcally, as in pre-trainingstage, for each user in a social relation, we will sample negative candidate users and require the model to selectthe true friend according their behaviour sequences. Wecompare our model with two baselines: (1) Similarity (Sim):it assumes that friends tend to behave similarly and interactwith the same items. Thus, it always predict the candidateuser as the friend who has the most of same items withthe query user. (2) BERT: it encodes the behaviour sequencewith BERT, and is trained with the loss stated in Equation 14. TABLE 5The performance of social relation detection task on the YELP dataset.We adopt the accuracy as the evaluation metric.

Model Sim BERT UPRecAcc 12.87 69.43

The results are shown in Table 5. We adopt the accuracyas the evaluation metric. From the results, we can ﬁnd thatUPRec signiﬁcantly outperforms the baseline models andare able to recommend friends for each user accurately,even the users are new for the model. It demonstrates thatour model can effectively generate similar representationsfor friends, and thus beneﬁt user preferences modelling.Besides, similarity strategy can also perform better thanrandom prediction, which verify the hypothesis that friends

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9 tend to behave similarly and are supposed to have similarrepresentations.

To explore the contribution of two user-aware pre-trainingtasks, we conduct an ablation study and the results areshown in Table 6. Speciﬁcally, we show the scores withdifferent pre-training tasks turned off. Here w/o All, w/oRel, and w/o Pro refer to pre-training the model withoutuser-aware tasks, social relation detection, and user attributeprediction, respectively. The results of the baseline with thebest overall performance, BERT4Rec, are also provided forcomparison.From the results, we can observe that both two user-aware pre-training tasks contribute to the main model, asthe performance decreases with any of the tasks missing.Note that the model without any user-aware pre-trainingtasks can also outperform BERT4Rec, which indicates thatthe two-stage pre-training and ﬁne-tuning mechanism canimprove the performance of the model. Besides, comparedwith the model pre-trained without user-aware tasks (w/oAll), the models pre-trained with social relation detection(w/o Pro) or user attribute prediction (w/o Rel) can signiﬁ-cantly achieve better results. It further proves that both twouser-aware tasks can help the model to capture high orderfeatures and inject user information into the pre-trainedmodels.Notably, the results on the WeChat dataset show thatthe models with only one pre-training task (w/o Rel, w/oPro) can achieve comparable performance on hit ratio. Itdemonstrates that the two pre-training tasks have a similarrole for injecting the user information into representationsto some extent. But pre-training with both two tasks canachieve more robust performance on various evaluationmetrics.The two datasets contain social graphs of different densi-ties and different types of user attributes. And our proposeduser-aware pre-training tasks can improve the performancesigniﬁcantly, which verify the effectiveness and robustnessof our method.

10 20 30 40 50 60 70Epoch0.3700.3750.3800.3850.3900.395 N D C G @

10 20 30 40 50 60 70Epoch0.3000.3050.3100.3150.320 M RR Fig. 4. The performance (NDCG@10 and MRR) with respect to thenumber of pre-training epochs on the YELP dataset.

Our model consists of the pre-training stage and ﬁne-tuning stage. In the pre-training stage, UPRec are trained toinject the user information into the user representations anditem representations. The number of pre-training epochs can greatly inﬂuence model performance. Therefore, in order tostudy this issue, we pre-training the model with differentnumber of epochs on the YELP dataset, and ﬁne-tune themon the sequential recommendation task every epochs.Figure 4 represents the performance comparison withregard to the number of epochs on the YELP dataset. Fromthe results, we can see that during the ﬁrst epochs, theperformance improves a lot, while after that the perfor-mance improves slightly. It proves that UPRec convergesquickly and can effectively capture the features from theheterogeneous user information in the ﬁrst few epochs. Andthus the enriched representations are able to improve theperformance of the sequential recommendation.

128 256 512 768Batch Size0.3840.3860.3880.3900.3920.3940.396 N D C G @

128 256 512 768Batch Size0.3120.3140.3160.3180.3200.322 M RR Fig. 5. The performance (NDCG@10 and MRR) with respect to thebatch size on the YELP dataset.

It has been proven in the ﬁelds of NLP that hyper-parameter choices have a signiﬁcant impact on the pre-trained models, and pre-training with bigger batch sizecan lead to better performance [13]. Therefore, we arewondering how the batch size of pre-training affects theperformance. To investigate this, we pre-train our modelwith batch size as { , , , } , and ﬁne-tune themon the sequential recommendation task.The results are shown in Figure 5. From the results,we can observe that the performance increases signiﬁcantlywith increasing the batch size. Large mini-batch can help themodel to be optimized stably and efﬁciently. Besides, as weadopt the in-batch negative strategy for the social relationdetection task, the larger batch size indicates the largernumber of negative users. This can help the model generateexpressive user representations, thus further improve theperformance on downstream tasks. But we still have notreached the upper bound of the model’s capacity. We canpre-train the model with a bigger batch size to achieve betterperformance, which we leave to future work. Further, the number of parameters has a signiﬁcant impacton the performance of pre-trained models. In the ﬁeldsof pre-trained language models, it has been proven thata larger number of parameters can lead to better perfor-mance [12], [13]. Therefore, in this part, we aim to study howthe hidden size affects the recommendation performanceand whether a larger model can lead to better performancein the recommendation. Figure 6 presents the NDCG@ and MRR with the hidden size varying from to while keeping other hyper-parameters unchanged. From the OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

TABLE 6The results of ablation study.

Dataset YELP WeChatMetrics HR@1 HR@5 HR@10 NDCG@5 NDCG@10 MRR HR@1 HR@5 HR@10 NDCG@5 NDCG@10 MRRBERT4Rec 14.60 45.98 64.81 30.58 36.66 29.79 28.61 58.02 72.46 43.85 48.53 42.45w/o All 15.04 46.31 65.50 30.93 37.14 30.25 29.57 64.62 79.36 47.91 52.70 45.47w/o Rel 16.29 48.57 67.94 32.71 38.98 31.79 29.82 64.84 79.45 48.15 52.90 45.69w/o Pro 16.46 48.29 67.67 32.62 38.89 31.77 29.70

32 48 64 96 128 160Hidden Size0.3700.3750.3800.3850.3900.395 N D C G @

32 48 64 96 128 160Hidden Size0.3000.3050.3100.3150.3200.325 M RR Fig. 6. The performance (NDCG@10 and MRR) with respect to thehidden size on the YELP dataset. results, we can observe that the performance beneﬁts a lotwhen we increase the hidden size to . After that, when wecontinue to increase the hidden size, the NDCG@ scoredecreases. That is probably caused by overﬁtting. Therefore,we should choose the hidden size carefully for differentdatasets with various sparsity and scale.To summarize, from the results of hyper-parameter sen-sitive analysis, we can conclude that the pre-training mech-anism can enable the model to recommend items accurately,even with the limited behaviour data. Besides, the resultssuggest that we can train large-scale pre-trained modelswith large batch size and large number of parameters tofurther improve the performance. ONCLUSION

In this paper, we propose to incorporate user informationinto pre-trained models for recommender systems. In ourmodel, we propose two novel user-aware tasks, includinguser attribute prediction and social relation detection, whichare designed to utilize user attributes and social graphs.Then we evaluate our proposed UPRec on the sequentialrecommendation task and user proﬁle prediction tasks. Theexperimental results demonstrate that our model can gener-ate expressive user representations from their behaviour se-quences, and outperform other competitive baseline models.Besides, we conduct an ablation study and hyper-parametersensitive analysis, which suggest that pre-training withuser-aware tasks can improve the performance, and we cantrain large models with large batch size to further promotethe progress.In the future, we will explore to how to design powerfulpre-training tasks to further utilize more user information,including their posted reviews and other behaviours. It isalso worthy to explore how our model perform in othercomplex recommendation tasks, such as next basket recom-mendation and click through rate prediction. A CKNOWLEDGMENTS

This work is supported by the National Key Research andDevelopment Program of China (No. 2020AAA0106501) andthe National Natural Science Foundation of China (NSFCNo. 61772302). Yao is also supported by 2020 Tencent Rhino-Bird Elite Training Program. R EFERENCES [1] S. Zhang, L. Yao, A. Sun, and Y. Tay, “Deep learning basedrecommender system: A survey and new perspectives,”

CSUR ,vol. 52, no. 1, pp. 1–38, 2019.[2] B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk, “Session-based recommendations with recurrent neural networks,” in

Pro-ceedings of ICLR , 2016.[3] W.-C. Kang and J. McAuley, “Self-attentive sequential recommen-dation,” in

Proceedings of ICDM . IEEE, 2018, pp. 197–206.[4] J. Li, P. Ren, Z. Chen, Z. Ren, T. Lian, and J. Ma, “Neural attentivesession-based recommendation,” in

Proceedings of CIKM , 2017, pp.1419–1428.[5] C.-Y. Wu, A. Ahmed, A. Beutel, A. J. Smola, and H. Jing, “Recur-rent recommender networks,” in

Proceedings of WSDM , 2017, pp.495–503.[6] J. Tang and K. Wang, “Personalized top-n sequential recommen-dation via convolutional sequence embedding,” in

Proceedings ofWSDM , 2018, pp. 565–573.[7] B. Hidasi, M. Quadrana, A. Karatzoglou, and D. Tikk, “Parallelrecurrent neural network architectures for feature-rich session-based recommendations,” in

Proceedings of RecSys , 2016, pp. 241–248.[8] J. Huang, Z. Ren, W. X. Zhao, G. He, J.-R. Wen, and D. Dong,“Taxonomy-aware multi-hop reasoning networks for sequentialrecommendation,” in

Proceedings of WSDM , 2019, pp. 573–581.[9] T. Zhang, P. Zhao, Y. Liu, V. S. Sheng, J. Xu, D. Wang, G. Liu,and X. Zhou, “Feature-level deeper self-attention network forsequential recommendation.” in

Proceedings of IJCAI , 2019, pp.4320–4326.[10] W. Song, C. Shi, Z. Xiao, Z. Duan, Y. Xu, M. Zhang, and J. Tang,“Autoint: Automatic feature interaction learning via self-attentiveneural networks,” in

Proceedings of CIKM , 2019, pp. 1161–1170.[11] T. Yao, X. Yi, D. Z. Cheng, F. Yu, A. Menon, L. Hong, E. H. Chi,S. Tjoa, E. Ettinger et al. , “Self-supervised learning for deep modelsin recommendations,” arXiv preprint arXiv:2007.12865 , 2020.[12] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language under-standing,” in

Proceedings of NAACL-HLT , 2019, pp. 4171–4186.[13] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy,M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A ro-bustly optimized bert pretraining approach,” arXiv preprintarXiv:1907.11692 , 2019.[14] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhari-wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al. , “Lan-guage models are few-shot learners,” in

Proceedings of NeurIPS ,2020.[15] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transferlearning with a uniﬁed text-to-text transformer,”

JMLR , vol. 21,pp. 1–67, 2020.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11 [16] X. Chen, D. Liu, C. Lei, R. Li, Z.-J. Zha, and Z. Xiong, “Bert4sessrec:Content-based video relevance prediction with bidirectional en-coder representations from transformer,” in

Proceedings of SIGMM ,2019, pp. 2597–2601.[17] F. Yuan, X. He, A. Karatzoglou, and L. Zhang, “Parameter-efﬁcienttransfer from sequential behaviors for user modeling and recom-mendation,” in

Proceedings of SIGIR , 2020, pp. 1469–1478.[18] K. Zhou, H. Wang, W. X. Zhao, Y. Zhu, S. Wang, F. Zhang, Z. Wang,and J.-R. Wen, “S3-rec: Self-supervised learning for sequentialrecommendation with mutual information maximization,” in

Pro-ceedings of CIKM , 2020, pp. 1893–1902.[19] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, and P. Jiang, “Bert4rec:Sequential recommendation with bidirectional encoder represen-tations from transformer,” in

Proceedings of CIKM , 2019, pp. 1441–1450.[20] X. Xie, F. Sun, Z. Liu, J. Gao, B. Ding, and B. Cui, “Contrastivepre-training for sequential recommendation,” arXiv preprintarXiv:2010.14395 , 2020.[21] Z. Zeng, C. Xiao, Y. Yao, R. Xie, Z. Liu, F. Lin, L. Lin, and M. Sun,“Knowledge transfer via pre-training for recommendation: A re-view and prospect,” arXiv preprint arXiv:2009.09226 , 2020.[22] J. Tang, X. Hu, and H. Liu, “Social recommendation: a review,”

Social Network Analysis and Mining , vol. 3, no. 4, pp. 1113–1133,2013.[23] C. H. Liu, J. Xu, J. Tang, and J. Crowcroft, “Social-aware sequentialmodeling of user interests: a deep learning approach,”

TKDE ,vol. 31, no. 11, pp. 2200–2212, 2018.[24] A. LeBlanc, W. L. Sims, C. Siivola, and M. Obert, “Music stylepreferences of different age listeners,”

Journal of Research in MusicEducation , vol. 44, no. 1, pp. 49–59, 1996.[25] M. McPherson, L. Smith-Lovin, and J. M. Cook, “Birds of a feather:Homophily in social networks,”

Annual review of sociology , vol. 27,no. 1, pp. 415–444, 2001.[26] F. Ricci, L. Rokach, and B. Shapira, “Recommender systems:introduction and challenges,” in

Recommender systems handbook .Springer, 2015, pp. 1–34.[27] Y. Koren and R. Bell, “Advances in collaborative ﬁltering,”

Recom-mender systems handbook , pp. 77–118, 2015.[28] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniquesfor recommender systems,”

Computer , vol. 42, no. 8, pp. 30–37,2009.[29] H. Guo, R. Tang, Y. Ye, Z. Li, and X. He, “Deepfm: a factorization-machine based neural network for ctr prediction,” in

Proceedingsof IJCAI , 2017, pp. 1725–1731.[30] S. Kabbur, X. Ning, and G. Karypis, “Fism: factored item simi-larity models for top-n recommender systems,” in

Proceedings ofSIGKDD , 2013, pp. 659–667.[31] G. Linden, B. Smith, and J. York, “Amazon. com recommenda-tions: Item-to-item collaborative ﬁltering,”

IEEE Internet comput-ing , vol. 7, no. 1, pp. 76–80, 2003.[32] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Item-based col-laborative ﬁltering recommendation algorithms,” in

Proceedings ofWWW , 2001, pp. 285–295.[33] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua, “Neuralcollaborative ﬁltering,” in

Proceedings of WWW , 2017, pp. 173–182.[34] S. Sedhain, A. K. Menon, S. Sanner, and L. Xie, “Autorec: Autoen-coders meet collaborative ﬁltering,” in

Proceedings of WWW , 2015,pp. 111–112.[35] D. Kim, C. Park, J. Oh, S. Lee, and H. Yu, “Convolutional matrixfactorization for document context-aware recommendation,” in

Proceedings of RecSys , 2016, pp. 233–240.[36] H. Wang, N. Wang, and D.-Y. Yeung, “Collaborative deep learningfor recommender systems,” in

Proceedings of SIGKDD , 2015, pp.1235–1244.[37] W.-C. Kang, C. Fang, Z. Wang, and J. McAuley, “Visually-awarefashion recommendation and design with generative image mod-els,” in

Proceedings of ICDM . IEEE, 2017, pp. 207–216.[38] S. Wang, Y. Wang, J. Tang, K. Shu, S. Ranganath, and H. Liu, “Whatyour images reveal: Exploiting visual contents for point-of-interestrecommendation,” in

Proceedings of WWW , 2017, pp. 391–400.[39] H. Fang, D. Zhang, Y. Shu, and G. Guo, “Deep learning forsequential recommendation: Algorithms, inﬂuential factors, andevaluations,”

TOIS , vol. 39, no. 1, pp. 1–42, 2020.[40] S. Rendle, “Factorization machines,” in

Proceedings of ICDM .IEEE, 2010, pp. 995–1000.[41] M. Quadrana, P. Cremonesi, and D. Jannach, “Sequence-awarerecommender systems,”

CSUR , vol. 51, no. 4, pp. 1–36, 2018. [42] R. He and J. McAuley, “Fusing similarity models with markovchains for sparse sequential recommendation,” in

Proceedings ofICDM . IEEE, 2016, pp. 191–200.[43] R. He, W.-C. Kang, and J. McAuley, “Translation-based recommen-dation,” in

Proceedings of RecSys , 2017, pp. 161–169.[44] M. Quadrana, A. Karatzoglou, B. Hidasi, and P. Cremonesi,“Personalizing session-based recommendations with hierarchicalrecurrent neural networks,” in

Proceedings of RecSys , 2017, pp. 130–137.[45] J. Huang, W. X. Zhao, H. Dou, J.-R. Wen, and E. Y. Chang, “Im-proving sequential recommendation with knowledge-enhancedmemory networks,” in

Proceedings of SIGIR , 2018, pp. 505–514.[46] P. Ren, Z. Chen, J. Li, Z. Ren, J. Ma, and M. De Rijke, “Repeatnet:A repeat aware neural recommendation machine for session-basedrecommendation,” in

Proceedings of AAAI , vol. 33, no. 01, 2019, pp.4806–4813.[47] V. Rakesh, N. Jadhav, A. Kotov, and C. K. Reddy, “Probabilisticsocial sequential model for tour recommendation,” in

Proceedingsof WSDM , 2017, pp. 631–640.[48] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in

Proceedings of CVPR , 2016, pp. 770–778.[49] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger,“Densely connected convolutional networks,” in

Proceedings ofCVPR , 2017, pp. 4700–4708.[50] K. Simonyan and A. Zisserman, “Very deep convolutional net-works for large-scale image recognition,” in

Proceedings of ICLR ,2015.[51] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, andQ. V. Le, “Xlnet: Generalized autoregressive pretraining for lan-guage understanding,” in

Proceedings of NeurIPS , 2019, pp. 5753–5763.[52] L. Zheng, V. Noroozi, and P. S. Yu, “Joint deep modeling of usersand items using reviews for recommendation,” in

Proceedings ofWSDM , 2017, pp. 425–434.[53] Y. Gong and Q. Zhang, “Hashtag recommendation usingattention-based convolutional neural network.” in

Proceedings ofIJCAI , 2016, pp. 2782–2788.[54] F. Zhang, N. J. Yuan, D. Lian, X. Xie, and W.-Y. Ma, “Collabora-tive knowledge base embedding for recommender systems,” in

Proceedings of SIGKDD , 2016, pp. 353–362.[55] H. Wang, F. Zhang, X. Xie, and M. Guo, “Dkn: Deep knowledge-aware network for news recommendation,” in

Proceedings ofWWW , 2018, pp. 1835–1844.[56] J. Chen, Y. Wu, L. Fan, X. Lin, H. Zheng, S. Yu, and Q. Xuan,“N2vscdnnr: A local recommender system based on node2vecand rich information network,”

IEEE Transactions on ComputationalSocial Systems , vol. 6, no. 3, pp. 456–466, 2019.[57] L. Guo, Y.-F. Wen, and X.-H. Wang, “Exploiting pre-trained net-work embeddings for recommendations in social networks,”

Jour-nal of Computer Science and Technology , vol. 33, no. 4, pp. 682–696,2018.[58] S. Yang, Y. Liu, C. Lei, G. Wang, H. Tang, J. Zhang, and C. Miao,“A pre-training strategy for recommendation,” arXiv preprintarXiv:2010.12284 , 2020.[59] F. Yuan, X. He, H. Jiang, G. Guo, J. Xiong, Z. Xu, and Y. Xiong,“Future data helps training: Modeling future contexts for session-based recommendation,” in

Proceedings of theWebConf , 2020, pp.303–313.[60] T. W. Smith, “Generational differences in musical preferences,”

Popular Music & Society , vol. 18, no. 2, pp. 43–59, 1994.[61] P. J. Huber, “Robust estimation of a location parameter,” in

Break-throughs in statistics . Springer, 1992, pp. 492–518.[62] H. Ma, H. Yang, M. R. Lyu, and I. King, “Sorec: social recommen-dation using probabilistic matrix factorization,” in

Proceedings ofCIKM , 2008, pp. 931–940.[63] W. Fan, Y. Ma, Q. Li, Y. He, E. Zhao, J. Tang, and D. Yin, “Graphneural networks for social recommendation,” in

Proceedings ofWWW , 2019, pp. 417–426.[64] S. Rendle, C. Freudenthaler, and L. Schmidt-Thieme, “Factorizingpersonalized markov chains for next-basket recommendation,” in

Proceedings of WWW , 2010, pp. 811–820.[65] T. Wolf, J. Chaumond, L. Debut, V. Sanh, C. Delangue, A. Moi,P. Cistac, M. Funtowicz, J. Davison, S. Shleifer et al. , “Transformers:State-of-the-art natural language processing,” in

Proceedings ofEMNLP (Demo) , 2020, pp. 38–45.[66] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” in