[PDF] Sparse-Interest Network for Sequential Recommendation

Abstract

Recent methods in sequential recommendation focus on learning an overall embedding vector from a user's behavior sequence for the next-item recommendation. However, from empirical analysis, we discovered that a user's behavior sequence often contains multiple conceptually distinct items, while a unified embedding vector is primarily affected by one's most recent frequent actions. Thus, it may fail to infer the next preferred item if conceptually similar items are not dominant in recent interactions. To this end, an alternative solution is to represent each user with multiple embedding vectors encoding different aspects of the user's intentions. Nevertheless, recent work on multi-interest embedding usually considers a small number of concepts discovered via clustering, which may not be comparable to the large pool of item categories in real systems. It is a non-trivial task to effectively model a large number of diverse conceptual prototypes, as items are often not conceptually well clustered in fine granularity. Besides, an individual usually interacts with only a sparse set of concepts. In light of this, we propose a novel \textbf{S}parse \textbf{I}nterest \textbf{NE}twork (SINE) for sequential recommendation. Our sparse-interest module can adaptively infer a sparse set of concepts for each user from the large concept pool and output multiple embeddings accordingly. Given multiple interest embeddings, we develop an interest aggregation module to actively predict the user's current intention and then use it to explicitly model multiple interests for next-item prediction. Empirical results on several public benchmark datasets and one large-scale industrial dataset demonstrate that SINE can achieve substantial improvement over state-of-the-art methods.

Full PDF

SSparse-Interest Network for Sequential Recommendation

Qiaoyu Tan , Jianwei Zhang , Jiangchao Yao , Ninghao Liu Jingren Zhou , Hongxia Yang , Xia Hu Department of Computer Science and Engineering, Texas A&M University, TX, USA Alibaba Group{qytan,nhliu43,xiahu}@tamu.edu{zhangjianwei.zjw,jiangchao.yjc,jingren.zhou,yang.yhx}@alibaba-inc.com

ABSTRACT

Recent methods in sequential recommendation focus on learningan overall embedding vector from a user’s behavior sequence forthe next-item recommendation. However, from empirical analy-sis, we discovered that a user’s behavior sequence often containsmultiple conceptually distinct items, while a unified embeddingvector is primarily affected by one’s most recent frequent actions.Thus, it may fail to infer the next preferred item if conceptuallysimilar items are not dominant in recent interactions. To this end,an alternative solution is to represent each user with multiple em-bedding vectors encoding different aspects of the user’s intentions.Nevertheless, recent work on multi-interest embedding usuallyconsiders a small number of concepts discovered via clustering,which may not be comparable to the large pool of item categoriesin real systems. It is a non-trivial task to effectively model a largenumber of diverse conceptual prototypes, as items are often notconceptually well clustered in fine granularity. Besides, an individ-ual usually interacts with only a sparse set of concepts. In lightof this, we propose a novel S parse I nterest NE twork (SINE) forsequential recommendation. Our sparse-interest module can adap-tively infer a sparse set of concepts for each user from the largeconcept pool and output multiple embeddings accordingly. Givenmultiple interest embeddings, we develop an interest aggregationmodule to actively predict the user’s current intention and then useit to explicitly model multiple interests for next-item prediction.Empirical results on several public benchmark datasets and onelarge-scale industrial dataset demonstrate that SINE can achievesubstantial improvement over state-of-the-art methods. CCS CONCEPTS • Computer systems organization → Embedded systems ; Re-dundancy ; Robotics; •

Networks → Network reliability.

KEYWORDS

Recommender system, Sequential recommendation, Sparse-interestnetwork, Multi-interest extraction

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

35 0 . OUT IN

Hit Miss .

31 0 . HIT MISS

Figure 1: Hit and Miss analysis in top@100 of single-embedding based SASRec [21] for next-item prediction onTaobao [51]. The left side shows the prediction results over"In" and "Out" settings. "In" means similar items belongto the same category of next predicted item are interactedin recent fifty behaviors, otherwise "Out". The right sideshows the frequency of similar items in recent five behav-iors. SASRec prefers to correctly predict the next-item if sim-ilar items are dominant in past interactions.

ACM Reference Format:

Qiaoyu Tan, Jianwei Zhang, Jiangchao Yao, Ninghao Liu, Jingren Zhou,Hongxia Yang, Xia Hu. 2021. Sparse-Interest Network for Sequential Recom-mendation. In

Proceedings of the Fourteenth ACM International Conference onWeb Search and Data Mining (WSDM ’21), March 8–12, 2021, Virtual Event,Israel. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3437963.3441811

Recommender systems have been widely applied to many onlineservices such as E-commerce, advertising, and social media to per-form personalized information filtering [14, 17, 31, 46]. At its core isto estimate how likely a user will interact with an item based on thepast actions, e.g., purchases and clicks. Traditional recommendationmethods adopt collaborative filtering approaches [35] to addressthe problem by assuming that behaviorally similar users wouldexhibit similar preferences on items. Recently, neural-based deeprecommendation models have shown revolutionary performancein many recommendation scenarios, due to the powerful expressiveability of deep learning. For example, NCF [14] extends matrix fac-torization based models [35] by replacing the interaction functionof inner product with nonlinear neural networks. PinSage [46] isbuilt on GraphSage [10], and learns user and item embeddings byconducting convolutional operations on the user-item interactiongraph. However, these methods ignore the sequential structure in a r X i v : . [ c s . I R ] F e b ser behaviors and thus fail to capture the correlations betweenadjacent behaviors.Some recent works formalize recommendation as a sequentialproblem. The principal idea behind this is to represent each userwith an ordered sequence and assume its order matters. With auser’s behavior history, the sequential recommendation approachfirst sorts the past behaviors to obtain the ordered sequence. Af-ter that, the sequence will be fed into different neural sequentialmodules (e.g., recurrent neural network [17], convolutional net-work [42], and Transformer [21]) to generate an overall user em-bedding vector, which is then used to predict the next interesteditem. Since the sequential recommendation approach reflects thereal-world recommendation situation, it has attracted much atten-tion in modern recommendation systems.Despite the recent advances, we argue that existing sequentialrecommendation models may be sub-optimal for next-item predic-tion due to the bottleneck of learning a single embedding from theuser’s behavior sequence. Each user in an E-commerce platformusually interacts with several types of items over time that areconceptually different. For example, we find that the number ofcategories of items that belong to different categories in a user’srecent fifty behaviors is around 10 on Taobao dataset [51]. Withmultiple user’s intentions , we also observe that, in Figure 1, anoverall user embedding vector learned from a behavior sequence isprimarily affected by the recent most frequent actions. Thus, it mayfail to extract related information for learning to predict the nextitem if its conceptually similar items are not dominant in recentinteractions. Therefore, a promising alternative solution is to learnmultiple embedding vectors from a user’s behavior sequence, whereeach embedding vector encodes one aspect of the user’s interests.However, there are several challenges for effectively extractingmultiple embedding vectors from the user’s behavior sequence inindustry-level data. First, items are often not conceptually wellclustered in real systems. Although category information of itemscan be used as concepts, in many cases, such type of auxiliaryinformation may not be available or reliable due to annotation noisein practice. The second challenge is to adaptively infer a sparseset of interested concepts for a user from the large concept pool.The inference procedure includes a selection operation, which is adiscrete optimization problem and hard to train end-to-end. Third,given multiple interest embedding vectors, we need to determinewhich interest is likely to be activated for next-item predictions.During training, the next predicted item could be used as a label toactivate the preferred intention, but the inference stage has no suchlabel. The model has to predict a user’s next intention adaptively.In this paper, we propose a novel Sparse-Interest NEtwork (SINE)for sequential recommendation to address these issues. SINE canlearn a large pool of interest groups and capture multiple intentionsof users in an end-to-end fashion. Figure 4 shows the overall struc-ture of SINE. Our sparse interest extraction module adaptively infersthe interacted interests of a user from a large pool of interest groupsand outputs multiple interest embeddings. The aggregation moduleenables dynamically predicting the user’s next intention, which Through out the paper, we interchangeably use intention and interest to indicateitem cluster that consists of conceptually similar items. helps to capture multi-interests for top-N item recommendation ex-plicitly. We conduct experiments on several public benchmarks andan industrial dataset. Empirical results show that our frameworkoutperforms state-of-the-art models and produces reasonable itemclusters. To summarize, the main contributions of this paper are: • We propose a comprehensive framework that integrateslarge-scale item clustering and sparse-interest extractionjointly in a recommender system. • We investigate an adaptive interest aggregation module toexplicitly model users’ multiple interests for top-N recom-mendation in the sequential recommendation scenario. • Our model not only achieves state-of-the-art performanceon several real-world challenging datasets, but also producesreasonable interest groups to assist multi-interest extraction.

In the conventional recommendation system, researchers focuson extracting users’ general tastes from their historical behaviors.The typical examples include collaborative filtering [35, 36], matrixfactorization techniques [23], and factorization machines [32]. Thecritical challenge of them lies in representing users and items withembedding vectors to compute their similarity. Matrix factorization(MF) methods seek to map users and items into joint latent space andestimate user-item interactions through the inner product betweentheir embedding vectors. Factorization machines [32] aim to modelall interactions between variables using factorized parameters andcan even estimate interactions when facing sparsity problems.Recently, inspired by the success of deep learning in computer vi-sion and natural language processing [49], much effort has been putinto developing deep-learning-based recommender algorithms [9,14, 40]. One line of work seeks to use neural networks to extractadditional features for the content-aware recommendation [22]. An-other range of work targets to replace traditional MF. For example,NCF [14] uses multi-layer perceptions to replace the inner productoperation in MF for interaction estimation, while AutoRec [37]adopts autoencoders to predict ratings. Moreover, several attemptsalso tried to apply graph neural networks [7, 19, 39, 48] for recom-mendation [13, 46].

Sequential recommendation has become the crucial problem ofmodern recommender systems, owing to its ability to capture thesequential patterns among successive items. One line of work at-tempts to model the item-to-item transition matrix based on theMarkov Chain (MC). For instance, some works model the sequenceusing first-order Markov chain [4, 33], which assumes that the nextaction only relies on the last behavior. To relax this limitation, thereare also methods adopting high-order MCs that consider moreprevious items [11, 12, 45]. A representative work is Caser [42],which treats use’s behavior sequence as an "image" and adoptsConvolutional Neural Network to extract user representation.Another line of works seeks to use a sequential neural moduleto process the user behavior sequence [16, 21, 38, 41]. For example,GRU4Rec [17] first applies Gated Recurrent Units (GRU) to modelthe whole session for a more accurate recommendation. At the S e qu e n c e E n c o d e r Intention predictionIntention selectorSparse-interest module Interest aggregation module C o n c e p t A c t i v a t i o n Concept pool ItemConceptual prototypeInterest embeddingPredicted intentionOutput embedding 𝑪 " ∅ ’( (𝑥 % )∅ ’, (𝑥 % )∅ ’- (𝑥 % ) v % Figure 2: The architecture of SINE (better viewed in color). Given a user’s behavior sequence as input, sparse-interest moduleaims to adaptively activate his/her interests from the large interest group pool as well as output multi-interest embeddings.Then, the interest aggregation module helps to select the most preferred interest for next-item recommendation by activelypredicting user’s next intention. SINE offers the ability to cluster items and infer user’s sparse set of interests in an end-to-endfashion. same time, SASRec [21] explores to use self-attention [43] basedsequential model to capture long-term semantics and use an at-tention mechanism to make its prediction based on relatively fewactions. Besides, there are some other works [16, 25, 47] that in-troduces specific neural modules for particular recommendationscenarios. For instance, DIN [50] develops a local activation unitto adaptively learn the user’s representation from past behaviorsfor a specific ad. RUM [3] introduces a memory-augmented neuralnetwork with the insights of collaborative filtering for the recom-mendation. SDM [28] integrates a multi-head self-attention modulewith a gated fusion module to capture both short- and long-termuser preferences for the next-item prediction.

The attention mechanism is initially proposed in computer vi-sion [2] and only becomes popular in recent years. It is first appliedto solve the machine translation problem by [1] and later becomesan outbreaking building block as Transformer [43]. Recently, BERTleverages Transformer to achieve enormous success in the naturallanguage processing filed for pre-training. It has also been success-fully applied in many recommendation applications [38] and israther useful and efficient in real-world application tasks.

In this section, we first introduce the problem formulation and thendiscuss the proposed framework in detail. Finally, we discuss thedifference between our framework and existing methods.

Assume { x ( 𝑢 ) } 𝑁𝑢 = be the behavior dataset consists of the interac-tions between 𝑁 users and 𝑀 items. x ( 𝑢 ) = [ 𝑥 ( 𝑢 ) , 𝑥 ( 𝑢 ) , · · · , 𝑥 ( 𝑢 ) 𝑛 ] is the ordered sequence of items clicked by user 𝑢 , where 𝑛 is thenumber of clicks made by user 𝑢 . Each element 𝑥 ( 𝑢 ) 𝑡 ∈ { , , · · · , 𝑀 } in the sequence is the index of the item being clicked. Note that, dueto the strict requirements of latency and performance, industrialrecommender systems consist of two stages, the matching stageand ranking stage [6]. The matching stage aims to retrieve top- 𝑁 candidate items from a large volume of item pool, while the rankingstage targets to sort the candidate items by more precise scores. Wefocus on improving the effectiveness of the matching stage, wherethe task is to retrieve high-quality candidate items that the usermight be clicked with based on the observed sequence x ( 𝑢 ) . As the item pools of real-world recommender systems often consistof millions or even billions of items, the matching stage is crucialin modern recommender systems. Specifically, a deep sequentialmodel in the matching stage typically has a sequence encoder 𝜙 𝜃 (·) and an item embedding table H ∈ R 𝑀 × 𝐷 , where 𝜃 is the set thatcontains all the trainable parameters including H . The encodertakes the user’s historical behavior sequence x ( 𝑢 ) as input andoutputs the representation of the sequence 𝜙 𝜃 ( x ( 𝑢 ) ) , which canbe viewed as the representation of the user’s intention. The user’sintention embedding is then used as a query to generate his/hercandidate items from the item pool via a fast K nearest neighboralgorithm (i.e., faiss [20]). Most encoders 𝜙 𝜃 (·) in the literatureoutput a single 𝐷 -dimensional embedding vector, while there arealso models that output 𝐾 𝐷 -dimensional embedding vectors topreserve the user’s intentions under 𝐾 latent categories. We mainlyfocus on the latter direction and target to capture a user’s diverseintentions accurately.The state-of-art sequence encoders for capturing a user’s mul-tiple intentions can be summarized into two categories. The firsttype of methods resort to powerful sequential encoders to implic-itly extract the user’s multiple intentions, such as models basedon multi-head self-attention (aka the Transformer [43]). The otherype of methods rely on the latent prototype to explicitly capturea user’s multiple intentions. In general, the former approach maylimit its ability to capture multiple intentions due to the mixed na-ture of intention detection and embedding in practice. For example,the empirical results show that the multiple vector representationslearned by Transformer do not seem to have a clear advantageover the single-head implementation [21] for recommendation. Incontrast, the later can effectively extract a user’s diverse interestswith the help of concept identified via clustering as empiricallyproved in [27, 29]. However, these methods scale poor because theyrequire each user has an intention embedding under every concept,which easily scales up to thousands in industrial applications. Forinstance, millions or even billions of items belong to more than10 thousand expert-labeled leaf categories [24] in the e-commerceplatform of Tmall in China. With a large pool of interest concepts inreal systems, a scalable multi-interest extraction module is needed.Therefore, we propose a sparse-interest network here, whichoffers the ability to adaptively activate a subset of concepts fromthe large concept pool for a user. The input of our model is theuser’s behavior sequence x ( 𝑢 ) , which is then fed into an embeddinglayer and transformed into item embedding matrix X 𝑢 ∈ R 𝑛 × 𝐷 . Let C ∈ R 𝐿 × 𝐷 denotes the overall conceptual prototype matrix, and C 𝑢 ∈ R 𝐾 × 𝐷 indicates the activated prototypical embedding matrixon 𝐾 latent concepts for user 𝑢 . 𝐿 is the total number of concepts. Our sparse-interest layer starts by in-ferring the interested conceptual prototypes C 𝑢 for each user 𝑢 .Given X 𝑢 ∈ R 𝑛 × 𝐷 , the self-attentive method [26] is first applied toaggregate the input sequence selectively. a = softmax ( tanh ( X 𝑢 W ) W ) , (1)where W ∈ R 𝐷 × 𝐷 and W ∈ R 𝐷 are trainable parameters. Thevector a ∈ R 𝑛 is the attention weight vector of user behaviors.When we sum up the embeddings of input sequence accordingto the attention weight, we can obtain a virtual concept vector z 𝑢 = ( a ⊤ X 𝑢 ) ⊤ for the user. z 𝑢 ∈ R 𝐷 reflects the user’s generalintentions and could be used to activate the interested conceptualprototypes as: s 𝑢 = ⟨ C , z 𝑢 ⟩ , idx = rank ( s 𝑢 , 𝐾 ) , C 𝑢 = C ( idx , : ) ⊙ ( Sigmoid ( s 𝑢 ( idx , : ) 𝑇 )) , (2)where rank ( s 𝑢 , 𝐾 ) is the top-K ranking operator, which returnsthe indices of the 𝐾 -largest values in s 𝑢 . The index returned byrank ( s 𝑢 , 𝐾 ) contains the indices of prototypes selected for user 𝑢 . C ( idx , : ) performs the row extraction to form the the sub-prototypematrix, while s ( idx , : ) extracts values in s 𝑢 with indices idx. ∈ R 𝐾 is a vector with all elements being 1. ⊙ represents Hadamard prod-uct and ⟨· , ·⟩ is inner product. C 𝑢 ∈ R 𝐾 × 𝐷 is the final activated 𝐾 latent concept embedding matrix for user 𝑢 . Equation 2 is a top- 𝐾 se-lection trick that enables discrete selection operation differentiable,prior work [8] has found that it is very effective in approximatingtop- 𝐾 selection problem. After inferring the current conceptualprototypes C 𝑢 , we can estimate the user intention related with eachitem in his/her behavior sequence according to their distance to the prototypes. 𝑃 𝑘 | 𝑡 = exp ( LayerNorm ( X 𝑢𝑡 W ) · LayerNorm ( C 𝑢𝑘 )) (cid:205) 𝐾𝑘 ′ = exp ( LayerNorm ( X 𝑢𝑡 W ) · LayerNorm ( C 𝑢𝑘 ′ )) , (3)where 𝑃 𝑘 | 𝑡 measures how likely the primary intention at position 𝑡 is related with the 𝑘 𝑡ℎ latent concept. C 𝑢𝑘 ∈ R 𝐷 is the embeddingof the 𝑘 𝑡ℎ activated conceptual prototype of user 𝑢 . W ∈ R 𝐷 × 𝐷 is the trainable weight matrix. LayerNorm 𝑙 (·) represents a layernormalization layer. Note that we are using cosine similarity insteadof the inner product here, due to the normalization. This choice ismotivated by the fact that cosine is much less vulnerable than dotproduct when it comes to model collapse [29], e.g., the degenerationsituation where the model is ignoring most prototypes. In addition to the attention weight 𝑃 𝑘 | 𝑡 calculated from the conceptual perspective, we also consider an-other attention weight 𝑃 𝑡 | 𝑘 to estimate how likely the item at posi-tion 𝑡 is essential for predicting the user’s next intentions. 𝑃 𝑡 | 𝑘 = a 𝑘𝑡 , a 𝑘 = softmax ( tanh ( X 𝑢 W 𝑘, ) W 𝑘, ) 𝑇 , (4) a 𝑘 ∈ R 𝑛 is the attention vector for all positions. The superscript 𝑘 represents it’s the attention layer for the 𝑘 𝑡ℎ activated intention.Similar to Equation 1, the above equation is another self-attentivelayer. The primary difference lies in that we try to make use ofthe order of user sequences here and add extra trainable positionalembeddings [43] to the input embeddings. The dimensionality ofpositional embeddings is the same as that of the item embeddingsso that they can be directly summed. We can now generate mul-tiple interest embedding vectors from a user’s behavior sequence X 𝑢 according to 𝑃 𝑘 | 𝑡 and 𝑃 𝑡 | 𝑘 . Specifically, the 𝑘 𝑡ℎ output of oursparse-interest encoder 𝜙 𝑘𝜃 ( x ( 𝑢 ) ) ∈ R 𝐷 is computed as follows: 𝜙 𝑘𝜃 ( x ( 𝑢 ) ) = LayerNorm ( 𝑛 ∑︁ 𝑡 = 𝑃 𝑘 | 𝑡 · 𝑃 𝑡 | 𝑘 · X 𝑢𝑡 ) . (5)Till now, we have introduced the whole process of the sparse-interest network. Given a user’s behavior sequence, we first activatehis/her preferred conceptual prototypes from the concept pool. Theintention assignment is then performed to estimate the user in-tention related with each item in the input sequence. After that,the self-attentive layer is applied to calculate all items’ attentionweights for next-item prediction. Finally, the user’s multiple inter-est embeddings are generated through a weighted sum, accordingto Equation 5. After the sparse-interest extraction module, we obtain multipleinterest embeddings for each user. A natural follow-up question ishow to leverage various interest for practical inference. An intuitivesolution is to use the next predicted item as a target label to selectdifferent interest embeddings for training as in MIND [24]. Despiteits simplicity, the main drawback is that there are no target labelsuring inference, which leads to a gap between training and testingand may result in performance degeneration.To address this issue, we propose an adaptive interest aggre-gation module based on active prediction. The motivation here isthat it is easier to predict a user’s temporal preference-based nextintentions instead of finding the ideal labels. Specifically, based onthe intention assignment score 𝑃 𝑘 | 𝑡 computed in Equation 3, wecan obtain an intention distribution matrix, denoted by P 𝑢 ∈ R 𝑛 × 𝐾 ,for all items in the behavior sequence. Then, the input behaviorsequence x 𝑢 can be reformulated from the intention perspectivedenoted by (cid:99) X 𝑢 = P 𝑢 C 𝑢 , where (cid:99) X 𝑢 ∈ R 𝑛 × 𝐷 is viewed as the inten-tion sequence of user 𝑢 . With (cid:99) X 𝑢 , the user’s next intention C 𝑢𝑎𝑝𝑡 isadaptively computed as C 𝑢𝑎𝑝𝑡 = LayerNorm (cid:16) ( softmax ( tanh ( (cid:99) X 𝑢 W ) W )) ⊤ (cid:99) X 𝑢 (cid:17) ⊤ , (6)where C 𝑢𝑎𝑝𝑡 ∈ R 𝐷 is the predicted intention of user 𝑢 for next item. W ∈ R 𝐷 × 𝐷 and W ∈ R 𝐷 are trainable parameters. Given C 𝑢𝑎𝑝𝑡 and multiple interest embeddings { 𝜙 𝑘𝜃 ( x ( 𝑢 ) )} 𝐾𝑘 = , the aggregationweights of different interests are calculated as 𝑒 𝑢𝑘 = exp (( C 𝑢𝑎𝑝𝑡 ) ⊤ 𝜙 𝑘𝜃 ( x ( 𝑢 ) )/ 𝜏 ) (cid:205) 𝐾𝑘 ′ = exp (( C 𝑢𝑎𝑝𝑡 ) ⊤ 𝜙 𝑘 ′ 𝜃 ( x ( 𝑢 ) )/ 𝜏 ) . (7)Where 𝑒 𝑢 = [ 𝑒 𝑢 , 𝑒 𝑢 , · · · , 𝑒 𝑢𝐾 ] 𝑇 ∈ R 𝐾 is the attention vector fordiverse interests. 𝜏 is a temperature parameter to tune. When 𝜏 is large ( 𝜏 → ∞ ), 𝑒 𝑢 approximates a uniformly distributed vector.When 𝜏 is small ( 𝜏 → + ), 𝑒 𝑢 approximates a one-hot vector. Inexperiments, we use 𝜏 = . v 𝑢 ∈ R 𝐷 is computed as v 𝑢 = 𝐾 ∑︁ 𝑘 = 𝑒 𝑢𝑘 · 𝜙 𝑘𝜃 ( x ( 𝑢 ) . (8) We follow the common practice [21, 24] to train our model byrecovering the next click 𝑥 ( 𝑢 ) 𝑡 based on the truncated sequenceprior to the click, i.e., [ 𝑥 ( 𝑢 ) , 𝑥 ( 𝑢 ) , · · · , 𝑥 ( 𝑢 ) 𝑡 − ] . Given a training sample ( 𝑢, 𝑡 ) with the user embedding vector v 𝑢 and item embedding H 𝑡 ,we aim to minimize the following negative log-likelihood L 𝑙𝑖𝑘𝑒 = − ∑︁ 𝑢 ∑︁ 𝑡 log 𝑃 ( 𝑥 ( 𝑢 ) 𝑡 | 𝑥 ( 𝑢 ) , 𝑥 ( 𝑢 ) , · · · , 𝑥 ( 𝑢 ) 𝑡 − ) = − ∑︁ 𝑢 ∑︁ 𝑡 log exp ( H ⊤ 𝑡 v 𝑢 ) (cid:205) 𝑗 ∈{ , , ··· ,𝑀 } exp ( H ⊤ 𝑗 v 𝑢 )) . (9)Equation (9) is usually intractable in practice, because the sumoperation of the denominator is computationally prohibitive. We,therefore, leverage a Sampled Softmax technique [6, 18] to train ourmodel. Besides, we also introduce a covariance regularizer follow-ing [5] to enforce the learned conceptual prototypes orthogonally.Specifically, denote M = 𝐷 ( C − 𝐶 )( C − 𝐶 ) ⊤ as the covariance ma-trix of prototype embeddings, where C is the mean matrix of C .The regularization loss L 𝑐 to regularize the covariance is L 𝑐 = (|| M || 𝐹 − || diag ( M )|| 𝐹 ) . (10) Where || · || 𝐹 is the Frobenius norm matrix. Combine the two lossesabove, the final loss function of our model is L = L 𝑙𝑖𝑘𝑒 + 𝜆 L 𝑐 , (11)where 𝜆 is the trade-off parameter to balance the two losses. We compare our model and existing methods that focus on extract-ing user’s multiple interest embeddings in the matching stage ofrecommendation. We roughly divided them into two categories andanalyzed the difference below.

Implicit approach . This type of method relies on powerful neu-ral networks to implicitly cluster historical behaviors and extractdiverse interests. For example, MIND [24] utilizes Capsule net-work [34] to adaptively aggregate user’s behaviors into interest em-bedding vectors. SASRec [21] adopts the multi-head self-attentionmechanism [43] to output multiple representation for a user. Com-pared with these methods, our model belongs to an explicit ap-proach that explicitly detects intentions from the user’s behaviorsequence based on latent conceptual prototypes.

Explicit approach . Methods that belong to this type maintaina set of conceptual prototypes to explicitly determine the inten-tions of items in the user’s behavior sequence. MCPRN [44] is arecent representative work for extracting multiple interests fromthe session for the next-item recommendation. DisenRec [29] uti-lizes latent prototypes to help learn disentangled representations forrecommendation. Compared with them, we also follow the explicitapproach, but our model scales to a large-scale dataset. Specifically,they require the number of diverse interest embeddings equals tothe number of conceptual prototypes. However, the number of la-tent concepts depends on applications and can be easily scaled upto hundreds or even thousands in industrial recommender systems,which hinders their application in practice. In contrast, our sparse-interest network offers the ability to infer a sparse set of preferredintentions from the large concept pool automatically.

In this section, we conduct experiments over three benchmarkdatasets and one billion-scale industrial data to validate the pro-posed approach. Specifically, we try to answer the following ques-tions: • How effective is the proposed method compared to otherstate-of-the-art baselines? Q1 • What are the effects of the different modules, sparse-interestmodule, and interest aggregation module through ablationstudies? Q2 • How sensitive are the hyper-parameter settings, includingthe preferred 𝐾 intentions and the corresponding 𝐿 concep-tual prototypes? Q3 In this section, we elaborate on the dataset description, evaluationmetrics, and comparing methods in our experiments.

Datasets . We conduct experiments on three benchmark datasetsand one billion-scale industrial data. The statistics of the datasetsare shown in Table 2. able 1: Recommendation performance on public datasets. The best results are highlighted with bold fold. All the numbers inthe table are percentage numbers with ’%’ omitted.MovieLens Amazon Taobao

Metrics@10 Metrics@50 Metrics@50 Metrics@100 Metrics@50 Metrics@100HR NDCG HR NDCG HR NDCG HR NDCG HR NDCG HR NDCG

GRU4Rec

Caser

SASRec 17.34 7.84 46.01 13.53

MIND

MCPRN

SINE

Dataset • MovieLens collects user’s rating score for movies. In ex-periments, we follow [15] to preprocess the dataset. • Amazon consists of product views from Amazon. In ex-periments, we use the rating only version of Book categorybehaviors. Note that this version is more challenging thanthe 5-core version used in [24], due to its large volume andsparsity. • Taobao collects user behaviors from Taobao’s recommendersystem. In experiments, we only use the click behaviors. • ULarge consists of the clicked behaviors collected from thedaily logs of an Alibaba company from March 29 to April 4,2020.For all datasets, we follow [21] to split the datasets into train-ing/validation/testing sets. Specifically, we split the historical se-quence for each user into three parts: (1) the most recent action fortesting, (2) the second most recent action for validation, and (3) allremaining actions for training. Note that during testing, the inputsequences contain training actions and the validation actions.

Competitors . We compare our proposed model SINE with thefollowing state-of-the-art sequential recommendation baselines. • Single embedding models : GRU4Rec [17] is a pioneeringwork that employs GRU to model user behavior sequences.Caser [42] is a recent CNN-based sequential recommenda-tion benchmark. • Multi-embedding models : MIND [24] and SASRec [21]are recently proposed multi-interest methods based on cap-sule network [34] and multi-head self-attention [43]. MCPRNis another state-of-the-art multi-interest framework basedon latent conceptual prototypes.

Parameter Configuration . For a fair comparison, all methods areimplemented in Tensorflow and optimized with Adam optimizer https://grouplens.org/datasets/movielens/1m/ http://jmcauley.ucsd.edu/data/amazon/ https://tianchi.aliyun.com/dataset/dataDetail?dataId=649 with a mini-batch size of 128. The learning rate is fixed as 0.001. Wetuned the parameters of comparing methods according to valuessuggested in original papers and set the embedding size 𝐷 as 128and the number of negative samples as 5 and 10 for MovieLensand other datasets. For our method, it has three crucial hyper-parameters: the trade-off parameter 𝜆 , the number of intentions 𝐾 , and latent prototypes 𝐿 . We search 𝐾 from { , , , } , 𝐿 from { , , , , , } and 𝜆 from 0 to 1 with step size 0.1.We found our model performs relative stable when 𝜆 is around 0.5and set 𝜆 = .

5. The configuration of the other two parameters forfour datasets are reported in Table 3.

Table 3: The optimal setting of our hyper-parameters for ourmodel. Other parameters like dimension 𝐷 , sequence length 𝑛 and 𝜆 are set as 128, 20 and 0.5, respectively. 𝐾 𝐿 MovieLens 4 50Amazon 4 500Taobao 8 1000ULarge 8 5000

Evaluation Metrics . For each user in the test set, we treat allthe items that the user has not interacted with as negative items.We use two commonly used evaluation criteria [14]: hit rate (HR)and normalized discounted cumulative gain (NDCG) to evaluate theperformance of our model. Besides, we also leverage the widelyused

Normalized Mutual Information (NMI) [30] to quantitativeanalysis of the effectiveness of the learned conceptual prototypesof our model in clustering items.

Table 1 summarizes the performance of SINE as well as baselineson three benchmark datasets. Clearly, SINE achieves comparableperformance to all of the baselines on all the evaluation criteriain general. Caser obtains the best performance over other models(GRU4Rec) that only single output embedding for each user. It canbe observed that employing multiple embedding vectors (SASRec,MIND, MCPRN, SINE) for a user perform generally better thansingle embedding based methods (Caser and GRU4Rec). Therefore,exploring multiple user-embedding vectors has proved to be an olls JacketsCosmetics Cups

Figure 3: Concept visualization. We draw four concepts "dolls", "jackets", "cosmetics" and "cups" with the top-8 closest items. K HR @ L HR @ Figure 4: Sensitivity of SINE towards 𝐾 and 𝐿 on Taobao. effective way of modeling users’ diverse interests and boosting se-quential recommendation accuracy. Moreover, we can observe thatthe improvement introduced by capturing user’s various intentionsis more significant for Taobao and Amazon datasets. The users ofTaobao and Amazon tend to exhibit more diverse interests in onlineshopping than rating movies. The improvement of MIND over SAS-Rec shows that dynamic routing serves as a better multi-interestextractor than multi-head self-attention. An interesting observationis that MIND beats MCPRN on Amazon and Taobao while losseson MovieLens. It is mainly because MCPRN only supports clusterall items into a small set of prototypes, which is difficult well tocluster millions of items on Amazaon and Taobao. Considering theMIND and SINE results, SINE consistently outperforms MIND onthree datasets over all evaluation metrics. This can be attributed totwo points: 1) The sparse-interest extractor layer explicitly utilizesa large set of conceptual prototypes to cluster items and automati-cally infer a subset of preferred intentions for interest embeddingsgeneration, which achieves a more precise representation of a user.2) Interest aggregation module actively predicts the user’s currentintention to directly attend over multiple user embedding vectors,enabling modeling multi-interests for top-N recommendation. Parameter Sensitivity (Q3) . We also investigate the sensitivity ofthe number of intentions 𝐾 and conceptual prototypes 𝐿 . Figure 4reports the performance of our model in terms of HR. In particular,we randomly select 1 million users for inference, and the averageresult of 10 runs is reported. Results hold the same for other datasets,and we omit the figure here for more space. From the figure, wecan observe that SINE obtains the best performance when 𝐾 = 𝐿 = Table 4: Recommendation performance on industrialdataset ULarge. Improv. row means the improvement of ourmodel compared with the second-best baseline.

HR@50 HR@100 HR@500

Caser

GRU4Rec

SASRec

MCPRN

MIND

SINE 12.24 21.12 40.81Improv.

We further conduct an offline experiment to investigate the effec-tiveness of our model in extracting user’s diverse interests in theindustrial dataset. We implemented our model and baselines on theAlibaba company’s distributed cloud platform, where every twoworkers share an NVIDIA Tesla P100GPU with 16GB memory.Table 4 summarizes the performance in terms of Hit Rate. It isclear that SINE significantly outperforms other baselines by a wide able 5: Prototype clustering evaluation compared with thefirst, second and leaf level category information on ULarge.

Level-1 Level-2 Level-leafNMI 0.09 0.37 0.29

Table 6: Ablation study of SINE.Dataset Method HR@50 HR@100Taobao SINE-cate

SINE-label

SINE 17.69 20.64ULarge SINE-cate

SINE-label

SINE 12.24 21.12 margin. Another interesting observation is that the gap betweenSINE and the second-best benchmark (MIND) decreases when thenumber of recalled items increases. This fact indicates that oursparse-interest network helps capture user’s diverse interests andranks the most preferred items on the top recommendation list.

Case Study

We also visualize the learned conceptual prototypes ofour model. Concretely, for each concept, we leverage its prototypi-cal embedding vector to retrieve the top-8 closest items under theircosine similarity. Figure 3 illustrates four exemplar concepts toshow their clustering performance. As can be seen, our model suc-cessfully groups some semantic-similar items into a latent concept.More importantly, the items in one concept come from differentsemantic-close leaf categories. For example, the “cosmetics" conceptcontains different kinds of skin-nursing products. It indicates thatcompared to the conventional leaf-category partition, our concep-tual prototype is related to the user’s high-level intention.To confirm this point, we compare the learned concepts withthe expert-labeled category hierarchy in Alibaba company, wherethe number of categories in the first, second, and leaf-level are 178,7,945, and 14874, respectively. Table 5 reports the results in termsof NMI. We can observe that the learned concepts are closest to thesecond level category, not in the extreme fine-grained granularity(leaf) or the very coarse granularity (first). This result demonstratesthat our model can capture the relative high-level semantics for theuser’s intention modeling.

We introduce two variants (SINE-cate and SINE-label) to validatethe effectiveness of the learned new prototypes and the interestaggregation module. Specifically, SINE-cate is obtained by using thecategory attributes as prototypes, while SINE-label is obtained byadopting label-aware attention in [24] for training. We only conductexperiments on Taobao and ULarge, since other datasets do nothave category attributes. Taobao and ULarge have 9439 and 14874distinct categories, respectively. Note that, similar to MIND [24],SINE-label first independently retrieves 𝐾 · N candidate items basedon 𝐾 embedding vectors and then outputs the final top-N recom-mendation list by sorting 𝐾 · N items. Table 5 reports the results in terms of HR. Obviously, SINE significantly outperforms the othertwo variants in two datasets. The substantial difference betweenSINE-cate and SINE shows that the learned concepts are betterto cluster items than the original items’ categories. It verifies ourmotivation to cluster items in our model jointly. The improvementof SINE over SINE-label validates that our interest attention moduleis useful to model multiple interests for next-item recommendation.

In this paper, we propose a novel sparse-interest embedding frame-work for the sequential recommendation. Our model can adaptivelyactivate multiple intentions from a large pool of conceptual proto-types to generate multiple interest embeddings for a user. It alsodevelops an interest aggregation module to capture multi-intereststo obtain the overall top-N items actively. Empirical results demon-strate that our model performs better than state-of-the-art base-lines on challenging datasets. Results on the billion-scale industrialdataset further confirm our model’s effectiveness in terms of rec-ommendation accuracy and producing reasonable item clusters.We plan to leverage lifelong learning to capture users’ long-terminterests for a more accurate recommendation.

REFERENCES [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma-chine translation by jointly learning to align and translate. arXiv preprintarXiv:1409.0473 (2014).[2] Peter J Burt. 1988. Attention mechanisms for vision in a dynamic world. In

ICPRAM . IEEE Computer Society, 977–978.[3] Xu Chen, Hongteng Xu, Yongfeng Zhang, Jiaxi Tang, Yixin Cao, Zheng Qin, andHongyuan Zha. 2018. Sequential recommendation with user memory networks.In

WSDM . 108–116.[4] Chen Cheng, Haiqin Yang, Michael R Lyu, and Irwin King. 2013. Where you liketo go next: Successive point-of-interest recommendation. In

IJCAI .[5] Michael Cogswell, Faruk Ahmed, Ross Girshick, Larry Zitnick, and Dhruv Batra.2015. Reducing overfitting in deep networks by decorrelating representations. arXiv preprint arXiv:1511.06068 (2015).[6] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks foryoutube recommendations. In

RecSys . 191–198.[7] Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin.2019. Graph neural networks for social recommendation. In

WWW . 417–426.[8] Hongyang Gao and Shuiwang Ji. 2019. Graph u-nets. arXiv preprintarXiv:1905.05178 (2019).[9] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017.DeepFM: a factorization-machine based neural network for CTR prediction. arXivpreprint arXiv:1703.04247 (2017).[10] Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representationlearning on large graphs. In

NIPS . 1024–1034.[11] Ruining He, Chen Fang, Zhaowen Wang, and Julian McAuley. 2016. Vista: avisually, socially, and temporally-aware model for artistic recommendation. In

RecSys . 309–316.[12] Ruining He and Julian McAuley. 2016. Fusing similarity models with markovchains for sparse sequential recommendation. In

ICDM . IEEE, 191–200.[13] Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and MengWang. 2020. LightGCN: Simplifying and Powering Graph Convolution Networkfor Recommendation. arXiv preprint arXiv:2002.02126 (2020).[14] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-SengChua. 2017. Neural collaborative filtering. In

WWW . 173–182.[15] Xiangnan He, Hanwang Zhang, Min-Yen Kan, and Tat-Seng Chua. 2016. Fastmatrix factorization for online recommendation with implicit feedback. In

SIGIR .ACM, 549–558.[16] Balázs Hidasi and Alexandros Karatzoglou. 2018. Recurrent neural networkswith top-k gains for session-based recommendations. In

CIKM . 843–852.[17] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk.2015. Session-based recommendations with recurrent neural networks. arXivpreprint arXiv:1511.06939 (2015).[18] Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2014.On using very large target vocabulary for neural machine translation. arXivpreprint arXiv:1412.2007 (2014).19] Bowen Jin, Chen Gao, Xiangnan He, Depeng Jin, and Yong Li. 2020. Multi-behavior recommendation with graph convolutional networks. In

SIGIR . 659–668.[20] Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similaritysearch with GPUs.

IEEE Transactions on Big Data (2019).[21] Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recom-mendation. In

ICDM . IEEE, 197–206.[22] Donghyun Kim, Chanyoung Park, Jinoh Oh, Sungyoung Lee, and Hwanjo Yu.2016. Convolutional matrix factorization for document context-aware recom-mendation. In

RecSys . 233–240.[23] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization tech-niques for recommender systems.

Computer

42, 8 (2009), 30–37.[24] Chao Li, Zhiyuan Liu, Mengmeng Wu, Yuchi Xu, Huan Zhao, Pipei Huang,Guoliang Kang, Qiwei Chen, Wei Li, and Dik Lun Lee. 2019. Multi-interestnetwork with dynamic routing for recommendation at Tmall. In

CIKM . 2615–2623.[25] Jiacheng Li, Yujie Wang, and Julian McAuley. 2020. Time Interval Aware Self-Attention for Sequential Recommendation. In

WSDM . 322–330.[26] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang,Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentenceembedding. arXiv preprint arXiv:1703.03130 (2017).[27] Ninghao Liu, Qiaoyu Tan, Yuening Li, Hongxia Yang, Jingren Zhou, and XiaHu. 2019. Is a single vector enough? exploring node polysemy for networkembedding. In

KDD . 932–940.[28] Fuyu Lv, Taiwei Jin, Changlong Yu, Fei Sun, Quan Lin, Keping Yang, and Wil-fred Ng. 2019. SDM: Sequential deep matching model for online large-scalerecommender system. In

CIKM . 2635–2643.[29] Jianxin Ma, Chang Zhou, Peng Cui, Hongxia Yang, and Wenwu Zhu. 2019. Learn-ing disentangled representations for recommendation. In

NIPS . 5711–5722.[30] Aaron F McDaid, Derek Greene, and Neil Hurley. 2011. Normalized mutual in-formation to evaluate overlapping community finding algorithms. arXiv preprintarXiv:1110.2515 (2011).[31] Covington Paul, Adams Jay, and Sargin Emre. 2016. Deep neural networks forYouTube Recommendation. In

RecSys . 191–198.[32] Steffen Rendle. 2010. Factorization machines. In

ICDM . IEEE, 995–1000.[33] Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. 2010. Factor-izing personalized markov chains for next-basket recommendation. In

WWW .811–820.[34] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. 2017. Dynamic routingbetween capsules. In

NIPS . 3856–3866.[35] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-basedcollaborative filtering recommendation algorithms. In

WWW . 285–295. [36] J Ben Schafer, Dan Frankowski, Jon Herlocker, and Shilad Sen. 2007. Collaborativefiltering recommender systems. In

The adaptive web . Springer, 291–324.[37] Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. 2015.Autorec: Autoencoders meet collaborative filtering. In

WWW . 111–112.[38] Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang.2019. BERT4Rec: Sequential recommendation with bidirectional encoder repre-sentations from transformer. In

CIKM . 1441–1450.[39] Qiaoyu Tan, Ninghao Liu, and Xia Hu. 2019. Deep Representation Learning forSocial Network Analysis.

Frontiers in Big Data

WWW . 1988–1998.[41] Qiaoyu Tan, Jianwei Zhang, Ninghao Liu, Xiao Huang, Hongxia Yang, JignrenZhou, and Xia Hu. 2021. Dynamic memory based attention network for sequentialrecommendation. In

AAAI .[42] Jiaxi Tang and Ke Wang. 2018. Personalized top-n sequential recommendationvia convolutional sequence embedding. In

WSDM . 565–573.[43] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

NIPS . 5998–6008.[44] Shoujin Wang, Liang Hu, Yan Wang, Quan Z Sheng, Mehmet A Orgun, andLongbing Cao. 2019. Modeling Multi-Purpose Sessions for Next-Item Recommen-dations via Mixture-Channel Purpose Routing Networks.. In

IJCAI . 3771–3777.[45] An Yan, Shuo Cheng, Wang-Cheng Kang, Mengting Wan, and Julian McAuley.2019. CosRec: 2D Convolutional Neural Networks for Sequential Recommenda-tion. In

CIKM . 2173–2176.[46] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton,and Jure Leskovec. 2018. Graph convolutional neural networks for web-scalerecommender systems. In

KDD . 974–983.[47] Feng Yu, Qiang Liu, Shu Wu, Liang Wang, and Tieniu Tan. 2016. A dynamicrecurrent model for next basket recommendation. In

SIGIR . 729–732.[48] Wenhui Yu and Zheng Qin. 2020. Graph Convolutional Network for Recommen-dation with Low-pass Collaborative Filters. In

ICML . PMLR, 10936–10945.[49] Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep learning based rec-ommender system: A survey and new perspectives.

ACM Computing Surveys(CSUR)

52, 1 (2019), 1–38.[50] Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, YanghuiYan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-throughrate prediction. In

KDD . 1059–1068.[51] Han Zhu, Xiang Li, Pengye Zhang, Guozheng Li, Jie He, Han Li, and Kun Gai.2018. Learning tree-based deep model for recommender systems. In