[PDF] Dynamic Memory based Attention Network for Sequential Recommendation

Abstract

Sequential recommendation has become increasingly essential in various online services. It aims to model the dynamic preferences of users from their historical interactions and predict their next items. The accumulated user behavior records on real systems could be very long. This rich data brings opportunities to track actual interests of users. Prior efforts mainly focus on making recommendations based on relatively recent behaviors. However, the overall sequential data may not be effectively utilized, as early interactions might affect users' current choices. Also, it has become intolerable to scan the entire behavior sequence when performing inference for each user, since real-world system requires short response time. To bridge the gap, we propose a novel long sequential recommendation model, called Dynamic Memory-based Attention Network (DMAN). It segments the overall long behavior sequence into a series of sub-sequences, then trains the model and maintains a set of memory blocks to preserve long-term interests of users. To improve memory fidelity, DMAN dynamically abstracts each user's long-term interest into its own memory blocks by minimizing an auxiliary reconstruction loss. Based on the dynamic memory, the user's short-term and long-term interests can be explicitly extracted and combined for efficient joint recommendation. Empirical results over four benchmark datasets demonstrate the superiority of our model in capturing long-term dependency over various state-of-the-art sequential models.

Full PDF

DDynamic Memory based Attention Network for Sequential Recommendation

Qiaoyu Tan , Jianwei Zhang , Ninghao Liu , Xiao Huang Hongxia Yang , Jingren Zhou , Xia Hu Texas A&M University, Alibaba Group, The Hong Kong Polytechnic University { qytan,nhliu43,xiahu } @tamu.edu, [email protected] { zhangjianwei.zjw, yang.yhx, jingren.zhou } @alibaba-inc.com Abstract

Sequential recommendation has become increasingly essen-tial in various online services. It aims to model the dynamicpreferences of users from their historical interactions and pre-dict their next items. The accumulated user behavior recordson real systems could be very long. This rich data bringsopportunities to track actual interests of users. Prior effortsmainly focus on making recommendations based on rela-tively recent behaviors. However, the overall sequential datamay not be effectively utilized, as early interactions might af-fect users’ current choices. Also, it has become intolerable toscan the entire behavior sequence when performing inferencefor each user, since real-world system requires short responsetime. To bridge the gap, we propose a novel long sequen-tial recommendation model, called Dynamic Memory-basedAttention Network (DMAN). It segments the overall long be-havior sequence into a series of sub-sequences, then trains themodel and maintains a set of memory blocks to preserve long-term interests of users. To improve memory ﬁdelity, DMANdynamically abstracts each user’s long-term interest into itsown memory blocks by minimizing an auxiliary reconstruc-tion loss. Based on the dynamic memory, the user’s short-term and long-term interests can be explicitly extracted andcombined for efﬁcient joint recommendation. Empirical re-sults over four benchmark datasets demonstrate the superi-ority of our model in capturing long-term dependency overvarious state-of-the-art sequential models.

Introduction

Recommender systems have become an important tool invarious online systems such as E-commerce, social me-dia, and advertising systems to provide personalized ser-vices (Hidasi et al. 2015; Ying et al. 2018b). One core stageof live industrial systems is candidate selection and rank-ing (Covington, Adams, and Sargin 2016), which is respon-sible for retrieving a few hundred relevant items from a mil-lion or even billion scale corpus. Previously, researchers re-sort to collaborative ﬁltering approaches (Sarwar et al. 2001)to solve it by assuming that like-minded users tend to exhibitsimilar preferences on items. Typical examples includingmodels based on matrix factorization (Sarwar et al. 2001),factorization machines (Rendle 2010), and graph neural net-works (Ying et al. 2018b; Wang et al. 2019b; Tan et al. a r X i v : . [ c s . I R ] F e b erm and long-term behavior sequences and then explicitlyextracting a user’s temporal and long-term preferences (Yinget al. 2018a; Lv et al. 2019). Despite their simplicity, theystill suffer from high computation complexity since theyneed to scan over the whole behavior sequence during in-ference. 2) It is crucial to model the whole behavior se-quence for a more accurate recommendation. A few attemptshave been made to focus only on short-term actions (Li,Wang, and McAuley 2020; Hidasi and Karatzoglou 2018)and abandon previous user behaviors. Nevertheless, stud-ies (Ren et al. 2019; Belletti, Chen, and Chi 2019) havedemonstrated that user preferences may be inﬂuenced byher/his early interactions beyond the short-term behavior se-quence. 3) It is hard to explicitly control the contributions oflong-term or short-term interests for user modeling. Somestudies resort to memory neural network (Graves, Wayne,and Danihelka 2014) to implicitly preserve the long-term in-tentions for efﬁcient sequential modeling (Chen et al. 2018;Ren et al. 2019). But they may suffer from long-term knowl-edge forgetting (Sodhani, Chandar, and Bengio 2018), dueto that the memory is optimized by predicting the next-item.Therefore, an advanced sequential model is needed to ex-plicitly model both long-term and short-term preferences, aswell as supporting efﬁcient inference.To address the limitations above, we propose a noveldynamic memory-based self-attention network, dubbedDMAN, to model long behavior sequence data. It offersstandard self-attention networks to capture long-term depen-dencies for user modeling effectively. To improve modelefﬁciency, DMAN truncates the whole user behavior se-quence into several successive sub-sequences and optimizesthe model sequence by sequence. Speciﬁcally, a recurrent at-tention network is derived to utilize the correlation betweenadjacent sequences for short-term interest modeling. Mean-while, another attention network is introduced to measuredependencies beyond consecutive sequences for long-terminterest modeling based on a dynamic memory, which pre-serves user behaviors before the adjacent sequences. Finally,the two aspect interests are adaptively integrated via a neu-ral gating network for the joint recommendation. To enhancethe memory ﬁdelity, we further develop a dynamic memorynetwork to effectively update the memory blocks sequenceby sequence using an auxiliary reconstruction loss. To sum-marize, the main contributions of this paper are as follows:• We propose a dynamic memory-based attention net-work DMAN for modeling long behavior sequences,which conducts an explicit and adaptive user model-ing and supports efﬁcient inference.• We derive a dynamic memory network to dynami-cally abstract a user’s long-term interests into an ex-ternal memory sequence by sequence.• Extensive experiments on several challenging bench-marks demonstrate our method’s effectiveness inmodeling long user behavior data. The Proposed DMAN Model

In this section, we ﬁrst introduce the problem formulationand then discuss the proposed framework in detail. Table 1: Notations summary.

Notation Description u a user t an item x an interaction record U the set of users V the set of items S n the n -th behavior sequence K the number of candidate items N the number of sliced sequences L the number of self-attention layers m the number of memory slots D the number of embedding dimension (cid:101) H the short-term interest embedding (cid:98) H the long-term interest embedding M the memory embedding matrix V the output user embedding Notations and Problem Formulation

Assume U and V denote the sets of users and items, respec-tively. S = { x , x , . . . , x |S| } represents the behavior se-quence in chronological order of a user. x t ∈ V records the t -th item interacted by the user. Given an observed behaviorsequence { x , x , . . . , x t } , the sequential recommendationtask is to predict the next items that the user might be inter-acted with. Notations are summarized in Table 1.In our setting, due to the accumulated behavior se-quence S is very long, we truncate it into a series ofsuccessive sub-sequences with ﬁxed window size T , i.e., S = {S n } Nn =1 , for the model to process efﬁciently. S n = { x n, , x n, , . . . , x n,T } denotes the n -th sequence. Tradi-tional sequential recommendation methods mainly rely ona few recent behaviors S N for user modeling. Our paper fo-cuses on leveraging the whole behavior sequence for a com-prehensive recommendation. We ﬁrst illustrate how to ex-plicitly extract short-term and long-term user interests fromhistorical behaviors, and then describe an adaptive way tocombine them for joint recommendation. Finally, we intro-duce a novel dynamic memory network to preserve user’slong-term interests for efﬁcient inference effectively. Recurrent Attention Network

This subsection introduces the proposed recurrent attentionnetwork for short-term interest modeling. Given an arbitrarybehavior sequence S n as input, an intuitive way to estimatea user’s short-term preferences is only consider her/his be-haviors within the sequence. However, the ﬁrst few itemsin each sequence may lack necessary context for effectivemodeling, because previous sequences are not considered.To address this limitation, we introduce the notion of re-currence in RNNs into self-attention network and build asequence-level recurrent attention network, enabling infor-mation ﬂow between adjacent sequences. In particular, weuse the hidden state computed for last sequence as additionalcontext for next sequence modeling. Formally, let S n − and S n be two successive sequences, and (cid:101) H ln − ∈ R T × D denotethe l -th layer hidden state produced for sequence S n − . We ecurrent Attention NetLong-term Attention Net ! " %& " ' " (& " ) " ' "*+ ' + … Dynamic Memory Net , "*+,+ , "*+,. , "*+,/ , ",+ , ",/ , ",. , +,+ , +,/ , +,. … Knowledge transferSigmoid functionCurrent input sequence Dynamic Memory Net

Attention Attention (& " ℒ (& "*+ (& "*+ Dynamic memoryRecurrent hidden state " Figure 1: Illustration of DMAN for one layer. It takes a series of sequences as input and trains the model sequence by sequence.When processing the n -th sequence S n , the recurrent attention network is applied to extract short-term user interest by usingthe previous hidden state (cid:101) H n − as context. Meanwhile, the long-term attention network is utilized to extract long-term interestbased on the memory blocks M . Next, the short-term and long-term interests are combined via a neural gating network for jointuser modeling. Finally, the dynamic memory network updates the memory blocks via fusing the information in (cid:101) H n − , andthe model continues to process the next sequence. The overall model is optimized by maximizing the likelihood of observedsequence, while the dynamic memory network is trained based on a local reconstruction loss L ae .calculate the hidden state of sequence S n as follows. (cid:101) H ln = Atten l rec ( (cid:101) Q ln , (cid:101) K ln , (cid:101) V ln ) = softmax ( (cid:101) Q ln ( (cid:101) K ln ) (cid:62) ) (cid:101) V ln , (cid:101) Q ln = (cid:101) H l − n (cid:102) W (cid:62) Q , and (cid:101) K ln = H l − n (cid:102) W (cid:62) K , (cid:101) V ln = H l − n (cid:102) W (cid:62) V , H l − n = (cid:101) H l − n (cid:107) SG ( (cid:101) H l − n − ) , (1)where Atten l rec ( · , · , · ) is the l -th layer self-attention network,in which the query, key and value matrices are denoted by Q , K and V , respectively. The input of the ﬁrst layer is the se-quence embedding matrix X n = [ x n, , . . . , x n,T ] ∈ R T × D .Intuitively, the attention layer calculates a weighted sum ofembeddings, where the attention weight is computed be-tween query i in S n and value j obtained from previous se-quences. The function SG ( · ) stands for stop-gradient fromprevious hidden state (cid:101) H l − n − , and (cid:107) denotes concatenation.In our case, we use the extended context as key and valueand adopt three linear transformations to improve the modelﬂexibility, where { (cid:102) W Q , (cid:102) W K , (cid:102) W V } ∈ R D × D denote theparameters. The extended context not only provides preciousinformation for recovering the ﬁrst few items, but also al-lows our model to capture the dependency across sequences.In practice, instead of computing the hidden states fromscratch at each time point, we cache the hidden state oflast sequence for reuse. Besides, masking strategy and po-sitional embeddings are also included to avoid the future in-formation leakage problem (Yuan et al. 2019) and capturesequential dynamics (Kang and McAuley 2018; Wang et al.2019a). The ﬁnal short-term interest embedding is deﬁnedas (cid:101) H n = (cid:101) H Ln . Long-term Attention Network

In this subsection, we present another attention network forlong-term interest modeling. With the recurrent connectionmechanism deﬁned in Eq. 1, our model can capture corre-lations between adjacent sequences for interest modeling.However, longer-range dependencies beyond successive se-quences may still be ignored, since the recurrent connec-tion mechanism is limited in capturing longer-range correla-tions (Sodhani, Chandar, and Bengio 2018; Sukhbaatar et al.2015). Hence, additional architecture is needed to effectivelycapture long-term user preferences.To this end, we maintain an external memory matrix M ∈ R m × D to explicitly memorize a user’s long-term pref-erences, where m is the number of memory slots. Each useris associated with a memory. Ideally, the memory comple-ments with the short-term interest modeling, with the aim tocapture dependencies beyond adjacent sequences. We leavehow to effectively update the memory in later section andnow focus on how to extract long-term interests from thememory. Speciﬁcally, let M l ∈ R m × D denote the l -th layermemory matrix, we estimate the long-term hidden state ofsequence S n using another self-attention network as (cid:98) H ln = Atten l ( (cid:98) Q ln , (cid:98) K ln , (cid:98) V ln ) , (cid:98) Q ln , (cid:98) K ln , (cid:98) V ln = (cid:98) H l − n (cid:99) W (cid:62) Q , M l − (cid:99) W (cid:62) K , M l − (cid:99) W (cid:62) V . (2)Similarly, Atten l ( · , · , · ) is a self-attention network. It takesthe last layer hidden state (cid:98) H l − n as query and uses thelayer-wise memory matrix M l − as key (value). The in-put of the query is X n . By doing so, the output hiddenstate (cid:98) H ln ∈ R T × D is a selective aggregation of m mem-ory blocks, where the selection weight is query-based andaries across different queries. { (cid:99) W Q , (cid:99) W K , (cid:99) W V } are train-able transformation matrices to improve model capacity.Since the memory is maintained to cache long-term user in-terests that beyond adjacent sequences, we refer to the aboveattention network as long-term interest modeling. The ﬁnallong-term interest embedding for sequence S k is denoted as (cid:98) H n = (cid:98) H Ln . Neural Gating Network

After obtaining the short-term and long-term interest em-beddings, the next aim is to combine them for comprehen-sive modeling. Considering that a user’s future intention canbe inﬂuenced by early behaviors, while short-term and long-term interests may contribute differently for next-item pre-diction over time (Ma et al. 2019), we apply a neural gatingnetwork to adaptively control the importance of the two in-terest embeddings. V n = G n (cid:12) (cid:101) H n + (1 − G n ) (cid:12) (cid:98) H n , G n = σ ( (cid:101) H n W short + (cid:98) H n W long ) , (3)where G n ∈ R T × D is the gate matrix learned by a non-linear transformation based on short-term and long-term em-beddings. σ ( · ) indicates the sigmoid activation function, (cid:12) denotes element-wise multiplication, and W short , W long ∈ R D × D are model parameters. The ﬁnal user embedding V n ∈ R T × D is obtained by a feature-level weighted sumof two types of interest embeddings controlled by the gate. Dynamic Memory Network

In this subsection, we describe how to effectively update thememory M to preserve long-term user preferences beyondadjacent sequences. One feasible solution is to maintain aﬁxed-size FIFO memory to cache long-term message. Thisstrategy is sub-optimal for user modeling due to two reasons.First, the oldest memories will be discarded if the memory isfull, whether it is important or not. This setting is reasonablein NLP task (Rae et al. 2019) as two words that too far awayin the sentence are often not correlated. But it is not held inrecommendation because behavior sequence is not strictlyordered (Yuan et al. 2019) and users often exhibit monthlyor seasonal periodic behaviors. Second, the memory is re-dundant and not effectively utilized, since user interests inpractice is often bounded in tens (Li et al. 2019).To avoid these limitations, we propose to abstract a user’slong-term interests from the past actively. Assume the modelwas processed sequence S n , then the memory is updated as M l ← f labs ( M l , (cid:101) H ln − ) , (4)where f labs : R ( m + T ) × D → R m × D is the l -th layer ab-straction function. It takes the old memory and the contextstate (cid:101) H ln − as input and updates memory M l to representuser interests. In theory, f abs requires to effectively pre-serve the primary interests in old memories and merges con-textual information. Basically, f abs can be trained with thenext-item prediction task end-to-end. Nevertheless, memo-ries that differ from the target item may be discarded. There- fore, we consider train the abstraction function with an aux-iliary attention-based reconstruction loss as follows. L ae = min L (cid:88) l =1 || attent lrec ( (cid:101) Q l , (cid:101) K l , (cid:101) V l ) − attent lrec ( (cid:101) Q l , (cid:98) K l , (cid:98) V l ) || F (cid:101) Q l = (cid:101) H ln , (cid:101) K l = (cid:101) V l = M l (cid:107) (cid:101) H ln − , (cid:98) K l = (cid:98) V l = M l , (5)where atten lrec ( · , · , · ) is the self-attention network deﬁned inEq. (1). We reuse the recurrent attention network but keepthe parameters ﬁxed and not trainable here. We employ thehidden state (cid:101) H ln of S n as query for two attention networks.The ﬁrst attention outputs a new representation for the queryvia a weighted sum from the old and new memories, whilethe second from the abstracted memories. By minimizingthe reconstruction loss, we expect the primary interests canbe extracted by f abs as much as possible. Note that we con-sider a lossy objective here because the information that isno longer attended to in S n can be discarded in order to cap-ture the shifting of user interests to some extent. Implementation of abstraction function f abs We pa-rameterize f abs with the dynamic routing method in Cap-sNet (Sabour, Frosst, and Hinton 2017) for its promisingresults in capturing user’s diverse interests in recommenda-tion (Li et al. 2019). Suppose we have two layers of capsules,we refer capsules from the ﬁrst layer and the second layeras primary capsules and interest capsules, respectively. Thegoal of dynamic routing is to calculate the values of interestcapsules given the primary capsules in an iterative fashion.In each iteration, given primary capsules vectors x i (inputvector), i ∈ { , . . . , T + m } and interest capsules ¯ x j (out-put vector), j ∈ { , . . . , m } , the routing logit b ij betweenprimary capsule i and interest capsule j is computed by b ij = ¯ x (cid:62) j W ij x i , (6)where W ij is a transformation matrix. Given routing logits, s j is computed as weighted sum of all primary capsules s j = m + T (cid:88) i =1 α ij W ij x i , (7)where α ij = exp( b ij ) / (cid:80) m + Tj (cid:48) =1 exp( b ij (cid:48) ) is the connectionweight between primary capsule i and interest capsule j . Fi-nally, a non-linear ”squashing” function (Sabour, Frosst, andHinton 2017) is proposed to obtain the corresponding vec-tors of interest capsules as ¯ x j = squash ( s j ) = (cid:107) s j (cid:107) (cid:107) s j (cid:107) s j (cid:107) s j (cid:107) . (8)The routing process between Eq. (6) and Eq. (8) usually re-peats three times to converge. When routing ﬁnishes, theoutput interest capsules of user u are then used as the mem-ory, i.e., M = [¯ x , . . . , ¯ x m ] . Model Optimization

As the data is derived from the user implicit feedback, weformulate the learning problem as a binary classiﬁcationable 2: The dataset statistics.Dataset

T K

MovieLens 6,040 3,952 20 10Taobao 987,994 4,162,024 20 10JD.com 1,608,707 378,457 20 10XLong 20,000 3,269,017 50 20task. Given the training sample ( u, t ) in a sequence S n withthe user embedding vector V n,t and target item embedding x t , we aim to minimize the following negative likelihood L like = − (cid:88) u ∈U (cid:88) t ∈S n log P ( x n,t | x n, , x n, , · · · , x n,t − )= − (cid:88) u ∈U (cid:88) t ∈S n log exp( x (cid:62) t V n,t ) (cid:80) j ∈V exp( x (cid:62) j V n,t )) . (9)The loss above is usually intractable in practice because thesum operation of the denominator is computationally pro-hibitive. Therefore, we adopt a negative sampling strategyto approximate the softmax function in experiments. Whenthe data volume is large, we leverage Sampled Softmax tech-nique (Covington, Adams, and Sargin 2016; Jean et al. 2014)to further accelerate the training. Note that Eqs. (9) and (6)are separately updated in order to preserve long-term inter-ests better. Speciﬁcally, we ﬁrst update Eq. (9) by feeding anew sequence and then updating the abstraction function’sparameters by minimizing Eq. (6). Experiments and Analysis

Datasets

We conduct experiments over four public benchmarks.Statistics of them are summarized in Table 2. MovieLens collects users’ rating scores for movies. JD.com (Lv et al.2019) is a collection of user browsing logs over e-commerceproducts collected from JD.com. Taobao (Zhu et al. 2018)and XLong (Ren et al. 2019) are datasets of user behav-iors from the commercial platform of Taobao. The behaviorsequence in XLong is signiﬁcantly longer than other threedatasets, thus making it difﬁcult to model. Baselines

To evaluate the performance of DMAN, we include threegroups of baseline methods. First, traditional sequentialmethods. To evaluate the effectiveness of our model in deal-ing with long behavior sequence, three state-of-the-art rec-ommendation algorithms for sequences with a normal lengthhave been employed, including

GRU4Rec (Tang and Wang2018),

Caser (Kang and McAuley 2018) and

SASRec (Heand Chua 2017). Second, long sequential methods. To eval-uate the effectiveness of our model in extracting long-termuser interests with dynamic memory, we include

SDM (Lvet al. 2019) and

SHAN (Ying et al. 2018a), which are tai-lored for modeling long behavior sequences. To evaluatethe effectiveness of our model in explicitly capturing user’s https://grouplens.org/datasets/movielens/1m/ short-term and long-term interests, we also set HPMN (Renet al. 2019) as a baseline. It is based on the memory network.Thrid, DMAN variants. To analyze the contribution of eachcomponent of DMAN, we consider three variants. DMAN-XL discards the long-term attention network to verify theeffectiveness of capturing long-term interests. DMAN-FIFOadopts a FIFO strategy to validate the usefulness of the ab-straction function in extracting primary interests. DMAN-NRAN replaces the recurrent attention network with vanillaattention network to demonstrate the effectiveness of ex-tending context for effective user modeling.

Experimental Settings

We obtain the behavior sequence by sorting behaviors inchronological order. Following the traditional way (Kangand McAuley 2018), we employ the last and second last in-teractions for testing and validation, respectively, and the re-maining for training. We follow the widely-adopted way (Liet al. 2017; Lv et al. 2019) and split the ordered training se-quence into L consecutive sequences. The maximum lengthof a sequence is T . The statistics of four datasets are listed inTable 2. We repeatedly run the model ﬁve times and reportthe average results. Evaluation metrics

For each user in the test set, we treatall the items that the user has not interacted with as neg-ative items. To estimate the performance of top- K recom-mendations, we use Hit Rate (HR @ K ) and Normalized Dis-counted Cumulative Gain (NDCG @ K ) metrics, which arewidely used in the literature (He and Chua 2017). Parameter settings

For baselines, we use the source codereleased by the authors, and their hyper-parameters are tunedto be optimal based on the validation set. To enable a faircomparison, all methods are optimized with the numberof samples equals and the number of embedding dimen-sions D equals . We implement DMAN with Tensor-ﬂow and the Adam optimizer is utilized to optimize themodel with learning rate equals . . The batch size isset to and the maximum epoch is . The number ofmemory slots m and attention layers L are searched from { , , , , , , } and { , , , , } , respectively. Comparisons with SOTA

In this section, we compare our model with different base-lines. Tables 3 and 4 report the results. In general, we havethree aspects of observations.

Inﬂuence of modeling long behavior sequence for tra-ditional sequential methods

From Table 3, we observethat GRU4Rec, Caser, and SASRec improve their perfor-mance when considering longer behavior sequence. There-fore, modeling longer behavior sequence has proved to beeffective for user modeling. Besides, different sequentialmodules have varied abilities in handling long behavior se-quence. Speciﬁcally, SASRec, GRU4Rec, and Caser im-prove 24.74%, 29.36%, and 13.42% on Taobao in terms ofHR @50 , while SASRec consistently performs the best. It in-dicates the ability of self-attention network in extracting se-able 3: Sequential recommendation performance over three benchmarks. ∗ indicates the model only use the latest behaviorsequence for training; otherwise, the whole behavior sequence. The second best results are underlined. Models MovieLens Taobao JD.com

HR@10 HR@50 NDCG@100 HR@50 HR@100 NDCG@100 HR@10 HR@50 NDCG@100

GRU4Rec ∗ Caser ∗ SASRec ∗ GRU4Rec

Caser

SASRec

SHAN

HPMN

SDM

DMAN 25.18 53.24 22.03 24.92 29.37 11.13 44.58 58.82 36.93Improv.

Table 4: Performance on long user behavior data XLong.

Method Recall@200 Recall@500GRU4Rec ∗ Caser ∗ SASRec ∗ GRU4Rec

Caser

SASRec

SHAN

HPMN

SDM

DMAN 0.132 0.163 quential patterns, and also validates our motivation to extendself-attention network for long behavior sequence modeling.

Comparison with baselines on general datasets

Asshown in Table 3, our model DMAN achieves better re-sults than baselines across three datasets. In general, longsequential models perform better than traditional sequen-tial methods, excepting SASRec. SASRec performs betterthan SHAN and comparable to HPMN in most cases. Thisfurther implies the effectiveness of self-attention networkin capturing long-range dependencies. The improvement ofSDM over SASRec shows that explicitly extract long-termand short-term interests from long sequence is beneﬁcial.Considering DMAN and SDM, DMAN consistently out-performs SDM over all evaluation metrics. This can be at-tributed to that DMAN utilizes a dynamic memory networkto actively extract long-term interests into a small set ofmemory blocks, which is easier for the attention networkto effectively attend relative information than from a longbehavior sequence.

Comparison with baselines on long behavior dataset

Table 4 summarizes the results of all methods on XLong,where the length of behavior sequence is larger than onaverage. Obviously, DMAN signiﬁcantly outperforms otherbaselines. Compared with the ﬁndings in Table 3, one in-terest observation is that traditional sequential methods, i.e.,GRU4Rec, Caser, and SASRec, performs poorly when di- Table 5: Ablation study of DMAN.

Dataset Method Recall@100 NDCG@100Taobao DMAN-XL

DMAN-FIFO

DMAN-NRNA

DMAN

XLong DMAN-XL

DMAN-FIFO

DMAN-NRAN

DMAN

Ablation Study

We also conduct experiments to investigate the effectivenessof several core components of the proposed DMAN. Table 5reports the results on two representative datasets. Obviously,DMAN signiﬁcantly outperforms the other three variants.The substantial difference between DMAN and DMAN-XLshows that recurrent connection is not enough to captureuser’s long-term interests. The improvement of DMAN overDMAN-FIFO validates that the proposed abstraction func-tion is effective to extract user’s primary long-term inter-ests. Besides, DMAN outperforms DMAN-NRAN in gen-eral, which veriﬁes the usefulness of extending current con-text with previous hidden sequence state for short-term in-terest extraction.

Hyper-parameter Analysis

We further study the impacts of our model w.r.t. the num-ber of memory slots m and attention layers L on Movie- .480.490.50.510.520.530.54 2 4 6 8 10 20 30 H R @ (a) Memory slots m H i t R a t e Top-10 Top-30 Top-50 (b) Layer size L (c) Learning curve Figure 2: The proposed DMAN analysisLens. As we can see in Figure 2(a), DMAN achieves satis-factory results when m = 20 and the gain slows down withless than 2% improvement when m further increases. In ex-periments, we found 20 is enough for MovieLens, Taobao,JD.com and XLong. From Figure 2(b), we observe that thenumber of attention layers has positive impacts in our model.To trade-off between memory costs and performance, weset L = 2 for all datasets since it already achieves satis-factory results. Besides, we also plot the learning curve ofDMAN on Taobao dataset in Figure 2(c), we can observethat DMAN converges quickly after about 2 epochs. Similarobservations have been observed on other datasets. Specif-ically, DMAN tends to converge after 2 epochs on Taobao,JD.com and XLong datasets, while 50 epochs for Movie-Lens data. These results demonstrate the training efﬁciencyof our model. Related Work

General Recommendation

Early recommendation works largely focused on explicitfeedback (Koren 2008). The recent research focus is shift-ing towards implicit data (Li and She 2017; Hu, Koren, andVolinsky 2008). The typical examples include collaborativeﬁltering (Sarwar et al. 2001; Schafer et al. 2007), matrixfactorization techniques (Koren, Bell, and Volinsky 2009),and factorization machines (Rendle 2010). The main chal-lenge lies in representing users or items with latent embed-ding vectors to estimate their similarity. Due to their abilityto learn salient representations, neural network-based mod-els (Guo et al. 2017; Su and Khoshgoftaar 2009; Tan, Liu,and Hu 2019) are also attracted much attention recently.Some efforts adopt neural networks to extract side attributesfor content-aware recommendation (Kim et al. 2016), whilesome aim to equip matrix factorization with non-linear inter-action function (He and Chua 2017) or graph convolutionalaggregation (Wang et al. 2019b; Liu et al. 2019). In general,deep learning-based methods perform better than traditionalcounterparts (Sedhain et al. 2015; Xue et al. 2017).

Sequential Recommendation

Sequential recommendation takes as input the chronologi-cal behavior sequence for user modeling. Typical examplesbelong to three categories. The ﬁrst relies on temporal ma-trix factorization (Koren 2009) to model user’s drifting pref-erences. The second school uses either ﬁrst-order (Ren- dle, Freudenthaler, and Schmidt-Thieme 2010; Cheng et al.2013) or hider-order (He and McAuley 2016; He et al. 2016;Yan et al. 2019) Markov-chains to capture the user state dy-namics. The third stream applies deep neural networks toenhance the capacity of feature extraction (Yuan et al. 2019;Sun et al. 2019; Hidasi and Karatzoglou 2018). For exam-ple, Caser (Tang and Wang 2018) applies CNNs to pro-cess the item embedding sequence, while GRU4Rec (Hidasiet al. 2015) uses gated recurrent unit GRU for session-basedrecommendation. Moreover, SASRec (Kang and McAuley2018) employs self-attention networks (Vaswani et al. 2017)to selectively aggregate relevant items for user modeling.However, these methods mainly focus on making recom-mendations based on relatively recent behaviors. Recently,a few efforts attempt to model long behavior sequence data.For instance, SDM (Lv et al. 2019) and SHAN (Ying et al.2018a) split the whole behavior sequence into short-termand long-term sequences and then explicitly extract long-term and short-term interest embeddings from them. Butthey are difﬁcult to capture long-term interests shifting andsuffer from high computation complexity. HPMN (Ren et al.2019) uses the memory network (Graves, Wayne, and Dani-helka 2014; Chen et al. 2018) to memorize important histor-ical behaviors for next-item prediction. Nevertheless, mem-ory network may suffer from long-term dependency forget-ting dilemma, as the memory is optimized by recovering thenext item. Our model focuses on combing external memoryand attention networks for effective long user behavior se-quence modeling, which conducts an explicit and adaptivemodeling process.

Conclusions

In this paper, we propose a novel dynamic memory-basedattention network DMAN for sequential recommendationwith long behavior sequence. We truncate a user’s overallbehavior sequence into a series of sub-sequences and trainour model in a dynamic manner. DMAN can explicitly ex-tract a user’s short-term and long-term interests based on therecurrent connection mechanism and a set of external mem-ory blocks. To improve the memory ﬁdelity, we derive a dy-namic memory network to actively abstract a user’s long-term interests into the memory by minimizing a local re-construction loss. Empirical results on real-world datasetsdemonstrate the effectiveness of DMAN in modeling longuser behavior sequences. eferences

Belletti, F.; Chen, M.; and Chi, E. H. 2019. QuantifyingLong Range Dependence in Language and User Behavior toimprove RNNs. In

KDD , 1317–1327.Chen, X.; Xu, H.; Zhang, Y.; Tang, J.; Cao, Y.; Qin, Z.; andZha, H. 2018. Sequential recommendation with user mem-ory networks. In

WSDM , 108–116.Cheng, C.; Yang, H.; Lyu, M. R.; and King, I. 2013. Whereyou like to go next: Successive point-of-interest recommen-dation. In

IJCAI .Covington, P.; Adams, J.; and Sargin, E. 2016. Deep neuralnetworks for youtube recommendations. In

RecSys , 191–198.Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J.; Le, Q. V.; andSalakhutdinov, R. 2019. Transformer-xl: Attentive lan-guage models beyond a ﬁxed-length context. arXiv preprintarXiv:1901.02860 .Graves, A.; Wayne, G.; and Danihelka, I. 2014. Neural tur-ing machines. arXiv preprint arXiv:1410.5401 .Guo, H.; Tang, R.; Ye, Y.; Li, Z.; and He, X. 2017. DeepFM:a factorization-machine based neural network for CTR pre-diction. arXiv preprint arXiv:1703.04247 .He, R.; Fang, C.; Wang, Z.; and McAuley, J. 2016. Vista:a visually, socially, and temporally-aware model for artisticrecommendation. In

RecSys , 309–316.He, R.; and McAuley, J. 2016. Fusing similarity modelswith markov chains for sparse sequential recommendation.In

ICDM , 191–200. IEEE.He, X.; and Chua, T.-S. 2017. Neural factorization machinesfor sparse predictive analytics. In

SIGIR , 355–364.Hidasi, B.; and Karatzoglou, A. 2018. Recurrent neural net-works with top-k gains for session-based recommendations.In

CIKM , 843–852.Hidasi, B.; Karatzoglou, A.; Baltrunas, L.; and Tikk, D.2015. Session-based recommendations with recurrent neuralnetworks. arXiv preprint arXiv:1511.06939 .Hu, Y.; Koren, Y.; and Volinsky, C. 2008. Collaborative ﬁl-tering for implicit feedback datasets. In

ICDM , 263–272.Ieee.Jean, S.; Cho, K.; Memisevic, R.; and Bengio, Y. 2014. Onusing very large target vocabulary for neural machine trans-lation. arXiv preprint arXiv:1412.2007 .Kang, W.-C.; and McAuley, J. 2018. Self-attentive sequen-tial recommendation. In

ICDM , 197–206. IEEE.Kim, D.; Park, C.; Oh, J.; Lee, S.; and Yu, H. 2016. Con-volutional matrix factorization for document context-awarerecommendation. In

RecSys , 233–240.Koren, Y. 2008. Factorization meets the neighborhood: amultifaceted collaborative ﬁltering model. In

KDD , 426–434.Koren, Y. 2009. Collaborative ﬁltering with temporal dy-namics. In

KDD , 447–456. Koren, Y.; Bell, R.; and Volinsky, C. 2009. Matrix factoriza-tion techniques for recommender systems.

Computer (8):30–37.Li, C.; Liu, Z.; Wu, M.; Xu, Y.; Zhao, H.; Huang, P.; Kang,G.; Chen, Q.; Li, W.; and Lee, D. L. 2019. Multi-interest net-work with dynamic routing for recommendation at Tmall. In

CIKM , 2615–2623.Li, J.; Ren, P.; Chen, Z.; Ren, Z.; Lian, T.; and Ma, J. 2017.Neural attentive session-based recommendation. In

CIKM ,1419–1428.Li, J.; Wang, Y.; and McAuley, J. 2020. Time Interval AwareSelf-Attention for Sequential Recommendation. In

WSDM ,322–330.Li, X.; and She, J. 2017. Collaborative variational autoen-coder for recommender systems. In

GKDD , 305–314.Liu, N.; Tan, Q.; Li, Y.; Yang, H.; Zhou, J.; and Hu, X. 2019.Is a single vector enough? exploring node polysemy for net-work embedding. In

KDD , 932–940.Lv, F.; Jin, T.; Yu, C.; Sun, F.; Lin, Q.; Yang, K.; and Ng,W. 2019. SDM: Sequential deep matching model for onlinelarge-scale recommender system. In

CIKM , 2635–2643.Ma, C.; Ma, L.; Zhang, Y.; Sun, J.; Liu, X.; and Coates,M. 2019. Memory Augmented Graph Neural Net-works for Sequential Recommendation. arXiv preprintarXiv:1912.11730 .Rae, J. W.; Potapenko, A.; Jayakumar, S. M.; and Lillicrap,T. P. 2019. Compressive Transformers for Long-Range Se-quence Modelling. arXiv preprint arXiv:1911.05507 .Ren, K.; Qin, J.; Fang, Y.; Zhang, W.; Zheng, L.; Bian, W.;Zhou, G.; Xu, J.; Yu, Y.; Zhu, X.; et al. 2019. Lifelong Se-quential Modeling with Personalized Memorization for UserResponse Prediction. In

SIGIR , 565–574.Rendle, S. 2010. Factorization machines. In

ICDM , 995–1000. IEEE.Rendle, S.; Freudenthaler, C.; and Schmidt-Thieme, L.2010. Factorizing personalized markov chains for next-basket recommendation. In

WWW , 811–820.Sabour, S.; Frosst, N.; and Hinton, G. E. 2017. Dynamicrouting between capsules. In

NIPS , 3856–3866.Sarwar, B.; Karypis, G.; Konstan, J.; and Riedl, J. 2001.Item-based collaborative ﬁltering recommendation algo-rithms. In

WWW , 285–295. ACM.Schafer, J. B.; Frankowski, D.; Herlocker, J.; and Sen, S.2007. Collaborative ﬁltering recommender systems. In

Theadaptive web , 291–324. Springer.Sedhain, S.; Menon, A. K.; Sanner, S.; and Xie, L. 2015. Au-torec: Autoencoders meet collaborative ﬁltering. In

WWW ,111–112.Sodhani, S.; Chandar, S.; and Bengio, Y. 2018. On train-ing recurrent neural networks for lifelong learning.

CoRR,abs/1811.07017 .Su, X.; and Khoshgoftaar, T. M. 2009. A survey of collabo-rative ﬁltering techniques.

Advances in artiﬁcial intelligence

NeurIPS , 2440–2448.Sun, F.; Liu, J.; Wu, J.; Pei, C.; Lin, X.; Ou, W.; and Jiang, P.2019. BERT4Rec: Sequential recommendation with bidirec-tional encoder representations from transformer. In

CIKM ,1441–1450.Tan, Q.; Liu, N.; and Hu, X. 2019. Deep RepresentationLearning for Social Network Analysis.

Frontiers in Big Data

2: 2.Tan, Q.; Liu, N.; Zhao, X.; Yang, H.; Zhou, J.; and Hu, X.2020. Learning to Hash with Graph Neural Networks forRecommender Systems. In

WWW , 1988–1998.Tan, Q.; Zhang, J.; Yao, J.; Liu, N.; Zhou, J.; Yang, H.; andHu, X. 2021. Sparse-interest network for sequential recom-mendation. In

WSDM .Tang, J.; and Wang, K. 2018. Personalized top-n sequentialrecommendation via convolutional sequence embedding. In

WSDM , 565–573.Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-tention is all you need. In

NIPS , 5998–6008.Wang, B.; Zhao, D.; Lioma, C.; Li, Q.; Zhang, P.; and Si-monsen, J. G. 2019a. Encoding word order in complex em-beddings. arXiv preprint arXiv:1912.12333 .Wang, X.; He, X.; Wang, M.; Feng, F.; and Chua, T.-S.2019b. Neural graph collaborative ﬁltering. In

SIGIR , 165–174.Xue, H.-J.; Dai, X.; Zhang, J.; Huang, S.; and Chen, J. 2017.Deep Matrix Factorization Models for Recommender Sys-tems. In

IJCAI , volume 17, 3203–3209. Melbourne, Aus-tralia.Yan, A.; Cheng, S.; Kang, W.-C.; Wan, M.; and McAuley,J. 2019. CosRec: 2D Convolutional Neural Networks forSequential Recommendation. In

CIKM , 2173–2176.Ying, H.; Zhuang, F.; Zhang, F.; Liu, Y.; Xu, G.; Xie, X.;Xiong, H.; and Wu, J. 2018a. Sequential recommender sys-tem based on hierarchical attention network. In

IJCAI .Ying, R.; He, R.; Chen, K.; Eksombatchai, P.; Hamilton,W. L.; and Leskovec, J. 2018b. Graph Convolutional NeuralNetworks for Web-Scale Recommender Systems. In

KDD ,974–983. ACM.Yu, F.; and Koltun, V. 2015. Multi-scale context aggregationby dilated convolutions. arXiv preprint arXiv:1511.07122 .Yuan, F.; Karatzoglou, A.; Arapakis, I.; Jose, J. M.; and He,X. 2019. A simple convolutional generative network for nextitem recommendation. In

WSDM , 582–590.Zhang, S.; Tay, Y.; Yao, L.; and Sun, A. 2018. Nextitem recommendation with self-attention. arXiv preprintarXiv:1808.06414 .Zhu, H.; Li, X.; Zhang, P.; Li, G.; He, J.; Li, H.; and Gai,K. 2018. Learning tree-based deep model for recommendersystems. In