[PDF] Multi-Interest-Aware User Modeling for Large-Scale Sequential Recommendations

Abstract

Precise user modeling is critical for online personalized recommendation services. Generally, users' interests are diverse and are not limited to a single aspect, which is particularly evident when their behaviors are observed for a longer time. For example, a user may demonstrate interests in cats/dogs, dancing and food \& delights when browsing short videos on Tik Tok; the same user may show interests in real estate and women's wear in her web browsing behaviors. Traditional models tend to encode a user's behaviors into a single embedding vector, which do not have enough capacity to effectively capture her diverse interests. This paper proposes a Sequential User Matrix (SUM) to accurately and efficiently capture users' diverse interests. SUM models user behavior with a multi-channel network, with each channel representing a different aspect of the user's interests. User states in different channels are updated by an \emph{erase-and-add} paradigm with interest- and instance-level attention. We further propose a local proximity debuff component and a highway connection component to make the model more robust and accurate. SUM can be maintained and updated incrementally, making it feasible to be deployed for large-scale online serving. We conduct extensive experiments on two datasets. Results demonstrate that SUM consistently outperforms state-of-the-art baselines.

Full PDF

MMulti-Interest-Aware User Modeling for Large-Scale SequentialRecommendations

Jianxun Lian [email protected] Research AsiaBeijing, China

Iyad Batal [email protected] Bing AdsSunnyvale, California, United States

Zheng Liu [email protected] Research AsiaBeijing, China

Akshay Soni [email protected] Bing AdsSunnyvale, California, United States

Eun Yong KangYajun Wang {eun.kang,yajunw}@microsoft.comMicrosoft Bing AdsSunnyvale, California, United States

Xing Xie [email protected] Research AsiaBeijing, China

ABSTRACT

Precise user modeling is critical for online personalized recommen-dation services. Generally, users’ interests are diverse and are notlimited to a single aspect, which is particularly evident when theirbehaviors are observed for a longer time. For example, a user maydemonstrate interests in cats/dogs, dancing and food & delightswhen browsing short videos on Tik Tok; the same user may showinterests in real estate and women’s wear in her web browsingbehaviors. Traditional models tend to encode a user’s behaviorsinto a single embedding vector, which do not have enough capacityto effectively capture her diverse interests.This paper proposes a Sequential User Matrix (SUM) to accu-rately and efficiently capture users’ diverse interests. SUM modelsuser behavior with a multi-channel network, with each channelrepresenting a different aspect of the user’s interests. User states indifferent channels are updated by an erase-and-add paradigm withinterest- and instance-level attention. We further propose a localproximity debuff component and a highway connection componentto make the model more robust and accurate. SUM can be main-tained and updated incrementally, making it feasible to be deployedfor large-scale online serving. We conduct extensive experimentson two datasets. Results demonstrate that SUM consistently out-performs state-of-the-art baselines.

CCS CONCEPTS • Information systems → Computational advertising ; Col-laborative filtering ; Personalization ; Recommender systems . KEYWORDS sequential recommendation, multiple interests, memory networks

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

Sequential recommender systems have attracted a lot of attentionin both academia [2, 7, 8, 21, 31, 32] and industry [20, 22, 23, 35] inrecent years. Different from general recommender systems [11, 12]that aim to learn users’ long-term preference, sequential recom-mender systems take the sequence of user behaviors as contextand predict her short-term interests, such as what items she willinteract with in the near future, or to the extreme case, what itemsshe will interact with next. Sequential recommendation models cancapture users’ dynamic and evolving interests over time. In the pastfew years, a lot of related models have been proposed for sequentialuser modeling, including Recurrent Neural Network (RNN) basedmodels [8, 9, 14, 22, 32, 35], Convolutional Neural Network (CNN)based models [28, 33], and Self-attention based models [10, 27, 34].Although state-of-the-art results have been reported by these ap-proaches, most of them cannot be applied in large scale onlinesystems due to the tight latency requirement in real-time onlineserving. A typical real-time recommendation service requires themodel to respond with results in less than ten milliseconds. Givensuch a constraint, how to effectively model users’ sequential behav-iors, especially when the behavior sequence is long, becomes ancrucial and challenging task.Gated Recurrent Unit (GRU) [3], despite the simplicity of itsstructure, turns out to be effective in sequential modeling and iswidely deployed in industry [22, 35]. There are two main meritsof GRU. First, it contains a gated mechanism to control how in-formation is passed through or forgotten, the gradient vanishingand exploding problems are alleviated so that GRU cell can handlebetter relatively longer sequences compared with vanilla RNN cell.Second, GRU-based models support inference in an incrementalupdate manner, which makes it practical for online serving. Incre-mental update means that there is no need to store and processthe whole user behavior sequence all over again at each time whenthe recommender system receives a new user event. Instead, wejust need to maintain a user state vector for every user; when anew event comes in, we update the user state only based on thissingle event so that the model can respond quickly. After we de-ployed a GRU-based recommender system to replace our previousattention-based neural model (which is non-sequential), the online a r X i v : . [ c s . I R ] M a r oodstock ’18, June 03–05, 2018, Woodstock, NY Jianxun Lian, Iyad Batal, et al. click-through rate (CTR) was significantly improved by 0.82% to4.54% across different scenarios in Bing Native Advertising business.However, GRU only generates a single embedding vector to rep-resent a user state; when user behavior sequence becomes longer,she may reveal multiple interests that belong to different topicsand are not suitable to be clustered into one representation. Forexample, for short video recommendation, a user may continuouslybrowse a dozen of short videos on Tik Tok. The browsed videos mayinclude different topics like Cute Pets, Food & Delights and Health& Fitness; in online advertising scenarios, a user may browse webpages related to used cars, men shoes and kids furniture. Thus, asingle user vector heavily restricts the representation ability of themodel, and this defect cannot be remedied simply by increasingthe embedding size of GRU. In this paper, we propose SequentialUser Matrix (SUM), which is built on but substantially improves aclassical Recommender system with User Memory network (RUM)[2] for better modeling of users’ multiple interests in a sequence.There are mainly three novel mechanisms proposed in SUM. Interest-level and instance-level attention : The writing operationsin RUM generates a channel-wise attention score for each inputevent to indicate the event’s relatedness to different channels (akainterest level). However, within each channel, the erase-and-add update strategy only depends on the input event itself, withoutconsidering the instance-level context. We argue that since differ-ent user events have different importance scores, distinguishingevents inside the same channel is critical for improving the modelexpressiveness. Thus, we propose to manipulate memory networksat both interest and instance level (Refer to Section 4.1).

Local proximity debuff : In many applications, user behaviors tend tohave the local proximity property. For example, a user’s in-sessionweb-page browsing behaviors are usually very similar; in onlineshopping scenario, a user’s consecutive behaviors may belong tothe same purchase intent. Motivated by this, we propose a debuffmechanism for adjacent similar behaviors, which further improvesthe model’s capacity on handling long sequences compared withGRU. (Refer to Section 4.2)

Highway channel : To make a memory network model capable torecognize the exact order of behavior sequence and automaticallybalance the interest disentanglement and interest mixture, we re-serve a channel in SUM’s memory network to be a highway con-nection channel , so that every user behavior will interact with thishighway channel without being influenced by the interest-levelattention score. We further update the reading operation to makeit attend to the states better (Refer to Section 4.3 and 4.4).We conduct extensive offline experiments on two real-worlddatasets and online A/B test experiments at Bing Native Advertisingscenario. Experimental results demonstrate that SUM outperformscompetitive baselines significantly and consistently. In addition,we provide some case studies to verify that SUM indeed captures auser’s diverse interests in its different memory channels.

We first describe the Bing Native Advertising scenario and our nearreal-time (NRT) system for large-scale online serving. We display personalized ads to users in a native manner when users connectto our services, such as browsing Microsoft News. Because adsclicking behaviors are extremely sparse, we choose to use massiveweb behaviors, including users’ visited pages, Bing search queriesand search clicks for more comprehensive user modeling. Throughoffline experiments, we find that sequential models are much betterthan non-sequential models because the former can automaticallylearn the sequential signal and capture the temporal evolution inusers’ interests. However, deploying a sequential model for large-scale online serving is challenging, due to the fact that processingusers’ long behavior sequences in real-time is expensive. To ad-dress this issue, a popular design for large-scale recommendationsystems is to decouple the architecture into two components [23]:the sequential user modeling component (aka user encoder ) and theuser-item preference scorer (aka ranker ). These two componentsare running on two separate applications. The user encoder main-tains the latest embeddings for users. Once a new user behaviorhappens, it updates the user’s embedding incrementally rather thanre-compute the whole behavior sequence from the beginning. Thismodule is usually built as the near realtime module. On the otherhand, the user-item preference scorer reads the latest user embed-dings directly from memory so that the latency of system responseis minimized.Following this decoupled paradigm, we build an NRT servingsystem depicted in Figure 1. The NRT system aims at serving anysequential models which can be updated incrementally, such asGRU. There are three pipelines in Figure 1, covered by differentbackground colors. Model training happens periodically on an of-fline platform called Deep Learning Training Service (DLTS). Afterthe model training finishes, we detach the two most important parts,i.e., the sequential user encoder and the user-item ranker, from themodel and freeze their parameters for serving. The blue pipeline isfor sequential user state updating. Once a user performs an activity(e.g., she enters a Bing search), the event will be sent to the real-timefeedback data pipeline for batch updating. For each update unit,the system first fetches the current user state from a distributedin-memory key-value store which we called object storage , thenconducts a one-step inference based on the current user event anduser state, and finally writes the new user state to the object storage.Under this incrementally updating paradigm, the NRT system canefficiently model users’ interests from an extremely long sequence.As for online ranking, the system will fetch a user state from objectstorage directly, then concatenate them with item vectors and runthe ranker module. In this paper, we focus on the user modelingmodule. Some other parts such as recall and pre-ranking (we usemultiple techniques such as ANN [17] and DeepXML [4]) are outof the discussion scope. So far, our NRT system serves hundredsof millions of unique user ids per day, with a peak QPS (queriesper second) reaching 120k and the corresponding updating latencysurprisingly being less than 1 minute. Thus, users’ latest states canbe updated in a near real-time manner for downstream applicationssuch as personalized ranking.In the next section, we formally describe how we implementsequential user matrix (SUM), which is a multi-interest-aware usermodel, on the NRT system. https://en.wikipedia.org/wiki/Native_advertising ulti-Interest-Aware User Modeling for Large-Scale Sequential Recommendations Woodstock ’18, June 03–05, 2018, Woodstock, NY Search Queries,

Visited Webpages

Batch Scheduler

Offline

Training Realtime Data

Offline data from Cosmos

Impression logs

Recommender

Model Sequential

User Encoder

User-Item Ranker

Item encoder Caching

Recall and pre-ranking

Recommended

Candidates

ANN Retrieval

DeepXML ...

Key-Value

StoreDLTS Item PoolKafka Streaming

User StatesDetach &

Freeze

Fetch User Profile

Write User Profile

User-Item

RankerSequential

User Encoder

ReplaceReplace

Offline Model Training Near Realtime Incrementally Updating

Ranking

34 5 6 Figure 1: An overview of our near real-time (NRT) recommendation serving system. Numbers are simply for better under-standing purpose, not necessarily meaning the exact order of execution. read & merge item 1 item 2 item t ... predicted scoreuser vectoritem vector

DNNintermediate states (multiple vectors)

User Encoder User-Item Ranker

Figure 2: A sequential user modeling framework with mul-tiple vectors in intermediate states.

Let U = { 𝑢 , 𝑢 , ..., 𝑢 𝑛 } denote the set of users and V = { 𝑣 , 𝑣 , ..., 𝑣 𝑚 } denote the set of target items, where 𝑛 and 𝑚 indicate the number ofusers and items respectively. Each user is associated with a sequenceof behaviors: 𝐵 ( 𝑢 ) = {( 𝑥 𝑢 , 𝑡 𝑢 ) , ( 𝑥 𝑢 , 𝑡 𝑢 ) , ..., ( 𝑥 𝑢 | 𝐵 ( 𝑢 ) | , 𝑡 𝑢 | 𝐵 ( 𝑢 ) | )} , where 𝑢 ∈ U indicates a user, 𝑥 ∈ X indicates an activity, ( 𝑥 𝑢 , 𝑡 𝑢 ) indi-cates a user 𝑢 has an activity 𝑥 𝑢 at time 𝑡 𝑢 , and | 𝐵 ( 𝑢 )| denotes thenumber of elements in the set. Behaviors are sorted by timestamps,i.e., we have 𝑡 𝑖 < = 𝑡 𝑗 for any 𝑖 < 𝑗 . Items in V are not necessary tobe the same as items in X . For example, in our ads display scenario, X consists of web pages that users visited previously and V con-sists of advertisements to display. Each item 𝑣 and user behavior 𝑥 will be encoded into a 𝐷 -dimensional vector: v , x ∈ R 𝐷 .The recommendation task can be formulated as to predict auser 𝑢 ’s preference to item 𝑣 at time 𝑡 : ˆ 𝑦 = 𝑓 ( 𝑦 | 𝑢, 𝑣, 𝑡 ) , given theuser’s behavior history before time 𝑡 . The key component lies inuser modeling, i.e., how to generate a user representation u 𝑡 basedon behavior sequence. As stated in Section 2, to make the model scalable for online serving, we only consider the architectures whichcan be updated incrementally , such as GRU [9] and RUM [2],which we call User Encoder . At any time 𝑡 , we can readout a uservector u 𝑡 from the User Encoder . u 𝑡 will be concatenated with thecandidate item vector v (and other context features if we have), thengoes through a two-layer fully connected neural network (FCN)to get the prediction score: 𝑓 ( 𝑦 | 𝑢, 𝑣, 𝑡 ) = 𝐹𝐶𝑁 ( u 𝑡 , v ) . We call thispart as User-Item Ranker . The model architecture is illustrated inFigure 2, where the two components can be matched accordinglyin Figure 1 Step 3.

To effectively capture users’ multiple interests in a scalable way foronline serving, we propose Sequential User Matrix (SUM), which isbuilt on memory networks [2], especially inspired by the Recom-mender model with User Memory network (RUM)[5]. We use mem-ory states with K channels to represent a user: H 𝑢 = { h , h , ..., h 𝐾 } ∈ R 𝐷 × 𝐾 . Figure 3 illustrates how SUM processes a user’s behaviorsequence. There are basically two groups of operations: the writingoperation and the reading operation. The writing operation takesone user behavior at a time as input and updates the memory statesaccordingly. The reading operation merges the memory states intoone user vector for user-item preference prediction. We proposethree key mechanisms in SUM’s writing operation, i.e., the interest-level and instance-level attention, the local proximity debuff, andthe highway channel. We assume that users historical behavior items may be differentfrom the target items. For example, in Display Ads dataset, usershistorical behaviors are their visited webpages, while the targetitems are ads. Thus, we have two sets of global feature maps, F 𝑤 = { f 𝑤 , f 𝑤 , ..., f 𝑤𝐾 } and F 𝑟 = { f 𝑟 , f 𝑟 , ..., f 𝑟𝐾 } , for writing and reading In this paper, “channels” and “slots” are used interchangeably. oodstock ’18, June 03–05, 2018, Woodstock, NY Jianxun Lian, Iyad Batal, et al. operation respectively (also known as writing heads and readingheads). When a new user behavior x 𝑡 comes in, we first computeits attention weight to each channel: 𝑤 𝑤𝑡𝑘 = x 𝑡 · f 𝑤𝑘 , 𝑧 𝑤𝑡𝑘 = 𝑒𝑥𝑝 ( 𝛽𝑤 𝑤𝑡𝑘 ) (cid:205) 𝑗 𝑒𝑥𝑝 ( 𝛽𝑤 𝑤𝑡 𝑗 ) , ∀ 𝑘 = , , ..., 𝐾 (1)where 𝛽 is a scaling factor, 𝑧 𝑡𝑘 represents the attention score ofevent x 𝑡 towards channel 𝑘 . We call this attention interest-levelattention . This step is the same with RUM[2]. However, we observethat in RUM, the updating vectors, such as add 𝑡 and erase 𝑡 , onlydepend on the input event x 𝑡 . We argue that the updating vectorsfor memory channels should consider both the current input x 𝑡 and the current state H (we call it instance-level attention ). Thus, wefirst merge the memory states by the writing attention scores z 𝑤𝑡 : ˆh = ∑︁ 𝑘 𝑧 𝑤𝑡𝑘 · h 𝑘 (2)The new value which will be added to the memory states is depen-dent on both the input x 𝑡 and the current states H : add 𝑡 = 𝜙 ( W 𝑎 [ x 𝑡 , 𝑟𝑒𝑠𝑒𝑡 · ˆh ] + b 𝑎 ) (3)Meanwhile, we have the reset gate and erase gate: erase 𝑡 = 𝜎 ( W 𝑒 [ x 𝑡 , ˆh ] + b 𝑒 ) (4) 𝑟𝑒𝑠𝑒𝑡 𝑡 = 𝜎 ( 𝑊 𝑟 [ x 𝑡 , ˆh ] + 𝑏 𝑟 ) (5)Note that 𝑟𝑒𝑠𝑒𝑡 𝑡 is a scalar to control how much the current statesare involved to generate the new add-on value. erase 𝑡 , add 𝑡 ∈ R 𝐷 .The memory states are updated by: h 𝑘 ← add 𝑡 · erase 𝑡 · 𝑧 𝑤𝑡𝑘 + h 𝑘 · ( − erase 𝑡 · 𝑧 𝑤𝑡𝑘 ) (6)which means we first erase a portion of information from the cur-rent states, then add new values to them. erase 𝑡 is a weightingvector which controls the erasing level in a bit-wise level. 𝑧 𝑤𝑡𝑘 is aninterest-level attention score which controls the degree of informa-tion change on channel 𝑘 at timestamp 𝑡 . Essentially, Equation 6 is alinear interpolation of previous states h 𝑘 and the new add-on vector add 𝑡 , we have the coefficient with erase 𝑡 · 𝑧 𝑤𝑡𝑘 + ( − erase 𝑡 · 𝑧 𝑤𝑡𝑘 ) = ≤ erase 𝑡 · 𝑧 𝑤𝑡𝑘 ≤

1. Since the current state H is reduced fromprevious user behaviors and impacted by writing heads, we call thestate updating mechanism is of both instance-level (Eq.(3, 4, 5)) and interest-level (Eq.(1)) awareness. Users’ consecutive behaviors tend to be similar. For example, whenpeople search for some information (Nike shoes), a few pages/itemsshe clicks on in a session usually belong to the same topic (Nikeshoes). We call this phenomenon local proximity . This phenome-non is especially obvious when we are handling the original userbehavior dataset without down-sampling or de-duplicating. Figure6 in Section 5.8 also verifies this. In online serving scenario, sincethe model is expected be updated incrementally in a streamingmanner, it is not practical to store the user sequences and performthe de-duplication process. To alleviate the local proximity problemand make the model capable to remember long term user interest, target item vector 𝒉 𝒉 𝒉 𝒉 𝒉 writing heads 𝒂𝒅𝒅 𝒕 𝒆𝒓𝒂𝒔𝒆 𝒕 𝑟𝑒𝑠𝑒𝑡 𝑡 ෡𝒉 𝒙 𝑡 update states 𝑧 t1𝑤 𝑧 𝑡2𝑤 𝑧 𝑡3𝑤 𝑧 𝑡4𝑤 𝑧 𝑡5𝑤 reading heads 𝑧 𝑡1𝑟 𝑧 𝑡2𝑟 𝑧 𝑡3𝑟 𝑧 𝑡4𝑟 𝑧 𝑡5𝑟 𝒖 t 𝒗 j outputuser vector user memory states 𝑯 attentions attentions 𝒙 𝑡 𝒙 𝑡−1 𝒙 𝒙 …… user behaviors Figure 3: An illustration of the architecture of SUM with 5memory channels. we propose a simple but effective debuff mechanism based on thecomparison between the current event x 𝑡 and the last event x 𝑡 − : 𝑠𝑖𝑚 ( 𝑡 ) = 𝑐𝑜𝑠𝑖𝑛𝑒 ( x 𝑡 , x 𝑡 − ) (7) 𝑧 𝑤𝑡𝑘 ← 𝑧 𝑤𝑡𝑘 · 𝛼 𝑠𝑖𝑚 ( 𝑡 ) (8)Where 𝛼 is a trainable parameter; we initialize it with a valueslightly smaller than 1.0, such as 0.98. When 𝛼 is less than 1, themore similar a pair of consecutive events ( x 𝑡 , x 𝑡 − ) are, the smallerthe attention score 𝑧 𝑤𝑡𝑘 will be. Although we have different choicesfor the debuff mechanism, such as feature based approach like 𝑠𝑖𝑚 ( 𝑡 ) = 𝐹𝐶𝑁 ( x 𝑡 , x 𝑡 − , x 𝑡 ⊙ x 𝑡 − , x 𝑡 − x 𝑡 − ) , our method doesn’tbring in additional parameters except a single scalar 𝛼 , which en-sures a more fair comparison between methods with/without thelocal proximity debuff mechanism. Through experiments (Section5.5) we find that this method works very well. Although memory networks have the stronger expressive capacityin general than recurrent neural networks (RNN) such as GRU,there is one thing that RNN is theoretically better than memorynetworks: RNN has the ability to record the order of events insequence explicitly. In contrast, in memory networks, the originalorder of events is not strictly maintained because the coming eventis routed to different channels with different attention weights. Theevents’ order is critical for sequential recommendations since themost recent event usually plays the most important role for the nextitem prediction. Motivated by this, we reserve one memory channelto be a highway channel , which always has an attention weight of1 for all the coming events. Since all the events have interacted onthis channel evenly, we empower the memory network to capturethe exact order of events. Including the highway channel can makethe SUM model more robust over various datasets. The degree ofuser interest diversity varies over different datasets. The highwaychannel represents a mixture of interests, while the other memorychannels represent a disentanglement of interests. The union of thetwo types of channels empowers the SUM model with the flexibility The terminology debuff originates from gaming. It means an effect that makes agame character weaker. https://en.wiktionary.org/wiki/debuff. We borrow this wordto describe that we are weakening the local proximity effect. ulti-Interest-Aware User Modeling for Large-Scale Sequential Recommendations Woodstock ’18, June 03–05, 2018, Woodstock, NY of switching between interest mixture and interest disentanglementadaptively with different datasets.

To read the user memory matrix H at time 𝑡 , [2] first uses a globalreading latent feature table F 𝑟 = { f 𝑟 , f 𝑟 , ..., f 𝑟𝐾 } to get the candidateitem 𝑗 ’s attention weight with each channel, then generates a uservector u 𝑡 by weighted average. The equations are as follows: 𝑤 𝑟𝑡𝑘 = v 𝑗 · f 𝑟𝑘 (9) 𝑧 𝑟𝑡𝑘 = 𝑒𝑥𝑝 ( 𝛽𝑤 𝑟𝑡𝑘 ) (cid:205) 𝐾𝑑 𝑒𝑥𝑝 ( 𝛽𝑤 𝑟𝑡𝑑 ) , ∀ 𝑘 = , , ..., 𝐾 (10) u 𝑡 = ∑︁ 𝑘 𝑧 𝑟𝑡𝑘 · h 𝑘 (11)However, through experiments we found that this reading operationis not very effective. We argue that the attention weights 𝑤 𝑟𝑡𝑘 shoulddepend on the content of user memory states. If a memory channelis rarely activated in the past, it should be assigned with a lowattention score in the reading operation. We still use an attentivemerging approach with Eq.(11) to read the user states, but changeEq.(9) to: 𝑤 𝑟𝑡𝑘 = v 𝑗 F 𝑟 h 𝑘 (12)where F 𝑟 ∈ R 𝐷 × 𝐷 is a global reading transformation matrix. Wewill report the comparison results in Section 5.5. After we got the representation u 𝑡,𝑖 for user 𝑖 , we concatenate itwith the target item 𝑗 ’s representation v 𝑗 , feed it into a 2-layerfully-connected neural network with ReLU activation function,then connect it to an output preference score unit which Sigmoidactivation function. Since the focus of this paper is about sequentialuser modeling, we do not include more features (such as contextfeatures) or use more complicated scorer (such as [6]) in this predic-tion process for the sake of simplicity. However, richer features andmodeling architectures can be easily included in this step. We takethe user preference prediction problem as a binary classificationtask, so in this paper, we use a point-wise loss with a negativelog-likelihood function: L = − 𝑁 ∑︁ 𝑖,𝑗 𝑦 𝑖,𝑗 𝑙𝑜𝑔 ˆ 𝑦 𝑖,𝑗 + ( − 𝑦 𝑖,𝑗 ) 𝑙𝑜𝑔 ( − ˆ 𝑦 𝑖,𝑗 ) + 𝜆 | | Θ || (13)where 𝑁 is the total number of training instances, Θ denotes theset of trainable parameters. As for the inference computational cost, our SUM model is on thesame level with GRU. The most expensive step of GRU is calculatingthe new state in forms of 𝜎 ( W [ x 𝑡 , h 𝑡 − ] + b 𝑧 ) , the time complexityis 𝑂 ( 𝐷 ) . Although in SUM there are K channels, from Eq (3) - Eq (5)we can see that updating vectors are shared among K channels. OnlyEq (2),(1),(6) are repeated for K channels, but their time complexityis 𝑂 ( 𝐷 ) which is much less than 𝑂 ( 𝐷 ) . Table 1: Statistics of the datasets. “k” indicates a thousand.

Dataset

We use two datasets for experiments, including a display ads datasetand an e-commerce item recommendation dataset. Some basic datastatistics are reported in Table 1.

Display Ads Dataset . We collect two weeks’ ad clicking logsfrom the Bing Native Advertising service as data samples and col-lect users’ web behavior history before their corresponding adclick behavior for user modeling. The user behavior sequencesare truncated to 100. The data samples are split into 70%/15%/15%as training/validation/test dataset by users to avoid informationleakage caused by repeated user behaviors. Both items and userbehaviors are described by textual content. We use CDSSM [26]model as a text-encoder to turn the raw text into a 128-dimensionembedding vector. The embedding vector is then used as the staticfeature for a user page view behavior.

Taobao Dataset . This is a public e-commerce dataset collectedfrom Taobao’s recommender system. To make two experimentaldatasets coherent, we take the purchase behaviors as target activi-ties (which corresponds to the ads clicking behavior in Display AdsDataset) and use page view behaviors for user modeling data (whichcorresponds to the web browsing behavior in Display Ads Dataset).The user behavior sequences are truncated to 100. Since we don’thave the non-click impression logs in this dataset, all the negativeinstances are randomly sampled according to item popularity witha positive:negative ratio of 1:4. So we use item id and category idas one-hot features to represent items.For more detailed descriptions related to datasets, please referto Appendix A.1. We compare SUM with three groups of methods: • AttMerge . It represents users by attentively merging the histor-ical items. It is non-sequential. • GRU [8, 9, 22],

SGRU and

HRNN [24] represent GRU-basedbaselines. SGRU stands for the

Stacked GRU , which has multiplelayers of GRU, with the layer number equivalent to the slots ofSUM. Since GRU is a single vector-based method, SGRU is a morefair baseline to compare with SUM.

HRNN is a hierarchical GRUwith sequences organized by sessions. • NTM [5],

RUM [2],

MCPRN [29],

MIMN [23],

HPMN [25] arevarious kinds of multi-channel-based sequential user models,which represent a set of strong baselines. For all these models aswell as SUM, we fix the channel number to 5 for fair comparisons.All models share the same

User-Item Ranker module marked withthe green part in Figure 2, while each model uses its own blue

UserEncoder module. User states sizes are 128 for Ads dataset and 64 for https://tianchi.aliyun.com/dataset/dataDetail?dataId=649&userId=1 oodstock ’18, June 03–05, 2018, Woodstock, NY Jianxun Lian, Iyad Batal, et al. Table 2: Overall performance comparison in terms of GroupAUC, LogLoss and NDCG@3. A bold font means the num-ber is significantly bigger than the second best model with 𝑝 -value < Display Ads TaobaoModel gAUC LogLoss NDCG gAUC LogLoss NDCGAttMerge 0.7788 0.4144 0.6784 0.8978 0.2752 0.8655GRU 0.8262 0.3768 0.7306 0.9279 0.2133 0.9052SGRU 0.8250 0.3781 0.7296 0.9360 0.2054 0.9152HRNN 0.8284 0.3769 0.7334 0.9267 0.2168 0.9036MCPRN 0.8258 0.3789 0.7311 0.9348 0.2066 0.9144NTM 0.8256 0.3786 0.7302 0.9251 0.2190 0.9014RUM 0.8303 0.3742 0.7366 0.9300 0.2197 0.9011MIMN 0.8280 0.3755 0.7327 0.9286 0.2137 0.9060HPMN 0.8262 0.3769 0.7313 0.9361 0.2033 0.9161SUM

Taobao dataset. In this paper we only study models which can beupdated incrementally. Thus, some other popular methods, such asDIEN [35], SASRec [10], MIND [13] and ComiRec [1] are not listedas baselines. Hyper-parameter settings are listed in Appendix A.2.

We adopt three widely-used metrics for evaluation:

Group AUC (Area Under the ROC curve),

Logloss (binary cross entropy) and

NDCG (Normalized Discounted Cumulative Gain). From the userrecommendations perspective, we only need to compare amongthe candidates for a given user. So we adopt the Group AUC, whichfirst calculates a AUC score per user, then takes the average amongusers. Logloss measures the distance between the predicted scoreand the true label for each instance, which is also frequently usedin recommendation task [6, 16, 25]. NDCG measures the rankingquality among top-k predicted candidates. We observe the sametrend for different k among compared models, for conciseness, weonly report NDCG@3 in the experiment section.

We first compare the overall performance of SUM with the afore-mentioned competitive methods. The results are reported in Table2. We make the following observations.Among all the benchmark methods,

AttMerge is the only one thatis non-sequential. As we focus on the sequential recommendationscenario, the fact that

AttMerge falls far behind the other methodsis expected.

GRU is a strong baseline method and is so far most widely usedin industry as incrementally updatable.

SGRU and

HRNN are twoupdated versions of GRU-based models, which have more compli-cated structures than

GRU . However, both

SGRU and

HRNN fail toconsistently outperform

GRU over two datasets, which indicatesthat simply stacking GRU layers or breaking behavior sequencesinto sessions are not powerful and generalized enough to handlevarious kinds of datasets. Moreover, If we directly apply the vanillamemory network architecture, i.e.,

NTM , for user modeling, theperformance is worse than

GRU models.

Table 3: Disabling every component in SUM will lead to aperformance drop.

Display Ads TaobaoModel gAUC NDCG gAUC NDCGSUM w/o instance-level att 0.8321 0.7389 0.9343 0.9138w/o proximity debuff 0.8320 0.7385 0.9410 0.9222w/o highway channel 0.8316 0.7378 0.939 0.9196reading operation (-) 0.8320 0.7374 0.9348 0.9140writing like NTM 0.8306 0.7354 0.9309 0.9090

MCPRN , RUM , MIMN and

HPMN all leverage and improve mem-ory networks for user modeling. Although the best one amongthem is better than GRU-based models when considering differentdatasets separately, none of them can beat all GRU-based modelson both two datasets. For example,

HPMN performs very well onthe Taobao dataset, but it performs almost the same with

GRU onthe Display Ads dataset.

SUM outperforms all the baseline methods in different evalua-tion metrics significantly, which verifies the effectiveness of theproposed model. More importantly,

SUM can perform best consis-tently on both datasets, which demonstrates the robustness of ourproposed method.

Table 4: LPD benefit comparison for GRU, RUM and SUM.

Display Ads TaobaoModel gAUC NDCG gAUC NDCGGRU 0.8262 0.7306 0.9279 0.9052GRU w/ LPD 0.8267 0.7312 0.9273 0.9051RUM 0.8303 0.7366 0.9300 0.9011RUM w/ LPD

SUM w/o LPD 0.8320 0.7385 0.9410 0.9222SUM

Key components in SUM include the instance-level attention, thelocal proximity debuff (LPD), the highway channel, and the read-ing/writing attention mechanism. To verify each component’s im-pact, we disable one component each time while keeping the othersettings unchanged, then test how the performance will be affected.We use reading operation(-) to denote the alternative setting ofthe reading operation as stated in Section 4.4. We further replaceSUM’s writing mechanism with NTM’s writing mechanism, whichis denoted as writing like NTM . From Table 3 we can see that re-moving either one component from SUM will cause a consistentperformance drop on both datasets.Among the key components of SUM, perhaps the most incompre-hensible one is the LPD. Thus, we conduct additional experiments tounderstand better the impact of LPD. LPD is a flexible unit that canbe plugged into different models. We choose three models –GRU,RUM and SUM –and report the results in Table 4. Interestingly,LPD benefits RUM and SUM, but it doesn’t make a significant dif-ference to GRU. The reason is that GRU has a gated mechanism ulti-Interest-Aware User Modeling for Large-Scale Sequential Recommendations Woodstock ’18, June 03–05, 2018, Woodstock, NY

Number of slots g A U C SUMRUM (a) Display Ads.

Number of slots g A U C SUMRUM (b) Taobao.

Figure 4: Performance with different number of slots. m o d e l s w i t h d i ff e r e n t . c h a nn e l s c h a n n e l i n d e x i n o n e m o d e l c u m u l a t e a tt e n t i o n p r o p o r t i o n Figure 5: The proportion of cumulative readout attentionsover all users for each channel. We analyze 5 models withdifferent setting of channel number in {3, 4, 5, 8, 10}. For eachmodel, the last index of channel corresponds to the highwaychannel. Dataset is the display ads. to control how much information to absorb and how much mem-ory to forget, and the gated attention can be well determined bycomparing the hidden state and the input event. However, whenit comes to multi-channel hidden states, it is hard to sense the lo-cal proximity information due to the feature that behaviors aredispersed to different channels by user interest. Therefore, LPD isbeneficial to multi-channel-ware models like RUM and SUM, but itis meaningless to a single-channel model like GRU.

Figure 4 demonstrates how the number of memory channels im-pacts SUM’s performance. For comparison, we also plot the lines ofRUM to Figure 4. We observe that performance patterns on the twodatasets are slightly different. For the Display Ads dataset, a goodsetting for the channel number is 10, and further increasing thechannel number does not improve the accuracy significantly. How-ever, for the Taobao dataset, the ideal channel number is around 5,after which SUM’ performance becomes saturated while RUM startsto decline. We learn that the model capacity can not be enhancedinfinitely by adding more channels through this experiment. Allo-cating excessive channels will make the model hard to convergeand maybe even damage the performance.

Table 5: Writing utilization on the Display Ads dataset.

Next we analyze the utilization of channels in one model. Note thatwe have two types of attentive operations in SUM, i.e., the writingattention and the reading attention. For the writing attention pat-tern study, we report the average number of activated channels peruser in the writing stage. Here, “an activated channel” is definedas the channel with the largest writing attention score for userbehavior. Take a SUM model with 5 channels as an example. Aftergoing through the behavior sequence of one user, if the activatedchannel number is 2, then the utilization for her is 2 /( − ) = . / = .

1. An interesting finding from Figure 5 is that, overall, thechannels’ reading utilization is even, and as the channel numberincreases, the highway channel’s utilization proportion becomesmore prominent. This phenomenon is reasonable. With the incre-ment of channel number, each channel will store more fine-grainedinterest for users. The distinction between the highway channel andthe other channels become clearer, and the highway channel willplay a more important role in modeling users’ integrated interest.

We provide some case studies to demonstrate how user behaviorsare distributed to different interest channels. We randomly sampletwo users from the Taobao dataset and report the results in Figure6. A darker color indicates higher attention scores on a channel. Wereport the attention scores on 4 channels, excluding the highwaychannel because it always has an attention score of 1. We canobserve that (1) users have different behaviors on the four interestchannels. User A’s preference seems to evolve from interest 0 tointerest 2, and her interests are mostly concentrated on channel0 and channel 2. In contrast, user B’s interests look more evenlydistributed; And (2) user behaviors indeed have the local proximityproperty, as events with darker colors are prone to emerge in series.To get a concrete sense of what different channels like, we plotFigure 7 with our Display Ads dataset (note that the Taobao datasetis anonymized so that we cannot get the real meaning of eachcategory), on which we know the real meaning of each category andthus can clearly differentiate the latent interests of each channel. Forexample, channel

Home & Garden ,channel

Health , channel

Vehicles , and top items in channel

Apparel . oodstock ’18, June 03–05, 2018, Woodstock, NY Jianxun Lian, Iyad Batal, et al. Historical item index I n t e r e s t (a) User A from the Taobao dataset. Historical item index I n t e r e s t (b) User B from the Taobao dataset. Figure 6: Heat map of channel coefficients of behavior sequences from two randomly sampled users in the Taobao dataset.

Campers&RVs 25.9%IndustrialManufacturing 20.7% Plumbing18.9% Gardening18.9%BedroomFurniture15.6% (a) Channel 0.

BloodSugar & Diabetes 50.0% Dating&Marriage Matching16.7% WeightLoss11.1% AsianCuisine11.0%WorldMusic11.0% (b) Channel 1.

Cars&Trucks 31.3%HardwareTools& Accessories 20.1% CarParts & Accessories19.2% IndustrialManufacturing15.3%ComputerHardware14.1% (c) Channel 2.

Shirts,Tops& Blouses 32.2%Women'sClothing20.7% Golf17.4% Underwear17.0%Pants,Jeans& Trousers12.7% (d) Channel 3.

Figure 7: Percentages of top item categories for each channel on the Display Ads dataset. For each channel, the most frequentcategory is exploded for better illustration.

Interestingly, every channel is a mixture of various item categories,instead of being dedicated to one category.

We have deployed SUM in our NRT system for native ads serving.We conduct online A/B test on one of our main native ads traffic,with the treatment group being SUM and the control group beingGRU, because GRU is our best prior production model. After aperiod of 23 days, the treatment group achieves 1.46% gain in clickyield (clicks over page views) and 1.32% gain in revenue. We are stillin the progress of making SUM generalized to all different trafficslides before it can fully replace our existing production model.

The sequential recommender system is an important branch of rec-ommender systems that has attracted extensive attention in recentyears. [8, 9, 30] are some good early works to discuss applying RNNon recommender systems motivated by successful application ofRNN in other domains like natural language understanding. [28]proposes to use convolutional neural networks to model user behav-ior sequences and [33] further improves it by leveraging a dilatedconvolutional networks to increase the receptive fields. Motivatedby the recent success of self-attention based models and pretrainedmodels, some researchers proposed to leverage Transformer forsequential recommender systems [10, 15, 27, 34]. In industry cases, [22] uses an RNN with GRU cell to generate user representationswith user browsing histories as input sequences, which has beensuccessfully deployed to a news recommendation service. [35] pro-poses a novel deep interest evolution network (DIEN) to modelusers’ interest evolving process from behavior sequences. The keycomponent is a new GRU structure enhanced with attention updat-ing gates.However, a single vector-based recommender systems do nothave enough expressive power to model a user, especially whenthe behavior sequence is long or multiple interests exist in the se-quence. [19] proposes Hi-Fi Ark, which is a new user representationframework to comprehensively summarize user behavior historyinto multiple vectors. Similarly, [13, 18] design a multi-interest ex-tractor layer with various dynamic routing mechanisms to extractuser’s diverse interests. But neither of these models is designed forsequential recommender systems. [29] proposes mixture-channelpurpose routing networks (MCPRN) to detect the possible purposesof a user within a shopping session, thus it can recommend corre-sponding diverse items to satisfy a user’s different purposes. [25]argues that existing RNN based models are only capable of dealingwith relatively recent user behaviors. So it proposes a hierarchicalRNN with multiple update periods to better model user’s lifelongsequential behaviors. Recently, researchers find that Neural TuringMachines (NTM) [5] is a very promising architecture to model auser’s behavior sequence in a fine-grained level, so they study how ulti-Interest-Aware User Modeling for Large-Scale Sequential Recommendations Woodstock ’18, June 03–05, 2018, Woodstock, NY to leverage this architecture for user modeling [2, 23]. Our work ismost related to RUM [2], we point out the difference between ourwork and RUM in Section 4 and reported experimental comparisonsin Section 5.

A user’s long behavior sequences usually include dynamic anddiverse interests. Traditional sequential user modeling methodsrepresent a user with a single vector, which is insufficient to de-scribe the complicated and varied interests. We propose a novelSequential User Matrix (SUM) model, which leverages multiplechannels to capture a users’ multiple interests during user mod-eling. Our proposed SUM has three new components, includingan interest-level and instance-level attention mechanism, a localproximity debuff mechanism, and a highway channel comparedwith existing sequential user models with memory networks. Weconduct comprehensive experiments on two real-world datasets.The results demonstrate that our proposed model outperformsstate-of-the-art methods consistently.

REFERENCES [1] Yukuo Cen, Jianwei Zhang, Xu Zou, Chang Zhou, Hongxia Yang, and Jie Tang.2020. Controllable Multi-Interest Framework for Recommendation. arXiv preprintarXiv:2005.09347 (2020).[2] Xu Chen, Hongteng Xu, Yongfeng Zhang, Jiaxi Tang, Yixin Cao, Zheng Qin, andHongyuan Zha. 2018. Sequential recommendation with user memory networks.In

Proceedings of the eleventh ACM international conference on web search anddata mining . 108–116.[3] Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio.2014. On the Properties of Neural Machine Translation: Encoder–Decoder Ap-proaches. In

Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics andStructure in Statistical Translation . 103–111.[4] Kunal Dahiya, Deepak Saini, Anshul Mittal, Ankush Shaw, Kushal Dave, AkshaySoni, Himanshu Jain, Sumeet Agarwal, and Manik Varma. 2021. DeepXML: ADeep Extreme Multi-Label Learning Framework Applied to Short Text Documents (WSDM ’21) . Association for Computing Machinery.[5] Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural Turing Machines.

CoRR abs/1410.5401 (2014). arXiv:1410.5401 http://arxiv.org/abs/1410.5401[6] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. [n.d.].DeepFM: A Factorization-Machine based Neural Network for CTR Prediction.In

Proceedings of the Twenty-Sixth International Joint Conference on ArtificialIntelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017 .[7] Xueliang Guo, Chongyang Shi, and Chuanming Liu. 2020. Intention Model-ing from Ordered and Unordered Facets for Sequential Recommendation. In

Proceedings of The Web Conference 2020 (WWW ’20) . 1127–1137.[8] Balázs Hidasi and Alexandros Karatzoglou. 2018. Recurrent neural networks withtop-k gains for session-based recommendations. In

Proceedings of the 27th ACMInternational Conference on Information and Knowledge Management . 843–852.[9] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk.2015. Session-based recommendations with recurrent neural networks. arXivpreprint arXiv:1511.06939 (2015).[10] Wang-Cheng Kang and Julian J. McAuley. 2018. Self-Attentive Sequential Recom-mendation. In

IEEE International Conference on Data Mining, ICDM 2018, Singapore,November 17-20, 2018 . 197–206. https://doi.org/10.1109/ICDM.2018.00035[11] Yehuda Koren. 2008. Factorization meets the neighborhood: a multifacetedcollaborative filtering model. In

Proceedings of the 14th ACM SIGKDD internationalconference on Knowledge discovery and data mining . ACM, 426–434.[12] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization tech-niques for recommender systems.

Computer

Proceedings ofthe 28th ACM International Conference on Information and Knowledge Management .2615–2623.[14] Jing Li, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Tao Lian, and Jun Ma. 2017.Neural attentive session-based recommendation. In

Proceedings of the 2017 ACMon Conference on Information and Knowledge Management . ACM, 1419–1428.[15] Jiacheng Li, Yujie Wang, and Julian McAuley. 2020. Time Interval Aware Self-Attention for Sequential Recommendation. In

Proceedings of the 13th InternationalConference on Web Search and Data Mining (WSDM ’20) . 322–330. [16] Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, andGuangzhong Sun. 2018. XDeepFM: Combining Explicit and Implicit Feature Inter-actions for Recommender Systems. In

Proceedings of the 24th ACM SIGKDD Inter-national Conference on Knowledge Discovery & Data Mining (KDD ’18) . 1754–1763.[17] Ting Liu, Andrew W Moore, Ke Yang, and Alexander G Gray. 2005. An investiga-tion of practical approximate nearest neighbor algorithms. In

Advances in neuralinformation processing systems . 825–832.[18] Zheng Liu, Jianxun Lian, Junhan Yang, Defu Lian, and Xing Xie. 2020. Octopus:Comprehensive and Elastic User Representation for the Generation of Recommen-dation Candidates. In

Proceedings of the 43rd International ACM SIGIR Conferenceon Research and Development in Information Retrieval (Virtual Event, China) (SI-GIR ’20) . Association for Computing Machinery, New York, NY, USA, 289–298.https://doi.org/10.1145/3397271.3401088[19] Zheng Liu, Yu Xing, Fangzhao Wu, Mingxiao An, and Xing Xie. 2019. Hi-Fiark: deep user representation via high-fidelity archive network. In

Proceedingsof the 28th International Joint Conference on Artificial Intelligence . AAAI Press,3059–3065.[20] Fuyu Lv, Taiwei Jin, Changlong Yu, Fei Sun, Quan Lin, Keping Yang, and WilfredNg. 2019. SDM: Sequential Deep Matching Model for Online Large-Scale Rec-ommender System. In

Proceedings of the 28th ACM International Conference onInformation and Knowledge Management (CIKM ’19) . 2635–2643.[21] Chen Ma, Peng Kang, and Xue Liu. 2019. Hierarchical Gating Networks for Se-quential Recommendation. In

Proceedings of the 25th ACM SIGKDD InternationalConference on Knowledge Discovery & Data Mining (KDD ’19) . 825–833.[22] Shumpei Okura, Yukihiro Tagami, Shingo Ono, and Akira Tajima. 2017.Embedding-Based News Recommendation for Millions of Users. In

Proceedingsof the 23rd ACM SIGKDD International Conference on Knowledge Discovery andData Mining (Halifax, NS, Canada) (KDD ’17) . New York, NY, USA, 1933–1942.[23] Qi Pi, Weijie Bian, Guorui Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Practiceon long sequential user behavior modeling for click-through rate prediction.In

Proceedings of the 25th ACM SIGKDD International Conference on KnowledgeDiscovery & Data Mining . 2671–2679.[24] Massimo Quadrana, Alexandros Karatzoglou, Balázs Hidasi, and Paolo Cremonesi.2017. Personalizing Session-Based Recommendations with Hierarchical Recur-rent Neural Networks. In

Proceedings of the Eleventh ACM Conference on Recom-mender Systems (RecSys ’17) . 130–137.[25] Kan Ren, Jiarui Qin, Yuchen Fang, Weinan Zhang, Lei Zheng, Weijie Bian, GuoruiZhou, Jian Xu, Yong Yu, Xiaoqiang Zhu, et al. 2019. Lifelong Sequential Modelingwith Personalized Memorization for User Response Prediction. In

Proceedingsof the 42nd International ACM SIGIR Conference on Research and Development inInformation Retrieval . 565–574.[26] Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014.A latent semantic model with convolutional-pooling structure for informationretrieval. In

Proceedings of the 23rd ACM international conference on conference oninformation and knowledge management . 101–110.[27] Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang.2019. BERT4Rec: Sequential Recommendation with Bidirectional Encoder Rep-resentations from Transformer. In

Proceedings of the 28th ACM InternationalConference on Information and Knowledge Management (CIKM ’19) . 1441–1450.[28] Jiaxi Tang and Ke Wang. 2018. Personalized top-n sequential recommenda-tion via convolutional sequence embedding. In

Proceedings of the Eleventh ACMInternational Conference on Web Search and Data Mining . 565–573.[29] Shoujin Wang, Liang Hu, Yang Wang, Quan Z Sheng, Mehmet Orgun, and Long-bing Cao. 2019. Modeling multi-purpose sessions for next-item recommendationsvia mixture-channel purpose routing networks. In

Proceedings of the 28th Inter-national Joint Conference on Artificial Intelligence . AAAI Press, 1–7.[30] Chao-Yuan Wu, Amr Ahmed, Alex Beutel, Alexander J Smola, and How Jing. 2017.Recurrent recommender networks. In

Proceedings of the tenth ACM internationalconference on web search and data mining . 495–503.[31] Chengfeng Xu, Pengpeng Zhao, Yanchi Liu, Jiajie Xu, Victor S.Sheng S.Sheng,Zhiming Cui, Xiaofang Zhou, and Hui Xiong. 2019. Recurrent Convolutional Neu-ral Network for Sequential Recommendation. In

The World Wide Web Conference(WWW ’19) . 3398–3404.[32] Zeping Yu, Jianxun Lian, Ahmad Mahmoody, Gongshen Liu, and Xing Xie. 2019.Adaptive user modeling with long and short-term preferences for personalizedrecommendation. In

Proceedings of the 28th International Joint Conference onArtificial Intelligence . AAAI Press, 4213–4219.[33] Fajie Yuan, Alexandros Karatzoglou, Ioannis Arapakis, Joemon M. Jose, andXiangnan He. 2019. A Simple Convolutional Generative Network for Next ItemRecommendation. In

Proceedings of the Twelfth ACM International Conference onWeb Search and Data Mining (WSDM ’19) . 582–590.[34] Shuai Zhang, Yi Tay, Lina Yao, and Aixin Sun. 2018. Next Item Recommendationwith Self-Attention.

CoRR abs/1808.06414 (2018). arXiv:1808.06414 http://arxiv.org/abs/1808.06414[35] Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu,and Kun Gai. 2019. Deep Interest Evolution Network for Click-Through RatePrediction. In

Thirty-Third AAAI Conference on Artificial Intelligence . oodstock ’18, June 03–05, 2018, Woodstock, NY Jianxun Lian, Iyad Batal, et al. A APPENDIXA.1 Dataset Details

Here we describe the complete setting of datasets.

Display Ads Dataset . The SUM model is originally designedto support our online display advertising business in Bing NativeAds. We collect two weeks’ ads clicking logs as data samples, andcollect users’ web behavior history prior to their corresponding adclick behavior for user modeling. The data samples are split into70%/15%/15% as training/validation/test dataset by users to avoidinformation leakage caused by repeated user behaviors. For moreefficient offline modeling training, the user behavior sequences aretruncated to 100. Some basic data statistics are reported in Table 1.For each positive instance (which is an ad click behavior from oneuser), we sample 1 negative instance from non-click impression,and randomly sample 3 negative instances by item popularity. Eachweb browsing behavior is represented by its web page title, e.g., “Why Are People Rushing To Get This Stylish New SmartWatch? TheHealth Benefits Are Incredible” . We use the CDSSM [26] model as atext-encoder to turn the raw text into a 128-dimension embeddingvector. The embedding vector is then used as the static feature fora user page view behavior.

Taobao Dataset . This is a public e-commerce dataset collectedfrom Taobao’s recommender system. The original dataset containsseveral types of user behaviors such as page view and purchase. Tomake our two experimental datasets coherent, for Taobao Dataset,we take the purchase behaviors as target activities (which corre-sponds to the ads clicking behavior in Display Ads Dataset) and usepage view behaviors for user modeling data (which correspondsto the web browsing behavior in Display Ads Dataset). To stay fo-cused on studying users who have long behavior sequence, we onlyinclude users with more than 20 page view behaviors. We sort theuser page view behaviors according to their timestamp so that wecan get the last K user behaviors prior to the user’s purchase activ-ity. The user behavior sequences are truncated to 100, which alignswith the Display Ads dataset’s setting. Some basic data statistics arereported in Table 1. Since we don’t have the non-click impressionlogs in this dataset, all the negative instances are randomly sampledaccording to item popularity. For each positive instance, we sample4 negative instances. Unlike the Display Ads dataset, we don’t havetext descriptions for items. So we use item id and category id asone-hot features to represent items. The data samples are split into70%/15%/15% as training/validation/test dataset by users to avoidinformation leakage caused by repeated user behaviors. To makethe training process more efficient, we pretrained item embeddingswith a word2vec algorithm on the training dataset. All modelswill load the pretrained item embeddings for better warm starting.This setting also helps to get rid of the auxiliary loss in [23, 35]. A.2 Hyper-Parameter Settings

We use grid-search to find the best hyper-parameters for each modelon the validation set, then report the corresponding metrics on thetest set. Experiments are repeated 3 times for each method andwe take the best result in order to avoid being stuck in bad locally https://tianchi.aliyun.com/dataset/dataDetail?dataId=649&userId=1 https://radimrehurek.com/gensim/models/word2vec.html optimal solutions. The exploration range is: learning rate: {0.0001,0.0005, 0.001, 0.005, 0.01}, L2 regularization weight for model param-eters and embedding parameters: {0.0, 0.0001, 0.001, 0.01}, numberof slots for NTM / RUM / MCPRN / MIMN / HPMN / SUM andnumber of layers for SGRU: {2, 3, 4, 5, 8, 10, 20}, but if not explicitlymentioned, all models’ performance are reported with the slot num-ber of 5. For HRNN, we try the session-break time period in {30,60, 1440} minutes. The update period 𝑡 is set to 2 according to [25]and the number of slots indicates the number of hierarchical layers.Batch size is fixed to 256 for all models. Embedding size of itemsare fixed on 128 on the Display Ads dataset; for the Taobao dataset,item ID embedding size is 64 and category ID embedding size is 16.We concatenate item ID embedding and category ID embedding torepresent an item. User states sizes are 128 for Ads dataset and 64for Taobao dataset. The optimizer is Adam . A suggested configu-ration which works for most of the cases is that: learning rate is0.0005, 𝜆𝜆