Hybrid Interest Modeling for Long-tailed Users
Lifang Deng, Jin Niu, Angulia Yang, Qidi Xu, Xiang Fu, Jiandong Zhang, Anxiang Zeng
HHybrid Interest Modeling for Long-tailed Users
Lifang Deng, Jin Niu, Angulia Yang, Qidi Xu, Xiang Fu, Jiandong Zhang, Anxiang Zeng
Alibaba Group, Beijing, China.
ABSTRACT
User behavior modeling is a key technique for recommender sys-tems. However, most methods focus on head users with large-scaleinteractions and hence suffer from data sparsity issues. Severalsolutions integrate side information such as demographic featuresand product reviews, another is to transfer knowledge from otherrich data sources. We argue that current methods are limited by thestrict privacy policy and have low scalability in real-world applica-tions and few works consider the behavioral characteristics behindlong-tailed users. In this work, we propose the Hybrid InterestModeling (HIM) network to hybrid both personalized interest andsemi-personalized interest in learning long-tailed users’ preferencesin the recommendation. To achieve this, we first design the UserBehavior Pyramid (UBP) module to capture the fine-grained person-alized interest of high confidence from sparse even noisy positivefeedbacks. Moreover, the individual interaction is too sparse andnot enough for modeling user interest adequately, we design theUser Behavior Clustering (UBC) module to learn latent user interestgroups with self-supervised learning mechanism novelly, whichcapture coarse-grained semi-personalized interest from group-iteminteraction data. Extensive experiments on both public and indus-trial datasets verify the superiority of HIM compared with thestate-of-the-art baselines.
KEYWORDS
Recommender Systems; Long-tailed User Modeling
Benefit from the widespread usage of the Internet and mobile tech-nologies, personalized recommender systems live at the heart of theindustry, which aim to provide customized items for each user. Userbehavior modeling is a key technique for recommender systems inwhich historical interactions play a crucial role [27, 34, 37].Most methods design complex model to capture user’s latentinterest, a wealth of information rest in their abundant user in-teractions. User-item rich interactions prop up the accurate rec-ommendation and good recommendation greatly decreases userchurn rate. Such a running mode creates a positive feedback loop.However, most users do not express their interests explicitly (e.g.,click, buy), the number of user’s interactions inherently follows along-tailed distribution in real-world recommendation applications.Head users take their advantages in data richness to be well-served,while tail users are just the opposite. Such long-tailed users aresilent majority in recommender systems, especially for those risingcompanies. Taking two recommender systems in the homepages ofTaobao and Lazada for example (illustrated in Figure 1). Lazada is arising e-commerce company in Southeast Asia. The user scale andinteractions numbers are much smaller than Taobao. The visualizeddata difference is presented in Figure 2 (for privacy concern, weonly visualize tendency and eliminate details), the average length of (a) Taobao (b) Lazada
Figure 1: Homepage recommendation in Taobao and Lazada. user behavior sequences in Taobao has a large increase than Lazadaas time going. Long-tailed users dominate such rising e-commercecompanies but are ill-served by the current user behavior modelingmethods.However, implementing high-performance recommendation forlong-tailed users will also be full of challenges. Several solutions in-tegrate side information such as demographic features and productreviews or to transfer knowledge from other rich data sources. Weargue that current methods may have low scalability in real-worldapplications and raise privacy concerns. Another promising wayis group recommendation. such methods capture coarse-grainedsemi-personalized interest from group-item interaction where userpreferences on an unobserved item can be related to the group theybelong to [3, 24]. But these methods either do not make use of asingle user’s individual interactions to capture fine-grained person-alized interest or need extra group information for initialization,making these methods hard to be deployed online.In this paper, we propose a Hybrid Interest Modeling (HIM) net-work to improve the recommendation performance of long-tailedusers in an end-to-end manner without the requirement of auxiliarydata sources. To achieve this, we first take full consideration of be-havior patterns behind long-tailed users. These users are a massivequick-changing group with few interactions and large intervals.We divide the long time interval interactions into multiple sessionsand capture the users’ personalized interest and semi-personalizedinterest within each period. We first design the User Behavior Pyra-mid module (UBP) to capture fine-grained personalized interest. Inthis module, we consider user may click an item by mistake withno actual interest. This can bring a worse experience for long-tailedusers than for head users due to fewer interactions. To conquer this,UBP takes both the positive(e.g., click) and negative(e.g., not click)interactions as inputs and models the differences between them toselectively extract intrinsic interest of high confidence from sparseeven noisy positive interactions. The key idea is that interactionswith higher confidence contribute more to the accurate recom-mendation. Moreover, individual interaction is too sparse and notenough for modeling user interest adequately. We further design the a r X i v : . [ c s . I R ] D ec onference’17, July 2017, Washington, DC, USA Lifang Deng, Jin Niu, Angulia Yang, Qidi Xu, Xiang Fu, Jiandong Zhang, Anxiang Zeng User Behavior Clustering (UBC) module to learn latent user interestgroups with self-supervised learning mechanism novelly in an endto end manner, which captures coarse-grained semi-personalizedinterest from group-item interaction data. HIM models user-iteminteractions as well as group-item interactions. The two learningprocesses reinforce each other and boost recommendation perfor-mance for long-tailed users.To sum up, HIM contributes to the following aspects:(1) We find that existing recommendation models ignore thebehavior characteristics of long-tailed users. We explore long-tailed users’ behavior patterns and reorganize interactionsin an effective way.(2) We propose HIM to improve user behavior modeling forlong-tailed users and two specially designed modules UBPand UBC can effectively reduce data sparsity and boost rec-ommendation performance.(3) We validate on both public and industrial datasets and verifythe superiority of HIM compared with the state-of-the-artbaselines. It is notable that HIM has been deployed in Lazadaonline recommender systems and has obtained more than 7%item page view (IPV) improvements across multiple South-east Asian countries.
Figure 2: Statistics of user interactions.
Deep learning has led to many successes in building industry-scalerecommender systems [7, 9, 13]. Compared to conventional ap-proaches [2, 6, 14, 22], deep learning models capture underlyingrelationships between users and items more effectively. A line ofresearch put their efforts on discovering higher-order interactionsamong features, additionally start to replace manual features craft-ing with deep feature extraction [12, 13, 23, 28, 30]. These modelsfocus on mining the static relationships between users and items, ig-noring the dynamics of users’ preferences in real-world recommen-dation scenarios. Therefore several deep recommendation methodsconcentrate on capturing users’ preference from the rich histori-cal interactions [4, 5, 27, 34, 38]. However, these methods heavilydepend on the scale and quality of historical interactions. Theyignore the problem of long-tailed distribution, which may causeperformance degradation for tail users with limited interactions. To address this issue, existing methods generally integrate aux-iliary information of different modalities such as user profiles,products review, title, or other item side information to boostrecommendation performance with various deep model architec-tures [1, 11, 13, 17, 18, 36]. However, the accessibility of additionalauxiliary information limits their scalability when they are deployedin different recommendation scenarios and raise privacy concerns.
Group recommendation received a lot of attention in recent yearsand has been widely applied in various domains. Most previousgroup recommendation methods rely on pre-defined users’ groupinformation such as social relation [3, 19, 29, 35] to give recommen-dation to a group of people. AGRRE [3] exploits attention mech-anisms with group embeddings. DBRec [24] uses a dual-bridgeframework that initializes with pre-trained embeddings jointly inte-grate collaborative filtering, latent group discovery and hierarchicalmodeling tasks into a unified network. PreHash [32] defines anchorusers and groups users into each bucket by a hashing network.However, these methods need to be well initialized, the group in-formation or the anchor user is explicitly required, hence thosemethods cannot be scaled to the scenarios where the initializationinformation is not explicitly available.
Based on the sparse interactions dilemma for long-tailed usersand deficiency of previous methods, we proposed Hybrid InterestModeling (HIM) network. The model structure is shown in Figure 3.In the following subsection, we will illustrate each module of HIMin detail.
In the e-commerce recommender system, users can click items,add items to cart, or buy items. We treat these behaviors activelyperformed by users as positive feedback. If users only browse therecommended item with no further action, we mark it as negativefeedback which is usually ignored in previous user behavior model-ing. Here we argue that negative feedback can potentially indicateitems they are not interested in and can be used as supplementaryinformation to alleviate the sparsity problem of positive feedback.Thus we collect both positive feedback and negative feedback se-quences as user behavior input to enrich user interest learning.As for interactions reorganization, we observe that interactionsof long-tailed users have larger time interval and informative tem-poral dependency is rare, the interactions often shows session-based traits. Thus following the idea of session-based recommenda-tion [16], we propose to divide the entire long behavior sequenceinto 𝑇 sessions. The specific length of the session T is a hyperpa-rameter that we can adjust to get the optimal performance by userbehavior distribution.Then, in each session, we collect one user’s interactions andre-rank them by interaction frequency instead of temporal order.We argue that more frequent interactions have greater confidencein indicating user authentic interest. For example, if a user clicks amobile phone 8 times, the user may have a greater preference for ybrid Interest Modeling for Long-tailed Users Conference’17, July 2017, Washington, DC, USA (a) Framework of HIM UBP (cid:335)
Pooling
Group softmax
Freq-weighted similarity … Concat (cid:335)
Encoder
UBC
Decoder S e l f - A tt e n t i o n reconstruction
UBP (cid:3)
Module
User Behavior ( ) ( ) ( )
UBP (cid:3)
ModuleUBC (cid:3)
Module UBC (cid:3)
Module (cid:335)
Other featuresMLP layersSoftmax(2)OutputConcat: user behavior sequence: clicked item (cid:72)(cid:80)(cid:69)(cid:72)(cid:71)(cid:71)(cid:76)(cid:81)(cid:74) : non-clicked item embedding: user group embedding: target item embedding: other features embedding
Figure 3: The framework of HIM and inner modules design. (a) The framework of HIM: At the reorganization layer, users’behavior sequences are split into different sessions and rearranged based on interaction frequency. HIM contains two modulesUBP and UBC to model personalized interest and semi-personalized interest, respectively. The attention layer models theinterests hybrid process that is relative to the target item and obtains compact attentive features. The weighted user interestrepresentations and remaining features are concatenated and fed into shared MLP for the final prediction. (b) UBP and UBCdesign: In UBP, we take both the positive and negative feedback as inputs and model the relationship between them from theconfidence perspective, we utilize GRU and self-attention units to obtain enhanced feature representations. In UBC, we applyAuto-encoder to learn latent user interest groups by constructing a self-supervised mechanism based on the UBP personalizedrepresentations. the mobile phone than other items clicked once. Both the positiveand negative feedback are reranked in frequently descending order.Besides, HIM adopts target item components and other relatedfeatures as input as well, which is a commonly used pre-processingoperation. We omit the details here.
The above-mentioned Reorganization Layer is the basis for mod-eling long-tailed users’ behavior characteristics. Based on the re-organization module, the User Behavior Pyramid (UBP) is built tofurther capture users’ interest within a session by modeling thecorrelation hidden between positive and negative feedback.Within each session, 𝐸 𝑖 = { 𝑒 𝑖1 , 𝑒 𝑖2 , ..., 𝑒 𝑖𝑛 } ∈ R 𝑛 × 𝑑 representsthe sequence embedding in which the user has interacted with inthe 𝑖 -th session. We select the top 𝑛 most frequent interacted itemsas input. The value of 𝑛 depends on different data distributions. 𝑑 is embedding dimension, 𝐹 𝑖 = { 𝑓 𝑖 , 𝑓 𝑖 , ..., 𝑓 𝑖𝑛 } ∈ R 𝑛 corresponds tofrequency information of 𝐸 𝑖 . Here we enhance the confidence ofinteractions by multiplying each behavior sequence embedding by the corresponding frequency value:ˆ 𝑒 𝑖𝑗 = 𝑓 𝑖𝑗 ∗ 𝑒 𝑖𝑗 , 𝑗 ∈ [ , 𝑛 ] , 𝑖 ∈ [ ,𝑇 ] (1)where ∗ is scalar-vector product. Meanwhile, inspired by local acti-vation unit [38] which adaptively calculate the activation stage ofeach behavior embedding, we develop a distance-based attentionmechanism to model the relevance of each positive feedback to-wards compact negative feedback. Here, we first take a frequencyweighted sum pooling on negative feedback to generate a fixed-length compact negative feedback representation ˆ 𝑒 𝑖𝑛𝑒𝑔 to reducethe noises within single negative feedback, since these negativefeedback may be caused by various reasons besides dislike [15].ˆ 𝑒 𝑖𝑗 𝑝𝑜𝑠 = 𝑓 𝑖𝑗 𝑝𝑜𝑠 ∗ 𝑒 𝑖𝑗 𝑝𝑜𝑠 , 𝑗 ∈ [ , 𝑛 ] , 𝑖 ∈ [ ,𝑇 ] , ˆ 𝑒 𝑖𝑛𝑒𝑔 = 𝑝𝑜𝑜𝑙𝑖𝑛𝑔 (cid:16) 𝑓 𝑖 𝑛𝑒𝑔 ∗ 𝑒 𝑖1 𝑛𝑒𝑔 , ..., 𝑓 𝑖𝑛 𝑛𝑒𝑔 ∗ 𝑒 𝑖𝑛 𝑛𝑒𝑔 (cid:17) ,𝑑 𝑖𝑗 = 𝑠𝑖𝑚 (cid:16) ˆ 𝑒 𝑖𝑗 𝑝𝑜𝑠 , ˆ 𝑒 𝑖𝑛𝑒𝑔 (cid:17) (2) 𝑠𝑖𝑚 (cid:16) ˆ 𝑒 𝑖𝑗 𝑝𝑜𝑠 , ˆ 𝑒 𝑖𝑛𝑒𝑔 (cid:17) is the euclidean distance between the 𝑗 -th pos-itive feedback and compact negative feedback in the 𝑖 -th session. onference’17, July 2017, Washington, DC, USA Lifang Deng, Jin Niu, Angulia Yang, Qidi Xu, Xiang Fu, Jiandong Zhang, Anxiang Zeng Different from the inner-product based attention mechanism, wemanage to assign more weights to that positive feedback whichis less similar to negative feedback. Then we apply the softmaxfunction to get a normalized attentive weight 𝛼 𝑖𝑗 and multiply itwith the frequency weighted positive embedding: 𝛼 𝑖𝑗 = exp (cid:16) 𝑑 𝑖𝑗 (cid:17)(cid:205) 𝑛𝑘 = exp (cid:16) 𝑑 𝑖𝑘 (cid:17) , ˜ 𝑒 𝑖𝑗 𝑝𝑜𝑠 = 𝛼 𝑖𝑗 ∗ ˆ 𝑒 𝑖𝑗 𝑝𝑜𝑠 (3)To get an enhanced representation, we utilize GRU [8] to obtainthe sequence embeddings. We feed positive feedback to the GRUunit in descending order of frequency instead of temporal order, thegating mechanism within GRU captures the relationship betweenthe high-frequency feedback and the low-frequency feedback infrequency order. Besides, compared with other recurrent modelslike RNN or LSTM, GRU is computationally more efficient. Next,we flatten each hidden state of positive feedback to get a fixed-length embedding vector and concatenate with negative embeddingtogether to obtain personalized representation 𝑝 𝑖𝑥 in each session: 𝑝 𝑖𝑥 = 𝑐𝑜𝑛𝑐𝑎𝑡 (cid:16) 𝐺𝑅𝑈 ( ˜ 𝑒 𝑖𝑗 𝑝𝑜𝑠 | 𝑗 = , , ...𝑛 ) ; ˆ 𝑒 𝑖𝑛𝑒𝑔 (cid:17) (4)For each session, we follow the same scheme and compute con-currently. Then we split into two computing branches. On onebranch, 𝑝 𝑖𝑥 is fed into the self-attention unit [31] to get aggregatedpersonalized representation 𝑝 𝑖𝑧 across all sessions, 𝑝 𝑖𝑧 is computedas the weighted sum of linearly transformed input elements, weightcoefficient 𝛼 𝑖𝑡 is computed using a softmax function: 𝑝 𝑖𝑧 = 𝑇 ∑︁ 𝑡 = 𝛼 𝑖𝑡 ( 𝑊 𝑖 𝑝 𝑖𝑥 ) ,𝛼 𝑖𝑡 = 𝑒𝑥𝑝 ( 𝑤 𝑖𝑡 ) (cid:205) 𝑇𝑡 = 𝑒𝑥𝑝 ( 𝑤 𝑖𝑡 ) (5)where 𝑤 𝑖𝑡 is computed using a compatibility function that com-pares two input elements 𝑝 𝑖𝑥 and 𝑝 𝑖𝑧 correspondingly. On the otherbranch, we feed the two UBP embedding vectors 𝑝 𝑖𝑥 and 𝑝 𝑖𝑧 with aprogressive relationship into the UBC module to help to constructlatent user groups in each session. User Behavior Clustering (UBC) is another module we designedto further release sparsity, where the individual preference for anunseen item can be referred from the users within the same latentgroup who have interacted with the item. Since UBC can models theinterest of a group of people, UBC can be seen as a coarse-grainedsemi-personalized interest modeling module.Auto-encoder (AE) is one special category of Deep learning meth-ods that compress the data into a dense code and then map the codeinto the reconstruction of the original input. The appeal of AE liesin the fact that they can learn representations in a fully unsuper-vised way. However, empirical experience tells that learning groupembeddings solely from un-preprocessed input features by AE maynot promise robust performance [24]. We deal with this dilemmavery tactfully by the two progressive representations learned from UBP as input and constructing a heuristic self-supervised AE toenhance the clustering.Refer to Figure 3(b), we learn user group embeddings 𝐺 𝑖𝑢 ∈ R 𝑘 × 𝑑 𝑔 within each session, group number 𝑘 is a rather smallernumber compared to the number of users, 𝑑 𝑔 denotes the dimen-sion of group embedding. Recall the intermediate representation 𝑝 𝑖𝑥 and self attention unit output 𝑝 𝑖𝑧 in UBP, two representationsshare same feature dimension, meanwhile 𝑝 𝑖𝑧 is a higher order rep-resentation compared to 𝑝 𝑖𝑥 , thus certain progressive relationshiphold.In UBC, we initialize AE with random weights, then make itlearn effective group embeddings from scratch. In the intermediatelayer, we set the representation dimension to be 𝑘 , which is exactlyequal to the number of user groups, each value of the vector de-notes the probability of user belonging to each group, the learnedreconstruction representation ˆ 𝑝 𝑖𝑥 after AE compute as follows: 𝛽 𝑖 = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 (cid:16) 𝑊 𝑖𝑐 𝑝 𝑖𝑥 + 𝑏 𝑖𝑐 (cid:17) , 𝜇 𝑖 = 𝑘 ∑︁ 𝑠 = 𝛽 𝑖𝑠 𝑔 𝑖𝑠 , ˆ 𝑝 𝑖𝑥 = 𝜎 (cid:16) 𝑊 𝑖𝑟 𝜇 𝑖 + 𝑏 𝑖𝑟 (cid:17) (6)where 𝜎 is the sigmoid activation function, 𝑊 𝑖𝑐 ∈ R 𝑘 × 𝑑 , 𝑊 𝑖𝑟 ∈ R 𝑑 × 𝑑 𝑔 , 𝑏 𝑖𝑐 ∈ R 𝑘 , 𝑏 𝑖𝑟 ∈ R 𝑑 . 𝛽 𝑖𝑠 is the 𝑠 -th dimension of vector 𝛽 𝑖 , 𝑔 𝑖𝑠 is the 𝑠 -th row of group embedding matrix 𝐺 𝑖𝑢 . Reconstructed userrepresentation ˆ 𝑝 𝑖𝑥 will be used for the follow-mentioned reconstruc-tion loss in the UBC. The learned semi-personalized representation 𝑐 𝑖𝑧 as follows: 𝑐 𝑖𝑧 = 𝑔 𝑖𝑗 ,𝑗 = arg max 𝑠 ( 𝛽 𝑖𝑠 ) , 𝑠 ∈ [ , 𝑘 ] (7)where 𝑗 is the label of user group that user has the maximumactivation, 𝑔 𝑖𝑗 is the corresponding group embedding in 𝐺 𝑖𝑢 .The intuition behind self-supervision is that users with similarpersonal preferences are more likely to belong to the same group. Soin the learning process, the learned reconstruction representationˆ 𝑝 𝑖𝑥 from 𝑝 𝑖𝑥 after group embeddings projection should close to thehigh-level personalized representation 𝑝 𝑖𝑧 to keep the consistencyof individual interest learning and group interest learning.In loss construction stage, we deploy the contrastive max-marginobjective function that is commonly used in previous work [20, 33]to minimize the distance between ˆ 𝑝 𝑖𝑥 and 𝑝 𝑖𝑧 : L 𝑔 = 𝑇 ∑︁ 𝑖 = 𝑝 ∑︁ 𝑗 = max (cid:16) , − ˆ 𝑝 𝑖𝑥 𝑝 𝑖𝑧 + ˆ 𝑝 𝑖𝑥 ˆ 𝑝 𝑖 , 𝑗𝑥 (cid:17) (8)Where L 𝑔 is defined as hinge loss that maximize the cosine similar-ity between ˆ 𝑝 𝑖𝑥 and 𝑝 𝑖𝑧 and simultaneously minimize that betweenˆ 𝑝 𝑖𝑥 and negative samples, { ˆ 𝑝 𝑖 , 𝑗𝑥 | 𝑗 = , , , ...𝑝 } represent negativeembeddings where we randomly select 𝑝 users as negative users. Up to now, we obtain user behavior feature embeddings includingindividual-based 𝑝 𝑧 and group-based 𝑐 𝑧 which concatenate each ybrid Interest Modeling for Long-tailed Users Conference’17, July 2017, Washington, DC, USA Table 1: Statistics of datasets
Dataset Users Items Samples tailed user ratio body user ratio head user ratioMusical Instruments 51,253 14,194 129,867 47% 34% 19%Electronics 622,308 70,323 1,589,018 43% 35% 22%Industrial (sampling data) 3 million 4 million 0.1 billion 55% 19% 26% session output respectively: 𝑝 𝑧 = 𝑐𝑜𝑛𝑐𝑎𝑡 (cid:16) 𝑝 ; 𝑝 ; ... ; 𝑝 𝑇𝑧 (cid:17) , 𝑐 𝑧 = 𝑐𝑜𝑛𝑐𝑎𝑡 (cid:16) 𝑐 ; 𝑐 ; ... ; 𝑐 𝑇𝑧 (cid:17) (9)To model the mutual information between the target item anduser behavior, we apply the target item attention mechanism toautomatically learn the weight of 𝑝 𝑧 and 𝑐 𝑧 . The target item’sembedding 𝑒 𝑡 contains the information of its ID, price, brand, shop,category. All the vectors are concatenated together to obtain theoverall representation vector for the target item. Here, we applythe common-used dot-product attention mechanism: 𝑒 𝑝 = exp (cid:0) 𝑝 𝑧 𝑊 𝑝 𝑒 𝑡 (cid:1) exp (cid:0) 𝑝 𝑧 𝑊 𝑝 𝑒 𝑡 (cid:1) + exp ( 𝑐 𝑧 𝑊 𝑐 𝑒 𝑡 ) ∗ 𝑝 𝑧 , 𝑒 𝑐 = exp ( 𝑐 𝑧 𝑊 𝑐 𝑒 𝑡 ) exp (cid:0) 𝑝 𝑧 𝑊 𝑝 𝑒 𝑡 (cid:1) + exp ( 𝑐 𝑧 𝑊 𝑐 𝑒 𝑡 ) ∗ 𝑐 𝑧 (10)where 𝑊 𝑝 ∈ R 𝑑 𝑝 × 𝑑 𝑡 , 𝑊 𝑐 ∈ R 𝑑 𝑐 × 𝑑 𝑡 , 𝑑 𝑝 is the dimension of 𝑝 𝑧 , 𝑑 𝑐 is the dimension of 𝑐 𝑧 and 𝑑 𝑡 is the dimension of 𝑒 𝑡 . The attention-weighted user behavior feature, target item component featuretogether with other context and cross features are concatenatedtogether and fed into a multilayer perceptron (MLP) to get twoprobabilistic logits, which indicate the probability of this samplebelonging to the negative or positive class, respectively.The learning module is updated by cross-entropy loss, where D is the training set, 𝑦 is the label, e.g., for CTR prediction task, 𝑦 ∈ { , } which represents whether the user clicks target item ornot, ˆ 𝑦 𝑢𝑣 is the predicted score given from our model, 𝑥 is the inputfeature and 𝑁 is the sample count. L 𝑐 = − 𝑁 ∑︁ ( 𝑥,𝑦 ) ∈D [ 𝑦 log ˆ 𝑦 𝑢𝑣 + ( − 𝑦 ) log ( − ˆ 𝑦 𝑢𝑣 )] (11)In the proposed HIM model, we define two loss functions, in-cluding group loss L 𝑔 and cross entropy loss L 𝑐 . The joint loss canbe represented as: L = 𝛼 L 𝑔 + L 𝑐 (12)Where 𝛼 controls the trade-off between the loss of personal-ized interest modeling and semi-personalized interest modeling.The training of HIM can be decomposed into two parts since userembeddings are shared for learning group-item and user-item inter-actions, user embeddings can be well learned from the group-iteminteractions even the user have few user-item interactions. To thoroughly evaluate the performance of HIM, we conduct com-prehensive experiments on two public datasets and one real-worlddataset.
Public Dataset
In this paper, we select a widely used pub-lic benchmark – the Amazon dataset [26], conducting contrastexperiments on two subsets: Amazon’s Musical Instruments andElectronics, which contain necessary product reviews and meta-data. To keep consistent with our feedback setting, we additionallylabel the data samples whose ratings are higher than 3 as positivelabel while ratings of 1, 2, and 3 are negative (ratings range from1 to 5). For each positive feedback, we randomly sample other 5items as negative feedback. To focus on the recommendation forthe long-tailed user, we only filter out users with fewer than 1 pos-itive items and the items with fewer than 5 users, leaving morelong-tailed users in our dataset compared with processed datasetsused in previous studies [24, 37]. We divide 70% as training set, 10%as validation and 20% as testing.
Industrial Dataset
To the best of our knowledge, no publicdatasets with complete long-tailed user interactions have beenreleased. Public dataset like Amazon usually lacks real negativefeedback. So we also experiment on the industrial dataset, which isa sampling version of the whole data. The data is constructed byimpression and click logs from Lazada online recommender system.Lazada as a growing company, set up an e-commerce businessacross multiple Southeast Asian countries. In experiment practice,train, validation, and test set are split along the time sequence,which is a traditional industrial setting. Table 1 lists the statistics.
Experimental Setup
In HIM, we subdivide interactions intodifferent sessions. Figure 4 shows the distribution of user interac-tions’ time interval in Public and Industrial datasets, respectively.We choose the time interval corresponding to approximately 10%,30%, 50% distribution. For Musical Instruments, session list is {14 𝑑 ,6 𝑚 , 12 𝑚 , 𝑎𝑙𝑙 }, which contain users’ interactions during the last 14days, interactions during the last 6 months to the last 14 days, inter-actions during the last 12 months to the last 6 months and earlierinteractions, respectively. For Electronics dataset, session list is {3 𝑚 ,9 𝑚 , 18 𝑚 , 𝑎𝑙𝑙 }. For Industrial dataset, we select the user’s interactionsin the last month, session list is {3 𝑑 , 7 𝑑 , 14 𝑑 , 30 𝑑 }. As previouslystated, the user behavior pattern varies greatly. It’s inaccurate toreport all users’ results uniformly when we highlight effectivenesson long-tailed users who have fewer interactions. Hence we divideusers into three categories based on users’ interaction frequencies.For public datasets, tailed user means users’ behavior sequencelength is shorter than 3; the behavior sequence of body user isbetween 3 and 5; the behavior sequence of head user is more than5. For the industrial dataset, tailed user means users occurred in onference’17, July 2017, Washington, DC, USA Lifang Deng, Jin Niu, Angulia Yang, Qidi Xu, Xiang Fu, Jiandong Zhang, Anxiang Zeng (a) (b) Figure 4: The distribution of user interactions time intervalin Public and Industrial datasets, respectively. the app less than 7 days in one month; the body user is between7 and 15 days; the head user is more than 15 days. As for perfor-mance metrics, the area under the ROC curve (AUC) is adopted, allexperiments are repeated 10 times and the averaged result is finallyreported.
Implementation
HIM is implemented by TensorFlow and weuse the Adam [21] as the optimizer. All related experiments areconducted on one Tesla P100 GPU with 16 GB memory. In thepublic dataset, we set the batch size to be 256, the learning rateto be 1 𝑒 −
3, the preliminary embedding dimension is 8, and thegroup expression dimension is 16. As for the industrial dataset, thebatch size is 1024; learning rate is 1 𝑒 −
4; embedding and the groupembedding dimension is 32 and 128, respectively; the dimensionsetting of MLP layers as [256, 128, 32, 2].
In the experiment, we compare HIM with the following methods: • Popularity [10] is an intuitive and very simple base methodthat recommends items by their popularity which is mea-sured by the number of interactions. • LR (Logistic Regression) [25] is a widely used shallowmodel before deep learning networks for recommendations. • BaseModel is the baseline of HIM. It shares the same Em-bedding&MLP setting with HIM. BaseModel simply usessum pooling operation to aggregate behaviors then reportsresults, while neither considering negative feedback informa-tion nor distinguishing individual and group characteristics. • GRU4Rec [16] is the first work using the recurrent cell GRUto model temporal sequential user behaviors. • DIEN [37] achieves SOTA performance on sequential recom-mendation. The key points reside in two-part: one is extract-ing latent temporal interests from explicit user behaviors,the other is modeling interest evolving process. • DBRec [24] is a current SOTA solution for long-tailed userrecommendation, models user-user group and item-itemgroup hierarchies for learning compact user/item represen-tation.In results reporting, all involved hyperparameters setting ofthese methods keep the same with the original work.
Model Comparison
Table 2 shows the results on Public andIndustrial datasets respectively. For a fair comparison, the LR model computes in a straight way without overmuch feature engineeringwork. According to the experimental results, we have the followingobservations: • Compared with naive Popularity, LR gains overall perfor-mance improvement on all user categories, which indicatesthe basic fitting model is still capable of capturing someinteractions. However, simple deep BaseModel surpassesboth Popularity and LR. It also suggests the deep model isnecessary here for modeling complex interactions. • Among deep-learning based models, GRU4Rec simply mod-els chronological behavior. The improvement is little com-pared to the base model since the temporal dependence be-tween behavior sequences is not obvious for long-tailed usershere. DIEN performs better than GRU4Rec, especially forhead users, since DIEN captures interest evolution from con-crete behavior sequences. While GRU4Rec lacks an explicitmodeling process of latent interest behind the interactions.DBRec outperforms DIEN in tailed users on the Amazon Mu-sical and Industrial dataset. This is because the long-tailedissue is more severe on these two datasets. The group in-formation can boost recommendation performance wheninteractions are insufficient. • Compared with results on public datasets, AUC on industrydataset is relatively lower. This is because the data sparsityproblem is more severe on our system and we use completeand real-world long-tailed user interaction data which makespredicting users’ interest even more challenging. • Our proposed model HIM significantly outperforms othermethods by a large margin, which proves HIM can learnthe long-tailed users’ behavior characteristics better, captureuser interest with limited behaviors more accurately.
Ablation Study
In this section, we conduct ablation study tofurther discuss the contribution of each module in HIM. Since thereis no authentic negative feedback in the Public datasets, the UBPmodule learned in the public datasets is a variant version that ex-cludes the modeling of negative feedback, which refers to positiveUBP without pooling operation on negative feedback sequence andwithout distance-based attention operation between positive andnegative feedback. Despite this variation of feature organizing, pos-itive UBP still brings a large gain lift on these datasets as shown inTable 3, demonstrating that the behavioral characteristics modelingbased on frequency rearrangement is useful for long-tailed users. Itis worth to mention that, compared to DIEN, which also only usespositive feedback to personalize interest modeling, positive UBP ofHIM performs better than DIEN.In the Industrial dataset, the positive and negative feedback mod-eling results in UBP are shown in Table 3. When we add negativefeedback pooling operation to positive UBP, it brings 0.015 absoluteAUC gain, which indicates the negative feedback is a highly infor-mative supplement to basic behavior. Meanwhile, by adding thedistance-based attention between positive and negative feedback,which can further model users’ behavior and generate a “reliable”user representation, it achieves 0.003 absolute AUC gain. In ad-dition to personalized modeling module UBP, semi-personalizedmodeling module UBC also brings extra absolute AUC lift on Public ybrid Interest Modeling for Long-tailed Users Conference’17, July 2017, Washington, DC, USA
Table 2: Results of Different Model(AUC)
Datasets Model all tailed user body user head userMusical Popularity 0.798 0.796 0.800 0.807LR 0.813 0.810 0.814 0.823BaseModel 0.817 0.809 0.820 0.835GRU4Rec 0.817 0.812 0.818 0.830DIEN 0.820 0.813 0.826 0.836DBRec 0.827 0.823 0.825 0.840
HIM 0.831 0.825 0.833 0.848
Electronics Popularity 0.811 0.811 0.812 0.813LR 0.867 0.866 0.869 0.869BaseModel 0.874 0.870 0.878 0.888GRU4Rec 0.876 0.873 0.879 0.886DIEN 0.878 0.875 0.882 0.890DBRec 0.874 0.871 0.877 0.884
HIM 0.883 0.880 0.887 0.896
Industrial Popularity 0.555 0.566 0.556 0.549LR 0.555 0.553 0.555 0.557BaseModel 0.573 0.557 0.576 0.580GRU4Rec 0.576 0.556 0.579 0.583DIEN 0.579 0.550 0.585 0.585DBRec 0.574 0.560 0.574 0.578
HIM 0.607 0.610 0.612 0.608
Table 3: Results on Different Components of HIM
Datasets Components all tailed user body user head userMusical BaseModel 0.817 0.809 0.820 0.835+ UBP 0.826 0.820 0.827 0.843
HIM (+ UBP&UBC)
Electronics BaseModel 0.874 0.870 0.878 0.888+ UBP 0.879 0.876 0.882 0.891
HIM (+ UBP&UBC)
Industrial BaseModel 0.573 0.557 0.576 0.580+ UBP 0.604 0.606 0.609 0.605
HIM (+UBP&UBC) and Industrial datasets, indicating that semi-personalized recom-mendation is a good supplement for personalized recommendation,especially for those long-tailed users.
Effect of Confidence Modeling in UBP
We argue that thereexists a random click in positive feedback. Thus we design theeuclidean distance-based attention to model the similarity betweennegative aggregated embedding and each positive item embedding.Refer to Figure 5(a), we visualize the distance for users of differentsparsity. We have the following observations: First, the distancefor tailed users is relatively small, the smaller euclidean distancereflects the higher uncertainty of positive behaviors. Second, forhead users, distance varies largely among different sessions. Thisindicates that users’ behavior shifts across time, the session canhelp understand users’ preferences instead of modeling the wholebehavior sequence directly.
Effect of UBC
When we hybrid the embedding from UBPand UBC, we apply the target item embedding to automaticallylearn their weights. To study the effect of this attention and havea better understanding of UBC, we analyze the learned weight (a) (b)
Figure 5: Performance under different interaction sparsity:(a) shows the euclidean distance between positive and neg-ative feedback, the smaller euclidean distance reflects thehigher uncertainty of positive behaviors. (b) shows theweights allocation between UBP and UBC, users with highersparsity have a higher UBC weight. and drill down to the different types of users. As shown in Fig-ure 5(b), UBP’s weight is consistently higher (higher than 0.7) thanUBC’s weight (around 0.26), indicating that personalized embedding onference’17, July 2017, Washington, DC, USA Lifang Deng, Jin Niu, Angulia Yang, Qidi Xu, Xiang Fu, Jiandong Zhang, Anxiang Zeng (a) (b)(c) (d)
Figure 6: Performance of tuning hyperparameters. is more important when inferring users preference. Then, UBC’sweight is relatively higher for tailed users. This indicates that suchsemi-personalized group embedding is a promising method whenusers’ interaction is extremely sparse.
Effect of hyperparameters
In this section, we further discussthe performance of HIM under different pre-specified user groupnumber 𝑘 , and the loss weight 𝛼 for hybrid modeling in a limited set.Empirically, group number is varied amongst [5,10,15,20,25,30], andthe 𝛼 [0.0001,0.001,0.001,0.1,1,10]. As shown in Figure 6, we presentthe performance across the datasets, Figure 6(a) and Figure 6(b)shows the performance has a rapid degradation with group number 𝑘 = 30 and loss weight 𝛼 = 1 on Amazon Musical instruments dataset.Overall the performance is relatively stable under a wide-rangechoice and achieves the best performance when clustering into 5groups. In public datasets, the best performance achieves with 𝛼 = 0.0001 while for Industrial dataset is 0.1. Our analysis result isproved to be related with the size of the dataset. We have deployed the proposed HIM in Lazada online recommenderscenario across different Southeast Asian countries, including In-donesia (ID), Malaysia (MY), Vietnam (VN), Thailand (TH). A stan-dard A/B test is conducted online, we perform the online experi-ments for one month, and the average item page view (IPV) Gainof different user groups are reported. As shown in Table 4, all users’IPV are improved. A higher IPV indicates that users are more will-ing to browse and click items on our platform. Especially for thetailed users, the improvement is large, because HIM can well learnlong-tailed users’ preference, which leads to more positive feed-back. As this e-commerce company is still growing, the optimizationbased on the characteristics of long-tailed users would result in asignificant boost in revenue for the long term.
In this paper, we study user interest modeling under limited in-teractions in newborn e-commerce, provide a new perspective to
Table 4: Online A/B testing Results on Different SoutheastAsian countries
Country all IPV Gain tailed user body user head userID +7.2% +6.7% +7.5% +7.2%MY +8.6% +11.1% +9.6% +8.1%TH +7.5% +7.3% +8.1% +7.5%VN +10.5% +12.5% +12.7% +9.8% consider interaction sparsity issue caused by dominant long-tailedusers in an actual production environment. We propose a new struc-ture for long-tailed users, namely Hybrid interest modeling (HIM)network to balance individual expression and group expression toachieve better recommendation performance with limited behaviordata.In short, HIM is a successful practice for our online campaignand can be an instructive recommendation solution for other similarnewborn e-commerce business.
REFERENCES [1] Trapit Bansal, David Belanger, and Andrew McCallum. 2016. Ask the gru: Multi-task learning for deep text recommendations. In
Proceedings of the 10th ACMConference on Recommender Systems . ACM, 107–114.[2] Alex Beutel, Ed H Chi, Zhiyuan Cheng, Hubert Pham, and John Anderson. 2017.Beyond globally optimal: Focused learning for improved recommendations. In
Proceedings of the 26th International Conference on World Wide Web . 203–212.[3] Da Cao, Xiangnan He, Lianhai Miao, Yahui An, Chao Yang, and Richang Hong.2018. Attentive group recommendation. In
The 41st International ACM SIGIRConference on Research & Development in Information Retrieval . ACM, 645–654.[4] Jia Chen, Jiaxin Mao, Yiqun Liu, Min Zhang, and Shaoping Ma. 2020. A Context-Aware Click Model for Web Search. In
WSDM ’20: The Thirteenth ACM Interna-tional Conference on Web Search and Data Mining, Houston, TX, USA, February3-7, 2020 , James Caverlee, Xia (Ben) Hu, Mounia Lalmas, and Wei Wang (Eds.).ACM, 88–96.[5] Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat-Seng Chua. 2017. A entive collaborative filtering: Multimedia recommendationwith item-and component-level a ention. In
Proceedings of the 40th InternationalACM SIGIR conference on Research and Development in Information Retrieval .[6] Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system.In
Proceedings of the 22nd acm sigkdd international conference on knowledgediscovery and data mining . 785–794.[7] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra,Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al.2016. Wide & deep learning for recommender systems. In
Proceedings of the 1stworkshop on deep learning for recommender systems . ACM, 7–10.[8] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014.Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).[9] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networksfor youtube recommendations. In
Proceedings of the 10th ACM conference onrecommender systems . ACM, 191–198.[10] Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance ofrecommender algorithms on top-n recommendation tasks. In
Proceedings of thefourth ACM conference on Recommender systems . 39–46.[11] Wenjing Fu, Zhaohui Peng, Senzhang Wang, Yang Xu, and Jin Li. 2019. DeeplyFusing Reviews and Contents for Cold Start Users in Cross-Domain Recommen-dation Systems. In
Proceedings of the AAAI Conference on Artificial Intelligence ,Vol. 33. 94–101.[12] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017.DeepFM: A Factorization-Machine based Neural Network for CTR Prediction. In
IJCAI . ijcai.org, 1725–1731.[13] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-SengChua. 2017. Neural collaborative filtering. In
Proceedings of the 26th interna-tional conference on world wide web . International World Wide Web ConferencesSteering Committee, 173–182.[14] Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, AntoineAtallah, Ralf Herbrich, Stuart Bowers, et al. 2014. Practical lessons from predictingclicks on ads at facebook. In
Proceedings of the Eighth International Workshop onData Mining for Online Advertising . 1–9. ybrid Interest Modeling for Long-tailed Users Conference’17, July 2017, Washington, DC, USA [15] Xiangnan He, Hanwang Zhang, Min-Yen Kan, and Tat-Seng Chua. 2016. Fast ma-trix factorization for online recommendation with implicit feedback. In
Proceed-ings of the 39th International ACM SIGIR conference on Research and Developmentin Information Retrieval . 549–558.[16] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk.2016. Session-based Recommendations with Recurrent Neural Networks. In
ICLR(Poster) .[17] Guangneng Hu, Yu Zhang, and Qiang Yang. 2019. Transfer Meets Hybrid: ASynthetic Approach for Cross-Domain Collaborative Filtering with Text. In
TheWorld Wide Web Conference . ACM, 2822–2829.[18] Guang-Neng Hu and Xin-Yu Dai. 2017. Integrating reviews into personalizedranking for cold start recommendation. In
Pacific-Asia Conference on KnowledgeDiscovery and Data Mining . Springer, 708–720.[19] Liang Hu, Jian Cao, Guandong Xu, Longbing Cao, Zhiping Gu, and Wei Cao.2014. Deep modeling of group preferences for group-based recommendation. In
Twenty-Eighth AAAI Conference on Artificial Intelligence .[20] Mohit Iyyer, Anupam Guha, Snigdha Chaturvedi, Jordan Boyd-Graber, and HalDaumé III. 2016. Feuding families and former friends: Unsupervised learningfor dynamic fictional relationships. In
Proceedings of the 2016 Conference of theNorth American Chapter of the Association for Computational Linguistics: HumanLanguage Technologies . 1534–1544.[21] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Opti-mization. In
ICLR (Poster) .[22] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization tech-niques for recommender systems.
Computer
42, 8 (2009), 30–37.[23] Bin Liu, Ruiming Tang, Yingzhi Chen, Jinkai Yu, Huifeng Guo, and Yuzhou Zhang.2019. Feature Generation by Convolutional Neural Network for Click-ThroughRate Prediction. In
The World Wide Web Conference . ACM, 1119–1129.[24] Jingwei Ma, Jiahui Wen, Mingyang Zhong, Liangchen Liu, Chaojie Li, WeitongChen, Yin Yang, Hongkui Tu, and Xue Li. 2019. DBRec: Dual-Bridging Recommen-dation via Discovering Latent Groups. In
Proceedings of the 28th ACM InternationalConference on Information and Knowledge Management . ACM, 1513–1522.[25] H Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner,Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, et al.2013. Ad click prediction: a view from the trenches. In
Proceedings of the 19thACM SIGKDD international conference on Knowledge discovery and data mining .1222–1230.[26] Jianmo Ni, Jiacheng Li, and Julian McAuley. 2019. Justifying Recommendationsusing Distantly-Labeled Reviews and Fine-Grained Aspects. In
Proceedings of the2019 Conference on Empirical Methods in Natural Language Processing and the 9thInternational Joint Conference on Natural Language Processing (EMNLP-IJCNLP) .188–197.[27] Qi Pi, Weijie Bian, Guorui Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Practice onLong Sequential User Behavior Modeling for Click-Through Rate Prediction. In
KDD . ACM, 2671–2679.[28] Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang.2016. Product-based neural networks for user response prediction. In . IEEE, 1149–1154.[29] Shunichi Seko, Takashi Yagi, Manabu Motegi, and Shinyo Muto. 2011. Group rec-ommendation using feature space representing behavioral tendency and powerbalance among members. In
Proceedings of the fifth ACM conference on Recom-mender systems . ACM, 101–108.[30] Ying Shan, T Ryan Hoens, Jian Jiao, Haijing Wang, Dong Yu, and JC Mao. 2016.Deep crossing: Web-scale modeling without manually crafted combinatorialfeatures. In
Proceedings of the 22nd ACM SIGKDD international conference onknowledge discovery and data mining . ACM, 255–262.[31] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-Attention with Rela-tive Position Representations. In
NAACL-HLT (2) . Association for ComputationalLinguistics, 464–468.[32] Shaoyun Shi, Weizhi Ma, Min Zhang, Yongfeng Zhang, Xinxing Yu, Houzhi Shan,Yiqun Liu, and Shaoping Ma. 2020. Beyond User Embedding Matrix: Learningto Hash for Modeling Large-Scale Users in Recommendation. In
Proceedings ofthe 43rd International ACM SIGIR Conference on Research and Development inInformation Retrieval . 319–328.[33] Richard Socher, Andrej Karpathy, Quoc V Le, Christopher D Manning, and An-drew Y Ng. 2014. Grounded compositional semantics for finding and describingimages with sentences.
Transactions of the Association for Computational Linguis-tics
CIKM . ACM, 1441–1450.[35] Lucas Vinh Tran, Tuan-Anh Nguyen Pham, Yi Tay, Yiding Liu, Gao Cong, andXiaoli Li. 2019. Interact and decide: Medley of sub-attention networks for effectivegroup recommendation. In
Proceedings of the 42nd International ACM SIGIRConference on Research and Development in Information Retrieval . ACM, 255–264.[36] Lei Zheng, Vahid Noroozi, and Philip S Yu. 2017. Joint deep modeling of usersand items using reviews for recommendation. In
Proceedings of the Tenth ACMInternational Conference on Web Search and Data Mining . ACM, 425–434. [37] Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, XiaoqiangZhu, and Kun Gai. 2019. Deep interest evolution network for click-through rateprediction. In
Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 33.5941–5948.[38] Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, YanghuiYan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-throughrate prediction. In