[PDF] On Variational Inference for User Modeling in Attribute-Driven Collaborative Filtering

Abstract

Recommender Systems have become an integral part of online e-Commerce platforms, driving customer engagement and revenue. Most popular recommender systems attempt to learn from users' past engagement data to understand behavioral traits of users and use that to predict future behavior. In this work, we present an approach to use causal inference to learn user-attribute affinities through temporal contexts. We formulate this objective as a Probabilistic Machine Learning problem and apply a variational inference based method to estimate the model parameters. We demonstrate the performance of the proposed method on the next attribute prediction task on two real world datasets and show that it outperforms standard baseline methods.

Full PDF

aa r X i v : . [ c s . L G ] D ec On Variational Inference for User Modeling inAttribute-Driven Collaborative Filtering

Venugopal Mani*

Walmart LabsSunnyvale, CA, USA [email protected]

Ramasubramanian Balasubramanian*

Walmart LabsSunnyvale, CA, USA [email protected]

Sushant Kumar

Walmart LabsSunnyvale, CA, USA [email protected]

Abhinav Mathur

Walmart LabsSunnyvale, CA, USA [email protected]

Kannan Achan

Walmart LabsSunnyvale, CA, USA [email protected]

Abstract —Recommender Systems have become an integralpart of online e-Commerce platforms, driving customer en-gagement and revenue. Most popular recommender systemsattempt to learn from users’ past engagement data to understandbehavioral traits of users and use that to predict future behavior.In this work, we present an approach to use causal inferenceto learn user-attribute afﬁnities through temporal contexts. Weformulate this objective as a Probabilistic Machine Learningproblem and apply a variational inference based method toestimate the model parameters. We demonstrate the performanceof the proposed method on the next attribute prediction task ontwo real world datasets and show that it outperforms standardbaseline methods.

Index Terms —Recommender Systems, Variational Methods,Collaborative Filtering, Bayesian Statistics

I. I

NTRODUCTION

Recommender Systems have traditionally been studied fromthe lens of attempting to increase customer engagement byuser modeling from past interactions. These interactions areoften collected in terms of explicit user signals like ratingsand item reviews. Recently, there has been a shift in literaturetowards building recommenders by using implicit user signalslike item views, item purchases, etc. Implicit signals, whileuseful in increasing the coverage of user signals over items,can suffer from lack of deﬁnition of what constitutes a negativesignal. This has led to a class of problems known as One ClassCollaborative Filtering [1] where techniques like low rankapproximation and negative sampling are used to improve userunderstanding by eliminating the ambiguity over the negativetrain samples.A common assumption in Implicit OCCF is that all positivesignals are equal. However, this assumption can fail to cap-ture the wide ranging spectrum of user interactions in somedomains. Normalization techniques do exist to scale these butthere exists scope for more nuanced values for the positivesamples. With modern data collection capabilities, domainspeciﬁc ﬁne tuning of user interactions can be achieved to * Both authors contributed equally to this work. further our understanding of abstract concepts about users(like loyalty, satisfaction with the product, etc). One suchidea was introduced in [2] where the concept of long termcustomer satisfaction was deﬁned through a function to trackthe continuous implicit signal of the user.While Implicit OCCF systems are quite effective, an oftencited drawback of these systems for user modeling is their lackof sensitivity to temporally changing user behavior. The traitsof a user from a few months prior need not necessarily modeltheir present behavior. It follows logically to try to encode thetemporal aspect of implicit signals into a user understandingobjective and optimize over it. This is particularly relevant inthe subset of attribute-driven collaborative ﬁltering as userstend to develop repeat patterns on certain attributes of anitem. For example, in the domain of music recommendation,afﬁnity to artists has been studied, and the recommendationof artists similar to the ones the user is loyal to has also seenimprovements in results [3]. In this work, we try to further ourunderstanding of users through the notion of temporal loyalty and integrate it into the attribute-driven collaborative ﬁlteringframework. We optimize the objective from two sources : thetransaction matrix which is a binary matrix that indicates pastinteraction (or lack of thereof) of the involved user with theitem’s attribute as well as a temporal loyalty matrix whichattempts to capture drifting user loyalty over time.The contributions of this work are as follows: ﬁrst, wemodel temporal loyalty of the users to augment the trans-action matrix. Then, we demonstrate that optimization usingvariational inference over these matrices outperforms plaincollaborative ﬁltering based methods on the next attributeprediction task, thus leading to a better understanding of userpreferences. The rest of the work is organized as follows:Section II delves into the literature of related work, Section IIIdescribes our proposed system model, Section IV describes ourexperiments on two real world datasets , Section V analyzesthe results, and Section VI concludes the work and describespossible future directions.

I. R

ELATED W ORK

The idea of optimizing over two matrices for modeling userpreferences is relatively new. There is active research aroundthe kind of domain-speciﬁc objectives to be optimized for andthe corresponding data that could be augmented. The authorsof [2] consider measures of satisfaction with the purchaseditems, such as the amount of time spent playing a gameor the number of times a particular artist was heard. Otherworks focus on tasks like using dwell time in session-basedrecommendations [4], [5] or to enrich the user-item matrix[6], leveraging implicit signals such as internet browsing logs[7], etc. On the other hand, several works exist that leveragethe binary transaction matrix to tackle the well-known top-k recommendation problem in large-scale datasets, such asthose dealing with memory-based collaborative ﬁltering forexplicit feedback [8], item-based collaborative ﬁltering toaddress scalability concerns [9], using stratiﬁed SGD to dealwith large-scale matrix factorization [10], etc. However, theserely solely on explicit user signals and fail to incorporate anytemporal signals.In our work, we use the temporal loyalty to an itemattribute as the second matrix, thus leveraging both explicitand temporal signals to set up an optimization over the twomatrices. The use of loyalty is motivated by works such as[11], where the authors model consumers’ repeat purchasebehavior, as well as our experience in the domain of e-Commerce and grocery. Attribute-based collaborative ﬁlteringhas been explored before in works such as [12] where theauthors use categorical attributes to improve recommendationthrough multi-task learning or hierarchical classiﬁcation, and[13] which deals with attribute-aware collaborative ﬁltering.Our work captures the changing afﬁnity of the users to theseattributes, and thus could be used as a ﬁrst stage in hierarchicalclassiﬁcation algorithms: to predict which brands the users willbuy next, before recommending particular items of that brand.In terms of the application of variational inference andBayesian statistics to solve collaborative ﬁltering problems,most works focus on the use of Variational Auto Encoders.For example, [14] introduces a generative model with a multi-nomial likelihood and uses Bayesian inference for parameterestimation, [15] uses VAE to alleviate the problem of poorrobustness and over-ﬁtting caused by large-scale data, etc.Other works using Bayesian inference, such as [16], whichpresents a scalable inference for Variational Bayesian matrixfactorization with side information, or [17], which proposes adistributed memo-free variational inference method for large-scale matrix factorization problems, address some of the well-known shortcomings of the same in recommender systems.The experimental framework that we have adopted is calledthe Box’s Loop [18], which is used to uncover patterns fromthe conditional distribution of a latent variable model anduse them to model the data and make predictions. This waswell suited for our problem, since we assume a particular structure of the latent variables and use that to explain thedata, and then use those variables for future recommendations.Related works in the sub-ﬁeld of latent variable modeling forrecommender systems include works such as [19], which usesblind regression to complete the partial user-item interactionmatrix and uses the features of the users and items as the latentvariables, and [20], which presents a Bayesian latent variablemodel for rating prediction that models ratings over eachuser’s latent interests and each item’s latent topics. Embeddingbased approaches to model the users and items have alsobeen tried, in works such as [21], which replaces the innerproduct between user and item latent features used in classicmatrix factorization by a neural architecture, and [22], whichcouples deep feature learning and deep interaction modelingwith a rating matrix to improve recommendation performance.Our work ﬁts the optimization task into this framework andgenerates user and item attribute embeddings that explainthe data well by applying variational inference, and theseuser representations thus obtained help us develop a betterunderstanding of the user preferences.III. S

YSTEM M ODEL

A. The Top-k Attribute Recommendation Problem

The classic top-k recommendation problem can be deﬁnedas follows : given a catalog of items C containing items i , i , · · · , i n and an item a ∈ C , henceforth referred toas the anchor item , ﬁnding a ranked list of distinct items i , i , · · · , i k ∈ C to be offered alongside the anchor item,such that the user of an e-Commerce platform is most likelyto engage with them. This engagement of users can be deﬁnedby a variety of metrics (in our case, the future purchase of theitem).A sub-problem of the top-k recommendation problem is theattribute recommendation problem. Rather than the recommen-dation of the items to cause the next most likely engagement,the task is to predict on the attribute. For instance, in the spaceof e-Commerce, the attribute of interest could be the brand ofthe item. Given a user u , the task would be to predict the k most likely brands that they’re likely to engage with. At-tribute recommendation is a slightly more well deﬁned spaceas, particularly with certain attributes like brands, users arevery likely to develop behaviors like loyalty towards certainbrands. While in the item space, repurchasing of an item is acommonly seen behavior, especially in domains like grocery,the re-engagement with attributes is a much more commonlyexhibited behavior in a wide variety of recommender systemdomains (songs by the same artist, books by the same author,movies starring the same actor) and solving it can henceinvolve harvesting this richer behavioral signal. B. The Temporal Loyalty Matrix

Most recommender systems dealing with grocery data fo-cus on the binary transaction matrix . Given a set of users , u , · · · , u m and a set of items i , i , · · · , i n , the ( p, q ) th element of the transaction matrix is 1 if u p has purchased i q within the training window, and 0 otherwise. Oftentimes, user-based collaborative ﬁltering algorithms that use such matricesfail to pick up important signals such as a user’s loyaltyto a brand, how their behavior changes dynamically withtime, etc. In this work, we try to capture those signals byoptimizing over a temporal loyalty matrix, L, in additionto the binary transaction matrix, T. Since we are dealingwith an attribute recommendation problem, consider a setof users u , u , · · · , u m and an associated set of attributevalues v , v , · · · , v n for a particular attribute (say, brand).The ( p, q ) th element of the transaction matrix, T pq =  , u p has bought an item with at-tribute value v q at least once in thetraining window , otherwise (1)The ( p, q ) th element of the temporal loyalty matrix is the time-decayed sum of all the purchases of attribute value v q madeby user u p , that is, L pq =  t k X t = t t − tstarttend − tstart , u p has bought an item withattribute value v q k ( ≥ times in the training win-dow , otherwise (2)The variables t start and t end in Equation 2 represent thestart and end times of the training window, and t , t , ..., t k are the time instances when the user purchased items withthat attribute value. For instance, a particular brand of beerpurchased over a year ago should not get the same weightas the one purchased a week ago as user preferences mighthave changed. We further show that this framework workswell for recommender systems that have some notion ofloyalty/preference, such as readers’ predilection for certainauthors, etc. This optimization [2] balances data from boththe transaction matrix and the temporal loyalty matrix. C. The Bayesian Framework

Probabilistic Machine Learning (PML) is a sub-ﬁeld ofMachine Learning where domain knowledge and assumptionsabout the hidden structure of data are leveraged to explain theobserved data. PML models large, interesting, and intercon-nected datasets at scale.The iterative probabilistic pipeline, coined

Box’s Loop by[18] lists the steps of modeling a PML pipeline as positinga model with assumptions about the hidden structure of data, inferring the hidden variables, and criticizing the model (theevaluation step). If the evaluation does not meet the standard required, the values of the hidden variables are revised to betterexplain the data at hand.The structure of collaborative ﬁltering by Matrix Factoriza-tion effectively lends itself to Box’s Loop. The entries of thetransaction matrix and the temporal loyalty matrix constitutethe observed data. The classic matrix factorization probleminvolves decomposition of a given Matrix M into latent factormatrices U and V along with their respective biases B u and B v . The matrices U and V are known as embedding matrices in the setting of collaborative ﬁltering. The learning taskthen becomes to learn the embeddings and biases such thattheir probability, given the observed transaction and temporalloyalty matrices is maximized. This is also known as the posterior distribution . D. The Posterior Distribution

We chose to model both the matrices, T and L, as well asthe priors with normal distributions. This is logical because thevalues in the L matrix are continuous and distributions fromthe exponential family have shown good results in literature[2]. Also, the normal family of distributions is conjugate toitself (or self-conjugate) with respect to a normal likelihoodfunction, and conjugacy has desirable properties, such asyielding a closed-form expression for the posterior.Fig. 1 is a probabilistic graphical representation of our latentvariable model. It shows how the random variables depend oneach other in our generative process. Thus, it helps us form theposterior by connecting the assumptions that we made aboutthe data to the model. The components of this graphical modelare the ones used in standard graphical models in the ﬁeld ofmachine learning, such as [18]: the nodes represent randomvariables, the edges represent a dependence between the nodesthat they connect, and the plates denote replication. Each entryin the transaction matrix, T pq , and the temporal loyalty matrix, L pq , depends only on its local variables u p , bu p , v q , and bv q ,which are the embedding and the bias vectors of the p th user,and the embedding and the bias vectors of the q th attributerespectively, and the corresponding global variables ( κ and ψ ), as is the case with conditionally conjugate models. κ and ψ are the scale and the location parameters, which allow thedistributions of T and L to have different dynamic rangesand be centered around different means, despite sharing someparameters which model the positive correlation between thetransactions and the temporal loyalty scores, as seen in workslike [2].As mentioned earlier, all these variables have normal priorswith mean 0 (except the scale parameters, which have a meanof 1) and a variance that depends on a hyperparameter, denotedby α with a subscript corresponding to the variable, as seenin equation 3. Utilising a modeling strategy similar to [2], wewrite out the posterior in equation 3 as being proportional tothe product of the likelihoods and the priors. H is the set ofall hyperparameters: those represented by solid black circlesin Figure 1, γ , and β . γ allows us to control how muchmportance we give to the two likelihoods relative to eachother, and β is used to model the variance. The matrices Tand L constitute the observed data, represented by grey circlesin Fig. 1. θ is the set of latent variables: the user and itemembeddings and biases, and the location and scale parameters,represented by white circles in Fig. 1. Each user and itemvector is of dimension d. P ( θ | T,L, H ) ∝ P ( T, L | θ, H ) P ( θ | H )= Y ( p,q,T pq ) ∈ T P ( T pq | u p , v q , bu p , bv q , κ t , ψ t , H ) Y ( p,q,L pq ) ∈ L P ( L pq | u p , v q , bu p , bv q , κ l , ψ l , H ) m Y p =1 [ P ( u p | α u ) P ( bu p | α bu )] n Y q =1 [ P ( v q | α v ) P ( bv q | α bv )] P ( κ t | α κ t ) P ( ψ t | α ψ t ) P ( κ l | α κ l ) P ( ψ t | α ψ l )= Y ( p,q,T pq ) ∈ T N ( κ t ( u Tp v q + bu p + bv q ) + ψ t , ( γβ ) − ) Y ( p,q,L pq ) ∈ L N ( κ l ( u Tp v q + bu p + bv q ) + ψ l , ((1 − γ ) β ) − ) m Y p =1 [ N (0 , α − u I d ) N (0 , α − bu )] n Y q =1 [ N (0 , α − v I d ) N (0 , α − bv )] N (1 , α − κ t ) N (0 , α − ψ t ) N (1 , α − κ l ) N (0 , α − ψ l ) (3) E. Variational Inference

The posterior P ( θ | T, L, H ) from Equation 3 solves for thefamily of high dimensional latent variables θ , given the initialprior distributions over the latent variables and a likelihoodfunction P ( T, L | θ, H ) that we posit about the model. Thedirect application of Bayes’ theorem has the problem of anintractable high dimensional integration in the denominatorand hence an approximate Bayesian inference of the posterioris carried out. One of the most popular methods for approx-imate Bayesian inference is the Markov Chain Monte Carlo(MCMC). However, despite the high accuracy of MCMC, thescale of our problem means that it is virtually impossible touse due to the computational time involved.Instead, a scalable approach is to treat the posterior ap-proximation as an optimization problem through variationalinference [23], [24]. The objective of variational inference isto ﬁnd the variational distribution which is a proxy-posterior q parametrized by ν , such that the variational distributionis least-divergent from the true posterior p . We adopted thewidely used Kullback–Leibler divergence (KL Divergence) asthe divergence metric between the two distributions. The KLdivergence term is an intractable one and the equivalent ofminimizing the KL-Divergence is the maximization of the Evidence Lower Bound (ELBO) [25]. The ELBO, L ( ν ) , isdescribed in equation 4. L ( ν ) = E q [ log ( p ( T, L | θ ))] − KL ( q ( θ ; ν ) k p ( θ )) (4)The terms here provide the classical Bayesian trade offbetween the log likelihood of the data and the prior overthe parameters of the model. That is, the ﬁrst term tries tomaximize the likelihood of the observed transactions and thetemporal loyalty scores, given the embedding vectors. Thesecond term is the KL divergence between the the variationaldistribution and the prior over the embedding vectors. Thesecond term effectively acts as a regularizer as it tries tominimize the divergence from the prior and hence preventsthe optimizer from converging to the maximum likelihoodestimate.Stochastic gradient descent is the commonly used approachto optimize the ELBO objective. Some works [2] also usecoordinate descent and other variants of gradient descentto compute the gradients and update the parameters. Thegradients can be obtained by rewriting equation 4 in terms ofthe complete log likelihood and then computing the gradient,as shown in equation 5. ∇ ν L ( ν ) = ∇ ν E q [ log ( p ( T, L, θ )) − log ( q ( θ ; ν ))] (5)In this work, we use score function gradient estimators [26],[27], by rewriting equation 5 as ∇ ν L ( ν ) = E q [ ∇ ν log ( q ( θ ; ν ))( log ( p ( T, L, θ )) − log ( q ( θ ; ν )))] (6) F. Prediction function

Once the variational distribution is approximated, predic-tions are made from the posterior predictive function . Theposterior predictive function uses the likelihood function fromEquation 3, P ( T, L | θ, H ) to generate the predictions. Afterthe estimation of the latent variables θ , the values of thetransaction entry T pq and temporal loyalty entry L pq foruser p and attribute q are estimated from the distributions N ( κ t ( u Tp v q + bu p + bv q ) + ψ t , ( γβ ) − ) and N ( κ l ( u Tp v q + bu p + bv q ) + ψ l , ((1 − γ ) β ) − ) respectively. Once the twovalues are determined, a simple addition of the two valuesgives the overall score for that particular user-attribute pair.IV. E XPERIMENTS

A. Datasets and preprocessing

We demonstrate our results on two datasets from differentdomains. The ﬁrst is a private dataset from a large-scalee-Commerce company. We collected six months’ worth ofgrocery transaction data. From the transaction metadata, thecustomer id, the brand, the transaction date, and the eventepoch (exact epoch at which the transaction happened) wereig. 1: A graphical representation of the proposed latentvariable modelchosen. Since we are trying to understand and model loyalty,we decided to ﬁlter out customers that didn’t have more thana threshold number of items in their basket, thus keepingonly engaged customers in the dataset. From experience, wehave seen that there are a few large merchants and clients,such as grocery stores, that place bulk orders. These wouldnot be representative of a single customer and hence wedecided to ﬁlter out those customers that had more than anupper threshold of transactions as well. We used the ﬁrstﬁve months of data as our train set and the ﬁnal monthas the test set. We also ﬁltered out customers that weren’tpresent in both the train and the test set, since we wouldnot have embeddings for customers we have not seen. Theother problem that both the baseline models and our modelsuffered from was the introduction of new brands during thetest period, which happens due to change in consumer demand,seasonality effects, etc. We removed the brands not present inboth the train and the test sets as well. This resulted in adataset containing 180,000 customers and 11,200 brands.The second dataset is the publicly available Goodreads BookReviews dataset. This contains ratings, reviews, and a lot ofother attributes of the items and the users, such as user-bookinteractions, metadata of the books, etc., collected in 2017 byscraping users’ public shelves on Goodreads [28], [29]. Thegroup that collected the dataset recommends using a subset(by genre) of the dataset, as the entire dataset is really large.In keeping with our theme of loyalty, we decided to go aheadwith the ‘fantasy and paranormal’ genre. Here, we are tryingto assess readers’ loyalty to authors. And this genre had ahigh density of such interactions, as was expected due to thepresence of sequels, authors that write multiple books withsimilar themes, etc. The relevant columns in this case wereauthor, user id, and the time when the book was marked asread. Even though we had information about the time the bookwas shelved, we felt that that would be a weaker (albeit denser)signal, similar to adding an item to cart in the grocery world,and hence decided to go ahead with the time the book wasmarked ‘read’, which is analogous to a transaction. We againﬁltered the data in a fashion similar to the one described forthe ﬁrst dataset, and ended up with a 150,000 users and 11,400 authors. One thing to note is that the interactions in this datasetweren’t as dense as the ﬁrst dataset.

B. Comparison methods

To compare our model, we chose the standard baselines inliterature [28] : Popularity Model (Pop) and classic MatrixFactorization model (MF). The popularity model captures thepopularity of attributes across each customer and recommendsthe most popular attributes for them.The second model is a standard implicit OCCF MatrixFactorization. The observed transaction data is used to learnlatent factors for the users and the attributes and to predictuser-attribute interactions. This was done using the standardAlternating Least Squares (ALS) optimization.We studied these under two settings : the ﬁrst setting is amore realistic setting with the notion of explore-exploit (EE) built into the recommender system. Most real life recom-menders employ a strategy to diversify their recommendationsin the hope of increasing exposure of items which do not havemuch user interaction. The second setting removes the explore-exploit strategy from the two baselines to give a sterner testto our model. We also included a weighting to favor attributesthat the user has prior interactions with. The baseline modelsand our model are compared across the metrics that aredescribed in subsection IV-C.

C. Evaluation metrics

The ground truth dataset was the list of brands bought bythe users in the grocery dataset or the list of authors whosebooks were read by the readers in the Goodreads dataset,in the test window. The predictions from the model werea list of brands/authors, ordered by the probability that thegiven user would buy/read the given brand/author in the testwindow. We compare our model with the baseline models onﬁve different evaluation metrics, most of them well-known inthe collaborative ﬁltering literature [30]–[34].

1) NDCG@k:

As is known, DCG works on the idea thathighly relevant entries appearing lower in the predictions listreturned by the models should be penalized. In our case, therelevance for a brand/author is 1 if it appears in the top kpredictions for a user and is present in the ground truth, and 0otherwise. Ideal DCG (IDCG) is used to normalize this scoreto account for the varying lengths of the recommendation listsreturned for different users. Finally, we take a mean of theNDCG values over all the queries, which are the users in thetest set, to get a measure of the performance. In the followingformulae, rel i denotes the relevance of the entry at the i th position in the predictions list returned by the models. DCG k = k X i =1 rel i log ( i + 1) , N DCG k = DCG k IDCG k ) MAP@k: The area under the precision-recall curve,which is obtained by plotting the precision and recall atevery position in a ranked list of predictions, is called theaverage precision. Mean of the average precision scores overa set of queries i.e. users, gives the MAP. In the followingformulae, AP is the average precision, MAP is the meanaverage precision, P(i) is the precision at position i, ∆ r(i) isthe change in recall from position i-1 to i, | U | is the number of users in the test set. AP = k X i =1 P ( i )∆ r ( i ) = P ki =1 P ( i ) rel i , M AP = P | U | j =1 AP j | U |

3) Hit Rate@k:

Essentially the true positive rate, where atrue positive is a brand/author predicted in the top k that ispresent in the ground truth for that query (user). We take amean of the hit rate values over all the queries (users) andreport that in section V.Hit rate = Number of True PositivesNumber of Positives

4) MRR@k:

The reciprocal rank of a query response, i.e.predictions for a user, is the inverse of the position of the ﬁrstitem in the predictions list that is present in the ground truth forthat query. Here, we consider only the ﬁrst k predictions, andaverage the reciprocal ranks over all the users. In the followingformula, pos i represents the position of the ﬁrst prediction forthe i th user that is present in the ground truth list for that user. M RR = 1 | U | | U | X i =1 pos i

5) Limited AUC@k:

The ROC curve is a plot of the truepositive rate against the false positive rate at various thresholdvalues, and the general objective in recommender systems isto maximize the area under the ROC curve. But, in most suchsettings, the entries at the top of a list are more impactful thanthose at the bottom, but AUC is equally affected by swapsat different places in the returned list. To address this, weuse limited AUC [35], which basically is the area under thepart of the curve formed by the top k recommendations. Thisassumes that all the other relevant recommendations (apartfrom the top k) are distributed uniformly over the rest ofthe ranking list until all entries are retrieved. Thus, a straightline is drawn between the end point of the curve formed bythese k recommendations and (1,1), the upper-right point ofany ROC curve, and the area thus obtained is measured. Thisaddresses some of the issues mentioned before, since swapsbelow the top k don’t affect the AUC. This also has a fewother good properties, such as a top-k list that contains morerelevant entries will yield a higher AUC score, with the ordermattering if the length of the list is close to the total numberof brands/authors. We take a mean over all the queries (users)to get a mean LAUC.

D. Implementation details

To generate the baselines, we used Turicreate [36], an opensource toolkit for generating core machine learning modelsincluding recommenders. For the popularity recommender, weused the popularity recommender class and for the MatrixFactorization based model, we utilized the factorization rec-ommender class. K-Fold cross validation was performed onboth classes of models using the in-built capability to tune themodels and ﬁnally the best models of each were selected forcomparison.For our model, we wrote a custom training loop and usedEdward2 [37], [38] to do black-box variational inference [26].Edward2 is a low-level language for specifying probabilisticmodels as programs and performing computations. We fedthe models/distributions as functions whose inputs were therandom variables that we were conditioning on and the outputswere the random variables that the probabilistic programwas over. In the training loop, we ﬁrst computed the log-likelihood using samples from the variational distribution. Weused Edward’s and TensorFlow’s tracing functionalities (insteps 8 and 12, Algorithm 1) to record the model’s computa-tions for automatic differentiation. We then computed the KLdivergence between the variational distribution and the priordistribution using the attributes of TensorFlow’s distributions,and combined that with the log likelihood obtained fromthe posterior predictive function to get the ELBO. We trieddifferent optimizers, learning schedules, and hyperparametersettings. A pseudocode of Edward’s custom training loopadapted to our problem setting has been presented in Al-gorithm 1. This loop is called a certain number of times(to ensure convergence) for each batch in each epoch andthe values of the variational parameters used to build thevariational distribution (step 2, algorithm 1) are the updatedvalues (step 13, algorithm 1) from the previous run.

Algorithm 1

Variational Inference Training Loop

INPUT:

Batch from Transaction matrix T b , batch from Temporal Loyalty matrix L b ,Transaction matrix T, Temporal Loyalty matrix L, set of hyperparameters H, set oflatent variables θ , set of prior variables { u, v, bu, bv, κ t , ψ t , κ l , ψ l } procedure C USTOM TRAINING LOOP ( T b , L b ) variational family, trainable parameters ← Build variational distribution qu, qv, qbu, qbv, qκ t , qψ t , qκ l , qψ l ← Sample posterior variables from thevariational family P P T , P P L ← Obtain posterior predictive functions, P ( T | θ, H ) and P ( L | θ, H ) , from equation 3 by setting prior variables to the sample posterior values LL Tb , LL Lb ← Compute the log likelihood of T b and L b from P P T and P P L respectively Initialize KL ← for prior variable, variational variable in [(u, qu), (v, qv), (bu, qbu), (bv, qbv),( κ t , qκ t ), ( ψ t , qψ r ),( κ s , qκ s ), ( ψ s , qψ s )] do KL ← KL + KL divergence between the distributions of the varia-tional variable and the prior variable end for ELBO ← Compute ELBO using KL, LL Tb , and LL Lb from equation 4 Loss ← -ELBO Get the gradients using the loss and the trainable parameters obtained

Update the parameter values end procedure etricMethod Pop+ EE MF+ EE Pop MF VI-MFNDCG @5 0.047 0.054 0.144 0.210 0.212@10 0.031 0.036 0.096 0.140 0.141@15 0.026 0.031 0.081 0.118 0.120@20 0.025 0.028 0.077 0.112 0.114

MAP @5 0.016 0.021 0.049 0.098 0.099@10 0.008 0.011 0.026 0.051 0.053@15 0.006 0.008 0.020 0.040 0.040@20 0.006 0.007 0.018 0.037 0.038 HR @5 0.064 0.064 0.197 0.196 0.198@10 0.033 0.033 0.101 0.101 0.102@15 0.024 0.025 0.075 0.075 0.076@20 0.022 0.022 0.067 0.066 0.068 MRR @5 0.080 0.108 0.246 0.491 0.492@10 0.080 0.108 0.246 0.491 0.492@15 0.080 0.108 0.246 0.491 0.492@20 0.080 0.108 0.246 0.491 0.492

LAUC @5 0.532 0.532 0.598 0.598 0.599@10 0.516 0.516 0.551 0.551 0.552@15 0.512 0.512 0.539 0.540 0.540@20 0.511 0.511 0.536 0.537 0.537

Table I: Comparison of evaluation metrics across models one-Commerce grocery data

MetricMethod Pop+ EE MF+ EE Pop MF VI-MFNDCG @5 0.020 0.027 0.047 0.068 0.069@10 0.014 0.019 0.034 0.049 0.051@15 0.012 0.016 0.030 0.043 0.044@20 0.010 0.015 0.028 0.040 0.041

MAP @5 0.007 0.013 0.017 0.033 0.034@10 0.004 0.008 0.011 0.021 0.022@15 0.003 0.006 0.009 0.018 0.018@20 0.002 0.005 0.008 0.016 0.017 HR @5 0.031 0.028 0.070 0.069 0.071@10 0.018 0.017 0.041 0.041 0.042@15 0.013 0.014 0.031 0.031 0.033@20 0.011 0.012 0.026 0.027 0.028 MRR @5 0.033 0.061 0.075 0.148 0.150@10 0.034 0.061 0.075 0.148 0.151@15 0.034 0.061 0.075 0.149 0.151@20 0.034 0.061 0.076 0.149 0.152

LAUC @5 0.514 0.512 0.533 0.533 0.534@10 0.508 0.507 0.521 0.521 0.522@15 0.506 0.505 0.517 0.517 0.518@20 0.505 0.505 0.516 0.516 0.516

Table II: Comparison of evaluation metrics across models onGoodreads data V. R

ESULTS AND A NALYSIS

The results on the e-Commerce data and the open sourceGoodreads data have been presented in Table I and TableII respectively. The metrics for our model are shown inthe ﬁnal column, titled VI-MF (Variational Inference MatrixFactorization). The ﬁrst two baselines, with the explore-exploit strategy,suffer from trading off accuracy for diversity and hence do notperform as well as the other models. In both settings, with andwithout explore-exploit, MF-based models outperform the popmodels because the pop models simply recommend attributesof the items that the user has bought most in the past whereasthe latent factors capture user afﬁnities well as they learn betterrepresentations from the interactions.Our model shows a clear 1 to 3 percent increase in allmetrics across the various ranks as compared to the bestperforming baseline model (that is, the classic Matrix Factor-ization) in both the e-Commerce grocery dataset as well as theGoodreads dataset. The size and scale of the datasets mean thatthese gains are signiﬁcant. Quantitatively, in the e-Commercesetting, for a business with tens of billions of dollars inrevenue, a 1 to 3 percent increase translates to hundreds ofmillions of dollars. This indicates that incorporating temporalloyalty leads to a better understanding of the user preferences,thus having an effect on the prediction of user behavior andsubsequently revenue.Overall, the metrics on the e-Commerce grocery data arehigher than the ones on the Goodreads data. This can beexplained by the higher density of grocery data leading tostronger user afﬁnities to attributes. Interestingly, the trendsacross the models seem to hold across the domains of groceryand ‘Fantasy and Paranormal’ genre. In other words, the notionof brand loyalty in grocery seems similar to the notion ofauthor loyalty in the ‘Fantasy and Paranormal’ genre of books.VI. C

ONCLUSION AND F UTURE W ORK

In this work, we leverage a customer’s temporal loyaltyto an item attribute in addition to the engagement behaviorto model their preferences and subsequently tackle the top-kattribute recommendation problem. We model this as an opti-mization problem over two matrices and use the Box’s Loopframework and variational inference to estimate the parametervalues and train the user embeddings that best explain theobserved explicit and temporal signals. We demonstrate theeffectiveness of the user embeddings learnt by showing thatthe proposed approach outperforms standard baselines for thistask on a private e-Commerce grocery dataset as well as thepublicly available Goodreads dataset, which also supports thehypothesis that capturing a customer’s temporally changinginterests can lead to better recommendations.In terms of future directions, one could explore other waysto come up with the loyalty scores in the Temporal Loyaltymatrix, L. Some works, such as [39], also focus on enrichingthe transaction matrix, T, to address issues that arise due tosparsity; we plan to investigate coupling those with our currentapproach. Another direction to explore would be to model thepriors and the likelihoods with other distributions, informed bydomain knowledge and the type of data one is dealing with.II. A

BBREVIATIONS AND A CRONYMS

ALS: Alternating Least SquaresAP: Average PrecisionDCG: Discounted Cumulative GainEE: Explore-ExploitELBO: Evidence Lower BoundHR: Hit-RateIDCG: Ideal Discounted Cumulative GainKL: Kullback-Leibler DivergenceLAUC: Limited Area Under the CurveMAP: Mean Average PrecisionMF: Matrix FactorizationMCMC: Markov Chain Monte CarloMRR : Mean Reciprocal RankNDCG: Normalized Discounted Cumulative GainOCCF: One Class Collaborative FilteringPML: Probabilistic Machine LearningROC: Receiver Operating CharacteristicSGD: Stochastic Gradient DescentVAE: Variational Auto EncoderVI : Variational InferenceR

EFERENCES[1] R. Pan, Y. Zhou, B. Cao, N. N. Liu, R. Lukose, M. Scholz, and Q. Yang,“One-class collaborative ﬁltering,” in

Proceedings of the 2008 EighthIEEE International Conference on Data Mining , ICDM ’08, (USA),p. 502–511, IEEE Computer Society, 2008.[2] G. Lavee, N. Koenigstein, and O. Barkan, “When actions speak louderthan clicks: A combined model of purchase probability and long-termcustomer satisfaction,” in

Proceedings of the 13th ACM Conference onRecommender Systems , RecSys ’19, (New York, NY, USA), p. 287–295,Association for Computing Machinery, 2019.[3] N. Lin, P. Tsai, Y. Chen, and H. H. Chen, “Music recommendationbased on artist novelty and similarity,” in , pp. 1–6, 2014.[4] V. Bogina and T. Kuﬂik, “Incorporating dwell time in session-basedrecommendations with recurrent neural networks.,” 2017.[5] X. Yi, L. Hong, E. Zhong, N. N. Liu, and S. Rajan, “Beyond clicks:Dwell time for personalization,” in

Proceedings of the 8th ACM Con-ference on Recommender Systems , RecSys ’14, (New York, NY, USA),p. 113–120, Association for Computing Machinery, 2014.[6] P. Yin, P. Luo, W.-C. Lee, and M. Wang, “Silence is also evidence: Inter-preting dwell time for recommendation from psychological perspective,”in

Proceedings of the 19th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining , KDD ’13, (New York, NY,USA), p. 989–997, Association for Computing Machinery, 2013.[7] R. Ronen, E. Yom-Tov, and G. Lavee, “Recommendations meet webbrowsing: enhancing collaborative ﬁltering using internet browsinglogs,” pp. 1230–1238, 05 2016.[8] F. Aiolli, “Efﬁcient top-n recommendation for very large scale binaryrated datasets,” 10 2013.[9] M. Deshpande and G. Karypis, “Item-based top-n recommendationalgorithms,”

ACM Trans. Inf. Syst. , vol. 22, p. 143–177, Jan. 2004.[10] R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis, “Large-scalematrix factorization with distributed stochastic gradient descent,” in

Proceedings of the 17th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining , KDD ’11, (New York, NY,USA), p. 69–77, Association for Computing Machinery, 2011.[11] R. Bhagat, S. Muralidharan, A. Lobzhanidze, and S. Vishwanath, “Buyit again: Modeling repeat purchase recommendations,” in

Proceedingsof the 24th ACM SIGKDD International Conference on KnowledgeDiscovery & Data Mining , KDD ’18, (New York, NY, USA), p. 62–70,Association for Computing Machinery, 2018. [12] Q. Zhao, J. Chen, M. Chen, S. Jain, A. Beutel, F. Belletti, and E. H.Chi, “Categorical-attributes-based item classiﬁcation for recommendersystems,” in

Proceedings of the 12th ACM Conference on RecommenderSystems , RecSys ’18, (New York, NY, USA), p. 320–328, Associationfor Computing Machinery, 2018.[13] W.-H. Chen, C.-C. Hsu, Y.-A. Lai, V. Liu, M.-Y. Yeh, and S.-D. Lin,“Attribute-aware recommender system based on collaborative ﬁltering:Survey and classiﬁcation,”

Frontiers in Big Data , vol. 2, p. 49, 2020.[14] D. Liang, R. Krishnan, M. Hoffman, and T. Jebara, “Variational autoen-coders for collaborative ﬁltering,” 02 2018.[15] K. Zheng, X. Yang, Y. Wang, Y. Wu, and X. Zheng, “Collaborativeﬁltering recommendation algorithm based on variational inference,”

International Journal of Crowd Science , vol. ahead-of-print, 01 2020.[16] Y.-D. Kim and S. Choi, “Scalable Variational Bayesian Matrix Factoriza-tion with Side Information,” vol. 33 of

Proceedings of Machine LearningResearch , (Reykjavik, Iceland), pp. 493–502, PMLR, 22–25 Apr 2014.[17] G. Chen, F. Zhu, and P. A. Heng, “Large-scale bayesian probabilisticmatrix factorization with memo-free distributed variational inference,”

ACM Trans. Knowl. Discov. Data , vol. 12, Jan. 2018.[18] D. M. Blei, “Build, compute, critique, repeat: Data analysis with latentvariable models,”

Annual Review of Statistics and Its Application , vol. 1,no. 1, pp. 203–232, 2014.[19] D. Song, C. E. Lee, Y. Li, and D. Shah, “Blind regression: Nonpara-metric regression for latent variable models via collaborative ﬁltering,”in

Advances in Neural Information Processing Systems 29 (D. D. Lee,M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, eds.), pp. 2155–2163, Curran Associates, Inc., 2016.[20] M. Harvey, M. J. Carman, I. Ruthven, and F. Crestani, “Bayesian latentvariable models for collaborative item rating prediction,” in

Proceedingsof the 20th ACM International Conference on Information and Knowl-edge Management , CIKM ’11, (New York, NY, USA), p. 699–708,Association for Computing Machinery, 2011.[21] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua, “Neuralcollaborative ﬁltering,” in

Proceedings of the 26th International Confer-ence on World Wide Web , WWW ’17, (Republic and Canton of Geneva,CHE), p. 173–182, International World Wide Web Conferences SteeringCommittee, 2017.[22] W. Chen, F. Cai, H. Chen, and M. D. Rijke, “Joint neural collaborativeﬁltering for recommender systems,”

ACM Trans. Inf. Syst. , vol. 37, Aug.2019.[23] D. M. Blei, “Probabilistic topic models,”

Commun. ACM , vol. 55,p. 77–84, Apr. 2012.[24] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational inference:A review for statisticians,”

Journal of the American Statistical Associa-tion , vol. 112, no. 518, pp. 859–877, 2017.[25] M. Braun and J. McAuliffe, “Variational inference for large-scale modelsof discrete choice,”

Journal of the American Statistical Association ,vol. 105, p. 324–335, Mar 2010.[26] R. Ranganath, S. Gerrish, and D. Blei, “Black box variational inference,”in

AISTATS , 2014.[27] J. Paisley, D. Blei, and M. Jordan, “Variational bayesian inference withstochastic search,” 2012.[28] M. Wan and J. J. McAuley, “Item recommendation on monotonicbehavior chains,” in

Proceedings of the 12th ACM Conference onRecommender Systems, RecSys 2018, Vancouver, BC, Canada, October2-7, 2018 (S. Pera, M. D. Ekstrand, X. Amatriain, and J. O’Donovan,eds.), pp. 86–94, ACM, 2018.[29] M. Wan, R. Misra, N. Nakashole, and J. J. McAuley, “Fine-grainedspoiler detection from large-scale review corpora,” in

Proceedings ofthe 57th Conference of the Association for Computational Linguistics,ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: LongPapers (A. Korhonen, D. R. Traum, and L. M`arquez, eds.), pp. 2605–2610, Association for Computational Linguistics, 2019.[30] M. Zhu, “Recall, precision and average precision,” 09 2004.[31] A. Tharwat, “Classiﬁcation assessment methods,”

Applied Computingand Informatics , 2018.[32] O. Vechtomova, “Introduction to information retrieval christopher d.manning, prabhakar raghavan, and hinrich sch¨utze (stanford university,yahoo! research, and university of stuttgart) cambridge: Cambridgeuniversity press, 2008, xxi+ 482 pp; hardbound, isbn 978-0-521-86571-5, $60.00,” 2009.[33] K. J¨arvelin and J. Kek¨al¨ainen, “Ir evaluation methods for retrievinghighly relevant documents,” in

ACM SIGIR Forum , vol. 51, pp. 243–250, ACM New York, NY, USA, 2017.34] C. Lioma, J. G. Simonsen, and B. Larsen, “Evaluation measures forrelevance and credibility in ranked lists,” in

Proceedings of the ACMSIGIR International Conference on Theory of Information Retrieval ,ICTIR ’17, (New York, NY, USA), p. 91–98, Association for ComputingMachinery, 2017.[35] G. Schr¨oder, M. Thiele, and W. Lehner, “Setting goals and choosingmetrics for recommender system evaluations,” vol. 811, 01 2011.[36] Apple,

Turicreate (https://github.com/apple/turicreate) , 2014 (accessedMay 11, 2020).[37] D. Tran, A. Kucukelbir, A. B. Dieng, M. Rudolph, D. Liang, andD. M. Blei, “Edward: A library for probabilistic modeling, inference,and criticism,” arXiv preprint arXiv:1610.09787 , 2016.[38] D. Tran, M. D. Hoffman, R. A. Saurous, E. Brevdo, K. Murphy,and D. M. Blei, “Deep probabilistic programming,” in

InternationalConference on Learning Representations , 2017.[39] Y. He, H. Chen, Z. Zhu, and J. Caverlee, “Pseudo-implicit feedbackfor alleviating data sparsity in top-k recommendation,” in2018 IEEEInternational Conference on Data Mining (ICDM)