[PDF] Variation Control and Evaluation for Generative SlateRecommendations

Abstract

Slate recommendation generates a list of items as a whole instead of ranking each item individually, so as to better model the intra-list positional biases and item relations. In order to deal with the enormous combinatorial space of slates, recent work considers a generative solution so that a slate distribution can be directly modeled. However, we observe that such approaches -- despite their proved effectiveness in computer vision -- suffer from a trade-off dilemma in recommender systems: when focusing on reconstruction, they easily over-fit the data and hardly generate satisfactory recommendations; on the other hand, when focusing on satisfying the user interests, they get trapped in a few items and fail to cover the item variation in slates. In this paper, we propose to enhance the accuracy-based evaluation with slate variation metrics to estimate the stochastic behavior of generative models. We illustrate that instead of reaching to one of the two undesirable extreme cases in the dilemma, a valid generative solution resides in a narrow "elbow" region in between. And we show that item perturbation can enforce slate variation and mitigate the over-concentration of generated slates, which expand the "elbow" performance to an easy-to-find region. We further propose to separate a pivot selection phase from the generation process so that the model can apply perturbation before generation. Empirical results show that this simple modification can provide even better variance with the same level of accuracy compared to post-generation perturbation methods.

Full PDF

VVariation Control and Evaluation for Generative SlateRecommendations

Shuchang Liu ∗ , Fei Sun † , Yingqiang Ge , Changhua Pei , Yongfeng Zhang Rutgers University, New Brunswick, NJ, USA Alibaba Group, Beijing, China {shuchang.syt.liu, yingqiang.ge, yongfeng.zhang}@rutgers.edu {ofey.sf, changhua.pch}@alibaba-inc.com ABSTRACT

Slate recommendation generates a list of items as a whole insteadof ranking each item individually, so as to better model the intra-list positional biases and item relations. In order to deal with theenormous combinatorial space of slates, recent work considersa generative solution so that a slate distribution can be directlymodeled. However, we observe that such approaches—despite theirproved effectiveness in computer vision—suffer from a trade-offdilemma in recommender systems: when focusing on reconstruc-tion, they easily over-fit the data and hardly generate satisfactoryrecommendations; on the other hand, when focusing on satisfyingthe user interests, they get trapped in a few items and fail to coverthe item variation in slates. In this paper, we propose to enhance theaccuracy-based evaluation with slate variation metrics to estimatethe stochastic behavior of generative models. We illustrate thatinstead of reaching to one of the two undesirable extreme cases inthe dilemma, a valid generative solution resides in a narrow “elbow”region in between. And we show that item perturbation can enforceslate variation and mitigate the over-concentration of generatedslates, which expand the “elbow” performance to an easy-to-findregion. We further propose to separate a pivot selection phase fromthe generation process so that the model can apply perturbationbefore generation. Empirical results show that this simple modi-fication can provide even better variance with the same level ofaccuracy compared to post-generation perturbation methods.

CCS CONCEPTS • Information systems → Recommender systems . KEYWORDS

Generative Recommendation; Slate Recommendation; ConditionalVariational Auto-Encoder

ACM Reference Format:

Shuchang Liu, Fei Sun, Yingqiang Ge, Changhua Pei, and Yongfeng Zhang.2021. Variation Control and Evaluation for Generative Slate Recommen-dations. In

Proceedings of the Web Conference 2021 (WWW ’21), April 19–23, 2021, Ljubljana, Slovenia.

ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3442381.3449864 ∗ This work was done when Shuchang Liu was an intern at Alibaba. † Corresponding author.This paper is published under the Creative Commons Attribution 4.0 International(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on theirpersonal and corporate Web sites with the appropriate attribution.

WWW ’21, April 19–23, 2021, Ljubljana, Slovenia © 2021 IW3C2 (International World Wide Web Conference Committee), publishedunder Creative Commons CC-BY 4.0 License.ACM ISBN 978-1-4503-8312-7/21/04.https://doi.org/10.1145/3442381.3449864

In most recommender systems, items are naturally exposed to usersas a slate, which usually contains a fixed number of items, e.g., a1-by-5 list of recommended items, or a 2-by-2 block that can fit amobile phone screen. This leads to the idea of slate recommendation,also known as exact- 𝑘 recommendation [18, 42]. The problem isusually formalized as generating a slate of items such that certainexpected user behavior (e.g., the number of clicks) is maximized.The challenge of this problem is that the number of possible slates iscombinatorially large [44]. For example, for a system with 𝑛 items,to generate a slate of 𝑘 items, the possible number of slates will be 𝑂 ( 𝑛 𝑘 ) , which is huge given that many recommender systems workon millions or even billions of items.Traditional ranking-based recommendation models such as learn-ing-to-rank (LTR) [7, 8, 17, 33, 37] first predicts the probability ofuser engagement on each candidate item, and then selects the top-ranked ones as the recommendation list. Despite its well-recognizedeffectiveness and scalability, this ranking and selection process isgreedy in essence and neglects the fact that the user behavioron an item may be influenced by other (e.g., complementary orcompetitive) items exposed in the same list [29, 48], thus resultingin its sub-optimality. Furthermore, evidence has shown that one canimprove the recommendation performance by taking into accountthe intra-list item relations in ranking [2, 8, 13, 18, 36, 48].Recently, researchers have explored the possibility of solvingthis problem by directly generating the slate as a whole to break thelimitation of ranking-based approaches. Many of the approachesare based on generative models such as Variational Auto-Encoders(VAE) [23, 28]. However, these generative models are stochastic innature and their variational behavior may not produce satisfactoryslate recommendations. For example, in the case of VAE-based mod-els, the performance depends on a trade-off coefficient 𝛽 [23]—thelarger the 𝛽 -value during training, the more the model is focused onencoding variation control against the data reconstruction accuracy.In terms of slate recommendation, this phenomenon diverges thegenerative results into one of the three cases: • Over-reconstruction : when 𝛽 is smaller than some lowerthreshold 𝛽 − , it tends to overfit the slate reconstruction onthe training set. Though the resulting generated slates haveextremely high variance, the model usually fails to generatesatisfactory recommendations. • Over-concentration : when 𝛽 is larger than some upperthreshold 𝛽 + , the model tends to choose from only a fewprototypical slates that achieve satisfactory performance butfails to explore the variety of slates. • Elbow case : when 𝛽 is selected in an appropriate region(i.e., 𝛽 ∈ (cid:2) 𝛽 − , 𝛽 + (cid:3) ), it gives intermediate item variety and is a r X i v : . [ c s . I R ] F e b WW ’21, April 19–23, 2021, Ljubljana, Slovenia Shuchang Liu, Fei Sun, Yingqiang Ge, Changhua Pei, and Yongfeng Zhang able to fulfill certain degree of user interests. We show thatthis transitional region is the most suitable for slate recom-mendation task. Unfortunately, this very setting usually liesin a narrow region (e.g. 𝛽 + − 𝛽 − ≪ − ) while the searchspace of 𝛽 can be arbitrarily large.We denote this as the Reconstruction-Concentration Dilemma (RCD)and in this paper we investigate possible solutions that can increasethe variety of items under the over-concentration case. To achievethis, one can simply apply post-generation perturbation to enforceitem variety, yet this solution ignores the intra-slate features andsignificantly downgrades the recommendation accuracy. With thisin mind, we further derive a modification of the original generationprocess, so that it can perturb before the final generation whilereducing the negative effect of the perturbation. Specifically, whengenerating a slate, it follows a two-phase procedure: first, a pivotselection model chooses an item for a fixed slate position; then aslate completion model generates the remainder of the slate basedon the pivot item along with other constraints. With this framework,we summarize our contributions as below: • We propose to consider both the slate accuracy metric andthe slate variation metric when evaluating models that gen-erate stochastic slates. • We identify the RCD with these metrics and show that themost desirable recommendation performance appears in anarrow “elbow” region. • We conduct experiments on real-world datasets and simu-lation environments to show that enforcing variation canmitigate over-concentration and extend the elbow’s perfor-mance to a wide range of search space. • We show that the proposed pivot selection phase can pro-vide better control over the slate variation under the over-concentration case of the dilemma.In the following sections, we first list related studies in section 2,then describe how generative slate recommendation is achieved insection 3.1. Further, we explain how to employ variance metrics ascomplements of accuracy metrics in section 3.2, and then introduceour slate recommendation framework in section 3.3. We presentour experimental results on both real-world datasets and simula-tion environments in section 4 and 5 as the evidence to supportour claims. And finally, we discuss some other possible solutionsthat may also improve the item variety to bridge the gap betweengenerative methods and recommendation systems.

There exist several types of generative modeling approaches torecommender systems. The most studied area is to leverage re-current neural networks (RNN) [14]. It models the probabilityof each item conditioned on all previously recommended items 𝑃 ( 𝑑 𝑖 | 𝑑 𝑖 − , . . . , 𝑑 ) and consecutively make recommendations from 𝑑 to 𝑑 𝐾 . Modeling in this way means that the recommendation ofitem 𝑑 𝑖 does not depend on the items 𝑑 𝑖 + , . . . , 𝑑 𝐾 that appear later,which weakens the intra-list relation of the recommendations. Thissub-optimality has already been shown in [28]. Another track ofresearch uses auto-encoder for recommendation [32, 38], but theymodel the user history profiles instead of the distribution of slates.A recent line of research adopts reasoning-based recommendation models [11, 40, 47], which models recommendation as a cognitionrather than perception task and adopts neural reasoning ratherthan neural matching models for better recommendation.In addition to the generative approach represented by [28], thereare other efforts that aim to deal with slate recommendation us-ing reinforcement learning (RL) [16, 26, 27, 42]. Like the early at-tempts [39, 43], this type of methods mostly targets on exploringhow to make use of the long term effects of several consecutiverecommendations by transforming the slate and its user reactionas “states” in RL. Though they are suitable for solving the problemof slate recommendation, the essence behind RL and generativemethods are mostly complementary, since a generated model canbe pretrained and transplanted as the actor in RL frameworks.We can also consider slate recommendation as a type of listrecommendation, but the list size is fixed. Except for accuracy mea-sures, there are many list-wise metrics that are proved beneficialto both the recommender systems and its customers, including butnot limited to coverage [19] and intra-list diversity [49, 50]. Typi-cally, the solution has to balance between accuracy and diversity,such as Max-Marginal Relevance (MMR) [9], relative benefits [6], 𝛼 -NDCG [12], and Determinantal Point Process (DPP) [15]. Butas pointed out by Jiang et al. [28], it will be unfair to comparethese essentially discriminative methods in generative settings, andconversely, it will be unfair for generative methods to compete intraditional LTR settings. In order to show this deviation, we investi-gate how much discriminative ranking methods are different fromgenerative methods if compared in the same setting in section 5.A relatively unrelated track that considers slate-wise patternsis to re-rank the items based on the expected user interaction onthe candidate slate [1, 3, 46]. However, the items available for re-ranking are often restricted to the candidates given by some baseranking model. Our problem is about directly generating slate rec-ommendations with no restriction on candidate items, which isessentially a different task. One should also distinguish slate recom-mendation with session-based recommendation [22], which usuallyconsists of user interaction history of arbitrary length, typicallywith a sequence of sessions, and the major research focus is on themodeling of the user sequential behaviors [14, 41]. The corpus of items is denoted as D , and a slate of size 𝐾 is definedas an ordered list of items 𝒔 = ( 𝑑 , 𝑑 , . . . , 𝑑 𝐾 ) , where 𝑑 𝑘 ∈D andpositional index 𝑘 ∈ { , . . . , 𝐾 } represents that the item appearedin the 𝑘 -th slot in the slate. A user’s response to a slate 𝒔 is denotedas 𝒓 = ( 𝑟 , 𝑟 , . . . , 𝑟 𝐾 ) , where 𝑟 𝑘 is the response on item 𝑑 𝑘 , e.g., 𝑟 𝑘 ∈ { , } represents 𝑑 𝑘 is clicked or not. Assume that each slate 𝒔 has corresponding latent unknown features 𝒛 and some knowncharacteristics/constraints 𝒄 . Typically, let 𝒄 = onehot ( (cid:205) 𝐾𝑘 = 𝑟 𝑘 ) so that the user responses are contained in the constraints. Forexample, for a slate with 0 click, the corresponding constraint wouldbe [ , , , , , ] , while for a slate with 3 clicks, the constraint wouldbe [ , , , , , ] . Unlike discriminative ranking methods that model 𝑅 ( 𝒓 | 𝒔 ) , which is the user response for a given slate, the goal ofgenerative slate recommendation models is to learn the distributionof slates with the given constraints: 𝑃 𝜃 ( 𝒔 | 𝒛 , 𝒄 ) ariation Control and Evaluation for Generative Slate Recommendations WWW ’21, April 19–23, 2021, Ljubljana, Slovenia where 𝒛 is the latent slate encoding. An optimal slate 𝒔 ∗ shouldmaximize the expected number of clicks E [ (cid:205) 𝐾𝑘 = 𝑟 𝑘 ] , so duringrecommendation, one should provide to the inference model withthe maximum number of clicks as constraint 𝒄 ∗ = [ , , , , , ] (correspond to the ideal all-clicked response 𝒓 ∗ = [ , , , , ] ).Different from the setting in [28], we also allow user features, so theconstraint vector 𝒄 in this case will be the concatenation of extracteduser embedding and the aforementioned transformed response. Aswe will discuss in section 5.4, a more fine-grained constraint vectorthat involves user is more likely to induce a smooth distributioninstead of a disjoint manifold in the encoding space 𝒛 . To find a good generative model 𝑃 𝜃 ( 𝒔 | 𝒛 , 𝒄 ) , a Conditional Varia-tional Auto-Encoder (CVAE) framework learns a set of latent factors 𝒛 ∈ R 𝑚 such that 𝒛 can encode sufficient high-level informationto reproduce the observed slates with maximum likelihood. Asformulated in [30], a variational posterior 𝑄 𝜙 ( 𝒛 | 𝒔 , 𝒄 ) is used as anapproximation to solve the intractable marginal likelihood (whichinvolves integral over latent 𝒛 ). The resulting model structure con-tains an encoder 𝑄 𝜙 that learns to encode the input slate 𝒔 andconstraint 𝒄 into a set of variational information (e.g., the meanand variance when Gaussian prior is assumed) of each factor of 𝒛 , and a decoder 𝑃 𝜃 , which corresponds to the generative model.When training the model, one can maximize the variational Evi-dence Lower Bound (ELBO) of the data likelihood [30], which isequivalent to minimizing: L 𝒔 = E 𝑄 𝜙 ( 𝒛 | 𝒔 , 𝒄 ) (cid:2) log 𝑃 𝜃 ( 𝒔 | 𝒛 , 𝒄 ) (cid:3) − 𝛽 KL (cid:2) 𝑄 𝜙 ( 𝒛 | 𝒔 , 𝒄 )∥ 𝑃 𝜃 ( 𝒛 | 𝒄 ) (cid:3) (1)where 𝑃 𝜃 ( 𝒛 | 𝒄 ) is the conditional prior distribution of 𝒛 , KL repre-sents the Kullback-Leibler Divergence (KLD), which restrains thedistance measure between two distributions, and 𝛽 is the trade-offcoefficient as described in section 1. The encoder, decoder, and theconditional prior are all modeled by neural networks to capturecomplex feature interactions. With the decoder, items of each slateare selected based on the dot product similarity between outputembeddings and embeddings of all items in D . During training, inorder to avoid overfitting, the reconstruction loss is calculated bythe cross entropy over down-sampled items instead of the entire D .At inference time, the slate is generated by passing the ideal condi-tion 𝒄 ∗ into the decoder along with a randomly sampled encoding 𝒛 (e.g., from random Gaussian) based on the variational informationprovided by the conditional prior.In the loss Eq. (1), we can interpret the KL divergence as howwell the learned encoding 𝒛 distribution is regulated by the guidingprior 𝑃 𝜃 ( 𝒛 | 𝒄 ) , and the other term reveals how well existing slatesare reconstructed. According to [23], manipulating the trade-offparameter 𝛽 will push the model to favor one of the terms over theother. For example, if we assume isotropic Gaussian as the priordistribution and set larger 𝛽 , the factors in the learned 𝒛 spacewill become more disentangled, and thus more meaningful controlover the generation, but with a possible downgrade of reconstruc-tion performance resulting in unrealistic generation. Despite itsfeasibility in many other tasks, as we will discuss in section 5.1,this 𝛽 leads to a reconstruction-concentration trade-off that barelyprovide satisfactory recommendation results. Many generative methods (e.g. VAEs and GANs[20]) are stochasticin terms of the output, but it is possible that the slate encoding 𝒛 is not obtained through an encoder model so one cannot simplyestimate the slate variance based on 𝒛 . Thus, we are interested inevaluation metrics that can estimate the variance of slates for awide range of stochastic models.An evident choice is directly using item variance across allpossible generated slates. Since items are typically represented byembedding vectors, let 𝒙 , . . . , 𝒙 𝐾 be the vector representations ofgenerated items. For simplicity, assume conditional independenceamong factors of 𝒙 , then the item variance can be calculated as thevariance of each factor and be approximated by sampling:Cov ( 𝒙 ) = E 𝒔 ∼ 𝑃 𝜃 (cid:34) 𝐾 𝐾 ∑︁ 𝑖 = (cid:13)(cid:13) 𝒙 ( 𝒔 ) 𝑖 − 𝝁 (cid:13)(cid:13) (cid:35) = lim 𝑁 →∞ 𝑁 𝐾 𝑁 ∑︁ 𝑗 = 𝐾 ∑︁ 𝑖 = (cid:13)(cid:13) 𝒙 ( 𝒔 𝑗 ) 𝑖 − 𝝁 (cid:13)(cid:13) (2)where 𝑁 is the number of generated slate samples, and each slate 𝒔 𝑗 is sampled from 𝑃 𝜃 ( 𝒔 |·) . Note that 𝝁 is the average of all 𝑁 𝐾 generated items, and it depends on the input constraint. If thegenerative model is personalized, then the user is included in theinput of 𝑃 𝜃 and the generation process will first run 𝑁 times for eachuser to give personalized variance estimation, then the estimationsare averaged for all users.Let 𝝁 ( 𝒔 ) be the average item of slate 𝒔 : 𝝁 ( 𝒔 ) = 𝐾 𝐾 ∑︁ 𝑖 = 𝒙 ( 𝒔 ) 𝑖 (3)then each slate variance in Eq.(2) can be decomposed into: 𝐾 ∑︁ 𝑖 = ∥ 𝒙 ( 𝒔 𝑗 ) 𝑖 − 𝝁 ∥ = 𝐾 ∑︁ 𝑖 = ∥ 𝒙 ( 𝒔 𝑗 ) 𝑖 − 𝝁 ( 𝒔 𝑗 ) + 𝝁 ( 𝒔 𝑗 ) − 𝝁 ∥ = (cid:18) 𝐾 ∑︁ 𝑖 = ( 𝒙 ( 𝒔 𝑗 ) 𝑖 − 𝝁 ( 𝒔 𝑗 ) ) ⊤ ( 𝒙 ( 𝒔 𝑗 ) 𝑖 − 𝝁 ( 𝒔 𝑗 ) )+ 𝐾 ∑︁ 𝑖 = ( 𝝁 ( 𝒔 𝑗 ) − 𝝁 ) ⊤ ( 𝝁 ( 𝒔 𝑗 ) − 𝝁 )+ ( 𝝁 ( 𝒔 𝑗 ) − 𝝁 ) ⊤ 𝐾 ∑︁ 𝑖 = ( 𝒙 ( 𝒔 𝑗 ) 𝑖 − 𝝁 ( 𝒔 𝑗 ) ) (cid:19) (4)Since the last term has (cid:205) 𝐾𝑖 = ( 𝒙 ( 𝒔 𝑗 ) 𝑖 − 𝝁 ( 𝒔 𝑗 ) ) = ( 𝒙 ) = lim 𝑁 →∞ 𝑁 𝑁 ∑︁ 𝑗 = (cid:13)(cid:13) 𝝁 ( 𝒔 𝑗 ) − 𝝁 (cid:13)(cid:13) + 𝑁 𝐾 𝑁 ∑︁ 𝑗 = 𝐾 ∑︁ 𝑖 = (cid:13)(cid:13) 𝒙 ( 𝒔 𝑗 ) 𝑖 − 𝝁 ( 𝒔 𝑗 ) (cid:13)(cid:13) (5)where the first term describes the slate-mean variance and thesecond term describes the intra-slate variance . Each of the twoterms provides a lower bound for the total item variance, and con-versely, the total item variance Eq.(2) gives an upper bound foreither term. A useful conclusion we can derive from this is thatmodels good at one of the two terms in Eq.(5) may not be the onethat achieves the best total item variance. On one hand, modelswith good intra-slate variance may still provide repeating slatewith the same 𝝁 ( 𝒔 𝑗 ) = 𝝁 , which results in extremely low slate-mean variance. On the other hand, models with good coverage of WW ’21, April 19–23, 2021, Ljubljana, Slovenia Shuchang Liu, Fei Sun, Yingqiang Ge, Changhua Pei, and Yongfeng Zhang item across slates may still have repeated items (in the most ex-treme case, 𝒙 ( 𝒔 𝑗 ) 𝑖 = 𝝁 ( 𝒔 𝑗 ) when all items are equal) inside each slateinducing reduced intra-slate variance. Intuitively, we would like tomake both slate-mean variance and intra-slate variance sufficientlylarge in order to support good total variance. Thus, the evaluationprotocol should at least include two of the metrics among total itemvariance, slate-mean variance, and intra-slate variance. We seek to enforce slate variation when CVAE model providesover-concentrated recommendations (i.e., the large 𝛽 case of RCD).A straightforward solution is to perturb the generated slate byconsidering each position as a separate ranking model. However,this post-generation perturbation is very hard to control and alwaystakes the risk of significant downgrade of recommendation accuracy(detail in Appendix A), due to the large perturbation space and theignorance of the positional bias and item relations. With this inconsideration, we turn to pre-generation perturbation and proposea simple and effective CVAE framework to mitigate the problem. Ingeneral, we separate the original generative process into two steps: 𝑃 𝜃 ( 𝒔 | 𝒛 , 𝒄 ) = 𝑃 𝜃 ( 𝑑 , . . . , 𝑑 𝐾 | 𝒛 , 𝒄 ) = 𝑃 𝜃 ( 𝑑 , . . . , 𝑑 𝐾 | 𝑑 , 𝒛 , 𝒄 ) 𝑃 𝜃 ( 𝑑 | 𝒛 , 𝒄 ) (6)That is, the framework first uses a pivot selection model 𝑃 𝜃 ( 𝑑 | 𝒛 , 𝒄 ) to select an adequate pivot item for a fixed slate position (here 𝑑 means we always generate the first appearing item in the slate).Then with this pivot item as additional condition, a slate completionmodel 𝑃 𝜃 ( 𝑑 , . . . , 𝑑 𝐾 | 𝑑 , 𝒛 , 𝒄 ) generates the rest of the items for theslate. With this separation, we can avoid RCD by enforcing variationof resulting slates through perturbation in the first stage, and usethe second phase to clean up the mess if it has made a bad choice ofpivot. As illustrated in Figure 1, the pivot controller is only appliedto the generative decoder. Compared to the standard VAE model,little has to be nudged in the encoder 𝑄 ( 𝒛 | 𝒔 , 𝒄 ) since it already hasthe potential to encode any intra-slate pattern. Picking Pivot Item for the Slate: 𝑃 𝜃 ( 𝑑 | 𝒛 , 𝒄 ) will predict anitem as the pivot, based on this, the slate completion model will fillin the rest of the slate according to the pivot. In other words, thegoal of this part is to find the best item for a certain position in theslate, based on the encoding 𝒛 and constraint 𝒄 . It first generatesan “ideal” latent item embedding (cid:98) 𝒙 , and then applies dot productwith all item embeddings in Ψ to find the closest item as the 𝑑 .The minimization of the reconstruction term can be achieved byoptimizing the cross entropy with softmax. In practice, we also usedown sampling [35] to reduce the computational cost and alleviateover-fitting on the training set. Readers may notice that this partcan be treated as a typical ranking model and thus any learning-to-rank framework is suitable for its training, only that one instead ofmany items are selected at a time.Similar to sequential modeling, the training of 𝑃 𝜃 ( 𝑑 | 𝒛 , 𝒄 ) is madeindependent of the later slate completion model, and in both train-ing and inference, this pivot selection phase allows perturbationwhich improves the item variation. Yet, perturbation inevitablycauses information loss and downgrades the recommendation ac-curacy. Theoretically, taking the simplest assumption that iteminteractions are directional and are all binary relations, then there are at most 𝐾 ( 𝐾 − ) such interactions between items for a slateof size 𝐾 . This separation and the introduction of perturbationmean that our model neglects 𝐾 − 𝐾 − < 𝑘 ′ < 𝐾 (in the binary relationcase), when choosing 𝑘 ′ pivots, the number of missing relationswill be ( 𝐾 − 𝑘 ′ ) 𝑘 ′ ≥ 𝐾 −

1, which indicates more loss of informationand recommendation accuracy.

Slate Completion with a Given Pivot Item:

After the selec-tion of the pivot, the goal of the slate completion model 𝑃 𝜃 ( 𝑑 , . . . , 𝑑 𝐾 | 𝑑 , 𝒛 , 𝒄 ) (7)is to learn to fill up the remaining items that can satisfy the desiredconstraint 𝒄 . A forward pass will take as input the selected pivot (cid:98) 𝑑 , the encoding 𝒛 (which is the output of 𝑄 if training, output ofthe conditional prior 𝑃 𝜃 ( 𝒛 | 𝒄 ∗ ) if inference, as in VAE Eq.(1)), andthe constraint 𝒄 , then output a set of “best” latent item embeddings (cid:98) 𝒙 , . . . , (cid:98) 𝒙 𝐾 for each of the remaining slots in the slate. After gen-erating these latent embeddings, it will find for each of the (cid:98) 𝒙 𝑖 thenearest neighbor in the candidate set D through dot product sim-ilarity. Similar to that in the pivot selection model, we can againapply cross-entropy loss with softmax and negative sampling dur-ing training. Note that this is the final generation stage and it doesnot employ perturbation.Compared to inference time when the model can only use theinferred (cid:98) 𝑑 ∼ 𝑝 𝜃 ( 𝑑 | 𝒛 , 𝒄 ) from the pivot selection model, duringtraining, there is another valid choice of the pivot - the groundtruth item in the data. We find that the later choice achieves thesame performance but usually exhibits faster convergence. Thus, weadopts the ground truth item 𝑑 for the input of the slate completionmodel during training in our experiments, and if perturbation,we calculate item similarities based on the ground truth insteadof the inferred item embedding. Additionally, when the pivot isperturbed during training, the slate completion model tends tolearn a “denoised” intra-slate patterns which may results in a slatethat is more accurate but with less variation, compared to trainingwithout perturbation, as we will discuss in section 5.3. We conducted experiments on two real-world datasets. The firstis YOOCHOOSE from RecSys 2015 Challenge and we followthe same reprocessing procedure as [28]. The resulting datasetcontains around 274K user slate-response pairs. Note that there isno user identifier involved in this dataset, so our second datasetis constructed from the MovieLens 100K dataset. We split userrating sessions into slates of size 5 and consider the rating of 4-5as positive feedback (with label 1) and 1-3 as negative feedback Code link: https://github.com/CharlieFaceButt/PivotCVAE https://2015.recsyschallenge.com/challenge.html https://grouplens.org/datasets/movielens/100k/ ariation Control and Evaluation for Generative Slate Recommendations WWW ’21, April 19–23, 2021, Ljubljana, Slovenia M L P List-CVAE: Decoder

Pivot SlectionModel Slate Completion Model

Pivot-CVAE: Decoder

Figure 1: Structure of the generative framework duringtraining. 𝒔 is the input slate of size 𝑘 . 𝒓 is the user responsevector of the input slate. (cid:98) 𝒔 represents the output slate in-ferred by decoder. Ψ and Ψ ( 𝑢 ) extract pretrained embeddingsfor items and users, respectively. (with label 0). The resulting distribution of slate responses (Figure8 in Appendix C) is similar to that in the Yoochoose dataset. Weconsider two versions of this dataset: ML (User) and

ML (No User) to investigate how the presence of user affects the generative results.Compared to ML(User), the ML(No User) dataset ignores user IDslike Yoochoose data. Since both datasets are skewed towards slateswith 0 and 3 clicks, we augment the records of 1,2,4, and 5 clicksby random repetition until each group has at least half the sizeof the largest response type. Note that these offline log data havelimited feasibility for evaluation since they cannot provide accurateestimations for unseen records. Thus, an additional user responsemodel 𝑅 : D 𝐾 → { , } 𝐾 is trained (with binary cross-entropyloss) to fulfill the role of “ground truth” user feedback. To observe how generative models behave for unseen slates underdifferent environment settings and to investigate the differencebetween slate generation metrics and traditional ranking metrics,similar to existing works [26, 28], we employ simulations withplugins of positional biases and item interactions.The primary goal of the simulated environment is to model 𝑅 ( 𝒓 | 𝒔 , 𝒖 ) that predicts the users’ true responses given slate 𝒔 . Andfor each of the simulators described in this section, the final re-sponse for each item 𝑑 𝑘 is sampled by Bernoulli distribution withclick probability I( 𝑑 𝑘 , 𝑗 ) , which represents user 𝑗 ’s interest for 𝑑 𝑘 : 𝑟 𝑘 = 𝑅 ( 𝑟 𝑘 𝑗 | 𝑑 𝑘 , 𝑗 ) ∼ Bernoulli (I( 𝑑 𝑘 , 𝑗 )) (8)Thus, the click behavior follows Poisson binomial distribution, andthe expectation of the number of clicks is: E (cid:104) 𝐾 ∑︁ 𝑘 = 𝑟 𝑘 (cid:105) = ∑︁ 𝑑 𝑘 ∈ 𝒔 I( 𝑑 𝑘 , 𝑗 ) (9)We tune the resulting distribution with proper setting (details inappendix D) so that it coincides with that of real-world datasets.Specifically, each simulation is built based on a basic User Re-sponse Model (URM) , which only considers point-wise user-itemresponses like the matrix factorization model. By adding awareness

Table 1: Pivot-CVAE variations

Models perturbation of 𝑑 training time inference timePivot-CVAE (GT-PI)Pivot-CVAE (SGT-PI) ✓ Pivot-CVAE (GT-SPI) ✓ Pivot-CVAE (SGT-SPI) ✓ ✓ of positional bias and multi-item relations, we obtain

URM_P (Pstands for positional bias) and

URM_P_MR (MR stands for multi-item relations), respectively. The URM_P_MR consists of a coeffi-cient 𝜌 for the weight of the multi-item relations. As a special case,setting 𝜌 = URM_P . The details of eachsimulation environment are given in Appendix D.

Simulation Data:

We set up three URM_P_MR environments( |D| = , |U| = , 𝜌 ∈ { , . , . } .Note that there is no need to train a response model from thegenerated dataset like that for real-world datasets. Conversely, wegenerate a training set of 100,000 slates from each environment.The number of slates for all types of user responses are also bal-anced similar to that of real-world datasets. The user and itemembeddings are assumed explicit and free to use in the training ofthe recommendation model. Here we expect readers to distinguishthese simulations from those used in Reinforcement Learning (RL)-based recommendation models, because the generative model doesnot interact with the simulated environment for rewards duringtraining. In other words, the generative model is training offlineand the simulators are only used for evaluation purposes. We denote our two-step generative process as Pivot-CVAE. ForPivot-CVAE model, perturbation of 𝑑 can be applied either ontraining phase or inference phase, inducing 4 possible variations:where “GT” represents that the model uses Ground Truth item dur-ing training, “PI” represents that the model uses Pivot Item duringinference, and “S” means the item applies perturbation. For all per-turbation, we adopt sigmoid dot-product between item embeddingsas similarity and sample according to multinomial distribution sothat it can capture user interests. Baselines:

We include the

List-CVAE model [28] as an exampleof VAE and build its non-greedy version (denote as

Non-GreedyList-CVAE ) that conducts post-generation perturbation. That is,after the generation of the slate, the item 𝑑 (in the same positionas the pivot of Pivot-CVAE) is perturbed by sampling from D .Again, we apply sampling based on multinomial distribution ofsigmoid dot product similarity. We also include biased MF [31] and NeuMF [21] as representatives of discriminative ranking models. Inorder to engage generative recommendations that can explore itemsother than the top items, we extend these discriminative methodsinto

Non-greedy MF/NeuMF by applying the same perturbationmethod on 𝑑 as that in Non-greedy List-CVAE and Pivot-CVAE. Tocompare the item variance with intra-slate variance, we include thewidely adopted MF-MMR [10] as a representative diversity-awaremethod. It re-ranks the items proposed by the pre-trained biased

WW ’21, April 19–23, 2021, Ljubljana, Slovenia Shuchang Liu, Fei Sun, Yingqiang Ge, Changhua Pei, and Yongfeng Zhang

MF model based on the following modified score:score ( 𝑑 ) = 𝜆 sim ( 𝑑, 𝑗 ) + ( − 𝜆 ) max 𝑑 𝑖 ∈ 𝒔 sim ( 𝑑 𝑖 , 𝑑 ) where slate 𝒔 starts from an empty set and choose the item withthe best MMR score in each step until the slate size is 𝐾 . sim ( 𝑑, 𝑗 ) represents the item’s original ranking score given by the base MFmodel, and sim ( 𝑑 𝑖 , 𝑑 ) is the item’s similarity to the 𝑖 -th item thathas already been added to the list 𝒔 .In our experiment, we adopt two-layered network with 256 di-mensional hidden size for each encoder, decoder, 𝑃 𝜃 ( 𝑑 | 𝒛 , 𝒄 ) , theslate completion model 𝑃 𝜃 ( 𝑑 , . . . , 𝑑 𝐾 | 𝑑 , 𝒛 , 𝒄 ) , and the MLP compo-nent in NeuMF. In terms of the performance of CVAE-based models,we found that it is relatively insignificant to change the width ordepth of the encoder and decoder network as long as they are largeenough. The user and item embedding size for all datasets andsimulations are fixed to 8, and the size of 𝒛 is set to 𝑚 =

16. Theslate size is 𝐾 =

5, which means the size of the condition input 𝒄 of CVAE-based model is 𝐾 + = 𝜆 = .

5. During training of generative models, the softmax functionon each slot in a slate is associated with 1000 negative samples forYoochoose, and 100 negative samples for MovieLens and simulationenvironments.

For all datasets, we randomly split them into train, validation, andtest sets following the 80-10-10 holdout rule. And we run each exper-iment five times to obtain the average performances. We considertwo major evaluation metrics based on interactive environment 𝑅 ( 𝒓 | 𝒔 ) : slate accuracy and slate variation. And for the illustrationof why ranking metrics on test set is invalid for evaluation of gen-erative models, we further include discriminative ranking metrics. Slate Accuracy Metric:

The primary metric, following the eval-uation setting of [28], is the

Expected Number of Clicks (ENC) whichis calculated as: E (cid:34) 𝐾 ∑︁ 𝑘 = 𝑟 𝑘 (cid:35) = ∑︁ 𝒔 ∈D 𝐾 𝑃 ( 𝒔 ) E (cid:34) 𝐾 ∑︁ 𝑘 = 𝑟 𝑘 | 𝒔 (cid:35) where 𝒓 𝑘 | 𝒔 is a random variable modeled by 𝑅 ( 𝒓 | 𝒔 ) , and 𝑃 ( 𝒔 ) is theprobability of generating 𝒔 . Similar to the variation evaluation de-scribed in section 3.2, we can approximated this metric by samplingtechniques. This metric is exactly the ultimate goal of the opti-mization and does not involve any test set compared to traditionalranking metrics. For simulation, combining Eq. (9), it becomes: E (cid:34) 𝐾 ∑︁ 𝑘 = 𝑟 𝑘 (cid:35) = ∑︁ 𝒔 ∈D 𝐾 𝑃 ( 𝒔 ) ∑︁ 𝑑 𝑘 ∈ 𝒔 I( 𝑑 𝑘 , 𝑗 ) And for real-world dataset, we train 𝑅 ( 𝒓 | 𝒔 ) ( 𝑅 ( 𝒓 | 𝒔 , 𝑢 ) if user IDs arepresented) with point-wise binary cross entropy minimization. Slate Variation:

This metric reveals the severance of the “overconcentration” in RCD and the generation pitfall of limited slateprototypes. As described in section 3.2, we use total item varianceand intra-slate variance metrics in our evaluation. Notably, thevariance of 𝒛 directly models the slate variance, but it is unique inVAE-based generative models. In order to form comparison withnon-VAE models, we use item Coverage [19] as the item variancemetric and

Intra-List Diversity (ILD) [49, 50] as an approximation ofthe intra-slate variance. Item coverage estimates the proportion ofunique items in D that can appear after several times of generations.Obviously, LTR models are deterministic so will always cover only5 /|D| of the items without perturbation. Intra-list diversity is basedon Intra-List Similarity (ILS) [50]:ILD = − ILS ( 𝒔 ) = − ∑︁ 𝑑 𝑖 ∈ 𝒔 ∑︁ 𝑑 𝑙 ∈ 𝒔 𝑑 𝑙 ≠ 𝑑 𝑖 𝑔 ( 𝒗 ⊤ 𝑖 𝒗 𝑙 ) where the similarity measure 𝑔 between 𝑑 𝑖 and 𝑑 𝑙 in the slate isbased on the dot product of their item embeddings. Ranking Metrics:

We agree with [28] that it is inadequateto use traditional offline ranking metrics to evaluate generativemodels, as we will discuss in section D.1, these metrics behavedifferently on a test set compared to that on a interactive userresponse environment. Even though, it is still reasonable to comparethese metrics among generative models. Specifically, we calculateslate

Hit Rate and slate

Recall considering each slate as a ranking list.It is considered as a “hit” if an item in the ground truth slate withpositive feedback is recommended. And the slate recall considerseach slate as a user history instead of the combined user historyacross slates. Note that in Yoochoose and ML, user identifiers areabsent, so we assume a universal user for all slates during training.In summary, we conduct two types of evaluation: 1) recommen-dation performance (slate accuracy and variance metric) on 𝑅 ( 𝑟 | 𝑠 ) as main evaluation, and 2) ranking metric on the test set. Due tothe stochastic nature of generative models (List-CVAE, Pivot-CVAE,and all Non-greedy models), the evaluation of each metric is calcu-lated based on 𝑁 sampled outputs (correspond to section 3.2). Notethat 𝑁 cannot be too small or else it will not provide accurate andstable estimation of the true value. In the meantime, it can neitherbe too large, otherwise the model would exhibit indistinguishablyhigh item coverage (i.e. it may simply generate all items in D givensufficient number of samples). We consider the search space of 𝛽 ∈ [ . , . ] (chosen uni-formly on log 𝛽 space) and for each setting we train List-CVAE andall Pivot-CVAE models until convergence of ENC on 𝑅 ( 𝒓 | 𝒔 ) . Whenevaluation, we generate 𝑁 =

500 slates from each trained modeland calculate the average as described in section 4.4. In Figure 3,we plot the RCD pattern of List-CVAE on Yoochoose dataset, andwe have observed the same pattern in MovieLens 100K and allsimulation environments. ariation Control and Evaluation for Generative Slate Recommendations WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

Figure 2: The slate encoding TSNE plots of List-CVAE on Yoochoose. The first plot correspond to over-reconstruction case, thelast corresponds to over concentration case, and the middle plots correspond to the “elbow” case.

OverConcentration Over Reconstruction Desired BehaviorTraining Inference

Log K L D Reconstruction Loss E NC Slate Variance

Figure 3: Training loss behavior (left) and recommendationperformance (right) of RCD on the Yoochoose Data. Eachpoint in the left panel represents the average result of slatesin one training epoch of a model. Each point in the rightpanel represents a certain generated slate. Here, we use ILDas slate variation, ENC as accuracy metric.

In cases where 𝛽 is small, CVAE becomes biased towards learningthe reconstruction term of Eq. (1) as illustrated by the yellow dot-dashed circle in the left subplot of Figure 3. And because of thesubdued regularization from the KL term, the encoding distributionof 𝒛 becomes less aligned with that of the predefined prior. Whensetting the prior 𝑃 𝜃 ( 𝒛 | 𝒄 ) as isotropic standard Gaussian, we observethat the means of the inferred 𝒛 are often significantly deviated from and the variances var ( 𝒛 ) are far from . Though it successfullylearns and generates the slates in the dataset during training, there isno guarantee on the effectiveness of the sampled 𝒛 during inference.In other words, the distribution of generated slates is close to arandom selection on the observed dataset. As shown in the yellowdot-dashed circle in the right subplot of Figure 3, we observe lowENC and high variance during inference.On the contrary, in the over-concentration case where 𝛽 is ratherlarge, the KL term plays a more important role in the learning. Theslate encoding 𝒛 indeed is more aligned with the prior, ensuringthe sampling effectiveness, and consequently able to generate satis-factory slates during inference. Yet, it is less capable of encodinginformation that is necessary to reconstruct the slates. When themodel learns that 𝒛 is reluctant to encode corresponding slates, thegenerator tends to ignore 𝒛 and focuses on the condition 𝒄 . Since 𝒄 alone does not contain any variational information about slates,the model will only be able to output several biased “slate proto-types” (as illustrated in Appendix B, second row of Figure 6). Analternative analysis of the slate encoding 𝒛 of List-CVAE is given inFigure 2. It shows that with large 𝛽 , slate encoding becomes disjoint according to the ground truth number of clicks, which means thatslate encoding tends to gather around its corresponding prototypegiven by the prior. This is undesirable since the model cannot inferslates outside the cluster, which results in the lack of variety inrecommendations.Besides, we notice that in the training data a lot of repeatedclicks appear in the click and/or purchase sessions in Yoochoosedata. This makes the RCD problem even worse since the same itemis repeatedly recommended even within the same slate, inducinglow intra-slate variance. We observed that RCD exists even with 𝛽 -annealing [5], disabled condition (reduce CVAE to VAE), andconstrained variation (only fix the variation of 𝑧 , but not the mean).These observations indicates that RCD problem may exist for abroad range of generative models. Though neither of the extremes appears to be a good choice forrecommendation, we find that there exists a very narrow region inbetween, where models can provide feasible outputs. In Figure 4, weshow case the results of all metrics on ML(No User) data for gener-ative models across different 𝛽 ∈ [ . , . ] . X-axis representsthe setting of 𝛽 and note that results for different 𝛽 s correspondto different models that are separately trained and evaluated. ForENC and ILD metrics, we use box plot to better demonstrate thedistribution of generated slates.We summarize three trends of model behavior when increasingthe value of 𝛽 as follows: • For model training, the converged reconstruction loss getsworse while the KLD loss gets better; • When inference, the accuracy measure ENC starts to boostbut the variation metric of the generated slates drops; • 𝒛 starts to show clustering behavior under the regulatingprior and the clusters will become crisper along with thetransition as shown in Figure 5.This transitional behavior indicates that models in this interme-diate region can to some extent cover the variety of slates in thedata while provide moderate accuracy performance. To better showthe detailed transitional behavior of the feasible region, we includea more fine-grained search space for 𝛽 ∈ [ . , . ] and highlightit with shaded green in Figure 4. However, in the experiment ofboth real-world datasets and all simulation datasets, we found thatthis transition happens within a very small region (at most 30% ofthe log 𝛽 search space or equivalently 2% of the 𝛽 search space),while the search space in our experiment is 𝛽 ∈ [ . , . ] WW ’21, April 19–23, 2021, Ljubljana, Slovenia Shuchang Liu, Fei Sun, Yingqiang Ge, Changhua Pei, and Yongfeng Zhang E NCC o v e r age I L DH i t R a t e R e c a ll Over-ConcentrationOver-Reconstruction Elbow

Figure 4: Performance on ML (No User) data. “listcvaewith-prior” represents the List-CVAE and “ng_listcvaewithprior”corresponds to the Non-Greedy List-CVAE.

Additionally, we observe that this transitional region consistentlygives good test set ranking performance (both hit rate and recall)compared to other choices of 𝛽 . The two extreme cases outside the“elbow” region do not always reveal a decreasing hit rate and anincreasing recall on the test set as in Figure 4, but the best rankingperformance usually appears in one of the two sides. Intuitively, thegenerative model should be able to maximize the likelihood of thetest set in addition to the training set. Following the same derivationof Eq.(1), this would require both the ability to reconstruct theslate information and the ability to satisfy the constraint. This canonly be observed in this transitional region if the slate variation isnot enforced, since the two extremes only possess one of the twocharacteristics. Note that this ranking performance can only serveas indicators to compare generative models, and it is incomparablebetween deterministic ranking models and stochastic generativemodels. As we will discuss in section D.1, the stochastic generationprocess explores and proposes various good slates in the view ofthe user 𝑅 ( 𝒔 | 𝒄 ) , and may not necessarily pin-point the data in thetest set thus it is typically not favored by this kind of metrics. We present the results of ENC and variance in Table 2. Generativemodels with 𝛽 = . 𝛽 case, since we want to observe the improvement of slate variancewhen models are over-concentrated. Generative models with small 𝛽 (described in section 5.1) and post-perturbation methods thatchange more than one item cannot provide satisfactory user re-sponse, so they are not included in the comparison. We only presentresults of datasets with user IDs (ML (User) and all simulation envi-ronments) so that collaborative filtering models like MF and NeuMFcan be compared. The result of each stochastic model (Non-Greedymodels, List-CVAE and Pivot-CVAE models) is calculated by the Table 2: Model Performance on User Feedback 𝑅 ( 𝑟 | 𝑠 ) ofdatasets with user IDs. All results are significant ( 𝑝 < . )and the overall best are the bold scores while the best amonggenerative models are underlined. ML(User) URM_P

URM_P_MR ( 𝜌 = . ) URM_P_MR ( 𝜌 = . ) A: Expected Number of Click (ENC)MF 3.246

Non-Greedy List-CVAE 3.285 3.262 3.883 4.777Pivot-CVAE (SGT-PI) 3.376 3.274

Pivot-CVAE (SGT-SPI) 0.144 0.097 0.090 0.083C: Intra-List Diversity (ILD)MF 0.206 0.031 0.035 0.036NeuMF 0.694 0.300 0.534 0.779MF-MMR 0.287 0.230 0.193 0.227Non-Greedy MF 0.545 0.515 0.231 0.126Non-Greedy NeuMF

List-CVAE 0.178 0.836 0.407 0.524Non-Greedy List-CVAE 0.428 0.864 0.572 0.664Pivot-CVAE (SGT-PI) 0.486 0.869 0.451 0.632Pivot-CVAE (GT-SPI) 0.725 average of all users’ evaluation. Note that when calculating itemcoverage and diversity, we consider user-wise instead of the system-wise metric for these datasets.The List-CVAE baseline achieves the best ENC on ML(User) andURM_P_MR environments because it is over-concentrated on theoptimal slate prototype, and CF models achieves the best ENC onURM_P because of the pointwise environment. All models withitem perturbation (Non-greedy List-CVAE, Pivot-CVAE (SGT_PI),Pivot-CVAE (GT_SPI), and Pivot-CVAE (SGT_SPI)) exhibit degradedENC compared with the original List-CVAE, but significantly im-proves slate variation (Item Coverage and ILD). Among models us-ing perturbation, the Pivot-CVAE (GT-SPI) model always achievessatisfactory accuracy with the best slate variety. We observe thisoutstanding performance across all datasets, meaning that sam-pling the pivot during inference (SPI) will induce more varianceand explore more choices of item combinations than sampling ariation Control and Evaluation for Generative Slate Recommendations WWW ’21, April 19–23, 2021, Ljubljana, Slovenia U s e r N o U s e r Figure 5: The slate encoding TSNE plots of List-CVAE on MovieLens datasets. When user identifier is presented, the encodingforms more fine-grained clusters that is no longer disjoint between one another. during training (SGT). Pivot-CVAE (SGT_PI) applies perturbationduring training but not inference, this allows the model to givemore accurate generation with better ENC, but the improvementof item coverage and ILD becomes limited. Note that it can achievea similar ILD with Non-Greedy List-CVAE even if there is a hugegap between their item coverages, indicating that SGT_PI seeks tofind good slates with sufficient intra-slate variance but tends to beconcentrated slate-wise in exchange for good accuracy. When ap-plying perturbation on both training and inference as Pivot-CVAE(SGT-SPI), it has similar performance to Non-Greedy List-CVAE.As shown in Table 2, generative methods consistently outper-form MF and NeuMF on variance metrics, and achieves better ENCon all datasets except for URM_P where the environment is point-wise. This indicates that the user responses of real-world datasetslike ML(User) are closer to URM_P_MR, which contain intra-slatefeatures such as item relations, rather than URM_P. Additionally,Non-Greedy MF/NeuMF can improve the item coverage of theseLTR models to the level of Non-Greedy List-CVAE baseline (stillworse than Pivot-CVAE (GT-SPI)) and Non-Greedy NeuMF evenoccasionally achieves better ILD performance than Pivot-CVAE(GT_SPI). However, they achieve this with greater sacrifice on theENC. On the other hand, MF-MMR is able to increase ILD, but itsperformance is worse than generative models on all metrics. More-over, it also shows that a model that improves intra-slate variancedoes not necessarily improve the total item variance.

Different from Yoochoose and MovieLens (No User) Data, the Movie-Lens (User) and our simulation environments include user ID in theconstraints in addition to the ideal response, allowing the modelto learn personalized preference of slates. We plot the distributionof 𝒛 (of List-CVAE) in Figure 5 to show their difference in over-concentration case. For generative models trained with large 𝛽 ,instead of having disjoint slate encoding clusters for each type ofuser response, the presence of user ID in the constraint will guidethe model to learn a set of more fine-grained clusters, each of whichcorresponds to a user. Note that the same user may have different types of user responses, and a typical user that usually gives a cer-tain type of response also has a higher chance of giving responsesof similar types (e.g., a user frequently clicks everything may alsofrequently click 𝐾 − In this paper, we show that generative models for slate recommenda-tion tasks may fall into the Reconstruction-Concentration Dilemma(RCD), where only a narrow middle region can produce effectiverecommendations. We point out that personalization or applyingperturbation can enforce variation on the over-concentration caseof the dilemma but have limitations. By separating a pivot selec-tion phase from the generation process, we propose Pivot-CVAEmode that offers better control of the slate variation by pertur-bation before the generation. Our pivot-based approach and thevariation evaluation framework can be extended to a wider scopeof stochastic generation models such as Generative-AdversarialNetworks (GAN) [45], which we will explore in the future. Besides,we also find it useful to construct a flexible and comprehensiveuser-response simulation framework, not only for the purpose ofrecovering a realistic recommendation environment but also forthe need of generating unseen user-item interactions for modeltraining and evaluation, which is essential for generative models aswell as causal [4, 24, 34] and RL-based models. We will extend ourframework for training and evaluating these models in the future.

ACKNOWLEDGMENTS

We thank Ji Zhang, Qi Dong, and all the reviewers for the construc-tive discussion.

WW ’21, April 19–23, 2021, Ljubljana, Slovenia Shuchang Liu, Fei Sun, Yingqiang Ge, Changhua Pei, and Yongfeng Zhang

REFERENCES [1] Qingyao Ai, Keping Bi, Jiafeng Guo, and W Bruce Croft. 2018. Learning a deeplistwise context model for ranking refinement. In

SIGIR . ACM, 135–144.[2] Qingyao Ai, Xuanhui Wang, Nadav Golbandi, Mike Bendersky, and Marc Najork.2019. Learning Groupwise Scoring Functions Using Deep Neural Networks. In

Proceedings of the First International Workshop On Deep Matching In PracticalApplications .[3] Irwan Bello, Sayali Kulkarni, Sagar Jain, Craig Boutilier, Ed Chi, Elad Eban,Xiyang Luo, Alan Mackey, and Ofer Meshi. 2018. Seq2slate: Re-ranking and slateoptimization with rnns. arXiv preprint arXiv:1810.02019 (2018).[4] Stephen Bonner and Flavian Vasile. 2018. Causal embeddings for recommendation.In

Proceedings of the 12th ACM Conference on Recommender Systems . ACM.[5] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz,and Samy Bengio. 2015. Generating sentences from a continuous space. arXivpreprint arXiv:1511.06349 (2015).[6] Keith Bradley and Barry Smyth. 2001. Improving recommendation diversity. In

Proceedings of the Twelfth Irish Conference on Artificial Intelligence and CognitiveScience, Maynooth, Ireland . Citeseer, 85–94.[7] Christopher Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, NicoleHamilton, and Gregory N Hullender. 2005. Learning to rank using gradientdescent. In

Proceedings of the 22nd ICML . 89–96.[8] Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007. Learningto rank: from pairwise approach to listwise approach. In

Proceedings of the 24thICML . ACM, 129–136.[9] Jaime Carbonell and Jade Goldstein. 1998. The use of MMR, diversity-basedreranking for reordering documents and producing summaries. In

Proceedings ofthe 21st annual international ACM SIGIR conference on Research and developmentin information retrieval . 335–336.[10] Jaime Carbonell and Jade Goldstein. 1998. The use of MMR, diversity-basedreranking for reordering documents and producing summaries. In

Proceedings ofthe 21st annual international ACM SIGIR conference on Research and developmentin information retrieval . 335–336.[11] Hanxiong Chen, Shaoyun Shi, Yunqi Li, and Yongfeng Zhang. 2021. NeuralCollaborative Reasoning. In

Proceedings of the 30th Web Conference (WWW) .[12] Charles LA Clarke, Maheedhar Kolla, Gordon V Cormack, Olga Vechtomova,Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. 2008. Novelty and diversityin information retrieval evaluation. In

SIGIR . 659–666.[13] Puneet Kumar Dokania, Aseem Behl, CV Jawahar, and M Pawan Kumar. 2014.Learning to rank using high-order information. In

European Conference on Com-puter Vision . Springer, 609–623.[14] Tim Donkers, Benedikt Loepp, and Jürgen Ziegler. 2017. Sequential user-basedrecurrent neural network recommendations. In

Proceedings of the Eleventh ACMConference on Recommender Systems . ACM, 152–160.[15] Mike Gartrell, Ulrich Paquet, and Noam Koenigstein. 2016. Bayesian low-rankdeterminantal point processes. In

Proceedings of the 10th ACM Conference onRecommender Systems . 349–356.[16] Yingqiang Ge, Shuchang Liu, Ruoyuan Gao, Yikun Xian, Yunqi Li, Xiangyu Zhao,Changhua Pei, Fei Sun, Junfeng Ge, Wenwu Ou, et al. 2021. Towards Long-termFairness in Recommendation. arXiv preprint arXiv:2101.03584 (2021).[17] Yingqiang Ge, Shuyuan Xu, Shuchang Liu, Zuohui Fu, Fei Sun, and YongfengZhang. 2020. Learning Personalized Risk Preferences for Recommendation. In

Proceedings of the 43rd International ACM SIGIR Conference on Research andDevelopment in Information Retrieval . 409–418.[18] Yu Gong, Yu Zhu, Lu Duan, Qingwen Liu, Ziyu Guan, Fei Sun, Wenwu Ou, andKenny Q. Zhu. 2019. Exact-K Recommendation via Maximal Clique Optimization.In

Proceedings of the 25th ACM SIGKDD International Conference on KnowledgeDiscovery & Data Mining (KDD ’19) .[19] Nathaniel Good, J. Ben Schafer, Joseph A. Konstan, Al Borchers, Badrul Sarwar,Jon Herlocker, and John Riedl. 1999. Combining Collaborative Filtering withPersonal Agents for Better Recommendations. In

Proceedings of the Sixteenth Na-tional Conference on Artificial Intelligence and the Eleventh Innovative Applicationsof Artificial Intelligence Conference Innovative Applications of Artificial Intelligence (Orlando, Florida, USA). American Association for Artificial Intelligence, MenloPark, CA, USA, 439–446.[20] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarialnets. In

Advances in neural information processing systems . 2672–2680.[21] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-SengChua. 2017. Neural collaborative filtering. In

Proceedings of the 26th WWW .[22] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk.2016. Session-based Recommendations with Recurrent Neural Networks. In .[23] Irina Higgins, Loïc Matthey, Arka Pal, Christopher Burgess, Xavier Glorot,Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. 2017. beta-VAE:Learning Basic Visual Concepts with a Constrained Variational Framework. In

Proceedings of 5th International Conference on Learning Representations .[24] Paul W Holland. 1986. Statistics and causal inference.

Journal of the Americanstatistical Association

81, 396 (1986), 945–960. [25] Eugene Ie, Chih-wei Hsu, Martin Mladenov, Vihan Jain, Sanmit Narvekar, JingWang, Rui Wu, and Craig Boutilier. 2019. RecSim: A Configurable SimulationPlatform for Recommender Systems. arXiv preprint arXiv:1909.04847 (2019).[26] Eugene Ie, Vihan Jain, Jing Wang, Sanmit Narvekar, Ritesh Agarwal, Rui Wu,Heng-Tze Cheng, Tushar Chandra, and Craig Boutilier. 2019. SlateQ: A TractableDecomposition for Reinforcement Learning with Recommendation Sets. In

Pro-ceedings of the Twenty-eighth IJCAI . Macau, China, 2592–2599.[27] Eugene Ie, Vihan Jain, Jing Wang, Sanmit Navrekar, Ritesh Agarwal, Rui Wu,Heng-Tze Cheng, Morgane Lustman, Vince Gatto, Paul Covington, et al. 2019.Reinforcement learning for slate-based recommender systems: A tractable de-composition and practical methodology. arXiv preprint arXiv:1905.12767 (2019).[28] Ray Jiang, Sven Gowal, Yuqiu Qian, Timothy Mann, and Danilo J. Rezende. 2019.Beyond Greedy Ranking: Slate Optimization via List-CVAE. In

ICLR .[29] Thorsten Joachims, Laura A Granka, Bing Pan, Helene Hembrooke, and Geri Gay.2005. Accurately interpreting clickthrough data as implicit feedback. In

Sigir ,Vol. 5. 154–161.[30] Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).[31] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization tech-niques for recommender systems.

Computer

Proceedings of the 2018World Wide Web Conference . 689–698.[33] Tie-Yan Liu et al. 2009. Learning to rank for information retrieval.

Foundationsand Trends® in Information Retrieval

3, 3 (2009), 225–331.[34] James McInerney, Brian Brost, Praveen Chandar, Rishabh Mehrotra, and Ben-jamin Carterette. 2020. Counterfactual Evaluation of Slate Recommendationswith Sequential Reward Interactions. In

Proceedings of the 26th ACM SIGKDDInternational Conference on Knowledge Discovery & Data Mining . 1779–1788.[35] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. [n.d.].Distributed Representations of Words and Phrases and their Compositionality.In

Advances in Neural Information Processing Systems 26 .[36] Tao Qin, Tie-Yan Liu, Xu-Dong Zhang, De-Sheng Wang, and Hang Li. 2009.Global ranking using continuous conditional random fields. In

Advances in neuralinformation processing systems . 1281–1288.[37] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme.2012. BPR: Bayesian personalized ranking from implicit feedback. arXiv preprintarXiv:1205.2618 (2012).[38] Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. 2015.Autorec: Autoencoders meet collaborative filtering. In

Proceedings of the 24thinternational conference on World Wide Web . 111–112.[39] Guy Shani, David Heckerman, and Ronen I Brafman. 2005. An MDP-basedrecommender system.

Journal of Machine Learning Research

6, Sep (2005).[40] Shaoyun Shi, Hanxiong Chen, Weizhi Ma, Jiaxin Mao, Min Zhang, and YongfengZhang. 2020. Neural Logic Reasoning. In

CIKM . 1365–1374.[41] Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang.2019. BERT4Rec: Sequential Recommendation with Bidirectional Encoder Repre-sentations from Transformer. arXiv preprint arXiv:1904.06690 (2019).[42] Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, JohnLangford, Damien Jose, and Imed Zitouni. 2017. Off-policy evaluation for slaterecommendation. In

Advances in NIPS . 3632–3642.[43] Nima Taghipour, Ahmad Kardan, and Saeed Shiry Ghidary. 2007. Usage-basedweb recommendations: a reinforcement learning approach. In

Proceedings of the2007 ACM RecSys . ACM, 113–120.[44] Paolo Viappiani and Craig Boutilier. 2010. Optimal bayesian recommendationsets and myopically optimal choice query sets. In

Advances in neural informationprocessing systems . 2352–2360.[45] Jun Wang, Lantao Yu, Weinan Zhang, Yu Gong, Yinghui Xu, Benyou Wang, PengZhang, and Dell Zhang. 2017. Irgan: A minimax game for unifying generativeand discriminative information retrieval models. In

SIGIR . 515–524.[46] Jianxiong Wei, Anxiang Zeng, Yueqiu Wu, Peng Guo, Qingsong Hua, and Qing-peng Cai. 2020. Generator and Critic: A Deep Reinforcement Learning Approachfor Slate Re-ranking in E-commerce. arXiv preprint arXiv:2005.12206 (2020).[47] Yikun Xian, Zuohui Fu, Handong Zhao, Yingqiang Ge, Xu Chen, Qiaoying Huang,Shijie Geng, Zhou Qin, Gerard De Melo, Shan Muthukrishnan, et al. 2020. CAFE:Coarse-to-fine neural symbolic reasoning for explainable recommendation. In

CIKM . 1645–1654.[48] Yisong Yue, Rajan Patel, and Hein Roehrig. 2010. Beyond position bias: Examiningresult attractiveness as a source of presentation bias in clickthrough data. In

WWW . 1011–1018.[49] Tao Zhou, Zoltán Kuscsik, Jian-Guo Liu, Matúš Medo, Joseph Rushton Wakeling,and Yi-Cheng Zhang. 2010. Solving the apparent diversity-accuracy dilemma ofrecommender systems.

Proceedings of the National Academy of Sciences

Proceedings ofthe 14th international conference on WWW . ACM, 22–32. ariation Control and Evaluation for Generative Slate Recommendations WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

APPENDIXA POST-GENERATION PERTURBATION

Given a slate (of size 5) and its observed labels ( 𝑎 numbers of its items randomly and apply perturbationto observe how much the ground-truth user response distribution 𝑅 ( 𝑟 | 𝒔 ) is affected. We present the result of the entire Yoochoosedata (including val and test) in Figure 7. Each row correspondsto slates with certain observed label ( 𝑎 . In each subplot,the 𝑥 -axis corresponds to the ground truth expected user response 𝑅 ( 𝑟 | 𝒔 ) (from pre-trained model if Yoochoose/MovieLens dataset,and from simulation if URM-based environments), and the 𝑦 -axiscorresponds to the frequency/density of slates. As shown in thefirst column where no perturbation is involved, 𝑅 ( 𝑟 | 𝒔 ) is usuallyvery close to the observed label. However, as given by the secondcolumn from left, changing merely a single item has already causeda significant deviation of distribution from that of the original slates,especially for slates with ideal condition 𝒓 ∗ (the bottom row). Andperturbation of 3 items is already inducing a distribution close tothat of random slates (the rightmost column). We also observedthis substantial reduction in MovieLens and simulation data. Asshown in table 2, simply applying post-generation perturbation ona given slate without considering the context of the slate neitherachieves the best ENC nor the best variation. B MORE ON RCD

The detailed view of the reconstruction behavior of List-CVAE onthe entire Yoochoose data is given by Figure 6. The same patternalso appears on MovieLens 100K. Each subplot gives the result of aList-CVAE model with certain 𝛽 , the 𝑦 -axis represents the predicteduser response 𝑅 ( 𝒓 | 𝑠 ) of the original slate and the 𝑥 -axis represents 𝑅 ( 𝒓 | (cid:98) 𝑠 ) of the reconstructed slates. The reconstruction behavior onthe dataset reveals its performance of inferring the observed dataset(including test set), which also helps identifying the RCD. Slateswith the same observed 𝛽 ≤ .

003 (first three subplots), and the more distinguishablediagonal line in the plot with 𝛽 = . 𝛽 > . . C SLATE RESPONSE DISTRIBUTION

The resulting slate response distribution of Yoochoose data andMovieLens 100K data are similar, and we show the later in Figure 8. Y-axis represents the frequency and X-axis correspond to theground truth response 𝒓 of slates. The label of X-axis is obtainedby considering each 𝒓 as a binary number and expressing it asinteger. For example, user response [ , , , , ] → → and [ , , , , ] → → where the subscript 2 and 10 meansbinary and decimal representation of numbers. D DESIGN OF SIMULATION

Basic User Response Model (URM):

We assume that the basicinteraction between users and items follow a user interest model,which is a biased matrix factorization model[31]. Each item 𝑑 𝑖 ∈ D is associated with a vector 𝒗 𝑖 ∈ R 𝑚 , where 𝑚 is the embeddingdimension. Each user 𝑗 is assigned with a vector of interest 𝒖 𝑗 ∈ R 𝑚 .To find realistic settings, we first observe the distribution of userembeddings, item embeddings, user biases, and item biases in pre-trained 𝑅 ( 𝒓 | 𝒔 ) of MovieLens dataset, then use the same mean andvariance to randomly sample each 𝒗 𝑖 , 𝒖 𝑗 , user bias 𝑏 𝑢𝑗 , item bias 𝑏 𝑣𝑗 ,and global bias 𝑏 . And the user’s initial interest in 𝑑 𝑖 is given by: I URM ( 𝑑 𝑖 , 𝑗 ) = 𝑔 ( 𝒖 𝑇𝑗 𝒗 𝑖 + 𝑏 𝑢𝑗 + 𝑏 𝑣𝑖 + 𝑏 ) where 𝑔 is a Sigmoid function. This basic model assumes indepen-dent point-wise interaction for each user-item pair and no othereffect from the slate context. Adding Positional Bias (URM_P):

Items appeared at the pre-vious positions of the slate are assumed to have a higher chance ofreceiving positive feedback than those in later positions. The reasonfor this design is that users may gradually lose their patience whenfurther browsing the items [26]. In our setting, we first employan average positional offset 𝒃 𝑝 = [ . , . , . , − . , − . ] . And foreach user, we draw personalized positional bias B( 𝑗 ) ∼ N ( 𝒃 𝑝 , 𝝈 𝑢 ) where the variance 𝝈 𝑢 = .

2. Then the final probability of click: I URM_P ( 𝑑 𝑖 , 𝑗 ) = clip (cid:0) I URM ( 𝑑 𝑖 , 𝑗 ) + 𝜆 B( 𝑗 ) 𝑘 , , ) (cid:1) where 𝜆 (set to 1.0 during experiment) controls how significantis the impact of positional bias on the user responses. The clipfunction ensures that the user’s interest is within [ , ] . Adding Item Interactions (URM_P_MR):

In [28], the au-thors assumed that item interactions are combinations of binaryrelations. Here we use a simple and easy-to-control multi-item re-lation model: First assume that a user’s attention is altered whenshe sees the overall features of the slate:Atn ( 𝒔 , 𝑗 ) = 𝑔 (cid:18)(cid:16) 𝐾 ∑︁ 𝑑 𝑖 ∈ 𝑠 ( 𝒗 𝑖 ) (cid:17) ⊙ 𝒖 𝑗 (cid:19) where ⊙ denotes element-wise multiplication. Then the resultingattention is applied to each item to obtain the excursion: M( 𝒔 , 𝑗 ) 𝑖 = Atn ( 𝒔 , 𝑗 ) 𝑇 𝒗 𝑖 Add up everything so far gives the final probability of click: I URM_P_MR ( 𝑑 𝑖 , 𝑗 ) = clip (cid:0) I URM ( 𝑑 𝑖 , 𝑗 ) + 𝜆 B( 𝑗 ) 𝑘 + 𝜌 M( 𝒔 , 𝑗 ) 𝑖 , , (cid:1) where coefficient 𝜌 is introduced to control the significance ofthe item relation term. Though we found these simulation settingssufficient for our study, one may sue for more realistic and advancedsimulation like [25] as complementary approaches. WW ’21, April 19–23, 2021, Ljubljana, Slovenia Shuchang Liu, Fei Sun, Yingqiang Ge, Changhua Pei, and Yongfeng Zhang

Ground Truth

Figure 6: The reconstruction behavior of RCD on the entire Yoochoose dataset (including train, val, and test).Figure 8: Slate response distribution of the preprocessedMovieLens 100K dataset with slate size 𝑘 = .Table 3: Infeasible Evaluation on Test Set, see section D.1 ML(User) URM_P

URM_P_MR ( 𝜌 = . ) URM_P_MR ( 𝜌 = . ) D: Slate Hit RateMF

Non-Greedy List-CVAE 0.0056 0.0068 0.0078 0.0090Pivot-CVAE (SGT-PI) 0.0043 0.0071 0.0069 0.0078Pivot-CVAE (GT-SPI) 0.0131 0.0062 0.0072 0.0080Pivot-CVAE (SGT-SPI) 0.0043 0.0070 0.0072 0.0074E: Slate RecallMF

Non-Greedy List-CVAE 0.0021 0.0022 0.0027 0.0035Pivot-CVAE (SGT-PI) 0.0014 0.0023 0.0024 0.0026Pivot-CVAE (GT-SPI) 0.0038 0.0020 0.0024 0.0028Pivot-CVAE (SGT-SPI) 0.0013 0.0023 0.0025 0.0028

Figure 7: User response distribution under perturbation.

D.1 Stochastic vs. Deterministic

Compare to generative models, discriminative ranking models aredeterministic and cannot explore the variety of slates, thus theyare favored by ranking metrics on offline test set. We present theranking performance on test set in Table 3-D and 3-E. For mostdatasets, different from the “ground truth” user response evaluatedby environment 𝑅 ( 𝑟 | 𝑺 ) , generative models (CVAE-based models)are stochastic and tend to explore more choices of good but unseenslates beyond the limited observation of the test set. On the otherhand, ranking models like MF and NeuMF tend to focus on the bestpoint that satisfies users the most, and perturbation of just one itemdoes not severely harm the ranking metrics of the whole slate sincethe remaining items are also individually accurate. Interestingly,the gap between deterministic and stochastic models becomes lessobservable when we increase the proportion of item relations inthe slates (URM_P → URM_MR), and generative models even startto outperform MF and NeuMF when 𝜌 = ..