Leave No User Behind: Towards Improving the Utility of Recommender Systems for Non-mainstream Users
LLeave No User Behind: Towards Improving the Utility ofRecommender Systems for Non-mainstream Users
Roger Zhe Li
Delft University of TechnologyDelft, The [email protected]
Julián Urbano
Delft University of TechnologyDelft, The [email protected]
Alan Hanjalic
Delft University of TechnologyDelft, The [email protected]
ABSTRACT
In a collaborative-filtering recommendation scenario, biases in thedata will likely propagate in the learned recommendations. In thispaper we focus on the so-called mainstream bias: the tendency of arecommender system to provide better recommendations to userswho have a mainstream taste, as opposed to non-mainstream users.We propose NAECF, a conceptually simple but effective idea toaddress this bias. The idea consists of adding an autoencoder (AE)layer when learning user and item representations with text-basedConvolutional Neural Networks. The AEs, one for the users andone for the items, serve as adversaries to the process of minimizingthe rating prediction error when learning how to recommend. Theyenforce that the specific unique properties of all users and itemsare sufficiently well incorporated and preserved in the learnedrepresentations. These representations, extracted as the bottlenecksof the corresponding AEs, are expected to be less biased towardsmainstream users, and to provide more balanced recommendationutility across all users. Our experimental results confirm theseexpectations, significantly improving the recommendations for non-mainstream users while maintaining the recommendation qualityfor mainstream users. Our results emphasize the importance ofdeploying extensive content-based features, such as online reviews,in order to better represent users and items to maximize the de-biasing effect.
KEYWORDS
Recommender Systems; Mainstream Bias; User Fairness
ACM Reference Format:
Roger Zhe Li, Julián Urbano, and Alan Hanjalic. 2021. Leave No User Be-hind: Towards Improving the Utility of Recommender Systems for Non-mainstream Users. In
Proceedings of the Fourteenth ACM International Confer-ence on Web Search and Data Mining (WSDM ’21), March 8–12, 2021, VirtualEvent, Israel.
ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3437963.3441769
Collaborative Filtering (CF) models are the most investigated anddeployed models in the domain of recommender systems [4]. Thesemodels assume that users who had similar preferences on items
Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).
WSDM ’21, March 8–12, 2021, Virtual Event, Israel © 2021 Copyright held by the owner/author(s).ACM ISBN 978-1-4503-8297-7/21/03.https://doi.org/10.1145/3437963.3441769 of a specific kind (e.g. books, movies) in the past may continuehaving similar preferences on other items of the same kind. Thepreferences of the users are expressed through their explicit (e.g.ratings) or implicit (e.g. clicks, downloads) interactions with items.Among the CF models,
Matrix Factorization (MF) [15], whichtries to find a representation of both users and items in the samelatent factor space, has long been the most successful and mostwidely deployed CF model. More recently, generalized factorizationmodels, such as factorization machines [30], have been proposed,exploiting input beyond user-item interactions to learn the latentspace. Exploiting more input, such as contextual features and othertypes of useful side information about users and items, was shownto further improve the recommendation quality. The potential forfurther improvement, for instance by relying on more abundantinput data including audio, visual and textual item descriptions orsocial network dynamics, came only when deep neural networks(DNN) [10] entered the recommendation domain and enabled moresophisticated user/item representation space learning. In particular,textual data acquired from websites have been extensively exploitedfor this purpose, allowing users to leave review comments for itemsalong with ratings. For this type of data, DNN-based user/itemmodeling utilizing NLP techniques has been shown to achievesignificantly higher recommendation performance [8, 37, 44, 45] aswell as provide convincing explanations [6, 7, 26, 40].While these developments have greatly contributed to the im-provement of the overall recommendation accuracy, one problemhas remained largely unsolved, namely the presence of varioustypes of biases in the learned recommendation models. In this pa-per we focus on the bias towards the so-called mainstream users . Amainstream user often prefers items liked by many people and alsoreacts negatively to items widely disliked by others [33]. Contraryto this, non-mainstream users typically show interest on rarely-visited items or have an opposite attitude towards widely acceptedor rejected items. Such a “grey sheep” property [46] makes theseusers different from others, making it difficult for a CF algorithmto identify similar peers. This leads to recommendations of a gen-erally lower quality, because recommendations for these users arebased on neighbors with insufficiently similar preferences. Fur-thermore, non-mainstream users are typically a minority and thenumerous consistent user-item interactions within the cluster ofmainstream users are likely to be dominant in steering the processof learning the user/item representation space. Because of this, thenon-mainstream users and their preferred (“outlier”) items becomeunderrepresented in such a space, leading to inequality of the rec-ommendation performance across the user population. This is the mainstream bias , the tendency to provide better recommendationsto the mainstream users. Such bias could make non-mainstream a r X i v : . [ c s . I R ] F e b sers draw insufficient utility from a recommender system andcould discourage them from using it anymore. This could lead toonline businesses starting to lose customers. For the informationand news recommender systems, however, we foresee even moreserious consequences. Recommender systems may namely becomeless inclusive with respect to non-mainstream opinions and (e.g.political) views and in this way contribute to undesired long-termeffects, like intellectual segregation and societal polarization.In this paper we build on the success of DNNs for developingrecommendation models and propose a simple but effective solutiontowards neutralizing the mainstream bias. With our new recom-mendation model, referred to as N eural A uto E ncoder C ollaborative F iltering (NAECF), we introduce adversarial conditions to the pro-cess of learning the recommendation algorithm, which in thisspecific case is realized as the minimization of the rating pre-diction error. The adversarial conditions are imposed by autoen-coders [5], a deep learning architecture widely used for recommen-dation [18, 25, 35], added to a state-of-the-art DNN-based recom-mendation framework. They enforce that the user and item repre-sentations are learned in a way such that they preserve their specificand unique properties before being fed to the rating predictor. Sincethis preservation is achieved for all users, mainstream or not, theautoencoders prevent that the learned representations are biasedtowards the users with a mainstream taste. The results of experi-ments conducted on different domains and scales of the real-worlddatasets from Amazon [21] show that the representations learnedin this way indeed help to de-bias the produced recommendations(predicted ratings in this case). Compared to the case without de-ploying the adversarial conditions, our proposed method producessignificantly better recommendations for non-mainstream userswhile largely maintaining the recommendation quality for main-stream users. We clearly show that this performance improvementlargely stems from adding adversarial conditions to the process ofuser and item representation learning. In addition, our experimentsdemonstrate the benefit the non-mainstream users draw from theapplication of content-based features such as online reviews, fur-ther highlighting their value for achieving high recommendationquality across the user community.The proposed NAECF approach is, to the best of our knowledge,the first to enforce preserving the unique user and item propertiesas the adversary to the process of learning how to recommend. Thisallows us to effectively address the mainstream bias in recommendersystems, which has not been extensively studied this far.A recommender system can be designed to either predict ratingsor rank items. Despite the latter is picking up momentum in thefield, we still choose in this paper to follow the rating predictionparadigm. The main reason for this lies in the core of our contribu-tion, which is to investigate how a state-of-the-art recommendationframework may be extended in order to de-bias the process of gen-erating recommendations. Since the framework we build upon wasevaluated in terms of rating prediction, we follow this same para-digm in this paper. Nonetheless, the user-item representation spacegenerated by the autoencoders can serve to predict both rankingand ratings, so we do not consider our choice to limit the broadapplication of our proposal to the recommendation practice. The results of the paper can be fully reproduced with data andcode available online . Our work relates mainly to two topics: biases in recommendersystems and review-based user/item modeling.
Potential biases in the training data have already been recognized inearly work on matrix factorization for recommendation. Koren et al.[15] introduced a correction in the dot-product rating predictionformula to incorporate rating biases across users, that is, how therating scale is interpreted by different users. Another bias related toratings is the anchoring bias; it emerges from the influence of pre-vious recommendations to a user on that user’s future ratings. Ado-mavicius et al. [3] explored two approaches to neutralize this bias.The first approach involves computational post-hoc adjustmentsof the ratings that are known to be biased. The second approachinvolves a user interface by which the system tries to prevent thisbias during rating. A different sort of bias is the popularity bias, dueto which popular items may be recommended more frequently thanother, less popular items (e.g., long-tail). Abdollahpouri et al. [1]proposed an add-on to a general collaborative filtering algorithm bywhich a trade-off between accuracy and long-tail coverage can betuned. More recently, the discussion about biases has increasinglybeen conducted in the context of resolving ethical and societal is-sues when deploying recommender systems in practice, such aspolarization [29], fairness [22] and discrimination [19], giving afurther boost to the research on this topic.In this paper we focus on the aforementioned mainstream bias.While being conceptually close to popularity bias [2, 17], there is animportant difference between the two. Popularity bias could leadto a separation between more and less popular items, similar to theseparation of items into those being interesting to mainstream andnon-mainstream users. However, popularity bias is not informativeregarding the way a recommender system serves different groups ofusers. According to Steck [36], users may tend to provide feedbackon popular items simply by following (being influenced by) otherusers. In this way, their preferences are likely to be unconsciouslydriven away from their real interest. By focusing on mainstream-ness, we explicitly look at the bias in the user population.Kowald et al. [17] demonstrated that non-mainstream music lis-teners are likely to receive the worst recommendations. Schedl andBauer [34] investigated music preferences across age groups. Theyobserved that, although only taking a small proportion of users, kidsand adolescents have significantly different preferences from otherage groups in terms of music genres, and the recommendation per-formance on these two groups is also distinctive among all users. Torepair the unfairness caused by the mainstream bias, several recentworks aim at identifying non-mainstream music listeners and usingthe power of cultural aspects [24, 33] and human memories [16]to better profile these underrepresented users in recommendersystems. Despite the reported progress, existing methods to allevi-ate the mainstream bias usually rely on their specific definitionsof mainstreamness, which may limit the findings. Furthermore, https://github.com/roger-zhe-li/wsdm21-mainstream hese methods tend to split users into different mainstreamnessgroups for individual training. This setting may cause the loss ofrecommendation accuracy due to not exploiting the cross-groupcollaborative information. The approach proposed in this paperaims at neutralizing the mainstream bias in a more generic fashionand without divisions within the user population. Supported by the rapid development of natural language processing(NLP) techniques, online reviews have increasingly been identifiedas an important source of useful information for addressing datasparsity issues in recommendation. Exploiting these reviews hasled to several advanced recommendation concepts, pioneered byCollaborative Deep Learning (CDL) [39]. This concept introduces ahierarchical Bayesian model using Stacked Denoising Autoencoders(SDAE) to reconstruct the rating matrix from encoded textual re-views. Another method, DeepCoNN [45], unifies the processes oflearning the user/item representation and rating prediction in anend-to-end model. The unification is achieved through a combina-tion of Convolutional Neural Networks (CNN) and factorizationmachines. Due to the sequential nature of reviews, Recurrent Neu-ral Networks (RNN) and attention models are also widely usedfor user and item feature learning. Wu et al. [42] trained the re-view representations and ratings jointly within a Long Short-TermMemory (LSTM) framework for movie recommendation. Chen et al.[6] extended the DeepCoNN concept by incorporating attentionfactors into NARRE, a DeepCoNN-based framework, to provideconvincing explanations. MPCN [37] is another attention-basedmodel, which uses two hierarchical attention layers to infer thereview importance. Although the use of text reviews has partiallyresolved the data sparsity issues, a more direct way to achievethis is to increase the scale of the training data. As an example,AugCF [41] was proposed on top of DeepCoNN to augment reviewand rating data using Generative Adversarial Networks [11]. Allmodels mentioned above represent users and items following thesame principle, and the representations are derived from the samedata source. Contrary to this, NPA [43] and NeuHash-CF [12] rep-resent users and items in different ways. While they model theitems using content-based information, the users are representedby one-hot coded userID.Despite the remarks expressed in literature state that reviewsserve the recommendation better as regularizers than features [32],the models mentioned above have been reported to achieve re-markable overall recommendation accuracy, showing the benefit ofusing textual review data as input. In this paper, we look at onlinereviews from a different angle and further than accuracy alone. Weanalyze their value in achieving better user representations thatallow us to balance the recommendation quality across users. Weshow that, with our proposed recommendation model, reviews canbe instrumental in neutralizing the mainstream bias.
The architecture of the proposed NAECF model is illustrated inFig. 1. The scheme shows that with NAECF we pursue two learninggoals simultaneously: maximizing the recommendation accuracyand reconstructing the users’ and items’ original feature vectors
Figure 1: Overall architecture of NAECF. in the autoencoders. These feature vectors consist of the texts ofuser reviews, so we refer to the process taking place in the twoAEs as “text reconstruction”. Recommendation accuracy may beachieved by optimizing for rating prediction or ranking prediction.Since DeepCoNN [45], the strongest baseline for comparison, isdesigned for rating prediction, we also take rating prediction as thecriterion for recommendation optimization. This allows us to assessspecifically the effect of enforcing user and item reconstruction asan adversarial condition to recommendation optimization on themainstream bias. If the effect is there, it can also be expected if aranking prediction scheme is expanded in the same way.
The data we use consist of tuples ( 𝑢, 𝑖, 𝑟 𝑢𝑖 , 𝑐 𝑢𝑖 ) , representing a user 𝑢 providing a rating 𝑟 𝑢𝑖 to item 𝑖 and leaving a review text 𝑐 𝑢𝑖 forsaid item. Based on Fig. 1, we see the realization of the overall goalof NAECF by minimizing the following loss: 𝐿 = 𝐿 𝑅 + 𝑤 ( 𝐿 𝑈 + 𝐿 𝐼 ) , (1)where 𝐿 𝑅 , 𝐿 𝑈 and 𝐿 𝐼 are, respectively, the mean rating predictionloss, and the mean text reconstruction losses for users and items.The constant 𝑤 is a weight determining the relative influence ofuser and item AEs compared to the rating prediction module. Thethree losses are defined by the following expressions: 𝐿 𝑅 = 𝑁 𝑅 ∑︁ 𝑢,𝑖 𝑙𝑜𝑠𝑠 𝑅 ( 𝑢, 𝑖 ) (2) 𝐿 𝑈 = 𝑁 𝑈 ∑︁ 𝑢 𝑙𝑜𝑠𝑠 𝑈 ( 𝑢 ) (3) 𝐿 𝐼 = 𝑁 𝐼 ∑︁ 𝑖 𝑙𝑜𝑠𝑠 𝐼 ( 𝑖 ) , (4)where 𝑁 𝑅 , 𝑁 𝑈 and 𝑁 𝐼 represent the number of interactions, usersand items in the training set, respectively. Normalizing by theseterms makes the effect of the weight 𝑤 invariant to the statistics ofthe dataset. igure 2: Architecture of the convolutional autoencoder fortext feature transformation and extraction. In NAECF, the rating prediction loss for an individual user-iteminteraction is computed as a traditional squared loss 𝑙𝑜𝑠𝑠 𝑅 ( 𝑢, 𝑖 ) = (cid:18) 𝑟 𝑢𝑖 − ˆ 𝑟 𝑢𝑖 𝑟 𝑚𝑎𝑥 − 𝑟 𝑚𝑖𝑛 (cid:19) , (5)where ˆ 𝑟 𝑢𝑖 is the predicted rating given by user 𝑢 to item 𝑖 . The lossis normalized by the limits of the rating scale used in the dataset, sothat 𝐿 𝑅 is bounded between 0 and 1. The prediction is computed forthe interaction ˆ 𝑧 = ( 𝒙 𝑢 , 𝒚 𝑖 ) between vectors 𝒙 𝑢 and 𝒚 𝑖 , encodingthe user and item text feature representations, respectively. 𝒙 𝑢 and 𝒚 𝑖 are the latent factors, low-rank representations of users anditems, extracted as the bottlenecks of the corresponding AEs, asindicated by the green and blue blocks in Fig. 1. We follow thesettings of DeepCoNN [45] with a Factorization Machine layer [31],and compute rating prediction asˆ 𝑟 𝑢𝑖 = ˆ 𝑎 + | ˆ 𝑧 | ∑︁ 𝑚 = ˆ 𝑎 𝑚 ˆ 𝑧 𝑚 + | ˆ 𝑧 | ∑︁ 𝑚 = | ˆ 𝑧 | ∑︁ 𝑛 = 𝑚 + ⟨ ˆ 𝒗 𝑚 , ˆ 𝒗 𝑛 ⟩ ˆ 𝑧 𝑚 ˆ 𝑧 𝑛 (6)where ˆ 𝑎 denotes the global bias and ˆ 𝑎 𝑚 denotes the strength offirst order interactions in ˆ 𝑧 . Second order interactions are modeledby ⟨ ˆ 𝒗 𝑚 , ˆ 𝒗 𝑛 ⟩ = (cid:205) | ˆ 𝑧 | 𝑓 = ˆv 𝑚,𝑓 ˆv 𝑛,𝑓 . The latent factors 𝒙 𝑢 and 𝒚 𝑖 are used not only for rating prediction,as indicated in the previous section, but also to reconstruct theoriginal user and item representations in the computation of textreconstruction losses. We use an encoder to generate latent factors,which takes an initial user representation 𝑉 𝑢 or item representation 𝑉 𝑖 . We deploy the strategy proposed by DeepCoNN [45], that appliesTextCNNs [13] for feature transformation. For an arbitrary user 𝑢 ,we extract all review texts they authored and concatenate them intoa single long document. Similar to the top NLP models like BERT [9]and GPT-2 [28], here we adopt a cutoff length 𝑇 𝑈 to truncate wordsexceeding the limit. For users with fewer than 𝑇 𝑈 words, we padempty words, denoted by
Dataset
Then we introduce a look-up layer to get the initial individual wordembeddings from a pre-trained model. By concatenating them, weobtain user embedding 𝑉 𝑢 𝑇 𝑈 . Similarly, we obtain item embedding 𝑉 𝑖 𝑇 𝐼 for item 𝑖 .Encoding these initial 𝑉 𝑢 𝑇 𝑈 and 𝑉 𝑖 𝑇 𝐼 embeddings results in thelatent factors, which then serve as input to a decoder that we intro-duce to create reconstructed embeddings ˆ 𝑉 𝑢 𝑇 𝑈 and ˆ 𝑉 𝑖 𝑇 𝐼 . The archi-tecture of the decoder is symmetric to the encoder with deconvolu-tion and unpooling layers, as shown in Fig. 2. All hyper-parametersused in the decoding stage are the same as in the encoding stage.The success of reconstructing initial user and item embeddingsis modeled by the text reconstruction losses, which are computedfor each user and item. In order to have scores on a bounded scale,we rely on the cosine similarity to measure text reconstruction loss.Unlike most cases in text analysis where embedding values arepositive, the original pre-trained embeddings we use in this paperdo have negative values, making the cosine similarity range from-1 to 1. Therefore, we also normalize cosine similarities so that thescales of 𝐿 𝑈 and 𝐿 𝐼 are comparable to that of 𝐿 𝑅 . This leads to thefollowing formulation of the individual text reconstruction losses: 𝑙𝑜𝑠𝑠 𝑈 ( 𝑢 ) = (cid:32) − 𝑐𝑜𝑠 (cid:0) 𝑉 𝑢 𝑇 𝑈 , ˆ 𝑉 𝑢 𝑇 𝑈 (cid:1) (cid:33) , (7) 𝑙𝑜𝑠𝑠 𝐼 ( 𝑖 ) = (cid:32) − 𝑐𝑜𝑠 (cid:0) 𝑉 𝑖 𝑇 𝐼 , ˆ 𝑉 𝑖 𝑇 𝐼 (cid:1) (cid:33) , (8)where 𝑐𝑜𝑠 stands for the cosine similarity between the originalvectors and the reconstructed ones. We use Adaptive Moment Estimation (Adam) [14] to minimize theoverall loss function in Eq. (1). This way, training converges fastand the learning rate is adapted during the process.
Here we present a series of experiments designed to evaluate theproposed NAECF model through the following research questions: • RQ1: Does NAECF improve the recommendation for non-mainstream users, creating a better balance across users? • RQ2: What is the effect of using reviews and textual featuretransformations on mainstream and non-mainstream users? • RQ3: What is the correlation between recommendation ac-curacy and the difficulty of user feature reconstruction?
In this paper we focus on improving the recommendation for non-mainstream users, and investigate the power of text reviews forthis purpose. Therefore, the selected datasets are all review-basedsee Table 1). We use two Amazon real-world datasets covering dif-ferent recommendation domains, namely instant videos and digitalmusic, and another dataset from BeerAdvocate [20] . The ratingsall range from 1 to 5. However, in the Amazon datasets ratings areintegers, while in the BeerAdvocate dataset they are multiples of0.5. Users in the Amazon datasets have at least 5 interactions. Toalign with this setting, we filter the BeerAdvocate dataset using thesame threshold. Due to unavailability of computational resources,we randomly sampled 25% of users to form a BeerAdvocate subset.Following the original setting of DeepCoNN [45] and its latestrelated research [32], we use the Google News pre-trained wordvectors [23] to generate pre-trained word embeddings. Each wordin the review is thus represented as a 300-dimension vector.We evaluate the rating prediction accuracy by computing theconventional Root-Mean-Square Error on the test set: 𝑟𝑅𝑀𝑆𝐸 = √︄ (cid:205) 𝑢,𝑖 ( 𝑟 𝑢𝑖 − ˆ 𝑟 𝑢𝑖 ) 𝑁 . (9)where 𝑁 is the number of ratings. To evaluate recommendationperformance for individual users, we also report per-user RMSE( 𝑢𝑅𝑀𝑆𝐸 , as opposed to 𝑟𝑅𝑀𝑆𝐸 ) for further investigation. We capthe predicted ratings to [ , ] , so there are no out-of-bounds values. We compare the performance of our proposed NAECF model withtwo related recommendation models: • Matrix Factorization [15].
We use MF as a classical,pure similarity-based CF baseline. All non-textual hyper-parameters in NAECF are reused. • DeepCoNN [45].
This is the pioneering work and state-of-the-art method that introduces deep learning techniquesto build a text-based recommender system. User and itemfeatures are extracted in parallel, and their interaction isrealized by means of factorization machines (FM). Althoughthere are other text-based models following a similar archi-tecture, such as NARRE [6] and [41], that may outperformDeepCoNN, the components they added for better recom-mendation performance are mainly attention layers or dataaugmentation modules, introducing no significant changein the model architecture. Therefore, to focus on the effectof autoencoders in NAECF, we still adopt DeepCoNN as thestrongest and most relevant baseline.
We randomly split the datasets into training, validation and test setswith proportions 80%, 10% and 10%, respectively. To address theinfluence of the data splitting strategy, we set 10 different randomseeds and thus use 10 different splits. While all users have at least 5interactions in total, a random split may distribute these interactionsunevenly across sets, such that there may be users with only onerating in the training set. To address this potentially unreliablesituation, we only account for users with at least 3 interactions inthe training set for evaluation. http://jmcauley.ucsd.edu/data/amazon/ http://snap.stanford.edu/data/web-BeerAdvocate.html We first do a grid search on the two Amazon datasets separatelyto fix the hyper-parameters on DeepCoNN. Then we reuse themfor the investigation of NAECF. The hyper-parameters tuned arelisted below, with the optimal values indicated in bold: • Number of latent factors for DeepCoNN and NAECF: { , , , , , , } . All latent factors are initializedwith a Uniform distribution between − .
01 and 0 . • Learning rate: { . , . , . , . , . , . } . • Dropout rate to avoid overfitting: { , . , . , . • Batch size: { , , , , , } . • Number of words: { , , , , } . • Length of CNN kernels: { , , } .Using the DeepCoNN architecture as reference, we investi-gate the impact of text reconstruction loss with different weights.Since our main concern in this paper is the effect of adding ad-versarial conditions via AEs to the original DeepCoNN setting,we weight the user and item AEs with the same weight 𝑤 , asshown in Eq. (1). Specifically, we consider weight values in the set { , . , . , . , , , , } . Note that NAECF reduces to DeepCoNNwhen 𝑤 =
0. Similar to the fine-tuning of the hyper-parameters,the optimal weight is selected on the validation set.Autoencoders act as adversaries to the rating prediction process,so their activation may lower the overall validation 𝑟𝑅𝑀𝑆𝐸 . There-fore, we deploy a two-stage training strategy: we set 𝑤 = 𝑤 to the value we are tuning forthe next 50 epochs.Since NAECF does not chase the best overall performance, butrather a better balance across users, we follow a different validationstrategy for 𝑤 . First, we separate users in bins based on their 𝑢𝑅𝑀𝑆𝐸 score with DeepCoNN; to stress more on the performance for non-mainstream users, we use the 4 uneven bins defined by percentiles10, 50 and 90 of the 𝑢𝑅𝑀𝑆𝐸 distribution. The performance gainwith respect to DeepCoNN is then computed using these bins asstrata, assigning smaller importance to the first and last bins, thatis, users with a good recommendation and users who are extremelydifficult to model. This way, the assessment of model capability isbetter aligned with our purposes. Gain is thus defined as follows: Δ = . Δ + . Δ + . Δ + . Δ , (10)where Δ 𝑏 indicates the mean 𝑢𝑅𝑀𝑆𝐸 difference between Deep-CoNN and NAECF in user bin 𝑏 , and bin weights reflect the fractionof users they contain out of the total sample. A positive Δ valuemeans NAECF improves upon DeepCoNN.All models are implemented in PyTorch [27], with CUDA andCuDNN for acceleration on an NVIDIA GeForce GTX 1080Ti GPU. In this section, we present and analyze the experimental results.As a summary, Table 2 presents the mean performance of all mod-els over the 10 splits per dataset. It can be seen that DeepCoNNand NAECFs show significantly better recommendation accuracythan MF (paired 𝑡 -test, 𝑝 < .
05 [38]), and that NAECF and Deep-CoNN perform similarly overall, provided that the weight of thetext reconstruction loss is not too high. Furthermore, in Section 5.1 able 2: 𝒓 𝑹𝑴𝑺𝑬 over 10 data splits for all recommendation models in all three datasets ( mean ± std . dev . ). Bold for best resultsper dataset. * for results statistically different from the best ( 𝒕 -test, 𝒑 < . ). Dataset MF DeepCoNN NAECF 𝑤 = . 𝑤 = . 𝑤 = . 𝑤 = . 𝑤 = . 𝑤 = . 𝑤 = .0Instant Video 1.1600 ± .0264* 0.9744 ± .0145 ± .0149 ± .0122 0.9754 ± .0159 0.9757 ± .0169 0.9798 ± .0212 0.9896 ± .0221* 0.9967 ± .0221*Digital Music 1.0466 ± .0097* ± .0138 ± .0115 0.9106 ± .0128 0.9097 ± .0114 0.9104 ± .0128 0.9118 ± .0134 0.9167 ± .0108 0.9219 ± .0146*BeerAdvocate 1.0442 ± .0048* 0.6722 ± .0090 0.6707 ± .0064 ± .0035 ± .0059* 0.6756 ± .0082* 0.6785 ± .0098* 0.6899 ± .0137* 0.7068 ± .0278* Table 3: Weights 𝑤 yielding the best performance gain persplit on the validation set. Dataset Split1 2 3 4 5 6 7 8 9 10Instant Video 2 5 0.1 10 1 5 0.1 0.1 0 0.5Digital Music 0 0 5 5 2 0.1 2 0.1 0.1 0.5BeerAdvocate 0.1 0.2 0 0.5 0 0.2 0.5 0.1 0.1 0.2 we show that NAECF, while maintaining similar overall recom-mendation quality as DeepCoNN, manages to create a significantlybetter balance across users thanks to the introduction of the userand item reconstruction losses as adversaries to the rating predic-tion optimization. Finally, in Section 5.2 we dive deeper into theability of the autocorrelates to reconstruct users from the learnedrepresentations, and how this correlates with the recommenda-tion performance per user. This analysis sheds more light on themechanics underlying NAECF and the reported results.
In order to answer research questions RQ1 and RQ2, we investigatethe effect of autoencoders and text reviews on the recommendationsfor non-mainstream and mainstream users.
As an adversarial learning model,NAECF has two conflicting goals: minimizing the text reconstruc-tion losses 𝐿 𝑈 and 𝐿 𝐼 versus minimizing the rating prediction loss 𝐿 𝑅 . If the weight of the text reconstruction loss is too small, au-toencoders cannot exert sufficient influence on the training process,making them ineffective regarding the mainstream bias. Conversely,if the text reconstruction loss dominates the training process, weexpect to have a significant drop in terms of overall rating predic-tion accuracy. Following the validation process in Section 4.3, wechose the weight 𝑤 with the best gain Δ on the validation set asthe optimal one.Table 3 reports the optimal validation-set weight per split. Asthe table shows, in 5 of the 30 splits a weight 𝑤 = 𝑤 is a hyperparameter to tuneon a case by case basis, and that autoencoders are expected to helpwhen the characteristics of the data allow for it; sometimes theydo not lead to a substantial gain over DeepCoNN. Furthermore,and based on detailed 𝑟𝑅𝑀𝑆𝐸 results not reported in the paper, wesee that the weights with the best performance gains often leadto lower overall performance (7, 6, and 6 out of 10 seeds in three Table 4: Test-set performance gains averaged over splits(higher is better): per-bin gain 𝚫 𝒃 and overall gain 𝚫 . * forgains statistically different from 0 ( 𝒕 -test, 𝒑 < . ). Dataset Δ Δ Δ Δ Δ Instant Video -0.0035 0.0256* 0.0267* -0.0308* 0.0175*Digital Music 0.0036 0.0184* 0.0106* -0.0167* 0.0103*BeerAdvocate 0.0119 0.0117* 0.0063* -0.0115* 0.0073* datasets). This contrast shows that a high score on an overall accu-racy metric like 𝑟𝑅𝑀𝑆𝐸 does not necessarily reflect a good balanceacross individual users.After the optimal weights are chosen on the validation set, weturn our attention to the corresponding test-set results. In Table 4we report the average performance gains of NAECF over Deep-CoNN, both per bin and overall. The table shows that users in thecentral bins (ie. central 80% of users) receive a statistically signifi-cant performance gain on all datasets, which is exactly the usersthat we specifically target in NAECF. For the two Amazon datasets,these are also the bins receiving the largest gains; for the BeerAd-vocate dataset it is the first bin that has the highest gain, thoughthe difference is not significant from the second bin. In fact, thegains and losses observed for the 10% of users in the first bin arenot statistically different from zero, which means that users that al-ready receive good performance are neither helped nor punished byNAECF. Therefore, the application of autoencoders as adversariesto the rating prediction problem does not sacrifice performance forthe mainstream users. Finally, we observe that the 10% of users inthe last bin do receive a statistically significant performance loss.While unfortunate, such loss is a collateral damage on a minority ofusers who are hard to satisfy anyway, in benefit of the bulk of userswho now receive better recommendations. Averaging gains acrossbins, as indicated in Eq. (10), we see that NAECF yields statisticallybetter results than DeepCoNN on all datasets. This indicates theoverall success for NAECF to create a better balance across users. Ingeneral, we help most of the non-mainstream users without hurtingmainstream users.Figure 3 shows the test-set performance gain for the differentdata splits. We can first notice that the optimal weight, selectedbased on the gain on the validation set, turned into a slight lossin the test set for only one split in the Instant Video dataset ( Δ = − . Δ = − . Δ , representing the users that are hard to optimizefor in any case. In 5 cases the optimal weight was 𝑤 =
0, whichyields a gain Δ = For the majority of cases though (23 out of 30 Due to the stochastic nature of the training process, one retraining of DeepCoNNmay yield a slight gain with respect to another, but it should be zero on expectation.Therefore, we set Δ = when 𝑤 = . igure 3: Test-set performance gain Δ on each of the 10 datasplits, sorted within dataset. splits), the optimal weight selection achieves a higher gain Δ andtherefore helps achieve a better overall balance in recommendationperformance across mainstream and non-mainstream users. If weconsider only top 90% users to select the best weight, there arein total 28 out of 30 splits (except 2 in the Digital Music dataset)where NAECF shows superiority over DeepCoNN. However, thisdoes not mean that a higher weight is always better for reachinga balance across users. For the BeerAdvocate dataset, 83% of thetop-3 weights selected via the validation process are not largerthan 0.5. For the two Amazon datasets, and although the optimalweights distribute over all weight candidates, unreported resultsstill show that weights no larger than 2.0 take 85% of the top-3best results. This observation matches our expectation that a mildweight value is more likely to bring a better trade-off between theoverall recommendation accuracy and the balance across users.Based on these observations, we provide a positive answer to RQ1. We hypothesize that exploiting elab-orate user- and item-related information, in our case in the formof online reviews, not only contributes to overall recommendationperformance [6, 45], but also to neutralizing the mainstream bias.While non-mainstream users are relatively underrepresented in theuser space, it should at least help if their individual representationsare as elaborate as possible to model their preferences better.In order to verify this hypothesis, we investigate the effect ofthis additional information compared to the case where it is notused, such as in a classical collaborative-filtering models like MF.We deliberately do not compare MF with NAECF because we wouldconfound the use of text reviews for boosting recommendationaccuracy and balancing across users. Instead, we choose to comparewith DeepCoNN, which may anyway be regarded as a special caseof NAECF and, architecture-wise it is the closest to collaborativefiltering in the NAECF family. As such, a superiority of DeepCoNNover MF will indirectly mean a superiority of NAECF as well.Fig. 4 shows the 𝑢𝑅𝑀𝑆𝐸 improvements on the test set made byDeepCoNN, compared to MF on all 3 datasets. As the figure shows,our expectations are met on all three datasets. The improvement on 𝑢𝑅𝑀𝑆𝐸 scores has a clearly positive correlation with the baseline 𝑢𝑅𝑀𝑆𝐸 achieved by MF, meaning that it is the users who receivedworse recommendations in MF, the ones who benefit the most fromthe inclusion of textual features in DeepCoNN. We note that, closeto the origin of the plots, we see that DeepCoNN leads to slightperformance loss for the users for which MF achieved the bestperformance. This is however an artifact of the evaluation process.Note that users with a 𝑢𝑅𝑀𝑆𝐸 close to zero in MF have almost
Figure 4: 𝒖𝑹𝑴𝑺𝑬 gain over MF (positive is better) of Deep-CoNN and a retrained MF model, on all 10 data splits. Curvesrepresent a spline-smoothed fit. no room for improvement, so any other model we compare withwill probably perform worse. Similarly, other models will likelyperform better for the users with very high 𝑢𝑅𝑀𝑆𝐸 in MF, becauseit is just not possible to perform worse. To illustrate and accountfor this effect, Fig. 4 also compares with a retrained MF model,displaying both the overall correlation and the loss close to theorigin. These serve as a sort of baseline to assess the improvementof DeepCoNN (ie. rather than comparing the red curve with the 𝑦 = 𝑢𝑅𝑀𝑆𝐸 scores are always higherthan on a retrained MF model.Finally, and similar to the comparison between DeepCoNN andNAECF stated in Eq. (10), here we also compare DeepCoNN and MFin terms of gain Δ . The values on three datasets are 0.1175, 0.1017and 0.3282, respectively. Such a significant improvement shows theeffectiveness of review-based features in creating balance acrossdifferent users, by which we provide an answer to RQ2.In summary, we confirmed that NAECF creates a better balanceacross users by significantly improving the recommendation ac-curacy for non-mainstream users, subject to a good selection ofthe weight hyper-parameter. We also compared the review-basedDeepCoNN and the CF-based MF, and found that the improvementstems mainly from a better optimization for non-mainstream userswho are harder to handle in bare collaborative filtering. This way,NAECF’s superiority lies in the use of review text, not only to boostrating prediction, but also as an adversary to ensure better userrepresentation. Ultimately, these findings direct an open questionto the correlation between the text reconstruction loss and therecommendation accuracy, which we study next. Mainstream users are generally active and display common behav-ioral patterns. This makes it easier for them to be matched withproper neighbors in collaborative filtering, ultimately giving themmore accurate recommendations. At the same time, good perfor-mance on similarity-based user modeling will make them easierto reconstruct in the NAECF autoencoders, and should thereforehave a lower text reconstruction loss after training. This shouldbe reflected by a positive correlation between 𝑢𝑅𝑀𝑆𝐸 scores anduser reconstruction losses 𝑙𝑜𝑠𝑠 𝑈 . Because DeepCoNN does not con-tain any text reconstruction module, the loss should be randomly igure 5: NAECF 𝒍𝒐𝒔𝒔 𝑼 by test-set 𝒖𝑹𝑴𝑺𝑬 , for each weight 𝑤 .Lines represent a spline-smoothed fit. distributed and uncorrelated with 𝑢𝑅𝑀𝑆𝐸 ; we verified this in thedata but do not report it here. However, intuition tells us that main-stream users should have low reconstruction losses. The failureof DeepCoNN to reflect this expectation means there is room forimprovement to create a better balance across users, which is con-firmed by our findings in the previous section.Therefore, we now look into the correlation between user recon-struction loss 𝑙𝑜𝑠𝑠 𝑈 and 𝑢𝑅𝑀𝑆𝐸 recommendation accuracy. Fig. 5shows the relationship for each of the evaluated weights 𝑤 . We cansee clear differences across datasets, but there are several qualitativecommonalities. First, the majority of users are not mainstream evenif they have a rather low 𝑢𝑅𝑀𝑆𝐸 score. Second, higher weightsgenerally lead to lower reconstruction losses and therefore to bet-ter user representations. This is expected because a high weightmakes the text reconstruction losses dominate the overall loss inEq. (1), but the figure further shows that the relative relationshipbetween 𝑙𝑜𝑠𝑠 𝑈 and 𝑢𝑅𝑀𝑆𝐸 is pretty consistent across weights. In-terestingly, the BeerAdvocate dataset shows some fluctuations withhigh weights. This evidences that the optimal weight needs propertuning, because excessively high weights lead to substantial perfor-mance loss and 𝑢𝑅𝑀𝑆𝐸 scores become less stable as a consequence.As reported in Table 3, the optimal weights for this dataset arerather small indeed in comparison with the other two datasets.We also followed the earlier approach of dividing users in fourbins according to the 𝑢𝑅𝑀𝑆𝐸 distribution. Fig. 6 similarly showsthe relationship between 𝑙𝑜𝑠𝑠 𝑈 and 𝑢𝑅𝑀𝑆𝐸 with all the evaluatedweights, but differentiating among user bins. We can clearly observethat, as expected, the relationship is monotonically positive exceptfor the last bin in the Digital Music dataset. This confirms again thatusers who are better represented receive more accurate recommen-dations. As reported in Table 4, mainstream users in bin 1 do notalways benefit from NAECF because they already receive good rec-ommendations and there is little room for improvement, regardlessof how well they are reconstructed. Fig. 6 confirms this specially inthe two Amazon datasets, where bin 1 users receive nearly perfectrecommendations. But NAECF improves performance specially forthe 80% of non-mainstream users in bins 2 and 3, because those areharder to represent to begin with. Fig. 6 confirms that these usersgenerally have the highest reconstruction losses indeed. Togetherwith the correlations in Fig. 5, we see the relationship betweenthe mainstreamness of users and the difficulty to represent them.Notwithstanding, the bottom 10% of users in bin 4 are too extremeto find proper representations, so the autoencoders hardly workfor them. We even observed in Fig. 6 a negative correlation on the Figure 6: NAECF 𝒍𝒐𝒔𝒔 𝑼 by test-set 𝒖𝑹𝑴𝑺𝑬 , with all evaluatedweights 𝑤 . Error bars show standard deviations per user bin. Instant Video and Digital Music datasets when 𝑢𝑅𝑀𝑆𝐸 is large.This confirms that NAECF sacrifices performance for these extremeusers in favor of the others. Although unfortunate, we find this be-havior acceptable because these users often display such particulartastes and patterns that it is hard for them to benefit from virtuallyany CF-based recommendation model.
Rating accuracy has long been an important criterion to evalu-ate recommender systems, if not the most important. Previousresearch has therefore focused mainly on maximizing the overallperformance averaged over users. However, traditional collabora-tive filtering methods focus more strongly on recommending itemsthat have positive interactions by similar users. In this situation,it is hard for CF models to work well with non-mainstream usersthat have special tastes or habits. Because non-mainstream usersare rather a minority, this problem may not have a strong effect onthe overall accuracy, yet it may create an unfair imbalance acrossusers. To address this problem, we proposed a conceptually simplebut effective model named NAECF, which minimizes the ratingprediction loss while keeping the user and item properties pre-served in the learned user and item representations. Preservationof user and item properties is imposed as an adversarial conditionby minimizing reconstruction losses in addition to rating predic-tion error. This prevents these representations from being biasedtowards mainstream users.We conducted experiments on three real-world datasets, andfound that NAECF achieves an overall rating accuracy that is onpar with the state-of-the-art. However, its strength is in the betterbalance it achieves across users thanks to a significant improvementof the recommendation accuracy for non-mainstream users, with-out significantly harming the mainstream ones. This improvementis achieved through an optimal trade-off between rating predictionand text reconstruction. Our results confirm a clear correlationbetween how well users are represented and the quality of theirrecommendations, evidencing that side information may be instru-mental not only for boosting overall accuracy, but also to minimizepossible biases in the learned models.Future work will be conducted in several directions. First, wewill investigate whether the conclusions drawn here for ratingprediction generalize to the ranking paradigm, which is gainingpopularity in the recommendation field. Second, in this paper wetreated users and items as equally important through a single texteconstruction weight. One may argue that improving the represen-tation of users alone is not enough, because the model also needs agood item representation to know what to recommend. However,users and items may have different impacts, and we would like toexplore this question by implementing two weights in the NAECFloss. Third, we introduced side information from text reviews in or-der to achieve a better balance across users. However, text reviewsare just an example of additional content-based resources such asimages and demographic information that can be used to achievea similar function. We would like to further investigate the effectof other side information in the future and, perhaps more impor-tantly, how to effectively incorporate such information in NAECFto eliminate the mainstream-bias. Finally, we are also interested incombining NAECF with explainable recommendation, so that wecan provide convincing explanations to non-mainstream users.
REFERENCES [1] Himan Abdollahpouri, Robin Burke, and Bamshad Mobasher. 2018. Popularity-Aware Item Weighting for Long-Tail Recommendation. arXiv preprintarXiv:1802.05382 (2018).[2] Himan Abdollahpouri, Masoud Mansoury, Robin Burke, and Bamshad Mobasher.2019. The Unfairness of Popularity Bias in Recommendation. In
RMSE@RecSys(CEUR Workshop Proceedings) , Vol. 2440. CEUR-WS.org.[3] Gediminas Adomavicius, Jesse Bockstedt, Shawn Curley, and Jingjing Zhang.2014. De-biasing user preference ratings in recommender systems. In
JointWorkshop on Interfaces and Human Decision Making in Recommender Systems . 2.[4] Gediminas Adomavicius and Alexander Tuzhilin. 2005. Toward the next gen-eration of recommender systems: A survey of the state-of-the-art and possibleextensions.
IEEE Transactions on Knowledge & Data Engineering
AAAI . 279–284.[6] Chong Chen, Min Zhang, Yiqun Liu, and Shaoping Ma. 2018. Neural attentionalrating regression with review-level explanations. In
World Wide Web Conference .International World Wide Web Conferences Steering Committee, 1583–1592.[7] Zhiyong Cheng, Xiaojun Chang, Lei Zhu, Rose C Kanjirathinkal, and MohanKankanhalli. 2019. MMALFM: Explainable recommendation by leveraging re-views and images.
ACM Transactions on Information Systems
37, 2 (2019).[8] Zhiyong Cheng, Ying Ding, Xiangnan He, Lei Zhu, Xuemeng Song, and Mohan SKankanhalli. 2018. Aˆ 3NCF: An Adaptive Aspect Attention Model for RatingPrediction.. In
IJCAI . 3748–3754.[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding. In
NAACL-HLT (1) . Association for Computational Linguistics, 4171–4186.[10] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016.
Deep learning . MITpress.[11] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. GenerativeAdversarial Nets. In
NIPS . 2672–2680.[12] Casper Hansen, Christian Hansen, Jakob Grue Simonsen, Stephen Alstrup, andChristina Lioma. 2020. Content-aware Neural Hashing for Cold-start Recom-mendation. In
SIGIR . ACM, 971–980.[13] Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In
EMNLP . ACL, 1746–1751.[14] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Opti-mization. In
ICLR .[15] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization tech-niques for recommender systems.
Computer
CoRR abs/2003.10699 (2020).[17] Dominik Kowald, Markus Schedl, and Elisabeth Lex. 2020. The Unfairness ofPopularity Bias in Music Recommendation: A Reproducibility Study. In
ECIR (2)(Lecture Notes in Computer Science) , Vol. 12036. Springer, 35–42.[18] Chen Ma, Peng Kang, Bin Wu, Qinglong Wang, and Xue Liu. 2019. GatedAttentive-Autoencoder for Content-Aware Recommendation. In
ACM Interna-tional Conference on Web Search and Data Mining . 519–527.[19] Masoud Mansoury, Himan Abdollahpouri, Jessie Smith, Arman Dehpanah,Mykola Pechenizkiy, and Bamshad Mobasher. 2020. Investigating PotentialFactors Associated with Gender Discrimination in Collaborative RecommenderSystems. In
FLAIRS Conference . AAAI Press, 193–196.[20] Julian McAuley, Jure Leskovec, and Dan Jurafsky. 2012. Learning attitudes andattributes from multi-aspect reviews. In
IEEE International Conference on Data Mining . IEEE, 1020–1025.[21] Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel.2015. Image-based recommendations on styles and substitutes. In
InternationalACM SIGIR Conference on Research and Development in Information Retrieval .ACM, 43–52.[22] Rishabh Mehrotra, James McInerney, Hugues Bouchard, Mounia Lalmas, andFernando Diaz. 2018. Towards a Fair Marketplace: Counterfactual Evaluationof the trade-off between Relevance, Fairness & Satisfaction in RecommendationSystems. In
CIKM . ACM, 2243–2251.[23] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.Distributed representations of words and phrases and their compositionality. In
Advances in neural information processing systems . 3111–3119.[24] Peter Müllner. 2019.
Studying Non-Mainstream Music Listening Behavior For FairMusic Recommendations . Ph.D. Dissertation. Graz University of Technology.[25] Shumpei Okura, Yukihiro Tagami, Shingo Ono, and Akira Tajima. 2017.Embedding-based news recommendation for millions of users. In
ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining . ACM, 1933–1942.[26] Deng Pan, Xiangrui Li, Xin Li, and Dongxiao Zhu. 2020. Explainable Recommen-dation via Interpretable Feature Mapping and Evaluation of Explainability. In
IJCAI . ijcai.org, 2690–2696.[27] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.2017. Automatic differentiation in PyTorch. In
NIPS-W .[28] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and IlyaSutskever. 2019. Language models are unsupervised multitask learners.
OpenAIBlog
1, 8 (2019), 9.[29] Bashir Rastegarpanah, Krishna P. Gummadi, and Mark Crovella. 2019. FightingFire with Fire: Using Antidote Data to Improve Polarization and Fairness ofRecommender Systems. In
WSDM . ACM, 231–239.[30] Steffen Rendle. 2010. Factorization machines. In
Data Mining (ICDM), 2010 IEEE10th International Conference on . IEEE, 995–1000.[31] Steffen Rendle. 2012. Factorization Machines with libFM.
ACM Trans. Intell. Syst.Technol.
3, 3 (2012), 57:1–57:22.[32] Noveen Sachdeva and Julian McAuley. 2020. How Useful are Reviews for Rec-ommendation? A Critical Review and Potential Improvements. In
SIGIR . ACM,1845–1848.[33] Markus Schedl and Christine Bauer. 2018. An analysis of global and regionalmainstreaminess for personalized music recommender systems.
Journal of MobileMultimedia
14, 1 (2018), 95–112.[34] Markus Schedl and Christine Bauer. 2019. Online Music Listening Culture ofKids and Adolescents: Listening Analysis and Music Recommendation Tailoredto the Young.
CoRR abs/1912.11564 (2019).[35] Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. 2015.Autorec: Autoencoders meet collaborative filtering. In
International Conferenceon World Wide Web . ACM, 111–112.[36] Harald Steck. 2011. Item popularity and recommendation accuracy. In
ACMconference on Recommender systems . ACM, 125–132.[37] Yi Tay, Anh Tuan Luu, and Siu Cheung Hui. 2018. Multi-Pointer Co-AttentionNetworks for Recommendation. In
KDD . ACM, 2309–2318.[38] Julián Urbano, Harlley Lima, and Alan Hanjalic. 2019. Statistical SignificanceTesting in Information Retrieval: An Empirical Analysis of Type I, Type II andType III Errors. In
ACM SIGIR . 505–514.[39] Hao Wang, Naiyan Wang, and Dit-Yan Yeung. 2015. Collaborative deep learningfor recommender systems. In
ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining . ACM, 1235–1244.[40] Nan Wang, Hongning Wang, Yiling Jia, and Yue Yin. 2018. Explainable rec-ommendation via multi-task learning in opinionated text data. In
InternationalACM SIGIR Conference on Research & Development in Information Retrieval . ACM,165–174.[41] Qinyong Wang, Hongzhi Yin, Hao Wang, Quoc Viet Hung Nguyen, Zi Huang,and Lizhen Cui. 2019. Enhancing Collaborative Filtering with Generative Aug-mentation. In
KDD . ACM, 548–556.[42] Chao-Yuan Wu, Amr Ahmed, Alex Beutel, and Alexander J. Smola. 2017. JointTraining of Ratings and Reviews with Recurrent Recommender Networks. In
ICLR (Workshop) . OpenReview.net.[43] Chuhan Wu, Fangzhao Wu, Mingxiao An, Jianqiang Huang, Yongfeng Huang,and Xing Xie. 2019. NPA: Neural News Recommendation with PersonalizedAttention. In
KDD . ACM, 2576–2584.[44] Susen Yang, Yong Liu, Yinan Zhang, Chunyan Miao, Zaiqing Nie, and JuyongZhang. 2020. Learning Hierarchical Review Graph Representation for Recom-mendation. arXiv preprint arXiv:2004.11588 (2020).[45] Lei Zheng, Vahid Noroozi, and Philip S Yu. 2017. Joint deep modeling of usersand items using reviews for recommendation. In
ACM International Conferenceon Web Search and Data Mining . ACM, 425–434.[46] Yong Zheng, Mayur Agnani, and Mili Singh. 2017. Identification of Grey SheepUsers by Histogram Intersection in Recommender Systems. In