[PDF] Learning to Explain Recommendations

Abstract

Explaining to users why some items are recommended is critical, as it can help users to make better decisions, increase their satisfaction, and gain their trust in recommender systems (RS). However, existing explainable RS usually consider explanation as a side output of the recommendation model, which has two problems: (1) it is difficult to evaluate the produced explanations because they are usually model-dependent, and (2) as a result, how the explanations impact the recommendation performance is less investigated. In this paper, explaining recommendations is formulated as a ranking task, and learned from data, similar to item ranking for recommendation. This makes it possible for standard evaluation of explanations via ranking metrics (e.g., NDCG). Furthermore, we extend traditional item ranking to an item-explanation joint-ranking formalization to study if purposely selecting explanations could achieve certain learning goals, e.g., improving recommendation performance. A great challenge, however, is that the sparsity issue in the user-item-explanation data would be inevitably severer than that in traditional user-item interaction data, since not every user-item pair can be associated with all explanations. To mitigate this issue, we propose to perform two sets of matrix factorization by considering the ternary relationship as two groups of binary relationships. Experiments on three large datasets verify our solution's effectiveness on both item recommendation and explanation ranking. To facilitate the development of explainable RS, we plan to make our code publicly available.

Full PDF

LLearning to Explain Recommendations

Lei Li

Hong Kong Baptist UniversityHong Kong, [email protected]

Yongfeng Zhang

Rutgers UniversityNew Brunswick, [email protected]

Li Chen

Hong Kong Baptist UniversityHong Kong, [email protected]

ABSTRACT

Explaining to users why some items are recommended is critical, asit helps users to make better decisions, increase their satisfaction,and gain their trust in recommender systems (RS). However, exist-ing explainable RS usually consider explanations as side outputsof the recommendation model, which has two problems: (1) it isdifficult to evaluate the produced explanations because they areusually model-dependent, and (2) as a result, the possible impactsof those explanations are less investigated.To address the evaluation problem, we propose learning to ex-plain for explainable recommendation. The basic idea is to traina model that selects explanations from a collection as a ranking-oriented task. A great challenge, however, is that the sparsity issuein the user-item-explanation data would be severer than that intraditional user-item relation data, since not every user-item paircan associate with multiple explanations. To mitigate this issue, wepropose to perform two sets of matrix factorization by consideringthe ternary relationship as two groups of binary relationships. Tofurther investigate the impacts of explanations, we extend the tra-ditional item ranking of recommendation to an item-explanationjoint-ranking formalization. We study if purposely selecting ex-planations could achieve certain learning goals, e.g., in this pa-per, improving the recommendation performance. Experimentson three large datasets verify our solution’s effectiveness on bothitem recommendation and explanation ranking. In addition, ouruser-item-explanation datasets open up new ways of modeling andevaluating recommendation explanations. To facilitate the develop-ment of explainable RS, we will make our datasets and code publiclyavailable.

CCS CONCEPTS • Information systems → Recommender systems ; Learningto rank ; •

Computing methodologies → Multi-task learning . KEYWORDS

Recommender Systems; Explainable Recommendation; Learning toRank; Learning to Explain; Multi-task Learning

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

ACM Reference Format:

Lei Li, Yongfeng Zhang, and Li Chen. 2021. Learning to Explain Recom-mendations. In

Proceedings of ACM Conference (Conference ’21), Monthdd–dd, 2021, Virtual Event.

ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/xxxxxxx.xxxxxxx

Recommendation algorithms, such as collaborative filtering [35, 36]and matrix factorization [19, 30], have widely been deployed ononline platforms, e.g., Amazon and Youtube, to help users find theirinterested items. Meanwhile, there is a growing interest in explain-able recommendation [3, 7, 9, 10, 13, 14, 22, 23, 42, 46, 47], whichaims at producing user-comprehensible explanations, as they canhelp users make informed decisions and gain users’ trust in thesystem [39, 46]. However, in current explainable recommendationapproaches, explanation is often a side output of the model, whichwould incur two problems: First, the evaluation of explanationscould be difficult; Second, these approaches rarely study the poten-tial impacts of the explanations, owing to the first problem.Evaluation of explanations in existing works can be generallyclassified into four categories, including case study, user study, on-line evaluation and offline evaluation [46]. In most works, case studyis adopted to show how the example explanations are correlatedwith recommendations. These examples may look intuitive, butthey are less representative as they may be cherry-picked. Resultsof user study are more plausible, but they are usually simulated andcannot reflect users’ own perception, while this is not a problemto online evaluation. However, the latter is difficult to implementas it relies on the collaboration with industrial firms, which mayexplain why only few works [29, 44, 47] conducted such evaluation.Consequently, one may wonder whether it is possible to evaluatethe explainability using offline metrics.We propose learning to explain, which aims to provide a standardway of evaluating explanations, and to improve the recommenda-tion performance by explanation. Its basic idea is to learn a modelthat can select appropriate explanations from an explanation poolfor an item. This makes it possible for offline evaluation of theexplanations, as we formulate the problem as a ranking-orientedtask. Learning to explain has been explored in some other appli-cation domains, such as relation prediction in knowledge graphs[15, 41], interpreting a model using instance-wise features [5], andexplaining patterns in the data via logic inductive learning [45].However, it has been less explored how learning to explain wouldbenefit recommender systems.Our formulation can be adapted to various explanation styles,such as sentences, images, and even new styles yet to be found, aslong as the corresponding datasets are available. As an instantiation,from user reviews that reflect users’ authentic evaluation towards a r X i v : . [ c s . I R ] F e b onference ’21, Month dd–dd, 2021, Virtual Event Li and Zhang, et al. Figure 1: Three user reviews for different restaurants fromYelp. Sentences that can be seen as explanations are high-lighted in colors. Co-occurring explanations across differentreviews are emphasized in rectangles. items, we extract some representative sentences as explanations, be-cause not all of them can serve for explanation purpose [46]. Specifi-cally, we keep a sentence, if it contains both noun(s) and adjective(s),to ensure that it talks about some item features with certain opin-ions. We remove sentences that contain personal pronouns, e.g.,“ we ”, because they are usually related to one’s experience, whichtends to be less informative (see in Fig. 1 the last review’s firstparagraph). We further retain the co-occurring explanations acrossdifferent reviews (in rectangles) to form the user-item-explanationinteractions.As a result, the traditional user-item pairs are extended to user-item-explanation triplets. However, in traditional pair-wise data,each user may be associated with several items, but in the user-item-explanation triplets data, each user-item pair may be associatedwith only one or even none explanation. In consequence, the datasparsity problem is severer for explanation ranking. Therefore, howto design an effective model for such one-shot learning scenariobecomes a great challenge. Our solution is to separate user-item-explanation triplets into user-explanation and item-explanationpairs, which significantly alleviates the data sparsity problem. Ontop of it, we design two types of models. First, a general model thatonly makes use of IDs, aims to accommodate various explanationstyles (e.g., visual explanations). Second, a domain-specific modelbased on BERT [12] further leverages the textual features of theexplanations to enhance the ranking performance.The aforementioned evaluation problem also leads to the issuethat the potential impacts of explanations, such as higher chance ofitem click and fairness [38] which commercial systems particularlycare about, have been less explored. Without an appropriate mea-sure for explanation evaluation, explanations have usually beenmodeled as an auxiliary function of the recommendation task inmost explainable models [3, 9, 27, 37, 47]. Recent works [10, 13] thatjointly model the explanation task and the recommendation task,find that the two tasks can influence each other. Particularly, [8]shows that fine-tuning a parallel task of feature ranking can boostthe recommendation performance. Along with this line of research,we design an item-explanation joint-ranking framework to study ifshowing some particular explanations would lead to increased itemacceptance rate (i.e., improved recommendations). Furthermore, weare motivated to identify how the recommendation task and the explanation task would interact with each other, whether there is atrade-off between them, and how to achieve the most ideal solutionfor both.In summary, our key contributions are as follows: • We introduce the concept of learning to explain for explain-able recommendation research, and formulate the explana-tion problem as a ranking-oriented task, which allows us toevaluate explainability via standard evaluation metrics suchas NDCG, precision and recall. • We design an effective solution, being applied to two typesof models (with and without semantic features of the expla-nations), to address the data sparsity issue in the explanationranking task. Extensive experiments on real-world datasetsshow its effectiveness against strong baselines. • We propose an item-explanation joint-ranking frameworkthat can achieve the designed goals, i.e., improving the per-formance of both recommendation and explanation, as evi-denced by our experimental results. • We construct three large datasets, which extend traditionalpair-wise user-item interactions to triple-wise user-item-explanation records to facilitate the research on explainablerecommendation. Datasets and code will be released soon.In the following, we first summarize related work in Section 2,and then formulate the problems in Section 3. Our proposed modelsand the joint-ranking framework are presented in Section 4. Section5 introduces the experimental setup, and the discussion of resultsis provided in Section 6. We conclude this work with outlooks inSection 7.

Recent years have witnessed a growing interest in explainablerecommendation [2, 3, 7, 9, 10, 13, 21–23, 27, 37, 42, 47]. In theseworks, there is a variety of explanation styles to recommendations,including visual highlights [7], textual highlights [27, 37], itemneighbors [24], word cloud [47], item features [14], pre-definedtemplates [13, 21, 47], automatically generated text [10, 22, 23],retrieved text [2, 3, 9, 42], etc. The last type of style is related tothis paper, but explanations in these works are merely side outputsof their models. As a result, none of these works measured theexplanation quality based on benchmark metrics. In comparison,we propose learning to explain for explainable recommendation andformulate the explanation task as a learning to rank [26] problem,which enables offline evaluation via ranking-oriented metrics.In more details, we model the user-item-explanation relations forboth item and explanation ranking. There is a previous work [14]that similarly considers user-item-aspect relations as a tripartitegraph, where aspects are extracted from user reviews. The datasparsity issue in our problem would be severer than that in [14],since identical sentences appear far less frequently over differentreviews than aspects. Another branch of related work is tag recom-mendation for folksonomy [16, 17, 34, 43], where tags are rankedfor each given user-item pair. In terms of problem setting, our workis different from the preceding two, because they solely rank eitheritems/aspects [14] or tags [16, 17, 34, 43], while besides that we alsorank item-explanation pairs as a whole in our joint-ranking frame-work. Another difference is that we extract semantic features from earning to Explain Recommendations Conference ’21, Month dd–dd, 2021, Virtual Event explanations to enhance the performance of explanation ranking,while none of them did so.The applications of learning to explain can be found in otherdomains as well. For instance, in [5] a function is learned fromthe perspective of mutual information to select features for eachsample to interpret classification models. Another example is learn-ing to explain with complemental examples [18] that can not onlyperform classification but also produce textual explanations alongwith a set of visual examples. Compared with the preceding twoexamples, the works on explaining entity relationships in Knowl-edge Graphs [15, 41] are more relevant to our work. The majordifference is that previous works heavily rely on the semantic fea-tures of explanations, either constructed manually [41] or extractedautomatically [15], while one of our models can only leverage therelation of explanations to users and items without consideringthese information.In our work, there is a pool of candidate explanations to selectfor each user-item pair. A recent online experiment [44] conductedon Microsoft Office 365 shows that this type of globally sharedexplanations is indeed helpful to users. The main focus of this workis to study how users perceive explanations, which is different fromours that aims to design effective models to rank explanations. De-spite of that, their research findings motivate us to provide betterexplanations that could lead to improved recommendations. How-ever, explanations in this study are manually designed and thusthe number of possible explanations is small. In comparison, ourdatasets contain a variety of explanations which are the commonlyused sentences in user reviews, e.g., “

Great place for breakfast ” inFig. 1, as inspired by the wisdom of the crowd.User reviews are widely adopted for the research on explainablerecommendation [2, 3, 7, 9, 10, 22, 23, 42] because they justifywhy users like/dislike the consumed products, which could providesuggestions for other users. However, directly taking reviews [2, 3,28] or their first sentences [10] as explanations is less appropriate, asthey may contain noisy content (e.g., personal experience) that arenot informative enough for explanation purpose. As a comparison,our strategy is to adopt some representative sentences that assessitems on a particular aspect.As discussed earlier, our framework can be applied to a broadspectrum of explanation styles, e.g., visual explanations. In thiswork, we instantiate it on review sentences. There are some meth-ods [10, 23] that generate reviews/tips for recommendations, andregard text similarity metrics (i.e., BLEU [31] and ROUGE [25]) asexplainability. The evaluation issue is still under debate, as [4, 22]point out that text similarity does not equal to explainability. Forexample, when the ground-truth is “ sushi is good ”, two generatedexplanations “ ramen is good ” and “ sushi is delicious ” gain the samescore on the two metrics. However, from the perspective of explain-ability, the latter is obviously more related to the ground-truth, asthey both refer to the same feature “ sushi ”, but the metrics fail toreflect this issue.

The key notations and concepts for the problems are presentedin Table 1. We use U to denote the set of all users, I the set ofall items and E the set of all explanations. Then the historical Table 1: Key notations and concepts.Symbol Description T training set U set of users I set of items I 𝑢 set of items that user 𝑢 preferred E set of explanations E 𝑢 set of user 𝑢 ’s explanations E 𝑖 set of item 𝑖 ’s explanations E 𝑢,𝑖 set of explanations that user 𝑢 preferred on item 𝑖 P latent factor matrix for users Q latent factor matrix for items O latent factor matrix for explanations p 𝑢 latent factors of user 𝑢 q 𝑖 latent factors of item 𝑖 o 𝑒 latent factors of explanation 𝑒𝑏 𝑖 bias term of item 𝑖𝑏 𝑒 bias term of explanation 𝑒𝑑 dimension of latent factors 𝛼 , 𝜆 regularization coefficient 𝛾 learning rate 𝑇 iteration number 𝑀 number of recommendations for each user 𝑁 number of explanations for each recommendationˆ 𝑟 𝑢,𝑖 score predicted for user 𝑢 on item 𝑖 ˆ 𝑟 𝑢,𝑖,𝑒 score predicted for user 𝑢 on explanation 𝑒 of item 𝑖 interaction set is given by T ⊆ U × I × E . In the following, webriefly introduce item ranking and explanation ranking, followedby the item-explanation joint-ranking.

Personalized recommendation aims at providing a user with aranked list of items that he/she never interacted with before. Foreach user 𝑢 ∈ U , the list of 𝑀 items can be generated as follows,Top ( 𝑢, 𝑀 ) : = 𝑀 arg max 𝑖 ∈I/I 𝑢 ˆ 𝑟 𝑢,𝑖 (1)where ˆ 𝑟 𝑢,𝑖 is the predicted score for a known user 𝑢 on item 𝑖 , and I/I 𝑢 denotes the set of items on which user 𝑢 has no interactions.In Eq. (1) 𝑖 is underlined, which means that we aim to rank theitems. Explanation ranking is the task of finding a list of appropriate expla-nations for a user-item pair to justify a recommendation. Formally,given a user 𝑢 ∈ U and an item 𝑖 ∈ I , the goal of this task is torank the entire collection of explanations E , and select the top 𝑁 to reason why the item 𝑖 is recommended. Specifically, we definethis list of top 𝑁 explanations as:Top ( 𝑢, 𝑖, 𝑁 ) : = 𝑁 arg max 𝑒 ∈E ˆ 𝑟 𝑢,𝑖,𝑒 (2)where ˆ 𝑟 𝑢,𝑖,𝑒 is the estimated score of explanation 𝑒 for a given user-item pair ( 𝑢, 𝑖 ) . The pair could be given by any recommendationmodel or by the user’s true behavior. onference ’21, Month dd–dd, 2021, Virtual Event Li and Zhang, et al. The preceding tasks solely rank either items or explanations. Inthis task, we further investigate whether it is possible to find anideal item-explanation pair for a user, to whom the explanationbest justifies the item that he/she likes the most. To this end, wetreat each pair of item-explanation as a joint unit, and then rankthese units. Specifically, for each user 𝑢 ∈ U , a ranked list of 𝑀 item-explanation pairs can be produced as follows,Top ( 𝑢, 𝑀 ) : = 𝑀 arg max 𝑖 ∈I/I 𝑢 ,𝑒 ∈E ˆ 𝑟 𝑢,𝑖,𝑒 (3)where ˆ 𝑟 𝑢,𝑖,𝑒 is the predicted score for a given user 𝑢 on the item-explanation pair ( 𝑖 , 𝑒 ).We see that both item ranking and explanation ranking tasksare a special case of this item-explanation joint-ranking task. Con-cretely, Eq. (3) degenerates to Eq. (1) when explanation 𝑒 is fixed,while it reduces to Eq. (2) if item 𝑖 is known in prior. Suppose we have an ideal model that can perform the aforemen-tioned joint-ranking task. During the prediction stage as in Eq. (3),there would be |I|×|E| candidate item-explanation pairs to rank foreach user 𝑢 ∈ U . The runtime complexity then is 𝑂 (|U| · |I| · |E|) ,which makes this task impractical, compared with the traditionalrecommendation task’s 𝑂 (|U| · |I|) complexity.To reduce the complexity, we reformulate the joint-ranking taskby performing ranking for items and explanations simultaneouslybut separately. In this way, we are also able to investigate whetheritem ranking and explanation ranking could influence each other,e.g., improving the performance of each other. Specifically, duringthe testing stage, we first follow Eq. (1) to rank items for each user 𝑢 ∈ U , which has the runtime complexity of 𝑂 (|U| · |I|) . Afterthat, for 𝑀 recommendations for each user, we can rank and selectexplanations to justify each of them according to Eq. (2). The secondstep’s complexity is 𝑂 (|U| · 𝑀 · |E|) , but since 𝑀 is a constant and |E| ≪ |I| (see Table 2), the overall complexity of the two stepswould be 𝑂 (|U| · |I|) .In the following, we first analyze the drawback of conventionalTensor Factorization (TF) model to the explanation ranking problem,and then introduce our solution. Second, we show how to enhance itby utilizing the semantic features of explanations. Third, we discussthe advantages of our two methods, and illustrate their relationto two typical TF methods. At last, we integrate the explanationranking with item ranking into a multi-task learning framework asa joint-ranking task. To perform explanation ranking, the score ˆ 𝑟 𝑢,𝑖,𝑒 on each explanation 𝑒 ∈ E for a given user-item pair ( 𝑢, 𝑖 ) must be estimated. As theuser-item-explanation ternary relations T = {( 𝑢, 𝑖, 𝑒 )| 𝑢 ∈ U , 𝑖 ∈I , 𝑒 ∈ E} form an interaction cube, we are inspired to employfactorization models to predict scores. There are a number of tensorfactorization techniques, such as Tucker Decomposition (TD) [40],Canonical Decomposition (CD) [1] and High Order Singular Value Decomposition (HOSVD) [11]. Intuitively, one would adopt CD,because of its linear runtime complexity in terms of both trainingand prediction [34] and its close relation to Matrix Factorization(MF) [30] that has been extensively studied over the years for itemrecommendation. Formally, the score ˆ 𝑟 𝑢,𝑖,𝑒 of user 𝑢 on item 𝑖 ’sexplanation 𝑒 can be estimated by the sum over the element-wisemultiplication of the user’s latent factors p 𝑢 , the item’s q 𝑖 and theexplanation’s o 𝑒 :ˆ 𝑟 𝑢,𝑖,𝑒 = ( p 𝑢 ⊙ q 𝑖 ) ⊤ o 𝑒 = 𝑑 ∑︁ 𝑘 = 𝑝 𝑢,𝑘 · 𝑞 𝑖,𝑘 · 𝑜 𝑒,𝑘 (4)where ⊙ denotes the element-wise multiplication of two vectors.However, this method may not be effective enough due to theinherent sparsity problem of the ternary data. Since each user-itempair ( 𝑢, 𝑖 ) in the training set T is unlikely to have interactions withmany explanations in E , the data sparsity problem for explanationranking is destined to be severer than that for item recommendation.Simply multiplying the three vectors would hurt the performanceof explanation ranking, which is evidenced by our experimentalresults in Section 6.To mitigate such an issue, so as to improve the effectiveness ofexplanation ranking, we propose to separately estimate the user 𝑢 ’spreference score ˆ 𝑟 𝑢,𝑒 on explanation 𝑒 and the item 𝑖 ’s suitablenessscore ˆ 𝑟 𝑖,𝑒 for explanation 𝑒 . To this end, we perform two sets ofmatrix factorization, rather than employ one single TF model. In thisway, the sparsity problem would be considerably alleviated, sincethe data can be reduced to two collections of binary relations, bothof which are similar to the case of item recommendation discussedabove. At last, the two scores ˆ 𝑟 𝑢,𝑒 and ˆ 𝑟 𝑖,𝑒 are combined linearlythrough a hyper-parameter 𝜇 . Specifically, the score of user 𝑢 foritem 𝑖 on explanation 𝑒 is predicted as follows,  ˆ 𝑟 𝑢,𝑒 = p ⊤ 𝑢 o 𝑈𝑒 + 𝑏 𝑈𝑒 = (cid:205) 𝑑𝑘 = 𝑝 𝑢,𝑘 · 𝑜 𝑈𝑒,𝑘 + 𝑏 𝑈𝑒 ˆ 𝑟 𝑖,𝑒 = q ⊤ 𝑖 o 𝐼𝑒 + 𝑏 𝐼𝑒 = (cid:205) 𝑑𝑘 = 𝑞 𝑖,𝑘 · 𝑜 𝐼𝑒,𝑘 + 𝑏 𝐼𝑒 ˆ 𝑟 𝑢,𝑖,𝑒 = 𝜇 · ˆ 𝑟 𝑢,𝑒 + ( − 𝜇 ) · ˆ 𝑟 𝑖,𝑒 (5)where { o 𝑈𝑒 , 𝑏 𝑈𝑒 } and { o 𝐼𝑒 , 𝑏 𝐼𝑒 } are two different sets of latent factorsfor explanations, corresponding to users and items respectively.Since selecting explanations that are likely to be perceived help-ful by users is inherently a ranking-oriented task, directly modelingthe relative order of explanations is thus more effective than simplypredicting their absolute scores. The Bayesian Personalized Rank-ing (BPR) criterion [33] meets such an optimization requirement.Intuitively, a user would be more likely to appreciate explanationsthat cater to his/her own preferences, while those that do not fitone’s interests would be less attractive to the user. Similarly, someexplanations might be more suitable to describe certain items, whileother explanations might not. To build such type of pair-wise pref-erences, we compute the difference between two explanations forboth user 𝑢 and item 𝑖 as follows,ˆ 𝑟 𝑢,𝑒𝑒 ′ = ˆ 𝑟 𝑢,𝑒 − ˆ 𝑟 𝑢,𝑒 ′ and ˆ 𝑟 𝑖,𝑒𝑒 ′′ = ˆ 𝑟 𝑖,𝑒 − ˆ 𝑟 𝑖,𝑒 ′′ (6)which respectively reflect user 𝑢 ’s interests for explanation 𝑒 over 𝑒 ′ , and item 𝑖 ’s suitableness for explanation 𝑒 over 𝑒 ′′ . earning to Explain Recommendations Conference ’21, Month dd–dd, 2021, Virtual Event (a) Our Bayesian Personalized Explanation Ranking (BPER)(b) Our BERT-enhanced BPER (BPER+) (c) Canonical Decomposition (CD)(d) Pairwise Interaction Tensor Factorization (PITF) Figure 2: Tensor Factorization models. The three matrices (i.e., P , Q , O ) are model parameters. Our BPER and BPER+ can beregarded as special cases of CD, while PITF can be seen as a special case of our BPER and BPER+. With the scores ˆ 𝑟 𝑢,𝑒𝑒 ′ and ˆ 𝑟 𝑢,𝑒𝑒 ′′ , we can then adopt the BPRcriterion [33] to minimize the following objective function:min Θ ∑︁ 𝑢 ∈U ∑︁ 𝑖 ∈I 𝑢 ∑︁ 𝑒 ∈E 𝑢,𝑖 (cid:104) ∑︁ 𝑒 ′ ∈E/E 𝑢 − ln 𝜎 ( ˆ 𝑟 𝑢,𝑒𝑒 ′ )+ ∑︁ 𝑒 ′′ ∈E/E 𝑖 − ln 𝜎 ( ˆ 𝑟 𝑖,𝑒𝑒 ′′ ) (cid:105) + 𝜆 || Θ || 𝐹 (7)where 𝜎 (·) denotes the sigmoid function, I 𝑢 represents the set ofitems that user 𝑢 interacted with, E 𝑢,𝑖 is the set of explanations inthe training set for the user-item pair ( 𝑢, 𝑖 ) , E/E 𝑢 and E/E 𝑖 respec-tively correspond to user 𝑢 ’s and item 𝑖 ’s uninteracted explanations, Θ is the model parameter, and 𝜆 is the regularization coefficient.From Eq. (7), we can see that there are two explanation tasksto be learned, corresponding to users and items, and yet we allowthem to be equally important during the training stage, becausewe have a hyper-parameter 𝜇 in Eq. (5) to balance their importanceduring the prediction stage. The effect of this parameter is studiedin Section 6.1. After the model parameters are estimated, we canrank explanations according to Eq. (2) for each user-item pair in thetesting set. As we model the explanation ranking task under BPRcriterion, we accordingly name our method Bayesian PersonalizedExplanation Ranking (BPER). To learn the model parameter Θ , wedraw on the widely used stochastic gradient descent algorithm tooptimize the objective function in Eq. (7). The complete learningsteps are shown in Algorithm 1 in the Appendix.Note that the model BPER may be further enriched by consid-ering more complex relation between explanations (e.g., Rank ofa user’s positive explanations > the other users’ explanations >the user’s negative explanations). Since we are the first one thatconducts explanation ranking, we want the model to be as simpleas possible and therefore leave it for future work. The BPER model only makes use of the IDs of users, items and ex-planations for explanation ranking. We further investigate whetherit can be enhanced by the semantic features of the explanations. Tothis end, we opt for BERT [12], a well-known pre-trained languagemodel, whose effectiveness has been demonstrated on a wide range of natural language understanding tasks. Specifically, after pass-ing through BERT an explanation 𝑒 ’s textual content, e.g., “ Greatplace for breakfast ”, we can obtain a vector o 𝐵𝐸𝑅𝑇𝑒 that encodesthe explanation’s semantic meaning. Then, we enhance the twoexplanation-related vectors o 𝑈𝑒 and o 𝐼𝑒 in Eq. (5) by multiplying o 𝐵𝐸𝑅𝑇𝑒 , resulting in o 𝑈 + 𝑒 and o 𝐼 + 𝑒 . (cid:26) o 𝑈 + 𝑒 = o 𝑈𝑒 ⊙ o 𝐵𝐸𝑅𝑇𝑒 o 𝐼 + 𝑒 = o 𝐼𝑒 ⊙ o 𝐵𝐸𝑅𝑇𝑒 (8)After replacing o 𝑈𝑒 and o 𝐼𝑒 in Eq. (5) with o 𝑈 + 𝑒 and o 𝐼 + 𝑒 , the re-maining steps are the same to BPER’s. The model’s parameters(including BERT’s) can be updated via back-propagation. Noticethat, in Eq. (8) we adopt the multiplication operation simply toverify the feasibility of incorporating semantic features. The modelmay be further improved by more sophisticated operations, but weleave it for future work. BPER is a general method that only requires the IDs of users, itemsand explanation, which makes it very flexible when being adapted toother explanation styles (e.g., images). Whereas BPER+ is a domain-specific method enhanced by the semantic features extracted fromtextual explanations, so it can be more effective. As the first workon ranking explanations for recommendations, we opt to makeboth methods relatively simple for reproducibility purpose. In thisway, it is also easy to observe the experimental results (e.g., theimpact of explanation task to recommendation task), without theinterference of other factors.Next, we analyze the relation between our BPER and two closelyrelated Tensor Factorization methods: Canonical Decomposition(CD) [1] and Pairwise Interaction Tensor Factorization (PITF) [34].In a similar way, BPER+ can also be rewritten as CD or PITF, but weomit the analysis due to space limitation. The graphical illustrationof the four models is shown in Fig. 2. Concretely, BPER is a specialcase of the CD model. Suppose the dimensionality of BPER is 2 · 𝑑 + onference ’21, Month dd–dd, 2021, Virtual Event Li and Zhang, et al. we can then reformulate it as CD in the following, 𝑝 𝐶𝐷𝑢,𝑘 = (cid:40) 𝜇 · 𝑝 𝑢,𝑘 , if 𝑘 ≤ 𝑑𝜇, else 𝑞 𝐶𝐷𝑖,𝑘 = (cid:40) ( − 𝜇 ) · 𝑞 𝑖,𝑘 , if 𝑘 > 𝑑 and 𝑘 ≤ · 𝑑 − 𝜇, else 𝑜 𝐶𝐷𝑒,𝑘 =  𝑜 𝑈𝑒,𝑘 , if 𝑘 ≤ 𝑑𝑜 𝐼𝑒,𝑘 , else if 𝑘 ≤ · 𝑑𝑏 𝑈𝑒 , else if 𝑘 = · 𝑑 + 𝑏 𝐼𝑒 , else (9)where the hyper-parameter 𝜇 is a constant, once we complete thelearning of our model.In the meantime, PITF can be seen as a special case of our BPER.Formally, its predicted score ˆ 𝑟 𝑢,𝑖,𝑒 for the user-item-explanationtriplet ( 𝑢, 𝑖, 𝑒 ) can be calculated by:ˆ 𝑟 𝑢,𝑖,𝑒 = p ⊤ 𝑢 o 𝑈𝑒 + q ⊤ 𝑖 o 𝐼𝑒 = 𝑑 ∑︁ 𝑘 = 𝑝 𝑢,𝑘 · 𝑜 𝑈𝑒,𝑘 + 𝑑 ∑︁ 𝑘 = 𝑞 𝑖,𝑘 · 𝑜 𝐼𝑒,𝑘 (10)We can see that our BPER degenerates to PITF if in Eq. (5) weremove the bias terms 𝑏 𝑈𝑒 and 𝑏 𝐼𝑒 and set the the hyper-parameter 𝜇 to 0.5, which means that the two types of scores for users anditems are equally important to the explanation ranking task.Although CD is more general than our BPER, its performancemay be affected by the data sparsity issue as discussed before. OurBPER could mitigate this problem owing to its explicitly designedstructure that may be difficult for CD to learn from scratch. Whencomparing with PITF, we can find that the hyper-parameter 𝜇 inBPER is able to balance the importance of the two types of scores,corresponding to users and items, which makes our BPER moreexpressive than PITF and thus it could reach better ranking quality. Owing to BPER’s flexibility to accommodate different explanationstyles, we perform the joint-ranking on it. Specifically, we incorpo-rate the two tasks of explanation ranking and item recommendationinto a unified multi-task learning framework, so as to find a goodsolution that benefits both of them.For recommendation, we adopt Singular Value Decomposition(SVD) model [19] to predict the score ˆ 𝑟 𝑢,𝑖 of user 𝑢 on item 𝑖 :ˆ 𝑟 𝑢,𝑖 = p ⊤ 𝑢 q 𝑖 + 𝑏 𝑖 = 𝑑 ∑︁ 𝑘 = 𝑝 𝑢,𝑘 · 𝑞 𝑖,𝑘 + 𝑏 𝑖 (11)where 𝑏 𝑖 is the bias term for item 𝑖 . Notice that, the latent factors p 𝑢 and q 𝑖 are shared with those for explanation ranking in Eq.(5). In essence, item recommendation is also a ranking task thatcan be optimized using BPR criteria [33], so we first compute thepreference difference ˆ 𝑟 𝑢,𝑖𝑖 ′ between a pair of items 𝑖 and 𝑖 ′ to a user 𝑢 as follows, ˆ 𝑟 𝑢,𝑖𝑖 ′ = ˆ 𝑟 𝑢,𝑖 − ˆ 𝑟 𝑢,𝑖 ′ (12) Table 2: Statistics of the datasets. Density is × × ( 𝑢, 𝑖, 𝑒 ) triplets 793,481 2,618,340 3,875,118Density ( × − ) 45.71 13.88 2.07which then can be combined with the task of explanation rankingin Eq. (7) to form the following objective function for joint-ranking:min Θ ∑︁ 𝑢 ∈U ∑︁ 𝑖 ∈I 𝑢 (cid:104) ∑︁ 𝑖 ′ ∈I/I 𝑢 − ln 𝜎 ( ˆ 𝑟 𝑢,𝑖𝑖 ′ ) + 𝛼 ∑︁ 𝑒 ∈E 𝑢,𝑖 (cid:16) ∑︁ 𝑒 ′ ∈E/E 𝑢 − ln 𝜎 ( ˆ 𝑟 𝑢,𝑒𝑒 ′ ) + ∑︁ 𝑒 ′′ ∈E/E 𝑖 − ln 𝜎 ( ˆ 𝑟 𝑖,𝑒𝑒 ′′ ) (cid:17)(cid:105) + 𝜆 || Θ || 𝐹 (13)where the parameter 𝛼 can be fine-tuned to balance the learning ofthe two tasks.We name this method BPER-J where J stands for joint-ranking.Similar to BPER, we can update each parameter of BPER-J viastochastic gradient descent (see Algorithm 2 in the Appendix).

For evaluation, it is expected that the datasets contain user-item-explanation interaction triplets. Therefore, we are motivated toconstruct such datasets to facilitate the research on explanationranking. Concretely, we narrow down the explanations to textualsentences that can be extracted from user reviews. The key problemnow is how to detect nearly the same sentences that co-occur indifferent reviews. A naive way is to compare each pair of sentencesin a dataset, but this would take quadratic time to process. To reducethe processing time, we seek to Locality Sensitive Hashing (LSH)[32] that is able to find near duplicates in linear time. The detailedprocedure can be found in our code, after this paper is published.To compare the performance of different methods, we adopt threelarge-scale datasets from different domains. The first dataset withregard to Movies & TV is from Amazon 5-core . The second onefor hotel domain is constructed by crawling review records from atravel website TripAdvisor . The last one is from Yelp Challenge2019 , which is about restaurants.Each record in the three datasets consists of user ID, item ID,and textual review. We split each review into sentences, and applyLSH over all sentences to obtain amounts of sentence groups. Eachgroup is considered as a candidate explanation, which containsnearly the same sentences. We remove a group if it contains nomore than 5 sentences so as to retain the more reliable explanations.Each group is assigned with an ID, which we refer to as explanationID in the following. For each user 𝑢 and item 𝑖 pair in the data, ifthe corresponding review does not contain any sentence in anygroup, then we remove this pair. Otherwise, if the review containsa sentence from group 𝑒 , we then construct a triplet ( 𝑢, 𝑖, 𝑒 ) . Noticethat, a review may contain more than one explanation, and thus http://jmcauley.ucsd.edu/data/amazon earning to Explain Recommendations Conference ’21, Month dd–dd, 2021, Virtual Event Table 3: Performance comparison of all methods on top 10 explanation ranking in terms of NDCG, Precision (Pre), Recall (Rec)and F1 (%). The best performing values are boldfaced, and the second best underlined. Improvements are made by BPER+ overthe best baseline PITF (* indicates the statistical significance over PITF for 𝑝 < Amazon TripAdvisor YelpNDCG@10 Pre@10 Rec@10 F1@10 NDCG@10 Pre@10 Rec@10 F1@10 NDCG@10 Pre@10 Rec@10 F1@10CD 0.001 0.001 0.007 0.002 0.001 0.001 0.003 0.001 0.000 0.000 0.003 0.001RAND 0.004 0.004 0.027 0.006 0.002 0.002 0.011 0.004 0.001 0.001 0.007 0.002RUCF 0.341 0.170 1.455 0.301 0.260 0.151 0.779 0.242 0.040 0.020 0.125 0.033RICF 0.417 0.259 1.797 0.433 0.031 0.020 0.087 0.030 0.037 0.026 0.137 0.042PITF 2.352 1.824 14.125 3.149 1.239 1.111 5.851 1.788 0.712 0.635 4.172 1.068

BPER * * * 1.389* 1.236* 6.549* 1.992* 0.814* 0.723* * 1.218* BPER+ 2.877 * 1.919* 14.936* 3.317* * * * * * * 4.544* * Improvement (%) of BPER on Amazon E x p l a n a t i o n R a n k i n g A cc u r a c y NDCG@10 (×5)Pre@10 (×7)Rec@10F1@10 (×4) 0.0 0.2 0.4 0.6 0.8 1.0 of BPER on TripAdvisor of BPER on Yelp

Figure 3: The effect of 𝜇 in BPER to explanation ranking on three datasets. NDCG@10, Pre@10 and F1@10 are linearly trans-formed into a certain range for better visualization. results in more than one triplets. The statistics of the three datasetsafter processing are presented in Table 2. As it can be seen, the datasparsity issue on the three datasets is very severe. We first evaluate the performance of explanation ranking task,where the user-item pairs are given. The following baselines areadopted. Notice that, we omit the comparison with Tucker Decom-position (TD) [40], because it takes cubic time to run and we alsofind it does not perform better than CD in our trial experiment. • RAND : It is a weak baseline that randomly picks up expla-nations from the collection. It is devised to examine whetherpersonalization is needed for explanation ranking. • RUCF : Revised User-based Collaborative Filtering. Becausetraditional CF methods [35, 36] cannot be directly appliedto the ternary data, we make some modifications to theirformula, following [17]. The similarity between two usersis measured by their associated explanation sets via JaccardIndex. When predicting the final score for the triplet ( 𝑢, 𝑖, 𝑒 ) ,we first find users associated with both item 𝑖 and explana-tion 𝑒 , i.e., U 𝑖 ∩ U 𝑒 , and then keep the ones appeared in user 𝑢 ’s neighbor set N 𝑢 .ˆ 𝑟 𝑢,𝑖,𝑒 = ∑︁ 𝑢 ′ ∈N 𝑢 ∩(U 𝑖 ∩U 𝑒 ) 𝑠 𝑢,𝑢 ′ where 𝑠 𝑢,𝑢 ′ = |E 𝑢 ∩ E 𝑢 ′ ||E 𝑢 ∪ E 𝑢 ′ | (14) • RICF : Revised Item-based Collaborative Filtering. Accord-ingly, this method predicts a score for a triplet from theperspective of items, whose formula is similar to Eq. (14). • CD : Canonical Decomposition [1] shown in Eq. (4). Thismethod only predicts one score instead of two for the triplet ( 𝑢, 𝑖, 𝑒 ) , so its objective function shown below is slightlydifferent from ours in Eq. (7).min Θ ∑︁ 𝑢 ∈U ∑︁ 𝑖 ∈I 𝑢 ∑︁ 𝑒 ∈E 𝑢,𝑖 ∑︁ 𝑒 ′ ∈E/E 𝑢,𝑖 − ln 𝜎 ( ˆ 𝑟 𝑢,𝑖,𝑒𝑒 ′ ) + 𝜆 || Θ || 𝐹 (15)where ˆ 𝑟 𝑢,𝑖,𝑒𝑒 ′ = ˆ 𝑟 𝑢,𝑖,𝑒 − ˆ 𝑟 𝑢,𝑖,𝑒 ′ is the score difference betweena pair of interactions. • PITF : Pairwise Interaction Tensor Factorization [34]. It makesprediction for a triplet based on Eq. (10), and its objectivefunction is identical to CD’s in Eq. (15).To verify the effectiveness of the joint-ranking framework, inaddition to our method BPER-J, we also present the results of twobaselines: CD [1] and PITF [34], and name them CD-J and PITF-J, respectively. Since CD and PITF are not originally designed toaccomplish the two tasks of item recommendation and explanationranking, we first allow them to make prediction for a user-item pair ( 𝑢, 𝑖 ) via the inner product of their latent factors, i.e., ˆ 𝑟 𝑢,𝑖 = p 𝑇𝑢 q 𝑖 ,and then combine this task with explanation ranking in a multi-tasklearning framework whose objective function is given below:min Θ ∑︁ 𝑢 ∈U ∑︁ 𝑖 ∈I 𝑢 (cid:104) ∑︁ 𝑖 ′ ∈I/I 𝑢 − ln 𝜎 ( ˆ 𝑟 𝑢,𝑖𝑖 ′ )+ 𝛼 ∑︁ 𝑒 ∈E 𝑢,𝑖 ∑︁ 𝑒 ′ ∈E/E 𝑢,𝑖 − ln 𝜎 ( ˆ 𝑟 𝑢,𝑖,𝑒𝑒 ′ ) (cid:105) + 𝜆 || Θ || 𝐹 (16)where ˆ 𝑟 𝑢,𝑖𝑖 ′ = ˆ 𝑟 𝑢,𝑖 − ˆ 𝑟 𝑢,𝑖 ′ is the difference between a pair of records. To evaluate the performance of both recommendation and expla-nation, we adopt four commonly used ranking-oriented metrics inrecommender systems: Normalized Discounted Cumulative Gain onference ’21, Month dd–dd, 2021, Virtual Event Li and Zhang, et al.

Table 4: Self-comparison of three TF methods on two datasets with or without joint-ranking in terms of NDCG (N for short) andF1 (denoted as F). Top 10 results are evaluated for both explanation (Exp) and recommendation (Rec) tasks. The improvementsare made by the best performance of each task via joint-ranking over that of solely learned.

BPER-J CD-J PITF-JAmazon TripAdvisor Amazon TripAdvisor Amazon TripAdvisorExp (%) Rec (‰) Exp (%) Rec (‰) Exp (%) Rec (‰) Exp (%) Rec (‰) Exp (%) Rec (‰) Exp (%) Rec (‰)N F N F N F N F N F N F N F N F N F N F N F N FSolely 2.6 3.4 6.6 8.1 1.4 2.0 5.3 7.1 0.0 0.0 6.5 7.9 0.0 0.0 4.5 4.8 2.4 3.2 6.5 7.7 1.2 1.8 4.3 4.7Best Exp 3.3 ↑ ↑ ↓ ↓ ↑ ↑ ↓ ↓ ↑ ↑ ↓ ↓ ↑ ↑ ↑ ↑ ↑ ↑ ↓ ↑ ↑ ↑ ↑ ↑ Best Rec 2.6 ↕ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ Improvement (%) of BPER-J on Amazon R a n k i n g A cc u r a c y F1@10 for ExpF1@10 for Exp without Co-rankingF1@10 for Rec (×5)0.0 0.2 0.4 0.6 0.8 1.0 of BPER-J on TripAdvisor R a n k i n g A cc u r a c y F1@10 for ExpF1@10 for Exp without Co-rankingF1@10 for Rec (×3) of CD-J on Amazon of CD-J on TripAdvisor of PITF-J on Amazon of PITF-J on TripAdvisor

Figure 4: The effect of 𝛼 in three TF methods with joint-ranking on two datasets. Exp and Rec respectively denote the Expla-nation and Recommendation tasks. F1@10 for Rec is linearly transformed into a certain range for better visualization. ( NDCG ), Precision (

Pre ), Recall (

Rec ) and F1 . We evaluate on top-10 ranking for both recommendation and explanation tasks.We randomly divide each dataset into training (70%) and testing(30%) sets, and guarantee that each user/item/explanation has atleast one record in the training set. The splitting process is repeatedfor 5 times, and the average performance is reported.We implemented all the methods in Python. For CF-based meth-ods, i.e., RUCF and RICF, we allow the neighbor sets N 𝑢 and N 𝑖 to be as large as they can, so as to make better use of all availableneighbors. For TF-based methods, including CD, PITF, CD-J, PITF-J,and our BPER and BPER-J, we search the dimension of latent factors 𝑑 from [10, 20, 30, 40, 50], regularization coefficient 𝜆 from [0.001,0.01, 0.1], learning rate 𝛾 from [0.001, 0.01, 0.1], and maximum iter-ation number 𝑇 from [100, 500, 1000]. As to joint-ranking of CD-J,PITF-J and our BPER-J, the regularization coefficient 𝛼 on explana-tion task is searched from [0, 0.1, ..., 0.9, 1]. For the evaluation ofjoint-ranking, we first evaluate the performance of item recommen-dation for users, followed by the evaluation of explanation rankingon those correctly predicted user-item pairs. For our methods BPERand BPER-J, the parameter 𝜇 that balances user and item scoresfor explanation ranking is searched from [0, 0.1, ..., 0.9, 1]. Afterfine-tuning, we use 𝑑 = 𝜆 = . 𝛾 = .

01 and 𝑇 =

500 for ourmethods, while other parameters 𝛼 and 𝜇 are dependent on thedataset.The configuration of BPER+ is slightly different, because of thetextual content of the explanations. We adopt the pre-trained BERTfrom huggingface , and implement the model in Python with Py-Torch. We set batch size to 128, 𝑑 = 𝑇 =

5, and 𝜆 = . https://huggingface.co/bert-base-uncased After parameter tuning, we set learning rate 𝛾 to 0.0001 on Amazon,0.00001 on both TripAdvisor and Yelp. In this section, we first present the comparison of our methodsBPER and BPER+ with baselines regarding explanation ranking,followed by an analysis of three TF methods’ joint-ranking results.

Experimental results for explanation ranking on the three datasetsare shown in Table 3. We see that each method’s performance onthe four metrics (NDCG, Precision, Recall, F1) are fairly consistentacross the three datasets. The method RAND is among the weakestbaselines, because it randomly selects explanations without consid-ering user and item information, which implies that the explanationranking task is non-trivial. CD performs even worse than RAND,because of the sparsity issue in the ternary data (see Table 2), whichCD may not be able to mitigate as discussed in Section 4.2. CF-basedmethods, i.e., RUCF and RICF, largely advance the performance ofRAND, as they take into account the information of either usersor items, which confirms the important role of personalization forexplanation ranking. However, their performance is still limiteddue to data sparsity. PITF and our BPER/BPER+ outperform theCF-based methods by a large margin, as they not only addressthe data sparsity issue via their MF-like model structure, but alsotake each user/item’s information into account using latent factors.Most importantly, our method BPER significantly outperforms thestrongest baseline PITF, owing to its ability of producing two sets ofscores, corresponding to users and items, and its hyper-parameter earning to Explain Recommendations Conference ’21, Month dd–dd, 2021, Virtual Event 𝜇 that can balance their importance to explanation ranking. Lastly,BPER+ further improves BPER on most of the metrics across thethree datasets, especially NDCG that cares about the ranking or-der, thanks to the semantic features of the explanations as well asBERT’s strong language modeling capability to extract them.Next, we further analyze the parameter 𝜇 of BPER that controlsthe contribution of user scores and item scores in Eq. (5). As it canbe seen in Fig. 3, the curves of NDCG, Precision, Recall and F1 areall bell-shaped, where the performance improves significantly withthe increase of 𝜇 until it reaches an optimal point, and then it dropssharply. Due to the characteristics of different application domains,the optimal points vary from dataset to dataset, i.e., 0.7 for bothAmazon and Yelp and 0.5 for TripAdvisor. We omit the figures ofBPER+, because the pattern is similar. Algorithm 1

Bayesian Personalized Explanation Ranking (BPER)

Input: training set T , dimension of latent factors 𝑑 , learning rate 𝛾 , regularization coefficient 𝜆 , iteration number 𝑇 Output: model parameters Θ = { P , Q , O 𝑈 , O 𝐼 , b 𝑈 , b 𝐼 } Initialize Θ , including P ← R |U |× 𝑑 , Q ← R |I |× 𝑑 , O 𝑈 ← R |E |× 𝑑 , O 𝐼 ← R |E |× 𝑑 , b 𝑈 ← R |E | , b 𝐼 ← R |E | for 𝑡 = 𝑇 do for 𝑡 = |T | do Uniformly draw ( 𝑢, 𝑖, 𝑒 ) from T , 𝑒 ′ from E/E 𝑢 , and 𝑒 ′′ from E/E 𝑖 ˆ 𝑟 𝑢,𝑒𝑒 ′ ← ˆ 𝑟 𝑢,𝑒 − ˆ 𝑟 𝑢,𝑒 ′ , ˆ 𝑟 𝑖,𝑒𝑒 ′′ ← ˆ 𝑟 𝑖,𝑒 − ˆ 𝑟 𝑖,𝑒 ′′ 𝑥 ← − 𝜎 (− ˆ 𝑟 𝑢,𝑒𝑒 ′ ) , 𝑦 ← − 𝜎 (− ˆ 𝑟 𝑖,𝑒𝑒 ′′ ) p 𝑢 ← p 𝑢 − 𝛾 · ( 𝑥 · ( o 𝑈𝑒 − o 𝑈𝑒 ′ ) + 𝜆 · p 𝑢 ) q 𝑖 ← q 𝑖 − 𝛾 · ( 𝑦 · ( o 𝐼𝑒 − o 𝐼𝑒 ′′ ) + 𝜆 · q 𝑖 ) o 𝑈𝑒 ← o 𝑈𝑒 − 𝛾 · ( 𝑥 · p 𝑢 + 𝜆 · o 𝑈𝑒 ) o 𝐼𝑒 ← o 𝐼𝑒 − 𝛾 · ( 𝑦 · q 𝑖 + 𝜆 · o 𝐼𝑒 ) o 𝑈𝑒 ′ ← o 𝑈𝑒 ′ − 𝛾 · (− 𝑥 · p 𝑢 + 𝜆 · o 𝑈𝑒 ′ ) o 𝐼𝑒 ′′ ← o 𝐼𝑒 ′′ − 𝛾 · (− 𝑦 · q 𝑖 + 𝜆 · o 𝐼𝑒 ′′ ) 𝑏 𝑈𝑒 ← 𝑏 𝑈𝑒 − 𝛾 · ( 𝑥 + 𝜆 · 𝑏 𝑈𝑒 ) 𝑏 𝐼𝑒 ← 𝑏 𝐼𝑒 − 𝛾 · ( 𝑦 + 𝜆 · 𝑏 𝐼𝑒 ) 𝑏 𝑈𝑒 ′ ← 𝑏 𝑈𝑒 ′ − 𝛾 · (− 𝑥 + 𝜆 · 𝑏 𝑈𝑒 ′ ) 𝑏 𝐼𝑒 ′′ ← 𝑏 𝐼𝑒 ′′ − 𝛾 · (− 𝑦 + 𝜆 · 𝑏 𝐼𝑒 ′′ ) end for end for We perform joint-ranking for three TF models, i.e., BPER-J, CD-Jand PITF-J. Because of the consistency in the experimental resultson different datasets, we only show results on Amazon and Tri-pAdvisor. In Fig. 4, we study the effect of the parameter 𝛼 to bothexplanation ranking and item ranking in terms of F1 (results on theother three metrics are consistent). In each sub-figure, the greendotted line represents the performance of explanation ranking taskwithout joint-ranking, whose value is correspondingly taken fromTable 3. As we can see, all the points on the explanation curve (inred) are above this line when 𝛼 is greater than 0, suggesting thatthe explanation task benefits from the recommendation task underthe joint-ranking framework. In particular, the explanation per-formance of CD-J improves dramatically under the joint-ranking framework, since the recommendation task suffers less from thedata sparsity issue than the explanation task as discussed in Section4.2. It in turn helps to better rank the explanations. Meanwhile, forthe recommendation task, all the three models degenerate to BPRwhen 𝛼 is set to 0. Therefore, on the recommendation curves (inblue), any points, whose values are greater than that of the startingpoint, gain profits from the explanation task as well. All these ob-servations show the effectiveness of our joint-ranking frameworkin helping the two tasks benefit from each other.In Table 4, we make a self-comparison of the three methods interms of NDCG and F1 (the other two metrics are similar). In this ta-ble, “Solely” corresponds to each model’s performance with regardto explanation and recommendation when the two tasks are indi-vidually learned. In other words, the explanation performance istaken from Table 3 correspondingly, and the recommendation per-formance is evaluated when 𝛼 =

0. Moreover, “Best Exp” and “BestRec” denote the best performance of each method on respectivelyexplanation task and recommendation task under the joint-rankingframework. As we can see, when the recommendation performanceis the best for all the models, the explanation performance is alwaysimproved. Although minor recommendation accuracy is sacrificedwhen the explanation task reaches the best performance, we canalways find points where both of the two tasks are improved, e.g.,on the top left of Fig. 4 when 𝛼 is in the range of 0.1 to 0.6 forBPER-J on Amazon. This again demonstrates our joint-rankingframework’s capability in finding good solutions for both tasks. Algorithm 2

Joint-Ranking on BPER (BPER-J)

Input: training set T , dimension of latent factors 𝑑 , learning rate 𝛾 , regularization coefficients 𝛼 and 𝜆 , iteration number 𝑇 Output: model parameters Θ = { P , Q , O 𝑈 , O 𝐼 , b , b 𝑈 , b 𝐼 } Initialize Θ , including P ← R |U |× 𝑑 , Q ← R |I |× 𝑑 , O 𝑈 ← R |E |× 𝑑 , O 𝐼 ← R |E |× 𝑑 , b ← R |I | , b 𝑈 ← R |E | , b 𝐼 ← R |E | for 𝑡 = 𝑇 do for 𝑡 = |T | do Uniformly draw ( 𝑢, 𝑖, 𝑒 ) from T , 𝑒 ′ from E/E 𝑢 , 𝑒 ′′ from E/E 𝑖 , and 𝑖 ′ from I/I 𝑢 ˆ 𝑟 𝑢,𝑒𝑒 ′ ← ˆ 𝑟 𝑢,𝑒 − ˆ 𝑟 𝑢,𝑒 ′ , ˆ 𝑟 𝑖,𝑒𝑒 ′′ ← ˆ 𝑟 𝑖,𝑒 − ˆ 𝑟 𝑖,𝑒 ′′ , ˆ 𝑟 𝑢,𝑖𝑖 ′ ← ˆ 𝑟 𝑢,𝑖 − ˆ 𝑟 𝑢,𝑖 ′ 𝑥 ← − 𝛼𝜎 (− ˆ 𝑟 𝑢,𝑒𝑒 ′ ) , 𝑦 ← − 𝛼𝜎 (− ˆ 𝑟 𝑖,𝑒𝑒 ′′ ) , 𝑧 ← − 𝜎 (− ˆ 𝑟 𝑢,𝑖𝑖 ′ ) p 𝑢 ← p 𝑢 − 𝛾 · ( 𝑥 · ( o 𝑈𝑒 − o 𝑈𝑒 ′ ) + 𝑧 · ( q 𝑖 − q 𝑖 ′ ) + 𝜆 · p 𝑢 ) q 𝑖 ← q 𝑖 − 𝛾 · ( 𝑦 · ( o 𝐼𝑒 − o 𝐼𝑒 ′′ ) + 𝑧 · p 𝑢 + 𝜆 · q 𝑖 ) q 𝑖 ′ ← q 𝑖 ′ − 𝛾 · (− 𝑧 · p 𝑢 + 𝜆 · q 𝑖 ′ ) o 𝑈𝑒 ← o 𝑈𝑒 − 𝛾 · ( 𝑥 · p 𝑢 + 𝜆 · o 𝑈𝑒 ) o 𝐼𝑒 ← o 𝐼𝑒 − 𝛾 · ( 𝑦 · q 𝑖 + 𝜆 · o 𝐼𝑒 ) o 𝑈𝑒 ′ ← o 𝑈𝑒 ′ − 𝛾 · (− 𝑥 · p 𝑢 + 𝜆 · o 𝑈𝑒 ′ ) o 𝐼𝑒 ′′ ← o 𝐼𝑒 ′′ − 𝛾 · (− 𝑦 · q 𝑖 + 𝜆 · o 𝐼𝑒 ′′ ) 𝑏 𝑖 ← 𝑏 𝑖 − 𝛾 · ( 𝑧 + 𝜆 · 𝑏 𝑖 ) 𝑏 𝑖 ′ ← 𝑏 𝑖 ′ − 𝛾 · (− 𝑧 + 𝜆 · 𝑏 𝑖 ′ ) 𝑏 𝑈𝑒 ← 𝑏 𝑈𝑒 − 𝛾 · ( 𝑥 + 𝜆 · 𝑏 𝑈𝑒 ) 𝑏 𝐼𝑒 ← 𝑏 𝐼𝑒 − 𝛾 · ( 𝑦 + 𝜆 · 𝑏 𝐼𝑒 ) 𝑏 𝑈𝑒 ′ ← 𝑏 𝑈𝑒 ′ − 𝛾 · (− 𝑥 + 𝜆 · 𝑏 𝑈𝑒 ′ ) 𝑏 𝐼𝑒 ′′ ← 𝑏 𝐼𝑒 ′′ − 𝛾 · (− 𝑦 + 𝜆 · 𝑏 𝐼𝑒 ′′ ) end for end for onference ’21, Month dd–dd, 2021, Virtual Event Li and Zhang, et al. In this paper, we propose learning to explain to model explain-able recommendation and formulate the problem as a ranking task,which gives rise to the offline evaluation for explainability. Tofacilitate the development of explainable recommendation, we con-struct three large datasets, based on which we develop two effectivemodels to address the data sparsity issue. We further design anitem-explanation joint-ranking framework that can improve theperformance of both recommendation and explanation tasks.As future works, we are interested in considering the relationshipbetween suggested explanations (e.g., coherency [20] and diver-sity) to further improve the explainability. In addition, we plan toconduct experiments in real-world systems to validate whetherrecommendations and their associated explanations produced bythe joint-ranking framework could influence users’ behavior, e.g.,clicking and purchasing. Besides, the joint-ranking framework inthis paper aims to improve the recommendation performance byproviding explanations, while in the future, we will also considerimproving other objectives based on explanation, such as recom-mendation serendipity [6, 24] and fairness [38].

APPENDIX

For the learning of both BPER (in Algorithm 1) and BPER-J (inAlgorithm 2), we first randomly initialize the parameters, and thenrepeatedly update them by uniformly taking samples from the train-ing set and computing the gradients with regard to the parameters,until the convergence of the algorithms.

REFERENCES [1] J Douglas Carroll and Jih-Jie Chang. 1970. Analysis of individual differences inmultidimensional scaling via an N-way generalization of "Eckart-Young" decom-position.

Psychometrika

35, 3 (1970), 283–319.[2] Rose Catherine and William Cohen. 2017. Transnets: Learning to Transform forRecommendation. In

RecSys . ACM, 288–296.[3] Chong Chen, Min Zhang, Yiqun Liu, and Shaoping Ma. 2018. Neural AttentionalRating Regression with Review-level Explanations. In

WWW . ACM, 1583–1592.[4] Hanxiong Chen, Xu Chen, Shaoyun Shi, and Yongfeng Zhang. 2019. GenerateNatural Language Explanations for Recommendation. In

SIGIR Workshop EARS .[5] Jianbo Chen, Le Song, Martin J Wainwright, and Michael I Jordan. 2018. Learningto explain: An information-theoretic perspective on model interpretation. In

ICML .[6] Li Chen, Yonghua Yang, Ningxia Wang, Keping Yang, and Quan Yuan. 2019. Howserendipity improves user satisfaction with recommendations? A large-scale userevaluation. In

WWW . 240–250.[7] Xu Chen, Han Chen, Hongteng Xu, Yongfeng Zhang, Yixin Cao, Zheng Qin, andHongyuan Zha. 2019. Personalized Fashion Recommendation with Visual Expla-nations based on Multimodal Attention Network: Towards Visually ExplainableRecommendation. In

SIGIR . ACM, 765–774.[8] Xu Chen, Zheng Qin, Yongfeng Zhang, and Tao Xu. 2016. Learning to rankfeatures for recommendation over multiple categories. In

SIGIR . 305–314.[9] Xu Chen, Yongfeng Zhang, and Zheng Qin. 2019. Dynamic Explainable Recom-mendation based on Neural Attentive Models. In

AAAI .[10] Zhongxia Chen, Xiting Wang, Xing Xie, Tong Wu, Guoqing Bu, Yining Wang,and Enhong Chen. 2019. Co-Attentive Multi-Task Learning for ExplainableRecommendation. In

IJCAI .[11] Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. 2000. A multilinearsingular value decomposition.

SIAM journal on Matrix Analysis and Applications

21, 4 (2000), 1253–1278.[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert:Pre-training of deep bidirectional transformers for language understanding. In

NAACL .[13] Jingyue Gao, Xiting Wang, Yasha Wang, and Xing Xie. 2019. Explainable Recom-mendation Through Attentive Multi-View Learning. In

AAAI .[14] Xiangnan He, Tao Chen, Min-Yen Kan, and Xiao Chen. 2015. TriRank: Review-aware Explainable Recommendation by Modeling Aspects. In

CIKM . ACM.[15] Jizhou Huang, Wei Zhang, Shiqi Zhao, Shiqiang Ding, and Haifeng Wang. 2017.Learning to Explain Entity Relationships by Pairwise Ranking with Convolutional Neural Networks.. In

IJCAI . 4018–4025.[16] Noor Ifada and Richi Nayak. 2014. Tensor-based item recommendation usingprobabilistic ranking in social tagging systems. In

WWW Workshop . 805–810.[17] Robert Jäschke, Leandro Marinho, Andreas Hotho, Lars Schmidt-Thieme, andGerd Stumme. 2007. Tag recommendations in folksonomies. In

PKDD . Springer.[18] Atsushi Kanehira and Tatsuya Harada. 2019. Learning to explain with comple-mental examples. In

CVPR . 8603–8611.[19] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Tech-niques for Recommender Systems.

Computer

IJCAI . 2427–2434.[21] Lei Li, Li Chen, and Ruihai Dong. 2020. CAESAR: context-aware explanationbased on supervised attention for service recommendations.

Journal of IntelligentInformation Systems (2020), 1–24.[22] Lei Li, Yongfeng Zhang, and Li Chen. 2020. Generate Neural Template Explana-tions for Recommendation. In

CIKM . 755–764.[23] Piji Li, Zihao Wang, Zhaochun Ren, Lidong Bing, and Wai Lam. 2017. NeuralRating Regression with Abstractive Tips Generation for Recommendation. In

SIGIR . ACM, 345–354.[24] Xueqi Li, Wenjun Jiang, Weiguang Chen, Jie Wu, Guojun Wang, and Kenli Li. 2020.Directional and Explainable Serendipity Recommendation. In

WWW . 122–132.[25] Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries.In

Text summarization branches out . 74–81.[26] Tie-Yan Liu. 2011.

Learning to rank for information retrieval . Springer Science &Business Media.[27] Yichao Lu, Ruihai Dong, and Barry Smyth. 2018. Coevolutionary Recommenda-tion Model: Mutual Learning between Ratings and Reviews. In

WWW . 773–782.[28] Yichao Lu, Ruihai Dong, and Barry Smyth. 2018. Why I like it: Multi-task Learningfor Recommendation and Explanation. In

RecSys . ACM.[29] James McInerney, Benjamin Lacker, Samantha Hansen, Karl Higley, HuguesBouchard, Alois Gruson, and Rishabh Mehrotra. 2018. Explore, exploit, andexplain: personalizing explainable recommendations with bandits. In

RecSys .[30] Andriy Mnih and Ruslan R Salakhutdinov. 2008. Probabilistic matrix factorization.In

NIPS . 1257–1264.[31] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: amethod for automatic evaluation of machine translation. In

ACL . Association forComputational Linguistics, 311–318.[32] Anand Rajaraman and Jeffrey David Ullman. 2011. Finding Similar Items. In

Mining of Massive Datasets (3 ed.). Cambridge University Press, Chapter 3, 73–134.[33] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme.2009. BPR: Bayesian personalized ranking from implicit feedback. In

UAI .[34] Steffen Rendle and Lars Schmidt-Thieme. 2010. Pairwise interaction tensorfactorization for personalized tag recommendation. In

WSDM . 81–90.[35] Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, and JohnRiedl. 1994. GroupLens: an open architecture for collaborative filtering of netnews.In

CSCW . ACM, 175–186.[36] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-basedcollaborative filtering recommendation algorithms. In

WWW . ACM, 285–295.[37] Sungyong Seo, Jing Huang, Hao Yang, and Yan Liu. 2017. Interpretable Convolu-tional Neural Networks with Dual Local and Global Attention for Review RatingPrediction. In

RecSys . ACM, 297–305.[38] Ashudeep Singh and Thorsten Joachims. 2018. Fairness of exposure in rankings.In

KDD . 2219–2228.[39] Nava Tintarev and Judith Masthoff. 2015. Explaining Recommendations: Designand Evaluation. In

Recommender Systems Handbook (2 ed.), Bracha Shapira (Ed.).Springer, Chapter 10, 353–382.[40] Ledyard R Tucker. 1966. Some mathematical notes on three-mode factor analysis.

Psychometrika

31, 3 (1966), 279–311.[41] Nikos Voskarides, Edgar Meij, Manos Tsagkias, Maarten De Rijke, and WouterWeerkamp. 2015. Learning to explain entity relationships in knowledge graphs.In

ACL . 564–574.[42] Xiting Wang, Yiru Chen, Jie Yang, Le Wu, Zhengtao Wu, and Xing Xie. 2018. AReinforcement Learning Framework for Explainable Recommendation. In

ICDM .[43] Robert Wetzker, Carsten Zimmermann, Christian Bauckhage, and Sahin Albayrak.2010. I tag, you tag: translating tags for advanced user models. In

WSDM . 71–80.[44] Xuhai Xu, Ahmed Hassan Awadallah, Susan T. Dumais, Farheen Omar, BogdanPopp, Robert Rounthwaite, and Farnaz Jahanbakhsh. 2020. Understanding UserBehavior For Document Recommendation. In

WWW . 3012–3018.[45] Yuan Yang and Le Song. 2020. Learn to Explain Efficiently via Neural LogicInductive Learning. In

ICLR .[46] Yongfeng Zhang and Xu Chen. 2020. Explainable Recommendation: A Surveyand New Perspectives.

Foundations and Trends® in Information Retrieval

14, 1(2020), 1–101.[47] Yongfeng Zhang, Guokun Lai, Min Zhang, Yi Zhang, Yiqun Liu, and ShaopingMa. 2014. Explicit Factor Models for Explainable Recommendation based onPhrase-level Sentiment Analysis. In