[PDF] EXTRA: Explanation Ranking Datasets for Explainable Recommendation

Abstract

Recently, research on explainable recommender systems has drawn much attention from both academia and industry, resulting in a variety of explainable models. As a consequence, their evaluation approaches vary from model to model, which makes it quite difficult to compare the explainability of different models. To achieve a standard way of evaluating recommendation explanations, we provide three benchmark datasets for EXplanaTion RAnking (denoted as EXTRA), on which explainability can be measured by ranking-oriented metrics. Constructing such datasets, however, poses great challenges. First, user-item-explanation triplet interactions are rare in existing recommender systems, so how to find alternatives becomes a challenge. Our solution is to identify nearly identical sentences from user reviews. This idea then leads to the second challenge, i.e., how to efficiently categorize the sentences in a dataset into different groups, since it has quadratic runtime complexity to estimate the similarity between any two sentences. To mitigate this issue, we provide a more efficient method based on Locality Sensitive Hashing (LSH) that can detect near-duplicates in sub-linear time for a given query. Moreover, we make our code publicly available to allow researchers in the community to create their own datasets.

Full PDF

EEXTRA: Explanation Ranking Datasets for ExplainableRecommendation

Lei Li

Hong Kong Baptist UniversityHong Kong, [email protected]

Yongfeng Zhang

Rutgers UniversityNew Brunswick, [email protected]

Li Chen

Hong Kong Baptist UniversityHong Kong, [email protected]

ABSTRACT

Recently, research on explainable recommender systems (RS) hasdrawn much attention from both academia and industry, resultingin a variety of explainable models. As a consequence, their evalua-tion approaches vary from model to model, which makes it quite dif-ficult to compare the explainability of different models. To achieve astandard way of evaluating recommendation explanations, we pro-vide three benchmark datasets for EXplanaTion RAnking (denotedas EXTRA), on which explainability can be measured by ranking-oriented metrics. Constructing such datasets, however, presentsgreat challenges. First, user-item-explanation interactions are rarein existing RS, so how to find alternatives becomes a challenge. Oursolution is to identify nearly duplicate or even identical sentencesfrom user reviews. This idea then leads to the second challenge, i.e.,how to efficiently categorize the sentences in a dataset into differ-ent groups, since it has quadratic runtime complexity to estimatethe similarity between any two sentences. To mitigate this issue,we provide a more efficient method based on Locality SensitiveHashing (LSH) that can detect near-duplicates in sub-linear timefor a given query. Moreover, we plan to make our code publiclyavailable, to allow other researchers create their own datasets.

CCS CONCEPTS • Information systems → Recommender systems ; Learningto rank . KEYWORDS

Recommender Systems; Explainable Recommendation; Learning toRank

ACM Reference Format:

Lei Li, Yongfeng Zhang, and Li Chen. 2021. EXTRA: Explanation RankingDatasets for Explainable Recommendation. In

Proceedings of ACM Confer-ence (Conference ’21), Month dd–dd, 2021, Virtual Event.

ACM, New York,NY, USA, 6 pages. https://doi.org/10.1145/xxxxxxx.xxxxxxx

Explainable recommender systems (RS) [18, 23] that not only pro-vide users with personalized recommendations but also justify why

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

Users Items Explanations from reviewsu1 u2u3u4 i1i2i3i4 e1 e2 e2e3 e4e5 e6 e2e1 r1 r2r3 r6 r4r5 Figure 1: User-item-review interactions can be convertedinto user-item-explanation interactions, so as to build a con-nection between explanations and users/items. they are recommended, have become a popular research topic inrecent years. Compared with traditional RS algorithms, e.g., col-laborative filtering [16, 17], which aim to tackle the informationoverload problem for users, explainable RS can further improveusers’ satisfaction and overall experience [18] by helping them bet-ter understand recommended items. However, as explanations cantake various forms, such as pre-defined template [9, 24], generatedtext [3, 10] and path on knowledge graph [6, 21], it is difficult toevaluate the explanations produced by different methods.We present three benchmark datasets on which recommendationexplanations can be evaluated quantitatively via standard rankingmetrics, such as NDCG, Precision and Recall. The idea of expla-nation ranking is inspired by information retrieval, which doesnot create information but rather rank all the available contents(e.g., documents or images) for a given query. In addition, this ideais also supported by our observation on the problems of naturallanguage generation techniques. In our previous work on expla-nation generation [10], we find that a large amount of generatedsentences are the commonly seen ones in the training data, e.g., “ thefood is good ”. This means that the generation models are fitting thegiven samples rather than creating something new. Furthermore,even strong language model as Transformer [19] trained on a largetext corpus, may generate content that deviates from facts, e.g.,“ four-horned unicorn ” [12].Thus, we create three EXplanaTion RAnking datasets (denotedas EXTRA) for explainable recommendation research. Specifically,they are built upon user generated reviews, which are the collection a r X i v : . [ c s . I R ] F e b onference ’21, Month dd–dd, 2021, Virtual Event Li and Zhang, et al. Figure 2: Three user reviews for different movies from Ama-zon (Movies & TV category). Sentences that can be regardedas explanations are highlighted in colors. Co-occurring ex-planations across different reviews are highlighted in rect-angles. of users’ true evaluation towards items. This ensures the qualityof explanations, such as readability and factuality. Moreover, thedatasets could be further enriched by new explanations, when thenewly added reviews contain new product features or up-to-dateexpressions.However, simply adopting reviews [2, 5] or their sentences [4,20] as explanations is less appropriate, because in this case eachreview/sentence only appears once, so their relation with usersand items cannot be well reflected (see r1 to r6 in Fig. 1), whichmakes it difficult to perform explanation ranking. Our solution is tofind the co-occurring sentences across all the reviews, in order toconnect different user-item pairs with one particular explanation(e.g., u2-i1 and u3-i3 with e2 in Fig. 1) and thus build the user-item-explanation interactions. This type of textual explanations couldbe very effective in helping users make better and faster decisions.A recent online experiment conducted on Microsoft Office 365 [22]finds that their manually designed textual explanations, e.g., “

Jackshare this file with you ”, can help users access documents faster. Itmotivates us to automatically create this type of explanations forother application domains, e.g., movie.Then, a follow-up problem is how to detect the similar or evenidentical sentences across the reviews in a dataset. Data clustering isinfeasible to this case, because its number of centroids is pre-definedand fixed. Computing the similarity between any two sentencesin a dataset is practical but less efficient, since it has quadratictime complexity. To make this process more efficient, we develop amethod that can categorize sentences into different groups, basedon Locality Sensitive Hashing (LSH) [13] which is devised for near-duplicates detection. Furthermore, because some sentences areless suitable for explanation purpose (see the first review’s firstsentence in Fig. 2), we only keep those containing both noun(s) andadjective(s), but not personal pronouns, e.g., “ I ”. In this way, we canobtain high-quality explanations that talk about item features withcertain opinions but do not go through personal experiences. Afterthe whole process, the explanation sentences remain personalized, since they resemble the case of traditional recommendation, whereusers of similar preferences write nearly identical review sentences,while similar items can be explained by the same explanations (seesentences in rectangles in Fig. 2).Notice that, our datasets are different from both user-item-aspectdata [7] and user-item-tag data [8, 15], since an aspect/tag whenbeing used as an explanation, may not be able to clearly explain anitem’s specialty. For example, a single word “ food ” cannot describehow good a restaurant’s food tastes.To sum up, our contributions are listed below: • We construct three large datasets consisting of user-item-explanation interactions, on which explainability can beevaluated via standard ranking metrics, e.g., NDCG. Datasetsand codes will be made available after this paper is published. • We address two key problems when creating such datasets,including the interactions between explanations and users/items,as well as the efficiency for grouping similar sentences.In the following, we first introduce our data processing approachand the resulting datasets in Section 2. Then, we present two expla-nation ranking formulations in Section 3. We experiment existingmethods on the datasets in Section 4. Section 5 concludes this work.

Algorithm 1

Sentence Grouping via LSH

Input: shingle size 𝑛 , similarity threshold 𝑡 , min group size 𝑔 Output: explanation set E , groups of sentences M Pre-process textual data to obtain the sentence collection S 𝑙𝑠ℎ ← 𝑀𝑖𝑛𝐻𝑎𝑠ℎ𝐿𝑆𝐻 ( 𝑡 ) , C ← ∅ for sentence 𝑠 in S do 𝑚 ← 𝑀𝑖𝑛𝐻𝑎𝑠ℎ () // create MinHash for 𝑠 for 𝑛 -shingle ℎ in 𝑠 do 𝑚.𝑢𝑝𝑑𝑎𝑡𝑒 ( ℎ ) // convert 𝑠 into 𝑚 by encoding its 𝑛 -shingles end for 𝑙𝑠ℎ.𝑖𝑛𝑠𝑒𝑟𝑡 ( 𝑚 ) , C .𝑎𝑑𝑑 ( 𝑚 ) // C : set of all sentences’ MinHash end for M ← ∅ , Q ← ∅ // Q : set of queried sentences for 𝑚 in C do if 𝑚 not in Q then G ← 𝑙𝑠ℎ.𝑞𝑢𝑒𝑟𝑦 ( 𝑚 ) // G : ID set of duplicate sentences if G .𝑠𝑖𝑧𝑒 > 𝑔 then M .𝑎𝑑𝑑 (G) // only keep groups with enough sentences E .𝑎𝑑𝑑 (G .𝑔𝑒𝑡 ()) // keep one explanation in each group end if for 𝑚 ′ in G do 𝑙𝑠ℎ.𝑟𝑒𝑚𝑜𝑣𝑒 ( 𝑚 ′ ) , Q .𝑎𝑑𝑑 ( 𝑚 ′ ) // for efficiency end for end if end for For explanation ranking, the datasets are expected to contain user-item-explanation interactions. In this paper, we narrow down theexplanations to textual sentences from user reviews. The key prob-lem is how to efficiently detect near-duplicates across differentreviews, since it takes quadratic time to compute the similaritybetween any two sentences in a dataset. In the following, we first

XTRA: Explanation Ranking Datasets for Explainable Recommendation Conference ’21, Month dd–dd, 2021, Virtual Event 𝑠 𝑠 𝑠 𝑚−1 𝑠 𝑚 𝑠 𝑠 𝑠 ⋯ 𝑠 𝑠 𝑠 𝑚−1 𝑠 𝑚 𝑠 𝑠 𝑠 ⋯ (a) Naive way MMM 𝑠 𝑠 𝑠 𝑚−1 𝑠 𝑚 𝑠 𝑠 𝑠 ⋯ 𝑠 𝑠 𝑠 𝑚−1 𝑠 𝑚 𝑠 𝑠 𝑠 ⋯ (b) Sentence grouping: step 1 MMM 𝑠 𝑠 𝑠 𝑚−1 𝑠 𝑚 𝑠 𝑠 𝑠 ⋯ 𝑠 𝑠 𝑠 𝑚−1 𝑠 𝑚 𝑠 𝑠 𝑠 ⋯ (c) Sentence grouping: step 2 M MMM M 𝑠 𝑠 𝑠 𝑚−1 𝑠 𝑚 𝑠 𝑠 𝑠 ⋯ 𝑠 𝑠 𝑠 𝑚−1 𝑠 𝑚 𝑠 𝑠 𝑠 ⋯ (d) Sentence grouping: step 3 M MMM M 𝑠 𝑠 𝑠 𝑚−1 𝑠 𝑚 𝑠 𝑠 𝑠 ⋯ 𝑠 𝑠 𝑠 𝑚−1 𝑠 𝑚 𝑠 𝑠 𝑠 ⋯ (e) Sentence grouping: step 4 Figure 3: White cells denote similarity computation, while black cells omit the computation. (a) shows a naive way to computethe similarity between any two sentences, which would take quadratic time. (b)-(e) show four example steps in our more effi-cient sentence grouping algorithm, where orange rectangles denote query steps in LSH, and M denotes the matched duplicates. present our approach to find duplicate sentences that we call sen-tence grouping, then introduce the data construction details, andat last analyze the datasets.

The advantage of sentence grouping is three-fold. First, it ensuresthe readability and factuality of explanations, as they are extractedfrom user generated reviews based on the wisdom of the crowd.Second, it allows the explanations to have connections with bothusers and items, so that we could design models to learn and predictsuch connections. Third, it makes the idea of explanation rankingand the automatic benchmark evaluation possible, since there areonly a limited set of candidate explanations.Computing the similarity between any two sentences in a datasetis computationally expensive, but at each step of sentence grouping,in fact it is unnecessary to compute the similarity for the alreadygrouped sentences. Therefore, we can reduce the computation costby removing those sentences (see Fig. 3 (b)-(e) for illustration). Tomore efficiently find similar sentences, we seek to Locality SensitiveHashing (LSH) [13] that is able to conduct near-duplicates detectionin sub-linear time. LSH consists of three major steps. First, a docu-ment (i.e., a sentence in our case) is converted to a set of 𝑛 -shingles(a.k.a., 𝑛 -grams). Second, the sets w.r.t. all documents are convertedto short signatures via hashing, so as to reduce computation costbut preserve document similarity. Third, the documents, whosesimilarity to a query document is greater than a pre-defined thresh-old, are returned. Our detailed procedure of sentence grouping ispresented in Algorithm 1.Next, we discuss the implementation details. To make better useof all the available text in a dataset, for each record we concate-nate the review text and the heading/tip. Then each piece of textis tokenized into sentences. In particular, a sentence is removed ifit contains personal pronouns, e.g., “ I ” and “ me ”, because explana-tions are expected to be objective rather than subjective. We alsocalculate the frequency of nouns and adjectives in each sentencevia NLTK , and only keep the sentences that contain both noun(s)and adjective(s), so as to obtain more informative explanations thatevaluate certain item features. After the data pre-processing, weconduct sentence grouping via an open-source LSH [13] package Datasketch . When creating MinHash for each sentence, we setthe shingle size 𝑛 to 2, for the purpose of relatively preserving theword order, and distinguishing positive sentiment from negativesentiment (e.g., “ is good ” v.s. “ not good ”). We test the similaritythreshold 𝑡 of querying sentences from [0.5, 0.6, ..., 0.9], and findthat the results with 0.9 are the best. We construct our datasets on top of three large datasets: AmazonMovies & TV (movie), TripAdvisor (hotel) and Yelp (restaurant).In each of the datasets, a record is comprised of user ID, item ID,overall rating in the scale of 1 to 5, and textual review. After splittingreviews into sentences, we apply sentence grouping (in Algorithm1) over them to obtain a large amount of sentence groups. A groupis removed if its number of sentences is no more than 5, in orderto retain commonly seen explanations. We then assign each of theremaining groups an ID that we call explanation ID. A record maybe assigned with none, one or multiple explanation IDs. We removethe records that do not have any explanation ID.To make our datasets more friendly to the community, we largelyfollow the data format of a well-known dataset MovieLens . Specif-ically, we store each processed dataset in two separate plain textfiles: IDs.txt and id2exp.txt . The former contains the meta-datainformation, such as user ID, item ID and explanation ID, while thelatter stores the textual content of an explanation that can be re-trieved via the explanation ID. The entries of each line in both filesare separated by double colon, i.e., “::”. If a line in IDs.txt containsmultiple explanation IDs, they are separated by single colon, i.e.,“:”. The detailed examples are shown in Table 1. With this type ofdata format, loading the data is quite easy, but we also provide ascript in our code for data loading.

Table 2 shows the statistics of the processed datasets. Notice that,multiple explanations may be detected in a review, which meansseveral user-item-explanation triplets. As we can see, all the threedatasets are very sparse. http://ekzhu.com/datasketch/lsh.html http://jmcauley.ucsd.edu/data/amazon https://grouplens.org/datasets/movielens/ onference ’21, Month dd–dd, 2021, Virtual Event Li and Zhang, et al. Table 1: Data format of our datasets. Each dataset contains two plain text files: IDs.txt and id2exp.txt. Entries in each lineof the files are separated by double colon. In IDs.txt, expID denotes the explanation ID after sentence grouping, while thecorresponding oexpID is the original explanation ID. When a record has multiple explanation IDs, they are separated bysingle colon. In id2exp.txt, expID applies to both expID and oexpID in IDs.txt.

File FormatIDs.txt userID::itemID::rating::timeStamp::expID:expID::oexpID:oexpID

A20YXFTS3GUGON::B00ICWO0ZY::5::1405958400::13459471:5898244::32215058:32215057APBZTFB6Y3TUX::B000K7VHPU::5::1394294400::13459471::21311508id2exp.txt expID::expSentence

Table 2: Statistics of the datasets. Density is × × Amazon TripAdvisor Yelp ( 𝑢, 𝑖, 𝑒 ) triplets 793,481 2,618,340 3,875,118Density ( × − ) 45.71 13.88 2.07Next, we show 5 example explanations on each dataset in Table3. We can see that the explanations vary from dataset to dataset, butthey all reflect the characteristics of the corresponding datasets, e.g.,“ a wonderful movie for all ages ” on the dataset Amazon Movies &TV. The occurrence of short explanations is high, not only becauseLSH favors short text, but also because people tend to express theiropinions using common and concise phrases. Moreover, we canobserve some negative expressions, which could be used to explaindisrecommendations.Because constructing the datasets does not involve manual ef-forts, we do observe one minor issue. Since a noun is not necessarilyan item feature, the datasets contain a few less meaningful explana-tions that are less relevant to items, e.g., “ the first time ”. This issuecan be effectively addressed, if we pre-define a set of item featuresor filter out item-irrelevant nouns for each dataset. However, thiswould require considerable human labor, so we leave it for futurework. The task of explanation ranking aims at finding a list of expla-nations to explain a recommendation for a user. Similar to itemranking, these explanations are better to be personalized to theuser’s interests as well as the target item’s characteristics. To pro-duce such a personalized explanation list, a recommender systemcould leverage the user’s historical data, e.g., her past interactionsand comments on other items. In the following, we introduce twotypes of explanation ranking formulation, including global-leveland item-level.In the setting of global-level explanation ranking , there is acollection of explanations E that are globally shared for all items. Table 3: Example explanations after sentence grouping onthree datasets. Occurrence means the number of near dupli-cate explanations.Explanation OccurrenceAmazon Movies & TV

Excellent movie 3628This is a great movie 2941Don’t waste your money 834The sound is okay 11A wonderful movie for all ages 6

TripAdvisor

Great location 61993The room was clean 6622The staff were friendly and helpful 2184Bad service 670Comfortable hotel with good facilities 8

Yelp

Great service 46413Everything was delicious 5237Prices are reasonable 2914This place is awful 970The place was clean and the food was good 6The recommender system can estimate a score ˆ 𝑟 𝑢,𝑖,𝑒 of each expla-nation 𝑒 ∈ E , for a given pair of user 𝑢 ∈ U and item 𝑖 ∈ I , whichis resulted from either the user’s behavior or a model’s prediction.According to the scores, the top 𝑁 explanations can be selected tojustify why recommendation 𝑖 is made for user 𝑢 . Formally, thisexplanation list can be defined as:Top ( 𝑢, 𝑖, 𝑁 ) : = 𝑁 arg max 𝑒 ∈E ˆ 𝑟 𝑢,𝑖,𝑒 (1)Meanwhile, we can perform item-level explanation ranking to select explanations from the target item’s collection, which canbe formulated as: Top ( 𝑢, 𝑖, 𝑁 ) : = 𝑁 arg max 𝑒 ∈E 𝑖 ˆ 𝑟 𝑢,𝑖,𝑒 (2)where E 𝑖 is item 𝑖 ’s explanation collection. XTRA: Explanation Ranking Datasets for Explainable Recommendation Conference ’21, Month dd–dd, 2021, Virtual Event

The two formulations respectively have their own advantages.The global-level ranking could make better use of all the user-item-explanation interactions, e.g., “ great story and acting ” for differentitems (see Fig. 2), so as to better capture the relation between users,items and explanations. As a comparison, item-level ranking couldprevent from presenting item-dependent explanations that maynot be applicable to some recommendations, e.g., “

Moneyball is agreat movie based on a true story ” that only applies to the movie

Moneyball . Depending on the application scenarios, we may adoptdifferent formulations.

In this section, we first introduce five methods for explanationranking. Then, we discuss the experimental details. At last, weanalyze the results of different methods.

On the global-level explanation ranking task, we test five methods.The first one is denoted as RAND, which randomly selects explana-tions from the explanation set E for any given user-item pair. It issimply to show the bottom line performance of explanation rank-ing. The other four methods can be grouped into two categories,including collaborative filtering and tensor factorization. For theranking purpose, each of the four methods must estimate a scoreˆ 𝑟 𝑢,𝑖,𝑒 for a triplet ( 𝑢, 𝑖, 𝑒 ) . Collaborative Filtering (CF) [16, 17]is a typical type of recommendation algorithms that recommenditems for a user, based on either the user’s neighbors who havesimilar preference, or each item’s neighbors. It naturally fits theexplanation ranking task, as some users may care about certainitem features, and some items’ specialty could be similar. We ex-tend user-based CF (UCF) and item-based CF (ICF) to our ternarydata, following [8], and denote them as RUCF and RICF, where “R”means “Revised”. Taking RUCF as an example, we first compute thesimilarity between users 𝑢 and 𝑢 ′ via Jaccard Index as follows, 𝑠 𝑢,𝑢 ′ = |E 𝑢 ∩ E 𝑢 ′ ||E 𝑢 ∪ E 𝑢 ′ | (3)where E 𝑢 and E 𝑢 ′ denote the explanations associated with 𝑢 and 𝑢 ′ , respectively. Then we estimate a score for the triplet ( 𝑢, 𝑖, 𝑒 ) , forwhich we only retain user 𝑢 ’s some neighbors who interacted withboth item 𝑖 and explanation 𝑒 .ˆ 𝑟 𝑢,𝑖,𝑒 = ∑︁ 𝑢 ′ ∈N 𝑢 ∩(U 𝑖 ∩U 𝑒 ) 𝑠 𝑢,𝑢 ′ (4)Similarly, RICF can predict a score for the same triplet via theneighbors of items. The triplets formed by users, items andexplanations correspond to entries in an interaction cube, whosemissing values could be recovered by Tensor Factorization (TF)methods. Thus, we test two typical TF methods, including CanonicalDecomposition (CD) [1] and Pairwise Interaction Tensor Factoriza-tion (PITF) [15]. To predict a score ˆ 𝑟 𝑢,𝑖,𝑒 , CD performs element-wisemultiplication on the latent factors of user 𝑢 , item 𝑖 and explana-tion 𝑒 , and then sums over the resultant vector. Formally, it can be written as: ˆ 𝑟 𝑢,𝑖,𝑒 = ( p 𝑢 ⊙ q 𝑖 ) ⊤ o 𝑒 = 𝑑 ∑︁ 𝑘 = 𝑝 𝑢,𝑘 · 𝑞 𝑖,𝑘 · 𝑜 𝑒,𝑘 (5)where ⊙ represents two vectors’ element-wise multiplication, and 𝑑 is the number of latent factors. PITF does the prediction via twosets of matrix multiplication as follows,ˆ 𝑟 𝑢,𝑖,𝑒 = p ⊤ 𝑢 o 𝑈𝑒 + q ⊤ 𝑖 o 𝐼𝑒 = 𝑑 ∑︁ 𝑘 = 𝑝 𝑢,𝑘 · 𝑜 𝑈𝑒,𝑘 + 𝑑 ∑︁ 𝑘 = 𝑞 𝑖,𝑘 · 𝑜 𝐼𝑒,𝑘 (6)where o 𝑈𝑒 and o 𝐼𝑒 are two different latent factors for the same ex-planation.We opt for Bayesian Personalized Ranking (BPR) criterion [14] tolearn the parameters of the two TF methods, because it can modelthe relative order of explanations, e.g., rank of a user’s interactedexplanations > that of her uninteracted explanations. The objectivefunction of both CD and PITF is shown below:min Θ ∑︁ 𝑢 ∈U ∑︁ 𝑖 ∈I 𝑢 ∑︁ 𝑒 ∈E 𝑢,𝑖 ∑︁ 𝑒 ′ ∈E/E 𝑢,𝑖 − ln 𝜎 ( ˆ 𝑟 𝑢,𝑖,𝑒𝑒 ′ ) + 𝜆 || Θ || 𝐹 (7)where ˆ 𝑟 𝑢,𝑖,𝑒𝑒 ′ = ˆ 𝑟 𝑢,𝑖,𝑒 − ˆ 𝑟 𝑢,𝑖,𝑒 ′ denotes the difference between twointeractions, 𝜎 (·) is the sigmoid function, I 𝑢 represents user 𝑢 ’sinteracted items, E 𝑢,𝑖 is the explanation set of ( 𝑢, 𝑖 ) pair for training, Θ denotes model parameters, and 𝜆 is a coefficient for preventing themodel from over-fitting. To learn model parameters Θ , we optimizeEq. (7) for both CD and PITF via stochastic gradient descent. Atthe testing stage, we can measure scores of explanations in E for auser-item pair, and then rank them according to Eq. (1).Notice that, CD and PITF may be further enriched by consideringmore complex relation between explanations (e.g., rank of a user’spositive explanations > the other users’ explanations > the user’snegative explanations). We leave the exploration for future work. To compare the performance of different methods on explanationranking task, we adopt four metrics: Normalized Discounted Cumu-lative Gain (

NDCG ), Precision (

Pre ), Recall (

Rec ) and F1 . Top-10explanations are returned for each testing user-item pair. We ran-domly select 70% of the triplets in each dataset for training, andthe rest for testing. Also, we make sure that the training set holdsat least one triplet for each user, item and explanation. We do thisfor 5 times, and thus obtain 5 data splits, on which we report theaverage performance of each method.All the methods are implemented in Python. To allow CF-basedmethods (i.e., RUCF and RICF) better utilize user/item neighbors, wedo not restrict the upper limit of size for N 𝑢 and N 𝑖 . For TF-basedmethods, i.e., CD and PITF, we search the number of latent factors 𝑑 from [10, 20, 30, 40, 50], regularization coefficient 𝜆 from [0.001, 0.01,0.1], learning rate 𝛾 from [0.001, 0.01, 0.1], and maximum iterationnumber 𝑇 from [100, 500, 1000]. After parameter tuning, we use 𝑑 = 𝜆 = . 𝛾 = .

01 and 𝑇 =

500 for both CD and PITF.

Table 4 presents the performance comparison of different methodson three datasets. We have the following observations. First, eachmethod performs consistently on the three datasets regarding the onference ’21, Month dd–dd, 2021, Virtual Event Li and Zhang, et al.

Table 4: Performance comparison of all methods on the top-10 explanation ranking in terms of NDCG, Precision (Pre), Recall(Rec) and F1 (%). The best performing values are boldfaced.

Amazon TripAdvisor YelpNDCG@10 Pre@10 Rec@10 F1@10 NDCG@10 Pre@10 Rec@10 F1@10 NDCG@10 Pre@10 Rec@10 F1@10CD 0.001 0.001 0.007 0.002 0.001 0.001 0.003 0.001 0.000 0.000 0.003 0.001RAND 0.004 0.004 0.027 0.006 0.002 0.002 0.011 0.004 0.001 0.001 0.007 0.002RUCF 0.341 0.170 1.455 0.301 0.260 0.151 0.779 0.242 0.040 0.020 0.125 0.033RICF 0.417 0.259 1.797 0.433 0.031 0.020 0.087 0.030 0.037 0.026 0.137 0.042PITF four metrics. Second, the performances of both RAND and CDare the worst, because RAND is non-personalized, while the datasparsity problem (see Table 2) may be difficult to mitigate for CDthat simply multiplies three latent factors. Third, both RUCF andRICF that can make use of user/item neighbors, are better thanRAND, but they are still limited because of the data sparsity issue.Lastly, PITF improves CD and also outperforms RUCF and RICF,with its specially designed model structure that may tackle datasparsity (see [15] for discussion).

In this paper, we construct three explanation ranking datasets forexplainable recommendation research, with an attempt to achievea standard way of evaluating explainability. To this end, we addresstwo problems during data construction, including the lack of user-item-explanation interactions and the efficiency of detecting similarsentences.In the future, we plan to adopt our sentence grouping approach toproduct images, so as to construct datasets with visual explanations.We will also test other ranking methods, such as those developedfor tag/aspect ranking [7]. Since the focus of this work is about dataconstruction, we present our two tensor factorization methods (withand without the textual content of explanations) for explanationranking in [11]. Moreover, we intend to seek industrial cooperationfor conducting online experiments to test the impact of the rankedexplanations to users, e.g., clicking rate.

REFERENCES [1] J Douglas Carroll and Jih-Jie Chang. 1970. Analysis of individual differences inmultidimensional scaling via an N-way generalization of "Eckart-Young" decom-position.

Psychometrika

35, 3 (1970), 283–319.[2] Chong Chen, Min Zhang, Yiqun Liu, and Shaoping Ma. 2018. Neural AttentionalRating Regression with Review-level Explanations. In

WWW . ACM, 1583–1592.[3] Hanxiong Chen, Xu Chen, Shaoyun Shi, and Yongfeng Zhang. 2019. GenerateNatural Language Explanations for Recommendation. In

SIGIR Workshop EARS .[4] Xu Chen, Yongfeng Zhang, and Zheng Qin. 2019. Dynamic Explainable Recom-mendation based on Neural Attentive Models. In

AAAI .[5] Miao Fan, Chao Feng, Mingming Sun, and Ping Li. 2019. Reinforced productmetadata selection for helpfulness assessment of customer reviews. In

EMNLP . [6] Zuohui Fu, Yikun Xian, Ruoyuan Gao, Jieyu Zhao, Qiaoying Huang, YingqiangGe, Shuyuan Xu, Shijie Geng, Chirag Shah, Yongfeng Zhang, et al. 2020. Fairness-Aware Explainable Recommendation over Knowledge Graphs. In

SIGIR .[7] Xiangnan He, Tao Chen, Min-Yen Kan, and Xiao Chen. 2015. TriRank: Review-aware Explainable Recommendation by Modeling Aspects. In

CIKM . ACM.[8] Robert Jäschke, Leandro Marinho, Andreas Hotho, Lars Schmidt-Thieme, andGerd Stumme. 2007. Tag recommendations in folksonomies. In

PKDD . Springer.[9] Lei Li, Li Chen, and Ruihai Dong. 2020. CAESAR: context-aware explanationbased on supervised attention for service recommendations.

Journal of IntelligentInformation Systems (2020), 1–24.[10] Lei Li, Yongfeng Zhang, and Li Chen. 2020. Generate Neural Template Explana-tions for Recommendation. In

CIKM . 755–764.[11] Lei Li, Yongfeng Zhang, and Li Chen. 2021. Learning to Explain Recommenda-tions.[12] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and IlyaSutskever. 2019. Language models are unsupervised multitask learners.[13] Anand Rajaraman and Jeffrey David Ullman. 2011. Finding Similar Items. In

Mining of Massive Datasets (3 ed.). Cambridge University Press, Chapter 3, 73–134.[14] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme.2009. BPR: Bayesian personalized ranking from implicit feedback. In

UAI .[15] Steffen Rendle and Lars Schmidt-Thieme. 2010. Pairwise interaction tensorfactorization for personalized tag recommendation. In

WSDM . 81–90.[16] Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, and JohnRiedl. 1994. GroupLens: an open architecture for collaborative filtering of netnews.In

CSCW . ACM, 175–186.[17] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-basedcollaborative filtering recommendation algorithms. In

WWW . ACM, 285–295.[18] Nava Tintarev and Judith Masthoff. 2015. Explaining Recommendations: Designand Evaluation. In

Recommender Systems Handbook (2 ed.), Bracha Shapira (Ed.).Springer, Chapter 10, 353–382.[19] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

NIPS . 5998–6008.[20] Xiting Wang, Yiru Chen, Jie Yang, Le Wu, Zhengtao Wu, and Xing Xie. 2018. AReinforcement Learning Framework for Explainable Recommendation. In

ICDM .[21] Yikun Xian, Zuohui Fu, S Muthukrishnan, Gerard De Melo, and Yongfeng Zhang.2019. Reinforcement knowledge graph reasoning for explainable recommenda-tion. In

SIGIR . 285–294.[22] Xuhai Xu, Ahmed Hassan Awadallah, Susan T. Dumais, Farheen Omar, BogdanPopp, Robert Rounthwaite, and Farnaz Jahanbakhsh. 2020. Understanding UserBehavior For Document Recommendation. In

WWW . 3012–3018.[23] Yongfeng Zhang and Xu Chen. 2020. Explainable Recommendation: A Surveyand New Perspectives.

Foundations and Trends® in Information Retrieval

14, 1(2020), 1–101.[24] Yongfeng Zhang, Guokun Lai, Min Zhang, Yi Zhang, Yiqun Liu, and ShaopingMa. 2014. Explicit Factor Models for Explainable Recommendation based onPhrase-level Sentiment Analysis. In