[PDF] Generate Natural Language Explanations for Recommendation

Abstract

Providing personalized explanations for recommendations can help users to understand the underlying insight of the recommendation results, which is helpful to the effectiveness, transparency, persuasiveness and trustworthiness of recommender systems. Current explainable recommendation models mostly generate textual explanations based on pre-defined sentence templates. However, the expressiveness power of template-based explanation sentences are limited to the pre-defined expressions, and manually defining the expressions require significant human efforts. Motivated by this problem, we propose to generate free-text natural language explanations for personalized recommendation. In particular, we propose a hierarchical sequence-to-sequence model (HSS) for personalized explanation generation. Different from conventional sentence generation in NLP research, a great challenge of explanation generation in e-commerce recommendation is that not all sentences in user reviews are of explanation purpose. To solve the problem, we further propose an auto-denoising mechanism based on topical item feature words for sentence generation. Experiments on various e-commerce product domains show that our approach can not only improve the recommendation accuracy, but also the explanation quality in terms of the offline measures and feature words coverage. This research is one of the initial steps to grant intelligent agents with the ability to explain itself based on natural language sentences.

Full PDF

GGenerate Natural Language Explanations for Recommendation

Hanxiong Chen

Rutgers [email protected]

Xu Chen

Tsinghua [email protected]

Shaoyun Shi

Tsinghua [email protected]

Yongfeng Zhang

Rutgers [email protected]

ABSTRACT

Providing personalized explanations for recommendations can helpusers to understand the underlying insight of the recommendationresults, which is helpful to the eectiveness, transparency, persua-siveness and trustworthiness of recommender systems. Currentexplainable recommendation models mostly generate textual ex-planations based on pre-dened sentence templates. However, theexpressiveness power of template-based explanation sentences arelimited to the pre-dened expressions, and manually dening theexpressions require signicant human eorts.Motivated by this problem, we propose to generate free-text nat-ural language explanations for personalized recommendation. Inparticular, we propose a hierarchical sequence-to-sequence model(HSS) for personalized explanation generation. Dierent from con-ventional sentence generation in NLP research, a great challengeof explanation generation in e-commerce recommendation is thatnot all sentences in user reviews are of explanation purpose. Tosolve the problem, we further propose an auto-denoising mecha-nism based on topical item feature words for sentence generation.Experiments on various e-commerce product domains show thatour approach can not only improve the recommendation accuracy,but also the explanation quality in terms of the oine measuresand feature words coverage. is research is one of the initial stepsto grant intelligent agents with the ability to explain itself basedon natural language sentences.

CCS CONCEPTS • Information systems → Recommender systems; • Computingmethodologies → Natural language processing;

KEYWORDS

Explainable Recommendation; Explainable AI; Collaborative Filter-ing; Natural Language Generation

ACM Reference format:

Hanxiong Chen, Xu Chen, Shaoyun Shi, and Yongfeng Zhang. 2019. Gen-erate Natural Language Explanations for Recommendation. In

Proceedingsof SIGIR 2019 Workshop on ExplainAble Recommendation and Search, Paris,France, July 25, 2019 (EARS’19),

10 pages.DOI: 10.475/123 4

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor prot or commercial advantage and that copies bear this notice and the full citationon the rst page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).

Recommender systems are playing an important role in many onlineapplications. ey provide personalized suggestions to help userselect the most relevant items based on their preferences. Collabora-tive Filtering (CF) has been one of the most successful approaches togenerate recommendations based on historical user behaviors [31].However, the recently popular latent representation approaches toCF – including both shallow or deep models – can hardly explaintheir rating prediction and recommendation results to users [46].Researchers have noticed that appropriate explanations are im-portant to recommendation systems [46], which can help to im-prove the system eectiveness, transparency, persuasiveness andtrustworthiness. As a result, researchers have looked into explain-able recommendation [46] and search [2] in the recent years [1, 4–7, 17, 18, 29, 39–41, 46, 47], which can not only provide users withthe recommendation lists, but also intutive explanations about whythese items are recommended.Recommendation explanations can be provided in many dier-ent forms, and among the many, a frequently used one is textualsentence explanation. Current textual explanation generation mod-els can be broadly classied into two categories – template-basedmethods and retrieval-based methods. Template-based models,such as [39, 47], dene one or more explanation sentence templates,and then ll dierent words into the templates according to thecorresponding recommendation so as to generate dierent expla-nations. Such words could be, for example, item feature words thatthe target user is interested in. However, template-based methodrequires extensive human eorts to dene dierent templates fordierent scenarios, and it limits the expressive power of explanationsentences to the pre-dened templates. Retrieval-based methodssuch as [4], on the other hand, aempt to retrieve particular sen-tences from user reviews as the explanations of a recommendation,which improves the expression diversity of explanation sentences.However, the explanations are limited to existing sentences andthe model cannot produce new sentences for explanation.Considering these problems, we propose to conduct explainablerecommendation by generating free-text natural language expla-nations, meanwhile keep a high prediction accuracy. ere existthree key challenges to build and evaluate a personalized naturallanguage explanation system. 1) Data bias – the most commonlyused text resources for training explainable recommender systemsare user-generated reviews. Although the reviews are plentiful,informative and contain valuable information about users opinionsand product features [21, 47], they can be very noisy and not allthe sentences in a review are of explanation purpose. Take Figure a r X i v : . [ c s . I R ] J a n as an example, only the underlined sentence is really comment-ing about the product. To train a good explanation generator, ourmodel should have the ability of auto-denoising so as to focus onthe training of explanation sentences. 2) Personalization – sincedierent users may pay aention to dierent product features, agood explainable recommendation system should have the abilityto provide tailored explanations for dierent users according tothe features that the user cares about. 3) Evaluation – althoughexplainable recommendation has been widely researched in recentyears, our understanding is still limited regarding which metric(s)is appropriate to evaluate the explainability of explanations. Re-cent research adopt readability measures in NLP (such as ROUGEscore) for evaluation, but since explainability is not equivalent toreadability, only generating readable sentences is not sucient,and we need to take the eectiveness of recommendation into con-sideration. is problem involves deep understandings of naturallanguage, and it also contributes technical merit to natural languageprocessing research.Motivated by these challenges, we propose a hierarchical sequence-to-sequence model (HSS) with auto-denoising for personalized rec-ommendation and natural language explanation generation. Inparticular, the paper makes the following contributions: • We design a hierarchical generation model, which is able tocollaboratively learn over multiple sentences from dierent usersfor explanation sentence generation. • Based on item feature words extracted from reviews, we designa feature-aware aention model to implicitly select explanation sen-tences from reviews for model learning, and we further introducea feature aention model to enhance the feature-level personalityof the explanations. • We adopt three oine metrics – BLEU score, ROUGE scoreand feature coverage – to evaluate the quality of the generatedexplanations. e rst two metrics are classical measures for neu-ral machine translation and text summarization. BLEU score isprecision-based while ROUGE score is relatively recall-based. eyare complement to each other and it would be reasonable to reportboth scores to reect the quality of machine generated text. Wealso use feature words coverage to show how well a model can cap-ture the real user personalized preferences. In the meanwhile, thefeature words coverage is a possible measure of the explainabilityof the generated explanation sentences.In the following, we rst review some related work in Section2, and then explain the details of our framework in Section 3. Wedescribe the three oine experimental results to verify the per-formance of the proposed approach in terms of rating predictionand explanation in Section 4. In Section 5, we will analyze theresults and make discussions about what we learned from the ex-periments. Finally, we conclude this work and provide our visionsof the research in Section 6.

Collaborative ltering (CF) [33] has been an important approach tomodern personalized recommendation systems. Early collaborativeltering methods adopted intuitive and explainable methods, suchas user-based [30] or item-based [32] collaborative ltering, whichmakes recommendation based on similar users or similar items. Later approaches to CF more and more advanced to more accu-rate but less transparent latent factor approaches, beginning fromvarious matrix factorization algorithms [15, 26, 34, 36], to morerecent deep learning and neural modeling approaches [44, 45, 49].ough eective in ranking and rating prediction, the latent natureof these approaches makes it dicult to explain the recommenda-tion results to users, which motivated the research on explainablerecommendation [46].Researchers have explored various approaches towards model-based explainable recommendation. Since user textual reviews areinformative and beer reect user preferences, a lot of researchexplored the possibility of incorporating user reviews to improvethe recommendation quality [3, 20, 22, 43, 49] and recommendationexplainability [4, 6, 18, 29, 39, 47], which helps to enhance theeectiveness, transparency, trustworthiness and persuasiveness ofrecommendation system [12, 47].Early approaches to explainable recommendation models gener-ate explanations based on pre-dened explanation templates. Forexample, Zhang et al [47] proposed an explicit factor model (EFM),which generates explanations by lling a sentence template withthe item features that a user is interested in. However, generatingexplanation sentences in this way needs extensive human eortsto dene various templates in dierent scenarios. Moreover, thepredened templates limit the expressive power of explanationsentences. Li et al [18] leveraged neural rating regression and textgeneration to predict the user ratings and user-generated tips forrecommendation, which helps to improve the prediction accuracyand the eectiveness of recommendation results. However, not allof the tips are of explanation purposes for the recommendationsbecause they do not always explicitly comment about the productfeatures. To alleviate the problem, Costa et al [9] aempted totrain generation models based on user reviews and automaticallygenerate fake reviews as explanations. One problem here is thatnot all of the sentences in the user reviews are appropriate forexplanation purposes, because users may write sentences that areirrelevant to the corresponding item, which makes it dicult togenerate explanations when the user reviews are too long with toomuch noise. Considering these deciencies, we propose an auto-denoising mechanism for text generation and produce personalizednatural language explanations for personalized recommendations.Recently, deep neural network models have been used in variousnatural language processing tasks, such as question answering [50]and text summarization [27]. A well trained neural network couldlearn lower-dimension dense representations to capture grammat-ical and semantical generalizations [11]. is property of neuralnetwork is useful for natural language generation tasks. Recurrentneural network (RNN) [24] has shown notable success in sequen-tial modeling tasks. e long short-term memory unit (LSTM)[13] and gated recurrent unit (GRU) [8] are among the most com-monly used neural networks for natural language modeling toavoid the gradient vanishing problem when dealing with long se-quences. A demonstration of potential utility of recurrent networksfor natural language generation was provided by [35], which useda character-level LSTM model for the generation of grammaticalEnglish sentences. Character-level models can obviate the Out-of-Vocabulary(OOV) problem and reduce the vector representation igure 1: An example of user reviews in e-commerce. esentences with red underlines are good for explanations. spaces for language modeling. However, they are generally outper-formed by word-level models [25]. Considering the performanceof these two modeling strategies, our proposed approach, in partic-ular, works on word-level with GRU to generate natural languageexplanations.

An explainable recommender system can not only give an accurateprediction of rating score by given a user and an item, but alsogenerate explanations to interpret the recommendation results. Inour framework, we have two major modules: a rating predictionmodule and a natural language explanation generation module.Both modules take shared user and item latent factors as input.Since the input space is shared, the extra information can be utilizedfrom the other module during training process to improve thegeneral performance of our framework. During the testing stage,only user and item latent factors as well as the extracted featurewords information are provided.At the training stage, the training data consists of users, items,user generated reviews and ratings. We use X to represent thetraining dataset; U and I are user set and item set respectively; 𝑅𝑒𝑣𝑖𝑒𝑤 is the set of sentences in the user generated reviews; R rep-resents the set of user ratings; K is the feature words set, where K is the subset of vocabulary V . We have X = {U , I , 𝑅𝑒𝑣𝑖𝑒𝑤, R , K} .e key notations in this paper are listed in Table 1.In the rating regression module, only the user latent factors U and item latent factors V are given as the input. en the multi-layer perceptron would project these latent factors into a singlevalue as the rating prediction. Aer that, we calculate the meansquare error loss and optimize the loss function.In the personalized natural language explanation generationmodule, we design a hierarchical GRU to map the user and itemlatent factors into a sequence of words. e overview of our frame-work is shown in Figure 2. e hierarchical GRU contains a contextGRU and a sentence GRU. Context GRU is used to generate theinitial hidden state for sentence GRU to generate the sequence of Table 1: A summary of key notations in this work.Notation Explanation X training dataset U user set I item set V vocabulary K feature words set S e set of generated sequences 𝑅𝑒𝑣𝑖𝑒𝑤 e set of sentences in the reviews R e set of user ratings U e set of user latent factors V e set of item latent factors u user latent factor v item latent factor k feature word embedding o aentive feature-aware vector Θ e set of neural network parameters 𝛽 𝑖 e supervised factor of the i-th sentence 𝑑 latent factor dimension 𝑟 𝑢,𝑖 rating of user 𝑢 to item 𝑖𝑡𝑎𝑛ℎ hyperbolic tangent activation function 𝜎 sigmoid activation function 𝜙 rectied linear unit activation function 𝜍 somax functionwords. e aention model is employed to improve the personal-ization of the generated sentences. It can be interpreted as whichfeature word or feature word should we pay more aention to whengenerating the current sentence. Since not all the words in the vo-cabulary are good for explanations and not all the feature wordsare suitable for each specic user item pair, we expect the modelto learn to generate a more related and personalized explanationsentences by applying aention model. Moreover, we design anauto-denoising mechanism by applying a supervised factor on thecorresponding generated sentence loss function. e key pointhere is that we believe the sentence with higher proportion of fea-ture words would be more important for training the model. eeect of those sentences with less or zero proportion of featurewords would automatically be weakened during training process byapplying zero or very small supervised factor on their loss function.Finally, all the neural network parameters, user and item latentfactors, word embedding in both modules are learned by a multi-task learning approach. e model can be trained through back-propagation algorithms. e goal of doing neural rating regression is to make rating pre-dictions by given user and item latent factors. We borrow the ideafrom paper [18] which is to learn a function 𝑓 𝑟 (·) to project userlatent factors U and item latent factors V to rating scores ˆ r . Here 𝑓 𝑟 (·) is represented as a multi-layer perceptron (MLP):ˆ r = MLP ( U , V ) (1)here U ∈ R 𝑑 × 𝑚 and V ∈ R 𝑑 × 𝑛 are in dierent latent vector spaces; 𝑚 is the number of users and 𝑛 is the number of items; 𝑑 is thelatent factor dimension for both user and item representations. Werst map user and item latent factors into a hidden state: h 𝑟 = 𝑡𝑎𝑛ℎ ( W 𝑟𝑢ℎ u + W 𝑟𝑣ℎ v + b 𝑟ℎ ) (2)where W 𝑟𝑢ℎ ∈ R 𝑑 × 𝑑 and W 𝑟𝑣ℎ ∈ R 𝑑 × 𝑑 ; b 𝑟ℎ ∈ R 𝑑 × is the biasterm. We add more layers and use tanh activation function to donon-linear transformation to improve the performance of ratingprediction: h 𝑟𝑙 = 𝑡𝑎𝑛ℎ ( W 𝑟ℎℎ 𝑙 h 𝑟𝑙 − + b 𝑟ℎ 𝑙 ) (3)where W 𝑟ℎℎ 𝑙 ∈ R 𝑑 × 𝑑 is a mapping matrix; 𝑙 is the index of hiddenlayer. We denote the last hidden layer as h 𝑟𝐿 . e output layer mapsthe last hidden state into a predicted rating score ˆ 𝑟 :ˆ 𝑟 = W 𝑟ℎℎ 𝐿 h 𝑟𝐿 + 𝑏 𝑟ℎ 𝐿 (4)where W 𝑟ℎℎ 𝐿 ∈ R × 𝑑 . e objective function of this rating regres-sion problem is dened as: L 𝑟 = |X| ∑︁ 𝑢 ∈U ,𝑖 ∈I ( ˆ 𝑟 𝑢,𝑖 − 𝑟 𝑢,𝑖 ) (5)ˆ 𝑟 𝑢,𝑖 is the predicted rating score by user 𝑢 given item 𝑖 and 𝑟 𝑢,𝑖 isthe corresponding ground truth. We can optimize this objectivefunction to learn neural network parameters Θ as well as user anditem latent representations U and V . e key point of doing this work is to generate personalized naturallanguage explanations. Although some research works have alreadyimplemented deep neural models to generating reviews [10] ortips [18], not many researchers work on explanation generation.In this section, we will introduce: 1) auto-denosing strategy; 2)feature-aware aention for personalized explanation generation; 3)hierarchical GRU model for sentence generation. User review usually contains multiplesentences. However, not all of them are good representations ofuser’s purchase intention. Our goal is to promote the quality ofgenerated explanation text by introducing a supervised factor tocontrol the training process, so that our model can learn from thosemore important sentences while ignoring those useless sentences.To implement this idea, we rst extract all the feature words byusing toolkit Sentires [47, 48], represented as K and K ⊆ V , fromthe data set. en the supervised factor of the 𝑖 -th sentence in thereview is calculated as: 𝛽 𝑖 = 𝑁 𝑖𝑘 𝑁 𝑖𝑤 (6)where 𝑁 𝑖𝑘 is the number of feature words in the 𝑖 -th sentence; 𝑁 𝑖𝑤 isthe total number of words in the 𝑖 -th sentence. We can multiply thissupervised factor to the loss function of this sentence to control thetraining process. We believe that the sentence with higher featurewords proportion would be more important. e eect of thosesentences with lower or zero proportion of feature words would be hps://github.com/evison/Sentires automatically weakened by multiplying zero or quite small factoron their loss function. Feature words are the wordswhich describe the features of a product. For example, ”memory”,”screen”, ”sensitivity” can be feature words in electronics dataset.However, ”use”, ”good”, ”day” are not feature words, since they arenot used to describe the feature of an item. Since users may paydierent aention to these feature words and each product mayonly relate to some feature words, inspired by [42], we implementa feature-aware aention mechanism to improve the personality.Mathematically, given a hidden state h 𝑡 and the 𝑖 − th feature wordembedding k 𝑖 , the aention score of the feature word k 𝑖 at time 𝑡 is computed as: x 𝑖 = h 𝑡 ; k 𝑖 𝑎 ( 𝑖, h 𝑡 ) = w 𝑇 𝜙 ( W 𝑎 x 𝑖 + b 𝑎 ) + 𝑏 𝑎 (7)where x 𝑖 ∈ R 𝑑 × is the concatenation of the hidden vector at time 𝑡 and the 𝑖 -th feature word vector; W 𝑎 ∈ R 𝑑 × 𝑑 is the mappingmatrix for the rst layer network; b 𝑎 ∈ R 𝑑 × is the rst layer bias; w ∈ R 𝑑 × and 𝑏 𝑎 ∈ R are the neural parameters for the secondlayer; 𝜙 (·) is the ReLU activation function which is dened as: 𝜙 ( 𝑥 ) = 𝑚𝑎𝑥 ( , 𝑥 ) (8)e nal aention weights are obtained by normalizing above aen-tive scores using somax, which can be interpreted as how muchaention do we pay to the feature word in term of correspondinghidden state during the training process. 𝛼 ( 𝑖, h 𝑡 ) = 𝑒𝑥𝑝 ( 𝑎 ( 𝑖, h 𝑡 )) (cid:205) |K | 𝑖 = 𝑒𝑥𝑝 ( 𝑎 ( 𝑖, h 𝑡 )) (9)Finally, the aentive feature-aware vector at time 𝑡 is calculated as: o 𝑡 = |K | ∑︁ 𝑖 = 𝛼 ( 𝑖, h 𝑡 ) k 𝑖 (10)is aentive feature-aware vector will be used to compute theinitial hidden state for generating the 𝑖 -th sentence in 𝐺𝑅𝑈 𝑤𝑟𝑑 ,which will be introduced in the following subsection.

𝐺𝑅𝑈 𝑐𝑡𝑥 ). As shown in Figure 2, thereview sentences are generated by

𝐺𝑅𝑈 𝑤𝑟𝑑 , which will be intro-duced in the next subsection. However, the initial hidden states aregiven by

𝐺𝑅𝑈 𝑐𝑡𝑥 . By leveraging this hierarchical recurrent neuralnetwork, we can generate multiple sentences by given one pair ofuser and item latent factors. Since each generated sentence has itsown loss function, we are able to apply the auto-denoising strategymentioned above to reduce the eects of unrelated sentences in theuser generated reviews during training process.Suppose that for each user and item pair, there are 𝑛 sentencesin the review. en we have 𝑛 context representations. 𝐶 = { C , C , . . . , C 𝑛 } We use 𝐶 to denote the collection of all the context representationsand C 𝑖 denotes a specic context representation. When a sentenceis generated, the context representation would be updated by thefollowing equation: C 𝑛 = 𝐺𝑅𝑈 𝑐𝑡𝑥 ( C 𝑛 − , h 𝑤𝑛 − ,𝐿 ) (11) xplanation Text Generation Topic-Aware Attention

MLP Cn k1 k2 kn …a1 a2 an… + a1 o n Topic-Aware Attention uv o C C C h w , (null) (null) (null) (null) h w , (null) (null) (null) (null) h w , (null) (null) (null) (null) h w , (null) (null) (null) (null) h w , (null) (null) (null) (null) h w , (null) (null) (null) (null) MLP uv o C w u v MLP w w < eos > MLP k1 k2 kn … feature wordsMLPu vC Hierarchical GRUEnhanced Input u vMLP ˆ r (null) (null) (null) (null) (null) (null) (null) (null) (null) (null) (null) (null) (null) Rating Regression < soc > Figure 2: Overview of our HSS model. ere are two major modules—explanation text generation module and rating regressionmodule. e yellow boxes represent latent factors, such as user latent factor u, item latent factor v, word vector w; blue boxesrepresent hidden states; gray boxes represent multi-layer perceptron; pink boxes represent the attention weights for eachfeature word vector.C 𝑛 − is the previous context representation; h 𝑤𝑛 − ,𝐿 is the last hid-den state of the 𝐺𝑅𝑈 𝑤𝑟𝑑 . en the

𝐺𝑅𝑈 𝑐𝑡𝑥 state is updated by thefollowing operations: r 𝑐𝑛 = 𝜎 ( W 𝑐ℎ𝑟 h 𝑤𝑛 − ,𝐿 + W 𝑐𝑐𝑟 C 𝑛 − + b 𝑐𝑟 ) z 𝑐𝑛 = 𝜎 ( W 𝑐ℎ𝑧 h 𝑤𝑛 − ,𝐿 + W 𝑐𝑐𝑧 C 𝑛 − + b 𝑐𝑧 ) g 𝑐𝑛 = tanh ( W 𝑐ℎ𝑔 h 𝑤𝑛 − ,𝐿 + W 𝑐𝑐𝑔 ( r 𝑐𝑛 (cid:12) C 𝑛 − ) + b 𝑐𝑔 ) C 𝑛 = z 𝑐𝑛 (cid:12) C 𝑛 − + ( − z 𝑐𝑛 ) (cid:12) g 𝑐𝑛 (12)To start the whole process, we utilize the user latent factor u and item latent factor v to initialize the rst hidden state C . C = 𝜙 ( W 𝑐 𝑢 u + W 𝑐 𝑣 v + b 𝑐 ) (13) 𝐺𝑅𝑈 𝑤𝑟𝑑 ). is part is to generate thewords for explanation sentences. e main idea can be descripbedas follow: 𝑝 ( 𝑤 𝑛,𝑡 | 𝑤 𝑛, , 𝑤 𝑛, , . . . , 𝑤 𝑛,𝑡 − ) = 𝜍 ( h 𝑤𝑛,𝑡 ) (14)where 𝑤 𝑛,𝑡 is the 𝑡 -th word of the 𝑛 -th review sentence. 𝜍 (·) is thesomax function which is dened as follow: 𝜍 ( 𝑥 𝑖 ) = 𝑒 𝑥 𝑖 (cid:205) 𝑗 𝑒 𝑥 𝑗 (15) h 𝑤𝑛,𝑡 is the sequence hidden state of the 𝑛 -th sentence at the time 𝑡 . It depends on the previous hidden state h 𝑤𝑛,𝑡 − and the currentinput w 𝑛,𝑡 : h 𝑤𝑛,𝑡 = 𝑓 ( h 𝑤𝑛,𝑡 − , w 𝑛,𝑡 ) (16)e 𝑓 (·) can be LSTM, GRU or Vanilla RNN. Here we utilize GRUfor eciency consideration. e states are updated by following operations: r 𝑤𝑛,𝑡 = 𝜎 ( W 𝑤𝑤𝑟 w 𝑛,𝑡 + W 𝑤ℎ𝑟 h 𝑤𝑛,𝑡 − + b 𝑤𝑟 ) z 𝑤𝑛,𝑡 = 𝜎 ( W 𝑤𝑤𝑧 w 𝑛,𝑡 + W 𝑤ℎ𝑧 h 𝑤𝑛,𝑡 − + b 𝑤𝑧 ) g 𝑤𝑛,𝑡 = tanh ( W 𝑤𝑤𝑔 w 𝑛,𝑡 + W 𝑤ℎ𝑔 ( r 𝑤𝑛,𝑡 (cid:12) h 𝑤𝑛,𝑡 − ) + b 𝑤𝑔 ) h 𝑤𝑛,𝑡 = z 𝑤𝑛,𝑡 (cid:12) h 𝑤𝑛,𝑡 − + ( − z 𝑤𝑛,𝑡 ) (cid:12) g 𝑤𝑛,𝑡 (17)where r 𝑤𝑛,𝑡 is the reset gate; z 𝑤𝑛,𝑡 is the update gate; (cid:12) representselement-wise multiplication; tanh denotes hyperbolic tangent ac-tivation function; w 𝑛,𝑡 can simply to be the vector representationof the word 𝑤 𝑛,𝑡 , which is the word in the 𝑛 -th sentence in thereview at time 𝑡 . However, we expect to bring more personalizedinformation into the text generation model. Inspired by [37] weconcatenate word embedding of word 𝑤 at time 𝑡 with user embed-ding u and item embedding v , to get an enhanced input embedding s 𝑛,𝑡 . en we feed this embedding into a multi-layer perceptron toproduce input vector w 𝑛,𝑡 : s 𝑛,𝑡 = e 𝑛,𝑡 ; u ; vh 𝑠 = 𝜙 ( W 𝑠 s 𝑛,𝑡 + b 𝑠 ) (18)where e 𝑛,𝑡 is the vector representation of word 𝑤 in the 𝑛 -th sen-tence at time 𝑡 ; h 𝑠 is the hidden state aer doing non-linear trans-formation on the enhanced embedding. We can add more layersand nally feed the output of the last layer hidden state h 𝑠𝐿 into anoutput layer to get the input vector w 𝑛,𝑡 : w 𝑛,𝑡 = W 𝑠𝐿 h 𝑠𝐿 + b 𝑠𝐿 (19)where W 𝑠 ∈ R 𝑑 × 𝑑 , W 𝑠𝐿 ∈ R 𝑑 × 𝑑 ; b 𝑠 and b 𝑠𝐿 are in R 𝑑 .To start the explanation sentences generation process, we needan initial hidden state. We use the output of 𝐺𝑅𝑈 𝑐𝑡𝑥 C 𝑛 , user latentfactor u , item latent factor v and the 𝑖 -th sentence feature-awareaentive context vector o 𝑛 together to compute the initial hiddenstate h 𝑤𝑛, : h 𝑤𝑛, = W 𝑖𝑛, 𝑇 𝜙 ( W 𝑖𝑛, ( C 𝑛 ; u ; v ; o 𝑛 ) + b 𝑖𝑛, ) + b 𝑖𝑛, (20)here W 𝑖𝑛, ∈ R 𝑑 × 𝑑 , b 𝑖𝑛, ∈ R 𝑑 × , W 𝑖𝑛, ∈ R 𝑑 × 𝑑 , b 𝑖𝑛, ∈ R 𝑑 × . efeature-aware aentive context vector o 𝑛 is calculated as describedin subsection 3.3.2, where the hidden state h 𝑡 is replaced with C 𝑛 and the feature-aware aentive context vector is represented as o 𝑛 instead of o 𝑡 . is can be interpreted as how much aentiondo the model pay to the feature words when generating the 𝑛 -thexplanation sentence. e equation (20) uses two layers neuralnetwork to calculate initial hidden state for 𝐺𝑅𝑈 𝑤𝑟𝑑 . You canchoose to add more layers here.By obtaining the h 𝑤𝑛, , GRU can conduct the sequence decodingprocess. Aer obtaining all the hidden states of the sequences,we then feed them into a nal output layer to predict the wordsequence in the review.ˆ y 𝑡 + = 𝜍 ( W 𝑤ℎ h 𝑤𝑡 + b 𝑤 ) (21) 𝜍 (·) is somax function which was dened in Equation 15; h 𝑤𝑡 ∈ R 𝑑 × 𝑙 is the hidden state matrix, where 𝑙 is the length of the sequence; W 𝑤ℎ ∈ R |V |× 𝑑 ; ˆ y 𝑡 + can be considered as a multinomial distributionover vocabulary V on review text. en the model can generatethe next word 𝑤 ∗ 𝑡 + from ˆ y 𝑡 + by selecting the one with the largestprobability. Here we use 𝑤 𝑖 to indicate the 𝑖 -th word in vocabulary.en we have 𝑤 ∗ 𝑡 + = argmax 𝑤 𝑖 ∈ 𝑉 ˆ y ( 𝑤 𝑖 ) 𝑡 + (22)To train the model, we use Negative Log-Likelihood as the lossfunction. Our goal is to make the words in the review have higherprobabilities than others. Here 𝐼 𝑤 is the index of word 𝑤 in thevocabulary V . e loss function of the 𝑖 − th sentence is representedas: L 𝑠𝑖 = − ∑︁ 𝑤 ∈ 𝑅𝑒𝑣𝑖𝑒𝑤 log ˆ y ( 𝐼 𝑤 ) (23)In the testing stage, we introduce beam search to search for thebest sequence 𝑠 ∗ with maximum log-likelihood. 𝑠 ∗ = argmax 𝑠 ∈S ∑︁ 𝑤 ∈ 𝑠 log ˆ y ( 𝐼 𝑤 ) (24) S is the set of generated sequences. |S| is the beam size. e framework contains two major modules. We integrate bothparts into one multi-task learning process. e nal objective func-tion is dened as: J = min U , V , E , Θ (cid:0) L 𝑟 + | 𝑅𝑒𝑣𝑖𝑒𝑤 | ∑︁ 𝑖 = 𝛽 𝑖 L 𝑠𝑖 + 𝜆 (|| U || + || V || + || Θ || ) (cid:1) (25)where E ∈ R 𝑑 ×|V | is the word embedding matrix; Θ is the setof neural parameters; 𝜆 is the penalty weight; L 𝑟 is the ratingregression loss function; 𝛽 𝑖 is the supervised factor of the 𝑖 -thsentence; 𝛽 𝑖 L 𝑠𝑖 is the weighted loss function of the 𝑖 -th generatedsentence for auto-denoising. Our datasets are built upon Amazon 5-core [23] which includesuser generated reviews and metadata spanning from May 1996 to hp://jmcauley.ucsd.edu/data/amazon Table 2: Statistics of the datasets in our experiments.Electronics Beauty | V |

Electronics and

Beauty two datasets tocover dierent domains and dierent scales in our experiment.Instead of using original 5-core version, we lter the dataset byselecting the users who has at least 10 shopping records. e reasonof doing this ltering operation is that the model would not be welltrained to learn the personalized preference for those users withvery few reviews. Aer the original 5-core data is ltered, wemove those records with less item frequencies into training set toget avoid of cold start issue in testing stage. For review text pre-processing, we keep all the punctuation and numbers in the raw textand we do not remove long sentences by seing a length threshold.In other words, our dataset is noisy which is challenging for textgeneration models. e ”Electronics” dataset contains 45,224 users,61,687 items, 744,453 reviews and 434 extracted feature words; the”Beauty” dataset is a smaller dataset which contains 5,122 users,11,616 items, 90,247 reviews and 518 extracted feature words. estatistical details of our datasets are in Table 2.We lter out the words with frequency lower than ten to buildthe vocabulary V . en the whole dataset is splied into threesubsets: training, validation and testing (80%/10%/10%). Baselines . To evaluate the performance of rating pre-diction, we compare our HSS model with three methods, namelyBiasedMF, SVD++ and DeepCoNN. e rst two methods only uti-lize the ratings information and the third method involves usergenerated review for rating prediction. • BiasedMF [15]: Biased Matrix Factorization. It only usesrating matrix to learn two low-rank user and item matricesto do rating prediction. By adding biases into plain matrixfactorization model, it is able to depict the independentinteraction of a user or an item on a rating value. • SVD++ [14]: It extends Singular Value Decomposition byintegrating implicit feedback into latent factor modeling. • DeepCoNN [49]: Deep Cooperative Neural Networks. isis a state-of-art deep learning method that exploits userreviews information to jointly model user and item. eauthor has shown that their model signicantly outper-forms some strong topic modeling based methods such as

HFT [22] and

CTR [38]. We use the implementation by [4]in our experiments.

Evaluation Metric . To evaluate the performance of rat-ing prediction, we employ the well-known Root Mean Square Error(RMSE) as our evaluation metric. .3 Explanation Sentence Generation Evaluation

Baseline . To evaluate the performance of text generationmodule, we compare our work with

Att2SeqA [16]. is work isto automatically generate product reviews by given user, item andcorresponding rating information. eir model treat user, itemand rating as aributes and encode the three aributes into latentfactors through multi-layer perceptron. en the decoder wouldtake the encoded latent factor as the initial hidden state of LSTMfor review generation. In our implementation, we also use thetwo-level stacked LSTM for text generation as the paper proposed.ere are three reasons for choosing this model as our baseline: • Similar input : both our

HSS and their

Att2SeqA wouldlearn user and item latent factor as the input for text gener-ation. e dierence is that their model would take ratinginformation as the direct input while our model wouldlearn to predict the rating score. • Use aention mechanism : their model introduces aentionmechanism to enhance the text generation quality whileour model also uses aention model to improve the per-sonality of the generated explanations. • Use review data : both methods use user generated reviewsfor training the model. e dierence is that their model isto learn from user wrien review to automatically generatefake reviews while our model is to generate explanationsentences.Considering these three reasons, we believe that this model is themost suitable and competitive model for comparison.ere is another related model called

NRT proposed in [18].In that paper, they also do rating regression and text generationsimultaneously. However, their goal is to generate tips. e datasource used by their model is the summary in Amazon dataset.e summary can be treated as the title of a user review. It onlycontains one short sentence expresses the general feeling of a userto a product such as ”So good”, ”Excellent”, ”I don’t like it”. Sincethe summaries or tips are too general to depict the features of anitem that a user is preferred, we cannot use summaries for trainingan explanation generation model. e

NRT model is very useful forsimulating user feelings on a specic item. However, consideringthe dierences from data source and the designing purposes, wewould not use this model as our baseline.

Evaluation Metrics . We use three evaluation metricsto evaluate generated explanation sentence quality: BLEU[28],ROUGE[19] and feature words coverage. • BLEU : this is a precision-based measure which is used forautomatically evaluating machine generated text quality.It measures how well a machine generated text (candidate)matches a set of human reference texts by counting thepercentage of n-grams in the machine generated text over-lapping with the human references. e precision scorefor n-gram is calculated as: 𝑝 𝑛 = (cid:205) 𝐶 ∈{ 𝐶𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑠 } (cid:205) 𝑛𝑔𝑟𝑎𝑚 ∈ 𝐶 𝐶𝑜𝑢𝑛𝑡 𝑐𝑙𝑖𝑝 ( 𝑛𝑔𝑟𝑎𝑚 ) (cid:205) 𝐶 (cid:48) ∈{ 𝐶𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑠 } (cid:205) 𝑛𝑔𝑟𝑎𝑚 (cid:48) ∈ 𝐶 (cid:48) 𝐶𝑜𝑢𝑛𝑡 ( 𝑛𝑔𝑟𝑎𝑚 (cid:48) ) where 𝐶𝑜𝑢𝑛𝑡 𝑐𝑙𝑖𝑝 means that the count of each word inthe machine generated text is truncated to not exceed the largest count observed in any single reference for thatword. For more details, please refer to the paper [28]. • ROUGE : this is another classical evaluation metric for eval-uating machine generated text quality. It is a recall-relatedmeasure which shows how much the words in the humanreference texts appear in the machine generated text. eROUGE-N is computed as:ROUGE-N = (cid:205) 𝑆 ∈{ 𝑅𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒𝑠 } (cid:205) 𝑛𝑔𝑟𝑎𝑚 ∈ 𝑆 𝐶𝑜𝑢𝑛𝑡 𝑚𝑎𝑡𝑐ℎ ( 𝑛𝑔𝑟𝑎𝑚 ) (cid:205) 𝑆 ∈{ 𝑅𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒𝑠 } (cid:205) 𝑛𝑔𝑟𝑎𝑚 ∈ 𝑆 𝐶𝑜𝑢𝑛𝑡 ( 𝑛𝑔𝑟𝑎𝑚 ) where 𝐶𝑜𝑢𝑛𝑡 𝑚𝑎𝑡𝑐ℎ ( 𝑛𝑔𝑟𝑎𝑚 ) is the maximum number of n-grams co-occurring in a machine generated text and a setof human reference texts. In our experiments, we use recall,precision and F-measure of ROUGE-1(uni-gram), ROUGE-2(bi-gram), ROUGE-L(longest common subsequence) andROUGE-SU4(skip gram) to evaluate the quality of gener-ated explanation sentences. We use the standard option for evalutaion. • Feature words coverage : this measure is to reect how wellour model can capture the user personalized preferences.Assuming that the number of feature words in the humanreference texts is 𝑁 𝑟 and the number of covered featurewords in the machine generated sentences is 𝑁 𝑐 , the featurewords coverage is calculated as: 𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒 = 𝑁 𝑟 𝑁 𝑐 We use this measure to reect how well our model gener-ated sentences can capture the users personalized prefer-ences. In the meanwhile, this is also the measure we useto evaluate the explainability of the generated explanationsentences.

In our HSS model, we use 300 as the dimension of user and itemlatent factors. e dimension of hidden size and word vector areset to 300. e number of layers for rating regression model is 4and for explanation generation is 3. e training batch size is 100.We add gradient clip on

𝐺𝑅𝑈 𝑐𝑡𝑥 and

𝐺𝑅𝑈 𝑠𝑒𝑛 by seing the normof gradient clip to 1.0. e L2 regularization weight parameter 𝜆 = . ROUGE-1.5.5.pl -n 4 -w 1.2 -m -2 4 -u -c 95 -r 1000 -f A -p 0.5 able 3: RMSE values for rating regressionElectronics Beauty

BiasedMF 1.096 1.030SVD++ 1.104 1.034DeepCoNN

HSS specic item quickly. However, if the explanation is too short, forexample one sentence, that one sentence may not cover enoughinformation to improve the recommendation quality. We think twois a reasonable good length for explanation sentences.

Our HSS model can not only generate natural language explanationsentences, but also provide predicted rating scores. e results ofrating prediction of our model and baseline models are given inTable 3. It shows that our model can outperforms all the baselineson Beauty dataset. On Electronics dataset, the RMSE score of ourHSS is beer than BiasedMF and SVD++. Although the performanceis not beer than the state-of-art model DeepCoNN, the result isstill comparable. In general, the topic-based deep neural networkmodel DeepCoNN and HSS are beer than tradition collaborativeltering based methods. It is because that DeepCoNN and HSStakes user reviews to improve the representation ability of userand latent factors, while the traditional methods only use ratinginformation.e dierence between our HSS and DeepCoNN is the way ofusing the reivew data. In our HSS, we use GRU to learn to gener-ate a sequence of words. e review data is used for maximizingthe log likelihood of generated words. e DeepCoNN maps theuser review content into a set of word embedding. en pass theword embedding into convolution layers, max-pooling layer andfully connected layers to map the word embedding into a ratingscore. Although the way of using the review data is dierent, theexperimental results on both models show that it is helpful to makeuse of user review information to improve the recommendationperformance.

In order to evaluate the quality of generated sentences, we reportrecall, precision and F-measure of ROUGE-1, ROUGE-2, ROUGE-L, ROUGE-SU4. e results are shown in Table 5–6. Accordingto the results, our model almost outperforms the baseline modelon all the measures and only the recall on ROUGE-SU4 is slightlylower than the baseline model. From the results we can see thatboth models achieve good recall scores on all the measures exceptfor ROUGE-2. One possible reason is that both models employaention mechanism during the sequence generation process. eexperiment result reect that by adding aention context vector onword generation process can help to generate the sentences whichare more related to the user and the product.One dierence between our model and the A2SeqA modelis that we implement the aention model by leveraging feature

Table 4: BLEU-1 (B-1), BLEU-4 (B-4) and Feature words cov-erage (FC) on Electronics and Beauty dataset (in percentage)Electronics Beauty

B-1 B-4 FC B-1 B-4 FCA2SeqA 7.32 2.17 2.16 8.54 1.61 1.69

HSS 12.36 4.17 6.74 9.55 3.49 6.05 words. We believe that not all the feature words are related toa specic item and each user has their own preferred features.Considering this property in the e-commerce scenario, we calculatethe aention weights on each of the feature word embedding withthe context level hidden state on current time stamp. en we applythe aention weights on each of the feature words embedding andintegrate all the weighted embedding into a aentive context vector.is context vector represents how much aention do the model payto each feature word when generating current sentence. However,A2SeqA model obtains the aentive context vector with user, itemand rating latent factors, which are the aributes as mentioned inthe paper [16]. en the author combines this context vector withthe output of GRU on each time stamp to predict the next word.Since their aention mechanism is not for improving the featureword coverage, our model get much higher score on feature wordscoverage as shown in Table 4. In another word, to do aention onfeature words do help the model to cover more feature words inthe generated sentences.Another observation is that our model gives much higher pre-cision score than the baseline model. It means that our modelgenerated sequences can hit much more words in human referencetexts than those generated by A2SeqA. As shown in Table 4, theBLEU-score, which is a precision-based metric for text generationevaluation, also gives a higher score to HSS than A2SeqA.

Our model has the ability of generating multiple sentences. Toevaluate the multiple sentences generation quality, we do experi-ments on Beauty dataset. We choose the number of sentences in therange of 1 to 3 during trainig and testing stages. For example, whenthe number of sentences is set to 1, we only use the rst sentenceof each review to train the model. During the testing stage, weonly generate one sentence and calculate ROUGE and BLEU scorebased on the rst sentence in the human reference text. We reportthe changing of recall, precision and F-measure of ROUGE scoreswith respect to the number of sentences in Figure 4(a), 4(b) and4(c). From the results, we can see that our model can have beerrecall on all the measures when to generate more than one sentence.e ROUGE precision score on multiple sentences generation isslightly lower than the one sentence case. A possible reason is thatthe more sentences involved in the training and testing, the morechallenging for the generation model to cover the information inthe human reference texts.

One thing we need to claim is that we do not do the length alignmenton the review data. at means some review only contains onesentence while some of them may contains 2 or more sentences.For each sentence, the length also varies. is is a big challenge able 5: ROUGE score on Electronics dataset (in percentage)ROUGE-1 ROUGE-2 ROUGE-L ROUGE-SU4 recall precision F1 recall precision F1 recall precision F1 recall precision F1A2SeqA 22.80 7.79 10.19 0.45 0.14 0.18 19.93 6.77 8.85 9.26 1.07 1.38

HSS 26.76 15.72 18.36 3.01 1.77 2.05 22.51 13.31 15.47 9.69 3.51 4.10Table 6: ROUGE score on Beauty dataset (in percentage)ROUGE-1 ROUGE-2 ROUGE-L ROUGE-SU4 recall precision F1 recall precision F1 recall precision F1 recall precision F1A2SeqA 26.55 8.67 12.03 0.70 0.19 0.27 22.96 7.57 10.46

Description Explanation Sentences good explanation on Beauty e bottle is very light and the smell is very strong. good explanation on Electronics e price is great. e sound quality is great cover feature words but not uent e scent is a good product . I have to use this product . I have used to use the hair . uent but wrong description the price is a great. e sound is great Figure 3: ROUGE scores change on the number of generated sentences S c o r e s i n p e r c e n t a g e % Number of sentences

ROUGE Recall on Beauty Dataset

ROUGE-1 ROUGE-2 ROUGE-L ROUGE-SU4 22.09 28.4 30.73.26 4.07 3.7920.4 24.64 26.0110.02 11.43 12.0305101520253035 1 2 3 S c o r e s i n p e r c e n t a g e % Number of sentences

ROUGE Recall on Beauty Dataset

ROUGE-1 ROUGE-2 ROUGE-L ROUGE-SU410.73 13.49 14.971.74 1.85 1.7710.05 11.66 12.642.98 2.73 2.870246810121416 1 2 3 S c o r e s i n p e r c e n t a g e % Number of sentences

ROUGE Precision on Beauty Dataset

ROUGE-1 ROUGE-2 ROUGE-L ROUGE-SU4 S c o r e s i n p e r c e n t a g e % Number of sentences

ROUGE F1 on Beauty Dataset

ROUGE-1 ROUGE-2 ROUGE-L ROUGE-SU4 to the RNN-based sentence generation model. One reason is thatthe training of quite long sentences would suer from gradientvanishing problem which would be hard for deep neural networkto learning the parameters. Our hierarchical GRU model could helpto solve this problem. It is because that our context level GRU couldcapture the long dependency so that the length of sequence foreach generation process is reduced. e experiment results verifythat our model has the ability of generating multiple sentences.In Table 7 we list some generated explanations which cover thegood sentences with explainability, sentence with feature wordbut not quite uent, bad sentence with wrong description of theitem. For the last example, the wrong description means that theitem is a wireless router but the sentence is not describing theitem correctly. is is a common issue we encountered during theexperiments. A possible reason is that the dataset is very sparse sothe corresponding item vector is not well trained, which result inthe wrong description issue.

In this work, we proposed a deep learning framework called hier-archical sequence-to-sequence (HSS) model, which can not onlymake accurate rating predictions but also generate explanationsentences to improve the eectiveness and the trustworthiness ofthe recommender systems. For rating prediction, our model canoutperform the CF-based BiasedMF and SVD++ algorithms and achieve a comparable result to the state-of-art DeepCoNN model.For the explanation generation module, we designed a hierarchicalGRU with feature-aware aention mechanism to generate personal-ized explanation sentences. We also introduced an auto-denoisingmethod to reduce the eect of unrelated sentences in training pro-cess. In the future, we expect to do research work to solve thewrong description issue mentioned in the previous section. Wewill also apply this framework on other datasets to test its robust-ness, and consider knowledge-enhanced explanation generationfor explainable AI.

ACKNOWLEDGEMENT

We thank the reviewers for the careful reviews and constructive sug-gestions. is work was partly supported by the National ScienceFoundation under IIS-1910154. Any opinions, ndings, conclusionsor recommendations expressed in this paper are the authors anddo not necessarily reect those of the sponsors.

REFERENCES [1] Qingyao Ai, Vahid Azizi, Xu Chen, and Yongfeng Zhang. 2018. Learning heteroge-neous knowledge base embeddings for explainable recommendation.

Algorithms

11, 9 (2018), 137.[2] Qingyao Ai, Yongfeng Zhang, Keping Bi, and W Bruce Cro. 2019. Explainableproduct search with a dynamic relation embedding model.

ACM Transactions onInformation Systems (TOIS)

38, 1 (2019), 1–29.[3] Amjad Almahairi, Kyle Kastner, Kyunghyun Cho, and Aaron Courville. 2015.Learning distributed representations from reviews for collaborative ltering. In roceedings of the 9th ACM Conference on Recommender Systems . ACM, 147–154.[4] Chong Chen, Min Zhang, Yiqun Liu, and Shaoping Ma. 2018. Neural aentionalrating regression with review-level explanations. In

Proceedings of the 2018World Wide Web Conference on World Wide Web . International World Wide WebConferences Steering Commiee, 1583–1592.[5] Xu Chen, Hanxiong Chen, Hongteng Xu, Yongfeng Zhang, Yixin Cao, ZhengQin, and Hongyuan Zha. 2019. Personalized fashion recommendation withvisual explanations based on multimodal aention network: Towards visuallyexplainable recommendation. In

Proceedings of the 42nd International ACM SIGIRConference on Research and Development in Information Retrieval . 765–774.[6] Xu Chen, Zheng Qin, Yongfeng Zhang, and Tao Xu. 2016. Learning to rankfeatures for recommendation over multiple categories. In

Proceedings of the 39thInternational ACM SIGIR conference on Research and Development in InformationRetrieval . ACM, 305–314.[7] Xu Chen, Yongfeng Zhang, and Zheng Qin. 2019. Dynamic explainable rec-ommendation based on neural aentive models. In

Proceedings of the AAAIConference on Articial Intelligence , Vol. 33. 53–60.[8] Kyunghyun Cho, Bart Van Merri¨enboer, Dzmitry Bahdanau, and Yoshua Ben-gio. 2014. On the properties of neural machine translation: Encoder-decoderapproaches. arXiv preprint arXiv:1409.1259 (2014).[9] Felipe Costa, Sixun Ouyang, Peter Dolog, and Aonghus Lawlor. 2018. Auto-matic Generation of Natural Language Explanations. In

Proceedings of the 23rdInternational Conference on Intelligent User Interfaces Companion . ACM, 57.[10] Li Dong, Shaohan Huang, Furu Wei, Mirella Lapata, Ming Zhou, and Ke Xu.2017. Learning to generate product reviews from aributes. In

Proceedings ofthe 15th Conference of the European Chapter of the Association for ComputationalLinguistics: Volume 1, Long Papers , Vol. 1. 623–632.[11] Albert Ga and Emiel Krahmer. 2018. Survey of the State of the Art in Natu-ral Language Generation: Core tasks, applications and evaluation.

Journal ofArticial Intelligence Research

61 (2018), 65–170.[12] Jonathan L Herlocker, Joseph A Konstan, and John Riedl. 2000. Explaining col-laborative ltering recommendations. In

Proceedings of the 2000 ACM conferenceon Computer supported cooperative work . ACM, 241–250.[13] Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory.

Neuralcomputation

9, 8 (1997), 1735–1780.[14] Yehuda Koren. 2008. Factorization meets the neighborhood: a multifacetedcollaborative ltering model. In

Proceedings of the 14th ACM SIGKDD internationalconference on Knowledge discovery and data mining . ACM, 426–434.[15] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization tech-niques for recommender systems.

Computer

ICLR (2017).[17] Lei Li, Yongfeng Zhang, and Li Chen. 2020. Generate neural template explana-tions for recommendation. In

Proceedings of the 29th ACM International Confer-ence on Information & Knowledge Management . 755–764.[18] Piji Li, Zihao Wang, Zhaochun Ren, Lidong Bing, and Wai Lam. 2017. Neu-ral rating regression with abstractive tips generation for recommendation. In

Proceedings of the 40th International ACM SIGIR conference on Research and De-velopment in Information Retrieval . ACM, 345–354.[19] Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries.

Text Summarization Branches Out (2004).[20] Guang Ling, Michael R Lyu, and Irwin King. 2014. Ratings meet reviews, acombined approach to recommend. In

Proceedings of the 8th ACM Conference onRecommender systems . ACM, 105–112.[21] Yichao Lu, Ruihai Dong, and Barry Smyth. 2018. Coevolutionary Recommenda-tion Model: Mutual Learning between Ratings and Reviews. In

Proceedings ofthe 2018 World Wide Web Conference on World Wide Web . International WorldWide Web Conferences Steering Commiee, 773–782.[22] Julian McAuley and Jure Leskovec. 2013. Hidden factors and hidden topics:understanding rating dimensions with review text. In

Proceedings of the 7th ACMconference on Recommender systems . ACM, 165–172.[23] Julian McAuley, Christopher Targe, Qinfeng Shi, and Anton Van Den Hengel.2015. Image-based recommendations on styles and substitutes. In

Proceedingsof the 38th International ACM SIGIR Conference on Research and Development inInformation Retrieval . ACM, 43–52.[24] Tom´aˇs Mikolov, Martin Kara´at, Luk´aˇs Burget, Jan ˇCernock`y, and Sanjeev Khu-danpur. 2010. Recurrent neural network based language model. In

EleventhAnnual Conference of the International Speech Communication Association .[25] Tom´aˇs Mikolov, Ilya Sutskever, Anoop Deoras, Hai-Son Le, Stefan Kombrink,and Jan Cernocky. 2012. Subword language modeling with neural networks.

Advances in neural information processing systems . 1257–1264.[27] Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing Xiang, and others. 2016.Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023 (2016). [28] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: amethod for automatic evaluation of machine translation. In

Proceedings of the40th annual meeting on association for computational linguistics . Association forComputational Linguistics, 311–318.[29] Zhaochun Ren, Shangsong Liang, Piji Li, Shuaiqiang Wang, and Maarten deRijke. 2017. Social collaborative viewpoint regression with explainable recom-mendations. In

Proceedings of the tenth ACM international conference on websearch and data mining . ACM, 485–494.[30] Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, and JohnRiedl. 1994. GroupLens: an open architecture for collaborative ltering ofnetnews. In

Proceedings of the 1994 ACM conference on Computer supportedcooperative work . ACM, 175–186.[31] Francesco Ricci, Lior Rokach, and Bracha Shapira. 2015. Recommender systems:introduction and challenges. In

Recommender systems handbook . Springer, 1–34.[32] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-basedcollaborative ltering recommendation algorithms. In

Proceedings of the 10thinternational conference on World Wide Web . ACM, 285–295.[33] J Ben Schafer, Dan Frankowski, Jon Herlocker, and Shilad Sen. 2007. Collaborativeltering recommender systems. In e adaptive web . Springer, 291–324.[34] Nathan Srebro, Jason Rennie, and Tommi S Jaakkola. 2005. Maximum-marginmatrix factorization. In

Advances in neural information processing systems . 1329–1336.[35] Ilya Sutskever, James Martens, and Georey E Hinton. 2011. Generating text withrecurrent neural networks. In

Proceedings of the 28th International Conference onMachine Learning (ICML-11) . 1017–1024.[36] G´abor Tak´acs, Istv´an Pil´aszy, Boy´an N´emeth, and Domonkos Tikk. 2008. Inves-tigation of various matrix factorization methods for large recommender systems.In

Data Mining Workshops, 2008. ICDMW’08. IEEE International Conference on .IEEE, 553–562.[37] Jian Tang, Yifan Yang, Sam Carton, Ming Zhang, and Qiaozhu Mei. 2016. Context-aware natural language generation with recurrent neural networks. arXivpreprint arXiv:1611.09900 (2016).[38] Chong Wang and David M Blei. 2011. Collaborative topic modeling for recom-mending scientic articles. In

Proceedings of the 17th ACM SIGKDD internationalconference on Knowledge discovery and data mining . ACM, 448–456.[39] Nan Wang, Hongning Wang, Yiling Jia, and Yue Yin. 2018. Explainable Recom-mendation via Multi-Task Learning in Opinionated Text Data. arXiv preprintarXiv:1806.03568 (2018).[40] Yikun Xian, Zuohui Fu, S Muthukrishnan, Gerard De Melo, and Yongfeng Zhang.2019. Reinforcement knowledge graph reasoning for explainable recommenda-tion. In

Proceedings of the 42nd International ACM SIGIR Conference on Researchand Development in Information Retrieval . 285–294.[41] Yikun Xian, Zuohui Fu, Handong Zhao, Yingqiang Ge, Xu Chen, QiaoyingHuang, Shijie Geng, Zhou Qin, Gerard De Melo, Shan Muthukrishnan, andothers. 2020. CAFE: Coarse-to-ne neural symbolic reasoning for explainablerecommendation. In

Proceedings of the 29th ACM International Conference onInformation & Knowledge Management . 1645–1654.[42] Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, and Wei-Ying Ma.2017. Topic Aware Neural Response Generation.. In

AAAI , Vol. 17. 3351–3357.[43] Yinqing Xu, Wai Lam, and Tianyi Lin. 2014. Collaborative ltering incorporatingreview text and co-clusters of hidden user communities and item groups. In

Proceedings of the 23rd ACM International Conference on Conference on Informationand Knowledge Management . ACM, 251–260.[44] Shuai Zhang, Lina Yao, and Aixin Sun. 2018. Deep learning based recommendersystem: A survey and new perspectives.

Comput. Surveys (2018).[45] Yongfeng Zhang, Qingyao Ai, Xu Chen, and W Bruce Cro. 2017. Joint repre-sentation learning for top-n recommendation with heterogeneous informationsources. In

Proceedings of the 2017 ACM on Conference on Information and Knowl-edge Management . ACM, 1449–1458.[46] Yongfeng Zhang and Xu Chen. 2020. Explainable Recommendation: A Surveyand New Perspectives.

Foundations and Trends in Information Retrieval (2020).[47] Yongfeng Zhang, Guokun Lai, Min Zhang, Yi Zhang, Yiqun Liu, and ShaopingMa. 2014. Explicit factor models for explainable recommendation based onphrase-level sentiment analysis. In

Proceedings of the 37th international ACMSIGIR conference on Research & development in information retrieval . ACM, 83–92.[48] Yongfeng Zhang, Haochen Zhang, Min Zhang, Yiqun Liu, and Shaoping Ma. 2014.Do users rate or review? Boost phrase-level sentiment labeling with review-level sentiment classication. In

Proceedings of the 37th international ACM SIGIRconference on Research & development in information retrieval . 1027–1030.[49] Lei Zheng, Vahid Noroozi, and Philip S Yu. 2017. Joint deep modeling of usersand items using reviews for recommendation. In

Proceedings of the Tenth ACMInternational Conference on Web Search and Data Mining . ACM, 425–434.[50] Xiaoqiang Zhou, Baotian Hu, Qingcai Chen, Buzhou Tang, and Xiaolong Wang.2015. Answer sequence learning with neural networks for answer selection incommunity question answering. arXiv preprint arXiv:1506.06490arXiv preprint arXiv:1506.06490