[PDF] JST-RR Model: Joint Modeling of Ratings and Reviews in Sentiment-Topic Prediction

Abstract

Analysis of online reviews has attracted great attention with broad applications. Often times, the textual reviews are coupled with the numerical ratings in the data. In this work, we propose a probabilistic model to accommodate both textual reviews and overall ratings with consideration of their intrinsic connection for a joint sentiment-topic prediction. The key of the proposed method is to develop a unified generative model where the topic modeling is constructed based on review texts and the sentiment prediction is obtained by combining review texts and overall ratings. The inference of model parameters are obtained by an efficient Gibbs sampling procedure. The proposed method can enhance the prediction accuracy of review data and achieve an effective detection of interpretable topics and sentiments. The merits of the proposed method are elaborated by the case study from Amazon datasets and simulation studies.

Full PDF

JJST-RR Model: Joint Modeling of Ratings andReviews in Sentiment-Topic Prediction

Qiao Liang † , Shyam Ranganathan ‡ , Kaibo Wang † , and Xinwei Deng ‡ † Department of Industrial Engineering, Tsinghua University ‡ Department of Statistics, Virginia Tech

Abstract

Analysis of online reviews has attracted great attention with broad applications.Often times, the textual reviews are coupled with the numerical ratings in the data. Inthis work, we propose a probabilistic model to accommodate both textual reviews andoverall ratings with consideration of their intrinsic connection for a joint sentiment-topic prediction. The key of the proposed method is to develop a uniﬁed generativemodel where the topic modeling is constructed based on review texts and the sentimentprediction is obtained by combining review texts and overall ratings. The inference ofmodel parameters are obtained by an eﬃcient Gibbs sampling procedure. The proposedmethod can enhance the prediction accuracy of review data and achieve an eﬀectivedetection of interpretable topics and sentiments. The merits of the proposed methodare elaborated by the case study from Amazon datasets and simulation studies.

Keywords:

Generative approach; Joint modeling; Latent Dirichlet allocation; Serviceanalytics; Text mining Address for correspondence: Xinwei Deng, Associate Professor, Department of Statistics, Virginia Tech,Blacksburg, VA 24061 (E-mail: [email protected]). a r X i v : . [ c s . C L ] F e b Introduction

In modern service applications, there are increasing amounts of online reviews generated byusers in recent years. The online reviews often contain both the text reviews and overallratings. For example, the reviews in Amazon.com contain a review text on customer opin-ions of products or services, as well as the overall rating score on the general evaluation.Clearly, these user-generated contents can provide valuable information for both customersand online merchants (Liu, 2012). Among various research works on analyzing such reviewdata, topic identiﬁcation (Titov and McDonald, 2008 b ; Blei, 2012; Airoldi and Bischof, 2016)and sentiment classiﬁcation (Bai, 2011; Taddy, 2013; Calheiros et al., 2017) are two majordirections. The former aims to extract representing features or aspects of interest from dis-crete review words, and the latter is to predict the semantic orientation of a review text.With consideration of the inherent dependency between sentiment polarities and topics, asimultaneous detection of correlated topics and sentiments serves as a critical function in theinformation retrieval of online customer reviews (Titov and McDonald, 2008 a ; Mei et al.,2007; Lin and He, 2009).Note that the existing works mainly focused on topic discovery and sentiment predictionusing the review texts only. While the information from the overall ratings has not beenintegrated to some extent. It is seen that the rating scores provide intuitive orientationsof user opinions, which can allow the latent sentiments extracted more appropriately (Liet al., 2015). Moreover, most collected review texts in practice are vague in the sense of low“signal to noise ratio” with large amounts of spam content, unhelpful opinions, as well ashighly subjective and misleading information (Lu et al., 2010). In such situations, it is ofgreat importance to consider ratings and review texts in a mutually complement manner foraccurate quantiﬁcation on review sentiments and topics.The scope of this work is to predict both sentiments and topics from the joint learningof review texts and overall ratings. Typically, the association between textual reviews andoverall ratings are prevailing based on the general orientation of review sentiments. Forinstance, a review stimulated by positive sentiment would present both a higher ratingand a positive review text. The sentiment polarities indicated by the overall ratings and2extual reviews are closely related, while their relationship varies among diﬀerent customers.In practice, customers may have diﬀerent preference and emphasis on diﬀerent aspects forthe same product, and they may give overall ratings based on the partial or whole productaspects discussed in review texts (Li et al., 2015). For example, even a full 5-star rating couldbe accompanied by negative review content. The dynamic relationship between the overallratings and the review texts makes it challenging to digest the information in reviews withratings jointly. As ratings serve as one of the most important metadata of review documents,this problem can be viewed from the perspective of the incorporation of document metadatawith the content of the text (Roberts et al., 2016).To address the aforementioned challenges, we propose a joint sentiment-topic model toaccommodate both ratings and review texts. We denote the proposed model as the JST-RRmodel . The proposed method extends the conventional joint sentiment-topic modeling byincorporating the generative process of ratings with textual reviews in a uniﬁed framework.Under this framework, the connection between review texts and ratings is characterized bythe latent joint sentiment-topic distribution. We have also developed a weighting mechanismbetween the number of of review words and the number of ratings for a more appropriatequantiﬁcation on review sentiments. The proposed JST-RR model enables an eﬀective iden-tiﬁcation of topics and sentiments in reviews and a more accurate prediction for review data.It is worth to pointing out that the proposed model is weakly supervised with the only su-pervision from a domain-independent sentiment lexicon. Hence it can be easily adapted toreview mining in various domains or applications such as political discussion in social mediaand detection of fake news.The remainder of this paper is organized as follows. Section 2 reviews the state-of-the-art methods on joint sentiment-topic prediction in review modeling. Section 3 presents thedetails of the proposed JST-RR model. Section 4 reports the model implementation andperformance on the Amazon datasets. Section 5 conducts a simulation study to extensivelyevaluate the performance of the proposed model. Finally, we conclude this work with somediscussion in Section 6. 3

Literature Review

This section mainly reviews modeling methods for online review data in sentiment-topicprediction. In the literature, many existing works (Lu et al., 2009; Brody and Elhadad,2010; Lu et al., 2011) performed topic detection and sentiment classiﬁcation in a two-stageprocess. They ﬁrst detected topics from review texts using traditional topic models suchas latent Dirichlet allocation (LDA) (Blei et al., 2003) and probabilistic latent semanticindexing (PLSI) (Hofmann, 1999). Then sentiment labels are assigned to speciﬁc topics byapplying sentiment classiﬁcation techniques to corresponding review texts. There are severalworks on detecting topics and sentiments simultaneously from user-generated content (Meiet al., 2007; Titov and McDonald, 2008 a ; Lin and He, 2009). For example, Mei et al. (2007)proposed the topic-sentiment mixture (TSM) model for the weblog collection based on themodel setting of PLSI. However, the topic-sentiment correlation in the TSM model was notdirectly constructed but captured through a post-processing of model parameters. With thefocus of ﬁnding correlated sentiments and topics from texts, the joint sentiment-topic (JST)model (Lin and He, 2009) and the Reverse-JST model (Lin et al., 2012) extended the LDAmodel by constructing an additional sentiment layer conditioning and being conditionedon the topic layer of LDA, respectively. Many follow-up works (Moghaddam and Ester,2011; Li et al., 2013; Dermouche et al., 2015), regarded as variants of the JST and Reverse-JST models, use the same assumption of conditional inter-dependency between topics andsentiments. However, these works mainly focused on topic discovery and sentiment predictionfrom review texts only, where the information from the overall ratings has been overlookedto some extent.For review sentiment prediction, existing methods often employed a supervised learningframework using sentiment labels directly indicated by overall ratings (Pang et al., 2002;Blitzer et al., 2007; Ye et al., 2009). That is, the ratings were used to supervise the sentimentprediction of corresponding review texts. However, there is still a discrepancy between thesentiment orientations indicated by review texts and ratings, since customers may give overallratings based on the partial or whole product aspects discussed in review texts. Consideringthe complex and dynamic relationship between the overall ratings and the review texts, it is4eneﬁcial to construct a joint model of textual reviews and numerical ratings for sentiment-topic prediction. For instance, the models by Wang et al. (2010, 2011) assumed that theoverall ratings were based on ratings of speciﬁc aspects or topics extracted from reviewtexts. The aspect identiﬁcation and rating (AIR) model by Li et al. (2015) followed a reverseassumption that aspect ratings were produced with the prior information of overall ratings.However, these models mainly focused on the detection of aspect ratings and conditionedthe joint modeling of textual reviews and overall ratings on the results of aspect ratings.Motivated by the lack of a general model to characterize the intrinsic connection betweenreview texts and overall ratings, we propose a joint sentiment-topic model to accommo-date both overall ratings and review texts in a uniﬁed probabilistic framework for accurateprediction of review sentiments and topics. In this section, we brieﬂy describe the notation and joint sentiment-topic (JST) representa-tion of reviews in Section 3.1. We then detail the proposed JST-RR model for integratingthe overall ratings with review words in Section 3.2. The procedure of model inference isconstructed in Section 3.3.

Consider the data consisting of a collection of product review documents { d i , i = 1 , . . . , D } .For each review document d i , suppose that it contains N i words denoted as w i = ( w i , . . . , w iN i ),and it contains M i rating scores denoted as r i = ( r i , . . . , r iM i ). A review document can becomposed of a single review (i.e., M i = 1) or a collection of reviews for learning user opinionsfrom various granularity levels. For example, multiple reviews of the same product or thesame user can be integrated into a document for extracting the product-speciﬁc or user-speciﬁc features (Ling et al., 2014). Here, each word in the observed document is assumed tobe from the vocabulary indexed by { , . . . , V } . Without loss of generality, we assume that5he rating r ij ∈ { , , , , } with 5 to be the highest rating and 1 to be the lowest rating.In a typical joint sentiment-topic modeling framework (Lin and He, 2009; Lin et al.,2012; Li et al., 2013), each review document d i is assumed to be represented by mixtures ofsentiments and topics that are interdependent. By following the assumption in the generalclass of mixed membership models (Airoldi et al., 2010, 2014; Manrique-Vallier and Reiter,2012), each observational unit in the document belongs to a single cluster that is representedby a speciﬁc sentiment-topic pair. Let us denote the sentiment label by l ∈ { , . . . , S } andthe topic label by z ∈ { , . . . , K } . The sentiment of the document d i follows a multinomialdistribution Multinomial( π i ), where the S -dimension prior distribution π i ∼ Dirichlet( γ ).Conditional on each sentiment label l ∈ { , . . . , S } , the topic follows a multinomial distribu-tion Multinomial( θ i,l ), where the K -dimension prior of topic distribution θ i,l ∼ Dirichlet( α l ).Typically, the document-level sentiment and topic distributions indicate how likely the cur-rent document ﬁts a speciﬁc sentiment and topic, providing a quantiﬁcation on the latentsentiments and topics for unstructured reviews. In this section, we detail the proposed JST-RR model with the consideration of both ratingsand review texts. Based on the JST representation, the proposed JST-RR model integratesthe overall ratings with textual words in review documents under a uniﬁed probabilisticframework. Figure 1 illustrates a graphical representation of the JST-RR model structure.The notation used in the proposed model is summarized in Table 1.We consider that each review document d i is represented by its document-level sentimentdistribution Multinomial( π i ) and topic distribution Multinomial( θ i ). The key of the JST-RR model is to provide a uniﬁed probabilistic generative process for both observed wordsand ratings in the review documents. That is, each word is assumed to be drawn fromthe V -dimension multinomial word distribution Multinomial( ϕ l w ,z ) conditioned on the wordsentiment label l w and topic label z , where the prior distribution ϕ l w ,z ∼ Dirichlet( β l w ,z ).For the generating process of overall ratings, we consider that the overall ratings provideonly a general orientation of sentiments. Each rating is then assumed to be drawn from the5-dimension multinomial rating distribution Multinomial( µ l r ) only conditioned on its rating6able 1: A summary of notation.Term Deﬁnition d Document w Word r Rating z Topic label l w Word sentiment label l r Rating sentiment label D Number of documents V Vocabulary size K Number of topics S Number of sentiment classiﬁcations π i Coeﬃcient vector of the multinomial sentiment distribution for the i thdocument θ i,l Coeﬃcient vector of the multinomial topic distribution under the senti-ment label l for the i th document ϕ l,z Coeﬃcient vector of the multinomial word distribution under the senti-ment label l and topic label z µ l Coeﬃcient vector of the multinomial rating distribution under the senti-ment label lN i Number of words in the i th document N i,l Number of words that are assigned to the sentiment label l in the i thdocument N i,l,z Number of words that are assigned to the sentiment label l and topic label z in the i th document N l,z Number of words that are assigned to the sentiment label l and topic label z in the dataset N l,z,w Number of times that the word w is assigned to the sentiment label l andtopic label z in the dataset M i Number of ratings in the i th document M i,l Number of ratings that are assigned to the sentiment label l in the i thdocument M l Number of ratings that are assigned to the sentiment label l in the dataset M l,r Number of times that the rating r is assigned to the sentiment label l inthe dataset 7 z (cid:302) (cid:537) i (cid:307) (cid:533) N i l w SK*S (cid:652) i rl r (cid:534) D (cid:541) (cid:303) SM i Figure 1: Illustration of the proposed JST-RR model.sentiment label l r , where the prior distribution µ l r ∼ Dirichlet( δ l r ).A formal generative process of the review document collection { d i , i = 1 , . . . , D } is pre-sented in Procedure 1. In this framework, words and ratings are jointly generated and usedas observations for the estimation of reviews. The hyperparameters β , δ , γ , and α indicatethe prior information before the actual words and ratings, i.e., the actual data, are observed.The settings of hyperparameters are detailed in Section 4.1 based on a real-world case.The proposed JST-RR model not only provides a probabilistic and uniﬁed framework,but also provides a meaningful manner on how ratings and review texts work in realisticsettings. For example, a reviewer on Amazon has an overall sentiment regarding the pur-chased product, which informs the reviewer’s sentiment on the various aspects of the productwhich are typically represented as “topics” in the model. It is likely that the reviewer hasa negative sentiment about one topic while having a positive sentiment on other topics, andthis can be reﬂected by the word sentiment and overall sentiment from the proposed model.8 rocedure 1 Generative procedure of words and ratings of the JST-RR model • For the entire document collection, ﬁrst characterize the “topic” and the “sentiment” bythe word probability distribution and the rating probability distribution: – For each combination of word sentiment label l w ∈ { , . . . , S } and topic label z ∈{ , . . . , K } : ∗ Draw sample from word probability distribution ϕ l w ,z ∼ Dirichlet( β l w ,z ). – For each rating sentiment label l r ∈ { , . . . , S } : ∗ Draw sample from rating probability distribution µ l r ∼ Dirichlet( δ l r ). • For each document d i , i = 1 , . . . , D : – Draw sample from sentiment probability distribution π i ∼ Dirichlet( γ ). – Draw sample from topic probability distribution θ i,l ∼ Dirichlet( α l ) for each senti-ment label l ∈ { , . . . , S } . – For each word w ij , j = 1 , . . . , N i in document d i : ∗ Draw the sentiment assignment l wij ∼ Multinomial( π i ). ∗ Draw the topic assignment z ij ∼ Multinomial( θ i,l wij ) conditioned on l wij . ∗ Draw a speciﬁc word w ij ∼ Multinomial( ϕ l wij ,z ij ) conditioned on l wij and z ij . – For each rating r ij , j = 1 , . . . , M i in document d i : ∗ Draw the sentiment assignment l rij ∼ Multinomial( π i ). ∗ Draw a speciﬁc rating score r ij ∼ Multinomial( µ l rij ) conditioned on l rij . For the inference of the proposed JST-RR model, there are four sets of latent distributionparameters: the document-level sentiment distribution parameter π , the sentiment-speciﬁctopic distribution parameter θ , the joint sentiment/topic-word distribution parameter ϕ , andthe sentiment-rating distribution parameter µ . Given these latent distributions, we can ex-plicitly express the joint probability of the observed words, ratings, and their sentiment/topic9abels in the document collection { d i , i = 1 , . . . , D } as P ( w , r , l w , l r , z | π , θ , ϕ , µ ) = D (cid:89) i =1 N i (cid:89) j =1 P ( l wij , z ij , w ij | π i , θ i,l wij , ϕ l wij ,z ij ) M i (cid:89) j =1 P ( l rij , r ij | π i , µ l rij )= D (cid:89) i =1 N i (cid:89) j =1 P ( l wij | π i ) P ( z ij | θ i,l wij ) P ( w ij | ϕ l wij ,z ij ) M i (cid:89) j =1 P ( l rij | π i ) P ( r ij | µ l rij ) , (1)where the words and ratings are conditionally independent given the document-level senti-ments and topics. It is seen that the observed words are dependent on their latent sentimentand topic assignments, while the ratings are only dependent on their latent sentiment as-signments.Note that there have been several methods in the literature developed for the inferenceof probabilistic topic models, including Gibbs sampling (Griﬃths and Steyvers, 2004), vari-ational Bayesian inference (Blei et al., 2003), and maximum a posteriori (MAP) estimation(Chien and Wu, 2008). In this work, we adopt the Gibbs sampling for the model infer-ence because of its promising convergence to the underlying distribution. It is also notedthat some advanced algorithms (Hoﬀman et al., 2013; Srivastava and Sutton, 2017) couldbe adapted to our problem for handling large and complex data. The state transition ofthe Markov chain formed by the Gibbs sampler is determined by the sampling of the latentvariables (i.e., the topic label z and the sentiment label l ) given the current values of allother variables and the observed data. The conditional probability of sampling sentimentlabel l wij and topic label z ij for the observed word w ij = w in the document d i can be writtenas P ( l wij = l, z ij = z | w , l w − ij , l r , z − ij ) ∝ P ( l wij = l, z ij = z, w ij = w | w − ij , l w − ij , l r , z − ij )= P ( l wij = l | l w − ij , l r ) × P ( z ij = z | l wij = l, l w − ij , z − ij ) × P ( w ij = w | l wij = l, z ij = z, l w − ij , z − ij , w − ij )= (cid:90) π i P ( l wij = l | π i ) P ( π i | l w − ij , l r ) d π i × (cid:90) θ i,l P ( z ij = z | θ i,l ) P ( θ i,l | l w − ij , z − ij ) d θ i,l × (cid:90) ϕ l,z P ( w ij = w | ϕ l,z ) P ( ϕ l,z | l w − ij , z − ij , w − ij ) d ϕ l,z . (2)10he superscript or subscript − ij hereafter denotes the data quantity excluding the j thposition in the document d i . By integrating out π i (see detailed derivation in Appendix),the ﬁrst term in Eq. (2) can be derived as P ( l wij = l | l w − ij , l r ) = N − iji,l + M i,l + γ l N − iji + M i + (cid:80) l (cid:48) γ l (cid:48) . (3)It represents the probability of sampling l wij = l given all other sentiment assignments l w − ij of words and l r of ratings in the same review document d i . Here N i and M i are the totalnumber of words and ratings in the document d i , N i,l and M i,l are the number of wordsand ratings associated with sentiment l in the document d i . The hyperparameter γ l can beinterpreted as the prior observation counts of the sentiment l assigned with d i .From Eq. (3) and its derivation in the Appendix, one can see that the number of ratingsand the number of words are treated with the equal weight for the estimation of sentiments.However, in a typical user review, the number of words is often much larger than the numberof ratings, even when multiple ratings are allowed in a particular application. Moreover, thenumber of ratings and the number of words associated with a speciﬁc sentiment (i.e., N i,l and M i,l ) in Eq. (3) are not generated from actual i.i.d. observations but from the resultsof a Gibbs sampler. The value of N i,l and M i,l reﬂects diﬀerent conﬁdence levels, i.e., thenumber of words used to express a particular opinion in a review is large relative to thesentiments expressed as ratings. To address these challenges, we consider to incorporate aweighting mechanism between the number of ratings and the number of words in sentimentestimation. From the perspective of a weighted likelihood for the sentiment assignments ofwords and ratings (see detailed derivation in Appendix), Eq. (3) can be re-expressed in amore general form: P ( l wij = l | l w − ij , l r ) = N − iji,l + σM i,l + γ l N − iji + σM i + (cid:80) l (cid:48) γ l (cid:48) , (4)where σ is a weighting parameter to indicate the weight of a rating relative to a word inthe estimation of review sentiments. When σ = 0, the document-level sentiment predictiondepends only on the review words, which is simpliﬁed as the JST model in Lin and He (2009).11imilarly, the second term in Eq. (2) can be estimated by integrating out θ i,l , which gives P ( z ij = z | l wij = l, l w − ij , z − ij ) = N − iji,l,z + α l,z N − iji,l + (cid:80) z (cid:48) α l,z (cid:48) , (5)where N i,l,z is the number of words associated with the sentiment l and topic z in thedocument d i , and the hyperparameter α l,z can be interpreted as the prior observation countsof words assigned with the sentiment l and topic z in d i . For the third term in Eq. (2), wecan obtain its posterior prediction by integrating out ϕ l,z in the same manner as P ( w ij = w | l wij = l, z ij = z, l w − ij , z − ij , w − ij ) = N − ijl,z,w + β l,z,w N − ijl,z + (cid:80) w (cid:48) β l,z,w (cid:48) , (6)where N l,z is the number of words assigned with the sentiment label l and topic label z in theentire dataset, N l,z,w is the number of times that the word w is associated with the sentimentlabel l and topic label z in the dataset, and the hyperparameter β l,z,w can be interpreted asthe prior counts of word w associated with sentiment label l and topic label z in the dataset.By combining the results in Eq. (4), Eq. (5), and Eq. (6), the expression for the fullconditional probability in Eq. (2) can be written as P ( l wij = l, z ij = z | w , l w − ij , l r , z − ij ) ∝ N − iji,l + σM i,l + γ l N − iji + σM i + (cid:80) l (cid:48) γ l (cid:48) · N − iji,l,z + α l,z N − iji,l + (cid:80) z (cid:48) α l,z (cid:48) · N − ijl,z,w + β l,z,w N − ijl,z + (cid:80) w (cid:48) β l,z,w (cid:48) . (7)In a similar manner, we can specify the conditional probability of sampling the sentimentlabel l rij for the observed rating r ij = r in the document d i as (see detailed derivation inAppendix) P ( l rij = l | r , l r − ij , l w ) ∝ P ( l rij = l, r ij = r | r − ij , l r − ij , l w )= P ( l rij = l | l r − ij , l w ) × P ( r ij = r | l rij = l, l r − ij , r − ij )= (cid:90) π i P ( l rij = l | π i ) P ( π i | l r − ij , l w ) d π i × (cid:90) µ l P ( r ij = r | µ l ) P ( µ l | l r − ij , r − ij ) d µ l = N i,l + σM − iji,l + γ l N i + σM − iji + (cid:80) l (cid:48) γ l (cid:48) × M − ijl,r + δ l,r M − ijl + (cid:80) r (cid:48) δ l,r (cid:48) , (8)12here M l is the number of ratings associated with the sentiment l in the dataset, M l,r is thenumber of times that the rating r is associated with sentiment label l in the dataset, andthe hyperparameter δ l,r can be interpreted as the prior counts of rating r associated withsentiment label l in the dataset.A sample obtained from the Markov chain in its stable state is used to obtain the posteriorestimations of the parameters π , θ , ϕ , and µ as follows:ˆ π i,l = N i,l + σM i,l + γ l N i + σM i + (cid:80) l (cid:48) γ l (cid:48) , ˆ θ i,l,z = N i,l,z + α l,z N i,l + (cid:80) z (cid:48) α l,z (cid:48) , ˆ ϕ l,z,w = N l,z,w + β l,z,w N l,z + (cid:80) w (cid:48) β l,z,w (cid:48) , ˆ µ l,r = M l,r + δ l,r M l + (cid:80) r (cid:48) δ l,r (cid:48) . (9)For each document d i , its document-level sentiment distribution parameter π i is approxi-mated based on both N i words and M i ratings with a weighting parameter σ , while thetopic distribution parameter θ i is estimated by only words in the document since ratings arenot assigned with topic labels. The Gibbs sampling procedure of making inference of theproposed JST-RR model is summarized in Procedure 2. In this section, we evaluate the performance of the proposed model using three real datasets.The real data are obtained from the publicly available Amazon datasets (McAuley et al.,2015). Speciﬁcally, the three datasets are the online reviews of HP laptops, the online reviewsof Lenovo laptops, and the online reviews of Dell laptops, which are denoted as HP , Lenovo ,and

Dell , respectively. For each single review, there is an overall rating that ranges from 1star to 5 stars.By deﬁning the review documents at various granularity levels (i.e. from a single review,to a collection of reviews from the same product or the same user), the proposed JST-RRmodel can be applied for modeling customer opinions on diﬀerent levels of interest. Inthis section, we mainly focus on examining the performance of the proposed method forthe individual review documents. That is, each document here is based on a single reviewincluding a review text and an overall rating.13 rocedure 2

Gibbs sampling procedure for inference of the JST-RR model

Input:

Document collection { d i , i = 1 , . . . , D } , hyperparameters β , δ , γ , α , and weightparameter σ . Output:

Word distribution parameter ϕ , rating distribution parameter µ , document-levelsentiment distribution parameter π and topic distribution parameter θ . Assign initial topic/sentiment labels to all words/ratings at random for each Gibbs samplingiteration do for each document d i , i = 1 , . . . , D do for each word w ij , j = 1 , . . . , N i in the document d i do Exclude w ij associated with its sentiment label l wij and topic label z ij from countvariables N i , N i,l , N i,l,z , N l,z , N l,z,w Sample a new sentiment-topic combinationfor w ij based on Eq. (7) Update count variables N i , N i,l , N i,l,z , N l,z , N l,z,w byincorporating the new sentiment/topic label of w ij end for each rating r ij , j = 1 , . . . , M i in the document d i do Exclude r ij associated with its sentiment label l rij from count variables M l , M l,r , M i , M i,l Sample a new sentiment assignment for r ij based on Eq. (8) Updatecount variables M l , M l,r , M i , M i,l by incorporating the new sentiment label of r ij end end end Estimate ϕ , µ , π , and θ based on Eq. (9) For each dataset, we perform data pre-processing in the following steps. First, we con-vert words into lower cases and remove the punctuation, stop words (e.g., ”a”, ”and”,”be”), and infrequent words. Second, we stem each word to its root with Porter Stem-mer (http://tartarus.org/martin/PorterStemmer/). Third, we perform

Negation by addinga preﬁx “not ” to the word in negative dependency. For example, in the sentence “I do notlike this product”, “not like” is recognized as a whole to express negative sentiment. Finally,to obtain unbiased training results on sentiment prediction, we balance the number of posi-tive and negative review documents in the dataset. After data pre-processing, the summarystatistics of three experimental datasets are listed in Table 2. For each dataset, we partitionit into a 90% training set for model training and a 10% test set for model evaluation.In the implementation of the proposed method, we set the number of sentiment polarities S = 2 (i.e., positive and negative) and a varying number of topics K ∈ { , , , , , , } .14able 2: A description of three Amazon datasets.Dataset Number of Reviews Average number of words (review length) HP Lenovo

Dell γ and α : γ l =3 . /S, l ∈ { , . . . , S } ; α l,z = 3 . / ( S × K ) , l ∈ { , . . . , S } , z ∈ { , . . . , K } . Based on the priorknowledge that a positive polarity is linked to a higher rating score and vice versa, we set δ l,r = 10 . × r, r ∈ { , , , , } for the positive sentiment l , and set δ l,r = 10 . × (6 − r ) , r ∈{ , , , , } for the negative sentiment l .For the setting of hyperparameter β , we use an asymmetric prior setting of β l , we set elements in β l to be 0 for the words in negative list, 0 . l , we set elements of β l to be 0 forthe words in positive list, 0 .

01 for other words. Such a setting of β enables that the wordsin sentiment lexicons can only be drawn from the word distributions conditioned on theircorresponding sentiment labels. The proposed JST-RR model is compared with four alternative methods: JST, RJST, AIR-JST, and AIR-RJST. The JST model in Lin and He (2009) can be treated as a baselinemethod for modeling topics and sentiments jointly via review texts alone. The RJST (orReverse-JST) method in Lin et al. (2012) is a variant of JST model where the topic and thesentiment layers are inverted. The last two methods in comparison are denoted as AIR-JSTand AIR-RJST based on the related AIR method in Li et al. (2015). The AIR method15odels observed textual reviews and overall ratings in a generative way by sampling latentsentiments of review texts with the overall ratings as prior parameters. For example, thereview sentiment probability π is generated in accordance with its normalized rating r by: π ∼ Beta( λr, λ (1 − r )) . The AIR model is adapted to our experimental settings in this case with two variants:AIR-JST and AIR-RJST, where the sentiment and the topic layers in the two models areinverted.To quantitatively evaluate the performance of the proposed method, we consider theperplexity based performance measure on the test set. The perplexity is a conventionalmetric for evaluating the performance of probabilistic topic models. Speciﬁcally, for a testset of documents { d i , i = 1 , . . . , D } , the perplexity of observed words { w i , i = 1 , . . . , D } inthe test set is deﬁned as: perplexity ( { w i , i = 1 , . . . , D }| ˆ ϕ ) = exp (cid:40) − (cid:80) Di =1 log P ( w i | ˆ ϕ ) (cid:80) Di =1 N i (cid:41) , (10)where the trained model is described by the word distribution parameter ˆ ϕ that is estimatedfrom the training set. We employed the importance sampling methods in Wallach et al.(2009) to approximate the probability of the observed words P ( w i | ˆ ϕ ) in Eq. (10). Sincethe perplexity values monotonically decrease with the log-likelihood of the test data, a lowerperplexity indicates better prediction performance of the proposed model (Blei et al., 2003).It is noted that the upper bound of the perplexity in Eq. (10) with the worst case of arandom prediction is given by perplexity ( { w i , i = 1 , . . . , D } ) = exp (cid:40) − (cid:88) w ∈ V P ( w ) log P ( w ) (cid:41) , which is determined by the information entropy of words in the test data. Similarly, theperplexity of observed ratings { r i , i = 1 , . . . , D } in the test set can be deﬁned accordingly.As the other four models in comparison only consider the generative process of the observed16ords, we conduct evaluation mainly based on the word perplexity values. Rating Weight σ P e r p l e x i t y word Figure 2: The word perplexity in 10-fold cross validation for the HP dataset with topicnumber K = 5.For the selection of the tuning parameter σ in the JST-RR model and the prior weight λ in the two AIR models, we adopt the 10-fold cross validation on the training set, such that theselected parameters give the average best goodness of ﬁt (indicated by the lowest perplexityvalues in this study). For example, Figure 2 shows the perplexity values of observed wordsversus the weight parameter σ by implementing the JST-RR model in 10-fold cross validationfor the HP dataset with topic number K = 5. Similar trends of perplexity are also observedin the other cases, and thus omitted here. Generally, a lower perplexity value indicates bettermodel performance in explaining the observed data. When σ = 0, the proposed JST-RRmodel converges to the baseline JST model that only focuses on review words. Based onthe results in Figure 2, the model explanation of observed words would beneﬁt from theincorporation of ratings with a proper setting of rating weight σ .Figure 3 shows the word perplexity results of the ﬁve models as well as their percentagesagainst the baseline of RJST model with varying topic numbers in comparison. It is seen thatthe JST-RR model achieves the best overall performance with the lowest perplexity amongall models under a variety of scenarios. In most cases, models that combine both textual17eviews and overall ratings (i.e., AIR-JST, AIR-RJST, JST-RR) are superior to the modelsthat only rely on textual reviews (i.e., JST, RJST). It implies that the incorporation of overallratings can eﬀectively enhance the model prediction accuracy. Compared to the AIR-JSTand the AIR-RJST models that simply use the overall ratings as the prior parameters for thelatent document-level sentiment distributions, the proposed JST-RR model achieves betterperformance in capturing the intrinsic connection between review words and ratings, leadingto signiﬁcant improvement in model prediction. It is also important to examine the eﬀectiveness of the proposed model in the extractionof topics and sentiments from the data. As the estimated word distribution is conditionedon both sentiment and topic assignments, one can refer to the most frequent words (ortop words) under each combination of sentiment-topic assignments for understanding theextracted topics with sentiment orientations. Table 3 shows the top positive and negativewords under ﬁve example topics extracted from the

Dell dataset. Each topic covers a speciﬁcquality aspect of Dell products as well as related services such as battery (topic 1), memory &speed (topic 2), shipping & return (topic 3), network connections (topic 4), and peripherals(topic 5). In terms of sentiment, it can be seen that most of the positive words and negativewords under each topic carry the corresponding sentiments well. Some of the words (e.g.,“good”, “not work”) show a general tendency of customer opinions that is independentof topics, and these words tend to appear under multiple topics frequently. Some otherwords could bear topic-speciﬁc sentiments. For example, words such as “crash”, “burn” arefrequently used for conveying negative sentiment with respect to the topic of memory andspeed (topic 2).Moreover, a more general sentiment detection can be examined by the estimated ratingdistribution. For example, Figures 4(a), 4(b), 4(c) show the estimated rating distributionparameter ˆ µ of the three experimental datasets under diﬀerent sentiment labels with thetopic number K = 5. It is seen that the positive and negative sentiments are obviouslydistinguished by their distributions over ﬁve rating scores. Such an observation is validatedby the results that a positive sentiment tends to produce higher ratings than the negative18

10 15 20

Number of Topics (K) P e r p l e x i t y JSTRJST AIR−JSTAIR−RJST JST−RR (a) Perplexity on

Lenovo . . . . Number of Topics (K) P e r c en t age JSTRJST AIR−JSTAIR−RJST JST−RR (b) Percentage of perplexity against RJST on

Lenovo

Number of Topics (K) P e r p l e x i t y JSTRJST AIR−JSTAIR−RJST JST−RR (c) Perplexity on

Dell . . . Number of Topics (K) P e r c en t age JSTRJST AIR−JSTAIR−RJST JST−RR (d) Percentage of perplexity against RJST on

Dell

Number of Topics (K) P e r p l e x i t y JSTRJST AIR−JSTAIR−RJST JST−RR (e) Perplexity on HP . . . . Number of Topics (K) P e r c en t age JSTRJST AIR−JSTAIR−RJST JST−RR (f) Percentage of perplexity against RJST on HP Figure 3: The results of word perplexity (smaller value indicating better performance) forﬁve methods in comparison on three Amazon datasets:

Lenovo , Dell , HP .19 a b l e : E x a m p l e o f t o p i c s und e r d i ﬀ e r e n t s e n t i m e n t l a b e l s i n D e ll d a t a s e t e x tr a c t e db y J S T - RR m o d e l. T o p i c T o p i c T o p i c T o p i c T o p i c P o s i t i v e N e ga t i v e P o s i t i v e N e ga t i v e P o s i t i v e N e ga t i v e P o s i t i v e N e ga t i v e P o s i t i v e N e ga t i v e b a tt e r i p r o b l e m g bd r i v e g r e a t a m a z o nu s e p r o b l e m s c r ee n s c r ee n p o w e r h o u rr a m h a r d s h i p r e t u r n i n t e r n e t c o nn ec t k e y b oa r d k e y b oa r d u s e y e a r p r o ce ss o r d v d a rr i v r ece i v w o r k i n t e r n e t m o u s u s e goo d r e p l a c d r i v ec d f a s t s e n t w e b w i r e l e ss u s e t o u c h li f e p o w e r m e m o r i n o t w o r k a m a z o n s e ll e r o ﬃ c i ss u li k e k e y h o u r m o n t hh a r d c o m pu t w e ll o r d e r w i r e l e ss t i m e f ee l p a d s t ill b o u g h t c o r e r e p l a c s e ll e r p r o du c t g r e a t w i ﬁn i ce bu tt o n y e a r b a tt e r i c a r d li tt l t i m e b a c k h o m e tr i goo d m o u s g e t m o t h e r b oa r d g h z o l d goo d s h i p c o nn ec t c a r d s p e a k e rt y p e li k e ﬁ r s t i n t e l d i s k o r d e rr e f undp r og r a m d r i v e r f e a t u r ﬁn g e r t i m e i ss u s p ee dbu r n c a m e d i s a pp o i n t s u r f ﬁ xk e y a nn o y l a s tt i m e hdu s bp r o du c t i t e m r un s ee m p r o p l a s t i c w o r k t u r n g r a ph i c b a d r ece i v d a y o p e n s l o w li g h t s t a rt c h a r g l a s t c pu i n s t a l r ec o mm e nd c o n t a c t f a s t n o t w o r k t a b l e t i ss u t h i n g d a y du a l f a il c o nd i t s e nd e a s i r e t u r nd i s p l a y b a c k g r e a t c h a r g d v dp l a y e r go t b o x b a s i c m i nu tt o u c h c li c k o v e r a l r e p a i r s l o t p o rt p a c k ag r e f u r b i s h li k e n e t w o r kq u a li t i t o u c hp a d n ee d ago p e r f o r m d i s c b o x a rr i v l oa d o p e n m o d e l o p e n r e v i e w b oa r d ss d c r a s hh a pp i g e t d o w n l oa dupd a t e a s il e f t w e ll s t a rt p e n t i u m r a m t h a n k w ee k l o v e o l d i n s p i r o n f a n This section conducts several simulation studies to examine the model performance in pre-dicting the document-level sentiment distributions under various scenarios.

We simulate review documents that are composed of words and ratings with known param-eters based on the generative process in Procedure 1. Speciﬁcally, each simulated documentis represented by a random joint sentiment-topic distribution P ( l, z ) = π l θ l,z that quanti-ﬁes how likely the current document is linked to each sentiment and topic label. We letthe number of topics K = 5 and the number of sentiments S = 2. For each review docu-ment, we test with the number of ratings M ∈ { , , , , , , } and the number of words N ∈ { M, M, M } for each value of M . Given the sentiment-topic mixtures sampledfrom P ( l, z ), a simulated document is generated by sampling words and ratings from theempirical word distribution Multinomial( ϕ ) and rating distribution Multinomial( µ ), respec-tively. Without loss of generality, we use the empirical word distribution estimated from thereal-world Dell dataset in Section 4 for generating the words in simulated documents. Inaddition, all the ratings are sampled from the empirical rating distribution with parameter µ Dell in Figure 4(a) conditioned on their sentiment assignments.Accordingly, the rating distribution provides occurrence rules among the observed ratings.For example, based on the rating distribution with parameter µ Dell in Figure 4(a), a positivesentiment is more likely to stimulate a higher rating, while a negative sentiment leads toa lower one. Note that the rating distribution varies with the studied dataset, and thesimulation data generated with various rating distributions would lead to diﬀerent results.For conducting a general comparison, our simulation additionally explores two distant casesof rating distributions with the parameters shown in Figure 4(d) and Figure 4(e). Figure 4(d)21 p (r) . . . . . Positive Negative (a) µ Dell p (r) . . . . . Positive Negative (b) µ Lenovo p (r) . . . . . Positive Negative (c) µ HP p (r) . . . . . Positive Negative (d) µ diff p (r) . . . . . Positive Negative (e) µ unif Figure 4: Distributions over ratings under positive and negative sentiments.22epresents an extreme case ( µ diff ) that ratings under two sentiment classiﬁcations are totallydiﬀerentiated. In contrast, Figure 4(e) represents the opposite case ( µ unif ) that ratings undertwo sentiment classiﬁcations are totally mixed. In practice, the distributions over ratingswould range between µ diff and µ unif .Based on Shannon’s concept of information theory, the information gain (IG) on theprediction of sentiments l ∈ { , . . . , S } given speciﬁc ratings r ∈ { , , , , } is deﬁned asIG( l, r ) = H( l ) − H( l | r ) = (cid:88) r =1 P ( r ) S (cid:88) l =1 P ( l | r ) log P ( l | r ) − S (cid:88) l =1 P ( l ) log P ( l ) , (11)which can be regarded as the amount of reduced randomness in predicting a sentiment givena rating. It is easy to show that the information gain in Eq. (11) is maximized, namelyH( l | r ) = 0 and IG( l, r ) = H( l ), in the case of µ diff (Figure 4(d)) where the sentimentprediction is 100% conﬁrmed under each possible rating score. In contrast, it is minimized,namely IG( l, r ) = 0, at the uniform distribution of µ unif (Figure 4(e)). Note that the incorporation of overall ratings mainly makes a diﬀerence in the estimationof document-level sentiments. Thus we focus on the accuracy of estimating the sentimentdistribution parameter π with the proposed Gibbs sampling procedure under diﬀerent modelimplementations. Speciﬁcally, the Kullback Leibler (KL) Divergence (Kullback, 1997) is usedto evaluate the performance measure of sentiment prediction as D KL ( ˆ π , π ) = (cid:88) l ˆ π l log ˆ π l π l . (12)It measures the distance between the predicted sentiment distribution ˆ π from diﬀerent mod-els and the target sentiment distribution π (ground-truth). For a general comparison, weconsider the following four models: • JST-RR( µ diff ): The JST-RR model applied to simulated documents generated withrating distribution parameter µ diff . 23 JST-RR( µ unif ): The JST-RR model applied to simulated documents generated withrating distribution parameter µ unif . • JST-RR( µ Dell ): The JST-RR model applied to simulated documents generated withrating distribution parameter µ Dell . • JST: The JST model only applied to the textual (word) part of simulated documents(baseline).All the models above are implemented in the same condition. Note that we have not includedthe implementations of other alternative models (e.g., RJST, AIR-JST, and AIR-RJST) sincethe simulated data here are based on the generative process of the JST-RR/JST model. Thetuning parameter σ is chosen by a 10-fold cross validation on a validation set of simulateddocuments.Table 4 reports the average KL Divergence under diﬀerent models, where each averagevalue of KL Divergence is computed based on D = 1 ,

000 samples of documents, and thestandard deviations are shown in brackets. In general, when N and M (the number of wordsand the number of ratings) are increased in a document, the document-level sentiment pa-rameters are estimated with higher accuracy. Based on results, the proposed JST-RR modelwith µ diff appear to achieve the best performance among all scenarios. It indicates thatthe incorporation of overall ratings in case of a diﬀerentiated sentiment-rating distribution ishelpful for the sentiment prediction. Note that the JST-RR model with µ unif is equivalentto the baseline model of JST (i.e., σ = 0) since the ratings in this case would not contributeto the sentiment prediction. Generally, the improvements of the JST-RR model comparedto the baseline of JST can be explained by the incorporation of ratings. When the ratingsbring larger information gain on the sentiment prediction as deﬁned in Eq. (11) as in thecase of µ diff , the improvements would be more signiﬁcant. In contrast, when the ratings arenon-informative as in the case of µ unif , the improvements are marginal.We also plot the results of KL Divergence in Figure 5 for a graphical visualization. It isclearly seen that the improvements in sentiment prediction become smaller with an increasingword-rating ratio N/M in review documents. For example, the JST-RR model under theword-rating ratio

N/M = 10 has a signiﬁcant advantage over the JST model (Figure 5(a)).24able 4: Results of KL Divergence between the predicted and the target sentiment distribu-tions. M and N represent the number of ratings and words in each document.N/M=10M N JST/JST-RR( µ unif ) JST-RR( µ Dell ) JST-RR( µ diff )1 10 0.1491(0.0059) 0.1324(0.0052)

10 100 0.0125(0.0005) 0.0096(0.0004)

N/M=20M N JST/JST-RR( µ unif ) JST-RR( µ Dell ) JST-RR( µ diff )1 20 0.0688(0.0028) 0.0627(0.0026)

10 200 0.0054(0.0002) 0.0046(0.0002)

N/M=30M N JST/JST-RR( µ unif ) JST-RR( µ Dell ) JST-RR( µ diff )1 30 0.0515(0.0023) 0.0461(0.0021)

10 300 0.0036(0.0002) 0.0031(0.0001)

N/M = 30 (Figure 5(c)). It showsthat the improvement from complementary ratings in the JST-RR model could be marginalwhen there are a suﬃcient amount of words for the document sentiment prediction. Ina short summary, the proposed JST-RR model has the advantage for short reviews withinsuﬃcient words (or a low word-rating ratio).

In this work, we propose a joint sentiment-topic model to properly accommodate ratingsand review texts. The proposed model characterizes the intrinsic connection between reviewtexts and ratings, leading to accurate prediction on review sentiments and topics. An eﬃcientGibbs sampling procedure is developed to make inference for the model parameters. Throughthe case study on the Amazon datasets, it appears that the proposed JST-RR model canenable an eﬀective identiﬁcation of latent topics and sentiments in reviews. It is noted thatthe proposed JST-RR model brings higher improvements in sentiment prediction with a moreinformative rating distribution and a decreasing word-rating ratio in review documents.Note that the proposed model is weakly supervised with the only supervision from adomain-independent sentiment lexicon. It can be adapted to other applications easily, suchas process monitoring of online services (Liang and Wang, 2020) and detection of fake newsin social media. Moreover, one can consider the ratings on some pre-speciﬁed topics, namely,aspect ratings. In such situations, it is interesting to extend the proposed method to thecase where aspect ratings are available, where the topic-sentiment correlation needs to beconstructed appropriately by incorporating aspect ratings with review texts. The currentproposed method is mainly based on data from one platform, i.e., the reviews and ratingsfrom Amazon. Another direction for future research is to incorporate the platform informa-tion of reviews into the proposed method such that it can integrate the reviews and ratingsof the same or similar products from multiple platforms.26 . . . . Number of Ratings ( M ) A v e r age K L D i v e r gen c e JST−RR ( μ unif ) / JSTJST−RR ( μ Dell )JST−RR ( μ diff ) (a) N/M = 10 . . . . Number of Ratings ( M ) A v e r age K L D i v e r gen c e JST−RR ( μ unif ) / JSTJST−RR ( μ Dell )JST−RR ( μ diff ) (b) N/M = 20 . . . . . . Number of Ratings ( M ) A v e r age K L D i v e r gen c e JST−RR ( μ unif ) / JSTJST−RR ( μ Dell )JST−RR ( μ diff ) (c) N/M = 30

Figure 5: Average KL Divergence between the predicted sentiment distribution and theground truth under diﬀerent word-rating ratios (

N/M = 10 , , ppendix The ﬁrst term in Eq. (2) can be derived by integrating out the document-level sentimentdistribution parameter π i as P ( l wij = l | l w − ij , l r ) = (cid:90) π i P ( l wij = l | π i ) P ( π i | l w − ij , l r ) d π i , (A.1)where the second term is derived as P ( π i | l w − ij , l r ) ∝ P ( l w − ij , l r | π i ) P ( π i ) . (A.2)Since P ( π i ) = Dirichlet( π i | γ ) is conjugate to the multinomial sentiment probability: P ( l w − ij , l r | π i ) = N i (cid:89) k =1 ,k (cid:54) = j Multinomial( l wik | π i ) M i (cid:89) k =1 Multinomial( l rik | π i )= S (cid:89) l =1 ( π i,l ) N − iji,l + M i,l , (A.3)the posterior is also a Dirichlet distribution: P ( π i | l w − ij , l r ) = Dirichlet( π i | n − iji + m i + γ ),where n i and m i are the counting number of words and ratings in the document d i that areassociated with each sentiment label: n i = ( N i, , N i, , . . . , N i,S ) , m i = ( M i, , M i, , . . . , M i,S ) . (A.4) Remark 1

When one would like to incorporate a weighting mechanism between the numberof words and ratings in sentiment estimation, we can consider a weighted likelihood as P ( l w − ij , l r | π i ) = N i (cid:89) k =1 ,k (cid:54) = j Multinomial( l wik | π i ) (cid:34) M i (cid:89) k =1 Multinomial( l rik | π i ) (cid:35) σ = S (cid:89) l =1 ( π i,l ) N − iji,l + σM i,l . (A.5)28 hen the posterior is still a Dirichlet distribution: P ( π i | l w − ij , l r ) = Dirichlet( π i | n − iji + σ m i + γ ) . The posterior predictive distribution of Eq. (A.1) can be derived as P ( l wij = l | l w − ij , l r ) = (cid:90) π i P ( l wij = l | π i ) P ( π i | l w − ij , l r ) d π i = (cid:90) π i P ( l wij = l | π i )Dirichlet( π i | n − iji + m i + γ ) d π i = (cid:90) π i π i,l · Dirichlet( π i | n − iji + m i + γ ) d π i = E ( π i,l | Dirichlet( π i | n − iji + m i + γ )) , (A.6)which is the expected value of Dirichlet( π i | n − iji + m i + γ ) on the sentiment dimension of l . According to the expected value of Dirichlet distribution in the Dirichlet-multinomialconjugate framework, we can obtain the ﬁnal derivation of Eq. (A.1) as P ( l wij = l | l w − ij , l r ) = E ( π i,l | Dirichlet( π i | n − iji + m i + γ ))= N − iji,l + M i,l + γ l N − iji + M i + (cid:80) l (cid:48) γ l (cid:48) , (A.7)where N i and M i are the total number of words and ratings in the document d i , N i,l and M i,l are the number of words and ratings associated with sentiment l in the document d i , andthe hyperparameter γ l can be interpreted as the prior observation counts of the sentiment l assigned with d i .Similarly, the second term in Eq. (2) can be derived by integrating out the randomvariable θ i,l as: P ( z ij = z | l wij = l, l w − ij , z − ij ) = (cid:90) θ i,l P ( z ij = z | θ i,l ) P ( θ i,l | l w − ij , z − ij ) d θ i,l , (A.8)where the second term is derived as P ( θ i,l | l w − ij , z − ij ) ∝ P ( z − ij | θ i,l , l w − ij ) P ( θ i,l ) . (A.9)29ince P ( θ i,l ) = Dirichlet( θ i,l | α l ) is conjugate to the multinomial probability P ( z − ij | θ i,l , l w − ij ),the posterior is also a Dirichlet distribution: P ( θ i,l | l w − ij , z − ij ) = Dirichlet( θ i,l | n − iji,l + α l ),where n i,l = ( N i,l, , N i,l, , . . . , N i,l,K ). By following the same derivation, the posterior predic-tive distribution of Eq. (A.8) is P ( z ij = z | l wij = l, l w − ij , z − ij ) = (cid:90) θ i,l P ( z ij = z | θ i,l ) P ( θ i,l | l w − ij , z − ij ) d θ i,l = (cid:90) θ i,l θ i,l,z · Dirichlet( θ i,l | n − iji,l + α l ) d θ i,l = E ( θ i,l,z | Dirichlet( θ i,l | n − iji,l + α l ))= N − iji,l,z + α l,z N − iji,l + (cid:80) z (cid:48) α l,z (cid:48) , (A.10)where N i,l,z is the number of words associated with the sentiment l and topic z in thedocument d i , and the hyperparameter α l,z can be interpreted as the prior observation countsof words assigned with the sentiment l and topic z in d i .Similarly, for the third term in Eq. (2), we can obtain its derivation by integrating outthe variable ϕ l,z as P ( w ij = w | l wij = l, z ij = z, l w − ij , z − ij , w − ij ) = (cid:90) ϕ l,z P ( w ij = w | ϕ l,z ) P ( ϕ l,z | l w − ij , z − ij , w − ij ) d ϕ l,z = (cid:90) ϕ l,z ϕ l,z,w · Dirichlet( ϕ l,z | n − ijl,z + β l,z ) d ϕ l,z = E ( ϕ l,z,w | Dirichlet( ϕ l,z | n − ijl,z + β l,z ))= N − ijl,z,w + β l,z,w N − ijl,z + (cid:80) w (cid:48) β l,z,w (cid:48) , (A.11)where N l,z is the number of words assigned with the sentiment label l and topic label z in the entire dataset, n l,z = ( N l,z, , . . . , N l,z,V ) is the number of times that each word w ∈{ , . . . , V } is associated with the sentiment label l and topic label z in the dataset, andthe hyperparameter β l,z,w can be interpreted as the prior counts of word w associated withsentiment label l and topic label z in the dataset.30imilarly, the second term in Eq. (8) can be derived by integrating out µ l as P ( r ij = r | l rij = l, l r − ij , r − ij ) = (cid:90) µ l P ( r ij = r | µ l ) P ( µ l | l r − ij , r − ij ) d µ l = (cid:90) µ l µ l,r · Dirichlet( µ l | m − ijl + δ l ) d µ l = E ( µ l,r | Dirichlet( µ l | m − ijl + δ l ))= M − ijl,r + δ l,r M − ijl + (cid:80) r (cid:48) δ l,r (cid:48) , (A.12)where M l is the number of ratings associated with the sentiment l in the dataset, m l =( M l, , . . . , M l, ) is the number of times that each rating r ∈ { , , , , } is associated withthe sentiment label l in the dataset, and the hyperparameter δ l,r can be interpreted as theprior counts of rating r associated with sentiment label l in the dataset. References

Airoldi, E. M., and Bischof, J. M. (2016), “Improving and evaluating topic models and othermodels of text,”

Journal of the American Statistical Association , 111(516), 1381–1403.Airoldi, E. M., Blei, D. M., Erosheva, E. A., and Fienberg, S. E. (2014), “Introduction toMixed Membership Models and Methods.,”

Handbook of mixed membership models andtheir applications , 100, 3–14.Airoldi, E. M., Erosheva, E. A., Fienberg, S. E., Joutard, C., Love, T., and Shringarpure,S. (2010), “Reconceptualizing the classiﬁcation of PNAS articles,”

Proceedings of the Na-tional Academy of Sciences , 107(49), 20899–20904.Bai, X. (2011), “Predicting Consumer Sentiments from Online Text,”

Decision Support Sys-tems , 50(4), 732–742.Blei, D. M. (2012), “Probabilistic Topic Models,”

Communications of the ACM , 55(4), 77–84.Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003), “Latent Dirichlet Allocation,”

Journal ofMachine Learning Research , 3(Jan), 993–1022.31litzer, J., Dredze, M., and Pereira, F. (2007), Biographies, Bollywood, Boom-boxes andBlenders: Domain Adaptation for Sentiment Classiﬁcation,, in

Proceedings of the 45thAnnual Meeting of the Association for Computational Linguistics , pp. 440–447.Brody, S., and Elhadad, N. (2010), An Unsupervised Aspect-Sentiment Model for OnlineReviews,, in

Human Language Technologies: The 2010 Annual Conference of the NorthAmerican Chapter of the Association for Computational Linguistics , pp. 804–812.Calheiros, A. C., Moro, S., and Rita, P. (2017), “Sentiment Classiﬁcation of Consumer-Generated Online Reviews Using Topic Modeling,”

Journal of Hospitality Marketing &Management , 26(7), 675–693.Chien, J., and Wu, M. (2008), “Adaptive Bayesian Latent Semantic Analysis,”

IEEE Trans-actions on Audio, Speech, and Language Processing , 16(1), 198–207.Dermouche, M., Kouas, L., Velcin, J., and Loudcher, S. (2015), A Joint Model for Topic-Sentiment Modeling from Text,, in

Proceedings of the 30th Annual ACM Symposium onApplied Computing , Association for Computing Machinery, p. 819–824.Griﬃths, T. L., and Steyvers, M. (2004), “Finding Scientiﬁc Topics,”

Proceedings of theNational Academy of Sciences , 101(suppl 1), 5228–5235.Hoﬀman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013), “Stochastic VariationalInference,”

The Journal of Machine Learning Research , 14(1), 1303–1347.Hofmann, T. (1999), Probabilistic Latent Semantic Indexing,, in

Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Infor-mation Retrieval , ACM, pp. 50–57.Kullback, S. (1997),

Information Theory and Statistics , Mineola, NY: Courier Corporation.Li, C., Zhang, J., Sun, J.-T., and Chen, Z. (2013), Sentiment Topic Model with Decom-posed Prior,, in

Proceedings of the 2013 SIAM International Conference on Data Mining ,pp. 767–775. 32i, H., Lin, R., Hong, R., and Ge, Y. (2015), Generative Models for Mining Latent Aspectsand Their Ratings from Short Reviews,, in , IEEE, pp. 241–250.Liang, Q., and Wang, K. (2020), “Ratings meet reviews in the monitoring of online productsand services,”

Journal of Quality Technology , in press.Lin, C., and He, Y. (2009), Joint Sentiment/Topic Model for Sentiment Analysis,, in

Pro-ceedings of the 18th ACM Conference on Information and Knowledge Management , ACM,pp. 375–384.Lin, C., He, Y., Everson, R., and Ruger, S. (2012), “Weakly Supervised Joint Sentiment-Topic Detection from Text,”

IEEE Transactions on Knowledge and Data Engineering ,24(6), 1134–1145.Ling, G., Lyu, M. R., and King, I. (2014), Ratings Meet Reviews, A Combined Approachto Recommend,, in

Proceedings of the 8th ACM Conference on Recommender Systems ,pp. 105–112.Liu, B. (2012), “Sentiment Analysis and Opinion Mining,”

Synthesis Lectures on HumanLanguage Technologies , 5(1), 1–167.Lu, B., Ott, M., Cardie, C., and Tsou, B. K. (2011), Multi-aspect Sentiment Analysis withTopic Models,, in ,IEEE, pp. 81–88.Lu, Y., Tsaparas, P., Ntoulas, A., and Polanyi, L. (2010), Exploiting Social Context forReview Quality Prediction,, in

Proceedings of the 19th International Conference on WorldWide Web , ACM, pp. 691–700.Lu, Y., Zhai, C., and Sundaresan, N. (2009), Rated Aspect Summarization of Short Com-ments,, in

Proceedings of the 18th international conference on World wide web , pp. 131–140. 33anrique-Vallier, D., and Reiter, J. P. (2012), “Estimating identiﬁcation disclosure riskusing mixed membership models,”

Journal of the American Statistical Association ,107(500), 1385–1394.McAuley, J., Targett, C., Shi, Q., and Van Den Hengel, A. (2015), Image-based Recommen-dations on Styles and Substitutes,, in

Proceedings of the 38th International ACM SIGIRConference on Research and Development in Information Retrieval , ACM, pp. 43–52.Mei, Q., Ling, X., Wondra, M., Su, H., and Zhai, C. (2007), Topic Sentiment Mixture: Mod-eling Facets and Opinions in Weblogs,, in

Proceedings of the 16th International Conferenceon World Wide Web , ACM, pp. 171–180.Moghaddam, S., and Ester, M. (2011), ILDA: interdependent LDA model for learning latentaspects and their ratings from online product reviews,, in

Proceedings of the 34th inter-national ACM SIGIR conference on Research and development in Information Retrieval ,pp. 665–674.Pang, B., Lee, L., and Vaithyanathan, S. (2002), Thumbs up?: Sentiment Classiﬁcation usingMachine Learning Techniques,, in

Proceedings of the ACL-02 Conference on EmpiricalMethods in Natural Language Processing , pp. 79–86.Roberts, M. E., Stewart, B. M., and Airoldi, E. M. (2016), “A model of text for experimenta-tion in the social sciences,”

Journal of the American Statistical Association , 111(515), 988–1003.Srivastava, A., and Sutton, C. (2017), “Autoencoding Variational Inference for Topic Mod-els,” arXiv no. 1703.01488 , .Taddy, M. (2013), “Measuring political sentiment on Twitter: Factor optimal design formultinomial inverse regression,”

Technometrics , 55(4), 415–425.Titov, I., and McDonald, R. (2008 a ), A Joint Model of Text and Aspect Ratings for Senti-ment Summarization,, in Proceedings of ACL-08: HLT , pp. 308–316.34itov, I., and McDonald, R. (2008 b ), Modeling Online Reviews with Multi-grain TopicModels,, in Proceedings of the 17th International Conference on World Wide Web , pp. 111–120.Wallach, H. M., Murray, I., Salakhutdinov, R., and Mimno, D. (2009), Evaluation Methodsfor Topic Models,, in

Proceedings of the 26th Annual International Conference on MachineLearning , pp. 1105–1112.Wang, H., Lu, Y., and Zhai, C. (2010), Latent Aspect Rating Analysis on Review Text Data:A Rating Regression Approach,, in

Proceedings of the 16th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining , pp. 783–792.Wang, H., Lu, Y., and Zhai, C. (2011), Latent Aspect Rating Analysis without AspectKeyword Supervision,, in

Proceedings of the 17th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining , pp. 618–626.Ye, Q., Zhang, Z., and Law, R. (2009), “Sentiment classiﬁcation of online reviews to traveldestinations by supervised machine learning approaches,”