Rethinking Attribute Representation and Injection for Sentiment Classification
RRethinking Attribute Representation and Injectionfor Sentiment Classification
Reinald Kim Amplayo
Institute for Language, Cognition and ComputationSchool of Informatics, University of Edinburgh [email protected]
Abstract
Text attributes, such as user and product infor-mation in product reviews, have been used toimprove the performance of sentiment classi-fication models. The de facto standard methodis to incorporate them as additional biases inthe attention mechanism, and more perfor-mance gains are achieved by extending themodel architecture. In this paper, we showthat the above method is the least effective wayto represent and inject attributes. To demon-strate this hypothesis, unlike previous modelswith complicated architectures, we limit ourbase model to a simple BiLSTM with atten-tion classifier, and instead focus on how and where the attributes should be incorporated inthe model. We propose to represent attributesas chunk-wise importance weight matrices andconsider four locations in the model (i.e., em-bedding, encoding, attention, classifier) to in-ject attributes. Experiments show that our pro-posed method achieves significant improve-ments over the standard approach and that at-tention mechanism is the worst location to in-ject attributes, contradicting prior work. Wealso outperform the state-of-the-art despite ouruse of a simple base model. Finally, we showthat these representations transfer well to othertasks . The use of categorical attributes (e.g., user, topic,aspects) in the sentiment analysis community(Kim and Hovy, 2004; Pang and Lee, 2007; Liu,2012) is widespread. Prior to the deep learningera, these information were used as effective cat-egorical features (Li et al., 2011; Tan et al., 2011;Gao et al., 2013; Park et al., 2015) for the ma-chine learning model. Recent work has used themto improve the overall performance (Chen et al., Model implementation and datasets are released here: https://github.com/rktamplayo/CHIM a in the attention mechanism, asintroduced by Chen et al. (2016) as: a = softmax ( v (cid:62) e ) (1) e = tanh( W h + W u u + W p p + b ) (2) = tanh( W h + b u + b p + b ) (3) = tanh( W h + b (cid:48) ) (4)where u and p are the user and product embed-dings, and h is a word encoding from BiLSTM.Since then, most of the subsequent work attemptedto improve the model by extending the model ar-chitecture to be able to utilize external features(Zhu and Yang, 2017), handle cold-start entities(Amplayo et al., 2018a), and represent user andproduct separately (Ma et al., 2017).Intuitively, however, this method is not the idealmethod to represent and inject attributes becauseof two reasons. First, representing attributes asadditional biases cannot model the relationship be-tween the text and attributes. Rather, it only addsa user- and product-specific biases that are inde-pendent from the text when calculating the atten-tion weights. Second, injecting the attributes inthe attention mechanism means that user and prod- a r X i v : . [ c s . C L ] A ug ct information are only used to customize howthe model choose which words to focus on, asalso shown empirically in previous work (Chenet al., 2016; Ma et al., 2017). However, we ar-gue that there are more intuitive locations to injectthe attributes such as when contextualizing wordsto modify their sentiment intensity.We propose to represent user and product infor-mation as weight matrices (i.e., W in the equationabove). Directly incorporating these attributes into W leads to large increase in parameters and sub-sequently makes the model difficult to optimize.To mitigate these problems, we introduce chunk-wise importance weight matrices, which (1) usesa weight matrix smaller than W by a chunk sizefactor, and (2) transforms these matrix into gatessuch that it corresponds to the relative importanceof each neuron in W . We investigate the use ofthis method when injected to several locations inthe base model: word embeddings, BiLSTM en-coder, attention mechanism, and logistic classifier.The results of our experiments can be sum-marized in three statements. First, our prelimi-nary experiments show that doing bias-based at-tribute representation and attention-based injec-tion is not an effective method to incorporate userand product information in sentiment classifica-tion models. Second, despite using only a simpleBiLSTM with attention classifier, we significantlyoutperform previous state-of-the-art models thatuse more complicated architectures (e.g., modelsthat use hierarchical models, external memory net-works, etc.). Finally, we show that these attributerepresentations transfer well to other tasks such asproduct category classification and review head-line generation. In this section, we explore different ways on howto represent attributes and where in the model canwe inject them.
The majority of this paper uses a base model thataccepts a review x = x , ..., x n as input and re-turns a sentiment y as output, which we extend toalso accept the corresponding user u and product p attributes as additional inputs. Different fromprevious work where models use complex archi-tectures such as hierarchical LSTMs (Chen et al.,2016; Zhu and Yang, 2017) and external memory networks (Dou, 2017; Long et al., 2018), we aimto achieve improvements by only modifying howwe represent and inject attributes. Thus, we use asimple classifier as our base model, which consistsof four parts explained briefly as follows.First, we embed x using a word embedding ma-trix that returns word embeddings x (cid:48) , ..., x (cid:48) n . Wesubsequently apply a non-linear function to eachword: w t = tanh( W emb x (cid:48) t + b emb ) (5)Second, we run a bidirectional LSTM (Hochreiterand Schmidhuber, 1997) encoder to contextualizethe words into h t = [ −→ h t ; ←− h t ] based on their for-ward and backward neighbors. The forward andbackward LSTM look similar, thus for brevity weonly show the forward LSTM below: g t i t f t o t = tanh σσσ W enc [ w t ; −→ h t − ] + b enc (6) c t = f t ∗ c t − + i t ∗ g t (7) −→ h t = o t ∗ c t (8)Third, we pool the encodings h t into one docu-ment encoding d using attention mechanism (Bah-danau et al., 2015), where v is a latent representa-tion of informativeness (Yang et al., 2016): e t = tanh( W att h t + b att ) (9) a t = softmax t ( v (cid:62) e t ) (10) d = (cid:88) t ( a t ∗ h t ) (11)Finally, we classify the document using a logisticclassifier to get a predicted y (cid:48) : y (cid:48) = argmax ( W cls d + b cls ) (12)Training is done normally by minimizing the crossentropy loss. Note that at each part of the model, we see similarnon-linear functions, all using the same form, i.e. g ( f ( x )) = g ( W x + b ) , where f ( x ) is an affinetransformation function of x , g is a non-linear ac-tivation, W and b are weight matrix and bias pa-rameters, respectively. Without extending the basemodel architecture, we can represent the attributeseither as the weight matrix W or as the bias b toone of these functions by modifying them to ac-cept u and p as inputs, i.e. f ( x, u, p ) . entiment Classification Model Review: the cake was okay . the wine was very sweet . (a) Logits when representing attributes as biases in the logis-tic classifier no attribute injection the cake was okay . the wine was very sweet . injection to the attention mechanism the cake was okay . the wine was very sweet . I don’t care aboutcakes, but I like my wine not sweet! actual user’s interest (b) Attention weights when injecting attributes in the atten-tion mechanism. Figure 1: Illustrative examples of issues when repre-senting attributes as biases and injecting them in theattention mechanism. The gray process icon indicatesthe model without incorporating attributes, while thesame icon in green indicates the model customized forthe green user.
Bias-based
The current accepted standard ap-proach to represent the attributes is through thebias parameter b . Most of the previous work (Chenet al., 2016; Zhu and Yang, 2017; Amplayo et al.,2018a; Wu et al., 2018) use Equation 2 in the at-tention mechanism, which basically updates theoriginal bias b to b (cid:48) = W u u + W p p + b . How-ever, we argue that this is not the ideal way to in-corporate attributes since it means we only add auser- and product-specific bias towards the goal ofthe function, without looking at the text. Figure 1ashows an intuitive example: When we representuser u as a bias in the logistic classifier, in whichit means that u has a biased logits vector b u of clas-sifying the text as a certain sentiment (e.g., u tendsto classify texts as three-star positive), shifting thefinal probability distribution regardless of what thetext content may have been. Matrix-based
A more intuitive way of repre-senting attributes is through the weight matrix W .Specifically, given the attribute embeddings u and p , we linearly transform their concatenation into avector w (cid:48) of size D ∗ D where D and D are thedimensions of W . We then reshape w (cid:48) into W (cid:48) toget the same shape as W and replace W with W (cid:48) : w (cid:48) = W c [ u ; p ] + b c (13) W (cid:48) = reshape ( w (cid:48) , ( D × D )) (14) f ( x, u, p ) = W (cid:48) x + b (15)Theoretically, this should perform better than bias-based representations since direct relationship be-tween text and attributes are modeled. For exam-ple, following the example above, W (cid:48) x is a user-biased logits vector based on the document encod-ing d (e.g., u tends to classify texts as two-star pos-itive when the text mentions that the dessert wassweet).However, the model is burdened by a large num-ber of parameters; matrix-based attribute repre-sentation increases the number of parameters by | U | ∗ | P | ∗ D ∗ D , where | U | and | P | correspondto the number of users and products, respectively.This subsequently makes the weights difficult tooptimize during training. Thus, directly incorpo-rating attributes into the weight matrix may causeharm in the performance of the model. CHIM-based
We introduce Ch unk-wise I mportance M atrix (CHIM) based representation,which improves over the matrix-based approachby mitigating the optimization problems men-tioned above, using the following two tricks.First, instead of using a big weight matrix W (cid:48) of shape ( D , D ) , we use a chunked weightmatrix C of shape ( D /C , D /C ) where C and C are chunk size factors. Second, we use thechunked weight matrix as importance gates thatshrinks the weights close to zero when they aredeemed unimportant. We show the CHIM-basedrepresentation method in Figure 2.We start by linearly transforming the concate-nated attributes into c . Then we reshape c into C with shape ( D /C , D /C ) . These operationsare similar to Equations 13 and 14. We then re-peat this matrix C ∗ C times and concatenatethem such that we create a matrix W (cid:48) of shape ( D , D ) . Finally, we use the sigmoid function σ to transform the matrix into gates that representimportance: W (cid:48) = σ C ; ... ; C... ; ... ; ...C ; ... ; C ∈ [0 , D × D (16) … … … …… 𝑓 𝑥, 𝑢, 𝑝 = 𝐂𝐇𝐈𝐌 𝒖, 𝒑 ∗ 𝑊 𝑥 + 𝑏= 𝑊 ∗ 𝑊 𝑥 + 𝑏𝑢 𝑝𝑐 𝑐𝑊′ transformreshaperepeat +sigmoid Figure 2: CHIM-based attribute representation and in-jection to a non-linear funtion in the model.
Finally we broadcast-multiply W (cid:48) with the origi-nal weight matrix W to shrink the weights. Theresult is a sparse version of W , which can be seenas either a regularization step (Ng, 2004) wheremost weights are set close to zero, or a correctionstep (Amplayo et al., 2018b) where the importantgates are used to correct the weights. The use ofmultiple chunks regards CHIM as coarse-grainedaccess control (Shen et al., 2019) where the use ofdifferent important gates for every node is unnec-essary and expensive. The final function is shownbelow: f ( x, u, p ) = ( W (cid:48) ∗ W ) x + b (17)To summarize, chunking helps reduce the num-ber of parameters while retaining the model per-formance, and importance matrix makes optimiza-tion easier during training, resulting to a perfor-mance improvement. We also tried alternativemethods for importance matrix such as residualaddition (i.e., tanh( W (cid:48) ) + W ) introduced in Heet al. (2016), and low-rank adaptation methods(Jaech and Ostendorf, 2018; Kim et al., 2019), butthese did not improve the model performance. Using the approaches described above, we caninject attribute representation into four different parts of the model. This section describes what itmeans to inject attributes to a certain location andwhy previous work have been injecting them in theworst location (i.e., in the attention mechanism).
In the attention mechanism
Injecting attributesto the attention mechanism means that we bias theselection of more informative words during pool-ing. For example, in Figure 1b, a user may finddelicious drinks to be the most important aspect ina restaurant. Injection in the attention mechanismwould bias the selection of words such as wine , smooth , and sweet to create the document encod-ing. This is the standard location in the model toinject the attributes, and several (Chen et al., 2016;Amplayo et al., 2018a) have shown how the in-jected attention mechanism selects different wordswhen the given user or product is different.We argue, however, that attention mechanism isnot the best location to inject the attributes. This isbecause we cannot obtain user- or product-biasedsentiment information from the representation. Inthe example above, although we may be able toselect, with user bias, the words wine and sweet in the text, we do not know whether the user has apositive or negative sentiment towards these words(e.g., Does the user like wine? How about sweet wines? etc.). In contrast, the three other locationswe discuss below use the attributes to modify howthe model looks at sentiment at different levels oftextual granularity. In the word embedding
Injecting attributes tothe word embedding means that we bias the sen-timent intensity of a word independent from itsneighboring context. For example, if a user nor-mally uses the words tasty and delicious with aless and more positive intensity, respectively, thecorresponding attribute-injected word embeddingswould come out less similar, despite both wordsbeing synonymous.
In the BiLSTM encoder
Injecting attributes tothe encoder means that we bias the contextualiza-tion of words based on their neighbors in the text.For example, if a user likes their cake sweet buttheir drink with no sugar, the attribute-injected en-coder would give a positive signal to the encodingof sweet in the text “ the cake was sweet ” and anegative signal in the text “ the drink was sweet ”. In the logistic classifier
Injecting attributes tothe classifier means that we bias the probabilityatasets C
Table 1: Statistics of the datasets used for the Sentiment Classification task. distribution of sentiment based on the final docu-ment encoding. If a user tends to classify the sen-timent of reviews about sweet cakes as highly pos-itive, then the model would give a high probabilityto highly positive sentiment classes for texts suchas “ the cake was sweet ”. We perform experiments on two tasks. The firsttask is Sentiment Classification, where we aretasked to classify the sentiment of a review text,given additionally the user and product informa-tion as attributes. The second task is AttributeTransfer, where we attempt to transfer the at-tribute encodings learned from the sentiment clas-sification model to solve two other different tasks:(a) Product Category Classification, where we aretasked to classify the category of the product, and(b) Review Headline Generation, where we aretasked to generate the title of the review, givenonly both the user and product attribute encodings.Datasets, evaluation metrics, and competing mod-els are different for each task and are described intheir corresponding sections.Unless otherwise stated, our models are imple-mented with the following settings. We set the di-mensions of the word, user, and product vectors to300. We use pre-trained GloVe embeddings (Pen-nington et al., 2014) to initialize the word vectors.We also set the dimensions of the hidden state ofBiLSTM to 300 (i.e., 150 dimensions for each ofthe forward/backward hidden state). The chunksize factors C and C are both set to 15. We usedropout (Srivastava et al., 2014) on all non-linearconnections with a dropout rate of 0.1. We set thebatch size to 32. Training is done via stochasticgradient descent over shuffled mini-batches withthe Adadelta update rule (Zeiler, 2012) and with l constraint (Hinton et al., 2012) of 3. We performearly stopping using the development set. Train-ing and experiments are done using an NVIDIA https://nlp.stanford.edu/projects/glove/ GeForce GTX 1080 Ti graphics card.
We use the threewidely used sentiment classification datasets withuser and product information available: IMDB,Yelp 2013, and Yelp 2014 datasets . Thesedatasets are curated by Tang et al. (2015), wherethey ensured twenty-core for both users and prod-ucts (i.e., users have at least twenty products andvice versa), split them into train, dev, and test setswith an 8:1:1 ratio, and tokenized and sentence-split using the Stanford CoreNLP (Manning et al.,2014). Dataset statistics are shown in Table 1.Evaluation is done using two metrics: the accu-racy which measures the overall sentiment classi-fication performance, and RMSE which measuresthe divergence between predicted and ground truthclasses. Comparisons of different attribute represen-tation and injection methods
To conduct afair comparison among the different methods de-scribed in Section 2, we compare these meth-ods when applied to our base model using thedevelopment set of the datasets. Specifically,we use a smaller version of our base model(with dimensions set to 64) and incorporate theuser and product attributes using nine differentapproaches: (1) bias-attention : the bias-basedmethod injected to the attention mechanism, (2-5)the matrix-based method injected to four differentlocations ( matrix-embedding , matrix-encoder , matrix-attention , matrix-classifier ), and (6-9)the CHIM-based method injected to four differentlocations ( CHIM-embedding , CHIM-encoder , CHIM-attention , CHIM-classifier ). We thencalculate the accuracy of each approach for alldatasets.Results are shown in Figure 3. The figureshows that bias-attention consistently performspoorly compared to other approaches. As ex-pected, matrix-based representations perform the https://drive.google.com/open?id=1PxAkmPLFMnfom46FMMXkHeqIxDbA16oy IMDB
Yelp 2013
Yelp 2014 bias-attentionmatrix-embeddingmatrix-encoder matrix-attentionmatrix-classifierCHIM-embedding CHIM-encoderCHIM-attentionCHIM-classifier
Figure 3: Accuracies (y-axis) of different attribute rep-resentation (bias, matrix, CHIM) and injection (emb:embed, enc: encode, att: attend, cls: classify) ap-proaches on the development set of the datasets. worst when injected to embeddings and encoder,however we can already see improvements overbias-attention when these representations are in-jected to attention and classifier. This is be-cause the number of parameters used in the theweight matrices of attention and classifier arerelatively smaller compared to those of embed-dings and encoder, thus they are easier to opti-mize. The CHIM-based representations performthe best among other approaches, where CHIM-embedding garners the highest accuracy acrossdatasets. Finally, even when using a better rep-resentation method, CHIM-attention consistentlyperforms the worst among CHIM-based represen-tations. This shows that attention mechanism is not the optimal location to inject attributes.
Comparisons with models in the literature
We also compare with models from previouswork, listed below:1.
UPNN (Tang et al., 2015) uses a CNN classi-fier as base model and incorporates attributesas user- and product-specific weight parame-ters in the word embeddings and logistic clas-sifier.2.
UPDMN (Dou, 2017) uses an LSTM classi-fier as base model and incorporates attributesas a separate deep memory network that usesother related documents as memory.3.
NSC (Chen et al., 2016) uses a hierarchi-cal LSTM classifier as base model and in-corporates attributes using the bias-attentionmethod on both word- and sentence-levelLSTMs.4.
DUPMN (Long et al., 2018) also uses a hi-erarchical LSTM as base model and incorpo-rates attributes as two separate deep memorynetwork, one for each attribute.5.
PMA (Zhu and Yang, 2017) is similar to NSCbut uses external features such as the rankingpreference method of a specific user.6.
HCSC (Amplayo et al., 2018a) uses a com-bination of BiLSTM and CNN as basemodel, incorporates attributes using the bias-attention method, and also considers the ex-istence of cold start entities.7.
CMA (Ma et al., 2017) uses a combinationof LSTM and hierarchical attention classifieras base model, incorporates attributes usingthe bias-attention method, and does this sep-arately for user and product.Notice that most of these models, especially thelater ones, use the bias-attention method to repre-sent and inject attributes, but also employ a morecomplex model architecture to enjoy a boost inperformance.Results are summarized in Table 2. On all threedatasets, our best results outperform all previousmodels based on accuracy and RMSE. Among ourfour models, CHIM-embedding performs the bestin terms of accuracy, with performance increasesof 2.4%, 1.3%, and 1.6% on IMDB, Yelp 2013,odel IMDB Yelp 2013 Yelp 2014Name Base Model Injection Acc RMSE Acc RMSE Acc RMSEUPNN CNN embedding, classifier 43.5 1.602 59.6 0.748 60.8 0.764UPDMN LSTM memory networks 46.5 1.351 63.9 0.662 61.3 0.720NSC HierLSTM attention 53.3 1.281 65.0 0.692 66.7 0.654DUPMN HierLSTM memory networks 53.9 1.279 66.2 0.667 67.6 0.639PMA HierLSTM attention Table 2: Sentiment classification results of competing models based on accuracy and RMSE metrics on the threedatasets. Underlined values correspond to the best values for each block.
Boldfaced values correspond to the bestvalues across the board. uses additional external features, uses a method that considers cold-start entities, usesseparate bias-attention for user and product. and Yelp 2014, respectively. CHIM-classifier per-forms the best in terms of RMSE, outperform-ing all other models on both Yelp 2013 and 2014datasets. Among our models, CHIM-attentionmechanism performs the worst, which shows sim-ilar results to our previous experiment (see Figure3). We emphasize that our models use a simpleBiLSTM as base model, and extensions to the basemodel (e.g., using multiple hierarchical LSTMs asin Wu et al. 2018), as well as to other aspects (e.g.,consideration of cold-start entities as in Amplayoet al. 2018a), are orthogonal to our proposed at-tribute representation and injection method. Thus,we expect a further increase in performance whenthese extensions are done. In this section, we investigate whether it is pos-sible to transfer the attribute encodings, learnedfrom the sentiment classification model, to othertasks: product category classification and reviewheadline generation. The experimental setup is asfollows. First, we train a sentiment classificationmodel using an attribute representation and injec-tion method of choice to learn the attribute encod-ings. Then, we use these fixed encodings as inputto the task-specific model.
Dataset
We collected a new dataset from Ama-zon , which includes the product category and https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_multilingual_US_v1_00.tsv.gz the review headline, aside from the review text,the sentiment score, and the user and product at-tributes. Following Tang et al. (2015), we ensuredthat both users and products are twenty-core, splitthem into train, dev, and test sets with an 8:1:1 ra-tio, and tokenized and sentence-split the text usingStanford CoreNLP (Manning et al., 2014). The fi-nal dataset contains 77,028 data points, with 1,728users and 1,890 products. This is used as the sen-timent classification dataset.To create the task-specific datasets, we split thedataset again such that no users and no productsare seen in at least two different splits. That is,if user u is found in the train set, then it shouldnot be found in the dev and the test sets. We re-move the user-product pairs that do not satistfythis condition. We then append the correspond-ing product category and review headline for eachuser-product pair. The final split contains 46,151training, 711 development, and 840 test instances.It also contains two product categories: Musicand Video DVD. The review headline is tok-enized using SentencePiece with 10k vocabulary.The datasets are released here for reproducibil-ity: https://github.com/rktamplayo/CHIM . Evaluation
In this experiment, we comparefive different attribute representation and injec-tion methods: (1) the bias-attention method, and(2-5) the CHIM-based representation method in- https://github.com/google/sentencepiece ethod Accuracy PerplexityMajority 60.12 –Random 60.67 ± ± ± ± ± ± Table 3: Accuracy (higher is better) and perplexity(lower is better) of competing models on the Amazondataset for the transfer tasks on product category clas-sification and review headline generation, respectively.Accuracy intervals are calculated by running the model10 times. Performance worse than the random and ma-jority baselines are colored red. jected to all four different locations in the model.We use the attribute encodings, which are learnedfrom pre-training on the sentiment classificationdataset, as input to the transfer tasks, in whichthey are fixed and not updated during training. Asa baseline, we also show results when using en-codings of randomly set weights. Moreover, weadditionally show the majority class as additionalbaseline for product category classification.For the product category classification task, weuse a logistic classifier as the classification modeland accuracy as the evaluation metric. For the re-view headline generation task, we use an LSTMdecoder as the generation model and perplexity asthe evaluation metric.
Results
For the product category classificationtask, the results are reported in Table 3. Thetable shows that representations learned fromCHIM-based methods perform better than the ran-dom baseline. The best model, CHIM-encoder,achieves an increase of at least 3 points in accu-racy compared to the baseline. This means that, in-terestingly, CHIM-based attribute representationshave also learned information about the categoryof the product. In contrast, representations learnedfrom the bias-attention method are not able totransfer well on this task, leading to worse re-sults compared to the random and majority base-line. Moreover, CHIM-attention performs theworst among CHIM-based models, which furthershows the ineffectiveness of injecting attributes tothe attention mechanism.Results for the review headline generation taskare also shown in Table 3. The table shows less promising results, where the best model, CHIM-encoder, achieves a decrease of 0.88 points in per-plexity from the random encodings. Althoughthis still means that some information has beentransferred, one may argue that the gain is toosmall to be considered significant. However, ithas been well perceived, that using only the userand product attributes to generate text is unrea-sonable, since we expect the model to generatecoherent texts using only two vectors. This im-possibility is also reported by Dong et al. (2017)where they also used sentiment information, andNi and McAuley (2018) where they additionallyused learned aspects and a short version of the textto be able to generate well-formed texts. Neverthe-less, the results in this experiment agree to the re-sults above regarding injecting attributes to the at-tention mechanism; bias-attention performs worsethan the random baseline, and CHIM-attentionperforms the worst among CHIM-based models.
All our experiments unanimously show that (a)the bias-based attribute representation method isnot the most optimal method, and (b) injectingattributes in the attention mechanism results tothe worst performance among all locations in themodel, regardless of the representation methodused. The question “where is the best locationto inject attributes?” remains unanswered, sincedifferent tasks and settings produce different bestmodels. That is, CHIM-embedding achieves thebest accuracy while CHIM-classifier achieves thebest RMSE on sentiment classification. Moreover,CHIM-encoder produces the most transferable at-tribute encoding for both product category classi-fication and review headline generation. The sug-gestion then is to conduct experiments on all loca-tions and check which one is best for the task athand.Finally, we also investigate whether injectingin to more than one location would result to bet-ter performance. Specifically, we jointly inject intwo different locations at once using CHIM, anddo this for all possible pairs of locations. We usethe smaller version of our base model and calcu-late the accuracies of different models using thedevelopment set of the Yelp 2013 dataset. Figure4 shows a heatmap of the accuracies of jointly in-jected models, as well as singly injected models.Overall, the results are mixed and can be summa- mbedding encoder attention classifierembeddingencoderattentionclassifier
Figure 4: Heatmap of the accuracies of singly andjointly injected CHIM models. Values on each cellrepresents either the accuracy (for singly injected mod-els) or the difference between the singly and doublyinjected models per row. rized into two statements. Firstly, injecting on theembedding and another location (aside from theattention mechanism) leads to a slight decrease inperformance. Secondly and interestingly, inject-ing on the attention mechanism and another loca-tion always leads to the highest increase in perfor-mance, where CHIM-attention+embedding per-forms the best, outperforming CHIM-embedding.This shows that injecting in different locationsmight capture different information, and we leavethis investigation for future work.
Aside from user and product information, otherattributes have been used for sentiment classifica-tion. Location-based (Yang et al., 2017) and time-based (Fukuhara et al., 2007) attributes help con-textualize the sentiment geographically and tem-porally. Latent attributes that are learned from an-other model have also been employed as additionalfeatures, such as latent topics from a topic model(Lin and He, 2009), latent aspects from an aspectextraction model (Jo and Oh, 2011), argumenta-tion features (Wachsmuth et al., 2015), among oth-ers. Unfortunately, current benchmark datasets donot include these attributes, thus it is practicallyimpossible to compare and use these attributes inour experiments. Nevertheless, the methods in thispaper are not limited to only user and product at-tributes, but also to these other attributes as well,whenever available.
Incorporating user and product attributes to NLPmodels makes them more personalized and thususer satisfaction can be increased (Baruzzo et al.,2009). Examples of other NLP tasks that usethese attributes are text classification (Kim et al.,2019), language modeling (Jaech and Ostendorf,2018), text generation (Dong et al., 2017; Ni andMcAuley, 2018), review summarization (Yanget al., 2018b), machine translation (Michel andNeubig, 2018), and dialogue response generation(Zhang et al., 2017). On these tasks, the usageof the bias-attention method is frequent since it istrivially easy and there have been no attempts toinvestigate different possible methods for attributerepresentation and injection. We expect this paperto serve as the first investigatory paper that con-tradicts to the positive results previous work haveseen from the bias-attention method.
We showed that the current accepted standard forattribute representation and injection, i.e. bias-attention, which incorporates attributes as addi-tional biases in the attention mechanism, is theleast effective method. We proposed to repre-sent attributes as chunk-wise importance weightmatrices (CHIM) and showed that this represen-tation method significantly outperforms the bias-attention method. Despite using a simple BiL-STM classifier as base model, CHIM significantlyoutperforms the current state-of-the-art models,even when those models use a more complex basemodel architecture. Furthermore, we conductedseveral experiments that conclude that injection tothe attention mechanism, no matter which repre-sentation method is used, garners the worst perfor-mance. This result contradicts previously reportedconclusions regarding attribute injection to the at-tention mechanism. Finally, we show promisingresults on transferring the attribute representationsfrom sentiment classification, and use them to twodifferent tasks such as product category classifica-tion and review headline generation.
Acknowledgments
We would like to thank the anonymous review-ers for their helpful feedback and suggestions.Reinald Kim Amplayo is grateful to be supportedby a Google PhD Fellowship. eferences
Reinald Kim Amplayo, Jihyeok Kim, Sua Sung, andSeung-won Hwang. 2018a. Cold-start aware userand product attention for sentiment classification. In
Proceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics, ACL 2018,Melbourne, Australia, July 15-20, 2018, Volume 1:Long Papers , pages 2535–2544.Reinald Kim Amplayo, Kyungjae Lee, Jinyoung Yeo,and Seung-won Hwang. 2018b. Translations as ad-ditional contexts for sentence classification. In
Pro-ceedings of the Twenty-Seventh International JointConference on Artificial Intelligence, IJCAI 2018,July 13-19, 2018, Stockholm, Sweden. , pages 3955–3961.Stefanos Angelidis and Mirella Lapata. 2018. Summa-rizing opinions: Aspect extraction meets sentimentprediction and they are both weakly supervised. In
Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing, Brussels,Belgium, October 31 - November 4, 2018 , pages3675–3686.Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In .Andrea Baruzzo, Antonina Dattolo, Nirmala Pudota,and Carlo Tasso. 2009. A general framework forpersonalized text classification and annotation. In
Proceedings of the Workshop on Adaptation andPersonalization for Web 2.0, AP WEB 2.0@UMAP,Trento, Italy, June 22, 2009.
Huimin Chen, Maosong Sun, Cunchao Tu, Yankai Lin,and Zhiyuan Liu. 2016. Neural sentiment classi-fication with user and product attention. In
Pro-ceedings of the 2016 Conference on Empirical Meth-ods in Natural Language Processing, EMNLP 2016,Austin, Texas, USA, November 1-4, 2016 , pages1650–1659.Li Dong, Shaohan Huang, Furu Wei, Mirella Lapata,Ming Zhou, and Ke Xu. 2017. Learning to generateproduct reviews from attributes. In
Proceedings ofthe 15th Conference of the European Chapter of theAssociation for Computational Linguistics: Volume1, Long Papers , pages 623–632, Valencia, Spain.Association for Computational Linguistics.Zi-Yi Dou. 2017. Capturing user and product infor-mation for document level sentiment analysis withdeep memory network. In
Proceedings of the 2017Conference on Empirical Methods in Natural Lan-guage Processing , pages 521–526. Association forComputational Linguistics.Jessica Ficler and Yoav Goldberg. 2017. Controllinglinguistic style aspects in neural language genera-tion.
CoRR , abs/1707.02633. Tomohiro Fukuhara, Hiroshi Nakagawa, and ToyoakiNishida. 2007. Understanding sentiment of peo-ple from news articles: Temporal sentiment analysisof social events. In
Proceedings of the First Inter-national Conference on Weblogs and Social Media,ICWSM 2007, Boulder, Colorado, USA, March 26-28, 2007 .Wenliang Gao, Naoki Yoshinaga, Nobuhiro Kaji, andMasaru Kitsuregawa. 2013. Modeling user leniencyand product popularity for sentiment classification.In
Sixth International Joint Conference on Natu-ral Language Processing, IJCNLP 2013, Nagoya,Japan, October 14-18, 2013 , pages 1107–1111.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In , pages 770–778.Geoffrey E. Hinton, Nitish Srivastava, AlexKrizhevsky, Ilya Sutskever, and Ruslan Salakhut-dinov. 2012. Improving neural networks bypreventing co-adaptation of feature detectors.
CoRR , abs/1207.0580.Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.
Neural computation ,9(8):1735–1780.Aaron Jaech and Mari Ostendorf. 2018. Low-rankRNN adaptation for context-aware language model-ing.
TACL , 6:497–510.Yohan Jo and Alice H. Oh. 2011. Aspect and sentimentunification model for online review analysis. In
Pro-ceedings of the Forth International Conference onWeb Search and Web Data Mining, WSDM 2011,Hong Kong, China, February 9-12, 2011 , pages815–824.Jihyeok Kim, Reinald Kim Amplayo, Kyungjae Lee,Sua Sung, Minji Seo, and Seung-won Hwang. 2019.Categorical metadata representation for customizedtext classification.
TACL , 7:201–215.Soo-Min Kim and Eduard H. Hovy. 2004. Deter-mining the sentiment of opinions. In
COLING2004, 20th International Conference on Computa-tional Linguistics, Proceedings of the Conference,23-27 August 2004, Geneva, Switzerland .Fangtao Li, Nathan Nan Liu, Hongwei Jin, Kai Zhao,Qiang Yang, and Xiaoyan Zhu. 2011. Incorporat-ing reviewer and product information for review rat-ing prediction. In
IJCAI 2011, Proceedings of the22nd International Joint Conference on Artificial In-telligence, Barcelona, Catalonia, Spain, July 16-22,2011 , pages 1820–1825.Chenghua Lin and Yulan He. 2009. Joint senti-ment/topic model for sentiment analysis. In
Pro-ceedings of the 18th ACM Conference on Infor-mation and Knowledge Management, CIKM 2009,Hong Kong, China, November 2-6, 2009 , pages375–384.ing Liu. 2012.
Sentiment Analysis and Opinion Min-ing . Synthesis Lectures on Human Language Tech-nologies. Morgan & Claypool Publishers.Yunfei Long, Mingyu Ma, Qin Lu, Rong Xiang, andChu-Ren Huang. 2018. Dual memory networkmodel for biased product review classification. In
Proceedings of the 9th Workshop on ComputationalApproaches to Subjectivity, Sentiment and SocialMedia Analysis , pages 140–148. Association forComputational Linguistics.Dehong Ma, Sujian Li, Xiaodong Zhang, HoufengWang, and Xu Sun. 2017. Cascading multiwayattentions for document-level sentiment classifica-tion. In
Proceedings of the Eighth InternationalJoint Conference on Natural Language Processing(Volume 1: Long Papers) , pages 634–643. AsianFederation of Natural Language Processing.Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Rose Finkel, Steven Bethard, and David Mc-Closky. 2014. The stanford corenlp natural languageprocessing toolkit. In
Proceedings of the 52nd An-nual Meeting of the Association for ComputationalLinguistics, ACL 2014, June 22-27, 2014, Baltimore,MD, USA, System Demonstrations , pages 55–60.Paul Michel and Graham Neubig. 2018. Extreme adap-tation for personalized neural machine translation.In
Proceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics, ACL 2018,Melbourne, Australia, July 15-20, 2018, Volume 2:Short Papers , pages 312–318.Andrew Y. Ng. 2004. Feature selection, l1 vs. l2 regu-larization, and rotational invariance. In
Proceedingsof the Twenty-first International Conference on Ma-chine Learning , ICML ’04, pages 78–, New York,NY, USA. ACM.Jianmo Ni and Julian McAuley. 2018. Personalized re-view generation by expanding phrases and attendingon aspect-aware representations. In
Proceedings ofthe 56th Annual Meeting of the Association for Com-putational Linguistics, ACL 2018, Melbourne, Aus-tralia, July 15-20, 2018, Volume 2: Short Papers ,pages 706–711.Bo Pang and Lillian Lee. 2007. Opinion mining andsentiment analysis.
Foundations and Trends in In-formation Retrieval , 2(1-2):1–135.Dae Hoon Park, Hyun Duk Kim, ChengXiang Zhai,and Lifan Guo. 2015. Retrieval of relevant opinionsentences for new products. In
Proceedings of the38th International ACM SIGIR Conference on Re-search and Development in Information Retrieval,Santiago, Chile, August 9-13, 2015 , pages 393–402.Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In
Proceedings of the 2014Conference on Empirical Methods in Natural Lan-guage Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a SpecialInterest Group of the ACL , pages 1532–1543.Yikang Shen, Shawn Tan, Alessandro Sordoni, andAaron C. Courville. 2019. Ordered neurons: In-tegrating tree structures into recurrent neural net-works. In .Nitish Srivastava, Geoffrey E. Hinton, AlexKrizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-nov. 2014. Dropout: a simple way to prevent neuralnetworks from overfitting.
Journal of MachineLearning Research , 15(1):1929–1958.Chenhao Tan, Lillian Lee, Jie Tang, Long Jiang, MingZhou, and Ping Li. 2011. User-level sentiment anal-ysis incorporating social networks. In
Proceedingsof the 17th ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining,San Diego, CA, USA, August 21-24, 2011 , pages1397–1405.Duyu Tang, Bing Qin, and Ting Liu. 2015. Learningsemantic representations of users and products fordocument level sentiment classification. In
Proceed-ings of the 53rd Annual Meeting of the Associationfor Computational Linguistics and the 7th Interna-tional Joint Conference on Natural Language Pro-cessing of the Asian Federation of Natural LanguageProcessing, ACL 2015, July 26-31, 2015, Beijing,China, Volume 1: Long Papers , pages 1014–1023.Henning Wachsmuth, Johannes Kiesel, and BennoStein. 2015. Sentiment flow - A general model ofweb review argumentation. In
Proceedings of the2015 Conference on Empirical Methods in NaturalLanguage Processing, EMNLP 2015, Lisbon, Portu-gal, September 17-21, 2015 , pages 601–611.Zhen Wu, Xin-Yu Dai, Cunyan Yin, Shujian Huang,and Jiajun Chen. 2018. Improving review repre-sentations with user attention and product attentionfor sentiment classification. In
Proceedings of theThirty-Second AAAI Conference on Artificial Intelli-gence, (AAAI-18), the 30th innovative Applicationsof Artificial Intelligence (IAAI-18), and the 8th AAAISymposium on Educational Advances in ArtificialIntelligence (EAAI-18), New Orleans, Louisiana,USA, February 2-7, 2018 , pages 5989–5996.Min Yang, Jincheng Mei, Heng Ji, Wei Zhao, ZhouZhao, and Xiaojun Chen. 2017. Identifying andtracking sentiments and topics from social mediatexts during natural disasters. In
Proceedings of the2017 Conference on Empirical Methods in NaturalLanguage Processing, EMNLP 2017, Copenhagen,Denmark, September 9-11, 2017 , pages 527–533.Min Yang, Qiang Qu, Jia Zhu, Ying Shen, and ZhouZhao. 2018a. Cross-domain aspect/sentiment-awareabstractive review summarization. In
Proceedingsof the 27th ACM International Conference on Infor-mation and Knowledge Management, CIKM 2018,orino, Italy, October 22-26, 2018 , pages 1531–1534.Min Yang, Wenting Tu, Qiang Qu, Zhou Zhao, XiaojunChen, and Jia Zhu. 2018b. Personalized responsegeneration by dual-learning based domain adapta-tion.
Neural Networks , 103:72–82.Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alexander J. Smola, and Eduard H. Hovy. 2016. Hi-erarchical attention networks for document classifi-cation. In
NAACL HLT 2016, The 2016 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, San Diego California, USA, June 12-17, 2016 , pages 1480–1489.Matthew D. Zeiler. 2012. ADADELTA: an adaptivelearning rate method.
CoRR , abs/1212.5701.Weinan Zhang, Ting Liu, Yifa Wang, and Qingfu Zhu.2017. Neural personalized response generation asdomain adaptation.
CoRR , abs/1701.02073.Pengcheng Zhu and Yujiu Yang. 2017. Parallel multi-feature attention on neural sentiment classification.In