Aspect Level Sentiment Classification with Attention-over-Attention Neural Networks
AAspect Level Sentiment Classification withAttention-over-Attention Neural Networks
Binxuan Huang, Yanglan Ou and Kathleen M. Carley
Carnegie Mellon University,5000 Forbe Ave., Pittsburgh, United States {binxuanh,kathleen.carley}@cs.cmu.edu,[email protected]
Abstract.
Aspect-level sentiment classification aims to identify the sentimentexpressed towards some aspects given context sentences. In this paper, we intro-duce an attention-over-attention (AOA) neural network for aspect level sentimentclassification. Our approach models aspects and sentences in a joint way and ex-plicitly captures the interaction between aspects and context sentences. With theAOA module, our model jointly learns the representations for aspects and sen-tences, and automatically focuses on the important parts in sentences. Our exper-iments on laptop and restaurant datasets demonstrate our approach outperformsprevious LSTM-based architectures.
Unlike document level sentiment classification task [4,15], aspect level sentiment clas-sification is a more fine-grained classification task. It aims at identifying the sentimentpolarity (e.g. positive, negative, neutral) of one specific aspect in its context sentence.For example, given a sentence “great food but the service was dreadful” the sentimentpolarity for aspects “food” and “service” are positive and negative respectively.Aspect sentiment classification overcomes one limitation of document level senti-ment classification when multiple aspects appear in one sentence. In our previous ex-ample, there are two aspects and the general sentiment of the whole sentence is mixedwith positive and negative polarity. If we ignore the aspect information, it is hard todetermine the polarity for a specified target. Such error commonly exists in the gen-eral sentiment classification tasks. In one recent work, Jiang et al. manually evaluated aTwitter sentiment classifier and showed that 40% of sentiment classification errors arebecause of not considering targets [6].Many methods have been proposed to deal with aspect level sentiment classification.The typical way is to build a machine learning classifier by supervised training. Amongthese machine learning-based approaches, there are mainly two different types. One isto build a classifier based on manually created features [6,26]. The other type is basedon neural networks using end-to-end training without any prior knowledge [11,25,28].Because of its capacity of learning representations from data without feature engineer-ing, neural networks are becoming popular in this task.Because of advantages of neural networks, we approach this aspect level sentimentclassification problem based on long short-term memory (LSTM) neural networks. Pre-vious LSTM-based methods mainly focus on modeling texts separately [23,28], while a r X i v : . [ c s . C L ] A p r ur approach models aspects and texts simultaneously using LSTMs. Furthermore, thetarget representation and text representation generated from LSTMs interact with eachother by an attention-over-attention (AOA) module [2]. AOA automatically generatesmutual attentions not only from aspect-to-text but also text-to-aspect. This is inspiredby the observation that only few words in a sentence contribute to the sentiment to-wards an aspect. Many times, those sentiment bearing words are highly correlated withthe aspects. In our previous example, there are two aspects “appetizers” and “service”in the sentence “the appetizers are ok, but the service is slow.” Based on our languageexperience, we know that the negative word “slow” is more likely to describe “service”but not the “appetizers”. Similarly, for an aspect phrase, we also need to focus on themost important part. That is why we choose AOA to attend to the most important partsin both aspect and sentence. Compared to previous methods, our model performs betteron the laptop and restaurant datasets from SemEval 2014 [17] Sentiment Classification
Sentiment classification aims at detecting the sentiment polarity for text. There are var-ious approaches proposed for this research question [12]. Most existing works use ma-chine learning algorithms to classify texts in a supervision fashion. Algorithms likeNaive Bayes and Support Vector Machine(SVM) are widely used in this problem [10,15,27].The majority of these approaches either rely on n-gram features or manually designedfeatures. Multiple sentiment lexicons are built for this purpose [14,18,22].In the recent years, sentiment classification has been advanced by neural networkssignificantly. Neural network based approaches automatically learn feature represen-tations and do not require intensive feature engineering. Researchers proposed a vari-ety of neural network architectures. Classical methods include Convolutional NeuralNetworks [7], Recurrent Neural Networks [9,24], Recursive Neural Networks [19,29].These approaches have achieved promising results on sentiment analysis.
Aspect Level Sentiment Classification
Aspect level sentiment classification is a branch of sentiment classification, the goal ofwhich is to identify the sentiment polarity of one specific aspect in a sentence. Someearly works designed several rule based models for aspect level sentiment classification,such as [3,13]. Nasukawa et al. first perform dependency parsing on sentences, then theyuse predefined rules to determine the sentiment about aspects [13]. Jiang et al. improvethe target-dependent sentiment classification by creating several target-dependent fea-tures based on the sentences’ grammar structures [6]. These target-dependent featuresare further fed into an SVM classifier along with other content features.Later, kinds of neural network based methods were introduced to solve this as-pect level sentiment classification problem. Typical methods are based on LSTM neuralnetworks. TD-LSTM approaches this problem by developing two LSTM networks tomodel the left and right contexts for an aspect target [23]. This method uses the last hid-den states of these two LSTMs for predicting the sentiment. In order to better capturethe important part in a sentence, Wang et al. use an aspect term embedding to gener-ate an attention vector to concentrate on different parts of a sentence [28]. Along theseines, Ma et al. use two LSTM networks to model sentences and aspects separately [11].They further use the hidden states generated from sentences to calculate attentions toaspect targets by a pooling operation, and vice versa. Hence their IAN model can attendto both the important parts in sentences and targets. Their method is similar to ours.However, the pooling operation will ignore the interaction among word-pairs betweensentences and targets, and experiments show our method is superior to their model.
Problem Definition
In this aspect level sentiment classification problem, we are given a sentence s =[ w , w , ..., w i , .., w j , ..., w n ] and an aspect target t = [ w i , w i +1 , ..., w i + m − ] . The as-pect target could be a single word or a long phrase. The goal is to classify the sentimentpolarity of the aspect target in the sentence. Fig. 1.
The overall architecture of our aspect level sentiment classification model.
The overall architecture of our neural model is shown in Figure 1. It is mainly com-posed of four components: word embedding, Bidirectional-Long short-term memory(Bi-LSTM), Attention-over-Attention module and the final prediction.
Word Embedding
Given a sentence s = [ w , w , ..., w i , .., w j , ..., w n ] with length n and a target t =[ w i , w i +1 , ..., w i + m − ] with length m, we first map each word into a low-dimensionalreal-value vector, called word embedding [1]. For each word w i , we can get a vector v i ∈ R d w from M V × d w , where V is the vocabulary size and d w is the embeddingdimension. After an embedding look up operation, we get two sets of word vectors [ v ; v ; ... ; v n ] ∈ R n × d w and [ v i ; v i +1 ; ... ; v i + m − ] ∈ R m × d w for the sentence andaspect phrase respectively. Bi-LSTM
After getting the word vectors, we feed these two sets of word vectors into two Bidirectional-STM networks respectively. We use these two Bi-LSTM networks to learn the hiddensemantics of words in the sentence and the target. Each Bi-LSTM is obtained by stack-ing two LSTM networks. The advantage of using LSTM is that it can avoid the gradientvanishing or exploding problem and is good at learning long-term dependency [5].With an input s = [ v ; v ; ... ; v n ] and a forward LSTM network , we generate asequence of hidden states −→ h s ∈ R n × d h , where d h is the dimension of hidden states.We generate another state sequence ←− h s by feeding s into another backward LSTM. Inthe Bi-LSTM network, the final output hidden states h s ∈ R n × d h are generated byconcatenating −→ h s and ←− h s . We compute the hidden semantic states h t for the aspecttarget t in the same way. −→ h s = −−−−→ LST M ([ v ; v ; ... ; v n ]) (1) ←− h s = ←−−−− LST M ([ v ; v ; ... ; v n ]) (2) h s = [ −→ h s , ←− h s ] (3) Attention-over-Attention
Given the hidden semantic representations of the text and the aspect target generated byBi-LSTMs, we calculate the attention weights for the text by an AOA module. This isinspired by the use of AOA in question answering [2]. Given the target representation h t ∈ R m × d h and sentence representation h s ∈ R n × d h , we first calculate a pair-wiseinteraction matrix I = h s · h Tt , where the value of each entry represents the correlationof a word pair among sentence and target. With a column-wise softmax and row-wisesoftmax, we get target-to-sentence attention α and sentence-to-target attention β . Aftercolumn-wise averaging β , we get a target-level attention ¯ β ∈ R m , which indicatingthe important parts in an aspect target. The final sentence-level attention γ ∈ R n iscalculated by a weighted sum of each individual target-to-sentence attention α , givenby equation (7). By considering the contribution of each aspect word explicitly, we learnthe important weights for each word in the sentence. α ij = exp ( I ij ) (cid:80) i exp ( I ij ) (4) β ij = exp ( I ij ) (cid:80) j exp ( I ij ) (5) ¯ β j = 1 n (cid:88) i β ij (6) γ = α · ¯ β T (7) Final Classification
The final sentence representation is a weighted sum of sentence hidden semantic statesusing the sentence attention from AOA module. r = h Ts · γ (8)e regard this sentence representation as the final classification feature and feed itinto a linear layer to project r into the space of targeted C classes. x = W l · r + b l (9)where W l and b l are the weight matrix and bias respectively. Following the linear layer,we use a softmax layer to compute the probability of the sentence s with sentimentpolarity c ∈ C towards an aspect a as: P ( y = c ) = exp ( x c ) (cid:80) i ∈ C exp ( x i ) (10)The final predicted sentiment polarity of an aspect target is just the label with thehighest probability. We train our model to minimize the cross-entropy loss with L regularization loss = − (cid:88) i (cid:88) c ∈ C I ( y i = c ) · log ( P ( y i = c )) + λ || θ || (11)where I ( · ) is an indicator function. λ is the L regularization parameter and θ is a set ofweight matrices in LSTM networks and linear layer. We further apply dropout to avoidoverfitting, where we randomly drop part of inputs of LSTM cells.We use mini-batch stochastic gradient descent with Adam [8] update rule to mini-mize the loss function with respect to the weight matrices and bias terms in our model. Dataset
We experiment on two domain-specific datasets for laptop and restaurant from SemEval2014 Task 4 [26]. Experienced annotators tagged the aspect terms of the sentences andtheir polarities. Distribution by sentiment polarity category are given in Table 1.
Dataset Positive Neutral NegativeLaptop-Train 994 464 870Laptop-Test 341 169 128Restaurant-Train 2164 637 807Restaurant-Test 728 196 196
Table 1.
Distribution by sentiment polarity category of the datasets from SemEval 2014 Task 4.Numbers in table represent numbers of sentence-aspect pairs.
Hyperparameters Setting
In experiments, we first randomly select 20% of training data as validation set to tunethe hyperparameters. All weight matrices are randomly initialized from uniform distri-bution U ( − − , − ) and all bias terms are set to zero. The L regularization coef-ficient is set to − and the dropout keep rate is set to 0.2[20]. The word embeddingsare initialized with 300-dimensional Glove vectors [16] and are fixed during training.or the out of vocabulary words we initialize them randomly from uniform distribution U ( − . , . . The dimension of LSTM hidden states is set to 150. The initial learn-ing rate is 0.01 for the Adam optimizer. If the training loss does not drop after everythree epochs, we decrease the learning rate by half. The batch size is set as 25. Model Comparisons
We train and evaluate our model on these two SemEval datasets separately. We use ac-curacy metric to measure the performance. In order to further validate the performanceof our model, we compare it with several baseline methods. We list them as follows:
Majority is a basic baseline method, which assigns the largest sentiment polarity inthe training set to each sample in the test set.
LSTM uses one LSTM network to model the sentence, and the last hidden state isused as the sentence representation for the final classification.
TD-LSTM uses two LSTM networks to model the preceding and following con-texts surrounding the aspect term. The last hidden states of these two LSTM networkare concatenated for predicting the sentiment polarity [23].
AT-LSTM first models the sentence via a LSTM model. Then it combines the hid-den states from the LSTM with the aspect term embedding to generate the attentionvector. The final sentence representation is the weighted sum of the hidden states [28].
ATAE-LSTM further extends AT-LSTM by appending the aspect embedding intoeach word vector [28].
IAN uses two LSTM networks to model the sentence and aspect term respectively.It uses the hidden states from the sentence to generate an attention vector for the target,and vice versa. Based on these two attention vectors, it outputs a sentence representationand a target representation for classification [11].
Methods Restaurant LaptopMajority 0.535 0.650LSTM 0.743 0.665TD-LSTM [23] 0.756 0.681AT-LSTM [28] 0.762 0.689ATAE-LSTM [28] 0.772 0.687IAN [11] 0.786 0.721AOA-LSTM (0.797 ± (0.726 ± Table 2.
Comparison results. For our method, we run it 10 times and show ”best (mean ± std)”.Performance of baselines are cited from their original papers. In our implementation, we found that the performance fluctuates with different ran-dom initialization, which is a well-known issue in training neural networks [21]. Hence,we ran our training algorithms 10 times, and report the average accuracy as well as thebest one we got in Table 2. All the baseline methods only reported a single best numberin their papers. On average, our algorithm is better than these baseline methods and ourbest trained model outperforms them in a large margin.
Case Study
In Table 3, We list five examples from the test set. To analyze which word contributes themost to the aspect sentiment polarity, we visualize the final sentence attention vectors γ in Table 3. The color depth indicates the importance of a word in a sentence, the darkerhe more important. In the first two examples, there are two aspects “appetizers” and“service” in the sentence “the appetizers are ok, but the service is slow.” We can observethat when there are two aspects in the sentence, our model can automatically point tothe right sentiment indicating words for each aspect. Same thing also happens in thethird and fourth examples. In the last example, the aspect is a phrase “boot time.” Fromthe sentence content “boot time is super fast, around any where from 35 seconds to 1minute,” this model can learn “time” is the most important word in the aspect, whichfurther helps it find out the sentiment indicating part “super fast.” Aspect Sentence Ans./Pred.appetizers 0/0service -1/-1food +1/+1service -1/-1+1/+1
Table 3.
Examples of final attention weights for sentences. The color depth denotes the impor-tance degree of the weight in attention vector γ . Error Analysis
The first type of major errors comes from non-compositional sentiment expressionwhich also appears in previous works [25]. For example, in the sentence “it took about2 1/2 hours to be served our 2 courses,” there is no direct sentiment expressed towardsthe aspect “served.” Second type of errors is caused by idioms used in the sentences.Examples include “the service was on point - what else you would expect from a ritz?”where “service” is the aspect word. In this case, our model cannot understand the senti-ment expressed by idiom “on point.” The third factor is complex sentiment expressionlike “i have never had a bad meal (or bad service) @ pigalle.” Our model still mis-understands the meaning this complex expressions, even though it can handle simplenegation like “definitely not edible” in sentence “when the dish arrived it was blazingwith green chillis, definitely not edible by a human”.
In this paper, we propose a neural network model for aspect level sentiment classifi-cation. Our model utilizes an Attention-over-Attention module to learn the importantarts in the aspect and sentence, which generates the final representation of the sen-tence. Experiments on SemEval 2014 datasets show superior performance of our modelwhen compared to those baseline methods. Our case study also shows that our modellearns the important parts in the sentence as well as in the target effectively.In our error analysis, there are cases that our model cannot handle efficiently. Oneis the complex sentiment expression. One possible solution is to incorporate sentences’grammar structures into the classification model. Another type of error comes fromuncommon idioms. In future work, we would like to explore how to combine priorlanguage knowledge into such neural network models.