Joint Aspect and Polarity Classification for Aspect-based Sentiment Analysis with End-to-End Neural Networks
Martin Schmitt, Simon Steinheber, Konrad Schreiber, Benjamin Roth
JJoint Aspect and Polarity Classification for Aspect-based SentimentAnalysis with End-to-End Neural Networks
Martin Schmitt , Simon Steinheber , Konrad Schreiber and Benjamin Roth Center for Information and Language Processing, LMU Munich, Germany MaibornWol ff GmbH, Munich, Germany [email protected]
Abstract
In this work, we propose a new model foraspect-based sentiment analysis. In contrast toprevious approaches, we jointly model the de-tection of aspects and the classification of theirpolarity in an end-to-end trainable neural net-work. We conduct experiments with di ff erentneural architectures and word representationson the recent GermEval 2017 dataset. We wereable to show considerable performance gainsby using the joint modeling approach in all set-tings compared to pipeline approaches. Thecombination of a convolutional neural networkand fasttext embeddings outperformed the bestsubmission of the shared task in 2017, estab-lishing a new state of the art. Sentiment analysis (Pang and Lee, 2008) is the au-tomatic detection of the sentiment expressed in apiece of text. Typically, this is modeled as a clas-sification task with at least two classes (positive,negative), sometimes extended to three (neutral) ormore fine-grained categories. Aspect-based senti-ment analysis (ABSA) aims at a finer analysis, i.e.it requires that certain aspects of an entity in ques-tion be distinguished and the sentiment be classi-fied with regard to each of them. An example canbe seen in Figure 1.This introduces several new challenges. First,labeled data, which are needed to train statisticalmodels, are more di ffi cult to obtain. Thereforethe amount of available training data is limited.Thus a good model for ABSA has to make thebest possible use of the available data. Second,the detection of the subset of aspects that occur ina given piece of text is non-trivial. Errors intro-duced at this stage severely limit the performanceon the overall ABSA task. Third, the general sen-timent and the sentiment of each aspect can each German: Alle so “Yeah, Streik beendet” Bahnso “Okay, daf¨ur werden dann nat¨urlich die Tick-ets teurer” Alle so “K¨onnen wir wieder Streikhaben?”Translation: Everybody’s like “Yeah, strike’sover” Bahn goes “Okay, but therefore we’re go-ing to raise the prices” Everybody’s like “Can wehave the strike back?”General sentiment: neutralAspect sentiment: Ticket purchase:negativeGeneral:positive Figure 1: Example sentence with contained aspects andtheir polarity. be completely di ff erent from each other (cf. Fig-ure 1). This means that a model has to be able todistinguish aspects in the text and make indepen-dent decisions for each of them.We want to address each of these challenges by(1) leveraging unlabeled data by modeling wordrepresentations and (2) modeling aspect detectionand classification of their polarity jointly in anend-to-end trainable system.We evaluate our approach on the GermEval2017 data, i.e. customer reviews about DeutscheBahn AG on social media. We particularly addresssubtask C as the typical setting where two piecesof information have to be detected from raw text:1. Which aspects are mentioned?2. For each mentioned aspect, what is the polar-ity of its sentiment?From the new state-of-the-art results we obtain,we conclude that modeling of word representa-tions and joint modeling of aspects and polarityhave not yet received the attention they deserve. a r X i v : . [ c s . C L ] A ug Related Work
Two recent shared tasks address ABSA: SemEval2016 Task 5 (Pontiki et al., 2016) and GermEval2017 (Wojatzki et al., 2017). The SemEval datasetis extremely small. The English laptop reviews,e.g., only contain 395 training instances for theprediction of 88 aspect categories and their po-larities. Because of this sparsity, top-ranked sys-tems rely on feature engineering and hand-craftedrules. GermEval is a larger dataset ( (cid:126)
20K train-ing instances, 20 aspect categories) and thus suitedfor our goal of evaluating the quality of fully au-tomatic methods for learning aspect and polaritypredictions. Furthermore the top systems at Se-mEval 2016, XRCE (Brun et al., 2016) and IIT-TUDA (Kumar et al., 2016), not only rely heavilyon feature engineering but also separate the tasksof aspect detection and aspect polarity classifica-tion into two di ff erent parts of their pipeline.The winners of GermEval 2017 rely on neuralmethods (Lee et al., 2017). They try to link all as-pects to a sequence of tokens and model the task asa sequence labeling problem. This leads to prob-lems because some aspects are not assigned to anytoken but still have to be detected and classified.Our approach always considers the complete doc-ument and produces the set of all detected aspectsat once. Although Lee et al. (2017) incorporatesome aspects of multi-task learning, the predic-tion of aspect category and polarity remains sep-arated in each of their approaches. In our work,we show that a joint learning of these two tasksachieves better performance. The approach by Leeet al. (2017) also relies more heavily on externalsources than ours. While we only collected a cor-pus of (cid:126) Word2vec skip-gram (Mikolov et al., 2013) is awidely used algorithm to obtain pretrained vectorrepresentations for input words. Notably, Lee et al.(2017) use it for their experiments on the Germ-Eval data. FastText (Grave et al., 2017) works ina similar fashion but has the advantage of incor-porating subword information in the embeddinglearning process. So it can not only learn sim-ilar embeddings for word forms sharing a com-mon stem but also generate embeddings for un-seen words in the test set by combining the learnedcharacter ngram embeddings. This can be cru-cial when dealing with a morphologically rich lan-guage such as German. Glove (Pennington et al.,2014) — similar to word2vec — does not incor-porate character-level information, but uses globalrather than local information to learn its word em-beddings.We have trained each of these embedding learn-ing algorithms on a corpus of (cid:126) @DB info and @DB Bahn ,two o ffi cial accounts of Deutsche Bahn AG o ff er-ing information and replying to questions. We col-lected these tweets specifically to build a docu-ment collection that is closely related to the do-main of GermEval 2017. We also included theGermEval training set for the embedding training. We compare our proposed approach to the modeldescribed in (Ruder et al., 2016). They first encodeeach sentence with glove word embeddings anda bidirectional LSTM (Hochreiter and Schmidhu-ber, 1997). Then this output is concatenated withan embedding of the aspect addressed in the cur-rent sentence and finally fed in a document-levelBiLSTM. As we are dealing with social mediatexts, our documents are already very short. Sowe do not split them into shorter units (sentences).Therefore the second hierarchy level of (Ruderet al., 2016), that combines the output of consec-utive sentences in a document, is superfluous andomitted in our experiments. In all other aspects— including hyperparameters — we do as (Ruderet al., 2016), i.e. we duplicate a tweet for eachaspect detected in it, concatenate an aspect em-bedding of size 15 to the output of the BiLSTMencoder, use dropout of 0 . We modify the pipeline model described in thelast section as follows: the aspect detection is in-tegrated into the neural network architecture per-mitting an end-to-end optimization of the wholemodel during training. This is achieved by for-matting the classifier output as a vector z ∈{ , , , } | A | , where A is the set of all 20 aspects(e.g., General, Ticket purchase, Design, Safety, . . . ). This corresponds to predicting one of thefour classes N / A , positive , negative and neutral foreach aspect. Specifically, we obtain a hidden rep-resentation of an input document X in the follow-ing manner: v = DO ( BiLSTM ( DO ( embed ( X )))) (1)where embed ∈ { word2vec , glove , fasttext } and DO = dropout (Hinton et al., 2012).The design choices for the BiLSTM in this stepremain the same as in the baseline model.Then, we transform the feature vector v ex-tracted from the text X to a score vector ˆ y ( a ) foreach aspect a ∈ A and apply softmax normaliza-tion: ˆ y ( a ) = softmax ( W ( a ) v + b ( a ) ) (2)where softmax ( x ) i = exp ( x i ) (cid:80) k = exp ( x k ) for i = , . . . , z ( a ) = arg max i ˆ y ( a ) i (4)The loss is simply the cross entropy summed overall aspects: L ( θ ) = (cid:88) a ∈ A H ( y ( a ) , ˆ y ( a ) ) (5)with H ( y , ˆ y ) = − (cid:88) i y i · log (ˆ y i ) (6) ... N/APOSNEGNEU 1000Safety N/APOSNEGNEU 1000Design N/APOSNEGNEU 0100 v Ticketpurchase N/APOSNEGNEU 0010 W ticket v W design v W safety v s o ft m ax B ahn ge t s m o r e f an cy t r a i n s , and t i ck e t s ge t m o r e e x pen s i v e Word vectorspooling classifieroutputCNN with multiple filters
Figure 2: Schematic view of end-to-end CNN architec-ture.
Keeping the formalization as an end-to-end task,we replace the BiLSTM by a convolutional neu-ral network (CNN) as described in (Kim, 2014).As in their setting
CNN-non-static , we use 300-dimensional word embeddings, a max-over-timepooling operation, filter sizes of 3 , ,
5, anddropout with a rate of 0 . f ( x ) = max (0 , x )) as our activation func-tion, and 300 filters of each size, a number alsofound in related work on sentiment analysis (dosSantos and Gatti, 2014). Following (Kim, 2014),we do not apply dropout after the embeddinglayer: v = DO ( CNN ( embed ( X ))) (7)With Equation 7 replacing Equation 1, the aspect-wise classification for the end-to-end CNN thenfollows the same definitions as described in theprevious section. In order to compare the e ff ects of joint end-to-end and pipeline approaches across neural archi-tectures, we also include an experiment where theCNN model from the previous section replaces theBiLSTM in the pipeline setting described in sec-tion 3.2. Experiments
We conduct our experiments on the GermEval2017 data (Wojatzki et al., 2017), i.e. customerfeedback about
Deutsche Bahn AG on social me-dia, microblogs, news, and Q&A sites. The datawere collected over the time of one year and manu-ally annotated, resulting in a main dataset of about26K documents, divided into a training, develop-ment, and test set using a random 80% / / ff erent polarity classes (which a ff ects approxi-mately 4% of the data). The development and testdata remain the same.We choose our hyperparameters based on thedevelopment data using the following procedure:we train initial models with a hyperparametersetting based on values we found in the liter-ature, stochastic gradient descent with a learn-ing rate of 0 .
01 (as in dos Santos and Gatti(2014)) and a mini-batch size of 10 (as in Ruderet al. (2016)). For the best-performing CNN andLSTM architectures (end-to-end + fasttext), wethen refine the learning rate and batch size onthe development data using random search in therange { . , . , . , . , . } for learningrate and { , , } for batch size. For the CNNsetting, this results in a learning rate of 0 .
03 and abatch size of 5 (which we then use for all CNNarchitectures in the final experiments). For theLSTM setting, this results in a learning rate of 0 . Aspect polarity
Table 1 shows the results of ourexperiments, as well as the results of our strongbaselines. Note that the majority class baseline al-ready provides good results. This is due to highlyunbalanced data; the aspect category “
Allgemein ”( “general” ), e.g., constitutes 61.5% of the cases.This imbalance makes the task even more chal-lenging.Over all architectures, we observe a comparable or better performance when using fasttext embed-dings instead of word2vec or glove. This backsour hypothesis that subword features are importantfor processing the morphologically rich Germanlanguage. Leaving everything else unchanged, wecan furthermore see an increase in performancefor all settings, when switching from the pipelineto an end-to-end approach. The best performance(marked in bold) is achieved by a combinationof CNN and FastText embeddings, which outper-forms the highly adapted winning system of theshared task. Aspect category only
Even though our archi-tectures are designed for the task of joint predic-tion of aspect category and polarity, we can alsoevaluate them on the detection of aspect categoriesonly. Table 2 shows the results for this task. Firstof all, we can see that the SVM-based GermEvalbaseline model has very decent performance as itis practically on par with the best submission forthe synchronic and even outperforms the best sub-mission on the diachronic test set. It is thereforewell-suited to serve as input to the pipeline LSTMmodel we compare with in our main task.Comparing our architectures, we see again thatfasttext embeddings always lead to equal or betterperformance. And even though we do not directlyoptimize our models for this task only, our bestmodel (CNN + fasttext) outperforms all baselines,as well as the GermEval winning system. Impact of domain-specific corpus
We comparethe domain-specific FastText embeddings to Fast-Text embeddings trained on Wikipedia , which isapproximately 100 times the size of our domain-specific corpus. We report the results in Ta-ble 3. The embeddings trained on Wikipediashow slightly lower performance on the dev setbut slightly higher or equal performance on thetest sets. We conclude that the main positive im-pact of FastText stems from its capability to modelsubword information and that a large domain-independent corpus or a small domain-specificcorpus lead to similar performance gains. We have presented a new approach to ABSA.By solving the two classification problems (aspect Downloaded from https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md . evelopment set synchronic test set diachronic test setPipeline LSTM + word2vec .350 .297 .342End-to-end LSTM + word2vec .378 .315 .383Pipeline CNN + word2vec .350 .298 .343End-to-end CNN + word2vec .400 .319 .388Pipeline LSTM + glove .350 .297 .342End-to-end LSTM + glove .378 .315 .384Pipeline CNN + glove .350 .298 .342End-to-end CNN + glove .415 .315 .390Pipeline LSTM + fasttext .350 .297 .342End-to-end LSTM + fasttext .378 .315 .384Pipeline CNN + fasttext .342 .295 .342End-to-end CNN + fasttext .511 .423 .465 majority class baseline – .315 .384GermEval baseline – .322 .389GermEval best submission – .354 .401 Table 1: Results on the GermEval data, aspect + sentiment task. Micro-averaged F1-scores for both aspectcategory and aspect polarity classification as computed by the GermEval evaluation script. In the bottom part ofthe table, we report results from (Wojatzki et al., 2017). development set synchronic test set diachronic test setEnd-to-end LSTM + word2vec .517 .442 .455End-to-end CNN + word2vec .521 .436 .470End-to-end LSTM + glove .517 .442 .456End-to-end CNN + glove .537 .457 .480End-to-end LSTM + fasttext .517 .442 .456End-to-end CNN + fasttext .623 .523 .557 majority class baseline – .442 .456GermEval baseline – .481 .495GermEval best submission – .482 .460 Table 2: Micro-averaged F1-scores for the prediction of aspect categories only (i.e. without taking polarity intoaccount at all) as computed by the GermEval evaluation script. The results in the bottom part of the table are takenfrom (Wojatzki et al., 2017). dev synchr. test diachr. testaspect + sent. .502 .423 .465aspect only .610 .544 .571 Table 3: Results of the end-to-end CNN model withfasttext embeddings trained on the German Wikipedia. categories + aspect polarity) inherent to ABSAin a joint manner, we observe significant perfor-mance gains for both of these tasks on the Germ-Eval 2017 data. Our experiments also showed thatword representations leveraging subword informa-tion are crucial for a challenging task like ABSAin a morphologically rich language, such as Ger-man. Furthermore we observed consistently bet-ter performance of CNN architectures in otherwise comparable scenarios, which suggests that CNNscope better with the irregularities of user-writtentexts on social media, a research question we leaveto future work. By establishing a new state of theart in aspect detection and polarity classification,we provide a new practical baseline for future re-search in this area. Acknowledgments
We would like to thank the anonymous review-ers for their valuable input. This work was par-tially supported by the European Research Coun-cil, Advanced Grant eferences
Rami Al-Rfou, Bryan Perozzi, and Steven Skiena.2013. Polyglot: Distributed word representationsfor multilingual nlp. In
Proceedings of the Seven-teenth Conference on Computational Natural Lan-guage Learning , pages 183–192, Sofia, Bulgaria.Association for Computational Linguistics.Caroline Brun, Julien Perez, and Claude Roux. 2016.XRCE at semeval-2016 task 5: Feedbacked ensem-ble modeling on syntactico-semantic knowledge foraspect based sentiment analysis. In
Proceedings ofthe 10th International Workshop on Semantic Eval-uation, SemEval@NAACL-HLT 2016, San Diego,CA, USA, June 16-17, 2016 , pages 277–281.Edouard Grave, Tomas Mikolov, Armand Joulin, andPiotr Bojanowski. 2017. Bag of tricks for e ffi cienttext classification. In EACL (2) , pages 427–431. As-sociation for Computational Linguistics.Geo ff rey E. Hinton, Nitish Srivastava, AlexKrizhevsky, Ilya Sutskever, and Ruslan Salakhut-dinov. 2012. Improving neural networks bypreventing co-adaptation of feature detectors. CoRR , abs / Neural Computation ,9(8):1735–1780.Yoon Kim. 2014. Convolutional neural networks forsentence classification. In
Proceedings of the 2014Conference on Empirical Methods in Natural Lan-guage Processing, EMNLP 2014, October 25-29,2014, Doha, Qatar, A meeting of SIGDAT, a SpecialInterest Group of the ACL , pages 1746–1751.Ayush Kumar, Sarah Kohail, Amit Kumar, Asif Ekbal,and Chris Biemann. 2016. IIT-TUDA at semeval-2016 task 5: Beyond sentiment lexicon: Combin-ing domain dependency and distributional seman-tics features for aspect based sentiment analysis. In
Proceedings of the 10th International Workshop onSemantic Evaluation, SemEval@NAACL-HLT 2016,San Diego, CA, USA, June 16-17, 2016 , pages 1129–1135.Ji-Ung Lee, Ste ff en Eger, Johannes Daxenberger, andIryna Gurevych. 2017. Ukp tu-da at germeval 2017:Deep learning for aspect based sentiment detection.In Proceedings of the GermEval 2017 – Shared Taskon Aspect-based Sentiment in Social Media Cus-tomer Feedback , pages 22–29. German Society forComputational Linguistics.Tomas Mikolov, Kai Chen, Greg Corrado, and Je ff reyDean. 2013. E ffi cient estimation of word represen-tations in vector space. CoRR , abs / Foundations and Trends in In-formation Retrieval , 2(1-2):1–135. Je ff rey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In Empirical Methods in Nat-ural Language Processing (EMNLP) , pages 1532–1543.Maria Pontiki, Dimitris Galanis, Haris Papageorgiou,Ion Androutsopoulos, Suresh Manandhar, Moham-mad AL-Smadi, Mahmoud Al-Ayyoub, YanyanZhao, Bing Qin, Orphee De Clercq, VeroniqueHoste, Marianna Apidianaki, Xavier Tannier, Na-talia Loukachevitch, Evgeniy Kotelnikov, N´uria Bel,Salud Mar´ıa Jim´enez-Zafra, and G¨uls¸en Eryi˘git.2016. Semeval-2016 task 5: Aspect based sentimentanalysis. In
Proceedings of the 10th InternationalWorkshop on Semantic Evaluation (SemEval-2016) ,pages 19–30, San Diego, California. Association forComputational Linguistics.Sebastian Ruder, Parsa Gha ff ari, and John G. Breslin.2016. A hierarchical model of reviews for aspect-based sentiment analysis. In Proceedings of the2016 Conference on Empirical Methods in NaturalLanguage Processing , pages 999–1005. Associationfor Computational Linguistics.C´ıcero Nogueira dos Santos and Maira Gatti. 2014.Deep convolutional neural networks for sentimentanalysis of short texts. In
COLING , pages 69–78.ACL.Michael Wojatzki, Eugen Ruppert, Sarah Holschnei-der, Torsten Zesch, and Chris Biemann. 2017.GermEval 2017: Shared Task on Aspect-based Sen-timent in Social Media Customer Feedback. In