Amobee at SemEval-2018 Task 1: GRU Neural Network with a CNN Attention Mechanism for Sentiment Classification
AAmobee at SemEval-2018 Task 1: GRU Neural Network with a CNNAttention Mechanism for Sentiment Classification
Alon Rozental ∗ , Daniel Fleischer ∗ Amobee, Tel Aviv, Israel [email protected]@amobee.com
Abstract
This paper describes the participation ofAmobee in the shared sentiment analysis taskat SemEval 2018. We participated in allthe English sub-tasks and the Spanish va-lence tasks. Our system consists of threeparts: training task-specific word embeddings,training a model consisting of gated-recurrent-units (GRU) with a convolution neural net-work (CNN) attention mechanism and trainingstacking-based ensembles for each of the sub-tasks. Our algorithm reached 3rd and 1stplaces in the valence ordinal classification sub-tasks in English and Spanish, respectively.
Sentiment analysis is a collection of methods andalgorithms used to infer and measure affectionexpressed by a writer. The main motivation isenabling computers to better understand humanlanguage, particularly sentiment carried by thespeaker. Among the popular sources of textualdata for NLP is Twitter, a social network servicewhere users communicate by posting short mes-sages, no longer than 280 characters long—calledtweets. Tweets can carry sentimental informationwhen talking about events, public figures, brandsor products. Unique linguistic features, such asthe use of slang, emojis, misspelling and sarcasm,make Twitter a challenging source for NLP re-search, attracting the interest of both academia andthe industry.Semeval is a yearly event in which internationalteams of researchers work on tasks in a com-petition format where they tackle open researchquestions in the field of semantic analysis. We par-ticipated in Semeval 2018 task 1, which focuseson sentiment and emotions evaluation in tweets.There were three main problems: identifying the ∗ These authors contributed equally to this work. presence of a given emotion in a tweet (sub-tasksEI-reg, EI-oc), identifying the general sentiment(valence) in a tweet (sub-tasks V-reg, V-oc) andidentifying which emotions are expressed in atweet (sub-task E-c). For a complete descriptionof Semeval 2018 task 1, see the official taskdescription (Mohammad et al., 2018).We developed an architecture based on gated-recurrent-units (GRU, Cho et al. (2014)). Weused a bi-directional GRU layer, together witha convolutional neural network (CNN) attention-mechanism, where its input is the hidden statesof the GRU layer; lastly there were two fullyconnected layers. We will refer to this architectureas the Amobee sentiment classifier (ASC). Weused ASC to train word embeddings to incorporatesentiment information and to classify sentimentusing annotated tweets. We participated in all theEnglish sub-tasks and in the valence Spanish sub-tasks, achieving competitive results.The paper is organized as follows: section 2describes our data sources, section 3 describes thedata pre-processing pipeline. A description of themain architecture is in section 4. Section 5 de-scribes the word embeddings generation; section6 describes the extraction of features. In section7 we describe the performance of our models;finally, in section 8 we review and summarize theresults.
We used four sources of data:1. Twitter Firehose: we randomly sampled 200million tweets using the Twitter Firehose ser-vice. They were used for training word em-beddings and for distant supervision learning.2. Semeval 2017 task 4 datasets of tweets, an-notated according to their general sentiment a r X i v : . [ c s . C L ] A p r n 3 and 5 level scales; used to train the ASCmodel.3. Annotated tweets from an external source ,annotated on a 3-level scale; used to train theASC model.4. Official Semeval 2018 task 1 datasets: usedto train task specific models.Datasets of Semeval 2017 and the external sourcewere combined with compression ; the resultingdataset contained 88,623 tweets with the followingdistribution: positive: 30097 sentences (34%),neutral: 35818 (40%), negative: 22708 (26%).Description of the official Semeval 2018 task 1datasets can be found in Mohammad et al. (2018);Mohammad and Kiritchenko (2018). We started by defining a cleaning pipeline that pro-duces two cleaned version of an original text; werefer to them as “simple” and “complex” versions.Both versions share the same initial cleaning steps:1. Word tokenization using the CoreNLP library(Manning et al., 2014).2. Parts of speech (POS) tagging using theTweet NLP tagger, trained on Twitter data(Owoputi et al., 2013).3. Grouping similar emojis and replacing themwith representative keywords.4. Regex: replacing URLs with a specialkeyword, removing duplications, break-ing date , number , brand , etc.3. Synonym replacement, based on a manually-created dictionary.4. Word replacement using a Wikipedia dictio-nary, created by crawling and extracting listsof places, brands and names. https://github.com/monkeylearn/sentiment-analysis-benchmark Transformed 5 labels to 3: {− , − } → {− } , { , } → { } , { } → { } . As an example, table 1 shows a fictitious tweet andthe results after the simple and complex cleaningstages.
Our main contribution is an RNN network, basedon GRU units with a CNN-based attention mecha-nism; we will refer to it as the Amobee sentimentclassifier (ASC). It is comprised of four identicalsub-models, which differ by the input data eachof them receives. Sub-model inputs are composedof word embeddings and embeddings of the POStags—see section 5 for a description of our embed-ding procedure. The words were embedded in a200 or 150 dimensional vector spaces and the POStags were embedded in a 8 dimensional vectorspace. We pruned the tweets to have 40 words,padding shorter sentences with a zero vector. Theembeddings form the input layer.Next we describe the sub-model architecture;the embeddings were fed to a bi-directional GRUlayer of dimension 200. Inspired by the attentionmechanism introduced in Bahdanau et al. (2014),we extracted the hidden states of the GRU layer;each state corresponds to a decoded word in theGRU as it reads each tweet word by word. Thehidden states were arranged in a matrix of dimen-sion × for each tweet (bi-directionality ofthe GRU layer contributes a factor of 2). We fedthe hidden states to a CNN layer, instead of aweighted sum as in the original paper. We used 6filter sizes [1 , , , , , , with 100 filters for eachsize. After a max-pooling layer we concatenatedall outputs, creating a 600 dimensional vector.Next was a fully connected layer of size 30 with tanh activation, and finally a fully connected layerof size 3 with a softmax activation function.We defined 4 such sub-models with embeddinginputs of the following settings: w2v-200, w2v-150, ft-200, ft-150 (ft=FastText, w2v=Word2Vec,see discussion in the next section). We combinedthe four sub-models by extracting their hidden d =30 layer and concatenating them. Next we added afully connected d = 25 layer with tanh activationand a final fully connected layer of size 3. Seefigure 1 for an illustration of the entire architec-ture. We used the AdaGrad optimizer (Duchi et al.,2011) and a cross-entropy loss function. We usedthe Keras library (Chollet et al., 2015) and theTensorFlow framework (Abadi et al., 2016). RU (F) GRU (F) GRU (F) GRU (F)GRU (B) GRU (B) GRU (B) GRU (B)
Whatawonderfulday
Sentence Embedding40x208 d=200 d=8 Bi-directional GRUd=200 Hidden States(40x2)x200 100 Filters6 Sizes d=600Pool-Max +Concat. d=30 d=3Fully connectedlayers
Mini Amobee Sentiment Classifier (ASC)
Mini-ASCFT-200Mini-ASCFT-150Mini-ASCW2V-200Mini-ASCW2V-150 d=25 d=3d=120Concatenation of hidden layers Fully connectedlayers
Figure 1 : Architecture of the ASC nework. Each of the four sub-models on the right has the same structure as depicted in thecentral region.
Original @USAIRWAYS is right :-) ! Flying in September
Simple Cleaning twitter-entity is right happy-smily ! flying in september nice to fly
Complex Cleaning twitter-entity be right happy-smily ! fly in date pleasant to fly
Table 1 : An example of a tweet processing, producing two cleaned versions.
Word embedding is a family of techniques inwhich words are encoded as real-valued vectors oflower dimensionality. These word representationshave been used successfully in sentiment analysistasks in recent years. Among the popular algo-rithms are Word2Vec (Mikolov et al., 2013) andFastText (Bojanowski et al., 2016).Word embeddings are useful representationsof words and can uncover hidden relationships.However, one disadvantage they have is the typicallack of sentiment information. For example, theword vector “good” can be very close to theword vector “bad” in some trained, off-the-shelfword embeddings. Our goal was to train wordembeddings based on Twitter data and then re-learn them so they will contain emotion-specificsentiment.We started with our 200 million tweets dataset;we cleaned them using the pre-processing pipeline(described in section 3) and then trained genericembeddings using the Gensim package ( ˇReh˚uˇrekand Sojka, 2010); we created four embeddings forthe words and two embeddings for the POS tags:for each sentence we created a list of correspond-ing POS tags (there are 25 tags offered by the tag-ger we used); treating the tags as words, we trained d = 8 embeddings using the word2vec algorithmon the simple and complex cleaned datasets. Theembeddings parameters are specified in table 2.Following Tang et al. (2014); Cliche (2017), who explored training word embeddings for sen-timent classification, we employed a similar ap-proach. We created distant supervision datasets,first, by manually compiling 4 lists of represen-tative words for each emotion: anger, fear, joyand sadness; then, we built two datasets for eachemotion: the first containing tweets with the rep-resentative words and the second does not. Eachlist contained about 40 words and each datasetcontained roughly 2 million tweets. We used theASC sub-model architecture (section 4) to trainas following: training for one epoch with embed-dings set to be untrainable (fixed). Then train for6 epochs where the embeddings can change.Overall we trained 16 word embeddings—4embedding configurations for each emotion. Inaddition, we decided to use the trained models’final hidden layer ( d = 15 ) as a feature vector inthe task-specific architectures; our motivation wasusing them as emotion and intensity classifiers viatransfer learning. Algorithm Dimension Dataset W o r d s Word2Vec 200 SimpleWord2Vec 150 ComplexFastText 200 SimpleFastText 150 Complex T a g s Word2Vec 200 SimpleWord2Vec 150 Complex
Table 2 : Parameters for the word and POS tag embeddings.
Features Description
In addition to our ASC models, we extractedsemantic and syntactic features, based on domainknowledge: • Number of magnifier and diminisher words,e.g. “incredibly”, “hardly” in each tweet. • Logarithm of length of sentences. • • Fully capitalized words. • The symbols • Predictions of external packages: Vader (partof the NLTK library, Hutto and Gilbert, 2014)and TextBlob (Loria et al., 2014).Additionally, we compiled a list of 338 emojisand words in 16 categories of emotion, annotatedwith scores from the set { . , , . , } . For eachsentence, we summed up the scores in each cate-gory, up to a maximum value of 5, generating 16features. The categories are: anger, disappointed,fear, hopeful, joy, lonely, love, negative, neutral,positive, sadness and surprise. Finally, we usedthe NRC Affect Intensity lexicon (Mohammad,2017) containing 5814 entries; each entry is aword with a score between 0 and 1 for a givenemotion out of the following: anger, fear, joy andsadness. We used the lexicon to produce 4 emotionfeatures from hashtags in the tweets; each featurecontained the largest score of all the hashtags inthe tweet. For a summary of all features used, seetable 6 in the appendix. Our general workflow for the tasks is as follows:for each sub-task, we started by cleaning thedatasets, obtaining two cleaned versions. Weran a pipeline that produced all the features wedesigned: the ASC predictions and the featuresdescribed in section 6. We removed sparse fea-tures (less than 8 samples). Next, we defineda shallow neural network with a soft-voting en-semble. We chose the best features and meta-parameters—such as learning rate, batch size andnumber of epochs—based on the dev dataset. Fi-nally, we generated predictions for the regressiontasks. For the classification tasks, we used agrid search method on the regression predictions
Task Metric Score Ranking
V-oc-Spanish Pearson 0.765 1/14V-reg-Spanish 0.770 2/14V-oc 0.813 3/37EI-oc Average 0.646 4/39V-reg 0.843 5/38E-c Jaccard 0.566 6/35EI-reg Average Pearson 0.721 13/48
Table 3 : Summary of results. to optimize the loss. Most model trainings wereconducted on a local machine equipped with aNvidia GTX 1080 Ti GPU. Our official results aresummarized in table 3.
In the valence sub-tasks, we identified how intensea general sentiment (valence) is; the score is eitherin a continuous scale between 0 and 1 or classifiedinto 7 ordinal classes {− , − , − , , , , } , andis evaluated using the Pearson correlation coeffi-cient.We started with the regression task and definedthe following model: first, we normalized thefeatures to have zero mean and SD = 1 . Then, weinserted 300 instances of fully connected layers ofsize 3, with a softmax activation and no bias term.For each copy, we applied the function f ( x ) =( x − x ) / . where x , x are the 1st and3rd component of each hidden layer. Our aim wastransforming the label predictions of the ASCs(trained on 3-label based sentiment annotation)into a regression score such that high certainty ineither label (negative, neutral or positive) wouldproduce scores close to 0, 0.5 or 1, respectively.Finally, we calculated the mean of all 300 predic-tion to get the final node; this is also known asa soft-voting ensemble. We used the Adam opti-mizer (Kingma and Ba, 2014) with default values,mean-square-error loss function, batch size of 400and 65 epochs of training. For an illustration ofthe network, see figure 2. We experimented withthe dev dataset, testing different subsets of thefeatures. Finally we produced predictions for theregression sub-task V-reg.We analyzed the relative contribution of eachfeature by measuring variable importance usingPratt (1987) approach. We calculated scores d i foreach feature using the following formula: d i =ˆ β i ˆ ρ i /R where ˆ β i denotes the sample estimation =300d=3 meand=212 f
ASC anger
25 31.38%
ASC
25 18.92%
ASC fear
25 10.63%
ASC joy
25 8.13%
W2V 200 sadness
15 7.10%
W2V 200 fear
15 3.82%
ASC sadness
25 3.46%
W2V 200 joy
15 1.74%
Blob
Joy
Table 4 : Relative contribution of features in the valenceregression sub-task.
EI-reg Anger Fear Joy Sadness
Features 204 274 150 181Learning rate − − − · − Epochs 330 700 700 1000
Table 5 : Summary of training parameters for the emotionintensity regression tasks.
In the emotion intensity sub-tasks, we identifiedhow intense a given emotion is in the giventweets. The four emotions were: anger, fear,joy and sadness; the score is either in a scalebetween 0 and 1 or classified into 4 ordinal classes { , , , } . Performance was evaluated using thePearson correlation coefficient. Our approach wassimilar to the valence tasks; first we generatedfeatures, then we used the same architecture asin the valence sub-tasks, depicted in figure 2.However, in these sub-tasks we used the emotion-specific embeddings for each emotion sub-task.We generated regression predictions and submit-ted them as the EI-reg sub-tasks; finally we carrieda grid search for the best partition, maximizingthe Pearson correlation and submitted the classespredictions as sub-tasks EI-oc. For a summary ofthe training parameters used in the regression sub-tasks, see table 5.Our system performed as following: in theregression tasks, the scores were: 0.748, 0.670,0.748, 0.721 for the anger, fear, joy and sadness,respectively, with a macro-average of 0.721. In theclassification tasks, the scores were: 0.667, 0.536,0.705, 0.673 for the anger, fear, joy and sadness,respectively, with a macro-average of 0.646. In the multi-label classification sub-task, we hadto label tweets with respect to 11 emotions: anger,anticipation, disgust, fear, joy, love, optimism,pessimism, sadness, surprise and trust. The scorewas evaluated using the Jaccard similarity coef-ficient. We started with the same cleaning andfeature-generation pipelines as before, creatingan input layer of size 217. We added a fullyconnected layer of size 100 with tanh activation.Next there were 300 instances of fully connectedlayers of size 11 with sigmoid activation function.We calculated the mean of all d = 11 vectors,producing the final d = 11 vector. For an illus-tration, see figure 4 for an illustration. We used S C _ a n g e r A S C A S C _ f e a r A S C _ j o y w v _ _ s a d n e ss w v _ _ f e a r A S C _ s a d n e ss w v _ _ j o y b l o b j o y h a s h _ j o y h a s h _ s a d n e ss p o s i t i v e s a d n e ss h a s h _ a n g e r a n g e r w v _ _ a n g e r h a s h _ f e a r n e g a t i v e l o v e j o y f e a r d i s a p p o i n t e d n e u t r a l v a d e r l o n e l y h a s h c a p s m a g d i m l e n g t h a t i r o n y l o n g Features P e r c e n t a g e Figure 3 : Relative contribution of features in the valence regression sub-task.
300 copiesd=11 meand=217 d=100 d=11
Figure 4 : Architecture of the multi-label sub-task E-c. the following loss function, based on Tanimotodistance: L ( y, ˜ y ) = 1 − y · ˜ y (cid:107) y +˜ y (cid:107) − y · ˜ y + (cid:15) , where (cid:107) · (cid:107) is an L norm and (cid:15) = 10 − is used for numericalstability. We trained with a batch size of 10, for40 epochs with Adam optimization with defaultparameters. Our final score was 0.566. We participated in the Spanish valence tasks toexamine the current state of neural machine trans-lation (NMT) algorithms. We used the GoogleCloud Translation API to translate the Spanishtraining, development and test datasets for the twovalence tasks from Spanish to English. We thentreated the tasks the same way as the Englishvalence tasks, using the same cleaning and featureextraction pipelines and the same architecture de-scribed in section 7.1 to generate regression andclassification predictions. We reached 1st and2nd places in the classification and regression sub-tasks, with scores of 0.765, 0.770, respectively.
In this paper we described the system developedto participate in the Semeval 2018 task 1 work-shop. We reached 3rd place in the valence ordinalclassification sub-task and 5th place in the valenceregression sub-task. In the Spanish valence tasks,we reached 1st and 2nd places in the classificationand regression sub-tasks, respectively. In theemotions intensity sub-tasks we reached 4th and13th places in the classification and regressionsub-tasks, respectively.Summarizing the methods used: training ofword embeddings based on a Twitter corpus(200M tweets), developing and using Amobeesentiment classifier (ASC) architecture—a bi-directional GRU layer with a CNN-based attentionmechanism and an additional hidden layer—usedto adjust the embeddings to include emotionalcontext, and finally a shallow feed-forward NNwith a stack-based ensemble of final hidden layersfrom all previous classifiers we trained. This formof transfer learning proved to be important, asthe hidden layers features achieved a significantcontribution to minimizing the loss.Overall, we had better performance in the va-lence tasks, both in English and Spanish. We positthis is due to the fact our annotated supervisedtraining dataset (non task-specific) was based onSemeval 2017 task 4, which focused on valenceclassification. In addition, the annotations in Se-meval 2017 were label-based, lending themselvesmore easily to the ordinal classification tasks. Inthe Spanish tasks, we used external translation(Google API) and achieved good results withoutthe use of Spanish-specific features.
Acknowledgment
We thank Zohar Kelrich for assisting in translatingthe Spanish datasets to English. eferences
Mart´ın Abadi, Paul Barham, Jianmin Chen, ZhifengChen, Andy Davis, Jeffrey Dean, Matthieu Devin,Sanjay Ghemawat, Geoffrey Irving, Michael Isard,et al. 2016. Tensorflow: A system for large-scalemachine learning. In
OSDI , volume 16, pages 265–283.Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2014. Neural machine translation byjointly learning to align and translate.
CoRR ,abs/1409.0473.Piotr Bojanowski, Edouard Grave, Armand Joulin,and Tomas Mikolov. 2016. Enriching word vec-tors with subword information. arXiv preprintarXiv:1607.04606 .KyungHyun Cho, Bart van Merrienboer, Dzmitry Bah-danau, and Yoshua Bengio. 2014. On the propertiesof neural machine translation: Encoder-decoder ap-proaches.
CoRR , abs/1409.1259.Franc¸ois Chollet et al. 2015. Keras. https://github.com/keras-team/keras .Mathieu Cliche. 2017. Bb twtr at semeval-2017 task4: Twitter sentiment analysis with cnns and lstms. arXiv preprint arXiv:1704.06125 .John Duchi, Elad Hazan, and Yoram Singer. 2011.Adaptive subgradient methods for online learningand stochastic optimization.
Journal of MachineLearning Research , 12(Jul):2121–2159.C.J. Hutto and E.E. Gilbert. 2014. Vader: A parsimo-nious rule-based model for sentiment analysis of so-cial media text. In
Eighth International Conferenceon Weblogs and Social Media (ICWSM-14) , AnnArbor, MI.Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 .Steven Loria, P Keen, M Honnibal, R Yankovsky,D Karesh, E Dempsey, et al. 2014. Textblob:simplified text processing.
Secondary TextBlob:Simplified Text Processing .Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP natural lan-guage processing toolkit. In
Association for Compu-tational Linguistics (ACL) System Demonstrations ,pages 55–60.Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg SCorrado, and Jeff Dean. 2013. Distributed represen-tations of words and phrases and their composition-ality. In C. J. C. Burges, L. Bottou, M. Welling,Z. Ghahramani, and K. Q. Weinberger, editors,
Advances in Neural Information Processing Systems26 , pages 3111–3119. Curran Associates, Inc. Saif M Mohammad. 2017. Word affect intensities. arXiv preprint arXiv:1704.08798 .Saif M. Mohammad, Felipe Bravo-Marquez, Mo-hammad Salameh, and Svetlana Kiritchenko. 2018.Semeval-2018 Task 1: Affect in tweets. In
Proceed-ings of International Workshop on Semantic Evalu-ation (SemEval-2018) , New Orleans, LA, USA.Saif M. Mohammad and Svetlana Kiritchenko. 2018.Understanding emotions: A dataset of tweets tostudy interactions between affect categories. In
Proceedings of the 11th Edition of the LanguageResources and Evaluation Conference , Miyazaki,Japan.Olutobi Owoputi, Brendan O’Connor, Chris Dyer,Kevin Gimpel, Nathan Schneider, and Noah ASmith. 2013. Improved part-of-speech tagging foronline conversational text with word clusters. Asso-ciation for Computational Linguistics.John W Pratt. 1987. Dividing the indivisible: Usingsimple symmetry to partition variance explained. In
Proceedings of the second international Tampereconference in statistics, 1987 , pages 245–260. De-partment of Mathematical Sciences, University ofTampere.Radim ˇReh˚uˇrek and Petr Sojka. 2010. SoftwareFramework for Topic Modelling with Large Cor-pora. In
Proceedings of the LREC 2010 Workshopon New Challenges for NLP Frameworks , pages 45–50, Valletta, Malta. ELRA. http://is.muni.cz/publication/884893/en .Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, TingLiu, and Bing Qin. 2014. Learning sentiment-specific word embedding for twitter sentiment clas-sification. In
Proceedings of the 52nd Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers) , volume 1, pages 1555–1565.D Roland Thomas, Edward Hughes, and Bruno DZumbo. 1998. On variable importance in linear re-gression.
Social Indicators Research , 45(1-3):253–275.
Features List
List of features used as inputs for the task-specific models.
Name Description Dim.
ASC ASC model hidden layer. 25ASC x { anger,fear,joy,sadness } Emotion specific ASC hidden layers. × at ‘@’ symbol in tweet. 1blob TextBlob sentiment library. 1caps Occurrence of all capitalized words. 1dim Diminisher words. 1 { ft,w2v } x { } x { anger,fear,joy,sadness } Hidden layers of models used to re-train theembeddings. × × hash ‘ { anger,fear,joy,sadness } Affection lexicon of hashtags. 4irony Occurrence of
Table 6 ::