[PDF] Amobee at SemEval-2018 Task 1: GRU Neural Network with a CNN Attention Mechanism for Sentiment Classification

Abstract

This paper describes the participation of Amobee in the shared sentiment analysis task at SemEval 2018. We participated in all the English sub-tasks and the Spanish valence tasks. Our system consists of three parts: training task-specific word embeddings, training a model consisting of gated-recurrent-units (GRU) with a convolution neural network (CNN) attention mechanism and training stacking-based ensembles for each of the sub-tasks. Our algorithm reached 3rd and 1st places in the valence ordinal classification sub-tasks in English and Spanish, respectively.

Full PDF

AAmobee at SemEval-2018 Task 1: GRU Neural Network with a CNNAttention Mechanism for Sentiment Classiﬁcation

Alon Rozental ∗ , Daniel Fleischer ∗ Amobee, Tel Aviv, Israel [email protected]@amobee.com

Abstract

This paper describes the participation ofAmobee in the shared sentiment analysis taskat SemEval 2018. We participated in allthe English sub-tasks and the Spanish va-lence tasks. Our system consists of threeparts: training task-speciﬁc word embeddings,training a model consisting of gated-recurrent-units (GRU) with a convolution neural net-work (CNN) attention mechanism and trainingstacking-based ensembles for each of the sub-tasks. Our algorithm reached 3rd and 1stplaces in the valence ordinal classiﬁcation sub-tasks in English and Spanish, respectively.

Sentiment analysis is a collection of methods andalgorithms used to infer and measure affectionexpressed by a writer. The main motivation isenabling computers to better understand humanlanguage, particularly sentiment carried by thespeaker. Among the popular sources of textualdata for NLP is Twitter, a social network servicewhere users communicate by posting short mes-sages, no longer than 280 characters long—calledtweets. Tweets can carry sentimental informationwhen talking about events, public ﬁgures, brandsor products. Unique linguistic features, such asthe use of slang, emojis, misspelling and sarcasm,make Twitter a challenging source for NLP re-search, attracting the interest of both academia andthe industry.Semeval is a yearly event in which internationalteams of researchers work on tasks in a com-petition format where they tackle open researchquestions in the ﬁeld of semantic analysis. We par-ticipated in Semeval 2018 task 1, which focuseson sentiment and emotions evaluation in tweets.There were three main problems: identifying the ∗ These authors contributed equally to this work. presence of a given emotion in a tweet (sub-tasksEI-reg, EI-oc), identifying the general sentiment(valence) in a tweet (sub-tasks V-reg, V-oc) andidentifying which emotions are expressed in atweet (sub-task E-c). For a complete descriptionof Semeval 2018 task 1, see the ofﬁcial taskdescription (Mohammad et al., 2018).We developed an architecture based on gated-recurrent-units (GRU, Cho et al. (2014)). Weused a bi-directional GRU layer, together witha convolutional neural network (CNN) attention-mechanism, where its input is the hidden statesof the GRU layer; lastly there were two fullyconnected layers. We will refer to this architectureas the Amobee sentiment classiﬁer (ASC). Weused ASC to train word embeddings to incorporatesentiment information and to classify sentimentusing annotated tweets. We participated in all theEnglish sub-tasks and in the valence Spanish sub-tasks, achieving competitive results.The paper is organized as follows: section 2describes our data sources, section 3 describes thedata pre-processing pipeline. A description of themain architecture is in section 4. Section 5 de-scribes the word embeddings generation; section6 describes the extraction of features. In section7 we describe the performance of our models;ﬁnally, in section 8 we review and summarize theresults.

We used four sources of data:1. Twitter Firehose: we randomly sampled 200million tweets using the Twitter Firehose ser-vice. They were used for training word em-beddings and for distant supervision learning.2. Semeval 2017 task 4 datasets of tweets, an-notated according to their general sentiment a r X i v : . [ c s . C L ] A p r n 3 and 5 level scales; used to train the ASCmodel.3. Annotated tweets from an external source ,annotated on a 3-level scale; used to train theASC model.4. Ofﬁcial Semeval 2018 task 1 datasets: usedto train task speciﬁc models.Datasets of Semeval 2017 and the external sourcewere combined with compression ; the resultingdataset contained 88,623 tweets with the followingdistribution: positive: 30097 sentences (34%),neutral: 35818 (40%), negative: 22708 (26%).Description of the ofﬁcial Semeval 2018 task 1datasets can be found in Mohammad et al. (2018);Mohammad and Kiritchenko (2018). We started by deﬁning a cleaning pipeline that pro-duces two cleaned version of an original text; werefer to them as “simple” and “complex” versions.Both versions share the same initial cleaning steps:1. Word tokenization using the CoreNLP library(Manning et al., 2014).2. Parts of speech (POS) tagging using theTweet NLP tagger, trained on Twitter data(Owoputi et al., 2013).3. Grouping similar emojis and replacing themwith representative keywords.4. Regex: replacing URLs with a specialkeyword, removing duplications, break-ing date , number , brand , etc.3. Synonym replacement, based on a manually-created dictionary.4. Word replacement using a Wikipedia dictio-nary, created by crawling and extracting listsof places, brands and names. https://github.com/monkeylearn/sentiment-analysis-benchmark Transformed 5 labels to 3: {− , − } → {− } , { , } → { } , { } → { } . As an example, table 1 shows a ﬁctitious tweet andthe results after the simple and complex cleaningstages.

Our main contribution is an RNN network, basedon GRU units with a CNN-based attention mecha-nism; we will refer to it as the Amobee sentimentclassiﬁer (ASC). It is comprised of four identicalsub-models, which differ by the input data eachof them receives. Sub-model inputs are composedof word embeddings and embeddings of the POStags—see section 5 for a description of our embed-ding procedure. The words were embedded in a200 or 150 dimensional vector spaces and the POStags were embedded in a 8 dimensional vectorspace. We pruned the tweets to have 40 words,padding shorter sentences with a zero vector. Theembeddings form the input layer.Next we describe the sub-model architecture;the embeddings were fed to a bi-directional GRUlayer of dimension 200. Inspired by the attentionmechanism introduced in Bahdanau et al. (2014),we extracted the hidden states of the GRU layer;each state corresponds to a decoded word in theGRU as it reads each tweet word by word. Thehidden states were arranged in a matrix of dimen-sion × for each tweet (bi-directionality ofthe GRU layer contributes a factor of 2). We fedthe hidden states to a CNN layer, instead of aweighted sum as in the original paper. We used 6ﬁlter sizes [1 , , , , , , with 100 ﬁlters for eachsize. After a max-pooling layer we concatenatedall outputs, creating a 600 dimensional vector.Next was a fully connected layer of size 30 with tanh activation, and ﬁnally a fully connected layerof size 3 with a softmax activation function.We deﬁned 4 such sub-models with embeddinginputs of the following settings: w2v-200, w2v-150, ft-200, ft-150 (ft=FastText, w2v=Word2Vec,see discussion in the next section). We combinedthe four sub-models by extracting their hidden d =30 layer and concatenating them. Next we added afully connected d = 25 layer with tanh activationand a ﬁnal fully connected layer of size 3. Seeﬁgure 1 for an illustration of the entire architec-ture. We used the AdaGrad optimizer (Duchi et al.,2011) and a cross-entropy loss function. We usedthe Keras library (Chollet et al., 2015) and theTensorFlow framework (Abadi et al., 2016). RU (F) GRU (F) GRU (F) GRU (F)GRU (B) GRU (B) GRU (B) GRU (B)

Whatawonderfulday

Sentence Embedding40x208 d=200 d=8 Bi-directional GRUd=200 Hidden States(40x2)x200 100 Filters6 Sizes d=600Pool-Max +Concat. d=30 d=3Fully connectedlayers

Mini Amobee Sentiment Classiﬁer (ASC)

Mini-ASCFT-200Mini-ASCFT-150Mini-ASCW2V-200Mini-ASCW2V-150 d=25 d=3d=120Concatenation of hidden layers Fully connectedlayers

Figure 1 : Architecture of the ASC nework. Each of the four sub-models on the right has the same structure as depicted in thecentral region.

Original @USAIRWAYS is right :-) ! Flying in September

Simple Cleaning twitter-entity is right happy-smily ! ﬂying in september nice to ﬂy

Complex Cleaning twitter-entity be right happy-smily ! ﬂy in date pleasant to ﬂy

Table 1 : An example of a tweet processing, producing two cleaned versions.

Word embedding is a family of techniques inwhich words are encoded as real-valued vectors oflower dimensionality. These word representationshave been used successfully in sentiment analysistasks in recent years. Among the popular algo-rithms are Word2Vec (Mikolov et al., 2013) andFastText (Bojanowski et al., 2016).Word embeddings are useful representationsof words and can uncover hidden relationships.However, one disadvantage they have is the typicallack of sentiment information. For example, theword vector “good” can be very close to theword vector “bad” in some trained, off-the-shelfword embeddings. Our goal was to train wordembeddings based on Twitter data and then re-learn them so they will contain emotion-speciﬁcsentiment.We started with our 200 million tweets dataset;we cleaned them using the pre-processing pipeline(described in section 3) and then trained genericembeddings using the Gensim package ( ˇReh˚uˇrekand Sojka, 2010); we created four embeddings forthe words and two embeddings for the POS tags:for each sentence we created a list of correspond-ing POS tags (there are 25 tags offered by the tag-ger we used); treating the tags as words, we trained d = 8 embeddings using the word2vec algorithmon the simple and complex cleaned datasets. Theembeddings parameters are speciﬁed in table 2.Following Tang et al. (2014); Cliche (2017), who explored training word embeddings for sen-timent classiﬁcation, we employed a similar ap-proach. We created distant supervision datasets,ﬁrst, by manually compiling 4 lists of represen-tative words for each emotion: anger, fear, joyand sadness; then, we built two datasets for eachemotion: the ﬁrst containing tweets with the rep-resentative words and the second does not. Eachlist contained about 40 words and each datasetcontained roughly 2 million tweets. We used theASC sub-model architecture (section 4) to trainas following: training for one epoch with embed-dings set to be untrainable (ﬁxed). Then train for6 epochs where the embeddings can change.Overall we trained 16 word embeddings—4embedding conﬁgurations for each emotion. Inaddition, we decided to use the trained models’ﬁnal hidden layer ( d = 15 ) as a feature vector inthe task-speciﬁc architectures; our motivation wasusing them as emotion and intensity classiﬁers viatransfer learning. Algorithm Dimension Dataset W o r d s Word2Vec 200 SimpleWord2Vec 150 ComplexFastText 200 SimpleFastText 150 Complex T a g s Word2Vec 200 SimpleWord2Vec 150 Complex

Table 2 : Parameters for the word and POS tag embeddings.

Features Description

In addition to our ASC models, we extractedsemantic and syntactic features, based on domainknowledge: • Number of magniﬁer and diminisher words,e.g. “incredibly”, “hardly” in each tweet. • Logarithm of length of sentences. • • Fully capitalized words. • The symbols • Predictions of external packages: Vader (partof the NLTK library, Hutto and Gilbert, 2014)and TextBlob (Loria et al., 2014).Additionally, we compiled a list of 338 emojisand words in 16 categories of emotion, annotatedwith scores from the set { . , , . , } . For eachsentence, we summed up the scores in each cate-gory, up to a maximum value of 5, generating 16features. The categories are: anger, disappointed,fear, hopeful, joy, lonely, love, negative, neutral,positive, sadness and surprise. Finally, we usedthe NRC Affect Intensity lexicon (Mohammad,2017) containing 5814 entries; each entry is aword with a score between 0 and 1 for a givenemotion out of the following: anger, fear, joy andsadness. We used the lexicon to produce 4 emotionfeatures from hashtags in the tweets; each featurecontained the largest score of all the hashtags inthe tweet. For a summary of all features used, seetable 6 in the appendix. Our general workﬂow for the tasks is as follows:for each sub-task, we started by cleaning thedatasets, obtaining two cleaned versions. Weran a pipeline that produced all the features wedesigned: the ASC predictions and the featuresdescribed in section 6. We removed sparse fea-tures (less than 8 samples). Next, we deﬁneda shallow neural network with a soft-voting en-semble. We chose the best features and meta-parameters—such as learning rate, batch size andnumber of epochs—based on the dev dataset. Fi-nally, we generated predictions for the regressiontasks. For the classiﬁcation tasks, we used agrid search method on the regression predictions

Task Metric Score Ranking

V-oc-Spanish Pearson 0.765 1/14V-reg-Spanish 0.770 2/14V-oc 0.813 3/37EI-oc Average 0.646 4/39V-reg 0.843 5/38E-c Jaccard 0.566 6/35EI-reg Average Pearson 0.721 13/48

Table 3 : Summary of results. to optimize the loss. Most model trainings wereconducted on a local machine equipped with aNvidia GTX 1080 Ti GPU. Our ofﬁcial results aresummarized in table 3.

In the valence sub-tasks, we identiﬁed how intensea general sentiment (valence) is; the score is eitherin a continuous scale between 0 and 1 or classiﬁedinto 7 ordinal classes {− , − , − , , , , } , andis evaluated using the Pearson correlation coefﬁ-cient.We started with the regression task and deﬁnedthe following model: ﬁrst, we normalized thefeatures to have zero mean and SD = 1 . Then, weinserted 300 instances of fully connected layers ofsize 3, with a softmax activation and no bias term.For each copy, we applied the function f ( x ) =( x − x ) / . where x , x are the 1st and3rd component of each hidden layer. Our aim wastransforming the label predictions of the ASCs(trained on 3-label based sentiment annotation)into a regression score such that high certainty ineither label (negative, neutral or positive) wouldproduce scores close to 0, 0.5 or 1, respectively.Finally, we calculated the mean of all 300 predic-tion to get the ﬁnal node; this is also known asa soft-voting ensemble. We used the Adam opti-mizer (Kingma and Ba, 2014) with default values,mean-square-error loss function, batch size of 400and 65 epochs of training. For an illustration ofthe network, see ﬁgure 2. We experimented withthe dev dataset, testing different subsets of thefeatures. Finally we produced predictions for theregression sub-task V-reg.We analyzed the relative contribution of eachfeature by measuring variable importance usingPratt (1987) approach. We calculated scores d i foreach feature using the following formula: d i =ˆ β i ˆ ρ i /R where ˆ β i denotes the sample estimation =300d=3 meand=212 f AAAB53icbZDNSsNAFIVv6l+tVasu3QyK4KoktcV2V3DjsgVrC20ok+mkHTuZhJmJUEKfwI0LFbe+hc/hzp2P4jQp4t+BgY9z7+XeOV7EmdK2/W7lVlbX1jfym4Wt4vbObmlv/1qFsSS0Q0Ieyp6HFeVM0I5mmtNeJCkOPE673vRiUe/eUqlYKK70LKJugMeC+Yxgbay2Pywd22U7FfoLzhKOm8XX9gcAtIalt8EoJHFAhSYcK9V37Ei7CZaaEU7nhUGsaITJFI9p36DAAVVukh46RyfGGSE/lOYJjVL3+0SCA6VmgWc6A6wn6ndtYf5X68far7sJE1GsqSDZIj/mSIdo8Ws0YpISzWcGMJHM3IrIBEtMtMmmkIZw5jTsWh1lUG0soVb9CqFTKTfKdtuEUYFMeTiEIzgFB86hCZfQgg4QoHAHD/Bo3Vj31pP1nLXmrOXMAfyQ9fIJSSmPJQ== AAAB53icbZBNSwJBGMeftTczK6tjEEMSdJLVlPQmdOmo0Kagi8yOszo5+8LMbCCLx05dOlR07Vv4Obr1GfoSjbsSvf1h4Mf/eR6eZ/5OyJlUpvluZFZW19Y3spu5rfz2zm5hb/9aBpEg1CIBD0TXwZJy5lNLMcVpNxQUew6nHWdysah3bqmQLPCv1DSktodHPnMZwUpbbXdQKJolMxH6C+UlFJv5efvj7mjeGhTe+sOARB71FeFYyl7ZDJUdY6EY4XSW60eShphM8Ij2NPrYo9KOk0Nn6EQ7Q+QGQj9focT9PhFjT8qp5+hOD6ux/F1bmP/VepFy63bM/DBS1CfpIjfiSAVo8Ws0ZIISxacaMBFM34rIGAtMlM4ml4RwVm6YtTpKodpYQq36FYJVKTVKZluHUYFUWTiEYziFMpxDEy6hBRYQoHAPj/Bk3BgPxrPxkrZmjOXMAfyQ8foJJgiQiw== AAAB53icbZBNSwJBGMeftTczK6tjEEMSdJLVlPQmdOmo0Kagi8yOszo5+8LMbCCLx05dOlR07Vv4Obr1GfoSjbsSvf1h4Mf/eR6eZ/5OyJlUpvluZFZW19Y3spu5rfz2zm5hb/9aBpEg1CIBD0TXwZJy5lNLMcVpNxQUew6nHWdysah3bqmQLPCv1DSktodHPnMZwUpbbXdQKJolMxH6C+UlFJv5efvj7mjeGhTe+sOARB71FeFYyl7ZDJUdY6EY4XSW60eShphM8Ij2NPrYo9KOk0Nn6EQ7Q+QGQj9focT9PhFjT8qp5+hOD6ux/F1bmP/VepFy63bM/DBS1CfpIjfiSAVo8Ws0ZIISxacaMBFM34rIGAtMlM4ml4RwVm6YtTpKodpYQq36FYJVKTVKZluHUYFUWTiEYziFMpxDEy6hBRYQoHAPj/Bk3BgPxrPxkrZmjOXMAfyQ8foJJgiQiw== AAAB53icbZBNSwJBGMeftTczK6tjEEMSdJLVlPQmdOmo0Kagi8yOszo5+8LMbCCLx05dOlR07Vv4Obr1GfoSjbsSvf1h4Mf/eR6eZ/5OyJlUpvluZFZW19Y3spu5rfz2zm5hb/9aBpEg1CIBD0TXwZJy5lNLMcVpNxQUew6nHWdysah3bqmQLPCv1DSktodHPnMZwUpbbXdQKJolMxH6C+UlFJv5efvj7mjeGhTe+sOARB71FeFYyl7ZDJUdY6EY4XSW60eShphM8Ij2NPrYo9KOk0Nn6EQ7Q+QGQj9focT9PhFjT8qp5+hOD6ux/F1bmP/VepFy63bM/DBS1CfpIjfiSAVo8Ws0ZIISxacaMBFM34rIGAtMlM4ml4RwVm6YtTpKodpYQq36FYJVKTVKZluHUYFUWTiEYziFMpxDEy6hBRYQoHAPj/Bk3BgPxrPxkrZmjOXMAfyQ8foJJgiQiw== AAAB53icbZBNSwJBGMeftTczK6tjEEMSdJLVlPQmdOmo0Kagi8yOszo5+8LMbCCLx05dOlR07Vv4Obr1GfoSjbsSvf1h4Mf/eR6eZ/5OyJlUpvluZFZW19Y3spu5rfz2zm5hb/9aBpEg1CIBD0TXwZJy5lNLMcVpNxQUew6nHWdysah3bqmQLPCv1DSktodHPnMZwUpbbXdQKJolMxH6C+UlFJv5efvj7mjeGhTe+sOARB71FeFYyl7ZDJUdY6EY4XSW60eShphM8Ij2NPrYo9KOk0Nn6EQ7Q+QGQj9focT9PhFjT8qp5+hOD6ux/F1bmP/VepFy63bM/DBS1CfpIjfiSAVo8Ws0ZIISxacaMBFM34rIGAtMlM4ml4RwVm6YtTpKodpYQq36FYJVKTVKZluHUYFUWTiEYziFMpxDEy6hBRYQoHAPj/Bk3BgPxrPxkrZmjOXMAfyQ8foJJgiQiw== f AAAB53icbZDNSsNAFIVv6l+tVasu3QyK4KoktcV2V3DjsgVrC20ok+mkHTuZhJmJUEKfwI0LFbe+hc/hzp2P4jQp4t+BgY9z7+XeOV7EmdK2/W7lVlbX1jfym4Wt4vbObmlv/1qFsSS0Q0Ieyp6HFeVM0I5mmtNeJCkOPE673vRiUe/eUqlYKK70LKJugMeC+Yxgbay2Pywd22U7FfoLzhKOm8XX9gcAtIalt8EoJHFAhSYcK9V37Ei7CZaaEU7nhUGsaITJFI9p36DAAVVukh46RyfGGSE/lOYJjVL3+0SCA6VmgWc6A6wn6ndtYf5X68far7sJE1GsqSDZIj/mSIdo8Ws0YpISzWcGMJHM3IrIBEtMtMmmkIZw5jTsWh1lUG0soVb9CqFTKTfKdtuEUYFMeTiEIzgFB86hCZfQgg4QoHAHD/Bo3Vj31pP1nLXmrOXMAfyQ9fIJSSmPJQ== AAAB53icbZBNSwJBGMeftTczK6tjEEMSdJLVlPQmdOmo0Kagi8yOszo5+8LMbCCLx05dOlR07Vv4Obr1GfoSjbsSvf1h4Mf/eR6eZ/5OyJlUpvluZFZW19Y3spu5rfz2zm5hb/9aBpEg1CIBD0TXwZJy5lNLMcVpNxQUew6nHWdysah3bqmQLPCv1DSktodHPnMZwUpbbXdQKJolMxH6C+UlFJv5efvj7mjeGhTe+sOARB71FeFYyl7ZDJUdY6EY4XSW60eShphM8Ij2NPrYo9KOk0Nn6EQ7Q+QGQj9focT9PhFjT8qp5+hOD6ux/F1bmP/VepFy63bM/DBS1CfpIjfiSAVo8Ws0ZIISxacaMBFM34rIGAtMlM4ml4RwVm6YtTpKodpYQq36FYJVKTVKZluHUYFUWTiEYziFMpxDEy6hBRYQoHAPj/Bk3BgPxrPxkrZmjOXMAfyQ8foJJgiQiw== AAAB53icbZBNSwJBGMeftTczK6tjEEMSdJLVlPQmdOmo0Kagi8yOszo5+8LMbCCLx05dOlR07Vv4Obr1GfoSjbsSvf1h4Mf/eR6eZ/5OyJlUpvluZFZW19Y3spu5rfz2zm5hb/9aBpEg1CIBD0TXwZJy5lNLMcVpNxQUew6nHWdysah3bqmQLPCv1DSktodHPnMZwUpbbXdQKJolMxH6C+UlFJv5efvj7mjeGhTe+sOARB71FeFYyl7ZDJUdY6EY4XSW60eShphM8Ij2NPrYo9KOk0Nn6EQ7Q+QGQj9focT9PhFjT8qp5+hOD6ux/F1bmP/VepFy63bM/DBS1CfpIjfiSAVo8Ws0ZIISxacaMBFM34rIGAtMlM4ml4RwVm6YtTpKodpYQq36FYJVKTVKZluHUYFUWTiEYziFMpxDEy6hBRYQoHAPj/Bk3BgPxrPxkrZmjOXMAfyQ8foJJgiQiw== AAAB53icbZBNSwJBGMeftTczK6tjEEMSdJLVlPQmdOmo0Kagi8yOszo5+8LMbCCLx05dOlR07Vv4Obr1GfoSjbsSvf1h4Mf/eR6eZ/5OyJlUpvluZFZW19Y3spu5rfz2zm5hb/9aBpEg1CIBD0TXwZJy5lNLMcVpNxQUew6nHWdysah3bqmQLPCv1DSktodHPnMZwUpbbXdQKJolMxH6C+UlFJv5efvj7mjeGhTe+sOARB71FeFYyl7ZDJUdY6EY4XSW60eShphM8Ij2NPrYo9KOk0Nn6EQ7Q+QGQj9focT9PhFjT8qp5+hOD6ux/F1bmP/VepFy63bM/DBS1CfpIjfiSAVo8Ws0ZIISxacaMBFM34rIGAtMlM4ml4RwVm6YtTpKodpYQq36FYJVKTVKZluHUYFUWTiEYziFMpxDEy6hBRYQoHAPj/Bk3BgPxrPxkrZmjOXMAfyQ8foJJgiQiw== AAAB53icbZBNSwJBGMeftTczK6tjEEMSdJLVlPQmdOmo0Kagi8yOszo5+8LMbCCLx05dOlR07Vv4Obr1GfoSjbsSvf1h4Mf/eR6eZ/5OyJlUpvluZFZW19Y3spu5rfz2zm5hb/9aBpEg1CIBD0TXwZJy5lNLMcVpNxQUew6nHWdysah3bqmQLPCv1DSktodHPnMZwUpbbXdQKJolMxH6C+UlFJv5efvj7mjeGhTe+sOARB71FeFYyl7ZDJUdY6EY4XSW60eShphM8Ij2NPrYo9KOk0Nn6EQ7Q+QGQj9focT9PhFjT8qp5+hOD6ux/F1bmP/VepFy63bM/DBS1CfpIjfiSAVo8Ws0ZIISxacaMBFM34rIGAtMlM4ml4RwVm6YtTpKodpYQq36FYJVKTVKZluHUYFUWTiEYziFMpxDEy6hBRYQoHAPj/Bk3BgPxrPxkrZmjOXMAfyQ8foJJgiQiw== f AAAB53icbZDNSsNAFIVv6l+tVasu3QyK4KoktcV2V3DjsgVrC20ok+mkHTuZhJmJUEKfwI0LFbe+hc/hzp2P4jQp4t+BgY9z7+XeOV7EmdK2/W7lVlbX1jfym4Wt4vbObmlv/1qFsSS0Q0Ieyp6HFeVM0I5mmtNeJCkOPE673vRiUe/eUqlYKK70LKJugMeC+Yxgbay2Pywd22U7FfoLzhKOm8XX9gcAtIalt8EoJHFAhSYcK9V37Ei7CZaaEU7nhUGsaITJFI9p36DAAVVukh46RyfGGSE/lOYJjVL3+0SCA6VmgWc6A6wn6ndtYf5X68far7sJE1GsqSDZIj/mSIdo8Ws0YpISzWcGMJHM3IrIBEtMtMmmkIZw5jTsWh1lUG0soVb9CqFTKTfKdtuEUYFMeTiEIzgFB86hCZfQgg4QoHAHD/Bo3Vj31pP1nLXmrOXMAfyQ9fIJSSmPJQ== AAAB53icbZBNSwJBGMeftTczK6tjEEMSdJLVlPQmdOmo0Kagi8yOszo5+8LMbCCLx05dOlR07Vv4Obr1GfoSjbsSvf1h4Mf/eR6eZ/5OyJlUpvluZFZW19Y3spu5rfz2zm5hb/9aBpEg1CIBD0TXwZJy5lNLMcVpNxQUew6nHWdysah3bqmQLPCv1DSktodHPnMZwUpbbXdQKJolMxH6C+UlFJv5efvj7mjeGhTe+sOARB71FeFYyl7ZDJUdY6EY4XSW60eShphM8Ij2NPrYo9KOk0Nn6EQ7Q+QGQj9focT9PhFjT8qp5+hOD6ux/F1bmP/VepFy63bM/DBS1CfpIjfiSAVo8Ws0ZIISxacaMBFM34rIGAtMlM4ml4RwVm6YtTpKodpYQq36FYJVKTVKZluHUYFUWTiEYziFMpxDEy6hBRYQoHAPj/Bk3BgPxrPxkrZmjOXMAfyQ8foJJgiQiw== AAAB53icbZBNSwJBGMeftTczK6tjEEMSdJLVlPQmdOmo0Kagi8yOszo5+8LMbCCLx05dOlR07Vv4Obr1GfoSjbsSvf1h4Mf/eR6eZ/5OyJlUpvluZFZW19Y3spu5rfz2zm5hb/9aBpEg1CIBD0TXwZJy5lNLMcVpNxQUew6nHWdysah3bqmQLPCv1DSktodHPnMZwUpbbXdQKJolMxH6C+UlFJv5efvj7mjeGhTe+sOARB71FeFYyl7ZDJUdY6EY4XSW60eShphM8Ij2NPrYo9KOk0Nn6EQ7Q+QGQj9focT9PhFjT8qp5+hOD6ux/F1bmP/VepFy63bM/DBS1CfpIjfiSAVo8Ws0ZIISxacaMBFM34rIGAtMlM4ml4RwVm6YtTpKodpYQq36FYJVKTVKZluHUYFUWTiEYziFMpxDEy6hBRYQoHAPj/Bk3BgPxrPxkrZmjOXMAfyQ8foJJgiQiw== AAAB53icbZBNSwJBGMeftTczK6tjEEMSdJLVlPQmdOmo0Kagi8yOszo5+8LMbCCLx05dOlR07Vv4Obr1GfoSjbsSvf1h4Mf/eR6eZ/5OyJlUpvluZFZW19Y3spu5rfz2zm5hb/9aBpEg1CIBD0TXwZJy5lNLMcVpNxQUew6nHWdysah3bqmQLPCv1DSktodHPnMZwUpbbXdQKJolMxH6C+UlFJv5efvj7mjeGhTe+sOARB71FeFYyl7ZDJUdY6EY4XSW60eShphM8Ij2NPrYo9KOk0Nn6EQ7Q+QGQj9focT9PhFjT8qp5+hOD6ux/F1bmP/VepFy63bM/DBS1CfpIjfiSAVo8Ws0ZIISxacaMBFM34rIGAtMlM4ml4RwVm6YtTpKodpYQq36FYJVKTVKZluHUYFUWTiEYziFMpxDEy6hBRYQoHAPj/Bk3BgPxrPxkrZmjOXMAfyQ8foJJgiQiw== AAAB53icbZBNSwJBGMeftTczK6tjEEMSdJLVlPQmdOmo0Kagi8yOszo5+8LMbCCLx05dOlR07Vv4Obr1GfoSjbsSvf1h4Mf/eR6eZ/5OyJlUpvluZFZW19Y3spu5rfz2zm5hb/9aBpEg1CIBD0TXwZJy5lNLMcVpNxQUew6nHWdysah3bqmQLPCv1DSktodHPnMZwUpbbXdQKJolMxH6C+UlFJv5efvj7mjeGhTe+sOARB71FeFYyl7ZDJUdY6EY4XSW60eShphM8Ij2NPrYo9KOk0Nn6EQ7Q+QGQj9focT9PhFjT8qp5+hOD6ux/F1bmP/VepFy63bM/DBS1CfpIjfiSAVo8Ws0ZIISxacaMBFM34rIGAtMlM4ml4RwVm6YtTpKodpYQq36FYJVKTVKZluHUYFUWTiEYziFMpxDEy6hBRYQoHAPj/Bk3BgPxrPxkrZmjOXMAfyQ8foJJgiQiw== Figure 2 : Architecture of the ﬁnal classiﬁer in the valencesub-tasks, where f = ( x − x ) / . and the inputdimension is 212 for the V-reg sub-task. of the feature, ˆ ρ i is the simple correlation be-tween the labels and the i th feature and R isthe coefﬁcient of determination (see Thomas et al.1998). We present the relative contribution of eachfeature in ﬁgure 3 and the top 10 features in table4. We can see that the ASC models, both generaland emotion-speciﬁc, contributed about 72% ofthe total contribution made by all features, in thissub-task.For the ordinal classiﬁcation task, we used thepredictions of the regression task on the sentences,which were the same in both tasks. Using agrid search method, we partitioned the regressionscores into 7 categories such that the Pearson cor-relation coefﬁcient was maximized. We submittedthe classes predictions as sub-task V-oc. Our ﬁnalscores were 0.843, 0.813 in the regression andclassiﬁcation sub-tasks, respectively. Name Dim. %

ASC anger

25 31.38%

ASC

25 18.92%

ASC fear

25 10.63%

ASC joy

25 8.13%

W2V 200 sadness

15 7.10%

W2V 200 fear

15 3.82%

ASC sadness

25 3.46%

W2V 200 joy

15 1.74%

Blob

Joy

Table 4 : Relative contribution of features in the valenceregression sub-task.

EI-reg Anger Fear Joy Sadness

Features 204 274 150 181Learning rate − − − · − Epochs 330 700 700 1000

Table 5 : Summary of training parameters for the emotionintensity regression tasks.

In the emotion intensity sub-tasks, we identiﬁedhow intense a given emotion is in the giventweets. The four emotions were: anger, fear,joy and sadness; the score is either in a scalebetween 0 and 1 or classiﬁed into 4 ordinal classes { , , , } . Performance was evaluated using thePearson correlation coefﬁcient. Our approach wassimilar to the valence tasks; ﬁrst we generatedfeatures, then we used the same architecture asin the valence sub-tasks, depicted in ﬁgure 2.However, in these sub-tasks we used the emotion-speciﬁc embeddings for each emotion sub-task.We generated regression predictions and submit-ted them as the EI-reg sub-tasks; ﬁnally we carrieda grid search for the best partition, maximizingthe Pearson correlation and submitted the classespredictions as sub-tasks EI-oc. For a summary ofthe training parameters used in the regression sub-tasks, see table 5.Our system performed as following: in theregression tasks, the scores were: 0.748, 0.670,0.748, 0.721 for the anger, fear, joy and sadness,respectively, with a macro-average of 0.721. In theclassiﬁcation tasks, the scores were: 0.667, 0.536,0.705, 0.673 for the anger, fear, joy and sadness,respectively, with a macro-average of 0.646. In the multi-label classiﬁcation sub-task, we hadto label tweets with respect to 11 emotions: anger,anticipation, disgust, fear, joy, love, optimism,pessimism, sadness, surprise and trust. The scorewas evaluated using the Jaccard similarity coef-ﬁcient. We started with the same cleaning andfeature-generation pipelines as before, creatingan input layer of size 217. We added a fullyconnected layer of size 100 with tanh activation.Next there were 300 instances of fully connectedlayers of size 11 with sigmoid activation function.We calculated the mean of all d = 11 vectors,producing the ﬁnal d = 11 vector. For an illus-tration, see ﬁgure 4 for an illustration. We used S C _ a n g e r A S C A S C _ f e a r A S C _ j o y w v _ _ s a d n e ss w v _ _ f e a r A S C _ s a d n e ss w v _ _ j o y b l o b j o y h a s h _ j o y h a s h _ s a d n e ss p o s i t i v e s a d n e ss h a s h _ a n g e r a n g e r w v _ _ a n g e r h a s h _ f e a r n e g a t i v e l o v e j o y f e a r d i s a p p o i n t e d n e u t r a l v a d e r l o n e l y h a s h c a p s m a g d i m l e n g t h a t i r o n y l o n g Features P e r c e n t a g e Figure 3 : Relative contribution of features in the valence regression sub-task.

300 copiesd=11 meand=217 d=100 d=11

Figure 4 : Architecture of the multi-label sub-task E-c. the following loss function, based on Tanimotodistance: L ( y, ˜ y ) = 1 − y · ˜ y (cid:107) y +˜ y (cid:107) − y · ˜ y + (cid:15) , where (cid:107) · (cid:107) is an L norm and (cid:15) = 10 − is used for numericalstability. We trained with a batch size of 10, for40 epochs with Adam optimization with defaultparameters. Our ﬁnal score was 0.566. We participated in the Spanish valence tasks toexamine the current state of neural machine trans-lation (NMT) algorithms. We used the GoogleCloud Translation API to translate the Spanishtraining, development and test datasets for the twovalence tasks from Spanish to English. We thentreated the tasks the same way as the Englishvalence tasks, using the same cleaning and featureextraction pipelines and the same architecture de-scribed in section 7.1 to generate regression andclassiﬁcation predictions. We reached 1st and2nd places in the classiﬁcation and regression sub-tasks, with scores of 0.765, 0.770, respectively.

In this paper we described the system developedto participate in the Semeval 2018 task 1 work-shop. We reached 3rd place in the valence ordinalclassiﬁcation sub-task and 5th place in the valenceregression sub-task. In the Spanish valence tasks,we reached 1st and 2nd places in the classiﬁcationand regression sub-tasks, respectively. In theemotions intensity sub-tasks we reached 4th and13th places in the classiﬁcation and regressionsub-tasks, respectively.Summarizing the methods used: training ofword embeddings based on a Twitter corpus(200M tweets), developing and using Amobeesentiment classiﬁer (ASC) architecture—a bi-directional GRU layer with a CNN-based attentionmechanism and an additional hidden layer—usedto adjust the embeddings to include emotionalcontext, and ﬁnally a shallow feed-forward NNwith a stack-based ensemble of ﬁnal hidden layersfrom all previous classiﬁers we trained. This formof transfer learning proved to be important, asthe hidden layers features achieved a signiﬁcantcontribution to minimizing the loss.Overall, we had better performance in the va-lence tasks, both in English and Spanish. We positthis is due to the fact our annotated supervisedtraining dataset (non task-speciﬁc) was based onSemeval 2017 task 4, which focused on valenceclassiﬁcation. In addition, the annotations in Se-meval 2017 were label-based, lending themselvesmore easily to the ordinal classiﬁcation tasks. Inthe Spanish tasks, we used external translation(Google API) and achieved good results withoutthe use of Spanish-speciﬁc features.

Acknowledgment

We thank Zohar Kelrich for assisting in translatingthe Spanish datasets to English. eferences

Mart´ın Abadi, Paul Barham, Jianmin Chen, ZhifengChen, Andy Davis, Jeffrey Dean, Matthieu Devin,Sanjay Ghemawat, Geoffrey Irving, Michael Isard,et al. 2016. Tensorﬂow: A system for large-scalemachine learning. In

OSDI , volume 16, pages 265–283.Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2014. Neural machine translation byjointly learning to align and translate.

CoRR ,abs/1409.0473.Piotr Bojanowski, Edouard Grave, Armand Joulin,and Tomas Mikolov. 2016. Enriching word vec-tors with subword information. arXiv preprintarXiv:1607.04606 .KyungHyun Cho, Bart van Merrienboer, Dzmitry Bah-danau, and Yoshua Bengio. 2014. On the propertiesof neural machine translation: Encoder-decoder ap-proaches.

CoRR , abs/1409.1259.Franc¸ois Chollet et al. 2015. Keras. https://github.com/keras-team/keras .Mathieu Cliche. 2017. Bb twtr at semeval-2017 task4: Twitter sentiment analysis with cnns and lstms. arXiv preprint arXiv:1704.06125 .John Duchi, Elad Hazan, and Yoram Singer. 2011.Adaptive subgradient methods for online learningand stochastic optimization.

Journal of MachineLearning Research , 12(Jul):2121–2159.C.J. Hutto and E.E. Gilbert. 2014. Vader: A parsimo-nious rule-based model for sentiment analysis of so-cial media text. In

Eighth International Conferenceon Weblogs and Social Media (ICWSM-14) , AnnArbor, MI.Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 .Steven Loria, P Keen, M Honnibal, R Yankovsky,D Karesh, E Dempsey, et al. 2014. Textblob:simpliﬁed text processing.

Secondary TextBlob:Simpliﬁed Text Processing .Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP natural lan-guage processing toolkit. In

Association for Compu-tational Linguistics (ACL) System Demonstrations ,pages 55–60.Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg SCorrado, and Jeff Dean. 2013. Distributed represen-tations of words and phrases and their composition-ality. In C. J. C. Burges, L. Bottou, M. Welling,Z. Ghahramani, and K. Q. Weinberger, editors,

Advances in Neural Information Processing Systems26 , pages 3111–3119. Curran Associates, Inc. Saif M Mohammad. 2017. Word affect intensities. arXiv preprint arXiv:1704.08798 .Saif M. Mohammad, Felipe Bravo-Marquez, Mo-hammad Salameh, and Svetlana Kiritchenko. 2018.Semeval-2018 Task 1: Affect in tweets. In

Proceed-ings of International Workshop on Semantic Evalu-ation (SemEval-2018) , New Orleans, LA, USA.Saif M. Mohammad and Svetlana Kiritchenko. 2018.Understanding emotions: A dataset of tweets tostudy interactions between affect categories. In

Proceedings of the 11th Edition of the LanguageResources and Evaluation Conference , Miyazaki,Japan.Olutobi Owoputi, Brendan O’Connor, Chris Dyer,Kevin Gimpel, Nathan Schneider, and Noah ASmith. 2013. Improved part-of-speech tagging foronline conversational text with word clusters. Asso-ciation for Computational Linguistics.John W Pratt. 1987. Dividing the indivisible: Usingsimple symmetry to partition variance explained. In

Proceedings of the second international Tampereconference in statistics, 1987 , pages 245–260. De-partment of Mathematical Sciences, University ofTampere.Radim ˇReh˚uˇrek and Petr Sojka. 2010. SoftwareFramework for Topic Modelling with Large Cor-pora. In

Proceedings of the LREC 2010 Workshopon New Challenges for NLP Frameworks , pages 45–50, Valletta, Malta. ELRA. http://is.muni.cz/publication/884893/en .Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, TingLiu, and Bing Qin. 2014. Learning sentiment-speciﬁc word embedding for twitter sentiment clas-siﬁcation. In

Proceedings of the 52nd Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers) , volume 1, pages 1555–1565.D Roland Thomas, Edward Hughes, and Bruno DZumbo. 1998. On variable importance in linear re-gression.

Social Indicators Research , 45(1-3):253–275.

Features List

List of features used as inputs for the task-speciﬁc models.

Name Description Dim.

ASC ASC model hidden layer. 25ASC x { anger,fear,joy,sadness } Emotion speciﬁc ASC hidden layers. × at ‘@’ symbol in tweet. 1blob TextBlob sentiment library. 1caps Occurrence of all capitalized words. 1dim Diminisher words. 1 { ft,w2v } x { } x { anger,fear,joy,sadness } Hidden layers of models used to re-train theembeddings. × × hash ‘ { anger,fear,joy,sadness } Affection lexicon of hashtags. 4irony Occurrence of

Table 6 ::