[PDF] Emoji-Based Transfer Learning for Sentiment Tasks

Abstract

Sentiment tasks such as hate speech detection and sentiment analysis, especially when performed on languages other than English, are often low-resource. In this study, we exploit the emotional information encoded in emojis to enhance the performance on a variety of sentiment tasks. This is done using a transfer learning approach, where the parameters learned by an emoji-based source task are transferred to a sentiment target task. We analyse the efficacy of the transfer under three conditions, i.e. i) the emoji content and ii) label distribution of the target task as well as iii) the difference between monolingually and multilingually learned source tasks. We find i.a. that the transfer is most beneficial if the target task is balanced with high emoji content. Monolingually learned source tasks have the benefit of taking into account the culturally specific use of emojis and gain up to F1 +0.280 over the baseline.

Full PDF

EEmoji-Based Transfer Learning for Sentiment Tasks

Susann Boy

Saarland University

Dana Ruiter

Saarland University { sboy,druiter,dietrich.klakow } @lsv.uni-saarland.de Dietrich Klakow

Saarland University

Abstract

Sentiment tasks such as hate speech detectionand sentiment analysis, especially when per-formed on languages other than English, areoften low-resource. In this study, we exploitthe emotional information encoded in emojisto enhance the performance on a variety ofsentiment tasks. This is done using a trans-fer learning approach, where the parameterslearned by an emoji-based source task aretransferred to a sentiment target task. We anal-yse the efﬁcacy of the transfer under three con-ditions, i.e. i ) the emoji content and ii ) labeldistribution of the target task as well as iii ) thedifference between monolingually and multi-lingually learned source tasks. We ﬁnd i.a. thatthe transfer is most beneﬁcial if the target taskis balanced with high emoji content. Monolin-gually learned source tasks have the beneﬁt oftaking into account the culturally speciﬁc useof emojis and gain up to F1 +0 . over thebaseline. Many natural language processing (NLP) tasks suf-fer from a lack of available data. This is especiallytrue for sentiment tasks, such as hate speech (HS)detection, which depend on the availability of man-ually annotated data. When moving to languagesother than English, many sentiment tasks quicklybecome very low-resourced.On the other hand, noisy social media content isavailable in abundance and many sentiment tasksare based on user comments on such platforms.Emojis can be a valuable source for the distantsupervision of sentiment tasks, as they correlatewith the underlying emotion of a comment. In thisstudy, we aim to exploit the emotional informationencoded in emojis to improve the performance onvarious sentiment tasks using a transfer learningapproach from an emoji-based source task (ST) to a sentiment target task (TT). Previous workhas focused on the transfer from predicting singleemojis (Felbo et al., 2017) or strictly pre-deﬁnedemoji-clusters (Deriu et al., 2016). However, pre-deﬁned emoji clusters do not take into account theculturally diverse usage of emojis (Park et al., 2012;Kaneko et al., 2019). We therefore introduce data-driven supervised and unsupervised emoji clustersand compare these with single emoji predictiontasks. Speciﬁcally, we analyze the efﬁcacy ofthe transfer from a single emoji or (un)supervisedemoji cluster prediction ST to a sentiment TT un-der three conditions, i.e. i) low vs. high amount of emoji content present in TT, ii) balanced vs. unbal-anced label distribution in TT and iii) monolin-gually or multilingually learned ST. The ﬁrst twoconditions are based on typical qualities of senti-ment corpora, which tend to be unbalanced in theirlabel distribution with varying degrees of emojicontent depending on the source of the data. Thethird condition is relevant for languages for whicha TT is low-resource and which might beneﬁt froma multilingually learned ST.In Section 2 we give an outline of related work,followed by the introduction of our method (Sec-tion 3). The experimental setup in Section 4details the data and models used as well as the(un)supervised clusters generated. In Section 5 wedescribe our results and conclude in Section 6. Emojis have been used as a type of distant super-vision using pre-deﬁned emotion classes based onpsychological models (Suttles and Ide, 2013), bi-nary ( positive / negative ) classes (Deriu et al., 2016)or a set of single emojis (Felbo et al., 2017). How-ever, such pre-deﬁned emoji classes often do not ac-count for the culturally diverse use of emojis (Parket al., 2012; Kaneko et al., 2019). In contrast, our a r X i v : . [ c s . C L ] F e b ork does not pre-deﬁne the emotion classes foundin emojis and instead learns these classes, or clus-ters, from the data itself. While our and the aboveapproaches focus on exploiting emojis as additionallabelled data, e.g. in a transfer setting, emoji em-beddings (Eisner et al., 2016) have been used asadditional features in downstream tasks such assarcasm detection (Subramanian et al., 2019). Transfer learning has recently been driven bytransformer-based (Vaswani et al., 2017) languagemodels (LM) such as BERT (Devlin et al., 2019)or XLM-R (Conneau et al., 2020). When learninga source task on these models, the representationsin the encoder change to become informative tothe task at hand. In a parameter transfer setting,a new but related target task then proﬁts from thelearned representations in the encoder. Transferlearning has been applied to sentiment analysis(SA) using parameter transfer methods such as pre-trained sentiment embeddings (Dong and de Melo,2018) or machine translation-based context vectors(McCann et al., 2017). Our approach forms part ofthe parameter transfer approach, as we use encoderrepresentations learned using emoji-based sourcetasks and transfer these to sentiment target tasks.

Hate speech classiﬁcation and sentiment anal-ysis have in recent years been the object of manyshared tasks (Rosenthal et al., 2017; Wiegand,2018; Basile et al., 2019; Mandl et al., 2019; Ogrod-niczuk and Łukasz Kobyli´nski, 2019). Classiﬁca-tion models for these tasks often rely on featureengineering and statistical methods such as naive-bayes (Saleem et al., 2016), logistic regression oversubwords (Waseem and Hovy, 2016) or neural ap-proaches including convolutional neural networks(Park and Fung, 2017) or, as in our case, the repre-sentations of large LMs (Yang et al., 2019).

For our parameter transfer, we rely on a singletransformer-based LM which is shared among dif-ferent tasks. A sequence x ∈ X is featurized byreading it into the encoder of the LM and retrievingits last hidden state. A linear layer is then usedas a predictive function f : X → Y to predictlabels y ∈ Y . A task T = { Y, f ( x ) } is then a setof labels Y and the predictive function f over theinstances in X .We follow a transfer learning approach, wheresource task T S is an emoji-based classiﬁcation task,i.e. given a sequence, predict the emoji (class) that it originally contained. Target task T T is a down-stream task such as SA or HS (Section 4.1). Eachtask has its own set of instances X , labels Y andpredictive function f , while the feature-generatingLM stays the same. The error of predictor f isback-propagated to the LM, which allows us totransfer learned parameters from T S to T T . We focus on 5 different emoji-based STs, that canbe divided into two types, emoji prediction (EP)and emoji cluster prediction. To sample emojis forEP or create clusters, we rely on a large collectionof user generated comments. EP is a multi-classprediction task over the 64 most common emo-jis identiﬁed in the collection of comments. Con-cretely, given a tweet with all emojis removed, theclassiﬁer has to predict which of the 64 emojis wasoriginally contained within it.The emoji cluster prediction tasks can be su-pervised (PMI- { Target,Swear } ) or unsupervised(KMeans- { } ). In this case the task is simpliﬁed:Given a tweet with all emojis removed, predict thecluster to which the emoji originally contained inthe tweet belonged. Unsupervised Clusters

In order to account forthe cultural differences in the use of emojis, welearn emoji clusters directly from the user gen-erated data. We generate 50-dimensional vectorrepresentations over the tokens in the collection ofuser comments using the continuous bag of words(Mikolov et al., 2013) approach. We then performk-means clustering with 6 target clusters on therepresentations of emojis that occurred ≥ times. These clusters are manually merged into 2( positive / negative ) and 3 ( positive / negative / neutral )clusters to create the binary KMeans-2 and ternary

KMeans-3 emoji cluster prediction STs respec-tively. Below a comment to be classiﬁed as pos-itive according to the KMeans- { } tasks, as itoriginally contained an emoji that belonged to the positive cluster: So beautiful and great advice → positive Supervised Clusters

As an alternative to thecompletely unsupervised clusters, we exploit themutual information between emojis and swearwords as a type of distant supervision for HS tasks.We calculate the pointwise mutual information(PMI) between comments in our collection of usercontent (not) containing slurs and the emojis thatppear. An emoji is in the slur cluster if its PMI islarger to comments containing swearwords, other-wise it is in the neutral cluster.

PMI-Swear is thena binary classiﬁcation task based on the resultingslur/neutral emoji clusters.While the unsupervised emoji cluster predic-tion STs and PMI-Swear are source-oriented, i.e.learned on user generated content, we also exploretarget-oriented clusters that rely on the shared infor-mation between emojis and the labels in each of theTTs. Concretely, we calculate the PMI between thelabel of an instance in the respective TT trainingdata and the emojis it contains. The emoji is placedinto the cluster of the label to which its PMI valueis largest.

PMI-Target is the ST based on thesetarget-oriented emoji clusters.

Once the classiﬁer has been fully trained on the ST,and thus has adapted the underlying LMs represen-tations to ﬁt the ST at hand, we discard it and train anew classiﬁer on top of the enriched LM to predictthe TT. We evaluate this transfer from the variousSTs on two main categories of TTs, namely HateSpeech Detection and Sentiment Analysis. Given auser generated comment,

Hate Speech

Detectionis the task of classifying the comment as either hate or none . Note, however, that concrete label names(e.g. offense , hate , harmful ) may differ across spe-ciﬁc HS tasks.While HS in our case is a binary classiﬁcationtask, Sentiment Analysis is a ternary classiﬁca-tion task which takes as input a user generatedcomment and classiﬁes it as either positive , neu-tral or negative . In the following an example fromthe Sentiment Analysis in Twitter (Rosenthal et al.,2017) task: Finally starting the 5th season of → positive Both HS and SA are sentiment-based tasks, e.g. hate towards a group of people or positive senti-ment towards a product etc. We therefore take thesetwo types of tasks to have the potential to beneﬁtfrom the emotion information encoded in emojis.In the following sections we explore the conditionsunder which the transfer from an emoji-based STto a sentiment-based TT is beneﬁcial for the TT.

We describe the data used for the STs and TTsrespectively (Section 4.1), followed by the speciﬁ-

Corpus

Target Tasks (TT)

HS-DE 1158/2439 970/2061 853 (7.2%)SA-DE 1346/900/3676 83/49/197 166 (2%)HS-ES 1857/2643 660/940 957 (14.5%)SA-EN 18481/7551/21542 2375/3972/5937 1211 (1.9%)SA-AR 653/1022/1336 1514/2222/2364 2126 (22.5%)HS-PL 812/8726 134/866 1733 (13.7%)

Source Tasks (ST)

TW-DE 16M – 3M (10%)TW-EN 323M – 82M (17%)TW-ES 320M – 43M (9%)TW-PL 7M – 1M (12%)TW-AR 183M – 56M (20%)

Table 1: Number of train, test (for TT) and collected(for ST) tweets as well as number of (non-unique)emojis contained in each corpus. Percentage of train-ing tweets containing emojis in brackets. TTs withlabel distribution for HS ( hate / none ) and SA ( posi-tive / negative / neutral ) tasks. cations of the encoding LM (Section 4.2) and theemoji cluster creation (Section 4.3). We use a collection of tweetsthat has been collected from the Twitter streambetween 2011 and 2019 as our corpus needed tosample emojis and create emoji clusters for theSTs. We perform language identiﬁcation usingthe polyglot library over the tweets to create acorpus for German, English, Spanish, Polish andArabic (TW- { DE,EN,ES,PL,AR } ) respectively.To automatically identify swear words for PMI-Swear, we use a German and a multilingualswear word collection, namely WoltLab and Hatebase . In total, we collected 785 slurs forGerman, and 1531, 140, 306, 79 for English, Span-ish, Polish and Arabic respectively. Target Tasks

We work with 6 target tasks in to-tal, 3 HS and 3 SA tasks, taking into account theiremoji content, class (im)balance and language.For German, we use GermEval 2018 (Wiegand,2018) Task 1 ( offense / other ) (HS-DE) and SB10k(Cieliebak et al., 2017) ( positive / negative / neutral )(SA-DE). For English, we use Sentiment Anal-ysis in Twitter (Rosenthal et al., 2017) ( posi-tive / negative / neutral ) (SA-EN). Sentiment Anal- sis in Twitter is also used for Arabic (SA-AR).For Spanish we use HatEval (Basile et al., 2019)( hate / none ) (HS-ES) and for Polish, we use PolEval(Ogrodniczuk and Łukasz Kobyli´nski, 2019) Task6 ( harmful / none ) (HS-PL). For all of the above, weuse the original train/test splits. While the HA taskshave different label names, we normalize these tobe hate / none across all tasks. For all SA, the labelsto be predicted are positive / negative / neutral .In Table 1, we report the label distribution, hate / none for HS and positive / negative / neutral forSA, across all TT training and test sets, as wellas ST Twitter corpora sizes. For both ST and TTcorpora, we also report the percentage as well astotal number of tweets containing emojis. Preprocessing

All data sets undergo thesame preprocessing. Tweets are tokenizedusing the NLTK (Bird and Loper, 2004)

TweetTokenizer and user mentions, retweetsand punctuation are removed. Repeated charactersare shortened. We use token frequencies todetermine the standard orthography of a word (e.g. coooool → cool instead of col ). For the monolingual (German) experiments,we use the German BERT (BERT-DE)and for multilingual experiments we use Bert-Base-Multilingual-Cased (BERT-M) as the LM to encode the tweets. We baseour code on the simpletransformers sequence classiﬁcation implementations of theabove models. Each classiﬁcation task is trainedfor a maximum of 10 epochs using early stoppingover the validation accuracy with δ = 0 . andpatience 3. Training was performed on a singleTitan-X GPU, which took between 1 and 6 hoursdepending on the data size. We evaluate theresulting classiﬁers using the Macro F1 measure. We describe the creation of the emoji clusters usedfor the emoji cluster STs.

Unsupervised

The unsupervised clusters (Sec-tion 3) were trained on TW-DE and the concate-nation of TW- { DE,EN,ES,PL,AR } for the mono- https://github.com/uds-lsv/emoji-transfer Figure 1:

Happy (left) and unhappy (right) emoji clus-ters obtained by KMeans on TW-DE. and multilingual experiments respectively. In bothcases, this yielded clusters that can be manuallycategorized as happy, love, fun, nature, unhappy,other (Figure 1). For KMeans-3, { happy, fun, love } were merged to positive , { other, nature } to neutral and { unhappy } was used as the negative class. ForKMeans-2, the neutral class is ignored. Supervised

The PMI-Target clusters are trainedon the respective TT training data. The slur listsare used to identify the slurs in the twitter cor-pora. PMI-Swear is then trained on TW-DE and theconcatenation of TW- { DE,EN,ES,PL,AR } for themono- and multilingual experiments respectively. We train each model over 10 seeded runs and reportthe averaged Macro F1 with standard error (Figure2). For each TT, we train a baseline , which isthe same pre-trained BERT- { DE,M } model thatis now ﬁne-tuned directly on the TT classiﬁcationtask at hand, without prior training on the ST. Wecompare these baselines with those models thathave undergone a transfer from ST to TT. We usethe term equivalent to signify that two models liewithin each others error bounds. We evaluate the effect that STs have on TTs withdifferent amounts of emoji content. We focus onthe TTs with the lowest and highest amount ofemoji content, namely SA-EN (1.9% emoji con-tent) and SA-AR (22.5%). This is the multilingualcase. For the monolingual case, we evaluate theeffect on SA-DE (2%) and HS-DE (7.2%). All ofthese TTs are unbalanced, i.e. the minority classmakes up 15.2–32.2% of the training data.The monolingual , low emoji content SA-DEtask does not proﬁt from the transfer. Rather, the . . . . . . . . . . B a s e li n e E P K M e a n s - K M e a n s - P M I - T a r g e t P M I - S w e a r S o u r c e T a s k HS-DE . . . . . . . . . . HS-DE . . . . . . . . . . HS-ES . . . . . . . . . . HS-PL B a s e li n e E P K M e a n s - K M e a n s - P M I - T a r g e t S o u r c e T a s k SA-DE SA-DE SA-EN SA-AR

Monolingual Multilingual

Figure 2: Macro F1 of the HS and SA target tasks transferred from monolingual (left) and multilingual (right) STs. training on most STs leads to a slight drop in F1-Macro compared to the baseline (F1 0.600). Onthe other hand, high emoji content HS-DE greatlybeneﬁts from the transfer, with PMI-Swear (F10.730) being especially beneﬁcial for the perfor-mance on the TT, yielding a gain of F1 +0.280 overthe baseline. This shows that the shared informa-tion in emojis and slurs is relevant to the HS taskat hand. Also beneﬁcial are EP (F1 0.705), and theunsupervised KMeans-3 (F1 0.690) and KMeans-2(F1 0.629) cluster prediction tasks. Only the su-pervised PMI-Target (F1 0.405) does no seem tobe beneﬁcial for the performance on the TT, lead-ing to a drop in performance, which is due to theunbalanced nature of the TT (Section 5.2).The multilingual case shows a slightly mixedtrend. Low emoji content SA-EN does not beneﬁtfrom the transfer, but unlike in the monolingualsetting, it is not harmed by it either. All STs lead toa TT performance that is equivalent to the baseline(F1 0.578). High emoji content SA-AR only barelyproﬁts from the transfer, with EP (F1 0.509) leadingto a small gain of F1 (+0.034) over the baseline (F10.475), while all other STs lead to an equivalentperformance to the baseline. The overall trend issimilar to the monolingual case but the positive andnegative effects are dimmed down, which may bedue to the multilingual aspect (Section 5.3).The general trend shows that a decent amountof emoji content in the TT training data is crucialfor the transfer to be beneﬁcial.

To analyze the effect that the STs have on differ-ently (un)balanced TTs, we focus on HS-PL (theminority class makes up 8.5% of training data)and HS-ES (41.3%), as they are the two most (un)balanced TTs, while being comparable in termsof emoji content (13.7% and 14.5% respectively).For unbalanced

HS-PL, EP (F1 0.617) and un-supervised KMeans-2 (F1 0.522) lead to an im-provement of F1 +0 . and F1 +0 . over thebaseline, respectively. All other STs are equivalentto the baseline. Balanced

HS-ES beneﬁts from allTTs, with EP (F1 0.708) leading to a gain of F1 +0 . over the baseline (F1 0.447), followed byPMI-Swear (F1 0.690) and PMI-Target (F1 0.643).The unsupervised clusters are beneﬁcial but lesseffective, with F1 0.602 and F1 0.475 for KMeans-3 and KMeans-2 respectively, which likely stemsfrom the multilingual aspect (Section 5.3). PMI-Target performs poorly on unbalanced HS-PL (and HS-DE etc.) due to its use of mutual in-formation between emojis and the TT labels. Thisleads to it reproducing the class imbalance, makingit less effective on unbalanced TTs.The difference in impact of

PMI-Swear on HS-PL (none) and HS-ES (and HS-DE) (gain) can beexplained by the composition of the ST dataset.TW-PL is the smallest corpus in the multilingualcollection of user comments, and this sparsity isfurther driven by the morphological complexity ofPolish, such that the 306 slurs from the Polish slurlist only resulted in 65 k Polish training samplesin PMI-Swear, as opposed to 1.8M and 3M forGerman and Spanish respectively.

Overall , if the label distribution in TT is bal-anced, the TT easily beneﬁts from the transfer. Oth-erwise other conditions such as the multilingualityor emoji content become more relevant.

We analyze the effectiveness of the transfer in amonolingual and multilingual setting. For this, weocus on the effect that the monolingually and mul-tilingually learned STs have on HS-DE and SA-DE.Both TTs are unbalanced, while HS-DE has a highemoji content and SA-DE has a low emoji content.The different effects of the emoji-content inHS-DE and SA-DE has been discussed in Section5.1, showing that in the monolingual setting, highemoji content HS-DE beneﬁts from the transfer,while low emoji content SA-DE does not. In the multilingual case, we see a similar, but dimmed,trend. SA-DE does not beneﬁt from the transfer,with all TTs leading to an equivalent performanceas the baseline (F1 0.566), except KMeans-2 (F10.439) which is below the baseline. The STs havea similar performance on HS-DE, being equivalentor below the baseline (F1 0.663). Only PMI-Swear(F1 0.678) is beneﬁcial for the TT performance.The effect of ST-oriented clusters KMeans- { } was beneﬁcial in the monolingual case (HS-DE),but this beneﬁt is lost in the multilingual set-ting. This underlines our original idea that ST-oriented unsupervised emoji clusters learned onlarge amounts of user generated text have the ad-vantage of accounting for cultural differences inthe usage of emojis. When learned multilingually,this advantage is lost. An example of the culturallydiverse use of emojis is , which is rather infre-quent in Europe and might be used to point towardsthe importance of recycling . In TW-AR, this emojiis among the top 5 most frequent emojis, and isused to motivate other users to share their content.The overall trend thus shows that monolin-gually learned STs are more beneﬁcial than multi-lingual STs. However, if the training data of a TTis balanced, this effect is less pronounced. To put the results into a broader perspective, wecompare to state-of-the-art (SOTA) models for eachof the shared-tasks/datasets that our TTs are basedon (Table 2). For two of the

Hate Speech bench-marks, the performance of our transfer approachis close to the SOTA, namely with a difference ofF1 − . (HS-DE) and F1 − . (HS-ES). ForHS-PL, we were able to achieve a gain of +0 . over the SOTA. Across all three Sentiment Analy-sis benchmarks, our models are below the SOTA.This indicates that SA, in general, is a more difﬁ-cult task to our transfer approach than HS, possiblydue to its ternary, rather than binary, classiﬁcationobjective. This is another factor causing the trans-

TT Method F1 SOTAHS-DE PMI-Swear (monolingual) 0.730

HS-ES EP 0.708

HS-PL EP

SA-AR EP 0.509

SA-EN KMeans-3 0.611

Table 2: Macro F1 comparison of top-scoring trans-fer method ( F1 ) with SOTA results on the differentTT test sets. Best scores in bold . See (Montani andSch¨uller, 2018) (HS-DE), (Basile et al., 2019) (HS-ES),(Ogrodniczuk and Łukasz Kobyli´nski, 2019) (HS-PL),(Cieliebak et al., 2017) (SA-DE) and (Rosenthal et al.,2017) (HS- { AR,EN } ) for SOTA method descriptions. fer to be overall more beneﬁcial for HS rather thanSA, next to the unbalanced (SA- { EN,AR } ) andlow-emoji content (SA-DE) nature of the SA tasks. We have evaluated and identiﬁed conditions underwhich the transfer from an emoji-based ST is ben-eﬁcial for a sentiment TT. In the experiments inSection 5 we observed three major trends, namely i ) TTs with high amounts of emoji content beneﬁtmore from the transfer, ii ) PMI-Target tends to bedetrimental to unbalanced TTs and iii ) monolin-gually learned STs tend to perform better than theirmultilingual counterparts, due to their improvedrepresentation of culturally unique emoji usages.The latter underlines the importance of taking intoaccount cultural differences when exploiting theinformation encoded in emojis.From these results, we can draw conclusionsabout the conditions under which a given emoji-based ST is beneﬁcial. Due to the shared infor-mation between emojis and slurs, PMI-Swear isbeneﬁcial to HS tasks when the data that can begenerated from the swear word list is decently large.

PMI-Target is beneﬁcial when the TT is balanced,otherwise it replicates the already existing classimbalance. Unsupervised

KMeans- { } shouldbe learned monolingually to be beneﬁcial and EP is a safe choice for TTs with high emoji content. Acknowledgments

We want to thank the anonymous reviewers as wellas Thomas Kleinbauer for their valuable feedback.The project on which this paper is based is fundedby the DFG under funding code WI 4204/3-1. eferences

Valerio Basile, Cristina Bosco, Elisabetta Fersini,Debora Nozza, Viviana Patti, Francisco ManuelRangel Pardo, Paolo Rosso, and Manuela San-guinetti. 2019. SemEval-2019 task 5: Multilin-gual detection of hate speech against immigrants andwomen in Twitter. In

Proceedings of the 13th Inter-national Workshop on Semantic Evaluation , pages54–63, Minneapolis, Minnesota, USA. Associationfor Computational Linguistics.Steven Bird and Edward Loper. 2004. NLTK: The nat-ural language toolkit. In

Proceedings of the ACL In-teractive Poster and Demonstration Sessions , pages214–217, Barcelona, Spain. Association for Compu-tational Linguistics.Mark Cieliebak, Jan Milan Deriu, Dominic Egger, andFatih Uzdilli. 2017. A twitter corpus and benchmarkresources for German sentiment analysis. In

Pro-ceedings of the Fifth International Workshop on Nat-ural Language Processing for Social Media , pages45–51, Valencia, Spain. Association for Computa-tional Linguistics.Alexis Conneau, Kartikay Khandelwal, Naman Goyal,Vishrav Chaudhary, Guillaume Wenzek, FranciscoGuzm´an, Edouard Grave, Myle Ott, Luke Zettle-moyer, and Veselin Stoyanov. 2020. Unsupervisedcross-lingual representation learning at scale. In

Proceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 8440–8451, Online. Association for Computational Lin-guistics.Jan Deriu, Maurice Gonzenbach, Fatih Uzdilli, Au-relien Lucchi, Valeria De Luca, and Martin Jaggi.2016. SwissCheese at SemEval-2016 task 4: Sen-timent classiﬁcation using an ensemble of convolu-tional neural networks with distant supervision. In

Proceedings of the 10th International Workshop onSemantic Evaluation (SemEval-2016) , pages 1124–1128, San Diego, California. Association for Com-putational Linguistics.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Xin Dong and Gerard de Melo. 2018. A helping hand:Transfer learning for deep sentiment analysis. In

Proceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers) , pages 2524–2534, Melbourne, Aus-tralia. Association for Computational Linguistics.Ben Eisner, Tim Rockt¨aschel, Isabelle Augenstein,Matko Boˇsnjak, and Sebastian Riedel. 2016. emoji2vec: Learning emoji representations fromtheir description. In

Proceedings of The FourthInternational Workshop on Natural Language Pro-cessing for Social Media , pages 48–54, Austin, TX,USA. Association for Computational Linguistics.Bjarke Felbo, Alan Mislove, Anders Søgaard, IyadRahwan, and Sune Lehmann. 2017. Using millionsof emoji occurrences to learn any-domain represen-tations for detecting sentiment, emotion and sarcasm.In

Proceedings of the 2017 Conference on Empiri-cal Methods in Natural Language Processing , pages1615–1625, Copenhagen, Denmark. Association forComputational Linguistics.Daisuke Kaneko, Alexander Toet, Shota Ushiama,Anne-Marie Brouwer, Victor Kallen, and Jan B.F.van Erp. 2019. Emojigrid: A 2d pictorial scale forcross-cultural emotion assessment of negatively andpositively valenced food.

Food Research Interna-tional , 115:541 – 551.Thomas Mandl, Sandip Modha, Prasenjit Majumder,Daksh Patel, Mohana Dave, Chintak Mandlia, andAditya Patel. 2019. Overview of the hasoc track atﬁre 2019: Hate speech and offensive content identiﬁ-cation in indo-european languages. In

Proceedingsof the 11th Forum for Information Retrieval Evalu-ation , FIRE ’19, page 14–17, New York, NY, USA.Association for Computing Machinery.Bryan McCann, James Bradbury, Caiming Xiong, andRichard Socher. 2017. Learned in translation: Con-textualized word vectors. In

Advances in NeuralInformation Processing Systems , volume 30, pages6294–6305. Curran Associates, Inc.Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efﬁcient estimation of word represen-tations in vector space.Joaqu´ın Padilla Montani and Peter Sch¨uller. 2018.Tuwienkbs at germeval 2018: German abusive tweetdetection. In

Proceedings of GermEval 2018, 14thConference on Natural Language Processing (KON-VENS 2018) , pages 45–50.Maciej Ogrodniczuk and Łukasz Kobyli´nski, editors.2019.

Proceedings of the PolEval 2019 Workshop .Institute of Computer Science, Polish Academy ofSciences, Warsaw, Poland.Jaram Park, Young Min Baek, and Meeyoung Cha.2012. Cross-Cultural Comparison of NonverbalCues in Emoticons on Twitter: Evidence fromBig Data Analysis.

Journal of Communication ,64(2):333–354.Ji Ho Park and Pascale Fung. 2017. One-step and two-step classiﬁcation for abusive language detection ontwitter. In

Proceedings of the First Workshop onAbusive Language Online , pages 41–45.Sara Rosenthal, Noura Farra, and Preslav Nakov. 2017.Semeval-2017 task 4: Sentiment analysis in twitter.In

Proceedings of the 11th international workshopn semantic evaluation (SemEval-2017) , pages 502–518.Haji Mohammad Saleem, Kelly P. Dillon, Susan Be-nesch, and Derek Ruths. 2016. A web of hate: tack-ling hateful speech in online social spaces. In

FirstWorkshop on text Analytics for Cybersecurity andOnline Safety at LREC 2016 .Jayashree Subramanian, Varun Sridharan, Kai Shu, andHuan Liu. 2019. Exploiting emojis for sarcasm de-tection. In

International Conference on Social Com-puting, Behavioral-Cultural Modeling and Predic-tion and Behavior Representation in Modeling andSimulation , pages 70–80. Springer.Jared Suttles and Nancy Ide. 2013. Distant supervisionfor emotion classiﬁcation with discrete binary val-ues. In

International Conference on Intelligent TextProcessing and Computational Linguistics , pages121–136. Springer.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

Advances in Neural Information Pro-cessing Systems , volume 30, pages 5998–6008. Cur-ran Associates, Inc.Zeerak Waseem and Dirk Hovy. 2016. Hateful sym-bols or hateful people? predictive features for hatespeech detection on twitter. In

Proceedings of theNAACL student research workshop , pages 88–93.Michael Wiegand. 2018. Overview of the germeval2018 shared task on the identiﬁcation of offen-sive language. pages 1–10, Wien. Verlag der¨Osterreichischen Akademie der Wissenschaften.Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.Xlnet: Generalized autoregressive pretraining forlanguage understanding. In