[PDF] SocialNLP EmotionGIF 2020 Challenge Overview: Predicting Reaction GIF Categories on Social Media

Abstract

We present an overview of the EmotionGIF2020 Challenge, held at the 8th International Workshop on Natural Language Processing for Social Media (SocialNLP), in conjunction with ACL 2020. The challenge required predicting affective reactions to online texts, and included the EmotionGIF dataset, with tweets labeled for the reaction categories. The novel dataset included 40K tweets with their reaction GIFs. Due to the special circumstances of year 2020, two rounds of the competition were conducted. A total of 84 teams registered for the task. Of these, 25 teams success-fully submitted entries to the evaluation phase in the first round, while 13 teams participated successfully in the second round. Of the top participants, five teams presented a technical report and shared their code. The top score of the winning team using the Recall@K metric was 62.47%.

Full PDF

SSocialNLP EmotionGIF 2020 Challenge Overview:Predicting Reaction GIF Categories on Social Media

Boaz Shmueli , , , ∗ , Lun-Wei Ku and Soumya Ray Social Networks and Human-Centered Computing, Taiwan International Graduate Program Institute of Information Science, Academia Sinica Institute of Service Science, National Tsing Hua University

Abstract

We present an overview of the EmotionGIF2020 Challenge, held at the 8th InternationalWorkshop on Natural Language Processingfor Social Media (SocialNLP), in conjunctionwith ACL 2020. The challenge required pre-dicting affective reactions to online texts, andincludes the EmotionGIF dataset, with tweetslabeled for the reaction categories. The noveldataset included 40K tweets with their reac-tion GIFs. Due to the special circumstancesof year 2020, two rounds of the competitionwere conducted. A total of 84 teams regis-tered for the task. Of these, 25 teams success-fully submitted entries to the evaluation phasein the ﬁrst round, while 13 teams participatedsuccessfully in the second round. Of the topparticipants, ﬁve teams presented a technicalreport and shared their code. The top score ofthe winning team using the Recall@K metricwas 62.47%.

Emotions, moods, and other affective states are anessential part of the human experience. The detec-tion of affective states in texts is an increasinglyimportant area of research in NLP, with importantapplications in dialogue systems, psychology, mar-keting, and other ﬁelds (Yadollahi et al., 2017). Re-cent approaches have taken advantage of progressin machine learning, and speciﬁcally deep neuralnetworks (LeCun et al., 2015), for building modelsthat classify sentiments and emotions in text. Train-ing these models often requires large amounts ofhigh-quality labeled data.Two main approaches have been used for collec-tion and labeling emotion data: manual annotationand distance supervision. With manual annota-tion , humans are presented with a text, and arerequested to annotate the text. When using this ∗ Corresponding author: [email protected]

Figure 1: A typical user interaction on Twitter approach, several emotional models can be usedfor labeling. The two most common models arethe discrete emotional model (Ekman and Friesen,1971), where the user needs to select among a fewcategorical emotions (e.g., disgust, joy, fear), andthe dimensional emotion model (Mehrabian, 1996),which uses three numerical dimensions (valence,arousal, dominance) to represent all emotions.With the help of crowd-sourcing platforms suchas Amazon Mechanical Turk (Buhrmester et al.,2016), human annotation can be quickly scaledup to produce large datasets. However, to achievelarge, high-quality datasets, the cost incurred isusually high. In addition, misinterpretations oftext due to cultural differences or contextual gapsare common, resulting in unreliable, low-qualitylabels. It should be noted that the annotators detectthe perceived emotions, i.e., the emotions that arerecognized in the text. a r X i v : . [ c s . C L ] F e b nother method for data collection is distant su-pervision , often using emojis or hashtags (e.g., Goet al. (2009), Mohammad and Kiritchenko (2015))This method provides high-volume, automatic col-lection of data, albeit with some limitations such asnoisy labels (hashtags might not be related to theemotions conveyed in the text). It should be notedthat the data collected in this case corresponds tothe intended emotions by the text’s author. Distantsupervision can also be used to label reactions totext. For example Pool and Nissim (2016) use theFacebook feature that allows users to respond withone of six emojis (Like, Love, Haha, Wow, Sadand Angry) to collect posts and their readers’ reac-tions. These reactions are a proxy to the readers’ induced emotions – the emotions they felt whenreading the text. This method is limited by thenarrow emotional range of labeling.To improve research on ﬁne-grained emotionalreaction and open up new research possibilities,we conducted the ReactionGIF 2020 shared task.The challenge offered a new dataset of 40K tweetswith their ﬁne-grained reaction category (or cate-gories). The task challenge was to predict eachtweet’s reactions in an unlabeled evaluation dataset.In the following sections, we describe and discussthe dataset, the competition, and the results. Twitter is a popular micro-blogging site, whereusers create short text posts known as tweets. Inmost languages, including English, tweets are lim-ited to 280 characters (the limit is 140 charactersin Japanese, Chinese, and Korean). As part of thepost, users can also mention other users (@user),and use hashtags ( reaction GIFs , are able to con-vey emotions in an expressive and accurate way(Bakhshi et al., 2016), and have become very popu-lar in online conversations. Figure 1 shows a typ-ical interaction on Twitter: User 1 posted a tweet( ”I just won ﬁrst place!” ), and User 2 replied witha tweet that includes an “applause”-category reac-tion GIF, and some text (”Congratulations dear!”). i don’t know you only leave once Figure 2: A sample of GIF categories on Twitter scared windo not want kiss seriously winkeww mic drop shocked yawneye roll no shrug yesfacepalm oh snap sigh yolo ﬁst bump ok smh you got thisgood luck omg sorry Table 1: GIF categories

For the challenge, we collected similar 2-turn in-teractions. Each sample in the dataset contains thetext of the original tweet, the text of the reply tweet,and the video ﬁle of the reaction GIF. The label foreach tweet is the reaction category (or categories)of the GIF. Because some replies only contain areaction GIF, the reply text is optionally empty. Weuse a list of 43 reaction categories, pre-deﬁned bythe Twitter platform, and used when a user needsto insert a GIF into a tweet (see Figure 2). The listis shown in Table 1, and covers a wide range ofemotions, including love, empathy, disgust, anger,happiness, disappointment, approval, regret, etc.There is an overlap between the reaction categoriesin terms of the GIFs they contain and thus someGIFs can belong to more than one category. Con-sequently, the label may contain more than onereaction category. For example, GIFs that are cate-gorized with “shocked” might also be categorizedin “omg”. Table 2 shows a few samples from thetraining dataset. Note that replies can be option-ally empty. The GIF MP4 ﬁles are included in the shake my head oh my god igure 3: Categories per sample dataset for completeness, but are not used in thischallenge.We collected the EmotionGIF dataset dur-ing April 2020, and it includes 40,000 English-language tweets and their GIF reactions. Thedataset is divided 80%/10%/10% into training (32,000 samples), development (4,000 samples),and evaluation (4,000 samples) datasets. Categories per sample

Figure 3 shows the dis-tribution of the number of categories per sample.The majority of samples (73.1%) in the trainingdataset are labeled with a single category. An addi-tional 17.7% of samples have two labels, and 5.1%have three categories. The remaining samples arelabeled with 3 to 6 labels,

Category Distribution

Figure 4 shows the cate-gory distribution. The categories suffer from un-even distribution; a few of the categories (“ap-plause”, “hug”, “agree”, “yes”, “no”) label between5% to 10% of the samples, while most of them label2% or less of the samples.

Category Co-occurrence

The categories are se-mantically overlapping, and thus some categorypairs co-occur more often than others. Figure 5shows the co-occurrence heat map. For exam-ple, GIFs that are labelled with “facepalm” tend toalso be labeled with “seriously”, “sign”, and “smh”( S hake M y H ead), as these four categories are allexpressions related to disappointment. “shocked”and “omg” ( O h M y G od) co-occur frequently, bothindicating surprise, etc. Due to year 2020’s extraordinary circumstances,the competition had two rounds: Round One andRound Two. A shared task website was set up , https://sites.google.com/view/emotiongif-2020/ which included general information, dates, ﬁle for-mat, frequently-asked questions, registration form,etc. In addition, two competition websites (RoundOne, Round Two) were set up on the Codalabplatform , where participants could download thedatasets and upload their submissions.For the shared task, we provided the trainingdataset (32K each) with labels, and two datasets(development and evaluation, 4K samples each),without labels. Additionally, the development andevaluation datasets did not contain the video ﬁles.The task was to predict six labels for every sample,with the metric being Mean Recall at 6, or MR @6 .To compute MR @6 , we ﬁrst deﬁne the per-samplerecall at 6 for sample i , R @6 i , which is the ratio: R @6 i = | G i ∩ P i || G i | , where G i is the set of true (“gold”) reaction cate-gories for sample i , and P i is the set of six predictedreaction categories for sample i . R @6 i is the frac-tion of reaction categories correctly predicted forsample i . Because each sample is labeled with amaximum of six categories, R @6 is always a valuebetween 0 and 1. We then average over all samplesto arrive at MR @6 : R = MR @6 = 1 N N (cid:88) i =1 R @6 i We also calculated R and R , which are the Re-call at 6 values for the samples with a non-emptyreply (i.e., the reply tweet included both a GIF andtext), and with an empty reply (the reply tweet onlyinclude a GIF).Each round of the competition had two phases:practice, and evaluation. During the practice phase,participants uploaded predictions for the develop-ment dataset and were able to instantly check theperformance. In the evaluation phase, which de-termined the competition winners, participants up-loaded predictions for the evaluation datasets. Re-sults were hidden until the end of the Round toprevent overﬁtting to the data. A total of 84 people registered for the shared task.25 teams successfully submitted entries to the eval-uation phase in Round One, while 13 teams par-ticipated successfully in Round Two. Of the top https://competitions.codalab.org/ igure 4: User options for Twitter GIF categories a g r ee a pp l a u s e a d a n c e d e a l _ w i t h _ i t d o _ n o t _ w a n t e ww e y e_ r o ll f a c e p a l m f i s t _ bu m p g oo d _ l u c k h a pp y _ d a n c e h e a r t s h i g h _ f i v e hu g i d k k i ss m i c _ d r o p n o o h _ s n a p o k o m g oo p s p l e a s e p o p c o r n s c a r e d s e r i o u s l y s h o c k e d s h r u g s i g h s l o w _ c l a p s m h s o rr y t h a n k _ y o u t hu m b s _ d o w n t hu m b s _ up w a n t w i n w i n k y a w n y e s y o l o y o u _ g o t _ t h i s Figure 5: GIF category co-occurrence heatmap

Original tweet Reply tweet Reaction GIF Reaction Categories

Why don’t you interact with me ? d74d...a34e.mp4 oopssomeone give me a hug (from 2 metres away) 2be1....5fc0.mp4 want, hugSo disappointed in the DaBaby You’re one to talk bb2d...cfcf.mp4 smhBonus stream tonight anyone? e91e....49af.mp4 win, ok, thumbs upCamila Cabello and You Of course? a9fc....b139a.mp4 shrug, oops, idk

Table 2: Dataset samples ank Team Approach

R R R Table 3: Recall at 6 scores for EmotionGIF participants, ﬁve teams presented a technical re-port and shared their code, as was required by thecompetition rules.We provided a simple majority baseline thatpredicts the 6 most common labels for all samples(applause, hug, agree, yes, no, seriously). MR @6 for the majority baseline is 40.0%.The highlights of these submissions are sum-marized below. More details are available in therelevant reports. Team Mojitok (Jeong et al., 2020)

This top sub-mission used a combination of methods to attackthe challenging aspects of the tasks. Four mod-els were used: three transformer-based models,RoBERTa (Liu et al., 2019), DialoGPT (Zhanget al., 2019) and XLNet (Yang et al., 2019). Thefourth was a RoBERTa model in combination witha label embedding using CGN (Chen et al., 2019)that captures category dependency was also em-ployed. This model were ﬁne-tuned, using the“large” variant of the pretrained models. 5-foldcross validation was used within each model toproduce a total of 20 estimators. Soft-voting en-sembles of these estimators were tested in vari-ous combinations. In addition, mixconnect (Leeet al., 2019) was used for regularization in someof the models. Reduction of the multi-label prob-lem Pick-All-Labels Normalised (PAL-N) (Menonet al., 2019) multi-class formulation was found tooptimize the R @6 metric. Team Whisky (Phen et al., 2020)

RoBERTa andBERT models were evaluated using different hyper-parameters, and with different handling of emojis.The superiority of large RoBERTa model with longsequences was demonstrated.

Team Yankee (Wang et al., 2020)

Elaborate pre-processing was used to increase token coverage.RoBERTa and two BERT models (case and un-cased) were ﬁne-tuned and then ensembled usingpower-weighted sum. The pretrained models are ofthe “base” variant. Binary cross entropy followed a sigmoid activation layer for prediction.

Team Crius (Bi et al., 2020)

This method ob-tained features by ﬁne-tuned pairwise and point-wise BERT, along with statistical semantic featuresand similarity-related features. These were con-catenated and fed into a LightGBM classiﬁer (Keet al., 2017).

Team IITP (Ghosh et al., 2020)

Preprocessingincluded removal of Twitter mentions and replacingemoticons with words. An ensemble of 5 modelswas used. The ﬁrst model used two 2D CNN withattention networks, one for the original tweet andone of the reply tweet, using pre-trained GloVe em-beddings. The outputs of the two CNN networkswere concatenated and fed into a fully-connectedlayer followed by sigmoid activation layer and bi-nary cross-entropy. Dropout layers were used atvarious stages for regularization. Four additionalmodels with similar top architecture were trainedusing two instances each of 1D CNN+BiLSTM,Stacked BiGRU, BiLSTM, BiGRU. The outputsfrom these ﬁve models were majority-voted to pro-duce the predictions.

A summary of the submissions and their challengescores is available in Table 3. Presented are theteams that submitted detailed technical reports andtheir code for veriﬁcation. A full leaderboard thatincludes all the teams is available on the shared taskwebsite. This section highlights some observationsrelated to the challenge.

Unbalanced Labels.

Emotion detection in textoften suffers from a data imbalance problem. Simi-larly our dataset (which supervised reactions, notemotions) has a similar phenomenon. This wouldbe emphasized if we used a metric that is sensitiveto class imbalances (e.g., Macro-F1). Our metric isless sensitive to these kind of problem. None of theteams decided to take any measures in that regard. abel dependency

Multi-label datasets((Tsoumakas and Katakis, 2007), (Zhang andZhou, 2013)) introduce challenging classiﬁcationstasks. As we can see from Figure 5, in our datasetthe categories are highly dependent. (Jeong et al.,2020) used a new approach to represent thiscorrelation. Classiﬁcation of multilabel datasets isa developing area that requires further research.

Models

All of the submissions used deep learn-ing models. Four of the models were transformer-based architectures, with most using pre-trainedtransformers (BERT or its variants). Some of thesubmissions enhanced these model in various ways,e.g. using k-fold CV ensembles (Jeong et al., 2020).The use of “large” vs “base” models explainedsome of the performance differential.

We summarized the results of the EmotionGIF2020 challenge, which was part of the SocialNLPWorkshop at ACL 2020. The challenge presenteda new task that entailed the prediction of affectivereactions to text, using the categories of reactionGIFs as proxies to affective states. The data in-cluded tweets and their GIF reactions. We providedbrief summaries of each of the eligible participants’entries. Most submissions used transformer-basedarchitectures (BERT, RoBERTa, XL-Net) etc, re-ﬂecting their increasing use in NLP classiﬁcationtasks, due to their superior performance but also theavailability of easy-to-use programming libraries.The top system employed the use of various meth-ods, including ensemble, regularization, and GCNto to achieve the top score.

Acknowledgements

This research was partially supported by the Min-istry of Science and Technology of Taiwan undercontracts MOST 108-2221-E-001-012-MY3 andMOST 108-2321-B-009-006-MY2.

References

Saeideh Bakhshi, David A Shamma, Lyndon Kennedy,Yale Song, Paloma De Juan, and Joseph ’Joﬁsh’Kaye. 2016. Fast, cheap, and good: Why animatedGIFs engage us. In

Proceedings of the 2016 CHIconference on human factors in computing systems ,pages 575–586.Ye Bi, Shuo Wang, and Zhongrui Fan. 2020.EmotionGIF-CRIUS: A hybrid BERT and Light- GBM based model for predicting emotion GIF cat-egories on Twitter. Technical report, Ping An Tech-nology (Shenzhen).Michael Buhrmester, Tracy Kwang, and Samuel DGosling. 2016. Amazon’s mechanical turk: A newsource of inexpensive, yet high-quality data?

Per-spectives on Psychological Science .Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, and Yan-wen Guo. 2019. Multi-label image recognition withgraph convolutional networks. In

Proceedings of theIEEE Conference on Computer Vision and PatternRecognition , pages 5177–5186.Paul Ekman and Wallace V Friesen. 1971. Constantsacross cultures in the face and emotion.

Journal ofpersonality and social psychology , 17(2):124.Soumitra Ghosh, Arkaprava Roy, Asif Ekbal, andPushpak Bhattacharyya. 2020. EmotionGIF-IITP:Ensemble-based automated deep neural system forpredicting category(ies) of a GIF response. Techni-cal report, IIT Patna.Alec Go, Richa Bhayani, and Lei Huang. 2009. Twit-ter sentiment classiﬁcation using distant supervision.

CS224N project report, Stanford , 1(12):2009.Woojung Choi Dongseok Jeong, Hyunho Kim, Hyun-woo Kim, Jihwan Moon, and Hyunwoo Cho. 2020.EmotionGIF-MOJITOK: A multi-label classiﬁerwith pick-all-models ensemble and pick-all-labelsnormalised loss. Technical report, Platfarm Inc.Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang,Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu.2017. Lightgbm: A highly efﬁcient gradient boost-ing decision tree. In

Advances in neural informationprocessing systems , pages 3146–3154.Yann LeCun, Yoshua Bengio, and Geoffrey Hinton.2015. Deep learning. nature , 521(7553):436–444.Cheolhyoung Lee, Kyunghyun Cho, and Wanmo Kang.2019. Mixout: Effective regularization to ﬁnetunelarge-scale pretrained language models. In

Interna-tional Conference on Learning Representations .Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692 .Albert Mehrabian. 1996. Pleasure-arousal-dominance:A general framework for describing and measuringindividual differences in temperament.

Current Psy-chology , 14(4):261–292.Aditya K Menon, Ankit Singh Rawat, Sashank Reddi,and Sanjiv Kumar. 2019. Multilabel reductions:what is my loss optimising? In

Advances in Neu-ral Information Processing Systems , pages 10600–10611.aif M Mohammad and Svetlana Kiritchenko. 2015.Using hashtags to capture ﬁne emotion cate-gories from tweets.

Computational Intelligence ,31(2):301–326.Wilbert Phen, Mu-Hua Yang, and Yu-Wun Tseng.2020. EmotionGIF-Whisky: BERT and RoBERTafor emotion classiﬁcation for tweet data. Technicalreport, NCTU.Chris Pool and Malvina Nissim. 2016. Distant super-vision for emotion detection using Facebook reac-tions. In

Proceedings of the Workshop on Compu-tational Modeling of People’s Opinions, Personality,and Emotions in Social Media (PEOPLES) , pages30–39, Osaka, Japan. The COLING 2016 Organiz-ing Committee.Grigorios Tsoumakas and Ioannis Katakis. 2007.Multi-label classiﬁcation: An overview.

Interna-tional Journal of Data Warehousing and Mining(IJDWM) , 3(3):1–13.Wei-Yao Wang, Kai-Shiang Chang, and Yu-ChienTang. 2020. EmotionGIF-Yankee: A sentiment clas-siﬁer with robust model based ensemble methods.Technical report, NCTU.Ali Yadollahi, Ameneh Gholipour Shahraki, and Os-mar R Zaiane. 2017. Current state of text sentimentanalysis from opinion to emotion mining.

ACMComputing Surveys (CSUR) , 50(2):1–33.Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.Xlnet: Generalized autoregressive pretraining forlanguage understanding. In

Advances in neural in-formation processing systems , pages 5753–5763.Min-Ling Zhang and Zhi-Hua Zhou. 2013. A re-view on multi-label learning algorithms.

IEEEtransactions on knowledge and data engineering ,26(8):1819–1837.Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen,Chris Brockett, Xiang Gao, Jianfeng Gao, JingjingLiu, and Bill Dolan. 2019. Dialogpt: Large-scalegenerative pre-training for conversational responsegeneration. arXiv preprint arXiv:1911.00536arXiv preprint arXiv:1911.00536