[PDF] On Analyzing Annotation Consistency in Online Abusive Behavior Datasets

Abstract

Online abusive behavior is an important issue that breaks the cohesiveness of online social communities and even raises public safety concerns in our societies. Motivated by this rising issue, researchers have proposed, collected, and annotated online abusive content datasets. These datasets play a critical role in facilitating the research on online hate speech and abusive behaviors. However, the annotation of such datasets is a difficult task; it is often contentious on what should be the true label of a given text as the semantic difference of the labels may be blurred (e.g., abusive and hate) and often subjective. In this study, we proposed an analytical framework to study the annotation consistency in online hate and abusive content datasets. We applied our proposed framework to evaluate the consistency of the annotation in three popular datasets that are widely used in online hate speech and abusive behavior studies. We found that there is still a substantial amount of annotation inconsistency in the existing datasets, particularly when the labels are semantically similar.

Full PDF

OOn Analyzing Annotation Consistency in Online Abusive Behavior Datasets

Md Rabiul Awal, Rui Cao, Roy Ka-Wei Lee, Sandra Mitrovi University of Saskatchewan, Canada University of Electronic Science and Technology of China, China Istituto Dalle Molle di Studi sull’Intelligenza Artiﬁciale, [email protected], [email protected], [email protected], [email protected]

Abstract

Online abusive behavior is an important issue that breaks thecohesiveness of online social communities and even raisespublic safety concerns in our societies. Motivated by this ris-ing issue, researchers have proposed, collected, and annotatedonline abusive content datasets. These datasets play a criticalrole in facilitating the research on online hate speech and abu-sive behaviors. However, the annotation of such datasets is adifﬁcult task; it is often contentious on what should be the truelabel of a given text as the semantic difference of the labelsmay be blurred (e.g., abusive and hate) and often subjective.In this study, we proposed an analytical framework to studythe annotation consistency in online hate and abusive contentdatasets. We applied our proposed framework to evaluate theconsistency of the annotation in three popular datasets thatare widely used in online hate speech and abusive behaviorstudies. We found that there is still a substantial amount ofannotation inconsistency in the existing datasets, particularlywhen the labels are semantically similar.

Introduction

Misbehavior in online social media such as cyberbullying,propagation of hate speeches, and abusive content have be-come an increasing problem. Such online misbehavior hasnot only sowed discord among individuals or communi-ties online but also resulted in violent hate crimes Williams(2019); Relia et al. (2019); Mathew et al. (2019). Therefore,it is a pressing issue to detect and curb such misbehavior inonline social media.Traditional machine learning and deep learning ap-proaches have been proposed to detect online misbehaviorautomatically. The recent surveys (Schmidt and Wiegand,2017; Fortuna and Nunes, 2018) have comprehensively sum-marized these methods. Most of the automatic online mis-behavior detection methods are supervised text classiﬁca-tion methods trained and tested on annotated datasets. Assuch, the quality of the annotation has direct implications ondetection algorithms’ performance and the insights gainedfrom the online misbehavior research studies.Three popular datasets are widely used in online mis-behavior studies: WZ (Waseem and Hovy, 2016; Waseem, Copyright c (cid:13) DT (Davidson et al., 2017), and the recently pub-lished FOUNTA

Founta et al. (2018) dataset. Waseem andHovy (2016) ﬁrst collected and annotated the WZ Twit-ter dataset into four classes: racism, sexism, both, and nei-ther. Waseem and Hovy (2016) subsequently enhanced thedataset by controlling the bias introduced by annotators.Davidson et al. (2017) argued that hate speech should be dif-ferentiated from offensive tweets; some tweets may containhateful words but should be labeled as offensive as they didnot meet the threshold of classifying them as hate speech.The researchers collected the DT dataset and manually an-notated the dataset into three categories: offensive, hate, andneither. In a recent study, Founta et al. (2018) proposed the FOUNTA dataset. This dataset went through two roundsof annotations. In the ﬁrst round, annotators are required toclassify tweets into three categories: normal, spam, and in-appropriate. Subsequently, the annotators were asked to re-ﬁne further the labels of the tweets in the “inappropriate”category. Speciﬁcally, the ﬁnal version of the dataset con-sists of four classes: normal, spam, hate, and abusive.While these datasets have facilitated many online misbe-havior studies, few analyses have been done to evaluate andbenchmark the quality of these datasets. The annotation ofonline misbehavior datasets is a challenging tasks. Firstly,the difference between certain labels may be subtle (David-son et al., 2017; Founta et al., 2018). Secondly, the manualannotation process is often subjected to the annotator’s bias-ness (Waseem, 2016). Therefore, we proposed an analyticalframework to examine the annotation consistency in onlinemisbehavior datasets. Included in our proposed frameworkis a two-step pipeline, which enables us to identify potentialmislabeling and contentious annotation in the datasets.We summarize our main contributions as follows: • We proposed a novel analytical framework to examine theannotation consistency in the online misbehavior dataset. • We applied our proposed framework to analyze three pop-ular real-world and publicly available datasets. To the bestof our knowledge, we are the ﬁrst study that quantitativelyand qualitatively compares existing online misbehaviordatasets. • Our analysis showed that there is a substantial amount ofannotation inconsistency in the existing datasets. We also a r X i v : . [ c s . S I] J un lassiﬁerClassiﬁerClassiﬁer Voting AllTweets ContentiousTweets

Tweet SearchEngine

Similar Tweets

LabelInconsistencyMatrixClassiﬁerClassiﬁer

Step 1: Classify-to-ﬁlter Step 2: Search Inconsistency

Figure 1: Overall annotation consistency analysis frameworkempirically demonstrate case studies where the annota-tion inconsistency is likely to occur in the datasets.

Annotation Consistency Analysis Framework

Figure 1 shows our proposed annotation consistency anal-ysis framework. Included in the analytical framework is atwo-step process. In the ﬁrst step, we train a set of classiﬁersto predict the labels of a given dataset of tweets. Voting willthen be performed to vote for contentious tweets, i.e., tweetsthat are wrongly classiﬁed by more than half of the classi-ﬁers. The intuition is that it is more challenging to classifytweets that are annotated with contentious labels. For exam-ple, in Table 1, the tweet t1 is identiﬁed as contentious whenmore than half of the classiﬁers mispredicted its label. A po-tential reason for the wrong classiﬁcation may be due to t1 ,which is labeled as Hate , sharing very similar attributes withother tweets that are labeled as

Offensive . Such contentiouslabeling is likely to confuse the classiﬁer, resulting in thewrong prediction. In the second step, the set of retrieved con-tentious tweets are used as input queries into a search engineto ﬁnd similar tweets in the dataset. Finally, we construct anannotation inconsistency matrix by comparing the labels ofthe contentious tweets and the retrieved similar tweets. Theunderlying assumption is that potential inconsistencies arisewhen the labels of the contentious tweet and its similar tweetare different. For example, in Table 1, the search engines re-turn t2 as the most similar tweet to the contentious tweet t1 .When we compare the label of t1 and t2 , we notice that thetwo tweets have different labels, ﬂagging a potential annota-tion inconsistency for the tweet t1 . Step 1: Classify-to-ﬁlter

The classify-to-ﬁlter step can be further broken down intotwo stages: classiﬁcation and voting .In the classiﬁcation stage, we adopt an ensemble ap-proach to train ﬁve different text classiﬁers on a given onlinemisbehavior dataset. The commonly-used traditional ma-chine learning and deep learning models are selected for ourtext classiﬁcation task. Speciﬁcally, we use Logistic Regres-sion (LR), Naive Bayes (NB), Single-layer ConvolutionalNeural Network (CNN), Recurrent Neural Network (RNN), Table 1: Tweets example

Id Tweet Label Contentious t1 You are such a b*tch Hate Yest2 Don’t be such a b*tch Offensive Not3 B*tch please, try hard! Offensive Noand Convolutional Long-Short Term Memory network (C-LSTM) as the classiﬁers in this step. For LR and NB, wetrained these classiﬁers using the tweets’ word-level termfrequency-inverse document frequency (tf-idf) features. Forthe deep learning models, we use pre-trained GloVe wordembeddings (Pennington, Socher, and Manning, 2014) torepresent the words in the tweets, which are subsequentlyused as input for the classiﬁers. Each classiﬁer is trained us-ing 5-fold cross-validation, and the predictions on the tweetsin the validation set are recorded for voting.In the voting stage, we consolidate the predictions madeby ﬁve classiﬁers and identify the contentious tweets.Speciﬁcally, given a tweet, if three or more classiﬁers pre-dicted its label wrongly, we would place this tweet into the contentious tweets set. While the incorrect prediction may beattributed to inconsistency in annotation, there could also beother reasons. For example, a tweet may contain rare words,and there are insufﬁcient data to train the models well toclassify this tweet. Therefore, we perform another step tofurther verify whether it is annotation inconsistency that ledto incorrect predictions.

Step 2: Search Inconsistency

In this step, we utilize the retrieved set of contentious tweetsas input into our search engine to retrieve similar tweets.Speciﬁcally, given a query contentious tweet, t q , the searchengine aims to retrieve its most similar tweet, t s , from thedataset. To measure the similarity between tweets, we com-pute the cosine similarity between the tweets’ tf-idf rep-resentation. The cosine similarity between two tweets arecomputed as follows: cos sim ( t q , t s ) = (cid:80) w ∈ t q ∩ t s t wq t ws (cid:113)(cid:80) w ∈ t q t wq ∗ (cid:113)(cid:80) w ∈ t s t ws (1)where t wq is the tf-idf weight of term w in the query tweet t q , acism sexism neither both t w ee t s WZ-LS contentiousnon-contentious hateful neither offensive t w ee t s DT contentiousnon-contentious hateful abusive spam normal t w ee t s FOUNTA contentiousnon-contentious

Figure 2: Breakdown distributions of contentious and non-contentious tweets from WZ (left), DT (middle), and FOUNTA (right) retrieved in step 1 of the annotation consistency analysis framework.and t ws is the tf-idf weight of term w in the similar tweet t s .We compute the cosine similarity between each query tweet t q and all tweets in the dataset, i.e., t s ∈ T , and select thetweet with the highest cosine similarity score as the similartweet to the query tweet.Finally, we compare the annotated labels of the t q and t s :if the two annotated labels disagree, we ﬂag out that t q mighthave an annotation inconsistency as the (a) classiﬁers ﬁnd ithard to classify this tweet, and (b) its annotated label is dif-ferent from its most similar tweet. The annotation inconsis-tencies in the contentious tweets are subsequently reportedin annotation inconsistency matrices in the next step. Evaluation and Discussion

We applied our proposed annotation consistency analysisframework on the three popular datasets, which are widelyused in online misbehavior studies: WZ (Waseem and Hovy,2016; Waseem, 2016), DT (Davidson et al., 2017), and FOUNTA

Founta et al. (2018). The summary statistics ofthe datasets are presented in Table 2. Note that we com-bined the number the tweets in (Waseem and Hovy, 2016;Waseem, 2016) to form the current WZ dataset.Table 2: Summary statistics of datasets Dataset

WZ 13,202 racism (82), sexism (3,332),both (21), neither (9,767)DT 24,783 hate (1,430), offensive(19,190), neither (4163)FOUNTA 99,999 normal (53,851), abusive(27,150), spam (14,029), hate(4,965)Figure 2 shows the breakdown distributions of con-tentious and non-contentious tweets retrieved in step 1 of ourproposed analytical framework. We observe that contentioustweets are found from all labels in the three datasets, i.e., theﬁve classiﬁers made mistakes in predicting the true label ofall kinds of tweets. Speciﬁcally, in WZ , the classiﬁers haveincorrectly predicted most of the sexism tweets. In DT , al- most half of the hateful tweets are wrongly predicted. Sim-ilar observations are made in FOUNTA , with hateful andspam tweets seeing a higher percentage of misclassiﬁcation.Table 3: Annotation Inconsistency matrix for WZ

Contentious Tweet Label

Racism Sexism Both NeitherRacism 16 0 0 1Sexism 9 662 10 222Both 0 4 0 1

SimilarTweetLabel

Neither 26 754 5 218As discussed earlier in the section, there could be multi-ple reasons for the misclassiﬁcation. For instance, the hatespeech detection problem may be hard as the tweets withinthe same label have high variance, or there might be insufﬁ-cient training data. In this paper, we are interested to under-stand how much of the misclassiﬁcation can be attributed toannotation inconsistency. Table 3, 4, and 5 shows the anno-tation inconsistency matrix generated in step 2 of our ana-lytical framework for WZ , DT , and FOUNTA respectively.Table 4: Annotation Inconsistency matrix for DT

Contentious Tweet Label

Offensive Hate NeitherOffensive 282 760 282Hate 84 133 16

SimilarTweetLabel

Neither 105 41 74From Table 3, we observe that 662 sexism contentioustweets have their most similar tweets sharing the same la-bel, while 745 of the sexism contentious tweets have theirmost similar tweets labeled as normal tweets (i.e., neither).This suggests that there could be inconsistencies in the an-notation of sexism tweets as two similar tweets may havedifferent labels, one labeled as sexism while another as nor-mal. Similar observations are made in other class labels, al-though the inconsistency in sexism tweet annotation is ob-able 5: Annotation Inconsistency matrix for FOUNTA

Contentious Tweet Label

Abusive Hate Spam NormalAbusive 491 1547 736 1062Hate 347 370 93 192Spam 109 62 790 1024

SimilarTweetLabel

Normal 758 1133 3170 915served to be the highest. Similar observations are also madefor the DT dataset in Table 4. A majority of the contentioushate tweets have their most similar tweets labeled as offen-sive. This is unsurprising as even for human annotators it isoften difﬁcult to differentiate hateful tweets from offensiveones (Davidson et al., 2017). Nevertheless, such challengesin annotation also highlight the difﬁculty in the hate speechdetection task.Comparing the annotation inconsistency matrix of FOUNTA against the other two datasets, we noted thatthere could be signiﬁcantly more annotation inconsistenciesin the

FOUNTA dataset. As shown in Table 5, there is ahigh amount of annotation inconsistencies observed in all la-bels. For instance, we observed that 758 contentious abusivetweets have their most similar tweets labeled as normal, anda signiﬁcant number of contentious hate tweets have theirmost similar tweets labeled as abusive or normal. We furtherverify the annotation inconsistencies in

FOUNTA datasetby retrieving some samples of the

FOUNTA tweets. Table1 shows three examples of the

FOUNTA contentious tweetsand their most similar tweets. Surprisingly, we notice thatthe most similar tweets retrieved for contentious tweets C1and C2 are retweets, and the retweets are annotated withdifferent class labels. This exposes an issue in FOUNTA’sannotation strategy. We postulated that the identical tweets(i.e., the retweets) are annotated by different human annota-tors, resulting in the inconsistencies. We further investigatedand found that more than 10% of the tweets are retweets,and most of which have inconsistencies in their annotation.Table 6: Examples of tweets from

FOUNTA dataset. C x de-notes the contentious tweet and S x denotes the correspond-ing most similar tweet. Id Tweet Label

C1 RT:[USER 1] How about we f**kinghire trans boys to play trans boys hateS1 RT:[USER 1] How about we f**kinghire trans boys to play trans boys normalC2 RT:[USER 2] I wish I wasn’t so an-noying like I even p*ss myself off normalS2 RT:[USER 2] I wish I wasn’t so an-noying like I even p*ss myself off abusiveC3 RT:[USER 3] f**king faggot hateS3 [USER 4] f**king faggot abusive

Conclusion

In this paper, we proposed an analytical framework to exam-ine annotation consistency in online misbehavior datasets. We applied our proposed framework to analyze three popu-lar online misbehavior datasets. Our analysis showed thatannotation inconsistencies in all three datasets, illustrat-ing the challenges in online misbehavior data collection.Speciﬁcally, in the

FOUNTA dataset, we found a signiﬁcantamount of annotation inconsistency where identical tweetsare annotated with different class labels. We also providedthe updated datasets with annotation inconsistency infor-mation so that researchers may perform the necessary datapreprocessing in future online misbehavior studies. References

Davidson, T.; Warmsley, D.; Macy, M.; and Weber, I. 2017.Automated hate speech detection and the problem of of-fensive language. In

Eleventh international aaai confer-ence on web and social media .Fortuna, P., and Nunes, S. 2018. A survey on automaticdetection of hate speech in text.

ACM Computing Surveys(CSUR)

TwelfthInternational AAAI Conference on Web and Social Media .Mathew, B.; Dutt, R.; Goyal, P.; and Mukherjee, A. 2019.Spread of hate speech in online social media. In

Pro-ceedings of the 10th ACM Conference on Web Science ,173–182.Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove:Global vectors for word representation. In

Proceedingsof the 2014 conference on empirical methods in naturallanguage processing (EMNLP) , 1532–1543.Relia, K.; Li, Z.; Cook, S. H.; and Chunara, R. 2019. Race,ethnicity and national origin-based discrimination in so-cial media and hate crimes across 100 us cities. In

Pro-ceedings of the International AAAI Conference on Weband Social Media , volume 13, 417–427.Schmidt, A., and Wiegand, M. 2017. A survey on hatespeech detection using natural language processing. In

Proceedings of the Fifth International Workshop on Nat-ural Language Processing for Social Media , 1–10.Waseem, Z., and Hovy, D. 2016. Hateful symbols or hatefulpeople? predictive features for hate speech detection ontwitter. In

Proceedings of the NAACL student researchworkshop , 88–93.Waseem, Z. 2016. Are you a racist or am i seeing things?annotator inﬂuence on hate speech detection on twitter. In

Proceedings of the ﬁrst workshop on NLP and computa-tional social science , 138–142.Williams, M. 2019. The connection between online hatespeech and real-world hate crime.1