On Analyzing Annotation Consistency in Online Abusive Behavior Datasets
OOn Analyzing Annotation Consistency in Online Abusive Behavior Datasets
Md Rabiul Awal, Rui Cao, Roy Ka-Wei Lee, Sandra Mitrovi University of Saskatchewan, Canada University of Electronic Science and Technology of China, China Istituto Dalle Molle di Studi sull’Intelligenza Artificiale, [email protected], [email protected], [email protected], [email protected]
Abstract
Online abusive behavior is an important issue that breaks thecohesiveness of online social communities and even raisespublic safety concerns in our societies. Motivated by this ris-ing issue, researchers have proposed, collected, and annotatedonline abusive content datasets. These datasets play a criticalrole in facilitating the research on online hate speech and abu-sive behaviors. However, the annotation of such datasets is adifficult task; it is often contentious on what should be the truelabel of a given text as the semantic difference of the labelsmay be blurred (e.g., abusive and hate) and often subjective.In this study, we proposed an analytical framework to studythe annotation consistency in online hate and abusive contentdatasets. We applied our proposed framework to evaluate theconsistency of the annotation in three popular datasets thatare widely used in online hate speech and abusive behaviorstudies. We found that there is still a substantial amount ofannotation inconsistency in the existing datasets, particularlywhen the labels are semantically similar.
Introduction
Misbehavior in online social media such as cyberbullying,propagation of hate speeches, and abusive content have be-come an increasing problem. Such online misbehavior hasnot only sowed discord among individuals or communi-ties online but also resulted in violent hate crimes Williams(2019); Relia et al. (2019); Mathew et al. (2019). Therefore,it is a pressing issue to detect and curb such misbehavior inonline social media.Traditional machine learning and deep learning ap-proaches have been proposed to detect online misbehaviorautomatically. The recent surveys (Schmidt and Wiegand,2017; Fortuna and Nunes, 2018) have comprehensively sum-marized these methods. Most of the automatic online mis-behavior detection methods are supervised text classifica-tion methods trained and tested on annotated datasets. Assuch, the quality of the annotation has direct implications ondetection algorithms’ performance and the insights gainedfrom the online misbehavior research studies.Three popular datasets are widely used in online mis-behavior studies: WZ (Waseem and Hovy, 2016; Waseem, Copyright c (cid:13) DT (Davidson et al., 2017), and the recently pub-lished FOUNTA
Founta et al. (2018) dataset. Waseem andHovy (2016) first collected and annotated the WZ Twit-ter dataset into four classes: racism, sexism, both, and nei-ther. Waseem and Hovy (2016) subsequently enhanced thedataset by controlling the bias introduced by annotators.Davidson et al. (2017) argued that hate speech should be dif-ferentiated from offensive tweets; some tweets may containhateful words but should be labeled as offensive as they didnot meet the threshold of classifying them as hate speech.The researchers collected the DT dataset and manually an-notated the dataset into three categories: offensive, hate, andneither. In a recent study, Founta et al. (2018) proposed the FOUNTA dataset. This dataset went through two roundsof annotations. In the first round, annotators are required toclassify tweets into three categories: normal, spam, and in-appropriate. Subsequently, the annotators were asked to re-fine further the labels of the tweets in the “inappropriate”category. Specifically, the final version of the dataset con-sists of four classes: normal, spam, hate, and abusive.While these datasets have facilitated many online misbe-havior studies, few analyses have been done to evaluate andbenchmark the quality of these datasets. The annotation ofonline misbehavior datasets is a challenging tasks. Firstly,the difference between certain labels may be subtle (David-son et al., 2017; Founta et al., 2018). Secondly, the manualannotation process is often subjected to the annotator’s bias-ness (Waseem, 2016). Therefore, we proposed an analyticalframework to examine the annotation consistency in onlinemisbehavior datasets. Included in our proposed frameworkis a two-step pipeline, which enables us to identify potentialmislabeling and contentious annotation in the datasets.We summarize our main contributions as follows: • We proposed a novel analytical framework to examine theannotation consistency in the online misbehavior dataset. • We applied our proposed framework to analyze three pop-ular real-world and publicly available datasets. To the bestof our knowledge, we are the first study that quantitativelyand qualitatively compares existing online misbehaviordatasets. • Our analysis showed that there is a substantial amount ofannotation inconsistency in the existing datasets. We also a r X i v : . [ c s . S I] J un lassifierClassifierClassifier Voting AllTweets ContentiousTweets
Tweet SearchEngine
Similar Tweets
LabelInconsistencyMatrixClassifierClassifier
Step 1: Classify-to-filter Step 2: Search Inconsistency
Figure 1: Overall annotation consistency analysis frameworkempirically demonstrate case studies where the annota-tion inconsistency is likely to occur in the datasets.
Annotation Consistency Analysis Framework
Figure 1 shows our proposed annotation consistency anal-ysis framework. Included in the analytical framework is atwo-step process. In the first step, we train a set of classifiersto predict the labels of a given dataset of tweets. Voting willthen be performed to vote for contentious tweets, i.e., tweetsthat are wrongly classified by more than half of the classi-fiers. The intuition is that it is more challenging to classifytweets that are annotated with contentious labels. For exam-ple, in Table 1, the tweet t1 is identified as contentious whenmore than half of the classifiers mispredicted its label. A po-tential reason for the wrong classification may be due to t1 ,which is labeled as Hate , sharing very similar attributes withother tweets that are labeled as
Offensive . Such contentiouslabeling is likely to confuse the classifier, resulting in thewrong prediction. In the second step, the set of retrieved con-tentious tweets are used as input queries into a search engineto find similar tweets in the dataset. Finally, we construct anannotation inconsistency matrix by comparing the labels ofthe contentious tweets and the retrieved similar tweets. Theunderlying assumption is that potential inconsistencies arisewhen the labels of the contentious tweet and its similar tweetare different. For example, in Table 1, the search engines re-turn t2 as the most similar tweet to the contentious tweet t1 .When we compare the label of t1 and t2 , we notice that thetwo tweets have different labels, flagging a potential annota-tion inconsistency for the tweet t1 . Step 1: Classify-to-filter
The classify-to-filter step can be further broken down intotwo stages: classification and voting .In the classification stage, we adopt an ensemble ap-proach to train five different text classifiers on a given onlinemisbehavior dataset. The commonly-used traditional ma-chine learning and deep learning models are selected for ourtext classification task. Specifically, we use Logistic Regres-sion (LR), Naive Bayes (NB), Single-layer ConvolutionalNeural Network (CNN), Recurrent Neural Network (RNN), Table 1: Tweets example
Id Tweet Label Contentious t1 You are such a b*tch Hate Yest2 Don’t be such a b*tch Offensive Not3 B*tch please, try hard! Offensive Noand Convolutional Long-Short Term Memory network (C-LSTM) as the classifiers in this step. For LR and NB, wetrained these classifiers using the tweets’ word-level termfrequency-inverse document frequency (tf-idf) features. Forthe deep learning models, we use pre-trained GloVe wordembeddings (Pennington, Socher, and Manning, 2014) torepresent the words in the tweets, which are subsequentlyused as input for the classifiers. Each classifier is trained us-ing 5-fold cross-validation, and the predictions on the tweetsin the validation set are recorded for voting.In the voting stage, we consolidate the predictions madeby five classifiers and identify the contentious tweets.Specifically, given a tweet, if three or more classifiers pre-dicted its label wrongly, we would place this tweet into the contentious tweets set. While the incorrect prediction may beattributed to inconsistency in annotation, there could also beother reasons. For example, a tweet may contain rare words,and there are insufficient data to train the models well toclassify this tweet. Therefore, we perform another step tofurther verify whether it is annotation inconsistency that ledto incorrect predictions.
Step 2: Search Inconsistency
In this step, we utilize the retrieved set of contentious tweetsas input into our search engine to retrieve similar tweets.Specifically, given a query contentious tweet, t q , the searchengine aims to retrieve its most similar tweet, t s , from thedataset. To measure the similarity between tweets, we com-pute the cosine similarity between the tweets’ tf-idf rep-resentation. The cosine similarity between two tweets arecomputed as follows: cos sim ( t q , t s ) = (cid:80) w ∈ t q ∩ t s t wq t ws (cid:113)(cid:80) w ∈ t q t wq ∗ (cid:113)(cid:80) w ∈ t s t ws (1)where t wq is the tf-idf weight of term w in the query tweet t q , acism sexism neither both t w ee t s WZ-LS contentiousnon-contentious hateful neither offensive t w ee t s DT contentiousnon-contentious hateful abusive spam normal t w ee t s FOUNTA contentiousnon-contentious
Figure 2: Breakdown distributions of contentious and non-contentious tweets from WZ (left), DT (middle), and FOUNTA (right) retrieved in step 1 of the annotation consistency analysis framework.and t ws is the tf-idf weight of term w in the similar tweet t s .We compute the cosine similarity between each query tweet t q and all tweets in the dataset, i.e., t s ∈ T , and select thetweet with the highest cosine similarity score as the similartweet to the query tweet.Finally, we compare the annotated labels of the t q and t s :if the two annotated labels disagree, we flag out that t q mighthave an annotation inconsistency as the (a) classifiers find ithard to classify this tweet, and (b) its annotated label is dif-ferent from its most similar tweet. The annotation inconsis-tencies in the contentious tweets are subsequently reportedin annotation inconsistency matrices in the next step. Evaluation and Discussion
We applied our proposed annotation consistency analysisframework on the three popular datasets, which are widelyused in online misbehavior studies: WZ (Waseem and Hovy,2016; Waseem, 2016), DT (Davidson et al., 2017), and FOUNTA
Founta et al. (2018). The summary statistics ofthe datasets are presented in Table 2. Note that we com-bined the number the tweets in (Waseem and Hovy, 2016;Waseem, 2016) to form the current WZ dataset.Table 2: Summary statistics of datasets Dataset
WZ 13,202 racism (82), sexism (3,332),both (21), neither (9,767)DT 24,783 hate (1,430), offensive(19,190), neither (4163)FOUNTA 99,999 normal (53,851), abusive(27,150), spam (14,029), hate(4,965)Figure 2 shows the breakdown distributions of con-tentious and non-contentious tweets retrieved in step 1 of ourproposed analytical framework. We observe that contentioustweets are found from all labels in the three datasets, i.e., thefive classifiers made mistakes in predicting the true label ofall kinds of tweets. Specifically, in WZ , the classifiers haveincorrectly predicted most of the sexism tweets. In DT , al- most half of the hateful tweets are wrongly predicted. Sim-ilar observations are made in FOUNTA , with hateful andspam tweets seeing a higher percentage of misclassification.Table 3: Annotation Inconsistency matrix for WZ
Contentious Tweet Label
Racism Sexism Both NeitherRacism 16 0 0 1Sexism 9 662 10 222Both 0 4 0 1
SimilarTweetLabel
Neither 26 754 5 218As discussed earlier in the section, there could be multi-ple reasons for the misclassification. For instance, the hatespeech detection problem may be hard as the tweets withinthe same label have high variance, or there might be insuffi-cient training data. In this paper, we are interested to under-stand how much of the misclassification can be attributed toannotation inconsistency. Table 3, 4, and 5 shows the anno-tation inconsistency matrix generated in step 2 of our ana-lytical framework for WZ , DT , and FOUNTA respectively.Table 4: Annotation Inconsistency matrix for DT
Contentious Tweet Label
Offensive Hate NeitherOffensive 282 760 282Hate 84 133 16
SimilarTweetLabel
Neither 105 41 74From Table 3, we observe that 662 sexism contentioustweets have their most similar tweets sharing the same la-bel, while 745 of the sexism contentious tweets have theirmost similar tweets labeled as normal tweets (i.e., neither).This suggests that there could be inconsistencies in the an-notation of sexism tweets as two similar tweets may havedifferent labels, one labeled as sexism while another as nor-mal. Similar observations are made in other class labels, al-though the inconsistency in sexism tweet annotation is ob-able 5: Annotation Inconsistency matrix for FOUNTA
Contentious Tweet Label
Abusive Hate Spam NormalAbusive 491 1547 736 1062Hate 347 370 93 192Spam 109 62 790 1024
SimilarTweetLabel
Normal 758 1133 3170 915served to be the highest. Similar observations are also madefor the DT dataset in Table 4. A majority of the contentioushate tweets have their most similar tweets labeled as offen-sive. This is unsurprising as even for human annotators it isoften difficult to differentiate hateful tweets from offensiveones (Davidson et al., 2017). Nevertheless, such challengesin annotation also highlight the difficulty in the hate speechdetection task.Comparing the annotation inconsistency matrix of FOUNTA against the other two datasets, we noted thatthere could be significantly more annotation inconsistenciesin the
FOUNTA dataset. As shown in Table 5, there is ahigh amount of annotation inconsistencies observed in all la-bels. For instance, we observed that 758 contentious abusivetweets have their most similar tweets labeled as normal, anda significant number of contentious hate tweets have theirmost similar tweets labeled as abusive or normal. We furtherverify the annotation inconsistencies in
FOUNTA datasetby retrieving some samples of the
FOUNTA tweets. Table1 shows three examples of the
FOUNTA contentious tweetsand their most similar tweets. Surprisingly, we notice thatthe most similar tweets retrieved for contentious tweets C1and C2 are retweets, and the retweets are annotated withdifferent class labels. This exposes an issue in FOUNTA’sannotation strategy. We postulated that the identical tweets(i.e., the retweets) are annotated by different human annota-tors, resulting in the inconsistencies. We further investigatedand found that more than 10% of the tweets are retweets,and most of which have inconsistencies in their annotation.Table 6: Examples of tweets from
FOUNTA dataset. C x de-notes the contentious tweet and S x denotes the correspond-ing most similar tweet. Id Tweet Label
C1 RT:[USER 1] How about we f**kinghire trans boys to play trans boys hateS1 RT:[USER 1] How about we f**kinghire trans boys to play trans boys normalC2 RT:[USER 2] I wish I wasn’t so an-noying like I even p*ss myself off normalS2 RT:[USER 2] I wish I wasn’t so an-noying like I even p*ss myself off abusiveC3 RT:[USER 3] f**king faggot hateS3 [USER 4] f**king faggot abusive
Conclusion
In this paper, we proposed an analytical framework to exam-ine annotation consistency in online misbehavior datasets. We applied our proposed framework to analyze three popu-lar online misbehavior datasets. Our analysis showed thatannotation inconsistencies in all three datasets, illustrat-ing the challenges in online misbehavior data collection.Specifically, in the
FOUNTA dataset, we found a significantamount of annotation inconsistency where identical tweetsare annotated with different class labels. We also providedthe updated datasets with annotation inconsistency infor-mation so that researchers may perform the necessary datapreprocessing in future online misbehavior studies. References
Davidson, T.; Warmsley, D.; Macy, M.; and Weber, I. 2017.Automated hate speech detection and the problem of of-fensive language. In
Eleventh international aaai confer-ence on web and social media .Fortuna, P., and Nunes, S. 2018. A survey on automaticdetection of hate speech in text.
ACM Computing Surveys(CSUR)
TwelfthInternational AAAI Conference on Web and Social Media .Mathew, B.; Dutt, R.; Goyal, P.; and Mukherjee, A. 2019.Spread of hate speech in online social media. In
Pro-ceedings of the 10th ACM Conference on Web Science ,173–182.Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove:Global vectors for word representation. In
Proceedingsof the 2014 conference on empirical methods in naturallanguage processing (EMNLP) , 1532–1543.Relia, K.; Li, Z.; Cook, S. H.; and Chunara, R. 2019. Race,ethnicity and national origin-based discrimination in so-cial media and hate crimes across 100 us cities. In
Pro-ceedings of the International AAAI Conference on Weband Social Media , volume 13, 417–427.Schmidt, A., and Wiegand, M. 2017. A survey on hatespeech detection using natural language processing. In
Proceedings of the Fifth International Workshop on Nat-ural Language Processing for Social Media , 1–10.Waseem, Z., and Hovy, D. 2016. Hateful symbols or hatefulpeople? predictive features for hate speech detection ontwitter. In
Proceedings of the NAACL student researchworkshop , 88–93.Waseem, Z. 2016. Are you a racist or am i seeing things?annotator influence on hate speech detection on twitter. In
Proceedings of the first workshop on NLP and computa-tional social science , 138–142.Williams, M. 2019. The connection between online hatespeech and real-world hate crime.1