Exploring multi-task multi-lingual learning of transformer models for hate speech and offensive speech identification in social media
NNoname manuscript No. (will be inserted by the editor)
Exploring multi-task multi-lingual learning oftransformer models for hate speech and offensivespeech identification in social media
Sudhanshu Mishra · Shivangi Prasad · Shubhanshu Mishra ∗ Received: date / Accepted: date
Abstract
Hate Speech has become a major content moderation issue for on-line social media platforms. Given the volume and velocity of online con-tent production, it is impossible to manually moderate hate speech relatedcontent on any platform. In this paper we utilize a multi-task and multi-lingual approach based on recently proposed Transformer Neural Networksto solve three sub-tasks for hate speech. These sub-tasks were part of the2019 shared task on hate speech and offensive content (HASOC) identifi-cation in Indo-European languages. We expand on our submission to thatcompetition by utilizing multi-task models which are trained using three ap-proaches, a) multi-task learning with separate task heads, b) back-translation,and c) multi-lingual training. Finally, we investigate the performance of var-ious models and identify instances where the Transformer based models per-form differently and better. We show that it is possible to to utilize differentcombined approaches to obtain models that can generalize easily on differ-ent languages and tasks, while trading off slight accuracy (in some cases)for a much reduced inference time compute cost. We open source an up-dated version of our HASOC 2019 code with the new improvements at https://github.com/socialmediaie/MTML_HateSpeech . Keywords
Hate Speech · Offensive content · Transformer Models · BERT · Language Models · Neural Networks · Multi-lingual · Multi-Task Learning · S. MishraIndian Institute of Technology Kanpur, IndiaE-mail: [email protected]. PrasadUniversity of Illinois at Urbana-Champaign, USAE-mail: [email protected]. Mishra ∗ [Corresponding Author]University of Illinois at Urbana-Champaign, USAE-mail: [email protected] a r X i v : . [ c s . C L ] J a n Sudhanshu Mishra et al.
Social Media · Natural Language Processing · Machine Learning · DeepLearning · Open Source
With increased access to the internet, the number of people that are connectedthrough social media is higher than ever (Perrin, 2015). Thus, social mediaplatforms are often held responsible for framing the views and opinions ofa large number of people (Duggan et al., 2017). However, this freedom tovoice our opinion has been challenged by the increase in the use of hate speech(Mondal et al., 2017). The anonymity of the internet grants people the power tocompletely change the context of a discussion and suppress a person’s personalopinion (Sticca and Perren, 2013). These hateful posts and comments not onlyaffect the society at a micro scale but also at a global level by influencingpeople’s views regarding important global events like elections, and protests(Duggan et al., 2017). Given the volume of online communication happening onvarious social media platforms and the need for more fruitful communication,there is a growing need to automate the detection of hate speech. For thescope of this paper we adopt the definition of hate speech and offensive speechas defined in the Mandl et al. (2019) as “ insulting, hurtful, derogatory, orobscene content directed from one person to another person ” (quoted from(Mandl et al., 2019)).In order to automate hate speech detection the Natural Language Pro-cessing (NLP) community has made significant progress which has been ac-celerated by organization of numerous shared tasks aimed at identifying hatespeech (Mandl et al., 2019; Kumar et al., 2020, 2018). Furthermore, there hasbeen a proliferation of new methods for automated hate speech detection insocial media text (Salminen et al., 2018; Mishra et al., 2020b; Mishra andMishra, 2019; Mishra, 2020a; Waseem et al., 2017; Struß et al., 2019; Mandlet al., 2019; Mondal et al., 2017). However, working with social media textis difficult (Eisenstein, 2013; Mishra and Diesner, 2016; Mishra et al., 2014;Mishra and Diesner, 2019; Mishra, 2019, 2020b,a), as people use combinationsof different languages, spellings and words that one may never find in anydictionary. A common pattern across many hate speech identification tasksMandl et al. (2019); Kumar et al. (2020); Waseem et al. (2017); Zampieriet al. (2019); Basile et al. (2019); Struß et al. (2019) is the identification ofvarious aspects of hate speech, e.g., in HASOC 2019 (Mandl et al., 2019), theorganizers divided the task into three sub-tasks, which focused on identifyingthe presence of hate speech; classification of hate speech into offensive, pro-fane, and hateful; and identifying if the hate speech is targeted towards anentity.Many researchers have tried to address these types of tiered hate speechclassification tasks using separate models, one for each sub-task (see review ofrecent shared tasks Zampieri et al. (2019); Kumar et al. (2018, 2020); Mandlet al. (2019); Struß et al. (2019)). However, we consider this approach limited ulti-task multi-lingual hate speech identification 3 for application to systems which consume large amounts of data, and arecomputationally constrained for efficiently flagging hate speech. The limitationof existing approach is because of the requirement to run several models, onefor each language and sub-task.In this work, we propose a unified modeling framework which identifiesthe relationship between all tasks across multiple languages. Our aim is tobe able to perform as good if not better than the best model for each tasklanguage combination. Our approach is inspired from the promising resultsof multi-task learning on some of our recent works (Mishra, 2019, 2020b,a).Additionally, while building a unified model which can perform well on all tasksis challenging, an important benefit of these models is their computationalefficiency, achieved by reduced compute and maintenance costs, which canallow the system to trade-off slight accuracy for efficiency.In this paper, we propose the development of such universal modellingframework, which can leverage recent advancements in machine learning toachieve competitive and in few cases state-of-the-art performance of a varietyof hate speech identification sub-tasks across multiple languages. Our frame-work encompasses a variety of modelling architectures which can either trainon all tasks, all languages, or a combination of both. We extend the our priorwork in Mishra and Mishra (2019); Mishra et al. (2020b); Mishra (2020b,a,2019) to develop efficient models for hate speech identification and benchmarkthem against the HASOC 2019 corpus, which consists of social media postsin three languages, namely, English, Hindi, and German. We open source ourimplementation to allow its usage by the wider research community. Our maincontributions are as follows:1. Investigate more efficient modeling architectures which use a) multi-tasklearning with separate task heads, b) back-translation, and c) multi-lingualtraining. These architectures can generalize easily on different languagesand tasks, while trading off slight accuracy (in some cases) for a muchreduced inference time compute cost.2. Investigate the performance of various models and identification of in-stances where our new models differ in their performance.3. Open source pre-trained models and model outputs at Mishra et al. (2020a)along with the updated code for using these models at: https://github.com/socialmediaie/MTML_HateSpeech
Prior work (see Schmidt and Wiegand (2017) for a detailed review on priormethods) in the area of hate speech identification, focuses on different aspectsof hate speech identification, namely analyzing what constitutes hate speech,high modality and other issues encountered when dealing with social mediadata and finally, model architectures and developments in NLP, that are be-ing used in identifying hate speech these days. There is also prior literature
Sudhanshu Mishra et al. focusing on the different aspects of hateful speech and tackling the subjectiv-ity that it imposes. There are many shared tasks Mandl et al. (2019); Kumaret al. (2018, 2020); Struß et al. (2019); Basile et al. (2019) that tackle hatespeech detection by classifying it into different categories. Each shared taskfocuses on a different aspect of hate speech. Waseem et al. (2017) proposeda typology on the abusive nature of hate speech, classifying it into general-ized, explicit and implicit abuse. Basile et al. (2019) focused on hateful andaggressive posts targeted towards women and immigrants. Mandl et al. (2019)focused on identifying targeted and un-targeted insults and classifying hatespeech into hateful, offensive and profane content. Kumar et al. (2018, 2020)tackled aggression and misogynistic content identification for trolling and cy-berbullying posts. Vidgen et al. (2019) identifies that most of these sharedtasks broadly fall into these three classes, individual directed abuse, identitydirected abuse and concept directed abuse. It also puts into context the variouschallenges encountered in abusive content detection.Unlike other domains of information retrieval, there is a lack of large data-sets in this field. Moreover, the data-sets available are highly skewed and focuson a particular type of hate speech. For example, Davidson et al. (2017) modelsthe problem as a generic abusive content identification challenge, however,these posts are mostly related towards racism and sexism. Furthermore, in thereal world, hateful posts do not fall into to a single type of hate speech. Thereis a huge overlapping between different hateful classes, making hate speechidentification a multi label problem.A wide variety of system architectures, ranging from classical machinelearning to recent deep learning models, have been tried for various aspectsof hate speech identification. Facebook, YouTube, and Twitter are the majorsources of data for most data-sets. Burnap and Williams (2015) used SVMand ensemble techniques on identifying hate speech in Twitter data. Razaviet al. (2010) approach for abuse detection using an insulting and abusive lan-guage dictionary of words and phrases. Van Hee et al. (2015) used bag ofwords n-gram features and trained an SVM model on a cyberbullying dataset.Salminen et al. (2018) achieved an F1-score of 0.79 on classification of hatefulYouTube and Facebook posts using a linear SVM model employing TF-IDFweighted n-grams.Recently, models based on deep learning techniques have also been ap-plied to the task of hate speech identification. These models often rely ondistributed representations or embeddings, e.g., FastText embeddings (Joulinet al. (2017), and paragraph2vec distributed representations (Le and Mikolov(2014). Badjatiya et al. (2017) employed an LSTM architecture to tune Gloveword embeddings on the DATA-TWITTER-TWH data-set. Risch and Krestel(2018) used a neural network architecture using a GRU layer and ensemblemethods for the TRAC 2018 (Kumar et al., 2018) shared task on aggres-sion identification. They also tried back-translation as a data augmentationtechnique to increase the data-set size. Wang (2018) illustrated the use of se-quentially combining CNNs with RNNs for abuse detection. They show thatthis approach was better than using only the CNN architecture giving a 1% ulti-task multi-lingual hate speech identification 5 improvement in the F1-score. One of the most recent developments in NLPare the transformer architecture introduced by Vaswani et al. (2017). Utilizingthe transformer architecture, Devlin et al. (2019) provide methods to pre-trainmodels for language understanding (BERT) that have achieved state of theart results in many NLP tasks and are promising for hate speech detection aswell. BERT based models achieved competitive performance in HASOC 2019shared tasks. We Mishra and Mishra (2019) fine tuned BERT base models forthe various HASOC shared tasks being the top performing model in some of thesub-tasks. A similar approach was also used for the TRAC 2020 shared tasks onAggression identification by us Mishra et al. (2020b) still achieving competitiveperformance with the other models without using an ensemble techniques. Aninteresting approach was the use of multi-lingual models by joint training ondifferent languages. This approach presents us with a unified model for differ-ent languages in abusive content detection. Ensemble techniques using BERTmodels (Risch and Krestel, 2020) was the top performing model in many of theshared tasks in TRAC 2020. Recently, multi-task learning has been used forimproving performance on NLP tasks (Liu et al., 2016; Søgaard and Goldberg,2016), especially social media information extraction tasks (Mishra, 2019), andmore simpler variants have been tried for hate speech identification in our re-cent works (Mishra and Mishra, 2019; Mishra et al., 2020b). Florio et al. (2020)investigated the usage of AlBERTo on monitoring hate speech against Italianon Twitter. Their results show that even though AlBERTo is sensitive to thefine tuning set, it’s performance increases given enough training time. Mozafariet al. (2020) employ a transfer learning approach using BERT for hate speechdetection. Ranasinghe and Zampieri (2020) use cross-lingual embeddings toidentify offensive content in multilingual setting. Our multi-lingual approachis similar in spirit to the method proposed in Plank (2017) which use the samemodel architecture and aligned word embedding to solve the tasks. There hasalso been some work on developing solutions for multilingual toxic commentswhich can be related to hate speech. Recently, Mishra (2020c) also used a sin-gle model across various tasks which performed very well for event detectiontasks for five Indian languages.There have been numerous competitions dealing with hate speech evalua-tion. OffensEval Zampieri et al. (2019) is one of the popular shared tasks deal-ing with offensive language in social media, featuring three sub-tasks for dis-criminating between offensive and non-offensive posts. Another popular sharedtask in SemEval is the HateEval Basile et al. (2019) task on the detection ofhate against women and immigrants. The 2019 version of HateEval consistsof two sub-task for determination of hateful and aggressive posts. GermEvalStruß et al. (2019) is another shared task quite similar to HASOC. It focusedon the Identification of Offensive Language in German Tweets. It featurestwo sub-tasks following a binary and multi-class classification of the Germantweets. Sudhanshu Mishra et al.
An important aspect of hate speech is that it is primarily multi-modal innature. A large portion of the hateful content that is shared on social mediais in the form of memes, which feature multiple modalities like audio, text,images and videos in some cases as well. Yang et al. (2019) present differentfusion approaches to tackle multi-modal information for hate speech detection.Gomez et al. (2020) explore multi-modal hate speech consisting of text andimage modalities. They propose various multi-modal architectures to jointlyanalyze both the textual and visual information. Facebook recently releasedthe hateful memes data-set for the Hateful Memes challenge Kiela et al. (2020)to provide a complex data-set where it is difficult for uni-modal models toachieve good performance.
For this paper, we extend some of the techniques that we have used in TRAC2020 in Mishra et al. (2020b) as well as Mishra (2019, 2020a,b), and applythem to the HASOC data-set Mandl et al. (2019). Furthermore, we extendthe work that we did as part of the HASOC 2019 shared task Mishra andMishra (2019) by experimenting with multi-lingual training, back-translationbased data-augmentation, and multi-task learning to tackle the data sparsityissue of the HASOC 2019 data-set.Fig. 1: Task Description3.1 Task Definition and DataAll of the experiments reported hereafter have been done on the HASOC 2019data-set (Mandl et al., 2019) consisting of posts in English (EN), Hindi (HI)and German (DE). The shared tasks of HASOC 2019 had three sub-tasks (A,B,C) for both English and Hindi languages and two sub-tasks (A,B) forthe German language. The description of each sub-task is as follows (see Figure1 for details): ulti-task multi-lingual hate speech identification 7 – Sub-Task A : Posts have to be classified into hate speech
HOF and non-offensive content
NOT . – Sub-Task B : A fine grained classification of the hateful posts in sub-task A. Hate Speech posts have to be identified into the type of hate theyrepresent, i.e containing hate speech content (HATE) , containing offensivecontent (OFFN) and those containing profane words (PRFN) . – Sub-Task C : Another fine grained classification of the hateful posts insub-tasks A. This sub-task required us to identify whether the hate speechwas targeted towards an individual or group TIN or whether it was un-targeted
UNT .Table 1: Distribution of number of tweets in different data-sets and splits. task DE EN HI train dev test train dev test train dev test A B
407 794 850 2,261 302 1,153 2,469 136 1,318 C The HASOC 2019 data-set consists of posts taken from Twitter and Face-book. The data-set only consists of text and labels and does not include anycontextual information or meta-data of the original post e.g. time information.The data distribution for each language and sub-task is mentioned in Table1. We can observe, that the sample size for each language is of the order of afew thousand post, which is an order smaller to other datasets like OffenseEval(13,200 posts), HateEval (19,000 posts), and Kaggle Toxic Comments datasets(240,000 posts). This can pose a challenge for training deep learning models,which often consists of large number of parameters, from scratch. Class wisedata distribution for each language is available in the appendix .1 figures 4,5, and 6. These figures show that the label distribution is highly skewed fortask C, such as the label UNT, which is quite underrepresented. Similarly, forGerman the task A data is quite unbalanced. For more details on the datasetalong with the details on its creation and motivation we refer the reader toMandl et al. (2019). Mandl et al. (2019) reports that the inter-annotator agree-ment is in the range of 60% to 70% for English and Hindi. Furthermore, theinter-annotator agreement is more than 86% for German.3.2 Fine-tuning transformer based modelsThe transformer based models especially BERT (Devlin et al., 2019), haveproven to be successful in achieving very good results on a range of NLP tasks.Upon its release, BERT based models became state of the art for 11 NLP tasks(Devlin et al., 2019). This motivated us to try out BERT for hate speech de-tection. We had used multiple variants of the BERT model during HASOC
Sudhanshu Mishra et al.
Fig. 2: An overview of various model architectures we used. Shaded task boxesrepresent that we first compute a marginal representation of labels only be-longing to that task before computing the loss.2019 shared tasks Mishra and Mishra (2019). We also experimented with othertransformer models and BERT during TRAC2020 Mishra et al. (2020b). How-ever, based on our experiments, we find the original BERT models to be bestperforming for most tasks. Hence, for this paper we only implement our mod-els on those. For our experiments we use the open source implementations ofBERT provided by Wolf et al. (2019) . A common practice for using BERTbased models, is to fine-tune an existing pre-trained model on data from anew task. For fine tuning the pre-trained BERT models we used the BERTfor Sequence Classification paradigm present in the HuggingFace library. Wefine tune BERT using various architectures. A visual description of these ar-chitectures is shown in Figure 2. These models are explained in detail in latersections.To process the text, we first use a pre-trained BERT tokenizer to convertthe input sentences into tokens. These tokens are then passed to the modelwhich generate a BERT specific embeddings for each token. The special part https://github.com/huggingface/transformersulti-task multi-lingual hate speech identification 9 about BERT is that its decoder is supplied all the hidden states of the encoderunlike other transformer models before BERT. This helps it to capture bettercontextual information even for larger sequences. Each sequence of tokens ispadded with a [CLS] and [SEP] token. The pre-trained BERT model generatesan output vector for each of the tokens. For sequence classification tasks, thevector corresponding to the [CLS] token is used as it holds the contextualinformation about the complete sentence. Additional fine-tuning is done onthis vector to generate the classification for specific data-sets.To keep our experiments consistent, the following hyper-parameters werekept constant for all our experiments. For training our models we used thestandard hyper-parameters as mentioned in the huggingface transformers doc-umentation. We used the Adam optimizer (with (cid:15) = 1 e −
8) for 5 epochs, witha training/eval batch size of 32. Maximum allowable length for each sequencewas kept as 128. We use a linearly decreasing learning rate with a startingvalue as 5 e − .
0. Allmodels were trained using Google Colab’s GPU runtimes. This limited us toa model run-time of 12 hours with a GPU which constrained our batch size aswell as number of training epochs based on the GPU allocated by Colab.We refer to models which fine tune BERT on using data set from a singlelanguage for a single task, as
Single models with an indicator (S) , this isdepicted in Figure 2 (1st row left). All other models types which we discusslater are identified by their model types and naems in Figure 2.3.3 Training a model for all tasksOne of the techniques that we had used for our work in HASOC 2019 Mishraand Mishra (2019) was creating an additional sub-task D by combining thelabels for all of the sub-tasks. We refer to models which use this technique as
Joint task models which an indicator (D) (see Figure 2 models marked withD). This allowed us to train a single model for all of the sub-tasks. This alsohelps in overcoming the data sparsity issue for sub-tasks B and sub-tasks Cfor which the no. of data points is very small. The same technique was alsoemployed in our submission to TRAC 2020 Mishra et al. (2020b) aggressionand misogyny identification tasks. Furthermore, when combining labels, weonly consider valid combination of labels, which allows us to reduce the possibleoutput space. For HASOC, the predicted output labels for the joint-trainingare as follows :
NOT-NONE-NONE , HOF-HATE-TIN , HOF-HATE-UNT , HOF-OFFN-TIN , HOF-OFFN-UNT , HOF-PRFN-TIN , HOF-PRFN-UNT . The task specific labels can be easily extracted from the outputlabels, using post-processing of predicted labels. https://colab.research.google.com/0 Sudhanshu Mishra et al. All models with anindicator (ALL) (see Figure 2 models marked with ALL). In this method, wecombine the data-sets from all the languages and train a single multi-lingualmodel on this combined data-set. The multi-lingual model is able to learndata from multiple languages thus providing us with a single unified modelfor different languages. A major motivation for taking this approach was thatsocial media data often does not belong to one particular language. It is quitecommon to find code-mixed posts on Twitter and Facebook. Thus, a multi-lingual model is the best choice in this scenario. During our TRAC 2020 work,we had found out that this approach works really well and was one of ourtop models in almost all of the shared tasks. From a deep learning point ofview, this technique seems promising as doing this also increases the size ofthe data-set available for training without adding new data points from otherdata-sets or from data augmentation techniques.As a natural extension of the above two approaches, we combine the multi-lingual training with the joint training approach to train a single model on alltasks for all languages. We refer to models which use this technique as
Alljoint task models with an indicator (ALL) (D) (see Figure 2).3.5 Multi-task learningWhile the joint task setting, can be considered as a multi-task setting, it isnot, in the common sense, and hence our reservation in calling it multi-task.The joint task training can be considered an instance of multi-class predic-tion, where the number of classes is based on the combination of tasks. Thisapproach does not impose any sort of task specific structure on the model, orcomputes and combines task specific losses. The core idea of multi-task learn-ing is to use similar tasks as regularizers for the model. This is done by simplyadding the loss functions specific to each task in the final loss function of themodel. This way the model is forced to optimize for all of the different taskssimultaneously, thus producing a model that is able to generalize on multipletasks on the data-set. However, this may not always prove to be beneficial asit has been reported that when the tasks differ significantly the model failsto optimize on any of the tasks. This leads to significantly worse performancecompared to single task approaches. However, sub-tasks in hate speech de-tection are often similar or overlapping in nature. Thus, this approach seemspromising for hate speech detection.Our multi-task setup is inspired from the marginalized inference tech-nique which was used in Mishra et al. (2020b). In the marginalized infer-ence, we post-process the probabilities of each label in the joint model, and ulti-task multi-lingual hate speech identification 11 compute the task specific label probability by marginalizing the probabilityof across all the other tasks. This ensures that the probabilities of labels foreach sub-task make a valid probability distribution and sum to one. For ex-ample, p ( HOF-HATE-TIN ) > p ( HOF-PRFN-TIN ) does not guaranteethat p ( HOF-HATE-UNT ) > p ( HOF-PRFN-UNT ). As described above,we can calculate the task specific probabilities by marginalizing the outputprobabilities of that task. For example, p ( HATE ) = p ( HOF-HATE-TIN ) + p ( HOF-HATE-UNT ). However, using this technique did not lead to a signif-icant improvement in the predictions and the evaluation performance. In somecases, it was even lower than the original method. A reason we suspect for thislow performance is that the model was not trained to directly optimize its lossfor this marginal inference. Next, we describe our multi-task setup inspiredfrom this approach.For our multi-task experiments, we first use our joint training approach(sub-task D) to generate the logits for the different class labels. These logitsare then marginalized to generate task specific logits (marginalizing logits issimpler than marginalizing the probability for each label, as we do not needto compute the partition function). For each task, we take a cross-entropy lossusing the new task specific logits. Finally we add the respective losses for eachsub-task along with the sub-task D loss. This added up loss is the final multi-task loss function of our model. We then train our model to minimize thisloss function. In this loss, each sub-task loss acts as a regularizer for the othertask losses. Since, we are computing the multi-task loss for each instance, weinclude a special label
NONE for sub-tasks B and C, for the cases where thelabel of sub-task A is
NOT . We refer to models which use this technique as
Multi-task models with an indicator (MTL) (D) (see Figure 2).One important point to note is that we restrict the output space of themulti-task model by using the task 4 labels. This is an essential constraint thatwe put on the model because of which there is no chance of any inconsistencyin the prediction. By inconsistency we say that it is not possible for our multi-task model to predict a data point that belongs to
NOT for task A and toany label other than
NONE for the tasks B and C. If we follow the generalprocedure for training a multi-task model, we would have 2 ∗ (3+1) ∗ (2+1) = 24, with +1 for the additional NONE label, combinations of outputs from ourmodel, which would produce the inconsistencies mentioned above.Like the methods mentioned before, we extend multi-task learning to alllanguages, which results in
Multi-task all model , which are indicated with anindicator (MTL) (ALL) (D) .3.6 Training with Back-Translated dataOne approach for increasing the size of the training data-set, is to generatenew instances based on existing instances using data augmentation techniques.These new instances are assigned the same label as the original instance. Train-ing model with instances generated using data augmentation techniques as- sumes that the label remains same if the data augmentation does not changethe instance significantly. We utilized a specific data augmentation techniqueused in NLP models, called Back-Translation (Koehn, 2005; Sennrich et al.,2016). Back-translation uses two machine translation models, one, to translatea text from its original language to a target language; and another, to translatethe new text in target language back to the original language. This techniquewas successfully used in the submission of Risch and Krestel (2018, 2020) dur-ing TRAC 2018 and 2020. Data augmentation via back-translation assumesthat current machine translation systems when used in back-translation set-tings give a different text which expresses a similar meaning as the original.This assumption allows us to reuse the label of the original text for the back-translated text.We used the Google translate API to back-translate all the text in ourdata-sets.For each language in our data-set we use the following source → target → source pairs: – EN : English → French → English – HI : Hindi → English → Hindi – DE : German → English → German
To keep track of the back-translated texts we added a flag to the text id. Inmany cases, there were minimal changes to the text. In some cases there wereno changes to the back-translated texts. However the no. of such texts wherethere was no change after back-translation was very low. For example, among4000 instances in the English training set around 100 instances did not haveany changes. So while using the back-translated texts for our experiments, wesimply used all the back-translated texts whether they under-went a changeor not. The data-set size doubled after using the back-translation data aug-mentation technique. An example of back-translated English text is as follows(changed text is emphasized):1.
Original : @politico No. We should remember very clearly that
Back-translated : @politico No, we must not forget that very clear
We present our results for sub-tasks A, B and C in Table 2, 3, and 4 respec-tively. To keep the table concise we use the following convention.1. (ALL) : A bert-base-multi-lingual-uncased model was used with multi-lingualjoint training. https://cloud.google.com/translate/docsulti-task multi-lingual hate speech identification 13 (BT) : The data-set used for this experiment is augmented using back-translation.3. (D) : A joint training approach has been used.4. (MTL) : The experiment is performed using a multi-task learning ap-proach.5. (S) : This is the best model which was submitted to HASOC 2019 in Mishraand Mishra (2019).The pre-trained BERT models which were fine-tuned for each language ina single language setting, are as follows:1. EN - bert-base-uncased HI - bert-base-multi-lingual-uncased DE - bert-base-multi-lingual-uncased weighted F1-score and macro F1-score , as were used in Mandlet al. (2019), with macro F1-score being the scores which were used foroverall ranking in HASOC 2019. The best scores for sub-task A are mentioned in Table 2. The best scores forthis task belong to Wang et al. (2019), Bashar and Nayak (2020) and Sahaet al. (2019) for English, Hindi and German respectively. All the models thatwe experimented with in sub-task A are very closely separated by the macro-F1 score. Hence, all of them give a similar performance for this task. Thedifference between the macro F1-scores of these models is <
3% . For bothEnglish and Hindi, the multi-task learning model performed the best whilefor the German language, the model that was trained on the back-translateddata using the multi-lingual joint training approach and task D ( (ALL)(BT) (D) ) worked best. However, it is interesting to see that the multi-taskmodel gives competitive performance on all of the languages within the samecomputation budget. One thing to notice is that, the train macro-F1 scores ofthe multi-task model are much lower than the other models. This suggests thatthe (MTL) model, given additional training time might improve the resultseven further. We are unable to provide longer training time due to lack ofcomputational resources available to us. The (ALL) (MTL) model also givesa similar performance compared to the (MTL) model. This suggests that theadditional multi-lingual training comes with a trade off with a slightly lowermacro-F1 score. However, the difference between the scores of the two modelsis ∼ Table 2: Sub-task A results. Models in HASOC 2019 (Mandl et al., 2019) wereranked based on Macro F1.
Weighted F1 Macro F1 lang model dev train test dev train test
EN (ALL) (ALL) (D) (BT) (BT) (ALL) (BT) (ALL) (D) (MTL) (ALL) (D) (MTL) (D) (S) (S) (D)
HASOC Best - - - -
HI (ALL) (ALL) (D) (BT) (BT) (ALL) (BT) (ALL) (D) (MTL) (ALL) (D) (MTL) (D) (S) (S) (D)
HASOC Best - - - - (ALL) (D) (BT) (BT) (ALL) (BT) (ALL) (D) (MTL) (ALL) (D) (MTL) (D) (S)
HASOC Best - - - - training time was too large as the models over-fitted the data. This finallyresulted in a degradation in the performance of these models. A sweet spot forthe training time may be found for the (MTL) models which may result in anincrease in the performance of the model whilst avoiding over-fitting. We werenot able to conduct more experiments to do the same due to time constraints.This may be evaluated in the additional future work on these models. We,however, cannot compare the German (MTL) models with the (MTL) modelsof the other languages as the German data did not have not have sub-task C,so the (MTL) approach did not have sub-task C for German. As we will see inthe next section, the (MTL) models performed equally well in sub-task B. Thismight be because both tasks A and B involve identifying hate and hence are ina sense co-related. This co-relation is something that the (MTL) models canutilize for their advantage. It has been found in other multi-task approachesthat the models learn more effectively when the different tasks are co-related. ulti-task multi-lingual hate speech identification 15
However, their performance can degrade if the tasks are un-related. The lowerperformance on the German data may be because of the unavailability of thesub-task C. However, the results are still competitive with the other models.For German, the (ALL) (MTL) model performed better than our submissionfor HASOC 2019. The (MTL) model for Hindi was able to match the bestmodel for this task at HASOC 2019.The (ALL) and (ALL) (D) training methods show an improvement fromour single models submitted at HASOC. These models present us with aninteresting option for abuse detection tasks as they are able to work on allof the shared tasks at the same time, leveraging the multi-lingual abilities ofthe model whilst still having a computation budget equivalent to that of asingle model. These results show that these models give a competitive perfor-mance with the single models. They even outperform the single model, e.g.,they outperform the bert-base-uncased single models that were used in En-glish sub-task A, which have been specially tuned for English tasks. While forGerman and Hindi, the single models themselves utilized a bert-base-uncasedmodel, so they are better suited for analyzing the improvements by the multi-lingual joint training approach. On these languages we see, that the (ALL)and (ALL) (D) techniques do improve the macro-F1 scores on for this task.The back-translation technique does not seem to improve the models much.The approach had a mixed performance. For all the languages, back-translationalone does not improve the model and hints at over-fitting, resulting in a de-crease in test results. However, when it is combined with the (ALL) and (D)training methods we see an increase in the performance. The (ALL) and (D)training methods are able to leverage the data-augmentation applied in theback-translated data. Back-translation when used with (ALL) or (ALL) (D)are better than the single models that we submitted at HASOC 2019. The(BT) (ALL) model comes really close to the best model at HASOC, comingsecond according to the results in Mandl et al. (2019).
The best scores for sub-task B are mentioned in Table 3. The best scores forthis task belong to Ruiter et al. (2019) for German. For English and Hindisub-task B our submissions had performed the best at HASOC 2019. For sub-task B, many of our models were able to significantly outperform the bestHASOC models. For English, the multi-task approach results in a new bestmacro-F1 score of 0 . .
662 which is 8% more than theprevious best. For Germans, our (MTL) model has a macro-F1 score on thetest set of 0.416 which is almost 7% more than the previous best.For the English task, even our (MTL) (ALL) and (BT) (ALL) models wereable to beat the previous best. However, our results show that unlike sub-taskA where our models had similar performances, in sub-task B there is hugevariation in their performance. Many outperform the best, however some ofthem also show poor results. The (ALL) and (ALL) (D) perform poorly for
Table 3: sub-task B results. Models in HASOC 2019 (Mandl et al., 2019) wereranked based on Macro F1.
Weighted F1 Macro F1 lang model dev train test dev train test
EN (ALL) (ALL) (D) (BT) (BT) (ALL) (BT) (ALL) (D) (MTL) (ALL) (D) (MTL) (D) (S) (S) (D)
HASOC Best - - - -
HI (ALL) (ALL) (D) (BT) (BT) (ALL) (BT) (ALL) (D) (MTL) (ALL) (D) (MTL) (D) (S) (S) (D)
HASOC Best - - - -
DE (ALL) (ALL) (D) (BT) (BT) (ALL) (BT) (ALL) (D) (MTL) (ALL) (D) (MTL) (D) (S) (S) (D)
HASOC Best - - - - the three languages, except (ALL) in German, and show very small macro-F1scores even on the training set. Thus, training these models for longer maychange the results. The (MTL) models are able to give competitive perfor-mance in task-A and is able to outperform the previous best, thus showingit’s capability to leverage different co-related tasks and generalize well on allof them.Here again we see that back-translation alone does not improve the macro-F1 scores. However, an interesting thing to notice is that the (ALL) and (BT)models which perform poorly individually, tend to give good results whenused together. Outperforming the previous best HASOC models, in all thethree languages. This hints that data sparsity alone is not the major issue ofthis task. This is also evident from the performance of the (MTL) model whichonly utilizes the data-set of a single language, which is significantly smaller ulti-task multi-lingual hate speech identification 17 than the back-translated (twice the original data-set) and the multi-lingualjoint model (sum of the sizes of the original model). But the (BT) (ALL) (D)model performed poorly in all of the three languages. Thus, the use of sub-task(D) with these models only degrades performance.The results from this task confirm that the information required to predicttask - A is important for task - B as well. This information is shared better bythe modified loss function of the (MTL) models rather than the loss functionfor sub-task (D).The (MTL) models build up on the sub-task (D) approach and do notutilize it explicitly. The sub-task (D) approach seems like a multi-task learningmethod, however, it is not complete and is not able to learn from the othertasks, thus does not offer huge improvement. These (MTL) models do showa variation in their performance but it is always on the higher side of themacro-F1 scores.
The best scores for sub-task C are mentioned in Table 4. The best scores forthis task belong to Mujadia et al. (2019) for Hindi while our submissions per-formed the best for English. The results for sub-task C also show appreciablevariation. Except the (ALL) (D) and (BT) (ALL) (D) models which also per-formed poorly in sub-task B, the variation in their performance, especially forEnglish, is not as significant as that present in sub-task B. This may be due tothe fact that the two way fine-grained classification is a much easier task thanthe three way classification in sub-task B. One important point to note is thatsub-task C focused on identifying the context of the hate speech, specificallyit focused on finding out whether it is targeted or un-targeted, while sub-taskA and sub-task B both focused on identifying the type of hate speech.The (MTL) models do not perform as well as they did in the previoustwo tasks. They were still able to outperform the best models for English butperform poorly for Hindi. An important point to notice here is that the trainmacro-F1 scores for the (MTL) models is significantly low. This suggests thatthe (MTL) model was not able learn well even for the training instances ofthis task. This can be attributed to the point mentioned above that this taskis inherently not as co-related to sub-task A and sub-task B as previouslyassumed. The task structure itself is not beneficial for a (MTL) approach.The main reason for this is that this task focuses on identifying targeted andun-targeted hate-speech. However, a non hate-speech text can also be an un-targeted or targeted text. As the (MTL) model receives texts belonging toboth hate (HOF) and non-hate speech
NOT , the information contained inthe texts belonging to this task are counter acted by those targeted and un-targeted texts belong to the (NOT) class. Thus, a better description of this taskis not a fine-grain classification of hate speech text, but that which involvestargeted and un-targeted labels for both (HOF) and (NOT) classes. In thatsetting, we can fully utilize the advantage of the multi-task learning modeland can expect better performance on this task as well.
Table 4: sub-task C results. Models in HASOC 2019 (Mandl et al., 2019) wereranked based on Macro F1.
Weighted F1 Macro F1 lang model dev train test dev train test
EN (ALL) (ALL) (D) (BT) (BT) (ALL) (BT) (ALL) (D) (MTL) (ALL) (D) (MTL) (D) (S) (S) (D)
HASOC Best - - - -
HI (ALL) (ALL) (D) (BT) (BT) (ALL) (BT) (ALL) (D) (MTL) (ALL) (D) (MTL) (D) (S) (S) (D)
HASOC Best - - - -
The (ALL) and (BT) models performed really well in sub-task C. The(ALL), (BT) and (ALL) (BT) models outperform the previous best for En-glish. The combination of these models with (D) still does not improve themand they continue to give poor performance. This provides more evidence toour previous inference that sub-task (D) alone does not improve the our per-formance.Overall most of our models, show an improvement from the single modelssubmitted at HASOC.The sub-par performance of the back-translated modelsacross the sub-tasks suggest that data sparsity is not the central issue of thischallenge. To take advantage of the augmented-data additional methods haveto be used. The sub-task (D) does not significantly adds as an improvementto the models. It can be seen that it actually worsens the situation for sub-tasks B and sub-tasks C. This can be attributed to it changing the task to amuch harder 7-class distribution task.The combined model approaches that wehave mentioned above offer a resource efficient way for hate speech detection.The (ALL), (ALL) (MTL) and (MTL) models are able to generalize well forthe different tasks and different languages. They present themselves as goodcandidates for a unified model for hate speech detection. ulti-task multi-lingual hate speech identification 19 ( 1 + , ' ( O D Q J X D J H I V F R U H I L O H B W \ S H G H Y ( 1 + , ' ( O D Q J X D J H I L O H B W \ S H W U D L Q ( 1 + , ' ( O D Q J X D J H I L O H B W \ S H W H V W O D E H O 1 2 7 + 2 ) (a) sub-task A ( 1 + , ' ( O D Q J X D J H I V F R U H I L O H B W \ S H G H Y ( 1 + , ' ( O D Q J X D J H I L O H B W \ S H W U D L Q ( 1 + , ' ( O D Q J X D J H I L O H B W \ S H W H V W O D E H O + $ 7 ( 2 ) ) 1 3 5 ) 1 (b) sub-task B ( 1 + , ' ( O D Q J X D J H I V F R U H I L O H B W \ S H G H Y ( 1 + , ' ( O D Q J X D J H I L O H B W \ S H W U D L Q ( 1 + , ' ( O D Q J X D J H I L O H B W \ S H W H V W O D E H O 7 , 1 8 1 7 (c) sub-task C Fig. 3: Variation in label F1-scores for all sub-tasks across all models4.2 Error analysisAfter identifying the best model and the variation in evaluation scores for eachmodel, we investigate the overall performance of these models for each labelbelonging to each task. In Figure 3, we can observe how the various labelshave a high variance in their predictive performance.
For English, the models show decent variation for both the labels on the train-ing set. However, this variation is not as significant for the dev and test setsfor the (NOT) label. There is an appreciable variation for (HOF) label onthe dev set, but it is not transferred to the test set. The scores for the trainsets is very high compared to the dev and test sets. For Hindi, the predictionsfor both the labels show minimum variation in the F1-score across the three data-sets, with similar scores for each label. For German, the F1-scores forthe (NOT) class is quite high compared to that of (HOF) class. The models,have a very low F1-score for the (HOF) label on the test set with appreciablevariation across the different models.
For English, the F1-scores for all the labels are quite high on the train setwith decent variation for the (OFFN) label. All of the labels show apprecia-ble variance on the dev and test sets. The (OFFN) has the lowest F1-scoreamongst the labels on both the dev and test sets with the other two labelshaving similar scores for the test set. For Hindi, the train F1-scores are similarfor all of the labels. The F1-scores are on the lower end for the (HATE) and (OFFN) labels on the dev set with appreciable variance across the models.This may be due to the fact that the Hindi dev set contains very few samplesfrom these two labels. For German, the variation among the F1-scores is highacross all the three sets. The (HATE) label and the (OFFN) label have alarge variation in their F1-scores across the models on the dev and test set re-spectively. The F1-score for the (OFFN) label is much higher than the otherlabels on the test set.
For English, the (UNT) label has exceptionally high variance across the mod-els for the train set. This is due to the exceptionally low scores by the (BT)(ALL) (D) model. This label has extremely low F1-score on the dev set. Fur-thermore, there is also large variation in the (TIN) scores across the modelsin all the sets.For Hindi, the (TIN) label has similar F1-scores with large variationsacross the models on all of the three sets. However, the (UNT) label hassmall variance across the models on the dev and test sets.
We also looked at the issue with back-translated results. In order to assess theback-translated data we looked at the new words added and removed froma sentence after back-translation. Aggregating these words over all sentenceswe find that the top words which are often removed and introduced are stopwords, e.g. the, of, etc. In order to remove these stop words from our analysisand assess the salient words introduced and removed per label we remove theoverall top 50 words from the introduced and removed list aggregated overeach label. This highlights words which are often removed from offensive andhateful labels are indeed offensive words. A detailed list of words for Englishand German can be found in appendix .2 (we excluded results for Hindi becauseof Latex encoding issues). ulti-task multi-lingual hate speech identification 21 as it has been found in Davidson et al. (2019) that hate speech and abusivelanguage datasets exhibit racial bias towards African American English usage.
We would like to conclude this paper by highlighting the promise shown bymulti-lingual and multi-task models on solving the hate and abusive speechdetection in a computationally efficient way while maintaining comparableaccuracy of single task models. We do highlight that our pre-trained modelsneed to be further evaluated before being used on large scale, however thearchitecture and the training framework is something which can easily scale tolarge dataset without sacrificing performance as was shown in Mishra (2020b,a,2019).
Compliance with Ethical Standards
Conflict of Interest: The authors declare that they have no conflict of interest.
References
Badjatiya, P., Gupta, S., Gupta, M., and Varma, V. (2017). Deep learningfor hate speech detection in tweets. In
Proceedings of the 26th InternationalConference on World Wide Web Companion , WWW ’17 Companion, page759–760, Republic and Canton of Geneva, CHE. International World WideWeb Conferences Steering Committee.Bashar, M. A. and Nayak, R. (2020). Qutnocturnal@ hasoc’19: Cnn for hatespeech and offensive content identification in hindi language. arXiv preprintarXiv:2008.12448 .Basile, V., Bosco, C., Fersini, E., Nozza, D., Patti, V., Rangel Pardo, F. M.,Rosso, P., and Sanguinetti, M. (2019). SemEval-2019 task 5: Multilingualdetection of hate speech against immigrants and women in twitter. In
Pro-ceedings of the 13th International Workshop on Semantic Evaluation , pages54–63, Minneapolis, Minnesota, USA. Association for Computational Lin-guistics.Burnap, P. and Williams, M. L. (2015). Cyber hate speech on twitter: Anapplication of machine classification and statistical modeling for policy anddecision making.
Policy & Internet , 7(2):223–242.Davidson, T., Bhattacharya, D., and Weber, I. (2019). Racial Bias in HateSpeech and Abusive Language Detection Datasets. In
Proceedings of theThird Workshop on Abusive Language Online , pages 25–35, Stroudsburg,PA, USA. Association for Computational Linguistics.Davidson, T., Warmsley, D., Macy, M. W., and Weber, I. (2017). Automatedhate speech detection and the problem of offensive language. In
Proceedingsof the International AAAI Conference on Web and Social Media 2017 . ulti-task multi-lingual hate speech identification 23 Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In
Proceedings of the 2019 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human Language Technologies,Volume 1 (Long and Short Papers) , pages 4171–4186, Minneapolis, Min-nesota. Association for Computational Linguistics.Duggan, M., Smith, A., and Caiazza, T. (2017). Online Harassment 2017.Technical report, Pew Research Center.Eisenstein, J. (2013). What to do about bad language on the internet. In
Proceedings of the 2013 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human Language Technologies ,pages 359–369, Atlanta, Georgia. Association for Computational Linguistics.Florio, K., Basile, V., Polignano, M., Basile, P., and Patti, V. (2020). Time ofyour hate: The challenge of time in hate speech detection on social media.
Applied Sciences , 10(12):4180.Gomez, R., Gibert, J., Gomez, L., and Karatzas, D. (2020). Exploring hatespeech detection in multimodal publications. In , pages 1459–1467.Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. (2017). Bag of tricksfor efficient text classification. In
Proceedings of the 15th Conference of theEuropean Chapter of the Association for Computational Linguistics: Volume2, Short Papers , pages 427–431, Valencia, Spain. Association for Computa-tional Linguistics.Kiela, D., Firooz, H., Mohan, A., Goswami, V., Singh, A., Ringshia, P., andTestuggine, D. (2020). The hateful memes challenge: Detecting hate speechin multimodal memes.Koehn, P. (2005). Europarl : A Parallel Corpus for Statistical Machine Trans-lation.
MT Summit .Kumar, R., Ojha, A. K., Malmasi, S., and Zampieri, M. (2018). Benchmarkingaggression identification in social media. In
Proceedings of the First Work-shop on Trolling, Aggression and Cyberbullying (TRAC-2018) , pages 1–11,Santa Fe, New Mexico, USA. Association for Computational Linguistics.Kumar, R., Ojha, A. K., Malmasi, S., and Zampieri, M. (2020). Evaluating ag-gression identification in social media. In Kumar, R., Ojha, A. K., Lahiri, B.,Zampieri, M., Malmasi, S., Murdock, V., and Kadar, D., editors,
Proceedingsof the Second Workshop on Trolling, Aggression and Cyberbullying (TRAC-2020) , Paris, France. European Language Resources Association (ELRA).Le, Q. V. and Mikolov, T. (2014). Distributed representations of sentencesand documents.
CoRR , abs/1405.4053.Liu, P., Qiu, X., and Huang, X. (2016). Deep Multi-Task Learning with SharedMemory for Text Classification. In
Proceedings of the 2016 Conference onEmpirical Methods in Natural Language Processing , pages 118–127, Strouds-burg, PA, USA. Association for Computational Linguistics.Mandl, T., Modha, S., Majumder, P., Patel, D., Dave, M., Mandlia, C., andPatel, A. (2019). Overview of the hasoc track at fire 2019: Hate speech andoffensive content identification in indo-european languages. In
Proceedings of the 11th Forum for Information Retrieval Evaluation , FIRE ’19, page14–17, New York, NY, USA. Association for Computing Machinery.Mishra, S. (2019). Multi-dataset-multi-task Neural Sequence Tagging for Infor-mation Extraction from Tweets. In
Proceedings of the 30th ACM Conferenceon Hypertext and Social Media - HT ’19 , pages 283–284, New York, NewYork, USA. ACM Press.Mishra, S. (2020a). Information Extraction from Digital Social Trace Datawith Applications to Social Media and Scholarly Communication Data.
ACM SIGIR Forum , 54(1).Mishra, S. (2020b).
Information Extraction from Digital Social Trace Datawith Applications to Social Media and Scholarly Communication Data . PhDthesis, University of Illinois at Urbana-Champaign.Mishra, S. (2020c). Non-neural Structured Prediction for Event Detectionfrom News in Indian Languages. In Mehta, P., Mandl, T., Majumder, P.,and Mitra, M., editors,
Working Notes of FIRE 2020 - Forum for Informa-tion Retrieval Evaluation , Hyderabad, India. CEUR Workshop Proceedings,CEUR-WS.org.Mishra, S., Agarwal, S., Guo, J., Phelps, K., Picco, J., and Diesner, J. (2014).Enthusiasm and support: alternative sentiment classification for social move-ments on social media. In
Proceedings of the 2014 ACM conference on Webscience - WebSci ’14 , pages 261–262, Bloomington, Indiana, USA. ACMPress.Mishra, S. and Diesner, J. (2016). Semi-supervised Named Entity Recognitionin noisy-text. In
Proceedings of the 2nd Workshop on Noisy User-generatedText (WNUT) , pages 203–212, Osaka, Japan. The COLING 2016 OrganizingCommittee.Mishra, S. and Diesner, J. (2019). Capturing Signals of Enthusiasm and Sup-port Towards Social Issues from Twitter. In
Proceedings of the 5th Inter-national Workshop on Social Media World Sensors - SIdEWayS’19 , pages19–24, New York, New York, USA. ACM Press.Mishra, S. and Mishra, S. (2019). 3Idiots at HASOC 2019: Fine-tuning Trans-former Neural Networks for Hate Speech Identification in Indo-EuropeanLanguages. In
Proceedings of the 11th annual meeting of the Forum forInformation Retrieval Evaluation , pages 208–213, Kolkata, India.Mishra, S., Prasad, S., and Mishra, S. (2020a). Model and predictions formulti-task multi-lingual learning of transformer models for hate speech andoffensive speech identification in social media. Accessible at: https://doi.org/10.13012/B2IDB-3565123_V1 .Mishra, S., Prasad, S., and Mishra, S. (2020b). Multilingual Joint Fine-tuningof Transformer models for identifying Trolling,Aggression and Cyberbully-ing at TRAC 2020. In
Proceedings of the Second Workshop on Trolling,Aggression and Cyberbullying (TRAC-2020) .Mondal, M., Silva, L. A., and Benevenuto, F. (2017). A Measurement Studyof Hate Speech in Social Media. In
Proceedings of the 28th ACM Conferenceon Hypertext and Social Media - HT ’17 , pages 85–94, New York, New York,USA. ACM Press. ulti-task multi-lingual hate speech identification 25
Mozafari, M., Farahbakhsh, R., and Crespi, N. (2020). A bert-based transferlearning approach for hate speech detection in online social media. In Cher-ifi, H., Gaito, S., Mendes, J. F., Moro, E., and Rocha, L. M., editors,
Com-plex Networks and Their Applications VIII , pages 928–940, Cham. SpringerInternational Publishing.Mujadia, V., Mishra, P., and Sharma, D. M. (2019). Iiit-hyderabad at hasoc2019: Hate speech detection.Perrin, A. (2015). Social Media Usage:2005-2015. Technical report, Pew Re-search Center.Plank, B. (2017). All-in-1 at IJCNLP-2017 task 4: Short text classificationwith one model for all languages. In
Proceedings of the IJCNLP 2017,Shared Tasks , pages 143–148, Taipei, Taiwan. Asian Federation of NaturalLanguage Processing.Ranasinghe, T. and Zampieri, M. (2020). Multilingual offensive language iden-tification with cross-lingual embeddings. In
Proceedings of the 2020 Con-ference on Empirical Methods in Natural Language Processing (EMNLP) ,pages 5838–5844, Online. Association for Computational Linguistics.Razavi, A. H., Inkpen, D., Uritsky, S., and Matwin, S. (2010). Offensive lan-guage detection using multi-level classification. In
Proceedings of the 23rdCanadian Conference on Advances in Artificial Intelligence , AI’10, page16–27, Berlin, Heidelberg. Springer-Verlag.Risch, J. and Krestel, R. (2018). Aggression identification using deep learningand data augmentation. In
Proceedings of the First Workshop on Trolling,Aggression and Cyberbullying (co-located with COLING) , pages 150–158.Risch, J. and Krestel, R. (2020). Bagging bert models for robust aggressionidentification. In
Proceedings of the Workshop on Trolling, Aggression andCyberbullying (TRAC@LREC) .Ruiter, D., Rahman, M. A., and Klakow, D. (2019). Lsv-uds at HASOC 2019:The problem of defining hate. In Mehta, P., Rosso, P., Majumder, P., andMitra, M., editors,
Working Notes of FIRE 2019 - Forum for InformationRetrieval Evaluation, Kolkata, India, December 12-15, 2019 , volume 2517of
CEUR Workshop Proceedings , pages 263–270. CEUR-WS.org.Saha, P., Mathew, B., Goyal, P., and Mukherjee, A. (2019). Hatemonitors:Language agnostic abuse detection in social media.Salminen, J., Almerekhi, H., Milenkovi´c, M., gyo Jung, S., An, J., Kwak, H.,and Jansen, B. (2018). Anatomy of online hate: Developing a taxonomy andmachine learning models for identifying and classifying hate in online newsmedia.Schmidt, A. and Wiegand, M. (2017). A Survey on Hate Speech Detectionusing Natural Language Processing. In
Proceedings of the Fifth InternationalWorkshop on Natural Language Processing for Social Media , pages 1–10,Stroudsburg, PA, USA. Association for Computational Linguistics.Sennrich, R., Haddow, B., and Birch, A. (2016). Improving Neural MachineTranslation Models with Monolingual Data. In
Proceedings of the 54th An-nual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers) , pages 86–96, Stroudsburg, PA, USA. Association for Compu- tational Linguistics.Søgaard, A. and Goldberg, Y. (2016). Deep multi-task learning with low leveltasks supervised at lower layers. In
Proceedings of the 54th Annual Meetingof the Association for Computational Linguistics (Volume 2: Short Papers) ,pages 231–235. Association for Computational Linguistics.Sticca, F. and Perren, S. (2013). Is Cyberbullying Worse than TraditionalBullying? Examining the Differential Roles of Medium, Publicity, andAnonymity for the Perceived Severity of Bullying.
Journal of Youth andAdolescence , 42(5):739–750.Struß, J., Siegel, M., Ruppenhofer, J., Wiegand, M., and Klenner, M. (2019).Overview of germeval task 2, 2019 shared task on the identification of of-fensive language. In
KONVENS .Van Hee, C., Lefever, E., Verhoeven, B., Mennes, J., Desmet, B., De Pauw, G.,Daelemans, W., and Hoste, V. (2015). Detection and fine-grained classifica-tion of cyberbullying events. In Angelova, G., Bontcheva, K., and Mitkov,R., editors,
Proceedings of Recent Advances in Natural Language Processing,Proceedings , pages 672–680.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N.,Kaiser, (cid:32)L., and Polosukhin, I. (2017). Attention is all you need. In
Advancesin neural information processing systems , pages 5998–6008.Vidgen, B., Harris, A., Nguyen, D., Tromble, R., Hale, S., and Margetts, H.(2019). Challenges and frontiers in abusive content detection. In
Proceedingsof the Third Workshop on Abusive Language Online , pages 80–93, Florence,Italy. Association for Computational Linguistics.Wang, B., Ding, Y., Liu, S., and Zhou, X. (2019). Ynu wb at hasoc 2019: Or-dered neurons lstm with attention for identifying hate speech and offensivelanguage.Wang, C. (2018). Interpreting neural network hate speech classifiers. In
Pro-ceedings of the 2nd Workshop on Abusive Language Online (ALW2) , pages86–92, Brussels, Belgium. Association for Computational Linguistics.Waseem, Z., Davidson, T., Warmsley, D., and Weber, I. (2017). Understandingabuse: A typology of abusive language detection subtasks. In
Proceedings ofthe First Workshop on Abusive Language Online , pages 78–84, Vancouver,BC, Canada. Association for Computational Linguistics.Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac,P., Rault, T., Louf, R., Funtowicz, M., and Brew, J. (2019). Huggingface’stransformers: State-of-the-art natural language processing.Yang, F., Peng, X., Ghosh, G., Shilon, R., Ma, H., Moore, E., and Predovic, G.(2019). Exploring deep multimodal fusion of text and photo for hate speechclassification. In
Proceedings of the Third Workshop on Abusive LanguageOnline , pages 11–18, Florence, Italy. Association for Computational Lin-guistics.Zampieri, M., Malmasi, S., Nakov, P., Rosenthal, S., Farra, N., and Kumar,R. (2019). SemEval-2019 task 6: Identifying and categorizing offensive lan-guage in social media (OffensEval). In
Proceedings of the 13th InternationalWorkshop on Semantic Evaluation , pages 75–86, Minneapolis, Minnesota, ulti-task multi-lingual hate speech identification 27
USA. Association for Computational Linguistics.
Appendix .1 Label Distribution 1 2 7 + 2 ) W D V N B $ L Q V W D Q F H V O D Q J ( 1 _ V S O L W W U D L Q 1 2 7 + 2 ) W D V N B $ O D Q J ( 1 _ V S O L W G H Y 1 2 7 + 2 ) W D V N B $ O D Q J ( 1 _ V S O L W W H V W + $ 7 ( 3 5 ) 1 2 ) ) 1 W D V N B % L Q V W D Q F H V O D Q J ( 1 _ V S O L W W U D L Q + $ 7 ( 3 5 ) 1 2 ) ) 1 W D V N B % O D Q J ( 1 _ V S O L W G H Y + $ 7 ( 3 5 ) 1 2 ) ) 1 W D V N B % O D Q J ( 1 _ V S O L W W H V W 7 , 1 8 1 7 W D V N B & L Q V W D Q F H V O D Q J ( 1 _ V S O L W W U D L Q 7 , 1 8 1 7 W D V N B & O D Q J ( 1 _ V S O L W G H Y 7 , 1 8 1 7 W D V N B & O D Q J ( 1 _ V S O L W W H V W Fig. 4: English Data class wise distribution 1 2 7 + 2 ) W D V N B $ L Q V W D Q F H V O D Q J ' ( _ V S O L W W U D L Q 1 2 7 + 2 ) W D V N B $ O D Q J ' ( _ V S O L W G H Y 1 2 7 + 2 ) W D V N B $ O D Q J ' ( _ V S O L W W H V W 2 ) ) 1 + $ 7 ( 3 5 ) 1 W D V N B % L Q V W D Q F H V O D Q J ' ( _ V S O L W W U D L Q 2 ) ) 1 + $ 7 ( 3 5 ) 1 W D V N B % O D Q J ' ( _ V S O L W G H Y 2 ) ) 1 + $ 7 ( 3 5 ) 1 W D V N B % O D Q J ' ( _ V S O L W W H V W Fig. 5: German Data class wise distribution + 2 ) 1 2 7 W D V N B $ L Q V W D Q F H V O D Q J + , _ V S O L W W U D L Q + 2 ) 1 2 7 W D V N B $ O D Q J + , _ V S O L W G H Y + 2 ) 1 2 7 W D V N B $ O D Q J + , _ V S O L W W H V W 3 5 ) 1 2 ) ) 1 + $ 7 ( W D V N B % L Q V W D Q F H V O D Q J + , _ V S O L W W U D L Q 3 5 ) 1 2 ) ) 1 + $ 7 ( W D V N B % O D Q J + , _ V S O L W G H Y 3 5 ) 1 2 ) ) 1 + $ 7 ( W D V N B % O D Q J + , _ V S O L W W H V W 7 , 1 8 1 7 W D V N B & L Q V W D Q F H V O D Q J + , _ V S O L W W U D L Q 7 , 1 8 1 7 W D V N B & O D Q J + , _ V S O L W G H Y 7 , 1 8 1 7 W D V N B & O D Q J + , _ V S O L W W H V W Fig. 6: Hindi Data class wise distribution ulti-task multi-lingual hate speech identification 29 .2 Back translation top changed words
Here we list the top 5 words per label for each task obtained after removing the top 50words which were either introduced or removed via back translation. We do not list the topwords for Hindi because of the encoding issue in Latex.
Listing 1: Changed words in English Training Data ∗ ∗ ∗ ’ , 5) , ( ’ there ’ , 4) , ( ’ these ’ , 4) ]21 t a s k 3 removed words22 NONE [ ( ’ happy ’ , 49) , ( ’ than ’ , 47) , ( ’ being ’ , 45) , ( ’ every ’ , 45) ,( ’ been ’ , 43) ]23 TIN [ ( ’ fuck ’ , 48) , ( ’ he ’ s ’ , 39) , ( ’ what ’ , 35) , ( ’ don ’ t ’ , 34) , (” he’ s ” , 34) ]24 UNT [ ( ’ them ’ , 7) , ( ’ f ∗ ∗∗ ’ , 6) , ( ’ being ’ , 5) , ( ’ such ’ , 5) , ( ’ does ’ ,5) ]