[PDF] Exploring multi-task multi-lingual learning of transformer models for hate speech and offensive speech identification in social media

Abstract

Hate Speech has become a major content moderation issue for online social media platforms. Given the volume and velocity of online content production, it is impossible to manually moderate hate speech related content on any platform. In this paper we utilize a multi-task and multi-lingual approach based on recently proposed Transformer Neural Networks to solve three sub-tasks for hate speech. These sub-tasks were part of the 2019 shared task on hate speech and offensive content (HASOC) identification in Indo-European languages. We expand on our submission to that competition by utilizing multi-task models which are trained using three approaches, a) multi-task learning with separate task heads, b) back-translation, and c) multi-lingual training. Finally, we investigate the performance of various models and identify instances where the Transformer based models perform differently and better. We show that it is possible to to utilize different combined approaches to obtain models that can generalize easily on different languages and tasks, while trading off slight accuracy (in some cases) for a much reduced inference time compute cost. We open source an updated version of our HASOC 2019 code with the new improvements at this https URL

Full PDF

NNoname manuscript No. (will be inserted by the editor)

Exploring multi-task multi-lingual learning oftransformer models for hate speech and oﬀensivespeech identiﬁcation in social media

Sudhanshu Mishra · Shivangi Prasad · Shubhanshu Mishra ∗ Received: date / Accepted: date

Abstract

Hate Speech has become a major content moderation issue for on-line social media platforms. Given the volume and velocity of online con-tent production, it is impossible to manually moderate hate speech relatedcontent on any platform. In this paper we utilize a multi-task and multi-lingual approach based on recently proposed Transformer Neural Networksto solve three sub-tasks for hate speech. These sub-tasks were part of the2019 shared task on hate speech and oﬀensive content (HASOC) identiﬁ-cation in Indo-European languages. We expand on our submission to thatcompetition by utilizing multi-task models which are trained using three ap-proaches, a) multi-task learning with separate task heads, b) back-translation,and c) multi-lingual training. Finally, we investigate the performance of var-ious models and identify instances where the Transformer based models per-form diﬀerently and better. We show that it is possible to to utilize diﬀerentcombined approaches to obtain models that can generalize easily on diﬀer-ent languages and tasks, while trading oﬀ slight accuracy (in some cases)for a much reduced inference time compute cost. We open source an up-dated version of our HASOC 2019 code with the new improvements at https://github.com/socialmediaie/MTML_HateSpeech . Keywords

Hate Speech · Oﬀensive content · Transformer Models · BERT · Language Models · Neural Networks · Multi-lingual · Multi-Task Learning · S. MishraIndian Institute of Technology Kanpur, IndiaE-mail: [email protected]. PrasadUniversity of Illinois at Urbana-Champaign, USAE-mail: [email protected]. Mishra ∗ [Corresponding Author]University of Illinois at Urbana-Champaign, USAE-mail: [email protected] a r X i v : . [ c s . C L ] J a n Sudhanshu Mishra et al.

Social Media · Natural Language Processing · Machine Learning · DeepLearning · Open Source

With increased access to the internet, the number of people that are connectedthrough social media is higher than ever (Perrin, 2015). Thus, social mediaplatforms are often held responsible for framing the views and opinions ofa large number of people (Duggan et al., 2017). However, this freedom tovoice our opinion has been challenged by the increase in the use of hate speech(Mondal et al., 2017). The anonymity of the internet grants people the power tocompletely change the context of a discussion and suppress a person’s personalopinion (Sticca and Perren, 2013). These hateful posts and comments not onlyaﬀect the society at a micro scale but also at a global level by inﬂuencingpeople’s views regarding important global events like elections, and protests(Duggan et al., 2017). Given the volume of online communication happening onvarious social media platforms and the need for more fruitful communication,there is a growing need to automate the detection of hate speech. For thescope of this paper we adopt the deﬁnition of hate speech and oﬀensive speechas deﬁned in the Mandl et al. (2019) as “ insulting, hurtful, derogatory, orobscene content directed from one person to another person ” (quoted from(Mandl et al., 2019)).In order to automate hate speech detection the Natural Language Pro-cessing (NLP) community has made signiﬁcant progress which has been ac-celerated by organization of numerous shared tasks aimed at identifying hatespeech (Mandl et al., 2019; Kumar et al., 2020, 2018). Furthermore, there hasbeen a proliferation of new methods for automated hate speech detection insocial media text (Salminen et al., 2018; Mishra et al., 2020b; Mishra andMishra, 2019; Mishra, 2020a; Waseem et al., 2017; Struß et al., 2019; Mandlet al., 2019; Mondal et al., 2017). However, working with social media textis diﬃcult (Eisenstein, 2013; Mishra and Diesner, 2016; Mishra et al., 2014;Mishra and Diesner, 2019; Mishra, 2019, 2020b,a), as people use combinationsof diﬀerent languages, spellings and words that one may never ﬁnd in anydictionary. A common pattern across many hate speech identiﬁcation tasksMandl et al. (2019); Kumar et al. (2020); Waseem et al. (2017); Zampieriet al. (2019); Basile et al. (2019); Struß et al. (2019) is the identiﬁcation ofvarious aspects of hate speech, e.g., in HASOC 2019 (Mandl et al., 2019), theorganizers divided the task into three sub-tasks, which focused on identifyingthe presence of hate speech; classiﬁcation of hate speech into oﬀensive, pro-fane, and hateful; and identifying if the hate speech is targeted towards anentity.Many researchers have tried to address these types of tiered hate speechclassiﬁcation tasks using separate models, one for each sub-task (see review ofrecent shared tasks Zampieri et al. (2019); Kumar et al. (2018, 2020); Mandlet al. (2019); Struß et al. (2019)). However, we consider this approach limited ulti-task multi-lingual hate speech identiﬁcation 3 for application to systems which consume large amounts of data, and arecomputationally constrained for eﬃciently ﬂagging hate speech. The limitationof existing approach is because of the requirement to run several models, onefor each language and sub-task.In this work, we propose a uniﬁed modeling framework which identiﬁesthe relationship between all tasks across multiple languages. Our aim is tobe able to perform as good if not better than the best model for each tasklanguage combination. Our approach is inspired from the promising resultsof multi-task learning on some of our recent works (Mishra, 2019, 2020b,a).Additionally, while building a uniﬁed model which can perform well on all tasksis challenging, an important beneﬁt of these models is their computationaleﬃciency, achieved by reduced compute and maintenance costs, which canallow the system to trade-oﬀ slight accuracy for eﬃciency.In this paper, we propose the development of such universal modellingframework, which can leverage recent advancements in machine learning toachieve competitive and in few cases state-of-the-art performance of a varietyof hate speech identiﬁcation sub-tasks across multiple languages. Our frame-work encompasses a variety of modelling architectures which can either trainon all tasks, all languages, or a combination of both. We extend the our priorwork in Mishra and Mishra (2019); Mishra et al. (2020b); Mishra (2020b,a,2019) to develop eﬃcient models for hate speech identiﬁcation and benchmarkthem against the HASOC 2019 corpus, which consists of social media postsin three languages, namely, English, Hindi, and German. We open source ourimplementation to allow its usage by the wider research community. Our maincontributions are as follows:1. Investigate more eﬃcient modeling architectures which use a) multi-tasklearning with separate task heads, b) back-translation, and c) multi-lingualtraining. These architectures can generalize easily on diﬀerent languagesand tasks, while trading oﬀ slight accuracy (in some cases) for a muchreduced inference time compute cost.2. Investigate the performance of various models and identiﬁcation of in-stances where our new models diﬀer in their performance.3. Open source pre-trained models and model outputs at Mishra et al. (2020a)along with the updated code for using these models at: https://github.com/socialmediaie/MTML_HateSpeech

Prior work (see Schmidt and Wiegand (2017) for a detailed review on priormethods) in the area of hate speech identiﬁcation, focuses on diﬀerent aspectsof hate speech identiﬁcation, namely analyzing what constitutes hate speech,high modality and other issues encountered when dealing with social mediadata and ﬁnally, model architectures and developments in NLP, that are be-ing used in identifying hate speech these days. There is also prior literature

Sudhanshu Mishra et al. focusing on the diﬀerent aspects of hateful speech and tackling the subjectiv-ity that it imposes. There are many shared tasks Mandl et al. (2019); Kumaret al. (2018, 2020); Struß et al. (2019); Basile et al. (2019) that tackle hatespeech detection by classifying it into diﬀerent categories. Each shared taskfocuses on a diﬀerent aspect of hate speech. Waseem et al. (2017) proposeda typology on the abusive nature of hate speech, classifying it into general-ized, explicit and implicit abuse. Basile et al. (2019) focused on hateful andaggressive posts targeted towards women and immigrants. Mandl et al. (2019)focused on identifying targeted and un-targeted insults and classifying hatespeech into hateful, oﬀensive and profane content. Kumar et al. (2018, 2020)tackled aggression and misogynistic content identiﬁcation for trolling and cy-berbullying posts. Vidgen et al. (2019) identiﬁes that most of these sharedtasks broadly fall into these three classes, individual directed abuse, identitydirected abuse and concept directed abuse. It also puts into context the variouschallenges encountered in abusive content detection.Unlike other domains of information retrieval, there is a lack of large data-sets in this ﬁeld. Moreover, the data-sets available are highly skewed and focuson a particular type of hate speech. For example, Davidson et al. (2017) modelsthe problem as a generic abusive content identiﬁcation challenge, however,these posts are mostly related towards racism and sexism. Furthermore, in thereal world, hateful posts do not fall into to a single type of hate speech. Thereis a huge overlapping between diﬀerent hateful classes, making hate speechidentiﬁcation a multi label problem.A wide variety of system architectures, ranging from classical machinelearning to recent deep learning models, have been tried for various aspectsof hate speech identiﬁcation. Facebook, YouTube, and Twitter are the majorsources of data for most data-sets. Burnap and Williams (2015) used SVMand ensemble techniques on identifying hate speech in Twitter data. Razaviet al. (2010) approach for abuse detection using an insulting and abusive lan-guage dictionary of words and phrases. Van Hee et al. (2015) used bag ofwords n-gram features and trained an SVM model on a cyberbullying dataset.Salminen et al. (2018) achieved an F1-score of 0.79 on classiﬁcation of hatefulYouTube and Facebook posts using a linear SVM model employing TF-IDFweighted n-grams.Recently, models based on deep learning techniques have also been ap-plied to the task of hate speech identiﬁcation. These models often rely ondistributed representations or embeddings, e.g., FastText embeddings (Joulinet al. (2017), and paragraph2vec distributed representations (Le and Mikolov(2014). Badjatiya et al. (2017) employed an LSTM architecture to tune Gloveword embeddings on the DATA-TWITTER-TWH data-set. Risch and Krestel(2018) used a neural network architecture using a GRU layer and ensemblemethods for the TRAC 2018 (Kumar et al., 2018) shared task on aggres-sion identiﬁcation. They also tried back-translation as a data augmentationtechnique to increase the data-set size. Wang (2018) illustrated the use of se-quentially combining CNNs with RNNs for abuse detection. They show thatthis approach was better than using only the CNN architecture giving a 1% ulti-task multi-lingual hate speech identiﬁcation 5 improvement in the F1-score. One of the most recent developments in NLPare the transformer architecture introduced by Vaswani et al. (2017). Utilizingthe transformer architecture, Devlin et al. (2019) provide methods to pre-trainmodels for language understanding (BERT) that have achieved state of theart results in many NLP tasks and are promising for hate speech detection aswell. BERT based models achieved competitive performance in HASOC 2019shared tasks. We Mishra and Mishra (2019) ﬁne tuned BERT base models forthe various HASOC shared tasks being the top performing model in some of thesub-tasks. A similar approach was also used for the TRAC 2020 shared tasks onAggression identiﬁcation by us Mishra et al. (2020b) still achieving competitiveperformance with the other models without using an ensemble techniques. Aninteresting approach was the use of multi-lingual models by joint training ondiﬀerent languages. This approach presents us with a uniﬁed model for diﬀer-ent languages in abusive content detection. Ensemble techniques using BERTmodels (Risch and Krestel, 2020) was the top performing model in many of theshared tasks in TRAC 2020. Recently, multi-task learning has been used forimproving performance on NLP tasks (Liu et al., 2016; Søgaard and Goldberg,2016), especially social media information extraction tasks (Mishra, 2019), andmore simpler variants have been tried for hate speech identiﬁcation in our re-cent works (Mishra and Mishra, 2019; Mishra et al., 2020b). Florio et al. (2020)investigated the usage of AlBERTo on monitoring hate speech against Italianon Twitter. Their results show that even though AlBERTo is sensitive to theﬁne tuning set, it’s performance increases given enough training time. Mozafariet al. (2020) employ a transfer learning approach using BERT for hate speechdetection. Ranasinghe and Zampieri (2020) use cross-lingual embeddings toidentify oﬀensive content in multilingual setting. Our multi-lingual approachis similar in spirit to the method proposed in Plank (2017) which use the samemodel architecture and aligned word embedding to solve the tasks. There hasalso been some work on developing solutions for multilingual toxic commentswhich can be related to hate speech. Recently, Mishra (2020c) also used a sin-gle model across various tasks which performed very well for event detectiontasks for ﬁve Indian languages.There have been numerous competitions dealing with hate speech evalua-tion. OﬀensEval Zampieri et al. (2019) is one of the popular shared tasks deal-ing with oﬀensive language in social media, featuring three sub-tasks for dis-criminating between oﬀensive and non-oﬀensive posts. Another popular sharedtask in SemEval is the HateEval Basile et al. (2019) task on the detection ofhate against women and immigrants. The 2019 version of HateEval consistsof two sub-task for determination of hateful and aggressive posts. GermEvalStruß et al. (2019) is another shared task quite similar to HASOC. It focusedon the Identiﬁcation of Oﬀensive Language in German Tweets. It featurestwo sub-tasks following a binary and multi-class classiﬁcation of the Germantweets. Sudhanshu Mishra et al.

An important aspect of hate speech is that it is primarily multi-modal innature. A large portion of the hateful content that is shared on social mediais in the form of memes, which feature multiple modalities like audio, text,images and videos in some cases as well. Yang et al. (2019) present diﬀerentfusion approaches to tackle multi-modal information for hate speech detection.Gomez et al. (2020) explore multi-modal hate speech consisting of text andimage modalities. They propose various multi-modal architectures to jointlyanalyze both the textual and visual information. Facebook recently releasedthe hateful memes data-set for the Hateful Memes challenge Kiela et al. (2020)to provide a complex data-set where it is diﬃcult for uni-modal models toachieve good performance.

For this paper, we extend some of the techniques that we have used in TRAC2020 in Mishra et al. (2020b) as well as Mishra (2019, 2020a,b), and applythem to the HASOC data-set Mandl et al. (2019). Furthermore, we extendthe work that we did as part of the HASOC 2019 shared task Mishra andMishra (2019) by experimenting with multi-lingual training, back-translationbased data-augmentation, and multi-task learning to tackle the data sparsityissue of the HASOC 2019 data-set.Fig. 1: Task Description3.1 Task Deﬁnition and DataAll of the experiments reported hereafter have been done on the HASOC 2019data-set (Mandl et al., 2019) consisting of posts in English (EN), Hindi (HI)and German (DE). The shared tasks of HASOC 2019 had three sub-tasks (A,B,C) for both English and Hindi languages and two sub-tasks (A,B) forthe German language. The description of each sub-task is as follows (see Figure1 for details): ulti-task multi-lingual hate speech identiﬁcation 7 – Sub-Task A : Posts have to be classiﬁed into hate speech

HOF and non-oﬀensive content

NOT . – Sub-Task B : A ﬁne grained classiﬁcation of the hateful posts in sub-task A. Hate Speech posts have to be identiﬁed into the type of hate theyrepresent, i.e containing hate speech content (HATE) , containing oﬀensivecontent (OFFN) and those containing profane words (PRFN) . – Sub-Task C : Another ﬁne grained classiﬁcation of the hateful posts insub-tasks A. This sub-task required us to identify whether the hate speechwas targeted towards an individual or group TIN or whether it was un-targeted

UNT .Table 1: Distribution of number of tweets in diﬀerent data-sets and splits. task DE EN HI train dev test train dev test train dev test A B

407 794 850 2,261 302 1,153 2,469 136 1,318 C The HASOC 2019 data-set consists of posts taken from Twitter and Face-book. The data-set only consists of text and labels and does not include anycontextual information or meta-data of the original post e.g. time information.The data distribution for each language and sub-task is mentioned in Table1. We can observe, that the sample size for each language is of the order of afew thousand post, which is an order smaller to other datasets like OﬀenseEval(13,200 posts), HateEval (19,000 posts), and Kaggle Toxic Comments datasets(240,000 posts). This can pose a challenge for training deep learning models,which often consists of large number of parameters, from scratch. Class wisedata distribution for each language is available in the appendix .1 ﬁgures 4,5, and 6. These ﬁgures show that the label distribution is highly skewed fortask C, such as the label UNT, which is quite underrepresented. Similarly, forGerman the task A data is quite unbalanced. For more details on the datasetalong with the details on its creation and motivation we refer the reader toMandl et al. (2019). Mandl et al. (2019) reports that the inter-annotator agree-ment is in the range of 60% to 70% for English and Hindi. Furthermore, theinter-annotator agreement is more than 86% for German.3.2 Fine-tuning transformer based modelsThe transformer based models especially BERT (Devlin et al., 2019), haveproven to be successful in achieving very good results on a range of NLP tasks.Upon its release, BERT based models became state of the art for 11 NLP tasks(Devlin et al., 2019). This motivated us to try out BERT for hate speech de-tection. We had used multiple variants of the BERT model during HASOC

Sudhanshu Mishra et al.

Fig. 2: An overview of various model architectures we used. Shaded task boxesrepresent that we ﬁrst compute a marginal representation of labels only be-longing to that task before computing the loss.2019 shared tasks Mishra and Mishra (2019). We also experimented with othertransformer models and BERT during TRAC2020 Mishra et al. (2020b). How-ever, based on our experiments, we ﬁnd the original BERT models to be bestperforming for most tasks. Hence, for this paper we only implement our mod-els on those. For our experiments we use the open source implementations ofBERT provided by Wolf et al. (2019) . A common practice for using BERTbased models, is to ﬁne-tune an existing pre-trained model on data from anew task. For ﬁne tuning the pre-trained BERT models we used the BERTfor Sequence Classiﬁcation paradigm present in the HuggingFace library. Weﬁne tune BERT using various architectures. A visual description of these ar-chitectures is shown in Figure 2. These models are explained in detail in latersections.To process the text, we ﬁrst use a pre-trained BERT tokenizer to convertthe input sentences into tokens. These tokens are then passed to the modelwhich generate a BERT speciﬁc embeddings for each token. The special part https://github.com/huggingface/transformersulti-task multi-lingual hate speech identiﬁcation 9 about BERT is that its decoder is supplied all the hidden states of the encoderunlike other transformer models before BERT. This helps it to capture bettercontextual information even for larger sequences. Each sequence of tokens ispadded with a [CLS] and [SEP] token. The pre-trained BERT model generatesan output vector for each of the tokens. For sequence classiﬁcation tasks, thevector corresponding to the [CLS] token is used as it holds the contextualinformation about the complete sentence. Additional ﬁne-tuning is done onthis vector to generate the classiﬁcation for speciﬁc data-sets.To keep our experiments consistent, the following hyper-parameters werekept constant for all our experiments. For training our models we used thestandard hyper-parameters as mentioned in the huggingface transformers doc-umentation. We used the Adam optimizer (with (cid:15) = 1 e −

8) for 5 epochs, witha training/eval batch size of 32. Maximum allowable length for each sequencewas kept as 128. We use a linearly decreasing learning rate with a startingvalue as 5 e − .

0. Allmodels were trained using Google Colab’s GPU runtimes. This limited us toa model run-time of 12 hours with a GPU which constrained our batch size aswell as number of training epochs based on the GPU allocated by Colab.We refer to models which ﬁne tune BERT on using data set from a singlelanguage for a single task, as

Single models with an indicator (S) , this isdepicted in Figure 2 (1st row left). All other models types which we discusslater are identiﬁed by their model types and naems in Figure 2.3.3 Training a model for all tasksOne of the techniques that we had used for our work in HASOC 2019 Mishraand Mishra (2019) was creating an additional sub-task D by combining thelabels for all of the sub-tasks. We refer to models which use this technique as

Joint task models which an indicator (D) (see Figure 2 models marked withD). This allowed us to train a single model for all of the sub-tasks. This alsohelps in overcoming the data sparsity issue for sub-tasks B and sub-tasks Cfor which the no. of data points is very small. The same technique was alsoemployed in our submission to TRAC 2020 Mishra et al. (2020b) aggressionand misogyny identiﬁcation tasks. Furthermore, when combining labels, weonly consider valid combination of labels, which allows us to reduce the possibleoutput space. For HASOC, the predicted output labels for the joint-trainingare as follows :

NOT-NONE-NONE , HOF-HATE-TIN , HOF-HATE-UNT , HOF-OFFN-TIN , HOF-OFFN-UNT , HOF-PRFN-TIN , HOF-PRFN-UNT . The task speciﬁc labels can be easily extracted from the outputlabels, using post-processing of predicted labels. https://colab.research.google.com/0 Sudhanshu Mishra et al. All models with anindicator (ALL) (see Figure 2 models marked with ALL). In this method, wecombine the data-sets from all the languages and train a single multi-lingualmodel on this combined data-set. The multi-lingual model is able to learndata from multiple languages thus providing us with a single uniﬁed modelfor diﬀerent languages. A major motivation for taking this approach was thatsocial media data often does not belong to one particular language. It is quitecommon to ﬁnd code-mixed posts on Twitter and Facebook. Thus, a multi-lingual model is the best choice in this scenario. During our TRAC 2020 work,we had found out that this approach works really well and was one of ourtop models in almost all of the shared tasks. From a deep learning point ofview, this technique seems promising as doing this also increases the size ofthe data-set available for training without adding new data points from otherdata-sets or from data augmentation techniques.As a natural extension of the above two approaches, we combine the multi-lingual training with the joint training approach to train a single model on alltasks for all languages. We refer to models which use this technique as

Alljoint task models with an indicator (ALL) (D) (see Figure 2).3.5 Multi-task learningWhile the joint task setting, can be considered as a multi-task setting, it isnot, in the common sense, and hence our reservation in calling it multi-task.The joint task training can be considered an instance of multi-class predic-tion, where the number of classes is based on the combination of tasks. Thisapproach does not impose any sort of task speciﬁc structure on the model, orcomputes and combines task speciﬁc losses. The core idea of multi-task learn-ing is to use similar tasks as regularizers for the model. This is done by simplyadding the loss functions speciﬁc to each task in the ﬁnal loss function of themodel. This way the model is forced to optimize for all of the diﬀerent taskssimultaneously, thus producing a model that is able to generalize on multipletasks on the data-set. However, this may not always prove to be beneﬁcial asit has been reported that when the tasks diﬀer signiﬁcantly the model failsto optimize on any of the tasks. This leads to signiﬁcantly worse performancecompared to single task approaches. However, sub-tasks in hate speech de-tection are often similar or overlapping in nature. Thus, this approach seemspromising for hate speech detection.Our multi-task setup is inspired from the marginalized inference tech-nique which was used in Mishra et al. (2020b). In the marginalized infer-ence, we post-process the probabilities of each label in the joint model, and ulti-task multi-lingual hate speech identiﬁcation 11 compute the task speciﬁc label probability by marginalizing the probabilityof across all the other tasks. This ensures that the probabilities of labels foreach sub-task make a valid probability distribution and sum to one. For ex-ample, p ( HOF-HATE-TIN ) > p ( HOF-PRFN-TIN ) does not guaranteethat p ( HOF-HATE-UNT ) > p ( HOF-PRFN-UNT ). As described above,we can calculate the task speciﬁc probabilities by marginalizing the outputprobabilities of that task. For example, p ( HATE ) = p ( HOF-HATE-TIN ) + p ( HOF-HATE-UNT ). However, using this technique did not lead to a signif-icant improvement in the predictions and the evaluation performance. In somecases, it was even lower than the original method. A reason we suspect for thislow performance is that the model was not trained to directly optimize its lossfor this marginal inference. Next, we describe our multi-task setup inspiredfrom this approach.For our multi-task experiments, we ﬁrst use our joint training approach(sub-task D) to generate the logits for the diﬀerent class labels. These logitsare then marginalized to generate task speciﬁc logits (marginalizing logits issimpler than marginalizing the probability for each label, as we do not needto compute the partition function). For each task, we take a cross-entropy lossusing the new task speciﬁc logits. Finally we add the respective losses for eachsub-task along with the sub-task D loss. This added up loss is the ﬁnal multi-task loss function of our model. We then train our model to minimize thisloss function. In this loss, each sub-task loss acts as a regularizer for the othertask losses. Since, we are computing the multi-task loss for each instance, weinclude a special label

NONE for sub-tasks B and C, for the cases where thelabel of sub-task A is

NOT . We refer to models which use this technique as

Multi-task models with an indicator (MTL) (D) (see Figure 2).One important point to note is that we restrict the output space of themulti-task model by using the task 4 labels. This is an essential constraint thatwe put on the model because of which there is no chance of any inconsistencyin the prediction. By inconsistency we say that it is not possible for our multi-task model to predict a data point that belongs to

NOT for task A and toany label other than

NONE for the tasks B and C. If we follow the generalprocedure for training a multi-task model, we would have 2 ∗ (3+1) ∗ (2+1) = 24, with +1 for the additional NONE label, combinations of outputs from ourmodel, which would produce the inconsistencies mentioned above.Like the methods mentioned before, we extend multi-task learning to alllanguages, which results in

Multi-task all model , which are indicated with anindicator (MTL) (ALL) (D) .3.6 Training with Back-Translated dataOne approach for increasing the size of the training data-set, is to generatenew instances based on existing instances using data augmentation techniques.These new instances are assigned the same label as the original instance. Train-ing model with instances generated using data augmentation techniques as- sumes that the label remains same if the data augmentation does not changethe instance signiﬁcantly. We utilized a speciﬁc data augmentation techniqueused in NLP models, called Back-Translation (Koehn, 2005; Sennrich et al.,2016). Back-translation uses two machine translation models, one, to translatea text from its original language to a target language; and another, to translatethe new text in target language back to the original language. This techniquewas successfully used in the submission of Risch and Krestel (2018, 2020) dur-ing TRAC 2018 and 2020. Data augmentation via back-translation assumesthat current machine translation systems when used in back-translation set-tings give a diﬀerent text which expresses a similar meaning as the original.This assumption allows us to reuse the label of the original text for the back-translated text.We used the Google translate API to back-translate all the text in ourdata-sets.For each language in our data-set we use the following source → target → source pairs: – EN : English → French → English – HI : Hindi → English → Hindi – DE : German → English → German

To keep track of the back-translated texts we added a ﬂag to the text id. Inmany cases, there were minimal changes to the text. In some cases there wereno changes to the back-translated texts. However the no. of such texts wherethere was no change after back-translation was very low. For example, among4000 instances in the English training set around 100 instances did not haveany changes. So while using the back-translated texts for our experiments, wesimply used all the back-translated texts whether they under-went a changeor not. The data-set size doubled after using the back-translation data aug-mentation technique. An example of back-translated English text is as follows(changed text is emphasized):1.

Original : @politico No. We should remember very clearly that

Back-translated : @politico No, we must not forget that very clear

We present our results for sub-tasks A, B and C in Table 2, 3, and 4 respec-tively. To keep the table concise we use the following convention.1. (ALL) : A bert-base-multi-lingual-uncased model was used with multi-lingualjoint training. https://cloud.google.com/translate/docsulti-task multi-lingual hate speech identiﬁcation 13 (BT) : The data-set used for this experiment is augmented using back-translation.3. (D) : A joint training approach has been used.4. (MTL) : The experiment is performed using a multi-task learning ap-proach.5. (S) : This is the best model which was submitted to HASOC 2019 in Mishraand Mishra (2019).The pre-trained BERT models which were ﬁne-tuned for each language ina single language setting, are as follows:1. EN - bert-base-uncased HI - bert-base-multi-lingual-uncased DE - bert-base-multi-lingual-uncased weighted F1-score and macro F1-score , as were used in Mandlet al. (2019), with macro F1-score being the scores which were used foroverall ranking in HASOC 2019. The best scores for sub-task A are mentioned in Table 2. The best scores forthis task belong to Wang et al. (2019), Bashar and Nayak (2020) and Sahaet al. (2019) for English, Hindi and German respectively. All the models thatwe experimented with in sub-task A are very closely separated by the macro-F1 score. Hence, all of them give a similar performance for this task. Thediﬀerence between the macro F1-scores of these models is <

3% . For bothEnglish and Hindi, the multi-task learning model performed the best whilefor the German language, the model that was trained on the back-translateddata using the multi-lingual joint training approach and task D ( (ALL)(BT) (D) ) worked best. However, it is interesting to see that the multi-taskmodel gives competitive performance on all of the languages within the samecomputation budget. One thing to notice is that, the train macro-F1 scores ofthe multi-task model are much lower than the other models. This suggests thatthe (MTL) model, given additional training time might improve the resultseven further. We are unable to provide longer training time due to lack ofcomputational resources available to us. The (ALL) (MTL) model also givesa similar performance compared to the (MTL) model. This suggests that theadditional multi-lingual training comes with a trade oﬀ with a slightly lowermacro-F1 score. However, the diﬀerence between the scores of the two modelsis ∼ Table 2: Sub-task A results. Models in HASOC 2019 (Mandl et al., 2019) wereranked based on Macro F1.

Weighted F1 Macro F1 lang model dev train test dev train test

EN (ALL) (ALL) (D) (BT) (BT) (ALL) (BT) (ALL) (D) (MTL) (ALL) (D) (MTL) (D) (S) (S) (D)

HASOC Best - - - -

HI (ALL) (ALL) (D) (BT) (BT) (ALL) (BT) (ALL) (D) (MTL) (ALL) (D) (MTL) (D) (S) (S) (D)

HASOC Best - - - - (ALL) (D) (BT) (BT) (ALL) (BT) (ALL) (D) (MTL) (ALL) (D) (MTL) (D) (S)

HASOC Best - - - - training time was too large as the models over-ﬁtted the data. This ﬁnallyresulted in a degradation in the performance of these models. A sweet spot forthe training time may be found for the (MTL) models which may result in anincrease in the performance of the model whilst avoiding over-ﬁtting. We werenot able to conduct more experiments to do the same due to time constraints.This may be evaluated in the additional future work on these models. We,however, cannot compare the German (MTL) models with the (MTL) modelsof the other languages as the German data did not have not have sub-task C,so the (MTL) approach did not have sub-task C for German. As we will see inthe next section, the (MTL) models performed equally well in sub-task B. Thismight be because both tasks A and B involve identifying hate and hence are ina sense co-related. This co-relation is something that the (MTL) models canutilize for their advantage. It has been found in other multi-task approachesthat the models learn more eﬀectively when the diﬀerent tasks are co-related. ulti-task multi-lingual hate speech identiﬁcation 15

However, their performance can degrade if the tasks are un-related. The lowerperformance on the German data may be because of the unavailability of thesub-task C. However, the results are still competitive with the other models.For German, the (ALL) (MTL) model performed better than our submissionfor HASOC 2019. The (MTL) model for Hindi was able to match the bestmodel for this task at HASOC 2019.The (ALL) and (ALL) (D) training methods show an improvement fromour single models submitted at HASOC. These models present us with aninteresting option for abuse detection tasks as they are able to work on allof the shared tasks at the same time, leveraging the multi-lingual abilities ofthe model whilst still having a computation budget equivalent to that of asingle model. These results show that these models give a competitive perfor-mance with the single models. They even outperform the single model, e.g.,they outperform the bert-base-uncased single models that were used in En-glish sub-task A, which have been specially tuned for English tasks. While forGerman and Hindi, the single models themselves utilized a bert-base-uncasedmodel, so they are better suited for analyzing the improvements by the multi-lingual joint training approach. On these languages we see, that the (ALL)and (ALL) (D) techniques do improve the macro-F1 scores on for this task.The back-translation technique does not seem to improve the models much.The approach had a mixed performance. For all the languages, back-translationalone does not improve the model and hints at over-ﬁtting, resulting in a de-crease in test results. However, when it is combined with the (ALL) and (D)training methods we see an increase in the performance. The (ALL) and (D)training methods are able to leverage the data-augmentation applied in theback-translated data. Back-translation when used with (ALL) or (ALL) (D)are better than the single models that we submitted at HASOC 2019. The(BT) (ALL) model comes really close to the best model at HASOC, comingsecond according to the results in Mandl et al. (2019).

The best scores for sub-task B are mentioned in Table 3. The best scores forthis task belong to Ruiter et al. (2019) for German. For English and Hindisub-task B our submissions had performed the best at HASOC 2019. For sub-task B, many of our models were able to signiﬁcantly outperform the bestHASOC models. For English, the multi-task approach results in a new bestmacro-F1 score of 0 . .

662 which is 8% more than theprevious best. For Germans, our (MTL) model has a macro-F1 score on thetest set of 0.416 which is almost 7% more than the previous best.For the English task, even our (MTL) (ALL) and (BT) (ALL) models wereable to beat the previous best. However, our results show that unlike sub-taskA where our models had similar performances, in sub-task B there is hugevariation in their performance. Many outperform the best, however some ofthem also show poor results. The (ALL) and (ALL) (D) perform poorly for

Table 3: sub-task B results. Models in HASOC 2019 (Mandl et al., 2019) wereranked based on Macro F1.

Weighted F1 Macro F1 lang model dev train test dev train test

EN (ALL) (ALL) (D) (BT) (BT) (ALL) (BT) (ALL) (D) (MTL) (ALL) (D) (MTL) (D) (S) (S) (D)

HASOC Best - - - -

HI (ALL) (ALL) (D) (BT) (BT) (ALL) (BT) (ALL) (D) (MTL) (ALL) (D) (MTL) (D) (S) (S) (D)

HASOC Best - - - -

DE (ALL) (ALL) (D) (BT) (BT) (ALL) (BT) (ALL) (D) (MTL) (ALL) (D) (MTL) (D) (S) (S) (D)

HASOC Best - - - - the three languages, except (ALL) in German, and show very small macro-F1scores even on the training set. Thus, training these models for longer maychange the results. The (MTL) models are able to give competitive perfor-mance in task-A and is able to outperform the previous best, thus showingit’s capability to leverage diﬀerent co-related tasks and generalize well on allof them.Here again we see that back-translation alone does not improve the macro-F1 scores. However, an interesting thing to notice is that the (ALL) and (BT)models which perform poorly individually, tend to give good results whenused together. Outperforming the previous best HASOC models, in all thethree languages. This hints that data sparsity alone is not the major issue ofthis task. This is also evident from the performance of the (MTL) model whichonly utilizes the data-set of a single language, which is signiﬁcantly smaller ulti-task multi-lingual hate speech identiﬁcation 17 than the back-translated (twice the original data-set) and the multi-lingualjoint model (sum of the sizes of the original model). But the (BT) (ALL) (D)model performed poorly in all of the three languages. Thus, the use of sub-task(D) with these models only degrades performance.The results from this task conﬁrm that the information required to predicttask - A is important for task - B as well. This information is shared better bythe modiﬁed loss function of the (MTL) models rather than the loss functionfor sub-task (D).The (MTL) models build up on the sub-task (D) approach and do notutilize it explicitly. The sub-task (D) approach seems like a multi-task learningmethod, however, it is not complete and is not able to learn from the othertasks, thus does not oﬀer huge improvement. These (MTL) models do showa variation in their performance but it is always on the higher side of themacro-F1 scores.

The best scores for sub-task C are mentioned in Table 4. The best scores forthis task belong to Mujadia et al. (2019) for Hindi while our submissions per-formed the best for English. The results for sub-task C also show appreciablevariation. Except the (ALL) (D) and (BT) (ALL) (D) models which also per-formed poorly in sub-task B, the variation in their performance, especially forEnglish, is not as signiﬁcant as that present in sub-task B. This may be due tothe fact that the two way ﬁne-grained classiﬁcation is a much easier task thanthe three way classiﬁcation in sub-task B. One important point to note is thatsub-task C focused on identifying the context of the hate speech, speciﬁcallyit focused on ﬁnding out whether it is targeted or un-targeted, while sub-taskA and sub-task B both focused on identifying the type of hate speech.The (MTL) models do not perform as well as they did in the previoustwo tasks. They were still able to outperform the best models for English butperform poorly for Hindi. An important point to notice here is that the trainmacro-F1 scores for the (MTL) models is signiﬁcantly low. This suggests thatthe (MTL) model was not able learn well even for the training instances ofthis task. This can be attributed to the point mentioned above that this taskis inherently not as co-related to sub-task A and sub-task B as previouslyassumed. The task structure itself is not beneﬁcial for a (MTL) approach.The main reason for this is that this task focuses on identifying targeted andun-targeted hate-speech. However, a non hate-speech text can also be an un-targeted or targeted text. As the (MTL) model receives texts belonging toboth hate (HOF) and non-hate speech

NOT , the information contained inthe texts belonging to this task are counter acted by those targeted and un-targeted texts belong to the (NOT) class. Thus, a better description of this taskis not a ﬁne-grain classiﬁcation of hate speech text, but that which involvestargeted and un-targeted labels for both (HOF) and (NOT) classes. In thatsetting, we can fully utilize the advantage of the multi-task learning modeland can expect better performance on this task as well.

Table 4: sub-task C results. Models in HASOC 2019 (Mandl et al., 2019) wereranked based on Macro F1.

Weighted F1 Macro F1 lang model dev train test dev train test

EN (ALL) (ALL) (D) (BT) (BT) (ALL) (BT) (ALL) (D) (MTL) (ALL) (D) (MTL) (D) (S) (S) (D)

HASOC Best - - - -

HI (ALL) (ALL) (D) (BT) (BT) (ALL) (BT) (ALL) (D) (MTL) (ALL) (D) (MTL) (D) (S) (S) (D)

HASOC Best - - - -

The (ALL) and (BT) models performed really well in sub-task C. The(ALL), (BT) and (ALL) (BT) models outperform the previous best for En-glish. The combination of these models with (D) still does not improve themand they continue to give poor performance. This provides more evidence toour previous inference that sub-task (D) alone does not improve the our per-formance.Overall most of our models, show an improvement from the single modelssubmitted at HASOC.The sub-par performance of the back-translated modelsacross the sub-tasks suggest that data sparsity is not the central issue of thischallenge. To take advantage of the augmented-data additional methods haveto be used. The sub-task (D) does not signiﬁcantly adds as an improvementto the models. It can be seen that it actually worsens the situation for sub-tasks B and sub-tasks C. This can be attributed to it changing the task to amuch harder 7-class distribution task.The combined model approaches that wehave mentioned above oﬀer a resource eﬃcient way for hate speech detection.The (ALL), (ALL) (MTL) and (MTL) models are able to generalize well forthe diﬀerent tasks and diﬀerent languages. They present themselves as goodcandidates for a uniﬁed model for hate speech detection. ulti-task multi-lingual hate speech identiﬁcation 19 (1 +, '( ODQJXDJH I VF R U H ILOHBW\SH GHY (1 +, '( ODQJXDJHILOHBW\SH WUDLQ (1 +, '( ODQJXDJHILOHBW\SH WHVW ODEHO 127+2) (a) sub-task A (1 +, '( ODQJXDJH I VF R U H ILOHBW\SH GHY (1 +, '( ODQJXDJHILOHBW\SH WUDLQ (1 +, '( ODQJXDJHILOHBW\SH WHVW ODEHO +$7(2))135)1 (b) sub-task B (1 +, '( ODQJXDJH I VF R U H ILOHBW\SH GHY (1 +, '( ODQJXDJHILOHBW\SH WUDLQ (1 +, '( ODQJXDJHILOHBW\SH WHVW ODEHO 7,1817 (c) sub-task C Fig. 3: Variation in label F1-scores for all sub-tasks across all models4.2 Error analysisAfter identifying the best model and the variation in evaluation scores for eachmodel, we investigate the overall performance of these models for each labelbelonging to each task. In Figure 3, we can observe how the various labelshave a high variance in their predictive performance.

For English, the models show decent variation for both the labels on the train-ing set. However, this variation is not as signiﬁcant for the dev and test setsfor the (NOT) label. There is an appreciable variation for (HOF) label onthe dev set, but it is not transferred to the test set. The scores for the trainsets is very high compared to the dev and test sets. For Hindi, the predictionsfor both the labels show minimum variation in the F1-score across the three data-sets, with similar scores for each label. For German, the F1-scores forthe (NOT) class is quite high compared to that of (HOF) class. The models,have a very low F1-score for the (HOF) label on the test set with appreciablevariation across the diﬀerent models.

For English, the F1-scores for all the labels are quite high on the train setwith decent variation for the (OFFN) label. All of the labels show apprecia-ble variance on the dev and test sets. The (OFFN) has the lowest F1-scoreamongst the labels on both the dev and test sets with the other two labelshaving similar scores for the test set. For Hindi, the train F1-scores are similarfor all of the labels. The F1-scores are on the lower end for the (HATE) and (OFFN) labels on the dev set with appreciable variance across the models.This may be due to the fact that the Hindi dev set contains very few samplesfrom these two labels. For German, the variation among the F1-scores is highacross all the three sets. The (HATE) label and the (OFFN) label have alarge variation in their F1-scores across the models on the dev and test set re-spectively. The F1-score for the (OFFN) label is much higher than the otherlabels on the test set.

For English, the (UNT) label has exceptionally high variance across the mod-els for the train set. This is due to the exceptionally low scores by the (BT)(ALL) (D) model. This label has extremely low F1-score on the dev set. Fur-thermore, there is also large variation in the (TIN) scores across the modelsin all the sets.For Hindi, the (TIN) label has similar F1-scores with large variationsacross the models on all of the three sets. However, the (UNT) label hassmall variance across the models on the dev and test sets.

We also looked at the issue with back-translated results. In order to assess theback-translated data we looked at the new words added and removed froma sentence after back-translation. Aggregating these words over all sentenceswe ﬁnd that the top words which are often removed and introduced are stopwords, e.g. the, of, etc. In order to remove these stop words from our analysisand assess the salient words introduced and removed per label we remove theoverall top 50 words from the introduced and removed list aggregated overeach label. This highlights words which are often removed from oﬀensive andhateful labels are indeed oﬀensive words. A detailed list of words for Englishand German can be found in appendix .2 (we excluded results for Hindi becauseof Latex encoding issues). ulti-task multi-lingual hate speech identiﬁcation 21 as it has been found in Davidson et al. (2019) that hate speech and abusivelanguage datasets exhibit racial bias towards African American English usage.

We would like to conclude this paper by highlighting the promise shown bymulti-lingual and multi-task models on solving the hate and abusive speechdetection in a computationally eﬃcient way while maintaining comparableaccuracy of single task models. We do highlight that our pre-trained modelsneed to be further evaluated before being used on large scale, however thearchitecture and the training framework is something which can easily scale tolarge dataset without sacriﬁcing performance as was shown in Mishra (2020b,a,2019).

Compliance with Ethical Standards

Conﬂict of Interest: The authors declare that they have no conﬂict of interest.

References

Badjatiya, P., Gupta, S., Gupta, M., and Varma, V. (2017). Deep learningfor hate speech detection in tweets. In

Proceedings of the 26th InternationalConference on World Wide Web Companion , WWW ’17 Companion, page759–760, Republic and Canton of Geneva, CHE. International World WideWeb Conferences Steering Committee.Bashar, M. A. and Nayak, R. (2020). Qutnocturnal@ hasoc’19: Cnn for hatespeech and oﬀensive content identiﬁcation in hindi language. arXiv preprintarXiv:2008.12448 .Basile, V., Bosco, C., Fersini, E., Nozza, D., Patti, V., Rangel Pardo, F. M.,Rosso, P., and Sanguinetti, M. (2019). SemEval-2019 task 5: Multilingualdetection of hate speech against immigrants and women in twitter. In

Pro-ceedings of the 13th International Workshop on Semantic Evaluation , pages54–63, Minneapolis, Minnesota, USA. Association for Computational Lin-guistics.Burnap, P. and Williams, M. L. (2015). Cyber hate speech on twitter: Anapplication of machine classiﬁcation and statistical modeling for policy anddecision making.

Policy & Internet , 7(2):223–242.Davidson, T., Bhattacharya, D., and Weber, I. (2019). Racial Bias in HateSpeech and Abusive Language Detection Datasets. In

Proceedings of theThird Workshop on Abusive Language Online , pages 25–35, Stroudsburg,PA, USA. Association for Computational Linguistics.Davidson, T., Warmsley, D., Macy, M. W., and Weber, I. (2017). Automatedhate speech detection and the problem of oﬀensive language. In

Proceedingsof the International AAAI Conference on Web and Social Media 2017 . ulti-task multi-lingual hate speech identiﬁcation 23 Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In

Proceedings of the 2019 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human Language Technologies,Volume 1 (Long and Short Papers) , pages 4171–4186, Minneapolis, Min-nesota. Association for Computational Linguistics.Duggan, M., Smith, A., and Caiazza, T. (2017). Online Harassment 2017.Technical report, Pew Research Center.Eisenstein, J. (2013). What to do about bad language on the internet. In

Proceedings of the 2013 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human Language Technologies ,pages 359–369, Atlanta, Georgia. Association for Computational Linguistics.Florio, K., Basile, V., Polignano, M., Basile, P., and Patti, V. (2020). Time ofyour hate: The challenge of time in hate speech detection on social media.

Applied Sciences , 10(12):4180.Gomez, R., Gibert, J., Gomez, L., and Karatzas, D. (2020). Exploring hatespeech detection in multimodal publications. In , pages 1459–1467.Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. (2017). Bag of tricksfor eﬃcient text classiﬁcation. In

Proceedings of the 15th Conference of theEuropean Chapter of the Association for Computational Linguistics: Volume2, Short Papers , pages 427–431, Valencia, Spain. Association for Computa-tional Linguistics.Kiela, D., Firooz, H., Mohan, A., Goswami, V., Singh, A., Ringshia, P., andTestuggine, D. (2020). The hateful memes challenge: Detecting hate speechin multimodal memes.Koehn, P. (2005). Europarl : A Parallel Corpus for Statistical Machine Trans-lation.

MT Summit .Kumar, R., Ojha, A. K., Malmasi, S., and Zampieri, M. (2018). Benchmarkingaggression identiﬁcation in social media. In

Proceedings of the First Work-shop on Trolling, Aggression and Cyberbullying (TRAC-2018) , pages 1–11,Santa Fe, New Mexico, USA. Association for Computational Linguistics.Kumar, R., Ojha, A. K., Malmasi, S., and Zampieri, M. (2020). Evaluating ag-gression identiﬁcation in social media. In Kumar, R., Ojha, A. K., Lahiri, B.,Zampieri, M., Malmasi, S., Murdock, V., and Kadar, D., editors,

Proceedingsof the Second Workshop on Trolling, Aggression and Cyberbullying (TRAC-2020) , Paris, France. European Language Resources Association (ELRA).Le, Q. V. and Mikolov, T. (2014). Distributed representations of sentencesand documents.

CoRR , abs/1405.4053.Liu, P., Qiu, X., and Huang, X. (2016). Deep Multi-Task Learning with SharedMemory for Text Classiﬁcation. In

Proceedings of the 2016 Conference onEmpirical Methods in Natural Language Processing , pages 118–127, Strouds-burg, PA, USA. Association for Computational Linguistics.Mandl, T., Modha, S., Majumder, P., Patel, D., Dave, M., Mandlia, C., andPatel, A. (2019). Overview of the hasoc track at ﬁre 2019: Hate speech andoﬀensive content identiﬁcation in indo-european languages. In

Proceedings of the 11th Forum for Information Retrieval Evaluation , FIRE ’19, page14–17, New York, NY, USA. Association for Computing Machinery.Mishra, S. (2019). Multi-dataset-multi-task Neural Sequence Tagging for Infor-mation Extraction from Tweets. In

Proceedings of the 30th ACM Conferenceon Hypertext and Social Media - HT ’19 , pages 283–284, New York, NewYork, USA. ACM Press.Mishra, S. (2020a). Information Extraction from Digital Social Trace Datawith Applications to Social Media and Scholarly Communication Data.

ACM SIGIR Forum , 54(1).Mishra, S. (2020b).

Information Extraction from Digital Social Trace Datawith Applications to Social Media and Scholarly Communication Data . PhDthesis, University of Illinois at Urbana-Champaign.Mishra, S. (2020c). Non-neural Structured Prediction for Event Detectionfrom News in Indian Languages. In Mehta, P., Mandl, T., Majumder, P.,and Mitra, M., editors,

Working Notes of FIRE 2020 - Forum for Informa-tion Retrieval Evaluation , Hyderabad, India. CEUR Workshop Proceedings,CEUR-WS.org.Mishra, S., Agarwal, S., Guo, J., Phelps, K., Picco, J., and Diesner, J. (2014).Enthusiasm and support: alternative sentiment classiﬁcation for social move-ments on social media. In

Proceedings of the 2014 ACM conference on Webscience - WebSci ’14 , pages 261–262, Bloomington, Indiana, USA. ACMPress.Mishra, S. and Diesner, J. (2016). Semi-supervised Named Entity Recognitionin noisy-text. In

Proceedings of the 2nd Workshop on Noisy User-generatedText (WNUT) , pages 203–212, Osaka, Japan. The COLING 2016 OrganizingCommittee.Mishra, S. and Diesner, J. (2019). Capturing Signals of Enthusiasm and Sup-port Towards Social Issues from Twitter. In

Proceedings of the 5th Inter-national Workshop on Social Media World Sensors - SIdEWayS’19 , pages19–24, New York, New York, USA. ACM Press.Mishra, S. and Mishra, S. (2019). 3Idiots at HASOC 2019: Fine-tuning Trans-former Neural Networks for Hate Speech Identiﬁcation in Indo-EuropeanLanguages. In

Proceedings of the 11th annual meeting of the Forum forInformation Retrieval Evaluation , pages 208–213, Kolkata, India.Mishra, S., Prasad, S., and Mishra, S. (2020a). Model and predictions formulti-task multi-lingual learning of transformer models for hate speech andoﬀensive speech identiﬁcation in social media. Accessible at: https://doi.org/10.13012/B2IDB-3565123_V1 .Mishra, S., Prasad, S., and Mishra, S. (2020b). Multilingual Joint Fine-tuningof Transformer models for identifying Trolling,Aggression and Cyberbully-ing at TRAC 2020. In

Proceedings of the Second Workshop on Trolling,Aggression and Cyberbullying (TRAC-2020) .Mondal, M., Silva, L. A., and Benevenuto, F. (2017). A Measurement Studyof Hate Speech in Social Media. In

Proceedings of the 28th ACM Conferenceon Hypertext and Social Media - HT ’17 , pages 85–94, New York, New York,USA. ACM Press. ulti-task multi-lingual hate speech identiﬁcation 25

Mozafari, M., Farahbakhsh, R., and Crespi, N. (2020). A bert-based transferlearning approach for hate speech detection in online social media. In Cher-iﬁ, H., Gaito, S., Mendes, J. F., Moro, E., and Rocha, L. M., editors,

Com-plex Networks and Their Applications VIII , pages 928–940, Cham. SpringerInternational Publishing.Mujadia, V., Mishra, P., and Sharma, D. M. (2019). Iiit-hyderabad at hasoc2019: Hate speech detection.Perrin, A. (2015). Social Media Usage:2005-2015. Technical report, Pew Re-search Center.Plank, B. (2017). All-in-1 at IJCNLP-2017 task 4: Short text classiﬁcationwith one model for all languages. In

Proceedings of the IJCNLP 2017,Shared Tasks , pages 143–148, Taipei, Taiwan. Asian Federation of NaturalLanguage Processing.Ranasinghe, T. and Zampieri, M. (2020). Multilingual oﬀensive language iden-tiﬁcation with cross-lingual embeddings. In

Proceedings of the 2020 Con-ference on Empirical Methods in Natural Language Processing (EMNLP) ,pages 5838–5844, Online. Association for Computational Linguistics.Razavi, A. H., Inkpen, D., Uritsky, S., and Matwin, S. (2010). Oﬀensive lan-guage detection using multi-level classiﬁcation. In

Proceedings of the 23rdCanadian Conference on Advances in Artiﬁcial Intelligence , AI’10, page16–27, Berlin, Heidelberg. Springer-Verlag.Risch, J. and Krestel, R. (2018). Aggression identiﬁcation using deep learningand data augmentation. In

Proceedings of the First Workshop on Trolling,Aggression and Cyberbullying (co-located with COLING) , pages 150–158.Risch, J. and Krestel, R. (2020). Bagging bert models for robust aggressionidentiﬁcation. In

Proceedings of the Workshop on Trolling, Aggression andCyberbullying (TRAC@LREC) .Ruiter, D., Rahman, M. A., and Klakow, D. (2019). Lsv-uds at HASOC 2019:The problem of deﬁning hate. In Mehta, P., Rosso, P., Majumder, P., andMitra, M., editors,

Working Notes of FIRE 2019 - Forum for InformationRetrieval Evaluation, Kolkata, India, December 12-15, 2019 , volume 2517of

CEUR Workshop Proceedings , pages 263–270. CEUR-WS.org.Saha, P., Mathew, B., Goyal, P., and Mukherjee, A. (2019). Hatemonitors:Language agnostic abuse detection in social media.Salminen, J., Almerekhi, H., Milenkovi´c, M., gyo Jung, S., An, J., Kwak, H.,and Jansen, B. (2018). Anatomy of online hate: Developing a taxonomy andmachine learning models for identifying and classifying hate in online newsmedia.Schmidt, A. and Wiegand, M. (2017). A Survey on Hate Speech Detectionusing Natural Language Processing. In

Proceedings of the Fifth InternationalWorkshop on Natural Language Processing for Social Media , pages 1–10,Stroudsburg, PA, USA. Association for Computational Linguistics.Sennrich, R., Haddow, B., and Birch, A. (2016). Improving Neural MachineTranslation Models with Monolingual Data. In

Proceedings of the 54th An-nual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers) , pages 86–96, Stroudsburg, PA, USA. Association for Compu- tational Linguistics.Søgaard, A. and Goldberg, Y. (2016). Deep multi-task learning with low leveltasks supervised at lower layers. In

Proceedings of the 54th Annual Meetingof the Association for Computational Linguistics (Volume 2: Short Papers) ,pages 231–235. Association for Computational Linguistics.Sticca, F. and Perren, S. (2013). Is Cyberbullying Worse than TraditionalBullying? Examining the Diﬀerential Roles of Medium, Publicity, andAnonymity for the Perceived Severity of Bullying.

Journal of Youth andAdolescence , 42(5):739–750.Struß, J., Siegel, M., Ruppenhofer, J., Wiegand, M., and Klenner, M. (2019).Overview of germeval task 2, 2019 shared task on the identiﬁcation of of-fensive language. In

KONVENS .Van Hee, C., Lefever, E., Verhoeven, B., Mennes, J., Desmet, B., De Pauw, G.,Daelemans, W., and Hoste, V. (2015). Detection and ﬁne-grained classiﬁca-tion of cyberbullying events. In Angelova, G., Bontcheva, K., and Mitkov,R., editors,

Proceedings of Recent Advances in Natural Language Processing,Proceedings , pages 672–680.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N.,Kaiser, (cid:32)L., and Polosukhin, I. (2017). Attention is all you need. In

Advancesin neural information processing systems , pages 5998–6008.Vidgen, B., Harris, A., Nguyen, D., Tromble, R., Hale, S., and Margetts, H.(2019). Challenges and frontiers in abusive content detection. In

Proceedingsof the Third Workshop on Abusive Language Online , pages 80–93, Florence,Italy. Association for Computational Linguistics.Wang, B., Ding, Y., Liu, S., and Zhou, X. (2019). Ynu wb at hasoc 2019: Or-dered neurons lstm with attention for identifying hate speech and oﬀensivelanguage.Wang, C. (2018). Interpreting neural network hate speech classiﬁers. In

Pro-ceedings of the 2nd Workshop on Abusive Language Online (ALW2) , pages86–92, Brussels, Belgium. Association for Computational Linguistics.Waseem, Z., Davidson, T., Warmsley, D., and Weber, I. (2017). Understandingabuse: A typology of abusive language detection subtasks. In

Proceedings ofthe First Workshop on Abusive Language Online , pages 78–84, Vancouver,BC, Canada. Association for Computational Linguistics.Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac,P., Rault, T., Louf, R., Funtowicz, M., and Brew, J. (2019). Huggingface’stransformers: State-of-the-art natural language processing.Yang, F., Peng, X., Ghosh, G., Shilon, R., Ma, H., Moore, E., and Predovic, G.(2019). Exploring deep multimodal fusion of text and photo for hate speechclassiﬁcation. In

Proceedings of the Third Workshop on Abusive LanguageOnline , pages 11–18, Florence, Italy. Association for Computational Lin-guistics.Zampieri, M., Malmasi, S., Nakov, P., Rosenthal, S., Farra, N., and Kumar,R. (2019). SemEval-2019 task 6: Identifying and categorizing oﬀensive lan-guage in social media (OﬀensEval). In

Proceedings of the 13th InternationalWorkshop on Semantic Evaluation , pages 75–86, Minneapolis, Minnesota, ulti-task multi-lingual hate speech identiﬁcation 27

USA. Association for Computational Linguistics.

Appendix .1 Label Distribution 127 +2) WDVNB$ L Q V W DQ F H V ODQJ (1_VSOLW WUDLQ 127 +2) WDVNB$ ODQJ (1_VSOLW GHY 127 +2) WDVNB$ ODQJ (1_VSOLW WHVW +$7( 35)1 2))1 WDVNB% L Q V W DQ F H V ODQJ (1_VSOLW WUDLQ +$7( 35)1 2))1 WDVNB% ODQJ (1_VSOLW GHY +$7( 35)1 2))1 WDVNB% ODQJ (1_VSOLW WHVW 7,1 817 WDVNB& L Q V W DQ F H V ODQJ (1_VSOLW WUDLQ 7,1 817 WDVNB& ODQJ (1_VSOLW GHY 7,1 817 WDVNB& ODQJ (1_VSOLW WHVW Fig. 4: English Data class wise distribution 127 +2) WDVNB$ L Q V W DQ F H V ODQJ '(_VSOLW WUDLQ 127 +2) WDVNB$ ODQJ '(_VSOLW GHY 127 +2) WDVNB$ ODQJ '(_VSOLW WHVW 2))1 +$7( 35)1 WDVNB% L Q V W DQ F H V ODQJ '(_VSOLW WUDLQ 2))1 +$7( 35)1 WDVNB% ODQJ '(_VSOLW GHY 2))1 +$7( 35)1 WDVNB% ODQJ '(_VSOLW WHVW Fig. 5: German Data class wise distribution +2) 127 WDVNB$ L Q V W DQ F H V ODQJ +,_VSOLW WUDLQ +2) 127 WDVNB$ ODQJ +,_VSOLW GHY +2) 127 WDVNB$ ODQJ +,_VSOLW WHVW 35)1 2))1 +$7( WDVNB% L Q V W DQ F H V ODQJ +,_VSOLW WUDLQ 35)1 2))1 +$7( WDVNB% ODQJ +,_VSOLW GHY 35)1 2))1 +$7( WDVNB% ODQJ +,_VSOLW WHVW 7,1 817 WDVNB& L Q V W DQ F H V ODQJ +,_VSOLW WUDLQ 7,1 817 WDVNB& ODQJ +,_VSOLW GHY 7,1 817 WDVNB& ODQJ +,_VSOLW WHVW Fig. 6: Hindi Data class wise distribution ulti-task multi-lingual hate speech identiﬁcation 29 .2 Back translation top changed words

Here we list the top 5 words per label for each task obtained after removing the top 50words which were either introduced or removed via back translation. We do not list the topwords for Hindi because of the encoding issue in Latex.

Listing 1: Changed words in English Training Data ∗ ∗ ∗ ’ , 5) , ( ’ there ’ , 4) , ( ’ these ’ , 4) ]21 t a s k 3 removed words22 NONE [ ( ’ happy ’ , 49) , ( ’ than ’ , 47) , ( ’ being ’ , 45) , ( ’ every ’ , 45) ,( ’ been ’ , 43) ]23 TIN [ ( ’ fuck ’ , 48) , ( ’ he ’ s ’ , 39) , ( ’ what ’ , 35) , ( ’ don ’ t ’ , 34) , (” he’ s ” , 34) ]24 UNT [ ( ’ them ’ , 7) , ( ’ f ∗ ∗∗ ’ , 6) , ( ’ being ’ , 5) , ( ’ such ’ , 5) , ( ’ does ’ ,5) ]