Hate-Alert@DravidianLangTech-EACL2021: Ensembling strategies for Transformer-based Offensive language Detection
Debjoy Saha, Naman Paharia, Debajit Chakraborty, Punyajoy Saha, Animesh Mukherjee
HHate-Alert@DravidianLangTech-EACL2021: Ensembling strategies forTransformer-based Offensive language Detection
Debjoy Saha , Naman Paharia , Debajit Chakraborty,Punyajoy Saha , Animesh Mukherjee
Indian Institute of Technology, Kharagpur, [email protected], [email protected], [email protected]@iitkgp.ac.in, [email protected]
Abstract
Social media often acts as breeding groundsfor different forms of offensive content. Forlow resource languages like Tamil, the situ-ation is more complex due to the poor per-formance of multilingual or language-specificmodels and lack of proper benchmark datasets.Based on this shared task “Offensive Lan-guage Identification in Dravidian Languages”at EACL 2021, we present an exhaustive ex-ploration of different transformer models, Wealso provide a genetic algorithm technique forensembling different models. Our ensembledmodels trained separately for each language se-cured the first position in Tamil, the second position in Kannada, and the first position inMalayalam sub-tasks. The models and codesare provided . Social media platforms have become a prominentway of communication, be it for acquiring infor-mation or promotion of business . While we can-not deny the positives, there are some ill conse-quences of social media as well. Bad actors of-ten use different social media platforms by post-ing tweets/comments that insult others by targetingtheir culture and beliefs. In social media, such postsare collectively known as offensive language (Chenet al., 2012). To reduce offensive content, differ-ent social media platforms like YouTube have laiddown moderation policies and employ moderatorsfor maintaining civility in their platforms. Re-cently, the moderators are finding it difficult (New-ton, 2019) to continue the moderation due to theever-increasing volume of offensive data. Hence,platforms are looking toward automatic modera-tion systems. For instance, Facebook (Robertson, https://github.com/Debjoy10/Hate-Alert-DravidianLangTech . Situation for countrieslike India is more complex, as courts often facedilemma while interpreting harmful content andsocial platforms like Facebook are often unable totake necessary actions . Hence, more effort is re-quired to detect and mitigate offensive language inthe Indian social media.Recently, different shared tasks like HASOC2019 have been launched to understand hatefuland offensive language in Indian context but it islimited to Hindi and English mostly. A sub-task inHASOC 2020 aimed to detect offensive posts ina code-mixed dataset. Extending that task further,the organisers of this shared task have put togethera large dataset of 43919, 7772, 20010 posts in threeDravidian languages – Tamil, Kannada, Malayalamrespectively, to further advance research on offen-sive posts in these languages. In this paper, we aimto build algorithmic systems that can detect offen-sive posts. Contributions of our paper are two-fold.First, we investigate how the current state-of-the-art multilingual language models perform on theselanguages. Second, we demonstrate how we canuse ensembling techniques to improve our classifi-cation performance. Offensive language has been studied in the researchcommunity for a long time, One of the earliest https://hasocfire.github.io/hasoc/2019/index.html https://sites.google.com/view/dravidian-codemix-fire2020/overview a r X i v : . [ c s . C L ] F e b lassifiers Tamil Kannada MalayalamTrain Dev Test Train Dev Test Train Dev TestNot-offensive 25425 3193 3190 3544 426 427 14153 1779 1765Offensive-untargeted 2906 356 368 212 33 33 191 20 29Offensive-targeted-individual 2343 307 315 487 66 75 239 24 27Offensive-targeted-group 2557 295 288 329 45 44 140 13 23Offensive-targeted-other 454 65 71 123 16 14 - - -Not-in-indented-language 1454 172 160 1522 191 185 1287 163 157Total 35139 4388 4392 6217 777 778 16010 1999 2001 Table 1: Dataset statistics for languages Tamil, Kannada and Malayalam for all splits Train, Dev and Test studies (Chen et al., 2012) tried to detect offensiveusers by using lexical syntactic features generatedfrom their posts. Although, they provided an effi-cient framework for future research, their datasetwas small for any conclusive evidence. Davidsonet al. curated one of the largest dataset contain-ing both offensive and hate speech. The authorsfound that one of the issues with their best per-forming models was that they could not distinguishbetween hate and offensive posts. In order to miti-gate this, subsequent research (Pitsilis et al., 2018)tried to use deep learning to identify offensive lan-guage in English and found that recurrent neuralnetworks (RNNs) are quite effective this task. Re-cently, the research community has begun to focuson offensive language detection in other low re-sourced languages like Danish (Sigurbergsson andDerczynski, 2019), Greek (Pitenis et al., 2020) andTurkish (C¸ ¨oltekin, 2020). In the Indian context,the HASOC 2019 shared task (Mandl et al., 2019)was a significant effort in that direction, where theauthors developed a dataset of hate and offensiveposts in Hindi and English. The best model in thiscompetition used an ensemble of multilingual trans-formers, fine-tuned on the given dataset (Mishraand Mishra, 2019). In Dravidian part of HASOC2020, Renjit and Idicula used an ensemble of deeplearning and simple neural networks to identify of-fensive posts in Manglish (Malayalam in romanfont).Transformer based language models are becom-ing quite popular in the past few years. Recently,different multilingual models like XLM-RoBERTa(Conneau et al., 2019), multilingual-BERT (Devlinet al., 2018), MuRIL and Indic-BERT (Conneauet al., 2019) have been introduced to facilitate NLPresearch in different languages. Often in differ-ent machine learning pipeline, ensembling differ-ent classification outcomes helps in getting better https://tfhub.dev/google/MuRIL/1 performance (Alonso et al., 2020; Renjit and Idic-ula, 2020; Mishra and Mishra, 2019). Rather thanselecting the models for the ensemble manually,genetic algorithms (GA) are used to optimise theweights of different classifiers, to improve the en-semble performance on the development set. GA-based ensembling techniques have previously beenused in the hate speech domain for architecture andhyperparamter search Madukwe et al. (2020). The shared task on Offensive Language Iden-tification in Dravidian Languages-EACL 2021(Chakravarthi et al., 2021) is based on a post clas-sification problem with an aim to moderate andminimise offensive content in social media. Theobjective of the shared task is to develop method-ology and language models for code-mixed datain low-resource languages, as models trained onmonolingual data fail to comprehend the semanticcomplexity of a code-mixed dataset.
Dataset:
The Dravidian offensive code-mixed language dataset is available for Tamil(Chakravarthi et al., 2020b), Kannada (Handeet al., 2020) and Malayalam (Chakravarthi et al.,2020a). The data provided is scraped entirelyfrom the YouTube comments of a multilingualcommunity where code-mixing is a prevalentphenomenon. The dataset contains rows of text andthe corresponding labels from the list not-offensive,offensive-untargeted, offensive-targeted-individual,offensive-targeted-group, offensive-targeted-other,or not-in-indented-language. Final evaluationscore was calculated using weighted F1-scoremetric on a held-out test dataset.We present the dataset statistics in Table 1.Please note that the Malayalam split of the datasetcontained no instances of ’Offensive-targeted-other’ label, so classification is done using 5 la-bels only, instead of the original six labels. Inrder to understand the amount of misspelt andcode-mixed words, we compare with an existingpure language vocabulary available in the Dakshinadataset (Roark et al., 2020). We find the propor-tion of out-of-vocabulary (OOV) words (includingcode-mixed, English and misspelt words) in thedataset as 85.55%, 84.23% and 83.03% in Tamil,Malayalam and Kannada respectively.
In this section, we discuss the different parts of thepipeline that we followed to detect offensive postsin this dataset.
As a part of our initial experiments, we used severalmachine learning models to establish a baseline per-formance. We employed random forests, logisticregression and trained them with TF-IDF vectors.The best results were obtained on ExtraTrees Clas-sifier (Geurts et al., 2006) with 0.70, 0.63 and 0.95weighted F1-scores on Tamil, Kannada and Malay-alam respectively. As we will notice further, theseperformances were lower than single transformerbased model. Hence, the simple machine learningmodels were not used in the subsequent analysis.
One of the issues with simple machine learningmodels is the inability to learn the context of aword based on its neighbourhood. Recent trans-former based architectures are capable of captur-ing this context, as established by their superiorperformance in different downstream tasks. Forour purpose, we fine-tuned different state-of-the-art multilingual BERT models on the given datasets.This includes XLM-RoBERTa (Conneau et al.,2019), multilingual-BERT (Devlin et al., 2018) ,Indic BERT and MuRIL . We also pretrain XLM-Roberta-Base on the target dataset for 20 epochsusing Masked Language Modeling, to capture thesemantics of the code-mixed corpus. This addi-tional pretrained BERT model was also used forfine-tuning. In addition, all models were fine-tuned XLM-Roberta-Base, 270M parameters, trained on datafrom 100 languages; Multilingual-BERT-Base, 179M parame-ters, trained on data from the top 104 languages. Originally released by Google, MuRIL (Multilin-gual Representations for Indian Languages) is a BERTmodel pre-trained on code-mixed data from 17 Indianlanguages https://huggingface.co/simran-kh/muril-cased-temp
Classifiers Tamil Kannada MalayalamDev Test Dev Test Dev TestXLMR-base (A) 0.77 0.76 0.69 0.70 0.97 0.96XLMR-large
XLMR-C (B) 0.76 0.76 mBERT-base (C) 0.73 0.72 0.69 0.70 0.97 0.96IndicBERT 0.73 0.71 0.62 0.66 0.96 0.95MuRIL 0.75 0.74 0.67 0.67 0.96 0.96DistilBERT 0.74 0.74 0.68 0.69 0.96 0.95CNN 0.71 0.70 0.60 0.61 0.95 0.95CNN + A + C 0.78 0.76 0.71 0.70
CNN + A + B
CNN + B + C 0.77 0.76
Table 2: Weighted F1-score comparison for trans-former, CNN and fusion models on Dev and Testsplits (XLMR-C refers to the custom-pretrained XLM-Roberta-Base Classifier).
BERT_11-D CNN BERT_2
FastText Tokenizer
Data
Fusion Classifier Head
Embedding Fusion Layer (1664 x 1)(128 x 1) (768 x 1) (768 x 1)
Figure 1: Our fusion model architecture for two BERTmodels. Note that × embedding sizes are used forthe BERT-base models. Embeddings size of × is used for BERT-large models. separately using unweighted and weighted cross-entropy loss functions (Mannor et al., 2005). Fortraining, we use HuggingFace (Wolf et al., 2019)with PyTorch (Paszke et al., 2019). We use theAdam adaptive optimizer (Loshchilov and Hutter,2019) with an initial learning rate of 1e-5. Train-ing is stopped by early stopping if macro-F1 scoreof the development split of the dataset does notincrease for 5 epochs. Convolution neural networks are able to captureneighbourhood information more effectively. Oneof the previous state-of-the-art model to detect hatespeech was CNN-GRU (Zhang et al., 2018), Wepropose a new BERT-CNN fusion classifier wherewe train a single classification head on the con-catenated embeddings from different BERT and odel Sets Tamil Kannada MalayalamDev Test Dev Test Dev TestTransformers 0.80 0.78 0.74 0.73 0.98 0.97F-models 0.79 0.77 0.73 0.73 0.98 0.97R-models 0.79 0.78 0.75 0.74 0.97 0.97
Overall 0.80 0.78 0.75 0.74 0.98 0.97
Table 3: Weighted-F1 score comparison for GA-weighted ensemble for transformers category, Fu-sion models(F-models) and Random seed models(R-models)
CNN models. BERT models were initialised withthe fine-tuned weights in the former section andthe weights were frozen. The number of BERTmodels in a single fusion model was kept flexiblewith maximum number of models fixed to three,due to memory limitation. For the CNN part, weuse the 128-dim final layer embeddings from CNNmodels trained on skip-gram word vectors usingFastText (Bojanowski et al., 2017) . FastText vec-tors worked the best among other word embeddingslike LASER (Artetxe and Schwenk, 2019). For thefusion classifier head, we use a feed-forward neuralnetwork having four layers with batch normaliza-tion (Ioffe and Szegedy, 2015) and dropout (Srivas-tava et al., 2014) on the final layer. The predictionswere generated from a softmax layer of dimensionequal to the number of classes. We present thedetails of the pipeline in Figure 1. Ensemble of different models often turn out betterpredictors than using a single classifier. Standardprediction averaging ensembles will not performwell, since some models might be weak predic-tors in the mix of different models. One of thestrategies to reduce the influence of weak mod-els is using weights for different models based ontheir performance. Genetic algorithm (GA) basedtechniques (Madukwe et al., 2020) are one of thepopular ways to set the weights of different modelsin an ensemble. Our approach is similar to that in-troduced in Zhou et al. (2001), except that insteadof selecting the models with the highest weightsfor the final ensemble, we directly use the weightsto compute the weighted average ensemble.Another issue with neural networks is the per-formance is dependent on the initial random seeds.With pretrained models like BERT, most of the https://fasttext.cc/docs/en/unsupervised-tutorial.html weights are fixed only in the final layer (classifica-tion head). Past research (McCoy et al., 2020) hasshown that even the initialisation of this final layercan affect the final performance by large margins.Hence, we take 10 different random seeds to trainthe models and then pass all the models to the GApipeline. We perform this operation for two of thebest models in Table 2. We observe that among the individual transformermodels, the best performance is obtained usingXLM-RoBERTa-large (XLMR-large) in the Tamildataset and Custom XLM-RoBERTa-base (XLMR-C) in the Kannada dataset. For Malayalam dataset,both the former models perform similarly. Thehigher performance of XLM-RoBERTa (Artetxeand Schwenk, 2019) models can be attributed tothe fact that they are pretrained using a parallelcorpus (same corpus in different languages). Fur-ther pretraining with our dataset helps in furtherimprovement of the performance in the Kannadadataset. We did not use the XLM-R large model fur-ther due to limited GPU space. Next, we note theperformance of the fusion models, which performalmost similarly across different combinations.When we use different random seeds, the per-formance of multilingual BERT models variedaround 2-3% across different languages. For XLM-RoBERTa models the variation was more (around15-20%). Table 3 shows the ensemble performanceof different categories of models and all the mod-els combined. GA-optimised weighted ensemblingimproves the final model scores by a 1-2% acrossdatasets of different languages which finally helpedus to rank higher in the leader board.In this shared task, we evaluated different trans-former based architectures and introduced differ-ent ensembling strategies. We found that XLM-RoBERTa models usually perform better than othertransformer models, although their performance ishighly variable across different random seeds. GAbased ensembling helps us in further improvingthe models. Our immediate next step will be toinvestigate the reason behind lower performanceof IndicBERT and MuRIL which are specificallytrained for Indian context.
References
Pedro Alonso, Rajkumar Saini, and Gy¨orgy Kov´acs.2020. Hate speech detection using transformer en-embles on the hasoc dataset. In
InternationalConference on Speech and Computer , pages 13–21.Springer.Mikel Artetxe and Holger Schwenk. 2019. Mas-sively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond.
Transac-tions of the Association for Computational Linguis-tics , 7:597–610.Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. 2017. Enriching word vectors withsubword information.
Transactions of the Associa-tion for Computational Linguistics , 5:135–146.Bharathi Raja Chakravarthi, Navya Jose, ShardulSuryawanshi, Elizabeth Sherly, and John Philip Mc-Crae. 2020a. A sentiment analysis dataset for code-mixed Malayalam-English. In
Proceedings of the1st Joint Workshop on Spoken Language Technolo-gies for Under-resourced languages (SLTU) andCollaboration and Computing for Under-ResourcedLanguages (CCURL) , pages 177–184, Marseille,France. European Language Resources association.Bharathi Raja Chakravarthi, Vigneshwaran Murali-daran, Ruba Priyadharshini, and John Philip Mc-Crae. 2020b. Corpus creation for sentiment anal-ysis in code-mixed Tamil-English text. In
Pro-ceedings of the 1st Joint Workshop on SpokenLanguage Technologies for Under-resourced lan-guages (SLTU) and Collaboration and Computingfor Under-Resourced Languages (CCURL) , pages202–210, Marseille, France. European Language Re-sources association.Bharathi Raja Chakravarthi, Ruba Priyadharshini,Navya Jose, Anand Kumar M, Thomas Mandl,Prasanna Kumar Kumaresan, Rahul Ponnusamy,Hariharan V, Elizabeth Sherly, and John Philip Mc-Crae. 2021. Findings of the shared task on Offen-sive Language Identification in Tamil, Malayalam,and Kannada. In
Proceedings of the First Workshopon Speech and Language Technologies for Dravid-ian Languages . Association for Computational Lin-guistics.Y. Chen, Y. Zhou, S. Zhu, and H. Xu. 2012. Detectingoffensive language in social media to protect adoles-cent online safety. In , pages71–80.Ying Chen, Yilu Zhou, Sencun Zhu, and Heng Xu.2012. Detecting offensive language in social mediato protect adolescent online safety. In , pages 71–80. IEEE.C¸ a˘grı C¸ ¨oltekin. 2020. A corpus of turkish offensivelanguage on social media. In
Proceedings of the12th Language Resources and Evaluation Confer-ence , pages 6174–6184. Alexis Conneau, Kartikay Khandelwal, Naman Goyal,Vishrav Chaudhary, Guillaume Wenzek, FranciscoGuzm´an, Edouard Grave, Myle Ott, Luke Zettle-moyer, and Veselin Stoyanov. 2019. Unsupervisedcross-lingual representation learning at scale. arXivpreprint arXiv:1911.02116 .Thomas Davidson, Dana Warmsley, Michael Macy,and Ingmar Weber. 2017. Automated hate speechdetection and the problem of offensive language. In
Proceedings of the International AAAI Conferenceon Web and Social Media , volume 11.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805 .Pierre Geurts, Damien Ernst, and Louis Wehenkel.2006. Extremely randomized trees.
Mach. Learn. ,63(1):3–42.Adeep Hande, Ruba Priyadharshini, and Bharathi RajaChakravarthi. 2020. KanCMD: KannadaCodeMixed dataset for sentiment analysis andoffensive language detection. In
Proceedings of theThird Workshop on Computational Modeling of Peo-ple’s Opinions, Personality, and Emotion’s in SocialMedia , pages 54–63, Barcelona, Spain (Online).Association for Computational Linguistics.Sergey Ioffe and Christian Szegedy. 2015. Batch nor-malization: Accelerating deep network training byreducing internal covariate shift. In
Internationalconference on machine learning , pages 448–456.PMLR.Ilya Loshchilov and Frank Hutter. 2019. Decoupledweight decay regularization.K. J. Madukwe, X. Gao, and B. Xue. 2020. A ga-basedapproach to fine-tuning bert for hate speech detec-tion. In , pages 2821–2828.Thomas Mandl, Sandip Modha, Prasenjit Majumder,Daksh Patel, Mohana Dave, Chintak Mandlia, andAditya Patel. 2019. Overview of the hasoc track atfire 2019: Hate speech and offensive content identifi-cation in indo-european languages. In
Proceedingsof the 11th forum for information retrieval evalua-tion , pages 14–17.Shie Mannor, Dori Peleg, and Reuven Rubinstein.2005. The cross entropy method for classification.In
Proceedings of the 22nd International Conferenceon Machine Learning , ICML ’05, page 561–568,New York, NY, USA. Association for ComputingMachinery.R Thomas McCoy, Junghyun Min, and Tal Linzen.2020. Berts of a feather do not generalize together:Large variability in generalization across modelswith similar test set performance. In
Proceedings ofthe Third BlackboxNLP Workshop on Analyzing andInterpreting Neural Networks for NLP , pages 217–227.hubhanshu Mishra and Sudhanshu Mishra. 2019. 3id-iots at hasoc 2019: Fine-tuning transformer neu-ral networks for hate speech identification in indo-european languages. In
FIRE (Working Notes) arXivpreprint arXiv:1912.01703 .Zeses Pitenis, Marcos Zampieri, and Tharindu Ranas-inghe. 2020. Offensive language identification ingreek. arXiv preprint arXiv:2003.07459 .Georgios K Pitsilis, Heri Ramampiaro, and HelgeLangseth. 2018. Detecting offensive languagein tweets using deep learning. arXiv preprintarXiv:1801.04433 arXiv preprint arXiv:1908.04531 .Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: A simple way to prevent neural networksfrom overfitting.
Journal of Machine Learning Re-search , 15(56):1929–1958.Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R´emi Louf, Morgan Fun-towicz, et al. 2019. Huggingface’s transformers:State-of-the-art natural language processing. arXivpreprint arXiv:1910.03771 .Ziqi Zhang, D. Robinson, and Jonathan Tepper. 2018.Detecting hate speech on twitter using a convolution-gru based deep neural network. Zhi-Hua Zhou, Jian-Xin Wu, Yuan Jiang, and Shi-FuChen. 2001. Genetic algorithm based selective neu-ral network ensemble. In