Hopeful_Men@LT-EDI-EACL2021: Hope Speech Detection Using Indic Transliteration and Transformers
Ishan Sanjeev Upadhyay, Nikhil E, Anshul Wadhawan, Radhika Mamidi
HHopeful Men@LT-EDI-EACL2021: Hope Speech Detection Using IndicTransliteration and Transformers
Ishan Sanjeev Upadhyay* , Nikhil E* , Anshul Wadhawan , and Radhika Mamidi
International Institute of Information Technology, Hyderabad Flipkart Private Limited { ishan.sanjeev, nikhil.e } @research.iiit.ac.in [email protected] [email protected] Abstract
This paper aims to describe the approach weused to detect hope speech in the HopeEDIdataset. We experimented with two ap-proaches. In the first approach, we used con-textual embeddings to train classifiers usinglogistic regression, random forest, SVM, andLSTM based models.The second approach in-volved using a majority voting ensemble of11 models which were obtained by fine-tuningpre-trained transformer models (BERT, AL-BERT, RoBERTa, IndicBERT) after adding anoutput layer. We found that the second ap-proach was superior for English, Tamil andMalayalam. Our solution got a weightedF1 score of 0.93, 0.75 and 0.49 for En-glish,Malayalam and Tamil respectively. Oursolution ranked first in English, eighth inMalayalam and eleventh in Tamil.
The spread of hate speech on social media is a prob-lem that still exists today. While there have been at-tempts made at hate speech detection (Schmidt andWiegand, 2017; Lee et al., 2018) to stop the spreadof negativity, this form of censorship can also bemisused to obstruct rights and freedom of speech.Furthermore, hate speech tends to spread fasterthan non-hate speech (Mathew et al., 2019).Whilethere has been a growing amount of marginalizedpeople looking for support online (Gowen et al.,2012; Wang and Jurgens, 2018), there has beena substantial amount of hate towards them too(Mondal et al., 2017). Therefore, detecting andpromoting content that reduces hostility and in-creases hope is important. Hope speech detec-tion can be seen as a rare positive mining taskbecause hope speech constitutes a low percent-age of overall content (Palakodety et al., 2020a). *These authors contributed equally to this work
There has been work done on hope speech or helpspeech detection before that has used logistic re-gression and active learning techniques (Palakodetyet al., 2020a,b). In our paper, we will be doing thehope speech detection task on the HopeEDI dataset(Chakravarthi, 2020) which consists of user com-ments from Youtube in English, Tamil and Malay-alam.In this paper, we will first look at the task defini-tion, followed by the methodology used. We willthen look at the experiments and results followedby conclusion and future work.
The given problem is a comment level classifica-tion task for the identification of ”hope speech”within YouTube comments, wherein they are to beclassified as ”Hope speech”, ”Not hope speech”and ”Not in intended language”. The data providedin the task was annotated at a per-comment basiswherein a comment could be composed of morethan one sentence.
This section talks about the methodology that wehave used to solve the task. As shown in Figure 1,the pipeline involves preprocessing, language de-tection, transliteration (for Indian languages), andhope speech detection. These steps are describedin this section.
The preprocessing module involved the following:• Removing special characters and excesswhitespaces• Removing emojis• Make text lowercase. a r X i v : . [ c s . C L ] F e b re-processing ModuleLanguage Detection ModuleTransliteration Module (Indian Languages)Hope Speech Detection Module Figure 1: Methodology Pipeline
These steps were taken to make the text more uni-form. Special characters like “@” and “
The task involves classifying text into hope, not-hope and not-language. Language detection mod-ule marks the not-language sentences. We useGoogle’s language detection library (Shuyo, 2010)to do this. The Tamil and Malayalam datasets arecode-mixed. Inter-sentential,intra-sentential, tagcode-mixing and code-mixing between Latin andnative script is observed in the Tamil and Malay-alam datasets. Google’s language detection librarydoes not work on such code-mixed data. Sincethe Tamil and Malayalam sentences involve code-mixed data, language detection can not be done onthem using the Google language detection library.We observed that sentences that were marked asnot-Tamil and not-Malayalam were mostly Englishsentences with some of them being Hindi and otherlanguages. Hence, we adopted a heuristic where wemarked sentences as not-Tamil or not-Malayalamif the sentences were detected to be in English orHindi, other sentences were assumed to belong tothe respective language.
After language detection, sentences that are classi-fied to be in Tamil and Malayalam undergo translit-eration. Tamil and Malayalam text have code-mixing between Latin and native script, hencetransliteration is done to make the entire text inthe native script. This step is also important be-cause it makes the text closer to the kind of textIndicBert is trained on. Transliteration was doneby using the indic-transliteration library .
After preprocessing and transliteration (for Indianlanguages), the text is sent to the hope speech de-tection module. The hope speech detection moduleis responsible for predicting if a text is hope speechor not hope speech. We have used the followingfor our experiment. (Vaswani et al., 2017) have per-formed well in various natural language process-ing (NLP) tasks. Unlike recurrent neural networks(RNN), transformers are non-sequential (ie. sen-tences are processed as a whole rather than word byword) and use self-attention at each input time step.Hence, they do not suffer from long dependency is-sues. Query, Key and Value are three different waysin which input vectors are used in the self-attentionmechanism. The attention score for every inputvector is calculated using a compatibility functionwhich takes as input the query vector and all thekeys. The final output is a weighted sum of valueswhere the weights are the attention scores calcu-lated by the compatibility function. We have usedthe following transformers in our experiment. Wechose RoBERTa for our final model in English andALBERT (IndicBERT) for Tamil and Malayalam.BERT (Devlin et al., 2019) is based on the trans-former architecture. Using its multi-layer encodemodule,It is able to jointly utilize both left andright contexts across all layers to pre-train its bidi-rectional representations. BERT is trained on twounsupervised prediction tasks, next sentence predic-tion and masked language modelling. We have finetuned “bert-base-uncased” model on the dataset forone of our experiments.RoBERTa (Liu et al., 2019) is a transformer ar-chitecture which is based on optimizations made to https://github.com/sanskrit-coders/indic_transliteration he BERT approach. It trains on more data and big-ger batches, removes next sentence prediction ob-jective that BERT used, trains on longer sequencesand introduces dynamic masking (ie. mask tokenschange during training epochs). RoBERTa outper-forms BERT and XLNet on the GLUE benchmark.For the hope speech classification task in English,we fine-tuned the “roberta-base” model on the pro-vided data. The roberta-base model is trained on160 GB of English text from five different datasets.ALBERT (Lan et al., 2020) is a transformer ar-chitecture based on BERT but with fewer parame-ters. There are two key changes made to ALBERT.The first is factorized embeddings parameterization,which decomposes the large vocabulary embeddingmatrix into smaller matrices. This makes it easierto grow hidden size without increasing the parame-ter size of vocabulary embeddings. This step leadsto a reduction in parameters by 80% compared toBERT. The second is cross-layer parameter sharing,which prevents the parameters from increasing asthe depth of the network increases. We fine-tunedthe IndicBERT model for the hope speech classi-fication task in Tamil and Malayalam. We usedIndicBERT (Kakwani et al., 2020) which is a mul-tilingual ALBERT model pre-trained on 12 majorIndian languages. We also fine-tuned ”albert-base-v2” model for our experiment in English.
LSTM
Long Short-Term Memory (Hochreiterand Schmidhuber, 1997) networks seek to solve theshort-term memory or vanishing gradient problemthat RNNs face. They do so by having internalgates that regulate the flow of information. Infor-mation flows through a mechanism known as cellstates. The cell can make decisions about what tostore, what to forget and what the next hidden stateshould be. This is done through internal mecha-nisms called gates which contain sigmoid activa-tions.
Random Forest Classifier
Random forests(Breiman, 2001) use an ensemble of a large numberof decision trees generally trained with the baggingmethod. These decision trees are created using ran-dom subsamples of the given dataset with replace-ment (bootstrap dataset) and a random subset of thefeatures. New samples are classified by choosingthe prediction made by most decision trees (major-ity voting).
Support Vector Machine
Support vector ma-chine (SVM) (Hearst, 1998) is a supervised learn-ing method that can be used for classification or regression. We have used SVM for classification.The objective of the SVM classification algorithmis to find the hyper-plane that most accurately dif-ferentiates two classes that have been plotted ona f dimensional plane where f is the number offeatures.
Logistic Regression
Logistic regression (Mc-Cullagh and Nelder, 1989)is a statistical modelused for binary classification. It does so by using alogistic function to model the binary outcome. Itcan be extended for multiclass classification prob-lems.
Ensembles can help make better predictions by re-ducing the spread of predictions. Hence, loweringvariance and improving accuracy. We used a vot-ing based ensemble method where we trained Nmodels on N different training and validation dataobtained by random shuffling. We then chose themajority voting as the merging technique to pro-duce our final prediction y. In majority voting, thefinal prediction y is decided based on which pre-diction is made by the majority of the models . Wemade two ensembles, one each of 7 models and 11models and chose the ensemble that gave the bestweighted F1 score.
Initially, the entire database is preprocessed to re-move extra tab spaces, punctuations, emojis, men-tions and links. In the case of Malayalam and Tamil,we also transliterate the entire database. Then wedistributed our experimentation procedure into twodifferent approaches. In the first approach, we fine-tune our pre-trained masked language models us-ing the train and validation splits for the purposeof making them more suitable to the subsequentclassification task. Thereafter, contextual embed-dings for each sentence in the dataset are producedby calculating the average of the second to last hid-den layer for every single token in the sentence.We then trained Logistic Regression, Random For-est, SVM and RNN-based classifier models us-ing these embeddings. In the second approach,all the sentences are encoded into tokens usingthe respective tokenizers and then we add a linearlayer on top of the pre-trained model layers afterdropout. All the layers of the devised model arethen trained such that the error is back propagatedthrough the entire architecture and the pre-trainedweights of the model are modified to reflect theew database. For both these approaches, we thencalculated predictions for the test split and reportedperformance metrics. For English, we try out threedifferent pre-trained models: ”roberta-base”, ”bert-base-uncased”, and ”albert-base-v2” for both theapproaches. For Tamil and Malayalam however,only the IndicBERT model is applicable for eitherapproach.
Language Database Hope Not Hope OtherLang.
English Train 1962 20778 22Dev 242 2569 2Tamil Train 6327 7872 1961Dev 757 998 263Malayalam Train 1668 6205 691Dev 190 784 96
Table 1: Data distribution by class
English Tamil Malayalam
Training 22762 16160 2564Development 2843 2018 1070Test 2846 2020 1071Total 28451 20198 10705
Table 2: Data distribution by language
The HopeEDI dataset consists of Youtube com-ments marked as “hope”, “not hope” and “otherlanguage” in three languages: English, Tamil andMalayalam. The distribution of hope, not hopeand other language tag in the training and devel-opment datasets is shown in table 1. The ratio ofhope to not hope is around 0.09 in English, 0.26in Malayalam and 0.79 in Tamil. Table 2 showsthe data distribution between training, developmentand test datasets. There are a total of 28,451 com-ments in English, 10,705 comments in Malayalamand 20,198 comments in Tamil. Data in Tamil andTelugu has code-mixing. In the English dataset,there are instances where English comments areannotated as not English. For example, “Fox Newsis pure Garbage!” is annotated as not English inthe training set. This contributes some noise to theEnglish dataset.
In the first approach, we run the task of maskedlanguage modelling on our database for 4 epochsfor each of the 5 model-database combinations. Af-terwards, the sentence input token length is limited to 512 and the embeddings extracted by evaluationon the input sequences by the model are of length768. The RNN based classifier is composed of anLSTM layer and two dense layers. In the secondapproach, the encoded sentences are in the formof a data loader class, containing the respective in-put IDs and attention masks, with a batch size of16. These are then passed into a model that imple-ments a dropout of 30% and the output from thefinal linear layer is used for classification.
We used F1 and weighted F1 scores for evaluatingour model. F Score = 2 × ( precision × recall )( precision + recall ) Weighted F1 scores are calculated by taking the F1scores for each label and then doing a weightedaverage by the number of true instances of eachlabel. weighted F ( F i × y i + F j × y j )( y i + y j ) y i and y j are the number of true instances of class i and class j respectively and F i and F j are theF1 scores of class i and j respectively. Our experimentation involved two approaches. Inthe first approach, we used contextual embeddings(E) to train classifiers using logistic regression(LR), random forest (RF), SVM, and LSTM basedmodels. In the second approach we used an en-semble of 11 models which were generated by fine-tuning (FT) pre-trained transformer models afteradding an output layer. We used majority votingto get our final prediction. Results for both theapproaches on our test split generated from theprovided train and dev datasets are reported in Ta-bles 3, 4 and 5 for English, Tamil and Malayalamrespectively. We report the macro-averaged andweighted recall, precision and F1-score for eachpossible model-method combination. While theweighted F1 scores are more representative of howwell a model performs, the disparity between theweighted and macro-averaged scores demonstrateshow disproportionate a certain model’s effective-ness is in predicting the different classes. For En-glish, the second approach involving finetuning isthe best performing one for each of the modelstested, closely followed by the Logistic Regressionand LSTM-based methods in the first approach.The roberta-base model seems to have a slight edgeover the other two tested models. For Tamil and odel Method Macro Weighted Macro Weighted Macro WeightedUsed Precision Precision Recall Recall F1-Score F1-Score
BERT E + LR 0.778 0.911 0.656 0.924 0.695 0.913E + RF 0.834 0.902 0.526 0.916 0.528 0.881E + SVM 0.771 0.86 0.489 0.866 0.488 0.837E + LSTM 0.456 0.833 0.500 0.913 0.477 0.871FT 0.759 0.915 0.728 0.915 0.742 0.915ALBERT E + LR 0.703 0.881 0.538 0.912 0.549 0.883E + RF 0.832 0.900 0.506 0.914 0.491 0.874E + SVM 0.456 0.833 0.500 0.913 0.477 0.871E + LSTM 0.657 0.878 0.571 0.905 0.591 0.887FT 0.755 0.916 0.705 0.924 0.725 0.919RoBERTa E + LR 0.794 0.914 0.657 0.926 0.700 0.915E + RF 0.840 0.905 0.535 0.917 0.544 0.885E + SVM 0.821 0.899 0.517 0.915 0.512 0.878E + LSTM 0.791 0.918 0.693 0.928 0.729 0.921FT 0.753 0.915 0.748 0.922 0.745
Table 3: Metrics for English language
Model Method Macro Weighted Macro Weighted Macro WeightedUsed Precision Precision Recall Recall F1-Score F1-Score
Indic E + LR 0.473 0.482 0.484 0.520 0.441 0.464-BERT E + RF 0.511 0.516 0.506 0.544 0.458 0.482E + SVM 0.278 0.309 0.500 0.556 0.357 0.397E + LSTM 0.591 0.587 0.501 0.557 0.364 0.403FT 0.635 0.637 0.627 0.636 0.623
Table 4: Metrics for Tamil language
Model Method Macro Weighted Macro Weighted Macro WeightedUsed Precision Precision Recall Recall F1-Score F1-Score
Indic E + LR 0.645 0.729 0.501 0.790 0.447 0.699-BERT E + RF 0.395 0.623 0.499 0.788 0.440 0.696E + SVM 0.386 0.610 0.492 0.777 0.433 0.683E + LSTM 0.367 0.579 0.471 0.745 0.413 0.652FT 0.776 0.842 0.743 0.842 0.756
Table 5: Metrics for Malayalam language alayalam, the second approach is still the bestperformer, but by a greater margin.
In this paper, we presented our approach for hopespeech detection in English, Tamil, and Malay-alam on the HopeEDI dataset. We used two ap-proaches. The first approach involved using con-textual embeddings to train various classifiers. Thesecond approach involved using a majority votingensemble of 11 models which were obtained byfine-tuning pre-trained transformer models. Thesecond approach using the roberta-base model wasthe best performing model for English, giving aweighted F1 score of 0.93. The second approachusing IndicBERT model gave the best performancefor Tamil and Malayalam, giving a weighted F1score of 0.75 for Malayalam and 0.49 for Tamil. Inthe future, we plan to fine-tune transformers pre-trained on code mixed data. Data augmentationmethods like synonym replacement and randominsertion could be used to fine-tune the model onmore data.
References
Leo Breiman. 2001. Random forests.
Mach. Learn. ,45(1):5–32.Bharathi Raja Chakravarthi. 2020. HopeEDI: A mul-tilingual hope speech detection dataset for equality,diversity, and inclusion. In
Proceedings of the ThirdWorkshop on Computational Modeling of People’sOpinions, Personality, and Emotion’s in Social Me-dia , pages 41–53, Barcelona, Spain (Online). Asso-ciation for Computational Linguistics.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In
Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Kris Gowen, Matthew Deschaine, Darcy Gruttadara,and Dana Markey. 2012. Young adults with men-tal health conditions and social networking websites:Seeking tools to build community.
Psychiatric Reha-bilitation Journal , 35(3):245–250.Marti A. Hearst. 1998. Support vector machines.
IEEEIntelligent Systems , 13(4):18–28.Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.
Neural Computation ,9(8):1735–1780. Divyanshu Kakwani, Anoop Kunchukuttan, SatishGolla, Gokul N.C., Avik Bhattacharyya, Mitesh M.Khapra, and Pratyush Kumar. 2020. IndicNLPSuite:Monolingual corpora, evaluation benchmarks andpre-trained multilingual language models for Indianlanguages. In
Findings of the Association for Com-putational Linguistics: EMNLP 2020 , pages 4948–4961, Online. Association for Computational Lin-guistics.Zhenzhong Lan, Mingda Chen, Sebastian Goodman,Kevin Gimpel, Piyush Sharma, and Radu Soricut.2020. Albert: A lite bert for self-supervised learn-ing of language representations.Younghun Lee, Seunghyun Yoon, and Kyomin Jung.2018. Comparative studies of detecting abusive lan-guage on Twitter. In
Proceedings of the 2nd Work-shop on Abusive Language Online (ALW2) , pages101–106, Brussels, Belgium. Association for Com-putational Linguistics.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach.Binny Mathew, Ritam Dutt, Pawan Goyal, and Ani-mesh Mukherjee. 2019. Spread of hate speech inonline social media. In
Proceedings of the 10thACM Conference on Web Science , WebSci ’19, page173–182, New York, NY, USA. Association forComputing Machinery.P. McCullagh and J.A. Nelder. 1989.
GeneralizedLinear Models, Second Edition . Chapman andHall/CRC Monographs on Statistics and AppliedProbability Series. Chapman & Hall.Mainack Mondal, Leandro Ara´ujo Silva, and Fabr´ıcioBenevenuto. 2017. A measurement study of hatespeech in social media. In
Proceedings of the 28thACM Conference on Hypertext and Social Media ,HT ’17, page 85–94, New York, NY, USA. Asso-ciation for Computing Machinery.Shriphani Palakodety, Ashiqur R. KhudaBukhsh, andJaime G. Carbonell. 2020a. Hope speech detection:A computational analysis of the voice of peace.Shriphani Palakodety, Ashiqur R. KhudaBukhsh,Jaime G. Carbonell, Shriphani Palakodety,Ashiqur R. KhudaBukhsh, and Jaime G. Car-bonell. 2020b. Voice for the voiceless: Activesampling to detect comments supporting the ro-hingyas.
Proceedings of the AAAI Conference onArtificial Intelligence , 34(01):454–462.Anna Schmidt and Michael Wiegand. 2017. A surveyon hate speech detection using natural language pro-cessing. In
Proceedings of the Fifth InternationalWorkshop on Natural Language Processing for So-cial Media , pages 1–10, Valencia, Spain. Associa-tion for Computational Linguistics.akatani Shuyo. 2010. Language detection library forjava.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need.Zijian Wang and David Jurgens. 2018. It’s going to beokay: Measuring access to support in online com-munities. In