Reducing Unintended Identity Bias in Russian Hate Speech Detection
aa r X i v : . [ c s . C L ] O c t Reducing Unintended Identity Bias in Russian Hate Speech Detection
Nadezhda Zueva , Madina Kabirova , Pavel Kalaidin , VK, VK Lab {firstname.lastname}@vk.com
Abstract
Toxicity has become a grave problem formany online communities and has been grow-ing across many languages, including Russian.Hate speech creates an environment of intim-idation, discrimination, and may even incitesome real-world violence. Both researchersand social platforms have been focused on de-veloping models to detect toxicity in onlinecommunication for a while now. A commonproblem of these models is the presence of biastowards some words (e.g. woman, black, jewor женщина, черный, еврей ) that are nottoxic, but serve as triggers for the classifier dueto model caveats. In this paper, we describeour efforts towards classifying hate speechin Russian, and propose simple techniquesof reducing unintended bias, such as generat-ing training data with language models usingterms and words related to protected identitiesas context and applying word dropout to suchwords.
With the ever-growing popularity of social me-dia, there is an immense amount of user-generatedonline content (e.g. as of May 2019, approxi-mately 30,000 hours worth of videos are uploadedto YouTube every hour ). In particular, therehas been an exponential increase in user-generatedtexts such as comments, blog posts, status up-dates, messages, forum threads, etc. The low en-try threshold and relative anonymity of the Internethave resulted not only in the exchange of informa-tion and content but also in the rise of trolling, hatespeech, and overall toxicity .Harassment is a pervasive issue for most onlinecommunities. A Pew survey conducted in 2014 https://vk.cc/aANMR4 https://vk.cc/aANMZn https://vk.cc/aANN6p found that 73% of Internet users have witnessedonline harassment, and 40% have personally expe-rienced it.Explicit policies against hate speech can beconsidered an industry standard across socialplatforms, including platforms popular amongRussian-speaking users (e.g. VK, the largest so-cial network in Russia and the CIS ).The study of hate speech, in online communi-cation in particular, has been gaining traction inRussia for a while now due to it being a prevalentissue long before the Internet (Lokshina, 2003).The number of competitions and workshops (e.g.HASOC at FIRE-2019; TRAC 2020; HatEval andOffensEval at SemEval-2019) on the topic of hatespeech and toxic language detection reflect thescale of the situation.Social platforms utilize a wide variety of mod-els to detect or classify hate speech. However, themajority of existing models operate with a bias intheir predictions. They tend to classify commentsmentioning certain commonly harassed identities(e.g. containing words such as woman, black, jewor женщина, черный, еврей ) as toxic, while thecomment itself may lack any actual toxicity. Iden-tity terms of frequently targeted social groups havehigher toxicity scores since they are found moreoften in abusive and toxic comments than termsrelated to other social groups. If the data used totrain a machine learning model is skewed towardsthese words, the resulting model is likely to adoptthis bias .Inappropriately high toxicity scores of termsrelated to specific social groups can potentiallynegate the benefits of using machine learning mod-els to fight the spread of hate speech. This mo-tivated us to work towards reducing these biases. https://vk.cc/aANNbQ https://vk.cc/ayxecu https://vk.cc/aANNqT n this paper, our main goal is to reduce thefalse toxicity scores of non-toxic comments thatinclude identity terms empirically known to intro-duce model bias. Little research has been done on the automaticdetection of toxicity and hate speech in the Rus-sian language. Potapova and Gordeev (2016) usedconvolutional neural networks to detect aggres-sion in user messages on anonymous messageboards. Andrusyak et al. (2018) proposed an un-supervised technique for extending the vocabu-lary of abusive and obscene words in Russian andUkrainian. More recently, Smetanin (2020) uti-lized pre-trained BERT (Devlin et al., 2019) andUniversal Sentence Encoder (Yang et al., 2019) ar-chitectures to classify toxic Russian-language con-tent.
Dixon et al. (2018) introduced Pinned AUC to con-trol for unintended bias. In this paper, we adoptGeneralized Mean of Bias AUCs (GMB-AUC) in-troduced by (Borkan et al., 2019b), following astudy by (Borkan et al., 2019a) showing the limi-tations of Pinned AUC.Vaidya et al. (2020) proposed a model thatlearns to predict the toxicity of a comment, aswell as the protected identities present, in order toreduce unintended bias as shown by an increasein Generalized Mean of Bias AUCs. Nozza et al.(2019) focused on misogyny detection, providinga synthetic test for evaluating bias and some miti-gation strategies for it.To our knowledge, there is no published re-search on reducing text classification bias in Rus-sian.
For our experiments, we manually collected a cor-pus of comments posted on a major Russian so-cial network. The mean length of each sample is26 characters; samples over 50 characters (5% ofthe total number of samples) were shortened. The The corpus is available on request to authors upon sub-mitting a license agreement. corpus consists of 100,000 samples that we ran-domly split into training, validation and test setsin the ratio 8:1:1. Each comment was assigned alabel based on whether or not it contained variousforms of hate speech or abuse, including threats,harassment, insults, mentions of family members,as well as language used to promote lookism, sex-ism, homophobia, nationalism, etc.As benchmarks, we also used a small corpusof 2,000 samples in mixed Russian and Ukrainiancollected by (Andrusyak et al., 2018), and a cor-pus in Russian used by (Smetanin, 2020) (around14,000 samples).
We considered the prediction of labels related tohate speech as a task and validated performanceusing introduced Generalized Mean of Bias AUCs(Borkan et al., 2019b) to analyze whether or notthe proposed methods help reduce text classifica-tion bias.
We manually compiled a list of Russian words re-lated to protected identities. The words were split,based on the type of hate speech used, into thefollowing classes: lookism, sexism, nationalism,threats, harassment, homophobia, and other. Ex-tracts from the full list are provided in Table 1. To-tal number of words in the list is 214. The full listof protected identities and related words is avail-able here: https://vk.cc/aAS3TQ . We used a model based on the self-attentive en-coder (Lin et al., 2017). We directly feed the to-ken embeddings matrix to the attention layer in-stead of the bi-LSTM encoder, making it a pureself-attention model similar to the one used inTransformer (Vaswani et al., 2017). An advantageof this architecture is that the individual attentionweights for each input token can be interpretable(Lin et al., 2017). This makes it possible to visu-alize what triggers the classifier, giving us an op-portunity to explore the data and extend our list ofprotected identities. To overcome the problem ofout-of-vocabulary words, we trained byte pair en-coding (Sennrich et al., 2015) on a corpora of Rus-sian subtitles taken from a large dataset collectedby (Shavrina and Shapovalova, 2017), and used itfor input tokenization. ookism корова korova “cow” пышка pishka “donut (meaning "plump")” sexism женщина zhenshchina “woman” баба baba “woman (derogatory)” nationalism чех chekh “"Chechen" (derogatory) lit."Czech"” еврей evrei “Jew” threats выезжать vyezhat “to come (after some-body)” айпи aipi “ip” harassment киска kiska “pussy” секси seksi “sexy” homophobia гей gay “gay” лгбт
LGBT “LGBT” other мамка mamka “mother” админ admin “admin”
Table 1: Extracts from the full list of protected identi-ties and related words.
We also evaluated a CNN-based text classifier(as in (Potapova and Gordeev, 2016)) to use as abaseline for comparison.
To reduce model bias, we propose to extend thedataset with the output of pre-trained languagemodels. We used the pre-trained Transformerlanguage model trained on the Taiga dataset(Shavrina and Shapovalova, 2017). As Taiga con-tains 8 sources of normative Russian text (news,fairy tales, classic literature, etc.), we assumedthat the model would be able to generate non-toxic comments even with one word from pro-tected identities given as context. We took a ran-dom word from a list of protected identities andrelated words as a single word prefix for languagegeneration, and generated samples up to 20 wordslong or until an end token was generated. An ad-ditional 25,000 samples were generated using thedescribed approach and added to the existing train- https://github.com/vlarine/ruGPT2 ing set. Random word dropout (Dai and Le, 2015) wasshown to improve text classification. We utilizedthis technique to randomly (with 0.5 probabil-ity) replace protected identities in input sequenceswith the < UNK > token during training. Following (Vaidya et al., 2020), we evaluated amulti-task learning framework, where we ex-tended a base model by predicting a protectedidentity class from an input sequence. In our setup,the loss from an extra classifier head is weightedequal to the loss from the toxicity classifier.
We trained our models for 100,000 iterationswith a batch size of 128, the Adam optimizer(Kingma and Ba, 2014), and a learning rate of 1e-5 with betas (0.9, 0.999) on a single NVIDIATesla T4 GPU. Each experiment took approxi-mately 1 hour to run. We used embeddingspre-trained on the corpora of Russian subtitles(Shavrina and Shapovalova, 2017). We experi-mented with 2 different architectures (self-ATTN,CNN) in several scenarios by applying Data Gen-eration with Language Model, Identity Dropout,and Multi-Task learning, as well as combiningthese approaches. We used binary cross-entropyloss as the loss function for the single-task ap-proach. As the loss function for Multi-Task learn-ing, we used the average loss score between twotasks: predicting the toxicity score, and predictingthe protected identity class. We trained our modelon the training set, controlled the training processusing the validation set, and evaluated metrics onthe test set. We repeated each experiment timesand showed the mean and standard deviation val-ues of the measurements. We applied an early stop-ping approach with patience level 50. The code isavailable on Google Drive . The results are provided in Table 2.We showed that, for our dataset and for thebenchmark from (Smetanin, 2020), adding an ex-tra task of predicting the class of a protected iden-tity can indeed improve the quality of toxicity clas- https://vk.cc/aANO1g ur Dataset (Andrusyak et al., 2018) (Smetanin, 2020)Method GMB-AUC F1 GMB-AUC F1 GMB-AUC F1CNN .56 ± .005 .66 ± .003 .51 ± .005 .59 ± .001 .53 ± .003 .78 ± .002CNN + multitask .58 ± .001 .68 ± .008 .52 ± .002 .61 ± .002 .53 ± .010 .80 ± .002Attn .60 ± .002 .71 ± .010 .54 ± .001 .72 ± .003 .54 ± .005 .80 ± .010Attn + multitask .60 ± .004 .74 ± .012 .54 ± .009 .69 ± .009 .54 ± .007 .82 ± .004Attn + LM data .65 ± .003 .74 ± .002 .58 ± .003 .70 ± .001 .57 ± .006 .83 ± .009Attn + LM data + multitask .67 ± .002 .74 ± .016 .59 ± .003 .70 ± .010 .58 ± .003 .84 ± .008Attn + identity d/o .61 ± .001 .65 ± .003 .53 ± .004 .68 ± .001 .54 ± .007 .82 ± .011Attn + identity d/o + multitask .61 ± .005 .66 ± .007 .54 ± .004 .69 ± .008 .58 ± .009 .83 ± .007Attn + identity d/o + LM data .67 ± .004 .76 ± .005 .55 ± .003 .71 ± .002 .59 ± .003 . ± .012 Attn + identity d/o + LM data + multitask .68 ± .001 .78 ± .010 .56 ± .004 .73 ± .003 .60 ± .008 .86 ± .004 Table 2: Generalized Mean of Bias AUCs (GMB-AUC) and F1 scores across datasets. sification in terms of reducing unintended bias.Moreover, we observed that simple techniquessuch as regularizing the input and extending thetraining data with external language models canhelp reduce unintended model bias on protectedidentities even further.For the (Andrusyak et al., 2018) benchmark, wedid not see much improvement in our metrics.This can be attributed to language differences, asthe benchmark contains abusive words both inRussian and Ukrainian.We also observed that the proposed mod-els achieved competitive results across all threedatasets when evaluated with F1 score. The bestperforming model (Attn + identity d/o + LM data+ multitask setup) achieved an F1 score of 0.86 onthe (Smetanin, 2020) benchmark, which is 93% ofthe reported SoTA performance of a much largermodel fine-tuned from a BERT-like architecture.
We are interested in automatically extending ourcompiled list of protected identities and relatedwords. We also expect that fine-tuning a pre-trained BERT-like model would improve our re-sults and plan to experiment with it.
The authors are grateful to Daniil Gavrilov and Ok-tai Tatanov for useful discussions, Daniil Gavrilovfor review, Viktoriia Loginova and David Princefor proofreading, and anonymous reviewers forvaluable comments. The authors would also liketo thank the VK Moderation Team (led by Kate-rina Egorushkova) for their help in building a hate speech dataset.
References
Bohdan Andrusyak, Mykhailo Rimel, and Roman Kern.2018. Detection of abusive speech for mixed soci-olects of russian and ukrainian languages. In
Pro-ceedings of Recent Advances in Slavonic NaturalLanguage Processing , page 77–84.Daniel Borkan, Lucas Dixon, John Li, JeffreySorensen, Nithum Thain, and Lucy Vasserman.2019a. Limitations of pinned auc for measuring un-intended bias. In arXiv preprint arXiv:1903.02088 .Daniel Borkan, Lucas Dixon, Jeffrey Sorensen, NithumThain, and Lucy Vasserman. 2019b. Nuanced met-rics for measuring unintended bias with real data fortext classification. In
Companion Proceedings ofThe 2019 World Wide Web Conference , pages 491–500.Andrew M. Dai and Quoc V. Le. 2015. Semi-supervised sequence learning. In
Advancesin neural information processing systems , page3079–3087.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training of deepbidirectional transformers for language understand-ing. In
NAACL-HLT .Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain,and Lucy Vasserman. 2018. Measuring and mitigat-ing unintended bias in text classification. In
Pro-ceedings of AAAI/ACM Conference on Artificial In-telligence, Ethics, and Society .Diederik P. Kingma and Jimmy Ba. 2014.Adam: A method for stochastic optimization.Zhouhan Lin, Minwei Feng, Cicero Nogueira dos San-tos, Mo Yu, Bing Xiang, Bowen Zhou, and YoshuaBengio. 2017. A structured self-attentive sentencembedding. In arXiv preprint arXiv:1703.03130 ,page 3079–3087.Tanya Lokshina. 2003. Hate speech in russia:Overview of the problem and means for counterac-tion. In
Bulletin: Anthropology, Minorities, Multi-culturalism , volume 4.Debora Nozza, Claudia Volpetti, and Elisabetta Fersini.2019. Unintended bias in misogyny detection. In
IEEE/WIC/ACM International Conference on WebIntelligence .Rodmonga Potapova and Denis Gordeev. 2016. Detect-ing state of aggression in sentences using cnn. In
International Conference on Speech and Computer .Rico Sennrich, Barry Haddow, and Alexandra Birch.2015. Neural machine translation of rare words withsubword units. In arXiv preprint arXiv:1508.07909 .Tatiana Shavrina and Olga Shapovalova. 2017. Tothe methodology of corpus construction for machinelearning: "taiga" syntax tree corpus". In
Corpora-2017 , pages 78–84.Sergey Smetanin. 2020. Toxic comments detection inrussian. In
Computational Linguistics and Intellec-tual Technologies: Proceedings of the InternationalConference “Dialogue 2020” .Ameya Vaidya, Feng Mai, and Yue Ning. 2020. Empir-ical analysis of multi-task learning for reducing iden-tity bias in toxic comment detection. In
Proceedingsof the Fourteenth International AAAI Conference onWeb and Social Media (ICWSM 2020) .Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In
Advances in neural information pro-cessing systems , pages 5998–6008.Yinfei Yang, Daniel Cer, Amin Ahmad, MandyGuo, Jax Law, Noah Constant, Gustavo HernandezAbrego, Steve Yuan, Chris Tar, Yun-Hsuan Sung,Brian Strope, and Ray Kurzweil. 2019. Multilingualuniversal sentence encoder for semantic retrieval. In arXiv preprint arXiv:1907.04307arXiv preprint arXiv:1907.04307