[PDF] Multilingual Twitter Corpus and Baselines for Evaluating Demographic Bias in Hate Speech Recognition

Abstract

Existing research on fairness evaluation of document classification models mainly uses synthetic monolingual data without ground truth for author demographic attributes. In this work, we assemble and publish a multilingual Twitter corpus for the task of hate speech detection with inferred four author demographic factors: age, country, gender and race/ethnicity. The corpus covers five languages: English, Italian, Polish, Portuguese and Spanish. We evaluate the inferred demographic labels with a crowdsourcing platform, Figure Eight. To examine factors that can cause biases, we take an empirical analysis of demographic predictability on the English corpus. We measure the performance of four popular document classifiers and evaluate the fairness and bias of the baseline classifiers on the author-level demographic attributes.

Full PDF

MMultilingual Twitter Corpus and Baselines for Evaluating Demographic Bias inHate Speech Recognition

Xiaolei Huang ∗ , Linzi Xing , Franck Dernoncourt , Michael J. Paul Abstract

Existing research on fairness evaluation of document classiﬁcation models mainly uses synthetic monolingual data without ground truthfor author demographic attributes. In this work, we assemble and publish a multilingual Twitter corpus for the task of hate speechdetection with inferred four author demographic factors: age, country, gender and race/ethnicity. The corpus covers ﬁve languages:English, Italian, Polish, Portuguese and Spanish. We evaluate the inferred demographic labels with a crowdsourcing platform, FigureEight. To examine factors that can cause biases, we take an empirical analysis of demographic predictability on the English corpus.We measure the performance of four popular document classiﬁers and evaluate the fairness and bias of the baseline classiﬁers on theauthor-level demographic attributes.

Keywords: demographic bias, fairness, multilingual, document classiﬁcation, hate speech

1. Introduction

While document classiﬁcation models should be objectiveand independent from human biases in documents, researchhave shown that the models can learn human biases andtherefore be discriminatory towards particular demographicgroups (Dixon et al., 2018; Borkan et al., 2019; Sun et al.,2019b). The goal of fairness-aware document classiﬁers isto train and build non-discriminatory models towards peo-ple no matter what their demographic attributes are, such asgender and ethnicity. Existing research (Dixon et al., 2018;Kiritchenko and Mohammad, 2018; Park et al., 2018; Garget al., 2019; Borkan et al., 2019) in evaluating fairness ofdocument classiﬁers focus on the group fairness (Choulde-chova and Roth, 2018), which refers to every demographicgroup has equal probability of being assigned to the posi-tive predicted document category.However, the lack of original author demographic attributesand multilingual corpora bring challenges towards the fair-ness evaluation of document classiﬁers. First, the datasetscommonly used to build and evaluate the fairness of doc-ument classiﬁers obtain derived synthetic author demo-graphic attributes instead of the original author information.The common data sources either derive from Wikipediatoxic comments (Dixon et al., 2018; Park et al., 2018;Garg et al., 2019) or synthetic document templates (Kir-itchenko and Mohammad, 2018; Park et al., 2018). TheWikipedia Talk corpus (Wulczyn et al., 2017) provides de-mographic information of annotators instead of the authors,Equity Evaluation Corpus (Kiritchenko and Mohammad,2018) are created by sentence templates and combinationsof racial names and gender coreferences. While existingwork (Davidson et al., 2019; Diaz et al., 2018) infers userdemographic information (white/black, young/old) fromthe text, such inference is still likely to cause confound- The work was partially done when the ﬁrst author worked as anintern at Adobe Research. https://figshare.com/articles/Wikipedia_Detox_Data/4054689 http://saifmohammad.com/WebPages/Biases-SA.html ing errors that impact and break the independence betweendemographic factors and the fairness evaluation of text clas-siﬁers. Second, existing research in the fairness evalua-tion mainly focus on only English resources, such as agebiases in blog posts (Diaz et al., 2018), gender biases inWikipedia comments (Dixon et al., 2018) and racial biasesin hate speech detection (Davidson et al., 2019). Differ-ent languages have shown different patterns of linguisticvariations across the demographic attributes (Johannsen etal., 2015; Huang and Paul, 2019), methods (Zhao et al.,2017; Park et al., 2018) to reduce and evaluate the demo-graphic bias in English corpora may not apply to other lan-guages. For example, Spanish has gender-dependent nouns,but this does not exist in English (Sun et al., 2019b); andPortuguese varies across Brazil and Portugal in both wordusage and grammar (Maier and Gómez-Rodríguez, 2014).The rich variations have not been explored under the fair-ness evaluation due to lack of multilingual corpora. Ad-ditionally, while we have hate speech detection datasets inmultiple languages (Waseem and Hovy, 2016; Sanguinettiet al., 2018; Ptaszynski et al., 2019; Basile et al., 2019; For-tuna et al., 2019), there is still no integrated multilingualcorpora that contain author demographic attributes whichcan be used to measure group fairness. The lack of authordemographic attributes and multilingual datasets limits re-search for evaluating classiﬁer fairness and developing un-biased classiﬁers.In this study, we combine previously published cor-pora labeled for Twitter hate speech recognition in En-glish (Waseem and Hovy, 2016; Waseem, 2016; Fountaet al., 2018), Italian (Sanguinetti et al., 2018), Pol-ish (Ptaszynski et al., 2019), Portuguese (Fortuna et al.,2019), and Spanish (Basile et al., 2019), and publishthis multilingual data augmented with author-level demo-graphic information for four attributes: race, gender, ageand country. The demographic factors are inferred fromuser proﬁles, which are independent from text documents,the tweets. To our best knowledge, this is the ﬁrst multilin-gual hate speech corpus annotated with author attributes aiming for fairness evaluation. We start with presenting col- a r X i v : . [ c s . C L ] M a r ection and inference steps of the datasets. Next, we take anexploratory study on the language variations across demo-graphic groups on the English dataset. We then experimentwith four multiple classiﬁcation models to establish base-line levels of this corpus. Finally, we evaluate the fairnessperformance of those document classiﬁers.

2. Data

We assemble the annotated datasets for hate speech clas-siﬁcation. To narrow down the data sources, we limit ourdataset sources to the unique online social media site, Twit-ter. We have requested 16 published Twitter hate speechdatasets, and ﬁnally obtained 7 of them in ﬁve languages.By using the Twitter streaming API , we collected thetweets annotated by hate speech labels and their corre-sponding user proﬁles in English (Waseem and Hovy, 2016;Waseem, 2016; Founta et al., 2018), Italian (Sanguinetti etal., 2018), Polish (Ptaszynski et al., 2019), Portuguese (For-tuna et al., 2019), and Spanish (Basile et al., 2019). Webinarize all tweets’ labels (indicating whether a tweet hasindications of hate speech), allowing to merge the differentlabel sets and reduce the data sparsity.Whether a tweet is considered hate speech heavily dependson who the speaker is; for example, whether a racial sluris intended as hate speech depends in part on the speaker’srace (Waseem and Hovy, 2016). Therefore, hate speechclassiﬁers may not generalize well across all groups of peo-ple, and disparities in the detection offensive speech couldlead to bias in content moderation (Shen et al., 2018). Ourcontribution is to further annotate the data with user demo-graphic attributes inferred from their public proﬁles, thuscreating a corpus suitable for evaluating author-level fair-ness for this hate speech recognition task across multiplelanguages. We consider four user factors of age, race, gender and geo-graphic location. For location, we inference two granular-ities, country and US region, but only experiment with thecountry attribute. While the demographic attributes can beinferred through tweets (Volkova et al., 2015; Davidson etal., 2019), we intentionally exclude the contents from thetweets if they infer these user attributes, in order to makethe evaluation of fairness more reliable and independent. Ifusers were grouped based on attributes inferred from theirtext, then any differences in text classiﬁcation across thosegroups could be related to the same text. Instead, we inferattributes from public user proﬁle information (i.e., descrip-tion, name and photo).

Age, Race, Gender.

We infer these attributes from eachuser’s proﬁle image by using Face++ ( ), a computer vision API that pro-vides estimates of demographic characteristics. Empiri-cal comparisons of facial recognition APIs have found thatFace++ is the most accurate tool on Twitter data (Junget al., 2018) and works comparatively better for darkerskins (Buolamwini and Gebru, 2018). For the gender, wechoose the binary categories (male/female) by the predicted https://developer.twitter.com/ probabilities. We map the racial outputs into four cate-gories: Asian, Black, Latino and White. We only keepusers that appear to be at least 13 years old, and we save theﬁrst result from the API if multiple faces are identiﬁed. Weexperiment and evaluate with binarization of race and agewith roughly balanced distributions (white and nonwhite, ≤ median vs. elder age) to consider a simpliﬁed setting acrossdifferent languages, since race is harder to infer accurately. Country.

The country-level language variations canbring challenges that are worth to explore. We extract ge-olocation information from users whose proﬁles containedeither numerical location coordinates or a well-formatted(matching a regular expression) location name. We fedthe extracted values to the Google Maps API ( https://maps.googleapis.com ) to obtain structured loca-tion information (city, state, country). We ﬁrst count themain country source and then binarize the country to indi-cate if a user is in the main country or not. For example, themajority of users in the English are from the United States(US), therefore, we can binarize the country attributes toindicate if the users are in the US or not.

We show the corpus statistics in Table 1 and summarize thefull demographic distributions in Table 2. The binary de-mographic attributes (age, country, gender, race) can bringseveral beneﬁts. First, we can create comparatively bal-anced label distributions. We can observe that there aredifferences in the race and gender among Italian and Polishdata, while other attributes across the other languages showcomparably balanced demographic distributions. Second,we can reduce errors inferred from the Face++ on coarse la-bels. Third, it is more convenient for us to analyze, conductexperiments and evaluate the group fairness of documentclassiﬁers.Language Users Docs Tokens HS RatioEnglish 64,067 83,077 20.066 .370Italian 3,810 5,671 19.721 .195Polish 86 10,919 14.285 .089Portuguese 600 1,852 18.494 .205Spanish 4,600 4,831 19.199 .397Table 1: Statistical summary of multilingual corpora acrossEnglish, Italian, Polish, Portuguese and Spanish. Wepresent number of users (Users), documents (Docs), andaverage tokens per document (Tokens) in the corpus, plusthe label distribution (HS Ratio, percent of documents la-beled positive for hate speech).Table 1 presents different patterns of the corpus. The Pol-ish data has the smallest users. This is because the datafocuses on the people who own the most popular accountsin the Polish data (Ptaszynski et al., 2019), the other datacollected tweets randomly. And the dataset shows a muchmore sparse distribution of the hate speech label than theother languages.Table 2 presents different patterns of the user attributes. En-glish, Portuguese and Spanish users are younger than theItalian and Polish users in the collected data. And both Ital-ian and Polish show more skewed demographic distribu-anguage Age Country Gender RaceEnglish Mean Median US non-US Female Male White non-White32.041 29 .599 .401 .499 .501 .505 .495Italian Mean Median Italy non-Italy Female Male White non-White44.518 43 .778 .222 .307 .692 .981 .018Polish Mean Median Poland non-Poland Female Male White non-White39.245 38 .795 .205 .324 .676 .895 .105Portuguese Mean Median Brazil non-Brazil Female Male White non-White29.635 26 .437 .563 .569 .431 .508 .492Spanish Mean Median Spain non-Spain Female Male White non-White31.911 27 .339 .661 .463 .537 .549 .451Table 2: Statistical summary of user attributes in age, country, gender and race. For the age, we present both mean andmedian values in case of outliers. For the other attributes, we show binary distributions.tions in country, gender and race, while the other datasetsshow more balanced distributions.

Image-based approaches will have inaccuracies, as a per-son’s demographic attributes cannot be conclusively deter-mined merely from their appearance. However, given thedifﬁculty in obtaining ground truth values, we argue thatautomatically inferred attributes can still be informative forstudying classiﬁer fairness. If a classiﬁer performs signiﬁ-cantly differently across different groups of users, then thisshows that the classiﬁer is biased along certain groupings,even if those groupings are not perfectly aligned with theactual attributes they are named after. This subsection triesto quantify how reliably these groupings correspond to thedemographic variables. Age Race GenderAnnotator AgreementFace++ .80 .80 .98AccuracyEnglish .86 .90 .94Italian .82 .96 .98Polish .88 .96 .98Portuguese .82 .78 .92Spanish .76 .82 .90Overall .828 .884 .944Table 3: Annotator agreement (percentage overlap) andevaluation accuracy for Face++.Prior research found that Face++ achieves 93.0% and92.0% accuracy on gender and ethnicity evaluations (Junget al., 2018). We further conduct a small evaluation on thehate speech corpus by a small sample of annotated user pro-ﬁle photos providing a rough estimate of accuracy whileacknowledging that our annotations are not ground truth.We obtained the annotations from the crowdsourcing web-site, Figure Eight ( https://figure-eight.com/ ).We randomly sampled 50 users whose attributes came fromFace++ in each language. We anonymize the user pro-ﬁles and feed the information to the crowdsourcing website.Three annotators annotated each user photo with the binarydemographic categories. To select qualiﬁed annotators andensure quality of the evaluations, we set up 5 golden stan-dard annotation questions for each language. The annota- tors can join the evaluation task only by passing the goldenstandard questions. We decide demographic attributes bymajority votes and present evaluation results in Table 3.Our ﬁnal evaluations show that overall the Face++ achievesaveraged accuracy scores of 82.8%, 88.4% and 94.4% forage, race and gender respectively.

To facilitate the study of classiﬁcation fairness, we willpublicly distribute this anonymized corpus with the inferreddemographic attributes including both original and bina-rized versions. To preserve user privacy, we will not pub-licize the personal proﬁle information, including user ids,photos, geocoordinates as well as other user proﬁle in-formation, which were used to infer the demographic at-tributes. We will, however, provide inferred demographicattributes in their original formats from the Face++ andGoogle Maps based on per request to allow wider re-searchers and communities to replicate the methodologyand probe more depth of fairness in document classiﬁca-tion.

3. Language Variations acrossDemographic Groups

Demographic factors can improve the performances of doc-ument classiﬁers (Hovy, 2015), and demographic variationsroot in language, especially in social media data (Volkovaet al., 2013; Hovy, 2015). For example, language stylesare highly correlated with authors’ demographic attributes,such as age, race, gender and location (Coulmas, 2017;Preo¸tiuc-Pietro and Ungar, 2018). Research (Bolukbasi etal., 2016; Zhao et al., 2017; Garg et al., 2018) ﬁnd thatbiases and stereotypes exist in word embeddings, which iswidely used in document classiﬁcation tasks. For example,“receptionist” is closer to females while “programmer” iscloser to males, and “professor” is closer to Asian Ameri-cans while “housekeeper” is closer to Hispanic Americans.This motivates us to explore and test if the language varia-tions hold in our particular dataset, how strong the effectsare. We conduct the empirical analysis of demographic pre-dictability on the English dataset.

We examine how accurately the documents can predict au-thor demographic attributes from three different levels:emographic Attributes Top 10 Features of Demographic Attribute PredictionRace White nigga, fucking, ass, bro, damn, niggas, sir, moive, melon, bitchesOther abuse, gg, feminism, wadhwa, feminists, uh, freebsd, feminist, ve, blockedGender Female rent, driving, tho, adorable, met, presented, yoga, stressed, awareness, meMale idiot, the, players, match, idiots, sir, fucking, nigga, bro, trumpTable 4: Top 10 predictable features of race and gender in the English dataset.1. Word-level. We extract TF-IDF-weighted 1-, 2-gramsfeatures.2. POS-level. We use Tweebo parser (Kong et al., 2014)to tag and extract POS features. We count the POS tagand then normalize the counts for each document.3. Topic-level. We train a Latent Dirichlet Alloca-tion (Blei et al., 2003) model with 20 topics usingGensim (Rehurek and Sojka, 2010) with default pa-rameters. Then a document can be represented as aprobabilistic distribution over the 20 topics.We shufﬂe and split data into training (70%) and test (30%)sets. Three logistic classiﬁers are trained by the three levelsof features separately. We measure the prediction accuracyand show the absolute improvements in Figure 1. age country gender raceDemographic Factorspostopicword F e a t u r e s Figure 1: Predictability of demographic attributes fromthe English data. We show the absolute percentage im-provements in accuracy over majority-class baselines. Themajority-class baselines of accuracy are .500 for the binarypredictions. The darker color indicates higher improve-ments and vice versa.The improved prediction accuracy scores over majoritybaselines suggest that language variations across demo-graphic groups are encoded in the text documents. The re-sults show that documents are the most predictable to theage attribute. We can also observe that the word is themost predictable feature to demographic factors, while thePOS feature is least predictable towards the country fac-tor. These suggest there might be a connection betweenlanguage variations and demographic groups. This moti-vates us to further explore the language variations based onword features. We rank the word features by mutual infor-mation classiﬁcation (Pedregosa et al., 2011) and presentthe top 10 unigram features in Table 4. The qualitative re-sults show the most predictable word features towards thedemographic groups and suggest such variations may im-pact extracted feature representations and further trainingfair document classiﬁers. The Table 4 shows that when classifying hate speechtweets, the n-words and b-words are more signiﬁcant cor-related with the white instead of the other racial groups.However, this shows an opposite view than the existingwork (Davidson et al., 2019), which presents the two typesof words are more signiﬁcantly correlated with the black.This can highlight the values of our approach that to avoidconfounding errors, we obtain author demographic infor-mation independently from the user generated documents.

4. Experiments

Demographic variations root in documents, especially insocial media data (Volkova et al., 2013; Hovy, 2015; Jo-hannsen et al., 2015). Such variations could further impactthe performance and fairness of document classiﬁers. Inthis study, we experiment four different classiﬁcation mod-els including logistic regression (LR), recurrent neural net-work (RNN) (Chung et al., 2014), convolutional neural net-work (CNN) (Kim, 2014) and Google BERT (Devlin et al.,2019). We present the baseline results of both performanceand fairness evaluations across the multilingual corpus.

To anonymize user information, we hash user and tweetids and then replace hyperlinks, usernames, and hashtagswith generic symbols (URL, USER, HASHTAG). Docu-ments are lowercased and tokenized using NLTK (Bird andLoper, 2004). The corpus is randomly split into training(70%), development (15%), and test (15%) sets. We trainthe models on the training set and ﬁnd the optimal hyper-parameters on the development set before ﬁnal evaluationson the test set. We randomly shufﬂe the training data at thebeginning of each training epoch.

We implement and experiment four baseline classiﬁcationmodels. To compare fairly, we keep the feature size up to15K for each classiﬁer across all ﬁve languages. We cal-culate the weight for each document category by NN l (Kingand Zeng, 2001), where N is the number of documents ineach language and N l is the number of documents labeledby the category. Particularly, for training BERT model, weappend two additional tokens, “[CLS]” and “[SEP]”, at thestart and end of each document respectively. For the neu-ral models, we pad each document or drop rest of words upto 40 tokens. We use “unknown” as a replacement for un-known tokens. We initialize CNN and RNN classiﬁers bypre-trained word embeddings (Mikolov et al., 2013; Godinet al., 2015; Bojanowski et al., 2017; Deriu et al., 2017) andtrain the networks up to 10 epochs. LR.

We ﬁrst extract TF-IDF-weighted features of uni-, bi-, and tri-grams on the corpora, using the most fre-anguage Method Acc F1-w F1-m AUCEnglish LR .874 .874 .841 .920CNN .878 .877 .845 .927RNN .898 .896 .867 .938

BERT .705 .635 .579 .581 Language Method Acc F1-w F1-m AUCItalian LR .660 .679 .631 .725CNN .687 .702 .651 .745RNN .729 .731 .666 .763

BERT .697 .629 .468 .498Language Method Acc F1-w F1-m AUCPolish LR .864 .846 .653 .804CNN .855 .851 .688 .813RNN .857 .854 .696 .822

BERT .824 .782 .478 .474 Language Method Acc F1-w F1-m AUCPortuguese LR .660 .598 .551 .648CNN .681 .674 .653 .719

RNN .607 .586 .553 .633BERT .613 .568 .525 .524Language Method Acc F1-w F1-m AUCSpanish LR .704 .707 .698 .761

CNN .650 .654 .645 .710RNN .674 .674 .658 .720BERT .605 .573 .502 .505Table 5: Overall performance evaluation of baseline classiﬁers. We evaluate overall performance by four metrics includingaccuracy (Acc), weighted F1 score (F1-w), macro F1 score (F1-m) and area under the ROC curve (AUC). The higher scoreindicates better performance. We highlight models achieve the best performance in each column.quent 15K features with the minimum feature frequency as2. We then train a

LogisticRegression from scikit-learn (Pedregosa et al., 2011). We use “liblinear” as thesolver function and leave the other parameters as default.

CNN.

We implement the Convolutional Neural Network(CNN) classiﬁer described in (Kim, 2014; Zimmerman etal., 2018) by Keras (Chollet and others, 2015). We ﬁrst ap-ply 100 ﬁlters with three different kernel sizes, 3, 4 and 5.After the convolution operations, we feed the concatenatedfeatures to a fully connected layer and output documentrepresentations with 100 dimensions. We apply “softplus”function with a l2 regularization with . and a dropout ratewith . in the dense layer. The model feeds the documentrepresentation to ﬁnal prediction. We train the model withbatch size 64, set model optimizer as Adam (Kingma andBa, 2014) and calculate loss values by the cross entropyfunction. We keep all other parameter settings as describedin the paper (Kim, 2014). RNN.

We build a recurrent neural network (RNN) clas-siﬁer by using bi-directional Gated Recurrent Unit (bi-GRU) (Chung et al., 2014; Park et al., 2018). We set theoutput dimension of GRU as 200 and apply a dropout onthe output with rate . . We optimize the RNN with RM-Sprop (Tieleman and Hinton, 2012) and use the same lossfunction and batch size as the CNN model. We leave theother parameters as default in the Keras (Chollet and oth-ers, 2015). BERT

BERT is a transformer-based pre-trained languagemodel which was well trained on multi-billion sentencespublicly available on the web (Devlin et al., 2019), whichcan effectively generate the precise text semantics and use-ful signals. We implement a BERT-based classiﬁcationmodel by HuggingFace’s Transformers (Wolf et al., 2019).The model encodes each document into a ﬁxed size (768)of representation and feed to a linear prediction layer. Themodel is optimized by

AdamW with a warmup and learningrate as . and e − respectively. We leave parameters as their default, conduct ﬁne-tuning steps with 4 epochs andset batch size as 32 (Sun et al., 2019a). The classiﬁcationmodel loads “bert-base-uncased” pre-trained BERT modelfor English and “bert-base-multilingual-uncased” multilin-gual BERT model (Gertner et al., 2019) for the other lan-guages. The multilingual BERT model follows the samemethod of BERT by using Wikipedia text from the top 104languages. Due to the label imbalance shown in Table 1,we balance training instances by randomly oversamplingthe minority during the training process. Performance Evaluation.

To measure overall perfor-mance, we evaluate models by four metrics: accuracy(Acc), weighted F1 score (F1-w), macro F1 score (F1-m) and area under the ROC curve (AUC). The F1 scorecoherently combines both precision and recall by ∗ precision ∗ recallprecision + recall . We report F1-m considering that thedatasets are imbalanced. Fairness Evaluation.

To evaluate group fairness, wemeasure the equality differences (ED) of true posi-tive/negative and false positive/negative rates for each de-mographic factor. ED is a standard metric to evaluate fair-ness and bias of document classiﬁers (Dixon et al., 2018;Park et al., 2018; Garg et al., 2019).This metric sums the differences between the rates withinspeciﬁc user groups and the overall rates. Taking the falsepositive rate (FPR) as an example, we calculate the equalitydifference by:

F P ED = (cid:88) d ∈ D | F P R d − F P R | , where D is a demographic factor (e.g., race) and d is ademographic group (e.g., white or nonwhite).

5. Results

We have presented our evaluation results of performanceand fairness in Table 5 and Table 6 respectively. Countrynd race have very skewed distributions in the Italian andPolish corpora, therefore, we omit fairness evaluation onthe two factors.

Overall performance evaluation.

Table 5 demonstratesthe performances of the baseline classiﬁers for hate speechclassiﬁcation on the corpus we proposed. Results areobtained from the ﬁve languages covered in our corpusrespectively. Among the four baseline classiﬁers, LR,CNN and RNN consistently perform well on all languages.Moreover, neural-based models (CNN and RNN) substan-tially outperform LR on four out of ﬁve languages (exceptSpanish). However, the results obtained by BERT are rela-tively lower than the other baselines, and show more signif-icant gap in the English dataset. One possible explanationis BERT was pre-trained on Wikipedia documents, whichare signiﬁcantly different from the Twitter corpus in docu-ment length, word usage and grammars. For example, eachtweet is a short document with 20 tokens, but the BERT istrained on long documents up to 512 tokens. Existing re-search suggests that ﬁne-tuning on the multilingual corpuscan further improve performance of BERT models (Sun etal., 2019a).

Group fairness evaluation.

We have measured the groupfairness in Table 6. Generally, the RNN classiﬁer achievesbetter and more stable performance across major fairnessevaluation tasks. By comparing the different baseline clas-siﬁers, we can ﬁnd out that the LR usually show strongerbiases than the neural classiﬁcation models among major-ity of the tasks. While the BERT classiﬁer performs com-paratively lower accuracy and F1 scores, the classiﬁer hasless biases on the most of the datasets. However, biases cansigniﬁcantly increases for the Portuguese dataset when theBERT classiﬁer achieves better performance. We examinethe relationship by building linear model between two dif-ferences: the performance differences between the RNNand other classiﬁers, the SUM-ED differences betweenRNN and other classiﬁers. We ﬁnd that the classiﬁcationperformance does not have signiﬁcantly ( p − value > . )correlation with fairness and bias. The signiﬁcant biases ofclassiﬁers varies across tasks and languages: the classiﬁerstrained on Polish and Italian are biased the most by Age andGender, the classiﬁers trained on Spanish and Portugueseare most biased the most by Country, and the classiﬁerstrained on English tweets are the most unbiased through-out all the attributes. Classiﬁers usually have very high biasscores on both gender and age in Italian and Polish data.We ﬁnd that the age and gender both have very skewed dis-tributions in the Italian and Polish datasets. Overall, ourbaselines provide a promising start for evaluating futurenew methods of reducing demographic biases for documentclassiﬁcation under the multilingual setting.

6. Conclusion

In this paper, we propose a new multilingual dataset cover-ing four author demographic annotations (age, gender, raceand country) for the hate speech detection task. We showthe experimental results of several popular classiﬁcationmodels in both overall and fairness performance evalua-tions. Our empirical exploration indicates that language variations across demographic groups can lead to biasedclassiﬁers. This dataset can be used for measuring fairnessof document classiﬁers along author-level attributes andexploring bias factors across multilingual settings and mul-tiple user factors. The proposed framework for inferringthe author demographic attributes can be used to generatemore large-scale datasets or even applied to other socialmedia sites (e.g., Amazon and Yelp). While we encodethe demographic attributes into categories in this work,we will provide inferred probabilities of the demographicattributes from Face++ to allow for broader researchexploration. Our code, anonymized data and data state-ment (Bender and Friedman, 2018) will be publicly avail-able at https://github.com/xiaoleihuang/Multilingual_Fairness_LREC . While our dataset provides new information on author de-mographic attributes, and our analysis suggest directionstoward reducing bias, a number of limitations must be ac-knowledged in order to appropriately interpret our ﬁndings.First, inferring user demographic attributes by proﬁle in-formation can be risky due to the accuracy of the infer-ence toolkit. In this work, we present multiple strategiesto reduce the errors bringing by the inference toolkits, suchas human evaluation, manually screening and using exter-nal public proﬁle information (Instagram). However, wecannot guarantee perfect accuracy of the demographic at-tributes, and, errors in the attributes may themselves be“unfair” or unevenly distributed due to bias in the infer-ence tools (Buolamwini and Gebru, 2018). Still, obtainingindividual-level attributes is an important step toward un-derstanding classiﬁer fairness, and our results found biasesacross these groupings of users, even if some of the group-ings contained errors.Second, because methods for inferring demographic at-tributes are not accurate enough to provide ﬁne-grainedinformation, our attribute categories are still too coarse-grained (binary age groups and gender, and only four racecategories). Using coarse-grained attributes would hidethe identities of speciﬁc demographic groups, includingother racial minorities and people with non-binary gender.Broadening our analyses and evaluations to include moreattribute values may require better methods of user attributeinference or different sources of data.Third, language variations across demographic groupsmight introduce annotation biases. Existing research (Sapet al., 2019) shows that annotators are more likely to an-notate tweets containing African American English wordsas hate speech. Additionally, the nationality and educa-tional level might also impact on the quality of annota-tions (Founta et al., 2018). Similarly, different annotationsources of our dataset (which merged two different corpora)might have variations in annotating schema. To reduce an-notation biases due to the different annotating schema, wemerge the annotations into the two most compatible docu-ment categories: normal and hate speech. Annotation bi-ases might still exist, therefore, we will release our originalanonymized multilingual dataset for research communities. ge Language Method FNED FPED SUM-EDEnglish LR .059 .104 .163CNN .052 .083 .135RNN .041 .118 .159BERT .004 .012 .016

Gender

Language Method FNED FPED SUM-EDEnglish LR .023 .056 .079CNN .018 .056 .074RNN .013 .055 .068BERT .007 .009 .016Language Method FNED FPED SUM-EDItalian LR .076 .194 .270CNN .003 .211 .214RNN .042 .185 .227BERT .029 .034 .063 Language Method FNED FPED SUM-EDItalian LR .145 .020 .165CNN .064 .094 .158RNN .088 .075 .163BERT .041 .056 .097Language Method FNED FPED SUM-EDPolish LR .256 .059 .315CNN .389 .138 .527RNN .335 .089 .424BERT .027 .027 .054 Language Method FNED FPED SUM-EDPolish LR .266 .045 .309CNN .411 .048 .459RNN .340 .034 .374BERT .042 .013 .055Language Method FNED FPED SUM-EDPortuguese LR .061 .044 .105CNN .033 .096 .129RNN .079 .045 .124BERT .090 .097 .187 Language Method FNED FPED SUM-EDPortuguese LR .052 .007 .059CNN .018 .013 .031RNN .099 .083 .182BERT .055 .125 .180Language Method FNED FPED SUM-EDSpanish LR .089 .013 .102CNN .117 .139 .256RNN .078 .083 .161BERT .052 .015 .067 Language Method FNED FPED SUM-EDSpanish LR .131 .061 .292CNN .032 .108 .140RNN .030 .039 .069BERT .021 .016 .037

Country

Language Method FNED FPED SUM-EDEnglish LR .026 .053 .079CNN .027 .063 .090RNN .024 .061 .085BERT .006 .001 .007

Race

Language Method FNED FPED SUM-EDEnglish LR .019 .056 .075CNN .007 .029 .036RNN .008 .063 .071BERT .003 .009 .012Language Method FNED FPEDPortuguese LR .093 .026 .119CNN .110 .122 .232RNN .022 .004 .026BERT .073 .071 .144 Language Method FNED FPED SUM-EDPortuguese LR .068 .005 .073CNN .056 .033 .089RNN .074 .054 .128BERT .045 .186 .231Language Method FNED FPED SUM-EDSpanish LR .152 .154 .306CNN .089 .089 .178RNN .071 .113 .184BERT .017 .017 .034 Language Method FNED FPED SUM-EDSpanish LR .095 .030 .125CNN .072 .054 .126RNN .011 .004 .015BERT .046 .005 .051Table 6: Fairness evaluation of baseline classiﬁers across the ﬁve languages on the four demographic factors. We measurefairness and bias of document classiﬁers by equality differences of false negative rate (FNED), false positive rate (FPED)and sum of FNED and FPED (SUM-ED). The higher score indicates lower fairness and higher bias and vice versa. . Acknowledgement

The authors thank the anonymous reviews for their insight-ful comments and suggestions. This work was supported inpart by the National Science Foundation under award num-ber IIS-1657338. This work was also supported in part bya research gift from Adobe.

8. Bibliographical References

Basile, V., Bosco, C., Fersini, E., Nozza, D., Patti, V.,Rangel Pardo, F. M., Rosso, P., and Sanguinetti, M.(2019). SemEval-2019 task 5: Multilingual detection ofhate speech against immigrants and women in twitter.In Proceedings of the 13th International Workshop onSemantic Evaluation, pages 54–63, Minneapolis, Min-nesota, USA, June. ACL.Bender, E. M. and Friedman, B. (2018). Data statementsfor natural language processing: Toward mitigating sys-tem bias and enabling better science.

Transactions of theAssociation for Computational Linguistics , 6:587–604.Bird, S. and Loper, E. (2004). Nltk: the natural languagetoolkit. In Proceedings of the ACL 2004 on Interactiveposter and demonstration sessions, page 31.Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latentdirichlet allocation.

Journal of Machine Learning Re-search , 3(Jan):993–1022.Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T.(2017). Enriching word vectors with subword informa-tion.

Transactions of the Association for ComputationalLinguistics , 5:135–146.Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., andKalai, A. T. (2016). Man is to computer programmer aswoman is to homemaker? debiasing word embeddings.In NIPS, pages 4349–4357.Borkan, D., Dixon, L., Sorensen, J., Thain, N., and Vasser-man, L. (2019). Nuanced metrics for measuring un-intended bias with real data for text classiﬁcation. InWWW.Buolamwini, J. and Gebru, T. (2018). Gender shades: In-tersectional accuracy disparities in commercial genderclassiﬁcation. In Conference on Fairness, Accountabil-ity and Transparency, pages 77–91.Chollet, F. et al. (2015). Keras. https://keras.io .Chouldechova, A. and Roth, A. (2018). The fron-tiers of fairness in machine learning. arXiv preprintarXiv:1810.08810 .Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014).Empirical evaluation of gated recurrent neural networkson sequence modeling. In NIPS 2014 Workshop onDeep Learning.Coulmas, F. (2017). Sociolinguistics: the study of speakerschoice; second edition. Cambridge University Press.Davidson, T., Bhattacharya, D., and Weber, I. (2019).Racial bias in hate speech and abusive language detec-tion datasets. In Proceedings of the Third Workshop onAbusive Language Online, pages 25–35. ACL.Deriu, J., Lucchi, A., De Luca, V., Severyn, A., Müller,S., Cieliebak, M., Hofmann, T., and Jaggi, M. (2017).Leveraging large amounts of weakly supervised datafor multi-language sentiment classiﬁcation. In WWW, WWW ’17, pages 1045–1052, Republic and Cantonof Geneva, Switzerland. International World Wide WebConferences Steering Committee.Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.(2019). BERT: Pre-training of deep bidirectional trans-formers for language understanding. In NAACL, pages4171–4186, Minneapolis, Minnesota, June. ACL.Diaz, M., Johnson, I., Lazar, A., Piper, A. M., and Ger-gle, D. (2018). Addressing age-related bias in sentimentanalysis. In Proceedings of the 2018 CHI Conferenceon Human Factors in Computing Systems, pages 412:1–412:14.Dixon, L., Li, J., Sorensen, J., Thain, N., and Vasserman,L. (2018). Measuring and mitigating unintended bias intext classiﬁcation. In AIES, pages 67–73.Fortuna, P., Rocha da Silva, J., Soler-Company, J., Wanner,L., and Nunes, S. (2019). A hierarchically-labeled Por-tuguese hate speech dataset. In Proceedings of the ThirdWorkshop on Abusive Language Online, pages 94–104.Founta, A. M., Djouvas, C., Chatzakou, D., Leontiadis,I., Blackburn, J., Stringhini, G., Vakali, A., Sirivianos,M., and Kourtellis, N. (2018). Large scale crowdsourc-ing and characterization of twitter abusive behavior. InICWSM.Garg, N., Schiebinger, L., Jurafsky, D., and Zou, J.(2018). Word embeddings quantify 100 years of genderand ethnic stereotypes.

National Academy of Sciences ,115:E3635–E3644.Garg, S., Perot, V., Limtiaco, N., Taly, A., Chi, E. H., andBeutel, A. (2019). Counterfactual fairness in text classi-ﬁcation through robustness. In AIES.Gertner, A., Henderson, J., Merkhofer, E., Marsh, A., Well-ner, B., and Zarrella, G. (2019). MITRE at SemEval-2019 task 5: Transfer learning for multilingual hatespeech detection. In Proceedings of the 13th Interna-tional Workshop on Semantic Evaluation, pages 453–459, Minneapolis, Minnesota, USA, June. ACL.Godin, F., Vandersmissen, B., De Neve, W., and Van deWalle, R. (2015). Multimedia lab @ ACL WNUT NERshared task: Named entity recognition for twitter mi-croposts using distributed word representations. In Pro-ceedings of the Workshop on Noisy User-generated Text,pages 146–153, Beijing, China, July. ACL.Hovy, D. (2015). Demographic factors improve classiﬁ-cation performance. In Proceedings of the 53rd AnnualMeeting of the Association for Computational Linguis-tics, pages 752–762.Huang, X. and Paul, M. J. (2019). Neural user factoradaptation for text classiﬁcation: Learning to general-ize across author demographics. In Proceedings of theEighth Joint Conference on Lexical and ComputationalSemantics, pages 46–56.Johannsen, A., Hovy, D., and Søgaard, A. (2015). Cross-lingual syntactic variation over age and gender. In Pro-ceedings of the Nineteenth Conference on Computa-tional Natural Language Learning, pages 103–112, Bei-jing, China, July. ACL.Jung, S.-G., An, J., Kwak, H., Salminen, J., and Jansen,B. J. (2018). Assessing the accuracy of four popular faceecognition tools for inferring gender, age, and race. InICWSM.Kim, Y. (2014). Convolutional neural networks for sen-tence classiﬁcation. In Proceedings of the 2014 Confer-ence on Empirical Methods in Natural Language Pro-cessing (EMNLP), pages 1746–1751, Doha, Qatar, Oc-tober. ACL.King, G. and Zeng, L. (2001). Logistic regression in rareevents data.

Political analysis , 9(2):137–163.Kingma, D. P. and Ba, J. (2014). Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 .Kiritchenko, S. and Mohammad, S. (2018). Examininggender and race bias in two hundred sentiment analysissystems. In Proceedings of the Seventh Joint Conferenceon Lexical and Computational Semantics, pages 43–53.Kong, L., Schneider, N., Swayamdipta, S., Bhatia, A.,Dyer, C., and Smith, N. A. (2014). A dependencyparser for tweets. In Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Processing,pages 1001–1012.Maier, W. and Gómez-Rodríguez, C. (2014). Language va-riety identiﬁcation in Spanish tweets. In Proceedings ofthe EMNLP’2014 Workshop on Language Technologyfor Closely Related Languages and Language Variants,pages 25–35, Doha, Qatar, October. ACL.Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., andDean, J. (2013). Distributed representations of wordsand phrases and their compositionality. In NIPS, pages3111–3119.Park, J. H., Shin, J., and Fung, P. (2018). Reducing genderbias in abusive language detection. In Proceedings of the2018 Conference on EMNLP, pages 2799–2804.Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Ma-chine learning in python.

Journal of machine learningresearch , 12(Oct):2825–2830.Preo¸tiuc-Pietro, D. and Ungar, L. (2018). User-level raceand ethnicity predictors from twitter text. In Proceedingsof the 27th International Conference on ComputationalLinguistics, pages 1534–1545.Ptaszynski, M., Pieciukiewicz, A., and Dybała, P. (2019).Results of the poleval 2019 shared task 6: First datasetand open shared task for automatic cyberbullying detec-tion in polish twitter. In Proceedings of the PolEval2019Workshop, page 89.Rehurek, R. and Sojka, P. (2010). Software framework fortopic modelling with large corpora. In In Proceedings ofthe LREC 2010 Workshop on New Challenges for NLPFrameworks.Sanguinetti, M., Poletto, F., Bosco, C., Patti, V., andStranisci, M. (2018). An Italian twitter corpus ofhate speech against immigrants. In Proceedings ofthe Eleventh International Conference on LanguageResources and Evaluation (LREC 2018), Miyazaki,Japan, May. European Language Resources Association(ELRA).Sap, M., Card, D., Gabriel, S., Choi, Y., and Smith, N. A. (2019). The risk of racial bias in hate speech detection.In Proceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 1668–1678,July.Shen, Q., Yoder, M., Jo, Y., and Rose, C. (2018). Percep-tions of censorship and moderation bias in political de-bate forums. In ICWSM.Sun, C., Qiu, X., Xu, Y., and Huang, X. (2019a). Howto ﬁne-tune bert for text classiﬁcation? arXiv preprintarXiv:1905.05583 .Sun, T., Gaut, A., Tang, S., Huang, Y., ElSherief, M., Zhao,J., Mirza, D., Belding, E., Chang, K.-W., and Wang,W. Y. (2019b). Mitigating gender bias in natural lan-guage processing: Literature review. In Proceedings ofthe 57th Annual Meeting of the Association for Compu-tational Linguistics, pages 1630–1640.Tieleman, T. and Hinton, G. (2012). Lecture 6.5-rmsprop:Divide the gradient by a running average of its recentmagnitude.

COURSERA: Neural networks for machinelearning , 4(2):26–31.Volkova, S., Wilson, T., and Yarowsky, D. (2013). Explor-ing demographic language variations to improve multi-lingual sentiment analysis in social media. In Proceed-ings of the 2013 Conference on Empirical Methods inNatural Language Processing, pages 1815–1827.Volkova, S., Bachrach, Y., Armstrong, M., and Sharma, V.(2015). Inferring latent user properties from texts pub-lished in social media. In AAAI.Waseem, Z. and Hovy, D. (2016). Hateful symbols or hate-ful people? predictive features for hate speech detec-tion on twitter. In Proceedings of the NAACL studentresearch workshop, pages 88–93.Waseem, Z. (2016). Are you a racist or am i seeing things?annotator inﬂuence on hate speech detection on twitter.In Proceedings of the ﬁrst workshop on NLP and com-putational social science, pages 138–142.Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue,C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtow-icz, M., and Brew, J. (2019). Huggingface’s transform-ers: State-of-the-art natural language processing.