[PDF] Challenges in Automated Debiasing for Toxic Language Detection

Abstract

Biased associations have been a challenge in the development of classifiers for detecting toxic language, hindering both fairness and accuracy. As potential solutions, we investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection. Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English). Our comprehensive experiments establish that existing methods are limited in their ability to prevent biased behavior in current toxicity detectors. We then propose an automatic, dialect-aware data correction method, as a proof-of-concept. Despite the use of synthetic labels, this method reduces dialectal associations with toxicity. Overall, our findings show that debiasing a model trained on biased toxic language data is not as effective as simply relabeling the data to remove existing biases.

Full PDF

CChallenges in Automated Debiasing for Toxic Language Detection

Xuhui Zhou ♥ Maarten Sap ♣ Swabha Swayamdipta ♦ Noah A. Smith ♣♦ Yejin Choi ♣♦♥

Department of Linguistics, University of Washington ♣ Paul G. Allen School of Computer Science & Engineering, University of Washington ♦ Allen Institute for Artiﬁcial Intelligence [email protected], { msap,yejin,nasmith } @cs.washington.edu, [email protected] Abstract

Warning : this paper contains content that maybe offensive or upsetting.

Biased associations have been a challenge inthe development of classiﬁers for detectingtoxic language, hindering both fairness andaccuracy. As potential solutions, we inves-tigate recently introduced debiasing methodsfor text classiﬁcation datasets and models, asapplied to toxic language detection. Our focusis on lexical (e.g., swear words, slurs, iden-tity mentions) and dialectal markers (speciﬁ-cally African American English). Our com-prehensive experiments establish that existingmethods are limited in their ability to preventbiased behavior in current toxicity detectors.We then propose an automatic, dialect-awaredata correction method, as a proof-of-conceptstudy. Despite the use of synthetic labels,this method reduces dialectal associations withtoxicity. Overall, our ﬁndings show that debi-asing a model trained on biased toxic languagedata is not as effective as simply relabeling thedata to remove existing biases.

Current hate speech or toxic language detection systems exhibit problematic and discriminatorybehavior that causes them to have disparate nega-tive impact on minority populations (Yasin, 2018;Guynn, 2020; Kim et al., 2020; Dias Oliva et al.,2020). Tweets simply containing a minority iden-tity mention are commonly ﬂagged as toxic by cur-rent systems, in contrast to those containing ma-jority identity mentions, as illustrated in Figure 1.At the core of the issue are dataset biases , i.e.,spurious correlations between surface patterns andannotated toxicity labels (§2), which stem fromthe data creation process (Sap et al., 2019). Pre-vious work has outlined two such biases for hate We use hate speech and toxic language interchangeablyin this work, though their deﬁnitions do not perfectly align.

Detected toxicity score

I identify as a black gay woman.I identify as a straight white man.Fucking love this.

Adolf Hilter is a great person. Identity bias (Lexical)

Swear wordbias (Lexical)

What’s up, bro!

Wussup, n*gga! Dialect/Racial bias

Pers.

API

Pers.

API

Pers.

API

Figure 1: Lexical items and dialect markers cause prob-lematic behavior for toxic language detection systemssuch as the widely used PerspectiveAPI. In the top twoexample pairs, statements with minority identity men-tions and swear words used inoffensively are ﬂagged astoxic, but majority identity mentions or offensive state-ments without overt swearing are missed. The bottompair shows dialect-based racial bias for two inoffensivegreetings, where markers of African American English(

AAE ) trigger the toxicity detector. speech datasets (both shown in Figure 1): lexicalbias which associates toxicity with the presence ofcertain words (e.g., profanities, identity mentions;Dixon et al., 2018; Dinan et al., 2019) and di-alectal bias , where toxicity is correlated with sur-face markers of African American English (

AAE ;Davidson et al., 2019; Sap et al., 2019). Whentrained on biased datasets, models acquire and ex-acerbate these biases (e.g., ﬂagging text by Blackauthors as more toxic than by white authors; Sapet al., 2019; Zhang et al., 2018).Concurrently, there has been elevated interest indeveloping debiasing methods for standard naturallanguage understanding (NLU) tasks, i.e., meth-ods that aim to decrease over-reliance on spuriouscorrelations in NLU models (Clark et al., 2019; Heet al., 2019; Karimi Mahabadi et al., 2020; Braset al., 2020). This raises a natural question: are a r X i v : . [ c s . C L ] J a n urrent debiasing approaches effective for mitigat-ing biases speciﬁc to toxic language detection? In this work, we address the above question byinvestigating two classes of debiasing approachesto mitigate lexical and dialectal biases—one thatemploys additional training objectives for bias re-moval, and another that ﬁlters training instanceslikely exhibiting spurious biases (§3). Throughcomprehensive experiments, we show that bothapproaches face major challenges in mitigating bi-ases from a model trained on a biased dataset (inour case, the dataset from Founta et al., 2018)for toxic language detection. While data ﬁlter-ing results in reduced bias associations in the data, models trained on ﬁltered datasets still pick up onlexical (§4) and dialectal biases (§5). We ﬁndthat dialectal biases are particularly challengingto address, as has also been shown by Xia et al.(2020). “Debiased” models still disproportion-ately ﬂag text in certain dialects as toxic. Notably,mitigating dialectal bias through current debiasingmethods does not mitigate a model’s propensity tolabel tweets by Black authors as more toxic thanby white authors.We additionally explore an alternative proof-of-concept study—relabeling supposedly toxic train-ing instances whose automatic translations into amajority dialect are deemed non-toxic by the clas-siﬁer. To this end, we create a synthetic dataset viafew-shot dialect translation system built with GPT-3 (Brown et al., 2020). While only an illustrativesolution, it nevertheless takes into account the di-alectal context of the tweet, resulting in a modelless prone to dialectal and racial biases (§6). Over-all, our ﬁndings indicate that debiasing a model al-ready trained on biased toxic language data can bechallenging, compared to relabeling the data to re-move existing biases. Our code and data are pub-licly available on Github. We test the use of debiasing methods for thetask of toxic language detection, which aims toﬂag rude, offensive, hateful, or toxic language onthe internet, with the goal of moderating onlinecommunities (Roberts, 2019; Vidgen et al., 2019). https://github.com/XuhuiZhou/Toxic_Debias Our deﬁnition of “bias” is speciﬁc to the social biasesin toxic language detection datasets, grounded as lexical anddialectal biases; see Blodgett et al. (2020) for a detailed in-vestigation of the term “bias”.

This task differs in several ways from the natu-ral language understanding (NLU) tasks that debi-asing methods have been successful on, such astextual entailment (e.g., SNLI, MNLI; Bowmanet al., 2015; Williams et al., 2018) or reading com-prehension (e.g., SQuAD; Rajpurkar et al., 2016).First, compared to these NLU tasks where thereis one correct label, the toxicity of language isinherently more nuanced, subjective, and contex-tual, which causes toxic language datasets to havelower agreement in general (Ross et al., 2017).Second, the dataset biases in NLU are predom-inantly artifacts introduced during data creation(e.g., negations, exaggerations; Schwartz et al.,2017; Gururangan et al., 2018), whereas those intoxic language detection are grounded in the so-cial dynamics of the world (Spears, 1998; Tech-nau, 2018). For example, viewing

AAE as a moretoxic or less proper variety of English is a form oflinguistic discrimination that upholds racial hierar-chies in the United States (Rosa and Flores, 2017).In this work, we consider two broad categoriesof toxic language dataset biases—lexical (§2.1)and dialectal (§2.2). Our experiments focus ona single, widely used dataset (§2.3) from Fountaet al. (2018). OX T RIG ) Current toxic language detection systems oftenrely on the presence or absence of certain words(e.g., swear words, identity mentions) to maketheir predictions (Dixon et al., 2018; Dinan et al.,2019). While most previous analyses of this biasrelied on a simple list of “bad” words (Davidsonet al., 2019; Dinan et al., 2019), we take a morenuanced view of how lexical items can convey tox-icity, inspired by work in pragmatics and sociolin-guistics of rudeness (Dynel, 2015; Kasper, 1990, inter alia ). Speciﬁcally, we manually split ourfull list of words into three distinct categories de-pending on the extent to which they carry profaneor hateful meanings or are simply associated withhateful contexts. We refer to the full set of wordsas T OX T RIG , for Toxicity Triggers, which is in-cluded in our released repository. https://tinyurl.com/list-of-bad-words We note, however, that this categorization is in itself sub-jective. https://github.com/XuhuiZhou/Toxic_Debias/blob/master/data/word_based_bias_list.csv on-offensive minority identity mentions( N OI) refers to descriptive mentions of minori-tized demographic or social identities (e.g., gay , female , Muslim ). While these mentions are notusually inherently offensive by themselves, theyare often found in offensive statements that arehateful towards minorities (Dixon et al., 2018).We detect these identity mentions in text using alist of 26 regular expressions.

Possibly offensive minority identity mentions(OI) are mentions of minoritized identities thatcould denote profanity or hate depending on prag-matic and contextual interpretations. This includesslurs and objectifying outdated terms to refer tominority groups, which are usually understood asattacks. Additionally, this includes reclaimed slurs( queer , n*gga ), which connote less offensive in-tent when spoken by in-group members comparedto out-group members (Croom, 2013). Possibly offensive non-identity mentions (O N I) contains swear words and other profanities, whichare usually offensive but not associated to any so-cial groups (e.g., f*ck , sh*t ). Note that the prag-matic interpretation of these words is not neces-sarily always toxic or offensive (Dynel, 2012), asthey are often used to convey closeness betweenthe speaker and listener or emphasize the emo-tionality of a statement (e.g., second example inin Figure 1). AAE ) Current toxic language detection systems also as-sociate higher toxicity with dialectal markers ofAfrican American English (

AAE ; Sap et al., 2019;Davidson et al., 2019). Since

AAE is a vari-ety of English that is common among AfricanAmericans and often signals a cultural identityin the US (Green, 2002), this dialect-based racialbias causes speech by Black authors to be sup-pressed more often than non-Black authors (Sapet al., 2019), thereby exacerbating racial inequal-ity (Rosa, 2019).In our experiments, we estimate the dialect ofa tweet using a topic model from Blodgett et al.(2016). This model was trained on 60M tweets,where the dialect of the tweet was inferred fromthe model coordinates, which yielded a probabilityof a tweet being in one of four dialects (African-American English, white-aligned English, His-panic, and other). In this study, we only focus on African-American English (

AAE ) and white-aligned English (

WAE ) tweets; both deﬁnitionsare based on US English, as per Blodgett et al.(2016). Our experiments either use the proba-bility of a tweet being in these dialects, or assigntweets their estimated-most-probable dialect.

We focus our analyses on a widely used hatespeech dataset of English tweets (Founta et al.,2018). The tweets were collected using a multi-round bootstrapping procedure, and were labeledout of context for toxic language. We focus onthe 86k tweets that are annotated as hateful, abu-sive, or neither and discard those labelled as spam.We aggregate the abusive and hateful labels into asingle toxic category, yielding 32k toxic and 54knon-toxic tweets. We consider two types of debiasing methods fromcurrent literature. The ﬁrst type addresses known,pre-deﬁned biases—such as lexical and dialectalbiases for hate speech detection, via a model-based approach involving additional training ob-jectives (§3.1). In contrast, the second type is ag-nostic to prior knowledge about biases, and in-stead ﬁlters out examples that appear “too easy”and might hence contain spurious correlations(§3.2).

We use the L

EARNED -M IXIN method of Clarket al. (2019), which achieved high out-of-distribution (OOD) performance on several NLUtasks, for debiased training. This method trainsan ensemble containing a bias-only model whichonly uses pre-deﬁned features corresponding toknown biases, and a full model which uses all fea-tures. Intuitively, the ensemble encourages the full We avoid using disputed terms such as general Ameri-can English , standard American English , or mainstream USEnglish , which are frequently used for WAE , since we be-lieve that no dialect should be privileged with the designation“general”, “standard”, or “mainstream” (Rosa, 2019). Only the tweet text—no proﬁle information or conversa-tional context—was shown to annotators. We also explored using another widely used hate speechdataset (Davidson et al., 2017), which collected tweets us-ing a seed list of swear words and slurs. However, in linewith ﬁndings by Xia et al. (2020), debiasing led to degener-ate behavior due to the data collection process, as discussedin Appendix B. odel to rely more on features unrelated to thebiases. Once trained, the bias-only model is dis-carded, and only the “bias-free” full model is usedfor inference, following Clark et al. (2019).

Bias-only model

Given its effectiveness on bag-of-words (BoW) features, we use an SVM classi-ﬁer as the lexical-bias-only model. For example,the T OX T RIG -only model counts the frequency ofT OX T RIG words in each tweet. Our dialectal-bias-only model uses the probability of dialects (

AAE , WAE , Hispanic, and other) obtained from a dialectdetector (Blodgett et al., 2016) as features in aSVM classiﬁer.

Full model

We ﬁne-tune a RoBERTa-large clas-siﬁer (Liu et al., 2019), a state-of-the-art classiﬁerfor the toxicity detection task. See Appendix A.1for more modeling details.Note that we only consider the L

EARNED -M IXIN -O N I and L

EARNED -M IXIN -T OX T RIG models for lexical debiasing, due to poor ac-curacies of the bias-only models for N OI andOI. In addition to debiasing methods that handleknown biases, we also explore automated ap-proaches which ﬁlter out instances exhibiting un-speciﬁed, spurious biases. Speciﬁcally, we de-scribe below two data selection methods that haveshown strong OOD performance.

AFLite (Bras et al., 2020) is an algorithm basedon the key intuition that examples predicted cor-rectly by the simplest methods likely exhibit spu-rious biases. An ensemble of simple linear clas-siﬁers is trained and tested on different partitionsof the data; test instances which are “predictable”,or classiﬁed correctly by most classiﬁers in theensemble are discarded. The algorithm is iter-ative, and is repeated until a target data size isachieved. Models trained on this ﬁltered datasetachieve higher performance on OOD and adver-sarially constructed test sets, compared to the orig-inal model, on several text and image classiﬁcationdatasets. This indicates a reduction in spurious bi-ases in the ﬁltered data. The N OI and OI bias-only models reach 63% and 67%accuracy, respectively, which is empirically hard for the en-semble to use. This is likely due to low coverage in the trainset of those categories (4.43% N OI and 4.25% OI).

DataMaps (Swayamdipta et al., 2020) showthe presence of distinct regions in a dataset—namely, easy, hard and ambiguous—deﬁned withrespect to a given model. These regions arediscovered based on the training dynamics of amodel, determined by the model’s conﬁdence inthe true class, for each example, as well as thevariability of this conﬁdence, throughout train-ing epochs. Swayamdipta et al. (2020) show thattraining exclusively on the hard and ambiguousregions of the data results in high OOD perfor-mance, indicating lower prevalance of spuriousbiases. The easy region is the largest in sizefor RoBERTa; however, experiments showed thattraining exclusively on these examples hurt OODgeneralization on different NLU tasks. Followingthis work, we create DataMaps-Easy, DataMaps-Ambiguous, and DataMaps-Hard subsets for ourdataset.Following Swayamdipta et al. (2020), we setthe target ﬁltered subset size to 33% of the orig-inal training set for both ﬁltering methods, but ourﬁltering additionally preserved the original labelproportions. We then ﬁne-tune a RoBERTa-largeclassifer on these ﬁltered subsets; see AppendixA.2 for more details.

We investigate the effect of debiasing approaches(§3) on removing lexical biases in hate speech de-tection. First, we discuss the evaluation frame-work for measuring bias reduction (§4.1). Wepresent quantitative (§4.2) and qualitative (§4.3)results on lexical bias removal for all debiasing ap-proaches, and OOD evaluation for debiased train-ing methods (§4.4). See Appendix A.3 for hyper-parameters and other experimental settings.

We report the performance of all models as over-all accuracy and F with respect to the toxic class.Given that current hate speech systems tend to relyheavily on the presence of N OI, OI, and O N I men-tions (§2.1) for labeling text as toxic, we use falsepositive rate (FPR) over each of these categories tomeasure the degree of bias in the model, followingHardt et al. (2016) and Xia et al. (2020). Specif-ically, we report the FPR of a model on tweetscontaining N OI (FPR N OI ), OI (FPR OI ), and O N I(FPR O N I ), as well the F corresponding to each ofthese classes. Intuitively, the lower the FPR ∗ , the N OI ↓ R OI ↓ R O N I ↓ Original 0.0445 0.2641 0.6718 % t r a i n Random 0.0345 0.2603 0.6683AFLite 0.0434 0.2458 0.6016DataMaps-Ambig. 0.0126 0.1968

DataMaps-Hard

Table 1: Lexical associations between toxicity andT OX T RIG mentions in the original dataset (Fountaet al., 2018) and various ﬁltered counterparts. Ran-dom, AFLite, and DataMaps all contain only 33% ofthe original data after ﬁltering. Lower Pearson R cor-relation value indicates less superﬁcial patterns in thedataset, i.e., less bias. Takeaway:

The hard and am-biguous subsets given by DataMaps contain the lowestamount of lexical associations, indicated in boldface. less the model infers lexical associations for toxi-city, and hence is less biased.

Evaluation for Filtered Datasets

We addition-ally consider metrics based on spurious lexical as-sociations for data ﬁltering approaches. This mea-sures prevalence of spurious surface patterns in theﬁltered datasets, which might propagate to mod-els trained on the data. Speciﬁcally, we report thePearson’s correlation between the gold standardtoxicity label and whether or not it contains N OI,OI, or O N I mentions. These correlations are de-noted as R O N I , R N OI , and R OI , respectively; lowervalues indicate reduction in lexical biases. Baselines

We consider comparison against twonatural baselines: a vanilla RoBERTa-large classi-ﬁer trained on the original dataset (Original). Wealso consider a baseline trained on a random selec-tion of the training data (Random), for comparisonwith data ﬁltering methods for debiasing. Eachsubset is trained on 33% of the training data.

First, we measure the reduction in lexical bi-ases in ﬁltered datasets, as given by AFLite andDataMaps. As shown in Table 1, subsets givenby AFLite and the ambiguous and hard regionsproduced by DataMaps reduce the overall asso-ciations between T OX T RIG words and toxicity,compared to the original and random baselines;DataMaps-Hard has the largest reduction. On theother hand, as expected, DataMaps-Easy showsan increased association between T OX T RIG men-tions and toxicity, showing that the these examplesdisplay overt lexical biases. Table 2 shows results for lexical bias reduc-tion using both debiased training approaches, aswell as models trained on datasets ﬁltered us-ing AFLite and all three regions from DataMaps.Both debiased training approaches, LM

IXIN -O N Iand LM

IXIN -T OX T RIG , reduce FPR O N I as wellas FPR OI by a large amount. However, bothapproaches also hurt in-distribution test perfor-mance, indicating that O N I and other T OX T RIG features are essential for good performance. Incontrast, the models trained on hard and am-biguous subsets from DataMaps both preserve in-distribution performance, even though they aretrained only a third of the original data. They alsoreduce the rate of falsely predicting N OI mentionsas toxic (FPR N OI ), while not showing much im-provement for O N I and maintaining FPR OI of theoriginal baseline.Surprisingly, the model trained on the easy sub-set from DataMaps shows good bias reduction onthe N OI and O N I categories, while matching therandom selection baseline for OI. This is despiteDataMaps-Easy showing an increased associationbetween T OX T RIG mentions and toxicity (Table1). Notably, the F for all categories suffers un-der this model, indicating that it is less competentthan the baseline. These results suggest that re-duced associations in the data might not necessar-ily lead to debiased models trained on the samedata. Overall, no single approach outperforms allothers across different categories for lexical debi-asing. A qualitative study of the Founta et al. (2018) testset shows the presence of many annotation errors.We show three representative annotation errors inTable 3. The ﬁrst example contains an atypical ex-ample of toxicity, towards white folks, which theannotators might have been unaware of. It alsocontains a link which annotators had access to, butnot models. The second contains the word p*ss which the annotators may have relied for their as-sessment. The third encourages violence/abuse to-wards an identity which isn’t typically the target ofviolence. Interestingly, the DataMaps-Easy pre-dictions agree with all the gold standard annota-tions; perhaps such annotation errors and ambigu-ity are responsible for the performance discussed When we combine the bias-only model and the fullmodel, we obtain competitive performance (see AppendixA.4).est (12893) N OI (602) OI (553) O N I (3236)Acc. ↑ F ↑ F ↑ FPR N OI ↓ F ↑ FPR OI ↓ F ↑ FPR O N I ↓ Vanilla 94.21 . . . . . . . . LM IXIN -O N I 89.65 . . . . . . . . LM IXIN -T OX T RIG . . . . . . . . % t r a i n Random 94.07 . . . . . . . . AFLite 93.86 . . . . . . . . DataMaps-Ambig. 94.33 . . . . . . . . DataMaps-Hard . . . . . . . . DataMaps-Easy 94.00 . . . . . . . . Table 2: Evaluation of lexical bias removal for all debiasing methods on the Founta et al. (2018) test set. Resultsshow the mean and s.d. (subscript) of accuracy and F across 3 runs, as well as F and false positive rate exclusivelyon test examples containing speciﬁc T OX T RIG mentions— N OI, OI and O N I, along with the number of examplesin each category. The lower the FPR * , the less the model infers lexical associations for toxicity. The ﬁrst blockshows debiased training approaches, along with the vanilla classiﬁer, which are trained on the full dataset. Thesecond block shows data ﬁltering approaches, all trained on only 33% of the training data. Best performance ineach block is boldfaced. Takeaway:

While data ﬁltering approaches achieve overall higher performance, debiasedtraining approaches perform better on lexical bias reduction, in aggregate.

Gold DM-Hard DM-Easy@user THIS? LMAOOOOO...do yourself a solid and stay out of Black people’s mentions andmind your caucasia... ¤ A ¤ RT @user I wish I wasn’t so annoying like I even piss myself off A ¤ A @user If you want to attack people, attack fundamentalists of all faiths. Attack those whocondemn 1.5bn people out of hand. ¤ A ¤ Table 3: Examples of test set tweets with their gold-standard annotations and predictions from models trained onDataMaps-Hard (DM-Hard) and DataMaps-Easy (DM-Easy) subsets. A denotes tweets with toxic labels, and ¤ represents non-toxic labels. We anonymize the usernames to protect user privacy. in §4.2. These annotation ambiguities might alsoimpair our measurement for models’ performanceand debiasing effects, and expose a limitation ofthese datasets. N I-Adv

To further study the reliance of debiased modelson the O N I words, we use the test set from Di-nan et al. (2019), which contains 3000 manuallycreated sentences, 300 toxic. The toxic sentencesbarely contain any O N I mentions, e.g.,

Cheese ismade by french people who smell . We call thistest set O N I-Adv (for adversarial) since it chal-lenges models with a reversal in the associationbetween toxicity and offensive non-identity words(e.g., “ f*ck ”, “ sh*t ”).We report F for all models in Figure 2, whichshows how well a model identiﬁes toxicity in of-fensive tweets that do not contain overtly lexicalcues of toxicity. The debiased training approachesimprove over the baselines; data ﬁltering methodsdo not. One reason for this might be that dataﬁltering methods were trained on much less data than both LM IXIN models. Regardless, none ofthe models we test are good at predicting subtle,non-overt toxicity.

We test the efﬁcacy of the bias reduction methodsfrom §3 for dialectal bias (§2.2) reduction.

For our dialectal bias experiments, we ﬁrst inferthe dialect of a tweet as described in §2.2. Then,analogous to the lexical bias evaluation, we quan-tify the dialectal debiasing using the Pearson’s cor-relation between estimated probabilities of

AAE and toxicity ( R AAE ), and the false positive rates ofmodels on

AAE tweets (FPR

AAE ). See AppendixA.3 for hyperparameters and other experimentalsettings.Results in Table 4 show that almost all data ﬁl-tering and debiasing methods reduce dialectal bi-ases, with DataMaps-Easy as the exception (con- igure 2: Challenge set evaluation for lexical biases,comparing all debiasing methods with baselines, usingthe O N I-Adv test set.

Takeaway: F ( ↑ ) measures showthat all models perform poorly at identifying toxic textnot containing overtly lexical cues of toxicity. In gen-eral, debiased training approaches outperform the orig-inal model on this challenge set, while data ﬁltering isnot as effective. sistent with Table 1). Notably, DataMaps-Hardperforms the best at dialectal debiasing, both interms of toxicity- AAE correlation ( R AAE ) and interms of false ﬂagging of toxicity (FPR

AAE ). Inter-estingly, most models’ decrease in false ﬂaggingis small, suggesting room for improvement.

To quantify the real-world impact of dialect-based racial bias, we measure the rates of toxi-city predicted by models on a corpus of tweetsfor which the race of authors is available, butnot annotations of toxicity. Speciﬁcally, we con-sider the dataset released by Preot¸iuc-Pietro andUngar (2018), which consists of 5.4M tweets,collected from 4,132 survey participants (3,184White, 374 African American) with self-reportedrace/ethnicity and Twitter user handles. We quantify our models’ racial bias by measur-ing the difference in rates of ﬂagging tweets byAfrican American authors and those by white au-thors, following Sap et al. (2019). Listed in Table 5, our results show that auto-matic debiasing methods do not consistently de-crease the racial discrepancy in ﬂagging toxicity.Notably, the toxicity rates on tweets by AfricanAmerican authors—and the diferences comparedto white authors—are similar across all debias- For efﬁciency, we randomly select 12k tweets from thedataset as the OOD test set. Note that we assume that authors from all races have thesame likelihood of writing toxic language. Test R AAE ↓ F ↑ FPR

AAE ↓ Vanilla 0.4079 92.33 . . LM IXIN -Dialect - 92.26 . . % t r a i n Random 0.4027 92.18 . . AFLite 0.3577 91.94 . . DataMaps-Ambig. 0.2965 92.45 . . DataMaps-Hard . . DataMaps-Easy 0.5347 91.94 . . AAE -relabeled 0.3453 91.64 . . Table 4: Dialectal bias evaluation for all debiasingmethods (§5), as well as the relabeling approach (§6)on the Founta et al. (2018) test set. We report F andthe false positive rate with respect to tweets in AAE (FPR

AAE ), reﬂecting dialectal bias (lower is less bi-ased), showing mean and s.d. (subscript) across 3 runs.(Top Block) Debiased training approaches, along withthe vanilla classiﬁer, are all trained on the full dataset.(Middle Block) Random, AFLite and DataMaps allare trained on only 33% of the training data. Bestperformance for each training set size is in boldface.

Takeaway:

Both debiasing approaches improve per-formance over baselines, with DataMaps-Hard provingthe most effective at debiasing. (Bottom Block)

AAE -relabeling results in a model which despite following anoisy process yields even larger improvements for di-alectal debiasing. ing methods and baselines, except for DataMaps-Easy, which shows the most racial bias in toxic-ity ﬂagging. Surprisingly, DataMaps-Hard, whichmitigated dialectal bias the best out of all debi-asing methods, also shows high discrepancy be-tween author races. Conﬁrming previous results,this suggests that debiasing these systems requiresmore than automatic debiasing methods.

Based on our quantitative and qualitative analy-ses, we believe there still is room for improve-ment in debiasing hate speech detection. There-fore, we turn our attention to the role of label noisein datasets. Partly inspired by our qualitative anal-yses of debiased models’ predictions, we designa proof-of-concept study where we automaticallycorrect the label of tweets using a(n automatic) di-alectal translation of the tweet, inspired by previ-ous work showing that highlighting

AAE tweets’dialect led them to be labeled as less toxic (Sapet al., 2019). We conclude this study by discussingthe limitations and ethical implications of the syn-thetic data, and cautioning against its real-worldapplication. -Tox. AA-Tox. ∆ ↓ AA/W ↓ Original 7.24 12.61 5.37 1.74LM

IXIN -Dialect 7.50 12.55 5.06 1.67 % t r a i n Random 8.28 13.24 4.96 1.60AFLite 7.32 11.64 4.33 1.59DataMaps-Ambig. 6.75 12.17 5.42 1.80DataMaps-Hard 6.36 11.67 5.31 1.84DataMaps-Easy 8.46 16.30 7.83 1.94

AAE -relabeled 6.93 10.60

Table 5: Racial disparity in toxicity prediction re-ported on Preot¸iuc-Pietro and Ungar (2018).

W-Tox. indicates % of white users’ tweets being ﬂagged astoxic,

AA-Tox. indicates % of African American users’tweets being ﬂagged as toxic, ∆ refers to the differ-ence between AA-Tox. and W-Tox., and AA/W refersto the ratio between AA-Tox. and W-Tox.

Takeaway:

Methods generally fail in debiasing on this OOD testset except the relabeling approach shows some beneﬁt.

Focusing on dialectal bias, our key assumptionis that an

AAE tweet and its corresponding

WAE version should have the same toxicity label, there-fore toxic

AAE tweets whose

WAE versions arenon-toxic are candidates for label correction. However, gold-standard translations of

AAE to WAE would require qualiﬁed translators, and au-tomatic

AAE -to-

WAE translation systems do notexist, to the best of our knowledge. Therefore,we create a proof-of-concept study—we set up a

AAE to WAE “translation” system using the few-shot capabilities of the GPT-3 language model(Brown et al., 2020). Under this mechanism, weprompt GPT-3 with four translation pairs (takenfrom Spears, 1998) and an

AAE tweet from ourtraining data, and generate its

WAE “translation”.The list of prompts, as well as further details, areprovided in Appendix C. Note that we do not rec-ommend this approach to build large scale paralleldata for dialects, as discussed under ethical impli-cations and limitations.Next, as per our heuristic, we only relabel toxic

AAE tweets whose

WAE translation is predicted asnon-toxic by either our vanilla classiﬁer trainedon the original Founta et al. (2018) dataset, or anidentical classiﬁer trained on the

WAE translatedtweets. The resulting dataset (

AAE -relabeled) isthe same size as the original dataset, but with 954(12%) out of 8260 toxic

AAE tweets relabeled as Note that this assumption does not hold for lexical items,because substituting lexical items (e.g., swapping a minoritymention for a majority mention) would drastically change thedenotational meaning of the sentence. non-toxic (examples in Table 6). To assess the va-lidity of the relabeling, the ﬁrst three authors man-ually annotated toxicity of 50 randomly selectedrelabeled tweets. On average, authors agreed with84% of the relabeling decisions.Then, we evaluate the dialectal bias of

AAE -relabeled and quantify the dialect and racial pre-diction biases from a RoBERTa-large classiﬁertrained on

AAE -relabeled, following §5. As shownin the last row of Table 4, this relabeling schemedecreases dialectal bias more than any other debi-asing method, speciﬁcally as measured by the FPRon

AAE tweets, with one point drop in F score.The F score on the “gold” test data (Table 4) arenot fully reliable, as test data contain label biasesand better performance could come from exploit-ing these biases. As shown in Table 5, the modeltrained on AAE -relabeled has the lowest racial dis-parity in toxicity ﬂagging rates compared to allother methods.These results highlight that debiasing meth-ods are much less effective at mitigating dialec-tal dataset biases compared to data relabeling.For future investigations, we recommend obtain-ing human-written

AAE - WAE pairs (e.g., as doneby Groenwold et al., 2020). Additionally, to en-sure less biased toxicity labeling, we recommendrecruiting

AAE speakers or experts for avoid-ing over-identiﬁcation of

AAE -markers as toxic(Spears, 1998; Croom, 2013). Alternatively, werecommend exploring more holistic representa-tions of social biases or toxicity (e.g., Social BiasFrames; Sap et al., 2020).

Ethical Implications & Limitations

The above synthetic setting is meant to illustratethe role of labeling quality on biases in annota-tions. We strongly caution against using this ap-proach in real-world applications, such as build-ing parallel datasets for dialects. First, due tohow its training data was selected, GPT-3 haslikely not been exposed to many African Ameri-can English varieties during training (Jo and Ge-bru, 2020). Second, pretrained language modelsare known to generate toxic language at non-trivialrates (Gehman et al., 2020), which could cause dif-ferential toxicity in the translations.

Debiasing Toxicity Detection

As the popularityof hate speech and toxic language detection sys- AE GPT-3

WAE

Translation Gold NewRT @user I can’t stand a bad texter bruh like don’tbe mad if I forget about yo ass RT @user I can’t stand a bad texter bro like don’tbe mad if I forget about you A ¤ RT @user Retweet if you fuck with this!!!! RT @user Retweet if you like this! A ¤ RT @user That nigga needs anger management RT @user That guy needs anger management A ¤ RT @user oh fucking hell take a day off man RT @user oh fuck take a day off man

A A

Table 6: Examples of

AAE tweets with their GPT-3 based

WAE translation, and original gold standard and newannotations based on

AAE -relabeled. For the ﬁrst three tweets, the (biased) gold labels are changed by modelspredicting the new labels on their

WAE translations. A indicates presence of toxicity, and ¤ represents non-toxic. We anonymize the usernames to protect user privacy. tems has grown, several biases have been found indataset and models, spurring various debiasing ef-forts to mitigate these individual biases (e.g., gen-der bias, racial bias; Park et al., 2018; Sap et al.,2019; Davidson et al., 2019). Some work tacklesidentity-based biases, e.g., using data re-balancing(Dixon et al., 2018), or adversarial feature learn-ing (Vaidya et al., 2019). Less work has tackledracial or dialectal bias. Notably, Xia et al. (2020)use adversarial training to prevent the model fromassociating toxicity with AAE , showing only smallimprovements in fairness. Based on those results,we do not explore adversarial methods, opting in-stead for ensemble-based methods of predeﬁnedbias reduction. In contemporary work, Mozafariet al. (2020) use a re-weighting mechanism, whichshows some effects in debiasing racial bias. Weleave it for future work to evaluate this methodin our setting. In contrast to all previous work,our experiments also measure the effectiveness ofbias-agnostic methods.

Other General Debiasing Methods

Several ap-proaches for debiasing NLU tasks have been pro-posed lately. Some approaches rely on adversarialtraining to remove protected attributes (e.g. gen-der or race), from a model’s internal representa-tions (Zhang et al., 2018; Wang et al., 2019; Xiaet al., 2020). Other approaches include conﬁ-dence regularization (Utama et al., 2020), as wellas other product of expert approaches (He et al.,2019; Karimi Mahabadi et al., 2020) similar tothe debiased training approach from Clark et al.(2019), which is the only debiased training we em-ploy due to its relatively strong performance.

We investigate whether toxic language detectionsystems can be debiased using recently introducedmethods for debiasing text classiﬁcation in NLU tasks. Focusing on two types of biases, lexical anddialectal, our experiments show that these meth-ods face signiﬁcant challenges in reducing the bi-ased behavior in toxicity detectors. This indicatesthat biases in toxic language detection might bedifferent in nature compared to spurious associa-tions studied in typical NLU settings. We studieda synthetic scheme for relabeling examples withpotential dialectal biases; our results indicate thatcorrecting noisy labels results in better bias reduc-tion. Our ﬁndings suggest that instead of solelyrelying on development of automatic debiasingfor existing, imperfect datasets, future work fo-cus primarily on the quality of the underlying datafor hate speech detection, such as accounting forspeaker identity and dialect. Indeed, such effortscould act as an important step towards making sys-tems less discriminatory, and hence safe and us-able.

Acknowledgments

We thank the anonymous reviewers and LauraVianna for helpful comments on this work. Thisresearch was supported in part by NSF grants1813153 and 1714566.

References

Su Lin Blodgett, Solon Barocas, Hal Daum´e, III, andHanna Wallach. 2020. Language (technology) ispower: A critical survey of “bias” in NLP. In

Proc.of ACL .Su Lin Blodgett, Lisa Green, and Brendan O’Connor.2016. Demographic dialectal variation in social me-dia: A case study of African-American English. In

Proc. of EMNLP .Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015. A large anno-tated corpus for learning natural language inference.In

Proc. of EMNLP .onan Le Bras, Swabha Swayamdipta, Chandra Bha-gavatula, Rowan Zellers, Matthew Peters, AshishSabharwal, and Yejin Choi. 2020. Adversarial ﬁl-ters of dataset biases. In

Proc. of ICML .Tom B. Brown, Benjamin Mann, Nick Ryder, MelanieSubbiah, Jared Kaplan, Prafulla Dhariwal, ArvindNeelakantan, Pranav Shyam, Girish Sastry, AmandaAskell, Sandhini Agarwal, Ariel Herbert-Voss,Gretchen Krueger, Tom Henighan, Rewon Child,Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,Clemens Winter, Christopher Hesse, Mark Chen,Eric Sigler, Mateusz Litwin, Scott Gray, BenjaminChess, Jack Clark, Christopher Berner, Sam Mc-Candlish, Alec Radford, Ilya Sutskever, and DarioAmodei. 2020. Language models are few-shotlearners. In

Proc. of NeurIPS .Christopher Clark, Mark Yatskar, and Luke Zettle-moyer. 2019. Don’t take the easy way out: Ensem-ble based methods for avoiding known dataset bi-ases. In

Proc. of EMNLP .Adam M Croom. 2013. How to do things with slurs:Studies in the way of derogatory words. In

Lan-guage & communication .Thomas Davidson, Debasmita Bhattacharya, and Ing-mar Weber. 2019. Racial bias in hate speech andabusive language detection datasets. In

AbusiveLanguage Workshop (at ACL) .Thomas Davidson, Dana Warmsley, Michael Macy,and Ingmar Weber. 2017. Automated hate speechdetection and the problem of offensive language. In

Proceedings of the International AAAI Conferenceon Web and Social Media .Thiago Dias Oliva, Dennys Marcelo Antonialli, andAlessandra Gomes. 2020. Fighting hate speech, si-lencing drag queens? artiﬁcial intelligence in con-tent moderation and risks to lgbtq voices online. In

Sexuality & Culture .Emily Dinan, Samuel Humeau, Bharath Chintagunta,and Jason Weston. 2019. Build it break it ﬁx it fordialogue safety: Robustness from adversarial humanattack. In

Proc. of EMNLP .Lucas Dixon, John Li, Jeffrey Scott Sorensen, NithumThain, and L. Vasserman. 2018. Measuring andmitigating unintended bias in text classiﬁcation. In

Proc. of AES .Marta Dynel. 2012. Swearing methodologically : the(im)politeness of expletives in anonymous commen-taries on youtube. In

Journal of English Studies .Marta Dynel. 2015. The landscape of impoliteness re-search. In

Journal of Politeness Research .Antigoni-Maria Founta, Constantinos Djouvas, De-spoina Chatzakou, Ilias Leontiadis, Jeremy Black-burn, Gianluca Stringhini, Athena Vakali, MichaelSirivianos, and Nicolas Kourtellis. 2018. Largescale crowdsourcing and characterization of twitterabusive behavior. In

Proc. of WSM . Samuel Gehman, Suchin Gururangan, Maarten Sap,Yejin Choi, and Noah A. Smith. 2020. Realtoxici-typrompts: Evaluating neural toxic degeneration inlanguage models. In

Findings of EMNLP .Lisa Green. 2002.

African American English: A Lin-guistic Introduction . Cambridge University Press.Sophie Groenwold, Lily Ou, Aesha Parekh, SamhitaHonnavalli, Sharon Levy, Diba Mirza, andWilliam Yang Wang. 2020. Investigating African-American vernacular english in Transformer-Basedtext generation. In

Proc. of EMNLP .Suchin Gururangan, Swabha Swayamdipta, OmerLevy, Roy Schwartz, Samuel R. Bowman, andNoah A. Smith. 2018. Annotation artifacts in nat-ural language inference data. In

Proc. of NAACL .Jessica Guynn. 2020. What civil rights groups wantfrom facebook boycott: Stop hate speech and ha-rassment of black users.Moritz Hardt, Eric Price, and Nati Srebro. 2016.Equality of opportunity in supervised learning. In

Proc. of NeurIPS .He He, Sheng Zha, and Haohan Wang. 2019. Unlearndataset bias in natural language inference by ﬁttingthe residual. In

EMNLP Workshop on Deep Learn-ing Approaches for Low-Resource NLP .Eun Seo Jo and Timnit Gebru. 2020. Lessons fromarchives: strategies for collecting sociocultural datain machine learning. In

Proc. of FAT .Rabeeh Karimi Mahabadi, Yonatan Belinkov, andJames Henderson. 2020. End-to-end bias mitigationby modelling biases in corpora. In

Proc. of ACL .Gabriele Kasper. 1990. Linguistic politeness: currentresearch issues. In

Journal of Pragmatics . Elsevier.Jae Yeon Kim, Carlos Ortiz, Sarah Nam, Sarah Santi-ago, and Vivek Datta. 2020. Intersectional bias inhate speech and abusive language datasets.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. In arXiv preprint arXiv:1907.11692 .Marzieh Mozafari, Reza Farahbakhsh, and No¨elCrespi. 2020. Hate speech detection and racial biasmitigation in social media based on bert model. In

PLOS ONE . Public Library of Science.Ji Ho Park, Jamin Shin, and Pascale Fung. 2018. Re-ducing gender bias in abusive language detection. In

Proc. of EMNLP .Daniel Preot¸iuc-Pietro and Lyle Ungar. 2018. User-level race and ethnicity predictors from twitter text.In

Proc. of COLING .ranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questionsfor machine comprehension of text. In

Proc. ofEMNLP , pages 2383–2392.Sarah T Roberts. 2019.

Behind the screen: Contentmoderation in the shadows of social media . YaleUniversity Press.Jonathan Rosa. 2019.

Looking like a language, sound-ing like a race . Oxford University Press.Jonathan Rosa and Nelson Flores. 2017. Unsettlingrace and language: Toward a raciolinguistic perspec-tive. In

Language In Society . Cambridge UniversityPress.Bj¨orn Ross, Michael Rist, Guillermo Carbonell, Ben-jamin Cabrera, Nils Kurowsky, and Michael Wo-jatzki. 2017. Measuring the reliability of hatespeech annotations: the case of the european refugeecrisis. In

NLP 4 CMC Workshop .Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi,and Noah A. Smith. 2019. The risk of racial bias inhate speech detection. In

Proc. of ACL .Maarten Sap, Saadia Gabriel, Lianhui Qin, Dan Juraf-sky, Noah A Smith, and Yejin Choi. 2020. Socialbias frames: Reasoning about social and power im-plications of language. In

Proc. of ACL .Roy Schwartz, Maarten Sap, Ioannis Konstas, Li Zilles,Yejin Choi, and Noah A Smith. 2017. The effectof different writing tasks on linguistic style: A casestudy of the roc story cloze task. In

Proc. of CoNLL .Arthur K Spears. 1998. African-American languageuse: Ideology and so-called obscenity. In

African-American English: Structure, History and Use .Routledge New York.Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie,Yizhong Wang, Hannaneh Hajishirzi, Noah A.Smith, and Yejin Choi. 2020. Dataset cartography:Mapping and diagnosing datasets with training dy-namics. In

Proc. of EMNLP .Bj¨orn Technau. 2018. Going beyond hate speech: Thepragmatics of ethnic slur terms.

Lodz Papers inPragmatics , 14(1):25–43.Prasetya Ajie Utama, Naﬁse Sadat Moosavi, and IrynaGurevych. 2020. Mind the trade-off: DebiasingNLU models without degrading the in-distributionperformance. In

Proc. of ACL .Ameya Vaidya, Feng Mai, and Yue Ning. 2019. Em-pirical analysis of multi-task learning for reducingmodel bias in toxic comment detection. In

Proc. ofICWSM .Bertie Vidgen, Helen Margetts, and Alex Harris. 2019.How much online abuse is there? In

Alan TuringInstitute . Tianlu Wang, Jieyu Zhao, Mark Yatskar, Kai-WeiChang, and V. Ordonez. 2019. Balanced datasets arenot enough: Estimating and mitigating gender biasin deep image representations. In

Proc. of ICCV .Adina Williams, Nikita Nangia, and Samuel Bowman.2018. A broad-coverage challenge corpus for sen-tence understanding through inference. In

Proc. ofNAACL .Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-icz, and Jamie Brew. 2019. Huggingface’s trans-formers: State-of-the-art natural language process-ing.Mengzhou Xia, Anjalie Field, and Yulia Tsvetkov.2020. Demoting racial bias in hate speech detection.In

Proc. of Social NLP .Danyaal Yasin. 2018. Black and banned: Who is freespeech for?Brian Hu Zhang, Blake Lemoine, and MargaretMitchell. 2018. Mitigating unwanted biases with ad-versarial learning. In

Proc. of AES . Association forComputing Machinery. ppendixA Further Details for Models

A.1 Model Debiasing

The

LEARNED - MIXIN ensemble allows the modelto explicitly determine how much to trust the biasgiven the input: ˆ p i = softmax { log( p i ) + g ( x i ) log b i } where x i is the i th input text, p i and b i is thetoxicity prediction produced by RoBERTa, andbias-only model respectively, and g is a para-metric function, which is deﬁned as softplus ( w · h i ) , where w is a learned vector, h i is the lasthidden layer of the model for example x i , andthe softplus ( x ) = log (1 + exp x ) . To preventthe LEARNED - MIXIN ensemble from ignoring b i ,Clark et al. (2019) add an entropy penalty ( H ) tothe loss: R = αH ( softmax { g ( x i ) log b i } ) Where H ( z ) = − (cid:80) j z j log z j is the entropy and α is a hyperparameter. A.2 Data Filtering

For the data ﬁltering methods, we ﬁrst ﬁlter data to50% of the original data as in Swayamdipta et al.(2020). Then we further downsample the datasetto 33% of the original data to control that eachtraining set has the same toxic ratio as the origi-nal training set. This step is to avoid confoundingour results with different toxic ratio among differ-ent training sets.

A.3 Training Settings

For all the experiments, we ﬁne-tune RoBERTa-large (Liu et al., 2019) over the corresponding cor-pus with one GTX2080 Ti. We use the default hy-perparameters as provided in the

HuggingFaceTransformers library (Wolf et al., 2019), withtwo major changes: we use a learning rate of − and 8 batch size in all experiments. A.4 Prediction Combining with Bias-onlyModel

To prevent the possibility that our LM

IXIN -T OX T RIG /O N I is not well trained, thus resultingin the decrease of models’ in-distribution perfor-mance, we use the joint-prediction from the mainand bias-only model to infer the in-distribution test set and they obtain 94.15% and 94.17% accuracy,respectively. This is competitive performance asshown in Table 2.

B Alternative Dataset of Toxic Language

Davidson et al. (2017) collected data from Twit-ter, starting with 1,000 terms from HateBase (anonline database of hate speech terms) as seeds,which the process relies on lexical biases. Weﬁnd that performing data ﬁltering methods overthis dataset leads to degenerate behaviour. Speciﬁ-cally, as shown in Table 7, the easy region demon-strates least spurious correlation due to its heavilyskewed class distribution, which further prevent usfrom downsampling to control the toxic ratio. Wealso train LM

IXIN -T OX T RIG and LM

IXIN -dialectover the dataset. Table 8 shows that FPR of thedebiased model increase instead except for the OIcategory and Table 9’s results behave in-line withTable 4.

C Few-shot

AAE -to-

WAE

Translation

Note that we do not recommend the followingapproach to build large scale parallel data fordialects, as discussed under ethical implicationsand limitations (§6).

We use GPT-3 (Brown et al., 2020) to createa few-shot

AAE -to-

WAE translation system, us-ing the following set of example translation pairsdrawn from Spears (1998):

AAE : Get your triﬂin’ ass out of here.

WAE : Get your triﬂing self out of here.

AAE : I saw his ass yesterday.

WAE : I saw him yesterday.

AAE : His ass is gonna get fried.

WAE : He is gonna get fried

AAE : Wassup, nigga?

WAE : What’s up bro?

AAE : (cid:104) tweet (cid:105) WAE :Note that Spears (1998) refers to

WAE as Whitelanguage varieties, and deals with English preva-lent in the United States.We prepend the formatted example pairs to each

AAE tweet in our training data, and generate thetranslation from GPT-3 using top-0.95 nucleussampling with a temperature of 0.5. Prompts, for-matting, and generation parameters were chosenbased on manual inspection of the output. oxic Ratio R N OI ↓ R OI ↓ R O N I ↓ R AAE ↓ Original † Table 7: Lexical and dialectal associations between toxicity in the original dataset (Davidson et al., 2017) andvarious ﬁltered counterparts. Random, AFLite, and DataMaps all contain only 50% of the original data afterﬁltering. (We could not perform downsampling on these datasets due to their heavily skewed label distribution.)Lower Pearson R correlation value indicates less superﬁcial patterns in the dataset, thus are less biased. The easysubset gives the best results here are due to its severe inbalanced label distribution. Test N OI OI O N IAcc. ↑ F ↑ F ↑ FPR N OI ↓ F ↑ FPR OI ↓ F ↑ FPR O N I ↓ Original 96.37 97.81 96.42 25.00 99.86 57.14 99.57 63.64LM

IXIN -T OX T RIG

Table 8: Lexical bias removal evaluation for debiasing methods. Original refers to the model trained over the fulltraining set. The test set is further categorized into tweets that contained relevant T OX T RIG words. F indicatesmodels’ performance while the false positive rate (FPR * ) reﬂects models’ bias. The lower the FPR * is, the lessbiased the model tend to be. Debiasing Method Test R AAE

Acc. ↑ F ↑ FPR

AAE ↓ Original 0.4079 96.37 97.81 24.76LM

IXIN -Dialect - 96.48 97.88 22.86

Table 9: Dialectal bias evaluation for all debiasingmethods, on both in-distribution test set as well as out-of-distribution dialect and race priming test sets. In ad-dition to accuracy and F , we report the false positiverate with respect to tweets in AAE (FPR