[PDF] Debiasing Pre-trained Contextualised Embeddings

Abstract

In comparison to the numerous debiasing methods proposed for the static non-contextualised word embeddings, the discriminative biases in contextualised embeddings have received relatively little attention. We propose a fine-tuning method that can be applied at token- or sentence-levels to debias pre-trained contextualised embeddings. Our proposed method can be applied to any pre-trained contextualised embedding model, without requiring to retrain those models. Using gender bias as an illustrative example, we then conduct a systematic study using several state-of-the-art (SoTA) contextualised representations on multiple benchmark datasets to evaluate the level of biases encoded in different contextualised embeddings before and after debiasing using the proposed method. We find that applying token-level debiasing for all tokens and across all layers of a contextualised embedding model produces the best performance. Interestingly, we observe that there is a trade-off between creating an accurate vs. unbiased contextualised embedding model, and different contextualised embedding models respond differently to this trade-off.

Full PDF

DDebiasing Pre-trained Contextualised Embeddings

Masahiro Kaneko

Tokyo Metropolitan University [email protected]

Danushka Bollegala ∗ University of Liverpool, Amazon [email protected]

Abstract

In comparison to the numerous debias-ing methods proposed for the static non-contextualised word embeddings, the discrim-inative biases in contextualised embeddingshave received relatively little attention. Wepropose a ﬁne-tuning method that can be ap-plied at token- or sentence-levels to debiaspre-trained contextualised embeddings. Ourproposed method can be applied to any pre-trained contextualised embedding model, with-out requiring to retrain those models. Us-ing gender bias as an illustrative example, wethen conduct a systematic study using sev-eral state-of-the-art (SoTA) contextualised rep-resentations on multiple benchmark datasets toevaluate the level of biases encoded in differ-ent contextualised embeddings before and af-ter debiasing using the proposed method. Weﬁnd that applying token-level debiasing for alltokens and across all layers of a contextualisedembedding model produces the best perfor-mance. Interestingly, we observe that there is atrade-off between creating an accurate vs. un-biased contextualised embedding model, anddifferent contextualised embedding models re-spond differently to this trade-off.

Contextualised word embeddings have signiﬁcantlyimproved performance in numerous natural lan-guage processing (NLP) applications (Devlin et al.,2019; Liu et al., 2019; Clark et al., 2020) andhave established as the de facto standard for inputtext representations. Compared to static word em-beddings (Pennington et al., 2014; Mikolov et al.,2013) that represent a word by a single vector inall contexts it occurs, contextualised embeddings ∗ Danushka Bollegala holds concurrent appointments asa Professor at University of Liverpool and as an AmazonScholar. This paper describes work performed at the Univer-sity of Liverpool and is not associated with Amazon. use dynamic context dependent vectors for repre-senting a word in a speciﬁc context. Unfortunatelyhowever, it has been shown that, similar to theirnon-contextual counterparts, contextualised textembeddings also encode various types of unfair bi-ases (Zhao et al., 2019; Bordia and Bowman, 2019;May et al., 2019; Tan and Celis, 2019; Bommasaniet al., 2020; Kurita et al., 2019). This is a worryingsituation because such biases can easily propagateto the downstream NLP applications that use con-textualised text embeddings.Different types of unfair and discriminative bi-ases such as gender, racial and religious biases havebeen observed in static word embeddings (Boluk-basi et al., 2016; Zhao et al., 2018a; Rudingeret al., 2018; Zhao et al., 2018b; Elazar and Gold-berg, 2018; Kaneko and Bollegala, 2019). As dis-cussed later in § 2 different methods have been pro-posed for debiasing static word embeddings such asprojection-based methods (Kaneko and Bollegala,2019; Zhao et al., 2018b; Bolukbasi et al., 2016;Ravfogel et al., 2020) and adversarial methods (Xieet al., 2017; Gonen and Goldberg, 2019). In con-trast, despite multiple studies reporting that contex-tualised embeddings to be unfairly biased, methodsfor debiasing contextualised embeddings are rela-tively under explored (Dev et al., 2020; Nadeemet al., 2020; Nangia et al., 2020). Compared tostatic word embeddings, debiasing contextualisedembeddings is signiﬁcantly more challenging dueto several reasons as we discuss next.First, compared to static word embedding mod-els where the semantic representation of a word islimited to a single vector, contextualised embed-ding models have a signiﬁcantly large number ofparameters related in complex ways. For example,BERT-large model (Devlin et al., 2019) contains24 layers, 16 attention heads and 340M parameters.Therefore, it is not obvious which parameters areresponsible for the unfair biases related to a partic- a r X i v : . [ c s . C L ] J a n lar word. Because of this reason, projection-basedmethods, popularly used for debiasing pre-trainedstatic word embeddings, cannot be directly appliedto debias pre-trained contextualised word embed-dings.Second, in the case of contextualised embed-dings, the biases associated with a particular word’srepresentation is a function of both the target worditself and the context in which it occurs. There-fore, the same word can show unfair biases in somecontexts and not in the others. It is important toconsider the words that co-occur with the targetword in different contexts when debiasing a con-textualised embedding model.Third, pre-training large-scale contextualisedembeddings from scratch is time consuming and re-quire specialised hardware such as GPU/TPU clus-ters. On the other hand, ﬁne-tuning a pre-trainedcontextualised embedding model for a particulartask (possibly using labelled data for the targettask) is relatively less expensive. Consequently,the standard practice in the NLP community hasbeen to share pre-trained contextualised embed-ding models and ﬁne-tune as needed. Therefore, itis desirable that a debiasing method proposed forcontextualised embedding models can be applied asa ﬁne-tuning method. In this view, counterfactualdata augmentation methods (Zmigrod et al., 2019;Hall Maudslay et al., 2019; Zhao et al., 2019) thatswap gender pronouns in the training corpus forcreating a gender balanced version of the trainingdata are less attractive when debiasing contextu-alised embeddings because we must retrain thosemodels on the balanced corpora, which is moreexpensive compared to ﬁne-tuning.Using gender-bias as a running example, we ad-dress the above-mentioned challenges by propos-ing a debiasing method that ﬁne-tunes pre-trainedcontextualised word embeddings . Our proposedmethod retains the semantic information learnt bythe contextualised embedding model with respectto gender-related words, while simultaneously re-moving any stereotypical biases in the pre-trainedmodel. In particular, our proposed method is ag-nostic to the internal architecture of the contextu-alised embedding method and we apply it to debiasdifferent pre-trained embeddings such as BERT,RoBERTa (Liu et al., 2019), ALBERT (Lan et al., https://huggingface.co/transformers/pretrained_models.html Code and debiased embeddings: https://github.com/kanekomasahiro/context-debias

SEAT ; May et al.,2019) and Multi-genre co-reference-based NaturalLanguage Inference (

MNLI ; Dev et al., 2020). Ex-perimental results show that the proposed methodnot only debiases all contextualised word embed-ding models compared, but also preserves use-ful semantic information for solving downstreamtasks such as sentiment classiﬁcation (Socher et al.,2013), paraphrase detection (Dolan and Brock-ett, 2005), semantic textual similarity measure-ment (Cer et al., 2017), natural language infer-ence (Dagan et al., 2005; Bar-Haim et al., 2006)and solving Winograd schema (Levesque et al.,2012). We consider gender bias as a running exam-ple throughout this paper and evaluate the proposedmethod with respect to its ability to overcome gen-der bias in contextualised word embeddings, anddefer extensions to other types of biases to futurework.

Prior work on debiasing word embeddings can bebroadly categorised into two groups depending onwhether they consider static or contextualised wordembeddings. Although we focus on contextualisedembeddings in this paper, we ﬁrst brieﬂy describeprior work on debiasing static embeddings for com-pleteness of the discussion.

Bias in Static Word Embeddings:

Bolukbasiet al. (2016) proposed a post-processing approachthat projects gender-neutral words into a sub-space, which is orthogonal to the gender direc-tion deﬁned by a list of gender-deﬁnitional words.However, their method ignores gender-deﬁnitionalwords during the subsequent debiasing process,and focus only on words that are not predictedas gender-deﬁnitional by a classiﬁer. Therefore,if the classiﬁer erroneously predicts a stereotypi-cal word as gender-deﬁnitional, it would not getdebiased. Zhao et al. (2018b) modiﬁed the orig-inal GloVe (Pennington et al., 2014) objectiveto learn gender-neutral word embeddings (GN-GloVe) from a given corpus. Unlike the above-entioned methods, Kaneko and Bollegala (2019)proposed GP-GloVe, a post-processing method topreserve gender-related information with autoen-coder (Kaneko and Bollegala, 2020), while remov-ing discriminatory biases from stereotypical cases.Adversarial learning (Xie et al., 2017; Elazarand Goldberg, 2018; Li et al., 2018) for debias-ing ﬁrst encode the inputs and then two classiﬁersare jointly trained – one predicting the target task(for which we must ensure high prediction accu-racy) and the other for protected attributes (thatmust not be easily predictable). Elazar and Gold-berg (2018) showed that although it is possible toobtain chance-level development-set accuracy forthe protected attributes during training, a post-hocclassiﬁer trained on the encoded inputs can stillmanage to reach substantially high accuracies forthe protected attributes. They conclude that adver-sarial learning alone does not guarantee invariantrepresentations for the protected attributes. Ravfo-gel et al. (2020) found that iteratively projectingword embeddings to the null space of the genderdirection to further improve the debiasing perfor-mance.

Benchmarks for biases in Static Embeddings:

Word Embedding Association Test (WEAT;Caliskan et al., 2017) quantiﬁes various biases (e.g.gender, race and age) using semantic similaritiesbetween word embeddings. Word Association Test(WAT) measures gender bias over a large set ofwords (Du et al., 2019) by calculating the genderinformation vector for each word in a word associ-ation graph created in the Small World of Wordsproject (SWOWEN; Deyne et al., 2019) by propa-gating masculine and feminine words via a randomwalk (Zhou et al., 2003). SemBias dataset (Zhaoet al., 2018b) contains three types of word-pairs: (a)

Deﬁnition , a gender-deﬁnition word pair (e.g. hero– heroine), (b)

Stereotype , a gender-stereotypeword pair (e.g., manager – secretary) and (c)

None ,two other word-pairs with similar meanings unre-lated to gender (e.g., jazz – blues, pencil – pen).It uses the cosine similarity between the genderdirectional vector, ( he − she ) , and the offset vec-tor ( a − b ) for each word pair, ( a, b ) , in each setto measure gender bias. WinoBias (Zhao et al.,2018a) uses the ability to predict gender pronounswith equal probabilities for gender neutral nounssuch as occupations as a test for the gender bias inembeddings. Bias in Contextualised Word Embeddings:

May et al. (2019) extended WEAT using templatesto create a sentence-level benchmark for evaluatingbias called SEAT. In addition to the attributes pro-posed in WEAT, they proposed two additional biastypes: angry black woman and double binds (whena woman is doing a role that is typically done by aman that woman is seen as arrogant). They showthat compared to static embeddings, contextualisedembeddings such as BERT, GPT and ELMo areless biased. However, similar to WEAT, SEAT alsoonly has positive predictive ability and cannot de-tect the absence of a bias. Bommasani et al. (2020)evaluated the bias in contextualised embeddingsby ﬁrst distilling static embeddings from contextu-alised embeddings and then using WEAT tests fordifferent types of biases such as gender (male, fe-male), racial (White, Hispanic, Asian) and religion(Christianity, Islam). They found that aggregat-ing the contextualised embedding of a particularword in different contexts via averaging to be thebest method for creating a static embedding from acontextualised embedding.Zhao et al. (2019) showed that contextualisedELMo embeddings also learn gender biases presentin the training corpus. Moreover, these biases prop-agate to a downstream coreference resolution task.They showed that data augmentation by swappinggender helps more than neutralisation by a pro-jection. They obtain the embedding of two inputsentences with reversed gender from ELMo, andobtain the debiased embedding by averaging them.It can only be applied to feature-based embeddings,so it cannot be applied to ﬁne-tuning based embed-dings like BERT. We directly debias the contextualembeddings. Additionally, data augmentation re-quires re-training of the embeddings, which is oftencostly compared to ﬁne-tuning. Kurita et al. (2019)created masked templates such as “ is a nurse”and used BERT to predict the masked gender pro-nouns. They used the log-odds between male andfemale pronoun predictions as an evaluation mea-sure and showed that BERT to be biased accordingto it. Karve et al. (2019) learnt conceptor matri-ces using class deﬁnitions in the WEAT and usedthe negated conceptors to debias ELMo and BERT.Although their method was effective for ELMo,the results on BERT were mixed. This method canonly be applied to context-independent vectors, andit requires the creation of static embeddings fromBERT and ELMo as a pre-processing step for debi- igure 1: Types of hidden states in E considered inthe proposed method. The blue boxes in the middlecorrespond to the hidden states of the target token. asing the context-dependent vectors. Therefore, wedo not compare against this method in the presentstudy, where we evaluate on context-dependent vec-tors.Dev et al. (2020) used natural language inference(NLI) as a bias evaluation task, where the goal isto ascertain if one sentence (i.e. premise) entails orcontradictions another (i.e. hypothesis), or if nei-ther conclusions hold (i.e. neutral). The premise-hypothesis pairs are constructed to elicit varioustypes of discriminative biases. They showed thatorthogonal projection to gender direction (Dev andPhillips, 2019) can be used to debias contextu-alised embeddings as well. However, their methodcan be applied only to the noncontextualised lay-ers (ELMo’s Layer 1 and BERT’s subtoken layer).In contrast, our proposed method can be appliedto all layers in a contextualised embedding andoutperforms their method on the same NLI task.And our debiasing approach does not require task-dependent data. We propose a method for debiasing pre-trained con-textualised word embeddings in a ﬁne-tuning set-ting that simultaneously (a) preserves the seman-tic information in the pre-trained contextualisedword embedding model, and (b) removes discrimi-native gender-related biases via an orthogonal pro-jection in the intermediate (hidden) layers by op-erating at token- or sentence-levels. Fine-tuningallows debiasing to be carried out without requir-ing large amounts of tarining data or computational resources. Our debiasing method is independent ofmodel architectures or their pre-training methods,and can be adapted to a wide range of contextu-alised embeddings as shown in § 4.3.Let us deﬁne two types of words: attribute words( V a ) and target words ( V t ). For example, in the caseof gender bias, attribute words consist of multipleword sets such as feminine (e.g. she, woman, her)and masculine (e.g. he, man, him) words, whereastarget words can be occupations (e.g. doctor, nurse,professor), which we expect to be gender neutral.We then extract sentences that contain an attributeor a target word. Sentences contain more than oneattribute (or target) words are excluded to avoidambiguities. Let us denote the set of sentencesextracted for an attribute or a target word w by Ω( w ) . Moreover, let A = (cid:83) w ∈V a Ω( w ) and T = (cid:83) w ∈V t Ω( w ) be the sets of sentences containingrespectively all of the attribute and target words.We require that the debiased contextualised wordembeddings preserve semantic information w.r.t.the sentences in A , and remove any discriminativebiases w.r.t. the sentences in T .Let us consider a contextualised word embed-ding model E , with pre-trained model parameters θ e . For an input sentence x , let us denote theembedding of token w in the i -th layer of E by E i ( w, x ; θ e ) . Moreover, let the total number of lay-ers in E to be N . In our experiments, we considerdifferent types of encoder models such as E . Toformalise the requirement that the debiased wordembedding E i ( t, x ; θ e ) of a target word t ∈ V t must not contain any information related to a pro-tected attribute a , we consider the inner-productbetween the noncontextualised embedding v i ( a ) of a and E i ( t, x ; θ e ) as a loss L i given by (1). L i = (cid:88) t ∈V t (cid:88) x ∈ Ω( t ) (cid:88) a ∈V a (cid:16) v i ( a ) (cid:62) E i ( t, x ; θ e ) (cid:17) (1)Here, v i ( a ) is computed by averaging the contex-tualised embedding of a in the i -th layer of E overall sentences in Ω( a ) following Bommasani et al.(2020) and is given by (2). v i ( a ) = 1 | Ω( a ) | (cid:88) x ∈ Ω( a ) E i ( a, x ; θ e ) (2)Here, | Ω( a ) | denotes the total number of sentencesin Ω( a ) . If a word is split into multiple sub-tokens,we compute the contextualised embedding of theword by averaging the contextualised embeddingsf its constituent sub-tokens. Minimising the loss L i deﬁned by (1) with respect to θ e forces thehidden states of E to be orthogonal to the protectedattributes such as gender.Although removing discriminative biases in E is our main objective, we must ensure that simul-taneously we preserve as much useful informationthat is encoded in the pre-trained model for thedownstream tasks. We model this as a regulariserwhere we measure the squared (cid:96) distance betweenthe contextualised word embedding of a word w inthe i -th layer in the original model, parametrisedby θ pre , and the debiased model as in (3). L reg = (cid:88) x ∈A (cid:88) w ∈ x N (cid:88) i =1 || E i ( w, x ; θ e ) − E i ( w, x ; θ pre ) || (3) The overall training objective is then given by (4) asthe linearly weighted sum of the two losses deﬁnedby (1) and (3). L = αL i + βL reg (4)Here, coefﬁcients α, β ∈ [0 , satisfy α + β = 1 .As shown in Figure 1, a contextualised wordembedding model typically contains multiple lay-ers. It is not obvious which hidden states of E arebest for calculating L i for the purpose of debiasing.Therefore, we compute L i for different layers in aparticular contextualised word embedding modelin our experiments. Speciﬁcally, we consider threesettings: debiasing only the ﬁrst layer, last layeror all layers. Moreover, L i can be computed onlyfor the target words in a sentence x as in (1), orcan be summed up for all words in w ∈ x (i.e. (cid:80) t ∈V t (cid:80) x ∈ Ω( t ) (cid:80) w ∈ x (cid:0) v i ( a ) (cid:62) E i ( w, x ; θ e ) (cid:1) ).We refer to the former as token-level debiasingand latter sentence-level debiasing. Collectivelythis gives us six different settings for the pro-posed debiasing method, which we evaluateexperimentally in § 4.3. We used SEAT (May et al., 2019) 6, 7 and 8 toevaluate gender bias. We use NLI as a down-stream evaluation task and use the Multi-GenreNatural Language Inference data (MNLI; Williamset al., 2018) for training and development follow-ing Dev et al. (2020). In NLI, the task is to classifya given hypothesis and premise sentence-pair as entailing, contradicting, or neutral. We program-matically generated the evaluation set followingDev et al. (2020) by ﬁlling occupation words andgender words in template sentences. The templatestake the form “The subject verb a/an object .” andthe created sentence-pairs are assumed to be neu-tral.We used the word lists created by Zhao et al.(2018b) for the attribute list of feminine and mas-culine words. As for the stereotype word list fortarget words, we use the list created by Kaneko andBollegala (2019). Using News-commentary-v15corpus was extract 11023, 42489 and 34148 sen-tences respectively for Feminine, Masculine andStereotype words. We excluded sentences withmore than 128 tokens in training data. We ran-domly sampled 1,000 sentences from each type ofextracted sentences as development data.We used the GLEU benchmark (Wang et al.,2018) to evaluate whether the useful information inthe pre-trained embeddings is retrained after debias-ing. To evaluate the debiased models with minimaleffects due to task-speciﬁc ﬁne-tuning, we usedthe following small-scale training data: StanfordSentiment Treebank (SST-2; Socher et al., 2013),Microsoft Research Paraphrase Corpus (MRPC;Dolan and Brockett, 2005), Semantic Textual Simi-larity Benchmark (STS-B; Cer et al., 2017), Recog-nising Textual Entailment (RTE; Dagan et al., 2005;Bar-Haim et al., 2006; Giampiccolo et al., 2007;Bentivogli et al., 2009), and Winograd SchemaChallenge (WNLI; Levesque et al., 2012). We eval-uate the performance of the contextualised embed-dings on the corresponding development data. We used BERT ( bert-base-uncased ; Devlin et al.,2019), RoBERTa ( roberta-base ; Liu et al., 2019),ALBERT ( albert-base-v2 ; Lan et al., 2020), Distil-BERT ( distilbert-base-uncased ; Sanh et al., 2019)and ELECTRA ( electra-small-discriminator ;Clark et al., 2020) in our experiments. Distil-BERT has 6 layers and the others 12. We used thedevelopment data in SEAT-6 for hyperparametertuning. The hyperparameters of the models, exceptthe learning rate and batch size, are set to their de-fault values as in run glue.py . Using greedysearch, the learning rate was set to 5e-5 and the We used https://github.com/huggingface/transformers odel

Layer Unit

SEAT-6 SEAT-7 SEAT-8 † SST-2 MRPC STS-B RTE WNLI AvgBERT all token . † -0.09 . † . † . † -1.18 . † . † -0.60 . † . † . † . † . † . † . † . † -0.08 -0.29 RoBERTa all token . † . † . † . † . † -0.60 . † . † . † . † . † . † . † . † . † . † . † random . † . † . † . † -1.15 -0.76 1 77.8 81.2 68.9 47.3 56.3 66.3sent . † -0.06 -0.10 1 78.3 80.1 71.3 55.2 56.3 68,2ﬁrst token 0.09 0.28 . † . † . † . † . † random . † . † . † -0.83 -0.66 . † . † . † last token . † -0.03 . † . † -1.34 0.27 . † . † . † . † . † . † . † . † . † . † -0.25 2 91.1 89.1 80.5 56.3 40.8 71.6ELECTRA all token 0.33 0.10 0.15 sent . † . † . † . † . † . † . † . † -0.22 2 90.4 87.7 78.5 51.3 54.9 72.6 Table 1: Gender bias of contextualised embeddings on SEAT. † denotes signiﬁcant bias effects at α < . . batch size to 32 during debiasing. Optimal valuesfor α = 0 . and β = 0 . were found by a greedysearch in [0 , with . increments. For the GLEUand MNLI experiments, we set the learning rate to2e-5 and the batch size to 16. Experiments wereconducted on a GeForce GTX 1080 Ti GPU. Table 1 shows the results on SEAT and GLEUwhere original denotes the pre-trained contextu-alised models prior to debiasing. We see that origi-nal models other than ELECTRA contain signiﬁ-cant levels of gender biases. Overall, the all-token method that conducts token-level debiasing acrossall layers performs the best. Prior work has shownthat biases are learned at each layer (Bommasani et al., 2020) and it is important to debias all layers.Moreover, we see that debiasing at token-level ismore efﬁcient compared to at the sentence-level.This is because in token-level debiasing, the lossis computed only on the target word and providesa more direct debiasing update for the target wordthan in the sentence-level debiasing, which sumsthe losses over all tokens in a sentence.To test the importance of carefully selecting thetarget words considering the types of biases thatwe want to remove from the embeddings, we im-plement a random baseline where we randomlyselect target and attribute words from V a ∪ V t andperform all-token debiasing. We see that random debiases BERT to some extent but is not effec-tive on other models. This result shows that the odel MNLI-m MNLI-mm NN FN T:0.7Dev et al. (2020) original Table 2: Debias results for BERT in MNLI. proposed debiasing method is not merely a regular-isation technique that imposes constraints on anyarbitrary set of words, but it is essential to carefullyselect the target words used for debiasing.The results on GLEU show that BERT, Distil-BERT and ELECTRA compared to the original embeddings, the debiased embeddings report com-parable performances in most settings. This con-ﬁrms that the proposed debiasing method preservessufﬁcient semantic information contained in the original embeddings that can be used to learn ac-curate prediction models for the downstream NLPtasks. However, the performance of RoBERTa andALBERT decrease signiﬁcantly compared to their original versions after debiasing. We suspect thatthese models are more sensitive to ﬁne-tuning andhence lose their pre-trained information during thedebiasing process. We defer the development oftechniques to address this issue to future research.

Following Dev et al. (2020), we use the multi-genre co-reference-based natural language infer-ence (MNLI) dataset for evaluating gender bias.This dataset contains sentence triples where apremise must be neutral in entailment w.r.t. twohypotheses. If the predictions made by a classi-ﬁer that uses word embeddings as features deviatefrom neutrality, it is considered as biased. Givena set containing M test instances, let the entail-ment predictor’s probabilities for the m -th instancefor entail, neutral and contradiction labels be re-spectively e m , n m and c m . Then, they proposedthe following measures to quantify the bias: (1)Net Neutral ( NN ): NN = M (cid:80) Mm =1 n m ; (2) Frac-tion Neutral ( FN ): FN = M (cid:80) Mm =1 [neutral =max( e m , n m , c m )] ; and (3) Threshold τ ( T: τ ): T: τ = [ n m ≥ τ ] , where we used τ = 0 . followingDev et al. (2020). For an ideal (bias-free) embed-ding, all three measure would be 1. Although on WNLI all-token debiasing improves per-formance for DistilBERT and ELECTRA compared to therespective original models, this is insigniﬁcant as WNLI con-tains only 146 test instances.

Model

Layer

SEAT-6 SEAT-7 SEAT-8BERT all last 0.56

ALBERT all last 1.15 ﬁrst 0.90 0.40 0.52ELECTRA all last 0.34 0.20 0.21ﬁrst 0.28 0.13 0.34

Table 3: Averaged scores over all layers in an embed-ding debiased at token-level, measured on SEAT tests.

In Table 2, we compare our proposed methodagainst the noncontextualised debiasing methodproposed by Dev et al. (2020) where they debiasLayer 1 of BERT-large model using an orthogonalprojection to the gender direction during trainingand evaluation. In addition to the above-mentionedmeasures, we also report the entailment accuracyon the matched (in-domain) and mismatched (cross-domain) denoted respectively by

MNLI-m and

MNLI-mm in Table 2 to evaluate the semanticinformation preserved in the embeddings after de-biasing.We see that the proposed method outperformsnoncontextualised debiasing (Dev et al., 2020) inNN and T:0.7, and its performance of the MNLItask is comparable to the original embeddings. Thisresult further conﬁrms that the proposed methodcan not only debias well but can also preserve thepre-trained information. Moreover, it is consistentwith the results reported in Table 1 and shows thatdebiasing all layers is more effective than only theﬁrst layer as done by Dev et al. (2020).

In Table 1, we investigated the bias for the ﬁnallayer, but it is known that the contextualised em-beddings are learned at each layer (Bommasaniet al., 2020). Therefore, to investigate whether bydebiasing in each layer we are able to remove thebiases of the entire contextualised embeddings, weevaluate the debiased embeddings at each layer a) BERT (b) RoBERTa (c) ALBERT(d) DistilBERT (e) ELECTRA

Figure 2: Scatter plot of gender information of hidden states for original and debiased stereotype words. on SEAT 6, 7, 8 datasets and report the averagedmetrics for all-token , ﬁrst-token and last-token methods in Table 3. We see that, on average, ﬁtst-token and last-token methods have more bias than all-token . Therefore, we conclude that It is notenough to debias only the ﬁrst and last layers evenin DistilBERT, which has a small number of layers.These results show that biases in the entire contex-tualised embedding cannot be reliably removed bydebiasing only some selected layers, but rather theimportance of debiasing all layers consistently. To further illustrate the effect of debiasing usingthe proposed all-token method, we visualise thesimilarity scores of a stereotypical word with fem-inine and masculine dimensions as follows. First,for each target word t , its hidden state, E i ( t, x ) inthe i -th layer of the model E in a sentence x iscomputed. Next, we average those hidden statesacross all sentences in the dataset that contain t toobtain ˆ E i ( t ) = |T | (cid:80) x ∈T E i ( t, x ) . Likewise, wecompute ˆ E i ( f ) and ˆ E i ( m ) respectively for eachfeminine ( f ) and masculine ( m ) word. Next, wecompute, s fi , the cosine similarity between each ˆ E i ( f ) and the feminine vector v i ( f ) , and the co-sine similarity, s mi , between each ˆ E i ( f ) and themasculine vector v i ( f ) . s fi and s mi , respectively,are averaged over all layers in a contextualised em-bedding model to obtain s f Avg and s m Avg , which represent how much gender information each gen-der word contains on average.We then compute the cosine similarity, s t,fi , be-tween each stereotype word’s averaged embedding, ˆ E i ( t ) and the feminine vector v i ( f ) . Similarly, wecompute the cosine similarity s t,mi between eachstereotype word’s averaged embedding ˆ E i ( t ) andthe masculine vector v i ( m ) . We then average s t,f and s t,m over the layers in E respectively, to com-pute s t,f Avg and s t,m Avg , which represent how muchgender information each stereotype word containson average. Finally, we visualise the normalised fe-male and male gender scores given respectively by s t,f Avg / s f Avg and s t,m Avg / s m Avg . For example, a zero s t,f Avg / s f Avg value indicates that t does not containfemale gender related information, whereas a valueof one indicates that it contains all informationabout the female gender. Figure 2 shows eachstereotype word with its normalised female ad malegender scores respectively in x and y axises. Fora word, a yellow circle denotes its original embed-dings, and the blue triangle denotes the result ofdebiasing using the all-token method.We see that with the original embeddings, stereo-typical words of are distributed close to one, indi-cating that they are highly gender-speciﬁc. Onthe other hand, we see that the debiased BERT,DistilBERT and ELECTRA have similar word dis-tributions compared to the original embeddingsrespectively, with an overall movement towardsero. On the other hand, for RoBERTa, debiasedembeddings are mainly distributed from zero toaround one compared to the original embeddings.Moreover, for ALBERT, the debiased embeddingsare close to zero, but unlike the original distribu-tion, the debiased embeddings are mainly clusteredaround zero. This shows that RoBERTa and AL-BERT do not retain structure of the original dis-tribution after debiasing. While ALBERT over-debiases pre-trained embeddings of stereotypicalwords, RoBERTa under-debiases them. This trendwas already conﬁrmed on the downstream evalua-tion tasks conducted in Table 1. We proposed a debiasing method for pre-trainedcontextualised word embeddings, operating attoken- or sentence-levels. Our experimental resultsshowed that the proposed method effectively de-biases discriminative gender-related biases, whilepreserving useful semantic information in the pre-trained embeddings. The results showed that thedownstream task was more effective in debias thanthe previous studies.

References

Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro,and Danilo Giampiccolo. 2006. The second pascalrecognising textual entailment challenge.

Proceed-ings of the Second PASCAL Challenges Workshopon Recognising Textual Entailment .Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, DaniloGiampiccolo, and Bernardo Magnini. 2009. Theﬁfth pascal recognizing textual entailment challenge.In

In Proc Text Analysis Conference (TAC’09 .Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou,Venkatesh Saligrama, and Adam Kalai. 2016. Manis to computer programmer as woman is to home-maker? debiasing word embeddings. In

NIPS .Rishi Bommasani, Kelly Davis, and Claire Cardie.2020. Interpreting Pretrained Contextualized Repre-sentations via Reductions to Static Embeddings. In

Proceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 4758–4781, Online. Association for Computational Lin-guistics.Shikha Bordia and Samuel R. Bowman. 2019. Identify-ing and reducing gender bias in word-level languagemodels. In

Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Student Research Work-shop , pages 7–15, Minneapolis, Minnesota. Associ-ation for Computational Linguistics. Aylin Caliskan, Joanna J Bryson, and ArvindNarayanan. 2017. Semantics derived automaticallyfrom language corpora contain human-like biases.

Science , 356:183–186.Daniel Cer, Mona Diab, Eneko Agirre, I˜nigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017task 1: Semantic textual similarity multilingual andcrosslingual focused evaluation. In

Proceedingsof the 11th International Workshop on SemanticEvaluation (SemEval-2017) , pages 1–14, Vancouver,Canada. Association for Computational Linguistics.K. Clark, Minh-Thang Luong, Quoc V. Le, and Christo-pher D. Manning. 2020. Electra: Pre-training textencoders as discriminators rather than generators.

ArXiv , abs/2003.10555.Ido Dagan, Oren Glickman, and Bernardo Magnini.2005. The pascal recognising textual entailmentchallenge. In

Proceedings of the First Inter-national Conference on Machine Learning Chal-lenges: Evaluating Predictive Uncertainty VisualObject Classiﬁcation, and Recognizing Textual En-tailment , MLCW’05, pages 177–190, Berlin, Hei-delberg. Springer-Verlag.Sunipa Dev, Tao Li, Jeff Phillips, and Vivek Srikumar.2020. On Measuring and Mitigating Biased Infer-ences of Word Embeddings. In

AAAI .Sunipa Dev and Jeff M. Phillips. 2019. Attenuatingbias in word vectors. In

The 22nd InternationalConference on Artiﬁcial Intelligence and Statistics,AISTATS 2019, 16-18 April 2019, Naha, Okinawa,Japan , volume 89 of

Proceedings of Machine Learn-ing Research , pages 879–887. PMLR.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Simon De Deyne, Danielle J. Navarro, Amy Perfors,Marc Brysbaert, and Gert Storms. 2019. The “smallworld of words” english word association norms forover 12,000 cue words.

Behavior Research Methods ,51:987–1006.William B. Dolan and Chris Brockett. 2005. Automati-cally constructing a corpus of sentential paraphrases.In

Proceedings of the Third International Workshopon Paraphrasing (IWP2005) .Yupei Du, Yuanbin Wu, and Man Lan. 2019. Exploringhuman gender stereotypes with word association test.In

Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) , pages 6133–6143, Hong Kong, China. Association for Computa-tional Linguistics.anai Elazar and Yoav Goldberg. 2018. AdversarialRemoval of Demographic Attributes from Text Data.In

Proc. of EMNLP .Danilo Giampiccolo, Bernardo Magnini, Ido Dagan,and Bill Dolan. 2007.

The Third PASCAL Recog-nizing Textual Entailment Challenge , pages 1–9. As-sociation for Computational Linguistics, USA.Hila Gonen and Yoav Goldberg. 2019. Lipstick on apig: Debiasing methods cover up systematic genderbiases in word embeddings but do not remove them.In

Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long and Short Papers) , pages 609–614,Minneapolis, Minnesota. Association for Computa-tional Linguistics.Rowan Hall Maudslay, Hila Gonen, Ryan Cotterell,and Simone Teufel. 2019. It’s all in the name: Mit-igating gender bias with name-based counterfactualdata substitution. In

Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 5266–5274, Hong Kong, China. As-sociation for Computational Linguistics.Masahiro Kaneko and Danushka Bollegala. 2019.Gender-preserving debiasing for pre-trained wordembeddings. In

Proceedings of the 57th AnnualMeeting of the Association for Computational Lin-guistics , pages 1641–1650, Florence, Italy. Associa-tion for Computational Linguistics.Masahiro Kaneko and Danushka Bollegala. 2020. Au-toencoding improves pre-trained word embeddings.In

Proceedings of the 28th International Conferenceon Computational Linguistics , pages 1699–1713,Barcelona, Spain (Online). International Committeeon Computational Linguistics.Saket Karve, Lyle Ungar, and Jo˜ao Sedoc. 2019. Con-ceptor debiasing of word representations evaluatedon WEAT. In

Proceedings of the First Workshopon Gender Bias in Natural Language Processing ,pages 40–48, Florence, Italy. Association for Com-putational Linguistics.Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W Black,and Yulia Tsvetkov. 2019. Measuring bias in contex-tualized word representations. In

Proceedings of theFirst Workshop on Gender Bias in Natural LanguageProcessing , pages 166–172, Florence, Italy. Associ-ation for Computational Linguistics.Zhenzhong Lan, Mingda Chen, Sebastian Goodman,Kevin Gimpel, Piyush Sharma, and Radu Sori-cut. 2020. Albert: A lite bert for self-supervisedlearning of language representations.

ArXiv ,abs/1909.11942.Hector J. Levesque, Ernest Davis, and Leora Morgen-stern. 2012. The winograd schema challenge. In

Proceedings of the Thirteenth International Con-ference on Principles of Knowledge Representa-tion and Reasoning , KR’12, pages 552–561. AAAIPress.Yitong Li, Timothy Baldwin, and Trevor Cohn. 2018.Towards robust and privacy-preserving text represen-tations. In

Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics(Volume 2: Short Papers) , pages 25–30. Associationfor Computational Linguistics.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692 .Chandler May, Alex Wang, Shikha Bordia, Samuel R.Bowman, and Rachel Rudinger. 2019. On measur-ing social biases in sentence encoders. In

Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long and Short Papers) , pages 622–628, Minneapo-lis, Minnesota. Association for Computational Lin-guistics.Tomas Mikolov, Kai Chen, and Jeffrey Dean. 2013. Ef-ﬁcient estimation of word representation in vectorspace. In

ICLR .Moin Nadeem, Anna Bethke, and Siva Reddy. 2020.StereoSet: Measuring stereotypical bias in pre-trained language models.Nikita Nangia, Clara Vania, Rasika Bhalerao, andSamuel R. Bowman. 2020. CrowS-Pairs: AChallenge Dataset for Measuring Social Biases inMasked Language Models. In

Proc. of EMNLP .Jeffery Pennington, Richard Socher, and Christopher D.Manning. 2014. Glove: global vectors for word rep-resentation. In

EMNLP , pages 1532–1543.Shauli Ravfogel, Yanai Elazar, Hila Gonen, MichaelTwiton, and Yoav Goldberg. 2020. Null It Out:Guarding Protected Attributes by Iterative NullspaceProjection. In

Proc. of ACL .Rachel Rudinger, Jason Naradowsky, Brian Leonard,and Benjamin Van Durme. 2018. Gender bias incoreference resolution. In

Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 2 (Short Papers) ,pages 8–14. Association for Computational Linguis-tics.Victor Sanh, Lysandre Debut, Julien Chaumond, andThomas Wolf. 2019. Distilbert, a distilled versionof bert: smaller, faster, cheaper and lighter.

ArXiv ,abs/1910.01108.ichard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D. Manning, Andrew Ng, andChristopher Potts. 2013. Recursive deep modelsfor semantic compositionality over a sentiment tree-bank. In

Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Processing ,pages 1631–1642, Seattle, Washington, USA. Asso-ciation for Computational Linguistics.Yi Chern Tan and L. Elisa Celis. 2019. Assessing so-cial and intersectional biases in contextualized wordrepresentations. In

Advances in Neural InformationProcessing Systems 32 , pages 13230–13241. CurranAssociates, Inc.Alex Wang, Amanpreet Singh, Julian Michael, Fe-lix Hill, Omer Levy, and Samuel Bowman. 2018.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In

Pro-ceedings of the 2018 EMNLP Workshop Black-boxNLP: Analyzing and Interpreting Neural Net-works for NLP , pages 353–355, Brussels, Belgium.Association for Computational Linguistics.Adina Williams, Nikita Nangia, and Samuel Bowman.2018. A broad-coverage challenge corpus for sen-tence understanding through inference. In

Proceed-ings of the 2018 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume1 (Long Papers) , pages 1112–1122, New Orleans,Louisiana. Association for Computational Linguis-tics.Qizhe Xie, Zihang Dai, Yulun Du, Eduard Hovy, andGraham Neubig. 2017. Controllable invariancethrough adversarial feature learning. In

Proc. ofNIPS .Jieyu Zhao, Tianlu Wang, Mark Yatskar, Ryan Cot-terell, Vicente Ordonez, and Kai-Wei Chang. 2019.Gender bias in contextualized word embeddings. In

Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long and Short Papers) , pages 629–634,Minneapolis, Minnesota. Association for Computa-tional Linguistics.Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Or-donez, and Kai-Wei Chang. 2018a. Gender biasin coreference resolution: Evaluation and debiasingmethods. In

Proceedings of the 2018 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 2 (Short Papers) , pages 15–20. As-sociation for Computational Linguistics.Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai-Wei Chang. 2018b. Learning Gender-Neutral WordEmbeddings. In

Proc. of EMNLP , pages 4847–4853.Dengyong Zhou, Olivier Bousquet, Thomas NavinLal, Jason Weston, and Bernhard Sch¨olkopf. 2003. Learning with local and global consistency. In

NIPS .Ran Zmigrod, Sebastian J. Mielke, Hanna Wallach, andRyan Cotterell. 2019. Counterfactual Data Aug-mentation for Mitigating Gender Stereotypes in Lan-guages with Rich Morphology. In