[PDF] Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models

Abstract

The capabilities of natural language models trained on large-scale data have increased immensely over the past few years. Open source libraries such as HuggingFace have made these models easily available and accessible. While prior research has identified biases in large language models, this paper considers biases contained in the most popular versions of these models when applied `out-of-the-box' for downstream tasks. We focus on generative language models as they are well-suited for extracting biases inherited from training data. Specifically, we conduct an in-depth analysis of GPT-2, which is the most downloaded text generation model on HuggingFace, with over half a million downloads in the past month alone. We assess biases related to occupational associations for different protected categories by intersecting gender with religion, sexuality, ethnicity, political affiliation, and continental name origin. Using a template-based data collection pipeline, we collect 396K sentence completions made by GPT-2 and find: (i) The machine-predicted jobs are less diverse and more stereotypical for women than for men, especially for intersections; (ii) Intersectional interactions are highly relevant for occupational associations, which we quantify by fitting 262 logistic models; (iii) For most occupations, GPT-2 reflects the skewed gender and ethnicity distribution found in US Labour Bureau data, and even pulls the societally-skewed distribution towards gender parity in cases where its predictions deviate from real labor market observations. This raises the normative question of what language models _should_ learn - whether they should reflect or correct for existing inequalities.

Full PDF

HHow True is GPT-2?An Empirical Analysis of Intersectional Occupational Biases

Hannah Kirk * 1 2

Yennie Jun * 1 2

Haider Iqbal * 1 3

Elias Benussi * 1 3

Filippo Volpin * 1 4

Fr´ed´eric A. Dreyer * 1 5

Aleksandar Shtedritski

Yuki M. Asano

Abstract

The capabilities of natural language modelstrained on large-scale data have increased im-mensely over the past few years. Downstreamapplications are at risk of inheriting biases con-tained in these models, with potential negativeconsequences especially for marginalized groups.In this paper, we analyze the occupational biasesof a popular generative language model, GPT-2,intersecting gender with ﬁve protected categories:religion, sexuality, ethnicity, political afﬁliation,and name origin. Using a novel data collectionpipeline we collect 396k sentence completionsof GPT-2 and ﬁnd: (i) The machine-predictedjobs are less diverse and more stereotypical forwomen than for men, especially for intersections;(ii) Fitting 262 logistic models shows intersec-tional interactions to be highly relevant for occu-pational associations; (iii) For a given job, GPT-2reﬂects the societal skew of gender and ethnicityin the US, and in some cases, pulls the distribu-tion towards gender parity, raising the normativequestion of what language models should learn.Code is available at https://github.com/oxai/intersectional_gpt2 .

1. Introduction

The advent of deep learning and massive growth in trainingdata have led to natural language models surpassing humanson numerous benchmarks (Wang et al., 2018; 2019; He et al.,2020; Adiwardana et al., 2020). However, as Bender et al.(2021) states, these models can exacerbate existing biasesin data and perpetuate stereotypical associations to the harmof marginalized communities. Simultaneously, pre-trainedmodels have become readily accessible via APIs, allowingnon-experts to apply these tools in their own applications. * Equal contribution Oxford Artiﬁcial Intelligence Society Oxford Internet Institute, Dept. of Computer Science, Dept.of Economics, Dept. of Physics, Dept. of Engineering Science,University of Oxford, Oxford, United Kingdom. Correspondenceto: Hannah Kirk < [email protected] > . Preprint. S h a r e o f W o m e n ( U S d a t a ) babysitterbus drivercarpenterchef cooklaborer maidmechanic modelnursephotographerplumber receptionistsecretarysecurity guard social workerteachertruck driver Woman[0.0238]+Asian[0.0004]+Black[0.0013]+Hispanic[0.0049]+White[0.0182]

Figure 1.

GPT-2 Monte-Carlo prediction vs ground truth USpopulation share.

GPT-2’s predictions with regards to intersec-tional characteristics are highly stereotypical – yet they are closelyaligned to the US population data. We show the predicted valuesfor gender intersected with ethnicity along with the [Mean-SquaredErrors] and annotate example jobs for the gender-only predictions.

These developments in generative language models substan-tiate a need to understand the potential for biases towardsprotected classes, such as gender and ethinicity. Within thespeciﬁc context of fairness in AI-assisted hiring, we focuson analyzing one popular, publicly available language model(GPT-2). By generating 396k samples, we empirically an-alyze which occupations GPT-2 preferentially associateswith intersections of gender and protected classes. Thepaper provides the following contributions:• a detailed data collection protocol for studying inter-sectional biases of generative language models,• the analysis of the intersectional biases of (gender × [ethnicity, religion, sexuality, political afﬁliation, nameorigin]) present in GPT-2,• a comparison of GPT-2’s predictions with real-worldoccupation frequencies, as shown in Fig. 1. a r X i v : . [ c s . C L ] F e b ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases AsianBlackHispanicWhite

Ethnicity

BuddhistChristianHinduJewishMuslim

Religion

Gay/LesbianStraight

Sexuality

ConservativeLiberal

Political

AfricaAmericasAsiaEuropeOceania

Continent XY ManWoman

Base

GPT-2 NER

The works as a ...

X YZ works as a ... Z {NAME} Man WomanJanitorWriterTeacher............

Man WomanJanitorWriterTeacher............

Man WomanJanitorWriterTeacher............ Man WomanJanitorWriterTeacher............ Man WomanJanitorWriterTeacher............

Frequency for or

X Y Z

Figure 2.

Data Collection Process.

We collect 396,000 responses from GPT-2, and retrieve “titles” via Stanford CoreNLP’s NamedEntity Recognition (NER) to analyze the predicted occupational distribution for various intersectional categories.

2. Related Work

Bias in classical NLP models.

Extensive research hasshown that unrestricted training of natural language mod-els inherits human biases and, in some cases, ampliﬁesthem. (Bolukbasi et al., 2016; Caliskan et al., 2017; Zhaoet al., 2018; Gonen & Goldberg, 2019). Negative generaliza-tions, stereotypes, or misrepresentations of particular socialgroups can be learned by generative language models (Blod-gett et al., 2020). These learned associations can persist intasks such as machine translation (Stanovsky et al., 2019),dialogue systems (Liu et al., 2020), and hate speech classi-ﬁers (Kennedy et al., 2020). There have been many attemptsto identify, quantify, and de-bias context-independent wordembeddings such as word2vec and GloVe (Bolukbasi et al.,2016; Zhao et al., 2019; Diaz et al., 2018). While theseearlier works investigate low-capacity models, we analyze ahigh-capacity transformer-based generative language model,GPT-2 (Vaswani et al., 2017; Radford et al., 2019).

Bias in generative language models.

Transformer-basedlanguage models are increasingly replacing approaches us-ing static word embeddings and current state-of-the-art lan-guage models are trained on immense amounts of text col-lected from the Internet (Fedus et al., 2021; Radford et al.,2019; Brown et al., 2020). The vastly improved model per-formance and capacity has motivated many downstream ap-plications, such as dialogue generation (Dinan et al., 2020),automatically generated video captions (Tatman, 2017), andjob application assessment (Bhatia et al., 2019; Li et al.,2020). In terms of models, GPT-2, a transformer-based lan-guage model trained on 40GB of text from over 8 milliondocuments from the Internet, has been the subject of variousstudies (Wick et al., 2020; Budzianowski & Vulic, 2019;Ethayarajh, 2019).Researchers have attempted to quantify and mitigate bi- ases in generative language models: Zhao et al. (2019) ﬁndsystematic gender biases in ELMo’s contextualized wordvectors and Bhardwaj et al. (2020) measure the ways inwhich gender bias in BERT affects emotion and sentimentintensity. Kurita et al. (2019) quantify gender bias in BERTby measuring gendered occupations and adjectives, whileSheng et al. (2019) utilize template-based methods to quan-tify gender bias in texts generated by GPT-2, with sentimentscores as a proxy for bias. Building on these methodologies,we focus on intersectional biases of GPT-2 with regards tothe domain of occupations.

Intersectional biases.

As Crenshaw (1989) explains, in-tersectional biases are a necessary consideration because asingle axis of analysis treating gender and race as mutuallyexclusive categories distorts the reality of marginalized com-munities (such as Black women). More recently, Foulds &Pan (2020) provides deﬁnitions of fairness in machine learn-ing systems informed by the framework of intersectionality.The intersections between gender and racial biases havebeen studied in sentiment analysis (Kiritchenko & Moham-mad, 2018) and generative language models such as BERTand GPT-2 (Tan & Celis, 2019). As well as race and gender,we extend our analysis to intersections with other legallyprotected categories that have historically been subject todiscrimination: religion, sexuality, and political afﬁliation.

3. Methods

In order to analyze the potential impact of biases in applica-tions of generative language models, we focus on publiclyavailable models. Among these, we focus on GPT-2 as it We applied for research access to GPT-3 (in Oct. 2020) buthave received no response from OpenAI. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases

Table 1.

Summary table of data collection showing the number ofcalls per category and per variant (Var) in addition to thresholds( θ ) used for analysis. The cumulative sum of all calls is . Category Var Calls Total θ ∗ Cum.Calls Share

Base 2 7,000 14,000 35 81%Ethnicity 8 7,000 56,000 140 82%Religion 10 7,000 70,000 175 84%Sexuality 4 7,000 28,000 70 83%Political 4 7,000 28,000 70 82%Continent 200 1,000 200,000 500 76%

Notes : ∗ Threshold ( θ ) for plots to exclude 0.25%tails of infrequently mentioned jobs is one of the most widely recognized generative languagemodels. To proxy for the most popular model version, weuse the 124M hyperparameter model as it was the most-downloaded version of GPT-2 available on HuggingFace(see Appendix A, Tab. 1) ). We also conﬁrm that our resultshold true for another generative model, XLNet (Yang et al.,2019) (see Appendix B).Our data collection pipeline is shown in Fig. 2. We promptGPT-2 with preﬁx templates, as deﬁned by Sheng et al.(2019). The template is “The [ X ][ Y ] works as a”, where X is one of the following protected classes: ethnicity, reli-gion, sexuality, and political afﬁliation, and Y is ‘man’ or‘woman’. For a baseline to intersectional effects, we leave X blank (i.e. “The man/woman works as a”). From these28 unique templates (Tab.1), we generate , sentencesusing GPT-2 (Radford et al., 2019) through the Hugging-Face API with a top-k parameter of 3 and model size ‘small’.Generated sentences are limited to a maximum length of 10words to capture immediate occupation associations. Name-based templates.

An additional preﬁx template iscreated of the form “[ Z ] works as a”, where Z is a namesampled from the most popular male and female ﬁrst namesper country, obtained from Wikipedia. We aggregate thenames into ﬁve geographic groups: Africa, Americas, Asia,Europe, Oceania. We sample 20 names for each geographicgroup and gender pair, yielding 200 unique templates, fromwhich we generate , sentences each. By promptingGPT-2 with a template devoid of inherently gendered orracialized terms, such as ‘man/woman’ or ‘Asian/Black’,we can better examine the latent associations when GPT-2estimates the ethnicity and gender from ﬁrst names. Occupation entity recognition.

For each generated sen-tence, we use the Stanford CoreNLP Named Entity Recog-nizer (NER) (Manning et al., 2014) to extract “job titles.” https://en.wikipedia.org/wiki/List_of_most_popular_given_names https://stanfordnlp.github.io/CoreNLP/ner.html NER was unable to detect titles for some sentences whichwere removed from the dataset, losing 10.6% of gender-occupation sentences and 19.6% of name-occupation sen-tences. We then create a one-hot encoded frequency ma-trix for returned job tokens, combining duplicate jobs (e.g.nurse/nurse practitioner). However, we do not merge jobtokens with inherent hierarchies (e.g. assistant profes-sor/professor) or implicit gender associations (e.g. sales-man/salesperson, waitress/waiter). Sentences returning mul-tiple titles (e.g. “The woman works as a waitress and amaid”) were treated as two separate entries in the frequencymatrix given that individuals can have more than one job.

The distribution of returned jobs is highly-skewed with longtails: a few jobs comprise a signiﬁcant share and many jobsare mentioned infrequently. Therefore, we apply a lower-bound threshold to focus our analysis, removing tokens men-tioned in fewer than 0.25% of total calls which preservesapproximately 80% of the sample (Tab.1). For jobs abovethe threshold, we run a logistic regression on the one-hotmatrix and output frequencies to predict p ( [job] = 1 | X, Y ) for the input “The [ X ][ Y ] works as a [job]”. While GPT-2is a ‘black-box’ model, this predictive modelling attemptsto estimate how intersectional categories change GPT-2’sprior on the probability of job associations. By using in-teraction terms, we can study whether intersectionality hasadditional inﬂuence beyond main effects (e.g. the isolatedeffect of gender and ethnicity). The logistic regression equa-tion includes ‘man’ from the baseline case as the referencegroup, with dummy variables added for woman, for eachintersectional category C , and for interaction terms: log odds( p ( job i | c )) = β + β Woman i + (1) C (cid:88) c =1 γ ic Category i c + C (cid:88) c =1 δ ic Category ic ∗ Woman i + (cid:15) i , where log odds( p ) = log( p/ (1 − p )) is the log-odds ratio.

4. Results

We analyze the effect of gender on returned occupationaldistributions in Sec 4.1 and on particular occupations inSec 4.2. We extend these analyses to intersectional associa-tions in Sec 4.3 and present empirical results derived fromlogistic regressions in Sec 4.4. Finally, we compare andquantify the predicted distributions against real-world USoccupational data in Sec 4.5. For the names-occupation template, we removed 2000 sen-tences with the job title ‘Princess’ for the African name ‘Princess’. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases Log(Rank) S h a r e o f T o t a l

16 jobs account for 50% of men8 jobs account for 50% of women66 jobs account for 90% of men43 jobs account for 90% of women

Women (Average) Men (Average)

Figure 3.

GPT-2 occupational stereotyping.

GPT-2 stereotypesthe occupational distribution of women more than that of men.

Fig. 3 ranks the frequency of jobs against the cumulativeshare. While 16 jobs account for 50% of the outputs formen, only 8 jobs account for the same share for women.Similarly, at the 90% level, men are associated with morejobs than women (66 vs 43, respectively). This suggeststhat GPT-2 predicts a wider variety of jobs for men and anarrower set of jobs for women. The Gini coefﬁcients inTab. 2 conﬁrm this more unequal distribution for women. In addition to distributional differences, the set of returnedjobs also differ by men and women. In Fig. 4, we show theproportion of genders in all jobs mentioned more than 35times for baseline man and woman. We make two obser-vations: ﬁrst, there is a greater number of jobs dominatedby men as compared to women, reﬂecting the greater di-versity of occupations for men. Second, the occupationsseem stereotypical: men are associated with manual jobssuch as laborer, truck driver, and mechanic, and with profes-sional jobs such as software engineer and private investiga-tor. Women are associated with domestic and care-givingroles such as babysitter, maid, social worker, and house-wife. Furthermore, over 90% of the returns for ‘prostitute’were women, and over 90% of returns for ‘software engi-neer’ were men. We only ﬁnd three jobs for which GPT-2’soutputs suggest a gender-neutral prior over occupations:reporter, lawyer, and sales representative. G = ( (cid:80) ni =1 (2 i − n − x i ) / ( n (cid:80) ni =1 x i ) , where x is theobserved value, n is the total values observed, and i is the rank isascending order. Table 2.

Gini coefﬁcients of rank-frequency distributions.

Gender Intersection Gini Coeff Relative CoeffBase M = 100%

Man Base 0.933 100Man Religion 0.929 99.57Man Sexuality 0.935 100.21Man Ethnicity 0.939 100.64Man Political 0.942 100.96Woman Base 0.951 101.93Woman Political 0.951 101.93Woman Ethnicity 0.956 102.47Woman Religion 0.956 102.47Woman Sexuality 0.958 102.68

The previous analysis indicates considerable differences be-tween genders. We ask the following question:

Do thesestereotypical male and female jobs change when inter-sectionality is considered?

The Gini coefﬁcients (Tab. 2)for gender-intersection pairs indicate a greater clustering ofwomen into fewer jobs across all intersections, especially forsexuality and religion. To analyze differences in job associa-tions for each intersection, we display a scatter plot with theequi-proportion line given by (1 / | c | , to (0 , / | c | ) , where | c | is the number of choices for intersection c . We normalizethe axis such that / | c | = 1x so that jobs lie on this line ifadding intersections has no effect on the gender ratio. Wefurther include a bar plot showing the extremes of the dis-tribution with the top ten jobs with the largest man-womanrange. Ethnicity.

For gender and ethnicity intersections (Fig. 5),we ﬁnd a similar pattern of occupations associated withmen (plumber, guard, contractor, and police ofﬁcer) andothers with women (secretary, prostitute, model, babysitter).While all ethnicities of women are associated with prostitute,only Black men are. Overall, few occupations are solelyassociated with men or women of a certain ethnicity, andare mostly distributed over several ethnicities.

Religion.

For gender and religion intersections (Fig. 6),Hindu men and women only have associations with non-religious professions (e.g. bouncers and massage therapists).For Christian, Buddhist, and Jewish religions, there is atendency of GPT-2 towards generating occupations withlarge man-woman disparities, especially for professionalreligious occupations: nuns are dominated by Buddhistwomen, rabbis are dominated by Jewish men, and monks,pastors, and priests are dominated by Buddhist and Christianmen.

Sexuality.

For gender and sexuality intersections (Fig. 7),we ﬁnd professions such as massage therapist, counselor,and graphic designer to be almost unique to lesbian women,while professions such as detective, plumber, guard, andcoach are dominated by straight men. Male-dominated ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases l a b o r e r c a s h i e r m o n k p l u m b e r s a l e s m a n t r u c k d r i v e r s o f t w a r e e n g i n ee r d e v e l o p e r p r i v a t e i n v e s t i g a t o r m e c h a n i c c o u r i e r c o n t r a c t o r d r i v e r j o u r n a li s t s u p e r v i s o r b u s d r i v e r p h o t o g r a p h e r b a r b e r p r i e s t t e c hn i c i a n c l e r k s e c u r i t y g u a r d c o m p u t e r p r o g r a mm e r w a i t e r c a r p e n t e r c o n s t r u c t i o n w o r k e r t a x i d r i v e r p o li c e o ff i c e r b o un c e r j a n i t o r d o c t o r t r a n s l a t o r m a n a g e r c o n s u l t a n t b a r t e n d e r s e r v a n t c h e f w r i t e r r e p o r t e r l a w y e r s a l e s r e p r e s e n t a t i v e t e a c h e r c oo k e d i t o r b a r i s t a h o u s e k ee p e r s e c r e t a r y r e c e p t i o n i s t a ss i s t a n t c l e a n e r w a i t r e ss nu r s e h o u s e w i f e c o un s e l o r p r o s t i t u t e s o c i a l w o r k e r m a i d m o d e l c a r e t a k e r m a ss a g e t h e r a p i s t b a b y s i tt e r Figure 4.

Fundamentally skewed output distributions.

We show the gender proportions when querying for the base case, i.e. X = {} , Y = { Man , Woman } and present all jobs with greater than

35 = n ∗ . mentions, making up 81% of returned sentence prompts.

0x 1x 2x 3x

Over-representation Factor (Women) O v e r - r e p r e s e n t a t i o n F a c t o r ( M e n ) babysitter babysitterbabysitterguardguard modelplumbertechnician Woman p l u m b e r g u a r d c o n t r a c t o r c o u r i e r p o li c e o ff i c e r t e c hn i c i a n s e c r e t a r y p r o s t i t u t e m o d e l b a b y s i tt e r ManAsianBlackHispanicWhite

Figure 5.

Man-Woman Occupational Split by Ethnicity

0x 1x 2x 3x 4x 5x

Over-representation Factor (Women) O v e r - r e p r e s e n t a t i o n F a c t o r ( M e n ) butcher counselorhousewifemassage therapistmonk nunpastorshepherd Woman m o n k p a s t o r p r i e s t m i ss i o n a r y b o un c e rr a bb i g u i d e nun c o un s e l o r m a ss a g e t h e r a p i s t ManBuddhistChristianHinduJewishMuslim

Figure 6.

Man-Woman Occupational Split by Religion professions are almost exclusively straight, whereas female-dominated professions are almost exclusively lesbian.

Political afﬁliation.

For gender and political afﬁliationintersections (Fig. 8), the occupations are similar to thebaseline man and woman case presented in Fig. 4. Although occupations are split along the gender axis, some have equalrepresentation across political afﬁliation. The exceptionis that liberal men are strongly associated with critic andbanker, and conservative men with driver and host.

Name origin.

For gender and continent name origin inter-sections (Fig. 9), jobs are more tightly distributed aroundthe equi-proportion line. This suggests that name origin hasless of an effect on the token returned by GPT-2 than whenadding an explicit categorical intersection (e.g. ethnicityor religion). Gender continues to be the more signiﬁcantdeterminant on the occupations generated by GPT-2, withmen being associated with jobs such as mechanic and leader,and women being associated with jobs such as nurse andreceptionist.

The previous visual analyses demonstrate the considerabledifferences in the jobs associated with women and men,and further show that these gender splits remain when inter-sections are added. Next, we address the question: quan-titatively, how important are gendered intersections indetermining the job returned by GPT-2?

Tab. 3 presentssummary results from 262 logistic regressions, which pre-dict the likelihood of a job being associated with a given sen-tence prompt. We focus on two metrics indicating how oftenthe addition of regressors adds explainability of the outcome:i) The proportions of regressions where the woman dummyand the interactions were signiﬁcant ( p < . ), and ii)The change in Pseudo- R on the addition of the womandummy and the interactions. Statistical results, includingthe coefﬁcients, for all regressions are in Appendix D. Theaggregated results in Tab. 3 show that the woman dummyis frequently signiﬁcant, most commonly so in ethnicity We use the McFadden R which is calculated by compar-ing the log-likelihood of a model with no predictors L , ver-sus the log-likelihood of the estimated model L M : R McF =1 − ln( L M ) / ln( L ) ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases

0x 1x 2x

Over-representation Factor (Women) O v e r - r e p r e s e n t a t i o n F a c t o r ( M e n ) coach counselordetective graphic designerguard massage therapistplumber therapist Woman d e t e c t i v e p l u m b e r g u a r dp o li c e o ff i c e r c o u r i e r c o a c h g r a p h i c d e s i g n e r c o un s e l o r t h e r a p i s t m a ss a g e t h e r a p i s t ManLesbian/GayStraight

Figure 7.

Man-Woman Occupational Split by Sexuality

0x 1x 2x

Over-representation Factor (Women) O v e r - r e p r e s e n t a t i o n F a c t o r ( M e n ) bankerbouncer counselordriverhost maidmodel prostitute Woman t r u c k d r i v e r b o un c e r l o bb y i s t s a l e s m a n b a n k e r d r i v e r h o s t c r i t i c p r o s t i t u t e m a i d ManConservativeLiberal

Figure 8.

Man-Woman Occupational Split by Political

0x 1x 2x

Over-representation Factor (Women) O v e r - r e p r e s e n t a t i o n F a c t o r ( M e n ) graphic designer nursereceptionistsalesmantechnician therapistwaiter waitress Woman s a l e s m a n m e c h a n i c w a i t e r g r a p h i c d e s i g n e r l e a d e r c oo k nu r s e w a i t r e ss t h e r a p i s t r e c e p t i o n i s t ManAfricaAmericasAsiaEuropeOceania

Figure 9.

Man-Woman Occupational Split by Name Origin regressions (71%) and least commonly in political regres-sions (59%). Adding a woman dummy increases the model R on average by +3.3% (percentage points), signifyingthat gender explains additional variation in job prediction.Interactions are signiﬁcant in approximately one third ofregressions, but the additional increase to R is on aver- Table 3.

Aggregated logistic regression results.

We ﬁt a total of262 logistic regressions and report the number of times the inde-pendent variables contributed signiﬁcantly to the logistic model,as well as their average contribution to the Pseudo- R . ∆ R2 Ethnicity 55 woman 0.71 3.22woman:asian 0.29 0.40woman:black 0.36woman:hispanic 0.38woman:white 0.16Religion 64 woman 0.61 3.31woman:buddhist 0.19 0.39woman:christian 0.27woman:hindu 0.27woman:jewish 0.33woman:muslim 0.25Sexuality 72 woman 0.61 3.36woman:lesbian 0.35 0.45woman:straight 0.26Political 71 woman 0.59 3.47woman:conservative 0.24 0.46woman:liberal 0.30 age smaller (+0.4%). There is some variation in the sig-niﬁcance of interactions; for example, { women:hispanic } and { woman:black } are more frequently signiﬁcant than { woman:white } , and { woman:lesbian } more signiﬁcantthan { woman:straight } . These results suggest that someintersections are more salient in changing the returned jobfrom a given sentence prompt, and may anchor GPT-2 ona stereotypical occupation set. In general, across a widerange of jobs, gender and intersectionality are signiﬁcantdeterminants of the token returned by GPT-2. A comparison of GPT-2’s predictions to the true labor mar-ket distribution requires recent data disaggregated by genderand intersection for a granular set of occupations. The2019 US Labor Force Statistics from the Current Popula-tion Survey (US Labor Bureau of Statistics, 2019) reportsthe gender and ethnicity shares of workers in 567 occupa-tional categories. We recognize a number of limitationsof this data, which we address in the discussion. We ﬁrstselect the 50 most frequently mentioned jobs by GPT-2.Then from these, we match GPT-2’s job tokens to realUS occupation titles, ﬁnding correspondences for 41/50titles (see Appendix C). We compute GPT-2’s predictedproportional representation for each gender-ethnicity pair,assuming the percentage of women is equal across ethnic-ities: The ‘predicted’ labor force has equal representationacross groups because we generate the same number ofsentence prompts per pair ( n = 7 , ). This is not thecase in reality, so the predicted proportions are scaled bythe true distribution of gender and ethnicity reported in theUS Labor Statistics and summarised in Appendix C. The ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases scaling factor is γ ( c ) = G ( c ) E ( c )ˆ D ( c ) , where G ( c ) , E ( c ) are thegender- and ethnicity-shares of the US data, respectivelyand ˆ D ( c ) = 12 . is our artiﬁcial “population”-share:adj. Pred ( i, c ) = γ ( c ) × Pred ( i, c ) , (2)where Pred( i ) is the share of job i for characteristics c . Forjobs reported in the US data, we calculate the differencebetween the predicted proportions and the true proportions. For a given job, how well does GPT-2 predict the gender-ethnicity split?

There are three possible cases: GPT-2overestimates the true representation of women in female-dominated jobs (exacerbates societal skew), GPT-2 matchesthe true proportional representation (directly inherits skew),or GPT-2 underestimates the true proportional represen-tation (corrects for skew). In Fig. 1, we ﬁnd that mostpredicted values lie close to the ground-truth given by theidentity line, indicating a high accuracy in prediction (mean-squared errors of < ). In particular, for the gender-ethnicity intersections, the low mean-squared errors indicatea considerable degree of similarity between GPT-2’s pre-dicted distribution and the ground truth distribution, espe-cially for Asian and Black workers. Furthermore, it appearsthat GPT-2 pulls the distribution further from the extremes;that is, it under-predicts the extent of occupational segre-gation. This is demonstrated by the fact that GPT-2 pre-dicts a higher proportion of women than the ground truth inmale-dominated jobs with less than 25% women-share (onaverage +8.7%) and predicts lower proportions of women injobs with more than 75% women-share (on average -6.5%).The exceptions to this pattern are courier, bus driver andphotographer, for which GPT-2 under-predicts the propor-tion of women, and social worker and model, for whichGPT-2 over-predicts the proportion of women. For a given gender-ethnicity pair, how well does GPT-2predict the top jobs?

This question aims to answer the ex-tent of stereotyping of GPT-2 predictions. Tab. 4 shows thetop ﬁve predicted and ground truth jobs for each intersection.GPT-2 predicts a high proportion of baseline women to bewaitresses (14%) but only Hispanic women have waitressin the top ﬁve occupations, according to the US Labor data.While GPT-2 predicts 18% of Hispanic women to be wait-resses, in reality only 3% of Hispanic women in Americawork as waitresses. Some of this strong association maybe because waitress is an inherently gendered job. GPT-2also over-predicts the number of nurses, predicting 11%of women to be nurses when in reality only about 4% ofAmerican women are nurses. Security guard is consistentlyoverpredicted for men of all ethnicities. Yet security guardonly appears as a top job for Black men and at a lowerfrequency (2%) than the predicted frequency (8%). GPT-2over-predicts the proportion of janitors for all ethnicities,especially for White and Asian men, for whom janitor doesnot appear as a top job. The share of the most popular occupation for each gender issigniﬁcantly higher for women (waitress at 14%) than formen (security guard at 8%). The cumulative share of the topﬁve occupations is 41% for women, which is more than dou-ble the ground truth observation (17%). While GPT-2 alsoover-predicts the cumulative share of top ﬁve occupationsfor men, the discrepancy to the US data is smaller (24% vs10%). While GPT-2’s tendency to aggregate women into asmall set of stereotypical jobs was identiﬁed in prior analysis(Fig. 3 and Tab. 2), the comparison to US data corroboratesthis result.

5. Discussion

Demographic distribution per occupation.

Overall, weﬁnd strong differences in the occupational tokens returnedby GPT-2 for gendered sentence prompts. At ﬁrst glance,it may seem biased that GPT-2 predicts so many women tobe maids or secretaries and so few to be plumbers or truckdrivers, but in fact, the model predicts less occupationalsegregation by gender as compared to the US ground truthdistribution. It appears that GPT-2 is pulling the skews ofthe distribution found in reality towards gender parity.For ethnicity, GPT-2 accurately predicts the distribution ofoccupations in real-world data with low mean-squared er-rors, especially for Asian and Black workers. In addition togender and ethnicity, adding a religious intersection consid-erably changes the returned jobs, especially for men. Forexample, GPT-2 predicts 4% of Buddhist men to be monks.There are an estimated 3.75 million Buddhists in the USand approximately , Buddhist centers and monasteries(Pew Research, 2020; Institute for Genealogical Studies,2020). A back of the envelope calculation shows each ofthese centers would need to employ more than 70 monkseach to reach the 4% threshold. Therefore, it is likely thatGPT-2 infers too strong of an association between practisinga religion and working in a religious profession. Intersec-tions with continent-based names show that the returnedoccupations are more similar to those of baseline man andwoman. This ﬁnding indicates that prompting GPT-2 withexplicit intersections like ‘Buddhist man’ or ‘Black woman’changes the probabilities of returned tokens to a greater ex-tent than a name prompt where GPT-2 must independentlyascertain the gender and background of the individual.

Occupation distribution per demographic.

Despite re-ﬂecting the gender-ethnicity proportions per real-world oc-cupation, GPT-2 notably displays a bias towards predictinggreater occupational clustering for both genders, especiallyfor women. The Gini coefﬁcients conﬁrm that the femaledistribution is more unequal than that of men. For the pre-dictions generated by GPT-2, a larger and more diverse setof occupations are associated with men than with women.Gender-ethnicity predictions do not deviate much from the ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases

Table 4.

Top ﬁve jobs per intersectional category with associated proportions of cumulative sum

GPT USJobs (Prop) Sum Jobs (Prop) SumWOMAN base waitress (0.14), nurse (0.11), maid (0.06), receptionist (0.05),teacher (0.05) 0.41 teacher (0.04), nurse (0.04), secretary/assistant (0.03), cashier(0.03), manager (0.03) 0.17Asian waitress (0.14), maid (0.11), nurse (0.08), teacher (0.05), recep-tionist (0.04) 0.42 nurse (0.05), personal appearance worker (0.04), cashier (0.03),accountant/auditor (0.03), manager (0.03) 0.18Black waitress (0.18), nurse (0.10), maid (0.07), prostitute (0.05),teacher (0.04) 0.44 nursing/home health aid (0.07), cashier (0.04), nurse (0.04),personal care aide (0.03), teacher (0.03) 0.21Hispanic waitress (0.16), nurse (0.14), receptionist (0.07), maid (0.07),teacher (0.04) 0.48 maid/housekeeper/cleaner (0.05), cashier (0.04), waiter/waitress(0.03), secretary/assistant (0.03), nursing/home aide (0.03) 0.18White waitress (0.17), nurse (0.11), maid (0.07), teacher (0.05), recep-tionist (0.04) 0.44 teacher (0.04), nurse (0.04), secretary/assistant (0.04), manager(0.03), cashier (0.03) 0.18

MAN base security guard (0.08), manager (0.05), waiter (0.04), janitor(0.04), mechanic (0.03) 0.24 manager (0.04), truck driver (0.04), construction laborer (0.02),retail sales supervisor (0.02), laborer/ material mover (0.02) 0.14Asian waiter (0.09), security guard (0.07), manager (0.04), janitor(0.04), chef (0.03) 0.27 software developer (0.11), manager (0.04), physician/surgeon(0.02), teacher (0.02), engineer (0.02) 0.21Black security guard (0.08), waiter (0.07), bartender (0.05), janitor(0.05), mechanic (0.04) 0.29 truck driver (0.06), laborer/material mover (0.04), janitor (0.03),manager (0.03), security guard (0.02) 0.18Hispanic security guard (0.09), janitor (0.07), waiter (0.07), bartender(0.05), manager (0.05) 0.33 construction laborer (0.06), truck driver (0.04), grounds mainte-nance worker (0.03), carpenter (0.03), janitor (0.03) 0.19White waiter (0.06), security guard (0.06), janitor (0.05), mechanic(0.04), bartender (0.04) 0.25 manager (0.04), truck driver (0.04), construction laborer (0.03),retail sales supervisor (0.02), laborer/material mover (0.02) 0.15 predictions for baseline man and woman. This signiﬁes thatGPT-2 predicts the occupations for women with less varietythan for men, regardless of what ethnicity.This is a different kind of bias than that normally discussedin the algorithmic fairness literature. In reality, large propor-tions of women do work as secretaries, receptionists, andmaids, and large proportions of men do work as mechanics,plumbers, and carpenters. Therefore, GPT-2’s bias is not inthe jobs associated with women, but in the rate at which itassociates women with such a small set of jobs, a patternexacerbated from the ground truth occupation data.

Limitations.

This paper is subject to a number of limita-tions. First, our chosen comparison to labor market datarenders the ground truth baseline inherently US-centric. Sec-ond, without consistent, granular data on occupational splitsby religion, sexuality and political afﬁliation, we cannotcomment on how accurately GPT-2 reﬂects the ground truthfor these intersections. Third, for jobs in the informal sec-tor, such as ‘prostitute’, we cannot compare to real-worldincidences. Additionally, if terms such as ‘prostitute’ arecommonly used as slurs, GPT-2 may display a bias towardsover-estimating their proportion. Finally, by focusing onlyon two genders, the results do not adequately reﬂect occu-pational biases which may be associated with non-binarygender identities. Future research is recommended to inves-tigate ground truth comparisons across a broader range ofcountries against the set of gender-intersections examined inthis paper and to comment on a broader spectrum of genderidentities. Doing so would be valuable in establishing poten-tial areas of bias which risk being inherited by downstreamapplications of generative language models such as GPT-2.

6. Conclusion

What should be the goal of generative language models?It is certainly appropriate that they should not exacerbateexisting societal biases with regards to occupational segre-gation. It is less clear whether they should reﬂect or seekto correct skewed societal distributions. Compared to USdata, we identify a bias towards returning a small numberof stereotypical jobs too many times, especially for women.However, for a given job, we ﬁnd that GPT-2 reﬂects soci-etal skew and, in some cases, errs on the side of correctingfor it. One proposed reason for this observed pattern isover-representation in the training data towards ‘exceptionalcases’. If society expects women to be teachers and nurses,it is possible that there are more training examples scrapedfrom social media platforms or newspaper articles of whenmen occupy these stereotypes, or vice-versa with plumbersand software developers. It remains to be seen whethera larger training set improves or impairs occupational bi-ases for intersections. Using the methodology developedin this paper, it is possible and would merit future researchto determine the effect of model and training set size bycomparing GPT-3 relative to its younger sibling analyzed inthis work, GPT-2. This work presents the ﬁrst comprehen-sive analysis of protected class intersections with gender ingenerative language models, and we hope that it will sparknew research further investigating biases on topics relevantto downstream applications and intersectionality in AI morebroadly. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases

Acknowledgements

This work has been supported by the Oxford AI studentsociety, the EPSRC Centre for Doctoral Training in Au-tonomous Intelligent Machines & Systems [EP/L015897/1](A.S., Y.M.A.), and the Economic and Social ResearchCouncil grant [ES/P000649/1] (H.K). We also thank R.Maria del Rio-Chanona, Gesa Biermann for their usefulcomments.

References

Adiwardana, D., Luong, M.-T., So, D., Hall, J., Fiedel, N.,Thoppilan, R., Yang, Z., Kulshreshtha, A., Nemade, G.,Lu, Y., and Le, Q. V. Towards a human-like open-domainchatbot.

ArXiv , abs/2001.09977, 2020.Bender, E. M., Gebru, T., McMillan-Major, A., andShmitchell, S. On the dangers of stochastic parrots:Can language models be too big? . In

Conference onFairness, Accountability, and Transparency (FAccT ’21) .ACM, New York, NY, USA, 2021.Bhardwaj, R., Majumder, N., and Poria, S. Investigatinggender bias in bert.

ArXiv , abs/2009.05021, 2020.Bhatia, V., Rawat, P., Kumar, A., and Shah, R. End-to-endresume parsing and ﬁnding candidates for a job descrip-tion using bert.

ArXiv , abs/1910.03089, 2019.Blodgett, S. L., Barocas, S., Daum’e, H., and Wallach, H.Language (technology) is power: A critical survey of”bias” in nlp. In

ACL , 2020.Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., andKalai, A. Man is to computer programmer as woman isto homemaker? debiasing word embeddings. In

NeurIPS ,2016.Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan,J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G.,Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu,J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M.,Gray, S., Chess, B., Clark, J., Berner, C., McCandlish,S., Radford, A., Sutskever, I., and Amodei, D. Languagemodels are few-shot learners, 2020.Budzianowski, P. and Vulic, I. Hello, it’s GPT-2 - howcan I help you? towards the use of pretrained lan-guage models for task-oriented dialogue systems.

CoRR ,abs/1907.05774, 2019.Caliskan, A., Bryson, J., and Narayanan, A. Semanticsderived automatically from language corpora containhuman-like biases.

Science , 356:183 – 186, 2017. Crenshaw, K. Demarginalizing the intersection of raceand sex: A black feminist critique of antidiscriminationdoctrine, feminist theory and antiracist politics. 1989.Diaz, M., Johnson, I., Lazar, A., Piper, A., and Gergle, D.Addressing age-related bias in sentiment analysis.

Pro-ceedings of the 2018 CHI Conference on Human Factorsin Computing Systems , 2018.Dinan, E., Fan, A., Williams, A., Urbanek, J., Kiela, D.,and Weston, J. Queens are powerful too: Mitigating gen-der bias in dialogue generation.

ArXiv , abs/1911.03842,2020.Ethayarajh, K. How contextual are contextualized wordrepresentations? comparing the geometry of bert, elmo,and GPT-2 embeddings.

CoRR , abs/1909.00512, 2019.Fedus, W., Zoph, B., and Shazeer, N. Switch transform-ers: Scaling to trillion parameter models with simple andefﬁcient sparsity, 2021.Foulds, J. and Pan, S. An intersectional deﬁnition of fair-ness. , pp. 1918–1921, 2020.Gonen, H. and Goldberg, Y. Lipstick on a pig: Debiasingmethods cover up systematic gender biases in word em-beddings but do not remove them.

ArXiv , abs/1903.03862,2019.He, P., Liu, X., Gao, J., and Chen, W. Deberta: Decoding-enhanced bert with disentangled attention.

ArXiv ,abs/2006.03654, 2020.Institute for Genealogical Studies. US: Reli-gious Records-Part 2, 2020. URL .Kennedy, C. J., Bacon, G., Sahn, A., and von Vacano, C.Constructing interval variables via faceted rasch mea-surement and multitask deep learning: a hate speechapplication.

ArXiv , abs/2009.10277, 2020.Kiritchenko, S. and Mohammad, S. M. Examining genderand race bias in two hundred sentiment analysis systems.In *SEM@NAACL-HLT , 2018.Kurita, K., Vyas, N., Pareek, A., Black, A., and Tsvetkov, Y.Measuring bias in contextualized word representations.

ArXiv , abs/1906.07337, 2019.Li, C., Fisher, E. M., Thomas, R., Pittard, S., Hertzberg,V., and Choi, J. D. Competence-level prediction andresume and job description matching using context-awaretransformer models.

ArXiv , abs/2011.02998, 2020. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases

Liu, H., Dacon, J., Fan, W., Liu, H., Liu, Z., and Tang,J. Does gender matter? towards fairness in dialoguesystems. In

COLING , 2020.Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R.,Bethard, S., and McClosky, D. The stanford corenlpnatural language processing toolkit. In

ACL (SystemDemonstrations) , pp. 55–60. The Association for Com-puter Linguistics, 2014. ISBN 978-1-941643-00-6.Pew Research. Religious Landscape Study,2020. URL .Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., andSutskever, I. Language models are unsupervised multitasklearners. 2019.Sheng, E., Chang, K.-W., Natarajan, P., and Peng, N. Thewoman worked as a babysitter: On biases in languagegeneration.

ArXiv , abs/1909.01326, 2019.Stanovsky, G., Smith, N. A., and Zettlemoyer, L. Eval-uating gender bias in machine translation.

ArXiv ,abs/1906.00591, 2019.Tan, Y. and Celis, L. Assessing social and intersectional bi-ases in contextualized word representations. In

NeurIPS ,2019.Tatman, R. Gender and dialect bias in youtube’s automaticcaptions. In

EthNLP@EACL , 2017.US Labor Bureau of Statistics. Employed peons by de-tailed occupation, sex, race, and Hispanic or Latino eth-nicity, 2019. URL .Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attentionis all you need. In

NeurIPS , 2017.Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., andBowman, S. R. Glue: A multi-task benchmark and anal-ysis platform for natural language understanding. In

BlackboxNLP@EMNLP , 2018.Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A.,Michael, J., Hill, F., Levy, O., and Bowman, S. R. Super-glue: A stickier benchmark for general-purpose languageunderstanding systems. In

NeurIPS , 2019.Wick, M. L., Silverstein, K., Tristan, J., Pocock, A. C., andJohnson, M. Detecting and exorcising statistical demonsfrom language models with anti-models of negative data.

ArXiv , abs/2010.11855, 2020. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.,and Le, Q. V. Xlnet: Generalized autoregressive pretrain-ing for language understanding. In

NeurIPS , 2019.Zhao, J., Zhou, Y., Li, Z., Wang, W., and Chang, K.-W.Learning gender-neutral word embeddings. In

EMNLP ,2018.Zhao, J., Wang, T., Yatskar, M., Cotterell, R., Ordonez, V.,and Chang, K.-W. Gender bias in contextualized wordembeddings.

ArXiv , abs/1904.03310, 2019. upplementary Material:How True is GPT-2?An Empirical Analysis of Intersectional Occupational Biases

Note on language used in this paper

In our paper, we focus on the occupational associations with binary gender identities i.e. “man” and “woman”. While we dosometimes refer to jobs dominated by women as ‘female-dominated jobs’, we do not make an explicit comparison to sex, i.e.prompting GPT-2 with the ‘female worker is a...’. We feel strongly about the importance in studying non-binary gender andin ensuring the ﬁeld of machine learning and AI does not diminish the visibility of non-binary gender identities. In futurework, we hope to extend our analysis with the same data collection pipeline. For example, womxn is a umbrella term used inthe intersectional feminist community to be inclusive of transgender woman and non-binary individuals. The sentencesreturned when prompting GPT-2 with ‘womxn’ are primarily of two types: (i) stereotypical job associations e.g. ‘dragqueen’, ‘feminist’, ‘crossdresser’ or ‘nurse’, and (ii) not recognizing ‘womxn’ as a person noun e.g. ‘The womxn works as akind of a noodle shop’, ‘The womxn works as a battery’, ‘The womxn works as a mauve-wool hat’ or ‘The womxn works asa kind of virtual sex toy’. These preliminary ﬁndings suggest it is critical for future work to study occupational biases withnon-binary gender identities in generative language models.

A. GPT Model Downloads

We select the most downloaded version of GPT-2 available on HuggingFace as a proxy for popularity in use-cases byexperts and non-experts alike. In Tab. 5 we show the original GPT-2 models released by OpenAI (Radford et al., 2019)and available on HuggingFace. The small version has an order of magnitude more downloads as compared to the mediumand XL versions. Further, larger models of GPT-2 have been shown to have an increased capability to memorize traininginformation, introducing privacy concerns (Carlini, 2020). Finally, while the environment cost of inference is cheap, Benderet al. (2021) discuss how the environmental impact of training scales with model size, and the associated consequenceslikely disproportionately affect marginalized populations.

Table 5.

GPT-2 model available on Huggingface by number by total downloads (accessed 3rd February 2021)

Model

GPT-2 Medium

GPT-2 Large

GPT2 XL

B. Comparison with XLNet

XLNet sample generation.

In addition to the suite of models released by Open-AI, XLNet is a generalized autoregressivepre-training method which outperforms BERT across a number of benchmark tasks (Yang et al., 2019). To assessthe generalizability of our ﬁndings, we generate 7,000 sentences for the gender-occupation template ( X = {} , Y = { Man, Woman } ), and analyze the returned occupational tokens from XLNet. Out of the total 14,000 returned sentences,4,442 had no title recognized by the Stanford NLP Named Entity Recognizer. This sample loss of 31% is higher than GPT-2(Tab. 6). A plausible reason for this higher sample loss is in the way XLNet generates text which includes extra inverted ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases commas. Table 6.

Sample loss from sentences with no detected job title

Model Template Missing Titles Sample Loss

GPT-2 Gender-occupation 20,689 10.6%GPT-2 Names-occupation 39,203 19.6%XLNET Gender-occupation 4,442 31.7%

Distributional Analysis.

Fig. 10 shows the rank of jobs against the cumulative share. While 11 jobs account for 50% of theoutputs for men, only 5 jobs account for the same share for women. Similarly, considering 90% of the output, women areassociated with fewer jobs than men (31 vs 46, respectively). This disparity is similar to the one that we found in GPT-2,suggesting that XLNet also predicts a wider variety of jobs for men and a narrower set of jobs for women.

Table 7.

XLNet: Top ﬁve jobs for base man and base woman

XLNet Jobs (Proportions) SumWoman maid (0.27), waitress (0.14), prostitute (0.05), servant (0.04), nurse (0.04) 0.54

Man carpenter (0.11), mechanic (0.07), maid (0.05), waiter (0.05), taxi driver (0.04) 0.32

Top occupations.

Tab. 7 shows the top ﬁve jobs for men and women as predicted by XLNet. Similar to our observations forgender differences predicted by GPT-2, we see a higher cumulative share in the top jobs for women as compared to men.The top job for woman (maid at 27%) represents a substantially larger proportion than the top job for man (carpenter at11%). Interestingly, men are predicted to be maids 5% of the time, which was a pattern that we did not see with GPT-2.Fig. 11 shows the proportion of genders in all jobs mentioned more than 35 times for baseline man and woman. This is thesame threshold as the one we used to calculate the analogous gender parity graph for GPT-2 jobs. Men and woman areassociated with stereotypical jobs, but slightly different ones than those predicted by GPT-2. In this case, we see that menare associated with a variety of jobs, such as courier, barber, teller, magician, and builder. Women are, yet again, associatedwith domestic and care-giving jobs, such as nanny, housewife, and nurse. Women are also highly associated with jobs suchas gardener, bartender, secretary, and prostitute. Log(Rank) S h a r e o f T o t a l

11 jobs account for 50% of men5 jobs account for 50% of women 46 jobs account for 90% of men31 jobs account for 90% of women base_Wbase_M

Figure 10.

XLNet: Occupational distribution for men and women (baseline case) . As with GPT-2, the job titles predicted by XLNetare less diverse and more stereotypical for women than for men. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases c o u r i e r b a r b e r t e ll e r c o n t r a c t o r m a g i c i a n b u il d e r p r i n t e r p l u m b e r t a il o r c a r p e n t e r d r i v e r t r a d e r p a i n t e r w e a v e r hun t e r c u tt e r t a x i d r i v e r s a l e s m a n b u t c h e r m e c h a n i c b a k e r m e r c h a n t d o c t o r d r e ss e r f a c t o r y w o r k e r m e ss e n g e r w a i t e r m a n a g e r l a b o r e r s h o p k ee p e r c a s h i e r t e a c h e r s h o e m a k e r o w n e r c l e r k t h e r a p i s t v e n d o r j a n i t o r s e r v a n t c oo k h o u s e k ee p e r d r e ss m a k e r m a i d a ss i s t a n t c l e a n e r p r o s t i t u t e nu r s e s e c r e t a r y w a i t r e ss h o u s e w i f e b a r t e n d e r g a r d e n e r n a nn y Figure 11.

XLNet: gender proportions when querying for the base case, i.e. X = {} , Y = { Man , Woman } and show all jobs withgreater than

35 = n ∗ . mentions, making up 65% of returned valid responses. C. Processing

C.1. Named Entity Recognition

We used Stanford CoreNLP Named Entity Recognition (NER) to extract job titles from the sentences generated by GPT-2.Using this approach resulted in the sample loss of 10.6% for gender-occupation sentences and 19.6% for name-occupationsentences (see Tab. 6). The sample loss was due to Stanford CoreNLP NER not recognizing some job titles e.g. “Karimaworks as a consultant-development worker”, “The man works as a volunteer”, or “The man works as a maintenance man at alocal...”.

C.2. Adjustment Factors

When comparing to the US data, some adjustments are made to ensure fair comparison. Firstly, there are no breakdownsby gender and ethnicity in the US Labor Bureau data so we assume the proportion of women are equal across ethnicities.Secondly, for each gender-ethnicity pair, we generate the same number of sentence prompts per pair ( n = 7 , ). Thisimplies the ‘predicted’ labor force has equal representation across groups which is not the case in reality. Accordingly, thepredicted proportions are scaled by the true distribution of gender and ethnicity reported in the US Labor Statistics. Thescaling factor is: γ ( c ) = G ( c ) E ( c )ˆ D ( c ) , where G ( c ) , E ( c ) are the gender- and ethnicity-shares of the US data, respectively and ˆ D ( c ) = 12 . is our artiﬁcial “population”-share:adj. Pred ( i, c ) = γ ( c ) × Pred ( i, c ) , (3)where Pred( i ) is the share of job i for characteristics c . Tab. 8 shows the true proportions and the steps made in the adjustmentprocess. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases Table 8.

Adjustment calculations.

US Eth. US Gender G-E. Distr. GPT Distr. Correction ( E ) ( G ) ( D = G ∗ E ) ( ˆ D ) ( γ )Man NA 0.530 0.530 0.500 1.060Woman NA 0.470 0.470 0.500 0.940Asian Man 0.065 0.530 0.034 0.125 0.276Asian Woman 0.065 0.470 0.031 0.125 0.244Black Man 0.123 0.530 0.065 0.125 0.522Black Woman 0.123 0.470 0.058 0.125 0.462Hispanic Man 0.176 0.530 0.093 0.125 0.746Hispanic Woman 0.176 0.470 0.083 0.125 0.662White Man 0.777 0.530 0.412 0.125 3.294White Woman 0.777 0.470 0.365 0.125 2.922 C.3. Matching GPT-2 and US Jobs

The US data has four nested levels of disaggregation e.g. Management, professional, and related occupations → Professionaland related occupations → Computer and mathematical occupations → Computer Programmer. For GPT-2’s 50 mostfrequently mentioned jobs, we match the GPT-2 job title to one in the US data at the lowest nested level, apart from‘salesperson’ and ‘manager’ which are too general to match to the lowest disaggregation. For these, we match to ‘sales andrelated occupations’, and ‘management occupations’, respectively. In total, we ﬁnd correspondences for 41/50 jobs. Jobswere not matched for three reasons: (i) there were too many varied mentions of a job e.g. ‘clerk’ was associated with 25different jobs spanning ﬁnance, law and hospitality sectors, (ii) there was no match for a job e.g. ‘prostitute’ and ‘translator’,(iii) the jobs were inherently gendered e.g. ‘waitress’ and ‘salesman’. There are two further considerations in matching. First,when a GPT-2 job is less general than the US categories. For example, while GPT-2 gave separate predictions for taxi driversand chauffeurs, the US data only reports ‘taxi drivers and chauffeurs’. Similarly, while GPT-2 gives separate predictions formaids, housekeepers and cleaners, the US category amalgamates these into ‘maids and housekeeping cleaners’. For thesecases, we average across GPT-2’s predictions for the relevant jobs, i.e. combining the predictions for maid, housekeeper andcleaner. Second, when GPT-2’s predictions are more general than the US categories. For example, when GPT-2 returns thetoken of ‘teacher’ but the US data reports ‘postsecondary teachers, ‘preschool and kindergarten teachers’, etc. For thesecases, we sum across the US sub-categories. Tab. 9 gives details on these matches. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases

Table 9.

Job matches between GPT-2 predicted jobs and US data.GPT US DATAbabysitter Childcare workerssecretary / assistant Secretaries and administrative assistantsreceptionist Receptionists and information clerkscleaner / housekeeper /maid Maids and housekeeping cleanersnurse Registered nursessocial worker Social workersteacher Postsecondary teachers, Preschool and kindergarten teachers, Elementary and middle school teachers,Special education teachersmodel Models, demonstrators, and product promoterswriter Writers and authorsbarista Counter attendants, cafeteria, food concession, and coffee shopbartender Bartendersphotographer Photographersbus driver Bus driversreporter / journalist News analysts, reporters and correspondentscook Cooksdoctor Physicians and surgeonsmanager Management occupationsjanitor Janitors and building cleanerslawyer Lawyersbarber Barberschef Chefs and head cooksguard / security guard/ bouncer Security guards and gaming surveillance ofﬁcerscourier Couriers and messengerscomputer programmer Computer programmerspolice ofﬁcer Police and sheriff’s patrol ofﬁcerstaxi driver / chauffeur /driver Taxi drivers and chauffeurstruck driver Driver/sales workers and truck driversconstruction worker /laborer Construction laborerscarpenter Carpentersplumber Pipelayers, plumbers, pipeﬁtters, and steamﬁttersmechanic Automotive service technicians and mechanicssalesperson Sales and related occupationsEXCLUDED JOBSclerk Too many sub-categoriestechnician Too many sub-categoriesconsultant No entrycontractor No entryprostitute No entrytranslator No entrysalesman Gendered titlewaitress Gendered titlewaiter Gendered title ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases

D. Regression Analysis

D.1. Percentage of Signiﬁcant Coefﬁcients

Tab. 10 shows the percentage of signiﬁcant coefﬁcients for each intersection. To produce these results, we run regressionsfor all jobs mentioned more times than the same threshold values used in the paper. Each regression includes all main effectsand interaction terms. We then compute the percentage of signiﬁcant coefﬁcients for each term across all regressions withbaseline man as the reference group. We repeat these steps for each intersection: ethnicity, religion, sexuality and politicalafﬁliation. We did not run regression for continent name origin because there was no suitable baseline category given everyﬁrst name has geographic and gender associations.Considering religion, the Buddhist term has the higher percentage signiﬁcance across all regressions (78%), while theHindu term has the lowest (55%). This supports the ﬁndings in the paper that some religions are stronger determinantsof jobs than others. Of the interaction terms, woman:buddhist is the least signiﬁcant (19%). This ﬁnding suggests thatmale jobs are more highly determined by Buddhist membership, but female jobs are less strongly associated with thisafﬁliation. Considering ethnicity, the Hispanic term is most commonly signiﬁcant (64%), while the Asian term is lesscommonly signiﬁcant (42%). The interactions for Hispanic and Black women are more frequently signiﬁcant than thosefor White and Asian women. This ﬁnding suggests some ethnicity-gender pairs more saliently affect GPT-2’s priors onjob associations. Considering sexuality, both sexuality categories (gay/straight) are signiﬁcant in approximately 50% ofregressions. A woman’s intersectional association with being lesbian is more commonly signiﬁcant than an association withbeing straight. Considering political afﬁliation, the liberal term is more commonly signiﬁcant than the conservative term,and the same pattern apply to gender-political interaction terms.Finally, we can compare the average signiﬁcance of categories, gender and their intersections across religion, ethnicity,sexuality and political regressions. Religion main effects are on average signiﬁcant in 66% of regressions, ethnicity maineffects in 53% of regressions, sexuality main effects in 48% of regressions and political main effects in 60% of regressions.This suggests for men, there is higher across-religion variation in predicted jobs than say for across-sexuality variation.The woman dummy is signiﬁcant in 61% of religion regressions, in 71% of ethnicity regressions, in 61% of sexualityregressions and in 59% of political regressions. This ﬁnding demonstrates the woman and man variation is more inﬂuentialin distinguishing between job afﬁliations for ethnicity and least inﬂuential for political afﬁliation. Across all regressions, thewoman dummy is highly signiﬁcant suggesting gender is an important determinant of job predictions. Finally, the interactionterms are signiﬁcant in 26% of religion regressions, in 30% of ethnicity regressions, in 31% of sexuality regressions and in27% of political regressions. This suggests for women, sexuality and ethnicity are stronger determinants of job associations.Interaction terms are signiﬁcant in approximately one-third of regressions, while the woman dummy is signiﬁcant inapproximately two-thirds of regressions. This ﬁnding suggests, while intersectionality is an relevant determinant of predictedjob, gender more strongly inﬂuences GPT-2’s priors over occupational associations.

Table 10.

Percentage of signiﬁcant coefﬁcients in logistic regressions by intersection

RELIGION ETHNICITY SEXUALITY POLITICAL

Intercept 0.94 Intercept 0.95 Intercept 0.90 Intercept 0.92buddhist 0.78 asian 0.42 gay 0.51 conservative 0.55christian 0.69 black 0.55 straight 0.44 liberal 0.66hindu 0.55 hispanic 0.64 woman 0.61 woman 0.59jewish 0.66 white 0.49 woman:lesbian 0.35 woman:conservative 0.24muslim 0.64 woman 0.71 woman:straight 0.26 woman:liberal 0.30woman 0.61 woman:asian 0.29woman:buddhist 0.19 woman:black 0.36woman:christian 0.27 woman:hispanic 0.38woman:hindu 0.27 woman:white 0.16woman:jewish 0.33woman:muslim 0.25 ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases

D.2. All Regression Results

Fig. 12 presents the signiﬁcant p-values in all regressions for main effects and interaction terms. Signiﬁcant p-values( p < . ) are shaded in black, while non-signiﬁcant terms are left as white. Considering for example ethnicity, there aretwo axes of variation. First, some jobs have signiﬁcant p-values across all terms such as supervisor and teacher, indicatingthese jobs are highly segmented by gender and by ethnicity, but also by their interaction. Jobs with no signiﬁcant p-valuesrepresents cases where the model did not converge which occurred when there was insufﬁcient variation across differentdemographics. In Fig. 13, we present the direction and magnitude of signiﬁcant coefﬁcients. Any negative coefﬁcients,i.e. those that make the job prediction less likely, are shaded in red. Any positive coefﬁcients, i.e. those that make thejob association more likely, are shaded in blue. Any insigniﬁcant coefﬁcients ( p > . ) are left as white. A darker colorindicates a larger strength of coefﬁcient. We present all the results so an interested reader can select a certain job and ﬁndthe associated coefﬁcients for gender and intersections, alongside their interaction terms.Finally, Fig. 14 presents the change in Pseudo- R for all job regressions across ethnicity when the woman dummy is addedand when the interaction terms are added. To produce these results, we ﬁrst run a regression with all the main effects ofcategorical membership e.g. (‘Asian’, ‘Black’, ‘Hispanic’, ‘White’) but without the woman dummy. Given baseline ‘man’is the reference group, all gender variation resides in the intercept. Next, we re-add the woman dummy, and observe howthe model ﬁt improves. Finally, we run a regression with all main effects and all interaction terms and see what additionalvariation is explained. The general pattern observed is that the woman dummy has a greater effect on the model ﬁt than theinteractions. This ﬁnding suggests that while interaction terms for intersectional associations are signiﬁcant in approximatelyone-third of job regressions, they explain a lower proportion of variation than gender. Once again, there is considerablevariation by job and by intersection, so for detailed insights we invite readers to examine particular occupation-demographicpatterns. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases g u a r d m o n k s u p e r v i s o r m a ss a g e _ t h e r a p i s t t e c hn i c i a n p r o s t i t u t e s e c r e t a r y l a b o r e r c a r e t a k e r t r a n s l a t o r p l u m b e r p o li c e _ o ff i c e r t r u c k _ d r i v e r m a i d h o u s e w i f e m o d e l c o u r i e r s a l e s m a n c o n t r a c t o r s a l e s _ c l e r k b a r b e r c l e a n e r s o c i a l _ w o r k e r m e c h a n i c d r i v e r r e p o r t e r p h o t o g r a p h e r nu r s e w a i t r e ss c o n s t r u c t i o n _ w o r k e r r e c e p t i o n i s t s e r v a n t e d i t o r w a i t e r c h a u ff e u r c l e r k j o u r n a li s t c o n s u l t a n t w r i t e r t a x i _ d r i v e r a ss i s t a n t b u s _ d r i v e r s e c u r i t y _ g u a r d b a r t e n d e r c a r p e n t e r j a n i t o r s a l e s _ r e p r e s e n t a t i v e d o c t o r t e a c h e r h o u s e k ee p e r m a n a g e r b a r i s t a c h e f l a w y e r c oo k Interceptasianblackhispanicwhitewomanwoman:asianwoman:blackwoman:hispanicwoman:white

ETHNICITY p a s t o r m o n k r a bb i t e c hn i c i a n g u i d e p r i e s t s h e p h e r d m i ss i o n a r y m a ss a g e _ t h e r a p i s t h o u s e w i f e f a r m e r p a i n t e r b a k e r l a b o r e r t a il o r c o n t r a c t o r p h o t o g r a p h e r r e p o r t e r s h o p k ee p e r g a r d e n e r nu r s e c o u r i e r b o un c e r c o n s t r u c t i o n _ w o r k e r s a l e s m a n s e r v a n t p l u m b e r t r u c k _ d r i v e r m e c h a n i c m a i d s e c r e t a r y t r a n s l a t o r c l e a n e r s e c u r i t y _ g u a r d w r i t e r r e c e p t i o n i s t s a l e s _ r e p r e s e n t a t i v e p o r t e r p a r a l e g a l e d i t o r b u s _ d r i v e r c o n s u l t a n t t a x i _ d r i v e r w a i t r e ss d r i v e r b a r b e r d o c t o r li b r a r i a n a ss i s t a n t w a i t e r b a r t e n d e r t e a c h e r c l e r k p o li c e _ o ff i c e r j o u r n a li s t c h a u ff e u r c a r p e n t e r b a r i s t a j a n i t o r m a n a g e r c h e f l a w y e r c oo k h o u s e k ee p e r Interceptbuddhistchristianhindujewishmuslimwomanwoman:buddhistwoman:christianwoman:hinduwoman:jewishwoman:muslim

RELIGION d e t e c t i v e t h e r a p i s t g u a r d c o a c h g r a p h i c _ d e s i g n e r d e v e l o p e r s t a n d _ i n s o f t w a r e _ e n g i n ee r c o un s e l o r b o d y g u a r d b a b y s i tt e r m a ss a g e _ t h e r a p i s t m o n k p l u m b e r c o n t r a c t o r p r i v a t e _ i n v e s t i g a t o r c a r e t a k e r c o n s t r u c t i o n _ w o r k e r c o u r i e r m o d e l t e c hn i c i a n p r i e s t c o m p u t e r _ p r o g r a mm e r t r u c k _ d r i v e r c l e a n e r h o u s e w i f e s u p e r v i s o r m e c h a n i c b u s _ d r i v e r s a l e s m a n m a i d d r i v e r p o li c e _ o ff i c e r b o un c e r l a b o r e r nu r s e s e c u r i t y _ g u a r d w a i t r e ss li b r a r i a n c h a u ff e u r t a x i _ d r i v e r c l e r k a ss i s t a n t s a l e s _ c l e r k j o u r n a li s t w a i t e r b a r b e r s e r v a n t s e c r e t a r y r e c e p t i o n i s t h o u s e _ p a i n t e r c a r p e n t e r p h o t o g r a p h e r b a r i s t a r e p o r t e r d o c t o r j a n i t o r d i r e c t o r m a n a g e r t r a n s l a t o r t e a c h e r p r o d u c e r e d i t o r b a r t e n d e r w r i t e r c o n s u l t a n t a r t i s t c h e f s a l e s _ r e p r e s e n t a t i v e h o u s e k ee p e r c oo k l a w y e r Interceptgaystraightwomanwoman:gaywoman:straight

SEXUALITY p l u m b e r c o u r i e r b a n k e r h o s t s u p e r v i s o r d r i v e r c r i t i c l o bb y i s t m a i d t e c hn i c i a n m e c h a n i c b a r b e r m a ss a g e _ t h e r a p i s t c o n s t r u c t i o n _ w o r k e r h o u s e w i f e c o un s e l o r t a x i _ d r i v e r t r u c k _ d r i v e r p e r s o n a l _ t r a i n e r a c t i v i s t p r o d u c e r j o u r n a li s t p r o s t i t u t e f i n a n c i a l _ a d v i s e r d i r e c t o r c o n t r a c t o r nu r s e s a l e s m a n b o un c e r t r a n s l a t o r m o d e l p o li c e _ o ff i c e r p r i v a t e _ d e t e c t i v e e d i t o r p r i v a t e _ i n v e s t i g a t o r s e c u r i t y _ g u a r d s e c r e t a r y d e v e l o p e r e x e c u t i v e s o c i a l _ w o r k e r w a i t r e ss a r t i s t s a l e s _ m a n a g e r a n a l y s t c a r p e n t e r r e c e p t i o n i s t s o f t w a r e _ e n g i n ee r a ss i s t a n t p r o f e ss o r l a w y e r p r e s i d e n t s p e c i a li s t s t u d e n t w r i t e r s e r v a n t r e p o r t e r b a r i s t a c oo k c l e r k w a i t e r c o m p u t e r _ p r o g r a mm e r p h o t o g r a p h e r c o n s u l t a n t m a n a g e r j a n i t o r t e a c h e r d o c t o r h o u s e k ee p e r c h e f s a l e s _ r e p r e s e n t a t i v e b a r t e n d e r Interceptconservativeliberalwomanwoman:conservativewoman:liberal

POLITICAL

Figure 12.

Signiﬁcant p-values ( p < . for job regressions : signiﬁcant (black), non-signiﬁcant (white) ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases g u a r d m o n k s u p e r v i s o r m a ss a g e _ t h e r a p i s t t e c hn i c i a n p r o s t i t u t e s e c r e t a r y l a b o r e r c a r e t a k e r t r a n s l a t o r p l u m b e r p o li c e _ o ff i c e r t r u c k _ d r i v e r m a i d h o u s e w i f e m o d e l c o u r i e r s a l e s m a n c o n t r a c t o r s a l e s _ c l e r k b a r b e r c l e a n e r s o c i a l _ w o r k e r m e c h a n i c d r i v e r r e p o r t e r p h o t o g r a p h e r nu r s e w a i t r e ss c o n s t r u c t i o n _ w o r k e r r e c e p t i o n i s t s e r v a n t e d i t o r w a i t e r c h a u ff e u r c l e r k j o u r n a li s t c o n s u l t a n t w r i t e r t a x i _ d r i v e r a ss i s t a n t b u s _ d r i v e r s e c u r i t y _ g u a r d b a r t e n d e r c a r p e n t e r j a n i t o r s a l e s _ r e p r e s e n t a t i v e d o c t o r t e a c h e r h o u s e k ee p e r m a n a g e r b a r i s t a c h e f l a w y e r c oo k Interceptasianblackhispanicwhitewomanwoman:asianwoman:blackwoman:hispanicwoman:white

POLITICAL

Figure 13.

Signiﬁcant coefﬁcients for job regressions : negative (red), positive (blue), and insigniﬁcant (white) ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases g u a r d s u p e r v i s o r t e c hn i c i a n p r o s t i t u t e s e c r e t a r y l a b o r e r c a r e t a k e r t r a n s l a t o r p l u m b e r p o li c e _ o ff i c e r t r u c k _ d r i v e r m a i d h o u s e w i f e m o d e l c o u r i e r s a l e s m a n c o n t r a c t o r s a l e s _ c l e r k b a r b e r c l e a n e r s o c i a l _ w o r k e r m e c h a n i c d r i v e r r e p o r t e r p h o t o g r a p h e r nu r s e w a i t r e ss c o n s t r u c t i o n _ w o r k e r r e c e p t i o n i s t s e r v a n t e d i t o r w a i t e r c h a u ff e u r c l e r k j o u r n a li s t c o n s u l t a n t w r i t e r t a x i _ d r i v e r a ss i s t a n t b u s _ d r i v e r s e c u r i t y _ g u a r d b a r t e n d e r c a r p e n t e r j a n i t o r s a l e s _ r e p r e s e n t a t i v e d o c t o r t e a c h e r h o u s e k ee p e r m a n a g e r b a r i s t a c h e f l a w y e r c oo k Add Woman DummyAdd Interactions

ETHNICITY p a s t o r t e c hn i c i a n g u i d e p r i e s t s h e p h e r d m i ss i o n a r y h o u s e w i f e f a r m e r p a i n t e r b a k e r l a b o r e r t a il o r c o n t r a c t o r p h o t o g r a p h e r r e p o r t e r s h o p k ee p e r g a r d e n e r nu r s e c o u r i e r b o un c e r c o n s t r u c t i o n _ w o r k e r s a l e s m a n s e r v a n t p l u m b e r t r u c k _ d r i v e r m e c h a n i c m a i d s e c r e t a r y t r a n s l a t o r c l e a n e r s e c u r i t y _ g u a r d w r i t e r r e c e p t i o n i s t s a l e s _ r e p r e s e n t a t i v e p o r t e r p a r a l e g a l e d i t o r b u s _ d r i v e r c o n s u l t a n t t a x i _ d r i v e r w a i t r e ss d r i v e r b a r b e r d o c t o r li b r a r i a n a ss i s t a n t w a i t e r b a r t e n d e r t e a c h e r c l e r k p o li c e _ o ff i c e r j o u r n a li s t c h a u ff e u r c a r p e n t e r b a r i s t a j a n i t o r m a n a g e r c h e f l a w y e r c oo k h o u s e k ee p e r Add Woman DummyAdd Interactions

RELIGION d e t e c t i v e t h e r a p i s t g u a r d c o a c h g r a p h i c _ d e s i g n e r d e v e l o p e r s t a n d _ i n c o un s e l o r b o d y g u a r d b a b y s i tt e r m o n k p l u m b e r c o n t r a c t o r p r i v a t e _ i n v e s t i g a t o r c a r e t a k e r c o n s t r u c t i o n _ w o r k e r c o u r i e r m o d e l t e c hn i c i a n p r i e s t c o m p u t e r _ p r o g r a mm e r t r u c k _ d r i v e r c l e a n e r h o u s e w i f e s u p e r v i s o r m e c h a n i c b u s _ d r i v e r s a l e s m a n m a i d d r i v e r p o li c e _ o ff i c e r b o un c e r l a b o r e r nu r s e s e c u r i t y _ g u a r d w a i t r e ss li b r a r i a n c h a u ff e u r t a x i _ d r i v e r c l e r k a ss i s t a n t s a l e s _ c l e r k j o u r n a li s t w a i t e r b a r b e r s e r v a n t s e c r e t a r y r e c e p t i o n i s t h o u s e _ p a i n t e r c a r p e n t e r p h o t o g r a p h e r b a r i s t a r e p o r t e r d o c t o r j a n i t o r d i r e c t o r m a n a g e r t r a n s l a t o r t e a c h e r p r o d u c e r e d i t o r b a r t e n d e r w r i t e r c o n s u l t a n t a r t i s t c h e f s a l e s _ r e p r e s e n t a t i v e h o u s e k ee p e r c oo k l a w y e r Add Woman DummyAdd Interactions

POLITICAL

Figure 14.

Change in R from addition of woman dummy and interaction terms for job regressions . The plots show that theaddition of woman has a greater effect on R than the addition of interaction terms. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases E. Further Analysis for Intersectional Breakdowns

Distributional Analysis.

Fig. 15 shows the distributional analysis for man and woman by intersection. The distributions forethnicity, religion, and sexuality intersections show job titles predicted by GPT-2 are less diverse and more stereotypical forwomen than for men. For political intersections and for continent-based name intersections, the disparity is not as apparent.For these latter two cases, the distribution of jobs predicted for men and women are more similar. Log(Rank) S h a r e o f T o t a l base_Wbase_M Log(Rank) S h a r e o f T o t a l ethnicity_Wethnicity_M Log(Rank) S h a r e o f T o t a l religion_Wreligion_M Log(Rank) S h a r e o f T o t a l sexuality_Wsexuality_M Log(Rank) S h a r e o f T o t a l political_Wpolitical_M Log(Rank) S h a r e o f T o t a l continent_Wcontinent_M Figure 15.

Occupational distribution for men and women by intersection . With the exception of the continent name origin intersection(bottom-right), all the others intersections show that the job titles predicted by GPT-2 are less diverse and more stereotypical for womenthan for men. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases

Lorenz Curve Analysis.

Fig. 16 shows the Lorenz Curve for men and women by intersection. With the exception ofintersections with continent-based names, women are concentrated in a smaller number of job titles as compared to men.This can be seen clearly in Fig. 17, which zooms in on the interesting part of the curve ( y = [0 , . ). We see that the largestdistributional difference is in the religion and sexuality intersections. This distributional difference is smaller for politicalintersections, agreeing with our ﬁnding in the paper that political afﬁliation has less of an effect by gender in GPT-2’soccupational predictions. The curves for continent-based name intersections are nearly identical, suggesting that GPT-2predicts a distribution with less disparity when it is prompted with ﬁrst names rather than an explicit intersection e.g. ‘Blackwoman’/ ‘Buddhist man’. Cumulative Share of Total Workers C u m u l a t i v e S h a r e o f J o b s base_Wbase_M Cumulative Share of Total Workers C u m u l a t i v e S h a r e o f J o b s ethnicity_Wethnicity_M Cumulative Share of Total Workers C u m u l a t i v e S h a r e o f J o b s religion_Wreligion_M Cumulative Share of Total Workers C u m u l a t i v e S h a r e o f J o b s sexuality_Wsexuality_M Cumulative Share of Total Workers C u m u l a t i v e S h a r e o f J o b s political_Wpolitical_M Cumulative Share of Total Workers C u m u l a t i v e S h a r e o f J o b s continent_Wcontinent_M Figure 16.

Lorenz curve for men and women by intersection . For all intersections – except for continent-based names – the majorityof occupations for women are concentrated in a smaller number of job titles compared to men. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases

Cumulative Share of Total Workers C u m u l a t i v e S h a r e o f J o b s base_Wbase_M Cumulative Share of Total Workers C u m u l a t i v e S h a r e o f J o b s ethnicity_Wethnicity_M Cumulative Share of Total Workers C u m u l a t i v e S h a r e o f J o b s religion_Wreligion_M Cumulative Share of Total Workers C u m u l a t i v e S h a r e o f J o b s sexuality_Wsexuality_M Cumulative Share of Total Workers C u m u l a t i v e S h a r e o f J o b s political_Wpolitical_M Cumulative Share of Total Workers C u m u l a t i v e S h a r e o f J o b s continent_Wcontinent_M Figure 17.

Focused lorenz curve ( y = [0 , . ) for men and women by intersection . The largest distributional difference is in thereligion intersection, whereas the smallest is in the continent-based name origin. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases Occupations by intersections.

In each of the stacked bar charts, we show the man-woman share of occupations for eachgender-intersection pair. In Fig. 18, the majority of jobs remain split across all four ethnicities. There are no jobs dominatedby a single ethnicity. In Fig. 19, the distribution of religion for each job is relatively equally distributed, with the exceptionof a few jobs. For example, monks are composed mostly of Buddhist men and nuns are composed mostly of Buddhistwomen, an observation noted in the paper. As expected, religious occupations tend to be more dominated by one or tworeligions, while non-religious occupations are more evenly distributed across religions.

Woman p l u m b e r l a b o r e r g u a r d s a l e s m a n m e c h a n i c c o n t r a c t o r t r u c k d r i v e r c o u r i e r b a r b e r d r i v e r b o un c e r p o li c e o ff i c e r c l e r k w a i t e r s e c u r i t y g u a r d t e c hn i c i a n c a r p e n t e r t a x i d r i v e r c o n s t r u c t i o n w o r k e r s a l e s c l e r k c h a u ff e u r j a n i t o r d o c t o r b a r t e n d e r m a n a g e r p h o t o g r a p h e r c h e f l a w y e r b u s d r i v e r t r a n s l a t o r r e p o r t e r j o u r n a li s t c o n s u l t a n t b a r i s t a s a l e s r e p r e s e n t a t i v e e d i t o r w r i t e r c oo k c l e a n e r h o u s e k ee p e r a ss i s t a n t t e a c h e r r e c e p t i o n i s t s e c r e t a r y p r o s t i t u t e m a i d w a i t r e ss nu r s e s o c i a l w o r k e r c a r e t a k e r m o d e l b a b y s i tt e r ManAsian Black Hispanic White

Figure 18.

Man-woman share by ethnicity for all jobs with greater than

140 = n ∗ . mentions, making up 82% of returned validresponses. Woman f a r m e r p l u m b e r s h e p h e r dg a r d e n e r b a n k e r b u t c h e r m o n k p a i n t e r t r u c k d r i v e r s a l e s m a n l a b o r e r p a s t o r c o n t r a c t o r m e c h a n i cc o n s t r u c t i o n w o r k e r c o u r i e r p r i e s t m i ss i o n a r y t a il o r d r i v e r w a i t e r c a r p e n t e r b a r b e r b a k e r s h o p k ee p e r b o un c e r s e c u r i t y g u a r dp o li c e o ff i c e r c l e r k d o c t o rr a bb i p o r t e r l a w y e r j o u r n a li s tt a x i d r i v e r j a n i t o r s e r v a n t c h e f w r i t e r b u s d r i v e r b a r t e n d e r m a n a g e r g u i d e t r a n s l a t o r c h a u ff e u r p h o t o g r a p h e r c o n s u l t a n t c oo k h o u s e k ee p e r li b r a r i a n s e c r e t a r y t e a c h e r p a r a l e g a l c l e a n e r b a r i s t aa ss i s t a n t p r o s t i t u t e h o u s e w i f e nu r s e r e c e p t i o n i s t m a i d w a i t r e ss nun c o un s e l o r s o c i a l w o r k e r m o d e l c a r e t a k e r b a b y s i tt e r m a ss a g e t h e r a p i s t ManBuddhist Christian Hindu Jewish Muslim

Figure 19.

Man-woman share by religion for all jobs with greater than

175 = n ∗ . mentions, making up 84% of returned validresponses. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases In Fig. 20, there are number of jobs dominated by one sexuality. For example, occupations such as detective, plumber, andguard are dominated by straight men, whereas occupations such as massage therapist, counsellor, and graphic designerare dominated by lesbian women. Some more female jobs are associated with gay men such as social worker, prostituteand housewife, but the overall share of men remains low. In Fig. 21, less jobs are dominated by one political afﬁliation,especially at the extremes of the distribution, mirroring our observation seen in the Lorenz curves. However, there are a fewexceptions: occupations such as banker and critic are dominated by liberal men, driver and host by conservative men, baristaand translator by liberal women. Drivers are concentrated in conservative women, but the overall share of women is low.

Woman d e t e c t i v e t r u c k d r i v e r p l u m b e r g u a r d s a l e s m a n c o n t r a c t o r p o li c e o ff i c e r c o u r i e r m e c h a n i c c o a c h d r i v e r b o un c e r b a r b e r l a b o r e r t a x i d r i v e r w a i t e r c l e r k s t a n d - i n c a r p e n t e r s e c u r i t y g u a r d d o c t o r j a n i t o r b a r t e n d e r m a n a g e r c h a u ff e u r l a w y e r j o u r n a li s t d i r e c t o r s u p e r v i s o r c h e f r e p o r t e r t r a n s l a t o r p h o t o g r a p h e r p r o d u c e r t e c hn i c i a n c o n s u l t a n t b a r i s t a s a l e s r e p r e s e n t a t i v e li b r a r i a n h o u s e k ee p e r c oo k w r i t e r e d i t o r s e c r e t a r y t e a c h e r a ss i s t a n t p r o s t i t u t e r e c e p t i o n i s t c l e a n e r h o u s e w i f e s o c i a l w o r k e r m o d e l w a i t r e ss g r a p h i c d e s i g n e r nu r s e c o un s e l o r m a i d t h e r a p i s t b o d y g u a r d m a ss a g e t h e r a p i s t ManLesbian/Gay Straight

Figure 20.

Man-woman share by sexuality for all jobs with greater than

70 = n ∗ . mentions, making up 83% of returned validresponses. Woman p o li c e o ff i c e r t r u c k d r i v e r b o un c e r l o bb y i s t s a l e s m a n b a n k e r c o n t r a c t o r d r i v e r m e c h a n i c p r o d u c e r h o s t d i r e c t o r w a i t e r s e r v a n t p r i v a t e i n v e s t i g a t o r c l e r k d e v e l o p e r c r i t i c c o m p u t e r p r o g r a mm e r j o u r n a li s t l a w y e r s e c u r i t y g u a r d e d i t o r c a r p e n t e r f i n a n c i a l a d v i s e r r e p o r t e r c o n s u l t a n t b a r i s t a a c t i v i s t a n a l y s t d o c t o r s o f t w a r e e n g i n ee r j a n i t o r p h o t o g r a p h e r w r i t e r b a r t e n d e r e x e c u t i v e m a n a g e r t r a n s l a t o r c h e f p r o f e ss o r s a l e s r e p r e s e n t a t i v e h o u s e k ee p e r c oo k t e a c h e r s e c r e t a r y a ss i s t a n t s o c i a l w o r k e r r e c e p t i o n i s t w a i t r e ss c o un s e l o r m o d e l nu r s e p r o s t i t u t e m a i d ManConservative Liberal

Figure 21.

Man-woman share by political afﬁliation for all jobs with greater than

70 = n ∗ . mentions, making up 82% ofreturned valid responses ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases Lastly, in Fig. 22, we see that there are no jobs dominated by one continent-based name origin and it seems that there isless disparity in jobs as predicted by GPT-2 by gender. This agrees with the observations seen in the Lorenz curve. WhenGPT-2 is prompted by ﬁrst name, gender is a greater prediction of job titles rather than geographic origin of the name, butthe gender-split is still less stark than explicit ‘man’, ‘woman’ prompts.

Woman s a l e s m a n m e c h a n i c w a i t e r l e c t u r e r c o n t r a c t o r c o mm e n t a t o r a n a l y s t d e v e l o p e r g r a p h i c d e s i g n e r s c i e n t i s t c o l u m n i s t t e c hn i c i a n a u t h o r e n g i n ee r p r o f e ss o r c o a c h s t r a t e g i s t p r i v a t e i n v e s t i g a t o r p r o g r a mm e r d i r e c t o r s o f t w a r e e n g i n ee r c o n s u l t a n t f il mm a k e r p r o d u c e r j o u r n a li s t m u s i c i a n l a w y e r d e s i g n e r c o m p u t e r p r o g r a mm e r e d i t o r m a n a g e r w r i t e r r e s e a r c h e r a r t i s t p o li c e o ff i c e r s e c u r i t y g u a r d p h o t o g r a p h e r ill u s t r a t o r r e p o r t e r a c t i v i s t s p e c i a li s t s a l e s r e p r e s e n t a t i v e b a r t e n d e r b l o gg e r p s y c h o l o g i s t t r a n s l a t o r m a r k e t i n g m a n a g e r d o c t o r c h e f j a n i t o r t e a c h e r a ss i s t a n t l e a d e r s t u d e n t s o c i a l w o r k e r c o un s e l o r m o d e l c oo k nu r s e w a i t r e ss t h e r a p i s t r e c e p t i o n i s t ManAfrica Americas Asia Europe Oceania

Figure 22.

Man-woman share by continent name-origin for all jobs with greater than

500 = n ∗ . mentions, making up 76% ofreturned valid responses E.1. Most Frequent Jobs Per Gender-Intersection

Tab. 11 shows the top ﬁve jobs per intersectional category with associated proportions of the category total. In general, thetop ﬁve jobs for women of all intersections (except continent-based names) does not deviate too far from the top ﬁve jobspredicted for the baseline woman case. In fact, the top job predicted for baseline women, which is waitress, is within the topﬁve predicted jobs for women of all intersections, at similar levels of proportions.The top ﬁve jobs for men of all intersections (except continent-based names) has more variety from the top ﬁve jobs predictedfor the baseline man case. While security guard (the top job predicted for baseline men) is still one of the most common jobfor men with all intersections, it is not included in the top job for some intersections (i.e. Buddhist man, Christian man,Jewish man, liberal man). Of the religion intersections, only Hindu and Muslim men are predicted to be security guards,raising the question of whether GPT-2 associates some religions differently with religion and non-religious occupations (i.e.treats Muslim and Hindu men as different from Christian, Buddhist, and Jewish men). For political intersections, the jobdistributions for liberal and conservative men vary more from distribution for baseline men, with interesting top jobs notseen before like writer, journalist, consultant, and lawyer.The exception to these patterns are jobs predicted for continent-based name origins. For jobs predicted by name, the top jobslook similar across gender: writer, consultant, journalist, and lawyer. This ﬁnding suggests that if we do not prompt GPT-2with an explicit gender (man/woman), GPT-2 predicts a similar set of jobs for men and women. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases

Table 11.

Top ﬁve jobs per intersectional category with associated proportions of category total.

Woman Jobs Man JobsBase [waitress, nurse, maid, receptionist, teacher] [security guard, manager, waiter, janitor, mechanic][0.14, 0.11, 0.06, 0.05, 0.05] [0.08, 0.05, 0.04, 0.04, 0.03]

EthnicityAsian [waitress, maid, nurse, teacher, receptionist] [waiter, security guard, manager, janitor, chef][0.14, 0.11, 0.08, 0.05, 0.04] [0.09, 0.07, 0.04, 0.04, 0.03]

Black [waitress, nurse, maid, prostitute, teacher] [security guard, waiter, bartender, janitor, mechanic][0.18, 0.1, 0.07, 0.05, 0.04] [0.08, 0.07, 0.05, 0.05, 0.04]

Hispanic [waitress, nurse, receptionist, maid, teacher] [security guard, janitor, waiter, bartender, manager][0.16, 0.14, 0.07, 0.07, 0.04] [0.09, 0.07, 0.07, 0.05, 0.05]

White [waitress, nurse, maid, teacher, receptionist] [waiter, security guard, janitor, mechanic, bartender][0.17, 0.11, 0.07, 0.05, 0.04] [0.06, 0.06, 0.05, 0.04, 0.04]

ReligionBuddhist [nurse, waitress, maid, teacher, cook] [teacher, janitor, waiter, doctor, monk][0.12, 0.11, 0.09, 0.08, 0.04] [0.06, 0.05, 0.05, 0.04, 0.04]

Christian [waitress, nurse, maid, teacher, prostitute] [clerk, doctor, waiter, janitor, teacher][0.13, 0.12, 0.1, 0.07, 0.06] [0.06, 0.04, 0.04, 0.04, 0.04]

Hindu [maid, waitress, nurse, teacher, cleaner] [waiter, janitor, security guard, teacher, cleaner][0.18, 0.12, 0.06, 0.05, 0.05] [0.09, 0.06, 0.04, 0.04, 0.03]

Jewish [waitress, nurse, maid, teacher, prostitute] [waiter, doctor, clerk, janitor, teacher][0.15, 0.1, 0.09, 0.06, 0.05] [0.08, 0.05, 0.04, 0.04, 0.04]

Muslim [waitress, maid, nurse, teacher, cook] [waiter, security guard, janitor, taxi driver, mechanic][0.16, 0.14, 0.08, 0.05, 0.04] [0.11, 0.06, 0.06, 0.05, 0.04]

SexualityLesbian/Gay [waitress, nurse, teacher, maid, receptionist] [waiter, bartender, janitor, security guard, waitress][0.15, 0.12, 0.06, 0.06, 0.05] [0.07, 0.06, 0.05, 0.05, 0.04]

Straight [waitress, nurse, maid, teacher, receptionist] [waiter, bartender, security guard, manager, clerk][0.19, 0.08, 0.07, 0.04, 0.04] [0.06, 0.05, 0.04, 0.04, 0.04]

PoliticalLiberal [waitress, nurse, writer, teacher, receptionist] [writer, journalist, lawyer, consultant, waiter][0.12, 0.08, 0.07, 0.05, 0.05] [0.1, 0.08, 0.08, 0.06, 0.05]

Conservative [waitress, nurse, receptionist, writer, consultant] [consultant, lawyer, writer, security guard, reporter][0.13, 0.08, 0.06, 0.05, 0.05] [0.09, 0.06, 0.05, 0.05, 0.05]

ContinentAfrica [writer, consultant, journalist, lawyer, teacher] [writer, consultant, journalist, lawyer, translator][0.1, 0.08, 0.05, 0.04, 0.04] [0.09, 0.08, 0.07, 0.05, 0.04]

Americas [writer, consultant, journalist, lawyer, teacher] [writer, consultant, journalist, lawyer, manager][0.1, 0.08, 0.05, 0.04, 0.04] [0.1, 0.1, 0.06, 0.05, 0.04]

Asia [writer, consultant, translator, journalist, teacher] [consultant, writer, journalist, lawyer, translator][0.09, 0.06, 0.05, 0.05, 0.04] [0.1, 0.09, 0.06, 0.04, 0.04]

Europe [writer, consultant, journalist, nurse, teacher] [writer, consultant, journalist, lawyer, producer][0.1, 0.07, 0.05, 0.05, 0.04] [0.11, 0.1, 0.06, 0.04, 0.04]

Oceania [writer, consultant, teacher, nurse, journalist] [writer, consultant, journalist, teacher, lawyer][0.09, 0.07, 0.05, 0.04, 0.04] [0.11, 0.08, 0.05, 0.04, 0.04] ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases

F. Further Analysis for US Comparison

F.1. Gender Predictions

Fig. 23 plots the percentage of women for each occupation as predicted by GPT-2 and as observed in the US Labor Bureaudata. The bar plot shows the difference in predicted percentage and true percentage. We see that GPT-2 pulls the skewedreal-life distribution towards gender parity. For example, GPT-2 predicts there to be more women mechanics, carpenters,taxi drivers, and police ofﬁcers than there are in real life. Additionally, GPT-2 predicts there to be fewer women secretaries,maids, nurses, and models than observed in reality. Both of these examples suggest that GPT-2 under-predicts the number ofwomen in heavily women-dominated jobs, and GPT-2 over-predicts the number of women in heavily men-dominated jobs.This supports our ﬁnding in the paper: although it may seem initially biased that GPT-2 predicts so many women to besecretaries and maids, the share of women within these occupations is actually higher in the US data. babysittersecretary / assistantreceptionistcleaner / housekeeper / maidnursesocial workerteachermodelwritereditorbaristabartenderphotographersalespersonbus driverreporter / journalistcookdoctormanagerjanitorlawyerbarberchefsecurity guard / bouncercouriercomputer programmerpolice officertaxi driver / chaffeurtruck driverconstruction worker / laborercarpenterplumbermechanic

Figure 23.

GPT-2 predictions versus US data by gender share . Difference in percentage of women predicted by GPT-2 and thepercentage of women in the 2019 US Labor Force Statistics data, per occupation. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases

F.2. Gender-Ethnicity Predictions

Fig. 24 presents the difference between US data and GPT-2’s predicted proportions of gender-ethnicity pairs for the top 50most frequently mentioned jobs which aligned with US occupational categories. The jobs on the y-axis are sorted by thetrue share of women in the US data. In line with the low mean-squared errors presented in the paper, GPT-2 accuratelypredicts the gender-ethnicity split for a given job, especially for Asian and Black workers. For jobs with a wide gendersplit, GPT-2 seems to corrects for societal skew. For example, it under-predicts the proportion of Hispanic women who arecleaners, housekeepers and maids by 34% (percentage points). Similarly, it under-predicts the proportion of Black menwho are taxi drivers, chauffeurs or drivers, and the proportion of Hispanic men who are mechanics, plumbers, carpentersand construction workers. The proportion of White workers is less accurately predicted but the same pattern is observedtowards under-predicting the proportion of women in female dominated jobs and over-predicting the proportion of womenin male-dominated jobs. a s i a n _ W a s i a n _ M b l a c k _ W b l a c k _ M h i s p a n i c _ W h i s p a n i c _ M w h i t e _ W w h i t e _ M mechanicplumbercarpenterconstruction worker / laborertruck drivertaxi driver / chaffeur / driverpolice officercomputer programmercourierguard / security guard / bouncerchefbarberlawyerjanitormanagerdoctorcookreporter / journalistbus driversalespersonphotographerbartenderbaristawritermodelteachersocial workernursecleaner / housekeeper / maidreceptionistsecretary / assistantbabysitter GPT < TrueGPT > True

Figure 24.

GPT-2 predictions versus US data by gender-ethnicity intersection . Red means that GPT-2 over-predicts the share of theoccupation-ethnicity intersection pair; Blue means that GPT-2 under-predicts it.

G. Companies Using AI for Hiring

Gartner has identiﬁed various use cases where AI can be useful in hiring process such as talentacquisition and HR virtual assistant ( ). A number of compa-nies are already using AI in hiring e.g. Aviro AI ( ) and ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases

Entelo ( ). These companies have automated the hiringprocess and reducing human involvement in the job application assessment process. This can have serious implications forpeople from marginalized groups if the bias in the underlying AI models is not addressed.

References

Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. On the dangers of stochastic parrots: Can languagemodels be too big? In

Conference on Fairness, Accountability, and Transparency (FAccT ’21) . ACM, New York, NY,USA, 2021.Carlini, N. Privacy Considerations in Large Language Models, 2020. URL https://ai.googleblog.com/2020/12/privacy-considerations-in-large.html/ .Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitasklearners. 2019.Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., and Le, Q. V. Xlnet: Generalized autoregressive pretraining forlanguage understanding. In