Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models
Hannah Kirk, Yennie Jun, Haider Iqbal, Elias Benussi, Filippo Volpin, Frederic A. Dreyer, Aleksandar Shtedritski, Yuki M. Asano
HHow True is GPT-2?An Empirical Analysis of Intersectional Occupational Biases
Hannah Kirk * 1 2
Yennie Jun * 1 2
Haider Iqbal * 1 3
Elias Benussi * 1 3
Filippo Volpin * 1 4
Fr´ed´eric A. Dreyer * 1 5
Aleksandar Shtedritski
Yuki M. Asano
Abstract
The capabilities of natural language modelstrained on large-scale data have increased im-mensely over the past few years. Downstreamapplications are at risk of inheriting biases con-tained in these models, with potential negativeconsequences especially for marginalized groups.In this paper, we analyze the occupational biasesof a popular generative language model, GPT-2,intersecting gender with five protected categories:religion, sexuality, ethnicity, political affiliation,and name origin. Using a novel data collectionpipeline we collect 396k sentence completionsof GPT-2 and find: (i) The machine-predictedjobs are less diverse and more stereotypical forwomen than for men, especially for intersections;(ii) Fitting 262 logistic models shows intersec-tional interactions to be highly relevant for occu-pational associations; (iii) For a given job, GPT-2reflects the societal skew of gender and ethnicityin the US, and in some cases, pulls the distribu-tion towards gender parity, raising the normativequestion of what language models should learn.Code is available at https://github.com/oxai/intersectional_gpt2 .
1. Introduction
The advent of deep learning and massive growth in trainingdata have led to natural language models surpassing humanson numerous benchmarks (Wang et al., 2018; 2019; He et al.,2020; Adiwardana et al., 2020). However, as Bender et al.(2021) states, these models can exacerbate existing biasesin data and perpetuate stereotypical associations to the harmof marginalized communities. Simultaneously, pre-trainedmodels have become readily accessible via APIs, allowingnon-experts to apply these tools in their own applications. * Equal contribution Oxford Artificial Intelligence Society Oxford Internet Institute, Dept. of Computer Science, Dept.of Economics, Dept. of Physics, Dept. of Engineering Science,University of Oxford, Oxford, United Kingdom. Correspondenceto: Hannah Kirk < [email protected] > . Preprint. S h a r e o f W o m e n ( U S d a t a ) babysitterbus drivercarpenterchef cooklaborer maidmechanic modelnursephotographerplumber receptionistsecretarysecurity guard social workerteachertruck driver Woman[0.0238]+Asian[0.0004]+Black[0.0013]+Hispanic[0.0049]+White[0.0182]
Figure 1.
GPT-2 Monte-Carlo prediction vs ground truth USpopulation share.
GPT-2’s predictions with regards to intersec-tional characteristics are highly stereotypical – yet they are closelyaligned to the US population data. We show the predicted valuesfor gender intersected with ethnicity along with the [Mean-SquaredErrors] and annotate example jobs for the gender-only predictions.
These developments in generative language models substan-tiate a need to understand the potential for biases towardsprotected classes, such as gender and ethinicity. Within thespecific context of fairness in AI-assisted hiring, we focuson analyzing one popular, publicly available language model(GPT-2). By generating 396k samples, we empirically an-alyze which occupations GPT-2 preferentially associateswith intersections of gender and protected classes. Thepaper provides the following contributions:• a detailed data collection protocol for studying inter-sectional biases of generative language models,• the analysis of the intersectional biases of (gender × [ethnicity, religion, sexuality, political affiliation, nameorigin]) present in GPT-2,• a comparison of GPT-2’s predictions with real-worldoccupation frequencies, as shown in Fig. 1. a r X i v : . [ c s . C L ] F e b ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases AsianBlackHispanicWhite
Ethnicity
BuddhistChristianHinduJewishMuslim
Religion
Gay/LesbianStraight
Sexuality
ConservativeLiberal
Political
AfricaAmericasAsiaEuropeOceania
Continent XY ManWoman
Base
GPT-2 NER
The works as a ...
X YZ works as a ... Z {NAME} Man WomanJanitorWriterTeacher............
Man WomanJanitorWriterTeacher............
Man WomanJanitorWriterTeacher............ Man WomanJanitorWriterTeacher............ Man WomanJanitorWriterTeacher............
Frequency for or
X Y Z
Figure 2.
Data Collection Process.
We collect 396,000 responses from GPT-2, and retrieve “titles” via Stanford CoreNLP’s NamedEntity Recognition (NER) to analyze the predicted occupational distribution for various intersectional categories.
2. Related Work
Bias in classical NLP models.
Extensive research hasshown that unrestricted training of natural language mod-els inherits human biases and, in some cases, amplifiesthem. (Bolukbasi et al., 2016; Caliskan et al., 2017; Zhaoet al., 2018; Gonen & Goldberg, 2019). Negative generaliza-tions, stereotypes, or misrepresentations of particular socialgroups can be learned by generative language models (Blod-gett et al., 2020). These learned associations can persist intasks such as machine translation (Stanovsky et al., 2019),dialogue systems (Liu et al., 2020), and hate speech classi-fiers (Kennedy et al., 2020). There have been many attemptsto identify, quantify, and de-bias context-independent wordembeddings such as word2vec and GloVe (Bolukbasi et al.,2016; Zhao et al., 2019; Diaz et al., 2018). While theseearlier works investigate low-capacity models, we analyze ahigh-capacity transformer-based generative language model,GPT-2 (Vaswani et al., 2017; Radford et al., 2019).
Bias in generative language models.
Transformer-basedlanguage models are increasingly replacing approaches us-ing static word embeddings and current state-of-the-art lan-guage models are trained on immense amounts of text col-lected from the Internet (Fedus et al., 2021; Radford et al.,2019; Brown et al., 2020). The vastly improved model per-formance and capacity has motivated many downstream ap-plications, such as dialogue generation (Dinan et al., 2020),automatically generated video captions (Tatman, 2017), andjob application assessment (Bhatia et al., 2019; Li et al.,2020). In terms of models, GPT-2, a transformer-based lan-guage model trained on 40GB of text from over 8 milliondocuments from the Internet, has been the subject of variousstudies (Wick et al., 2020; Budzianowski & Vulic, 2019;Ethayarajh, 2019).Researchers have attempted to quantify and mitigate bi- ases in generative language models: Zhao et al. (2019) findsystematic gender biases in ELMo’s contextualized wordvectors and Bhardwaj et al. (2020) measure the ways inwhich gender bias in BERT affects emotion and sentimentintensity. Kurita et al. (2019) quantify gender bias in BERTby measuring gendered occupations and adjectives, whileSheng et al. (2019) utilize template-based methods to quan-tify gender bias in texts generated by GPT-2, with sentimentscores as a proxy for bias. Building on these methodologies,we focus on intersectional biases of GPT-2 with regards tothe domain of occupations.
Intersectional biases.
As Crenshaw (1989) explains, in-tersectional biases are a necessary consideration because asingle axis of analysis treating gender and race as mutuallyexclusive categories distorts the reality of marginalized com-munities (such as Black women). More recently, Foulds &Pan (2020) provides definitions of fairness in machine learn-ing systems informed by the framework of intersectionality.The intersections between gender and racial biases havebeen studied in sentiment analysis (Kiritchenko & Moham-mad, 2018) and generative language models such as BERTand GPT-2 (Tan & Celis, 2019). As well as race and gender,we extend our analysis to intersections with other legallyprotected categories that have historically been subject todiscrimination: religion, sexuality, and political affiliation.
3. Methods
In order to analyze the potential impact of biases in applica-tions of generative language models, we focus on publiclyavailable models. Among these, we focus on GPT-2 as it We applied for research access to GPT-3 (in Oct. 2020) buthave received no response from OpenAI. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases
Table 1.
Summary table of data collection showing the number ofcalls per category and per variant (Var) in addition to thresholds( θ ) used for analysis. The cumulative sum of all calls is . Category Var Calls Total θ ∗ Cum.Calls Share
Base 2 7,000 14,000 35 81%Ethnicity 8 7,000 56,000 140 82%Religion 10 7,000 70,000 175 84%Sexuality 4 7,000 28,000 70 83%Political 4 7,000 28,000 70 82%Continent 200 1,000 200,000 500 76%
Notes : ∗ Threshold ( θ ) for plots to exclude 0.25%tails of infrequently mentioned jobs is one of the most widely recognized generative languagemodels. To proxy for the most popular model version, weuse the 124M hyperparameter model as it was the most-downloaded version of GPT-2 available on HuggingFace(see Appendix A, Tab. 1) ). We also confirm that our resultshold true for another generative model, XLNet (Yang et al.,2019) (see Appendix B).Our data collection pipeline is shown in Fig. 2. We promptGPT-2 with prefix templates, as defined by Sheng et al.(2019). The template is “The [ X ][ Y ] works as a”, where X is one of the following protected classes: ethnicity, reli-gion, sexuality, and political affiliation, and Y is ‘man’ or‘woman’. For a baseline to intersectional effects, we leave X blank (i.e. “The man/woman works as a”). From these28 unique templates (Tab.1), we generate , sentencesusing GPT-2 (Radford et al., 2019) through the Hugging-Face API with a top-k parameter of 3 and model size ‘small’.Generated sentences are limited to a maximum length of 10words to capture immediate occupation associations. Name-based templates.
An additional prefix template iscreated of the form “[ Z ] works as a”, where Z is a namesampled from the most popular male and female first namesper country, obtained from Wikipedia. We aggregate thenames into five geographic groups: Africa, Americas, Asia,Europe, Oceania. We sample 20 names for each geographicgroup and gender pair, yielding 200 unique templates, fromwhich we generate , sentences each. By promptingGPT-2 with a template devoid of inherently gendered orracialized terms, such as ‘man/woman’ or ‘Asian/Black’,we can better examine the latent associations when GPT-2estimates the ethnicity and gender from first names. Occupation entity recognition.
For each generated sen-tence, we use the Stanford CoreNLP Named Entity Recog-nizer (NER) (Manning et al., 2014) to extract “job titles.” https://en.wikipedia.org/wiki/List_of_most_popular_given_names https://stanfordnlp.github.io/CoreNLP/ner.html NER was unable to detect titles for some sentences whichwere removed from the dataset, losing 10.6% of gender-occupation sentences and 19.6% of name-occupation sen-tences. We then create a one-hot encoded frequency ma-trix for returned job tokens, combining duplicate jobs (e.g.nurse/nurse practitioner). However, we do not merge jobtokens with inherent hierarchies (e.g. assistant profes-sor/professor) or implicit gender associations (e.g. sales-man/salesperson, waitress/waiter). Sentences returning mul-tiple titles (e.g. “The woman works as a waitress and amaid”) were treated as two separate entries in the frequencymatrix given that individuals can have more than one job.
The distribution of returned jobs is highly-skewed with longtails: a few jobs comprise a significant share and many jobsare mentioned infrequently. Therefore, we apply a lower-bound threshold to focus our analysis, removing tokens men-tioned in fewer than 0.25% of total calls which preservesapproximately 80% of the sample (Tab.1). For jobs abovethe threshold, we run a logistic regression on the one-hotmatrix and output frequencies to predict p ( [job] = 1 | X, Y ) for the input “The [ X ][ Y ] works as a [job]”. While GPT-2is a ‘black-box’ model, this predictive modelling attemptsto estimate how intersectional categories change GPT-2’sprior on the probability of job associations. By using in-teraction terms, we can study whether intersectionality hasadditional influence beyond main effects (e.g. the isolatedeffect of gender and ethnicity). The logistic regression equa-tion includes ‘man’ from the baseline case as the referencegroup, with dummy variables added for woman, for eachintersectional category C , and for interaction terms: log odds( p ( job i | c )) = β + β Woman i + (1) C (cid:88) c =1 γ ic Category i c + C (cid:88) c =1 δ ic Category ic ∗ Woman i + (cid:15) i , where log odds( p ) = log( p/ (1 − p )) is the log-odds ratio.
4. Results
We analyze the effect of gender on returned occupationaldistributions in Sec 4.1 and on particular occupations inSec 4.2. We extend these analyses to intersectional associa-tions in Sec 4.3 and present empirical results derived fromlogistic regressions in Sec 4.4. Finally, we compare andquantify the predicted distributions against real-world USoccupational data in Sec 4.5. For the names-occupation template, we removed 2000 sen-tences with the job title ‘Princess’ for the African name ‘Princess’. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases Log(Rank) S h a r e o f T o t a l
16 jobs account for 50% of men8 jobs account for 50% of women66 jobs account for 90% of men43 jobs account for 90% of women
Women (Average) Men (Average)
Figure 3.
GPT-2 occupational stereotyping.
GPT-2 stereotypesthe occupational distribution of women more than that of men.
Fig. 3 ranks the frequency of jobs against the cumulativeshare. While 16 jobs account for 50% of the outputs formen, only 8 jobs account for the same share for women.Similarly, at the 90% level, men are associated with morejobs than women (66 vs 43, respectively). This suggeststhat GPT-2 predicts a wider variety of jobs for men and anarrower set of jobs for women. The Gini coefficients inTab. 2 confirm this more unequal distribution for women. In addition to distributional differences, the set of returnedjobs also differ by men and women. In Fig. 4, we show theproportion of genders in all jobs mentioned more than 35times for baseline man and woman. We make two obser-vations: first, there is a greater number of jobs dominatedby men as compared to women, reflecting the greater di-versity of occupations for men. Second, the occupationsseem stereotypical: men are associated with manual jobssuch as laborer, truck driver, and mechanic, and with profes-sional jobs such as software engineer and private investiga-tor. Women are associated with domestic and care-givingroles such as babysitter, maid, social worker, and house-wife. Furthermore, over 90% of the returns for ‘prostitute’were women, and over 90% of returns for ‘software engi-neer’ were men. We only find three jobs for which GPT-2’soutputs suggest a gender-neutral prior over occupations:reporter, lawyer, and sales representative. G = ( (cid:80) ni =1 (2 i − n − x i ) / ( n (cid:80) ni =1 x i ) , where x is theobserved value, n is the total values observed, and i is the rank isascending order. Table 2.
Gini coefficients of rank-frequency distributions.
Gender Intersection Gini Coeff Relative CoeffBase M = 100%
Man Base 0.933 100Man Religion 0.929 99.57Man Sexuality 0.935 100.21Man Ethnicity 0.939 100.64Man Political 0.942 100.96Woman Base 0.951 101.93Woman Political 0.951 101.93Woman Ethnicity 0.956 102.47Woman Religion 0.956 102.47Woman Sexuality 0.958 102.68
The previous analysis indicates considerable differences be-tween genders. We ask the following question:
Do thesestereotypical male and female jobs change when inter-sectionality is considered?
The Gini coefficients (Tab. 2)for gender-intersection pairs indicate a greater clustering ofwomen into fewer jobs across all intersections, especially forsexuality and religion. To analyze differences in job associa-tions for each intersection, we display a scatter plot with theequi-proportion line given by (1 / | c | , to (0 , / | c | ) , where | c | is the number of choices for intersection c . We normalizethe axis such that / | c | = 1x so that jobs lie on this line ifadding intersections has no effect on the gender ratio. Wefurther include a bar plot showing the extremes of the dis-tribution with the top ten jobs with the largest man-womanrange. Ethnicity.
For gender and ethnicity intersections (Fig. 5),we find a similar pattern of occupations associated withmen (plumber, guard, contractor, and police officer) andothers with women (secretary, prostitute, model, babysitter).While all ethnicities of women are associated with prostitute,only Black men are. Overall, few occupations are solelyassociated with men or women of a certain ethnicity, andare mostly distributed over several ethnicities.
Religion.
For gender and religion intersections (Fig. 6),Hindu men and women only have associations with non-religious professions (e.g. bouncers and massage therapists).For Christian, Buddhist, and Jewish religions, there is atendency of GPT-2 towards generating occupations withlarge man-woman disparities, especially for professionalreligious occupations: nuns are dominated by Buddhistwomen, rabbis are dominated by Jewish men, and monks,pastors, and priests are dominated by Buddhist and Christianmen.
Sexuality.
For gender and sexuality intersections (Fig. 7),we find professions such as massage therapist, counselor,and graphic designer to be almost unique to lesbian women,while professions such as detective, plumber, guard, andcoach are dominated by straight men. Male-dominated ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases l a b o r e r c a s h i e r m o n k p l u m b e r s a l e s m a n t r u c k d r i v e r s o f t w a r e e n g i n ee r d e v e l o p e r p r i v a t e i n v e s t i g a t o r m e c h a n i c c o u r i e r c o n t r a c t o r d r i v e r j o u r n a li s t s u p e r v i s o r b u s d r i v e r p h o t o g r a p h e r b a r b e r p r i e s t t e c hn i c i a n c l e r k s e c u r i t y g u a r d c o m p u t e r p r o g r a mm e r w a i t e r c a r p e n t e r c o n s t r u c t i o n w o r k e r t a x i d r i v e r p o li c e o ff i c e r b o un c e r j a n i t o r d o c t o r t r a n s l a t o r m a n a g e r c o n s u l t a n t b a r t e n d e r s e r v a n t c h e f w r i t e r r e p o r t e r l a w y e r s a l e s r e p r e s e n t a t i v e t e a c h e r c oo k e d i t o r b a r i s t a h o u s e k ee p e r s e c r e t a r y r e c e p t i o n i s t a ss i s t a n t c l e a n e r w a i t r e ss nu r s e h o u s e w i f e c o un s e l o r p r o s t i t u t e s o c i a l w o r k e r m a i d m o d e l c a r e t a k e r m a ss a g e t h e r a p i s t b a b y s i tt e r Figure 4.
Fundamentally skewed output distributions.
We show the gender proportions when querying for the base case, i.e. X = {} , Y = { Man , Woman } and present all jobs with greater than
35 = n ∗ . mentions, making up 81% of returned sentence prompts.
0x 1x 2x 3x
Over-representation Factor (Women) O v e r - r e p r e s e n t a t i o n F a c t o r ( M e n ) babysitter babysitterbabysitterguardguard modelplumbertechnician Woman p l u m b e r g u a r d c o n t r a c t o r c o u r i e r p o li c e o ff i c e r t e c hn i c i a n s e c r e t a r y p r o s t i t u t e m o d e l b a b y s i tt e r ManAsianBlackHispanicWhite
Figure 5.
Man-Woman Occupational Split by Ethnicity
0x 1x 2x 3x 4x 5x
Over-representation Factor (Women) O v e r - r e p r e s e n t a t i o n F a c t o r ( M e n ) butcher counselorhousewifemassage therapistmonk nunpastorshepherd Woman m o n k p a s t o r p r i e s t m i ss i o n a r y b o un c e rr a bb i g u i d e nun c o un s e l o r m a ss a g e t h e r a p i s t ManBuddhistChristianHinduJewishMuslim
Figure 6.
Man-Woman Occupational Split by Religion professions are almost exclusively straight, whereas female-dominated professions are almost exclusively lesbian.
Political affiliation.
For gender and political affiliationintersections (Fig. 8), the occupations are similar to thebaseline man and woman case presented in Fig. 4. Although occupations are split along the gender axis, some have equalrepresentation across political affiliation. The exceptionis that liberal men are strongly associated with critic andbanker, and conservative men with driver and host.
Name origin.
For gender and continent name origin inter-sections (Fig. 9), jobs are more tightly distributed aroundthe equi-proportion line. This suggests that name origin hasless of an effect on the token returned by GPT-2 than whenadding an explicit categorical intersection (e.g. ethnicityor religion). Gender continues to be the more significantdeterminant on the occupations generated by GPT-2, withmen being associated with jobs such as mechanic and leader,and women being associated with jobs such as nurse andreceptionist.
The previous visual analyses demonstrate the considerabledifferences in the jobs associated with women and men,and further show that these gender splits remain when inter-sections are added. Next, we address the question: quan-titatively, how important are gendered intersections indetermining the job returned by GPT-2?
Tab. 3 presentssummary results from 262 logistic regressions, which pre-dict the likelihood of a job being associated with a given sen-tence prompt. We focus on two metrics indicating how oftenthe addition of regressors adds explainability of the outcome:i) The proportions of regressions where the woman dummyand the interactions were significant ( p < . ), and ii)The change in Pseudo- R on the addition of the womandummy and the interactions. Statistical results, includingthe coefficients, for all regressions are in Appendix D. Theaggregated results in Tab. 3 show that the woman dummyis frequently significant, most commonly so in ethnicity We use the McFadden R which is calculated by compar-ing the log-likelihood of a model with no predictors L , ver-sus the log-likelihood of the estimated model L M : R McF =1 − ln( L M ) / ln( L ) ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases
0x 1x 2x
Over-representation Factor (Women) O v e r - r e p r e s e n t a t i o n F a c t o r ( M e n ) coach counselordetective graphic designerguard massage therapistplumber therapist Woman d e t e c t i v e p l u m b e r g u a r dp o li c e o ff i c e r c o u r i e r c o a c h g r a p h i c d e s i g n e r c o un s e l o r t h e r a p i s t m a ss a g e t h e r a p i s t ManLesbian/GayStraight
Figure 7.
Man-Woman Occupational Split by Sexuality
0x 1x 2x
Over-representation Factor (Women) O v e r - r e p r e s e n t a t i o n F a c t o r ( M e n ) bankerbouncer counselordriverhost maidmodel prostitute Woman t r u c k d r i v e r b o un c e r l o bb y i s t s a l e s m a n b a n k e r d r i v e r h o s t c r i t i c p r o s t i t u t e m a i d ManConservativeLiberal
Figure 8.
Man-Woman Occupational Split by Political
0x 1x 2x
Over-representation Factor (Women) O v e r - r e p r e s e n t a t i o n F a c t o r ( M e n ) graphic designer nursereceptionistsalesmantechnician therapistwaiter waitress Woman s a l e s m a n m e c h a n i c w a i t e r g r a p h i c d e s i g n e r l e a d e r c oo k nu r s e w a i t r e ss t h e r a p i s t r e c e p t i o n i s t ManAfricaAmericasAsiaEuropeOceania
Figure 9.
Man-Woman Occupational Split by Name Origin regressions (71%) and least commonly in political regres-sions (59%). Adding a woman dummy increases the model R on average by +3.3% (percentage points), signifyingthat gender explains additional variation in job prediction.Interactions are significant in approximately one third ofregressions, but the additional increase to R is on aver- Table 3.
Aggregated logistic regression results.
We fit a total of262 logistic regressions and report the number of times the inde-pendent variables contributed significantly to the logistic model,as well as their average contribution to the Pseudo- R . ∆ R2 Ethnicity 55 woman 0.71 3.22woman:asian 0.29 0.40woman:black 0.36woman:hispanic 0.38woman:white 0.16Religion 64 woman 0.61 3.31woman:buddhist 0.19 0.39woman:christian 0.27woman:hindu 0.27woman:jewish 0.33woman:muslim 0.25Sexuality 72 woman 0.61 3.36woman:lesbian 0.35 0.45woman:straight 0.26Political 71 woman 0.59 3.47woman:conservative 0.24 0.46woman:liberal 0.30 age smaller (+0.4%). There is some variation in the sig-nificance of interactions; for example, { women:hispanic } and { woman:black } are more frequently significant than { woman:white } , and { woman:lesbian } more significantthan { woman:straight } . These results suggest that someintersections are more salient in changing the returned jobfrom a given sentence prompt, and may anchor GPT-2 ona stereotypical occupation set. In general, across a widerange of jobs, gender and intersectionality are significantdeterminants of the token returned by GPT-2. A comparison of GPT-2’s predictions to the true labor mar-ket distribution requires recent data disaggregated by genderand intersection for a granular set of occupations. The2019 US Labor Force Statistics from the Current Popula-tion Survey (US Labor Bureau of Statistics, 2019) reportsthe gender and ethnicity shares of workers in 567 occupa-tional categories. We recognize a number of limitationsof this data, which we address in the discussion. We firstselect the 50 most frequently mentioned jobs by GPT-2.Then from these, we match GPT-2’s job tokens to realUS occupation titles, finding correspondences for 41/50titles (see Appendix C). We compute GPT-2’s predictedproportional representation for each gender-ethnicity pair,assuming the percentage of women is equal across ethnic-ities: The ‘predicted’ labor force has equal representationacross groups because we generate the same number ofsentence prompts per pair ( n = 7 , ). This is not thecase in reality, so the predicted proportions are scaled bythe true distribution of gender and ethnicity reported in theUS Labor Statistics and summarised in Appendix C. The ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases scaling factor is γ ( c ) = G ( c ) E ( c )ˆ D ( c ) , where G ( c ) , E ( c ) are thegender- and ethnicity-shares of the US data, respectivelyand ˆ D ( c ) = 12 . is our artificial “population”-share:adj. Pred ( i, c ) = γ ( c ) × Pred ( i, c ) , (2)where Pred( i ) is the share of job i for characteristics c . Forjobs reported in the US data, we calculate the differencebetween the predicted proportions and the true proportions. For a given job, how well does GPT-2 predict the gender-ethnicity split?
There are three possible cases: GPT-2overestimates the true representation of women in female-dominated jobs (exacerbates societal skew), GPT-2 matchesthe true proportional representation (directly inherits skew),or GPT-2 underestimates the true proportional represen-tation (corrects for skew). In Fig. 1, we find that mostpredicted values lie close to the ground-truth given by theidentity line, indicating a high accuracy in prediction (mean-squared errors of < ). In particular, for the gender-ethnicity intersections, the low mean-squared errors indicatea considerable degree of similarity between GPT-2’s pre-dicted distribution and the ground truth distribution, espe-cially for Asian and Black workers. Furthermore, it appearsthat GPT-2 pulls the distribution further from the extremes;that is, it under-predicts the extent of occupational segre-gation. This is demonstrated by the fact that GPT-2 pre-dicts a higher proportion of women than the ground truth inmale-dominated jobs with less than 25% women-share (onaverage +8.7%) and predicts lower proportions of women injobs with more than 75% women-share (on average -6.5%).The exceptions to this pattern are courier, bus driver andphotographer, for which GPT-2 under-predicts the propor-tion of women, and social worker and model, for whichGPT-2 over-predicts the proportion of women. For a given gender-ethnicity pair, how well does GPT-2predict the top jobs?
This question aims to answer the ex-tent of stereotyping of GPT-2 predictions. Tab. 4 shows thetop five predicted and ground truth jobs for each intersection.GPT-2 predicts a high proportion of baseline women to bewaitresses (14%) but only Hispanic women have waitressin the top five occupations, according to the US Labor data.While GPT-2 predicts 18% of Hispanic women to be wait-resses, in reality only 3% of Hispanic women in Americawork as waitresses. Some of this strong association maybe because waitress is an inherently gendered job. GPT-2also over-predicts the number of nurses, predicting 11%of women to be nurses when in reality only about 4% ofAmerican women are nurses. Security guard is consistentlyoverpredicted for men of all ethnicities. Yet security guardonly appears as a top job for Black men and at a lowerfrequency (2%) than the predicted frequency (8%). GPT-2over-predicts the proportion of janitors for all ethnicities,especially for White and Asian men, for whom janitor doesnot appear as a top job. The share of the most popular occupation for each gender issignificantly higher for women (waitress at 14%) than formen (security guard at 8%). The cumulative share of the topfive occupations is 41% for women, which is more than dou-ble the ground truth observation (17%). While GPT-2 alsoover-predicts the cumulative share of top five occupationsfor men, the discrepancy to the US data is smaller (24% vs10%). While GPT-2’s tendency to aggregate women into asmall set of stereotypical jobs was identified in prior analysis(Fig. 3 and Tab. 2), the comparison to US data corroboratesthis result.
5. Discussion
Demographic distribution per occupation.
Overall, wefind strong differences in the occupational tokens returnedby GPT-2 for gendered sentence prompts. At first glance,it may seem biased that GPT-2 predicts so many women tobe maids or secretaries and so few to be plumbers or truckdrivers, but in fact, the model predicts less occupationalsegregation by gender as compared to the US ground truthdistribution. It appears that GPT-2 is pulling the skews ofthe distribution found in reality towards gender parity.For ethnicity, GPT-2 accurately predicts the distribution ofoccupations in real-world data with low mean-squared er-rors, especially for Asian and Black workers. In addition togender and ethnicity, adding a religious intersection consid-erably changes the returned jobs, especially for men. Forexample, GPT-2 predicts 4% of Buddhist men to be monks.There are an estimated 3.75 million Buddhists in the USand approximately , Buddhist centers and monasteries(Pew Research, 2020; Institute for Genealogical Studies,2020). A back of the envelope calculation shows each ofthese centers would need to employ more than 70 monkseach to reach the 4% threshold. Therefore, it is likely thatGPT-2 infers too strong of an association between practisinga religion and working in a religious profession. Intersec-tions with continent-based names show that the returnedoccupations are more similar to those of baseline man andwoman. This finding indicates that prompting GPT-2 withexplicit intersections like ‘Buddhist man’ or ‘Black woman’changes the probabilities of returned tokens to a greater ex-tent than a name prompt where GPT-2 must independentlyascertain the gender and background of the individual.
Occupation distribution per demographic.
Despite re-flecting the gender-ethnicity proportions per real-world oc-cupation, GPT-2 notably displays a bias towards predictinggreater occupational clustering for both genders, especiallyfor women. The Gini coefficients confirm that the femaledistribution is more unequal than that of men. For the pre-dictions generated by GPT-2, a larger and more diverse setof occupations are associated with men than with women.Gender-ethnicity predictions do not deviate much from the ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases
Table 4.
Top five jobs per intersectional category with associated proportions of cumulative sum
GPT USJobs (Prop) Sum Jobs (Prop) SumWOMAN base waitress (0.14), nurse (0.11), maid (0.06), receptionist (0.05),teacher (0.05) 0.41 teacher (0.04), nurse (0.04), secretary/assistant (0.03), cashier(0.03), manager (0.03) 0.17Asian waitress (0.14), maid (0.11), nurse (0.08), teacher (0.05), recep-tionist (0.04) 0.42 nurse (0.05), personal appearance worker (0.04), cashier (0.03),accountant/auditor (0.03), manager (0.03) 0.18Black waitress (0.18), nurse (0.10), maid (0.07), prostitute (0.05),teacher (0.04) 0.44 nursing/home health aid (0.07), cashier (0.04), nurse (0.04),personal care aide (0.03), teacher (0.03) 0.21Hispanic waitress (0.16), nurse (0.14), receptionist (0.07), maid (0.07),teacher (0.04) 0.48 maid/housekeeper/cleaner (0.05), cashier (0.04), waiter/waitress(0.03), secretary/assistant (0.03), nursing/home aide (0.03) 0.18White waitress (0.17), nurse (0.11), maid (0.07), teacher (0.05), recep-tionist (0.04) 0.44 teacher (0.04), nurse (0.04), secretary/assistant (0.04), manager(0.03), cashier (0.03) 0.18
MAN base security guard (0.08), manager (0.05), waiter (0.04), janitor(0.04), mechanic (0.03) 0.24 manager (0.04), truck driver (0.04), construction laborer (0.02),retail sales supervisor (0.02), laborer/ material mover (0.02) 0.14Asian waiter (0.09), security guard (0.07), manager (0.04), janitor(0.04), chef (0.03) 0.27 software developer (0.11), manager (0.04), physician/surgeon(0.02), teacher (0.02), engineer (0.02) 0.21Black security guard (0.08), waiter (0.07), bartender (0.05), janitor(0.05), mechanic (0.04) 0.29 truck driver (0.06), laborer/material mover (0.04), janitor (0.03),manager (0.03), security guard (0.02) 0.18Hispanic security guard (0.09), janitor (0.07), waiter (0.07), bartender(0.05), manager (0.05) 0.33 construction laborer (0.06), truck driver (0.04), grounds mainte-nance worker (0.03), carpenter (0.03), janitor (0.03) 0.19White waiter (0.06), security guard (0.06), janitor (0.05), mechanic(0.04), bartender (0.04) 0.25 manager (0.04), truck driver (0.04), construction laborer (0.03),retail sales supervisor (0.02), laborer/material mover (0.02) 0.15 predictions for baseline man and woman. This signifies thatGPT-2 predicts the occupations for women with less varietythan for men, regardless of what ethnicity.This is a different kind of bias than that normally discussedin the algorithmic fairness literature. In reality, large propor-tions of women do work as secretaries, receptionists, andmaids, and large proportions of men do work as mechanics,plumbers, and carpenters. Therefore, GPT-2’s bias is not inthe jobs associated with women, but in the rate at which itassociates women with such a small set of jobs, a patternexacerbated from the ground truth occupation data.
Limitations.
This paper is subject to a number of limita-tions. First, our chosen comparison to labor market datarenders the ground truth baseline inherently US-centric. Sec-ond, without consistent, granular data on occupational splitsby religion, sexuality and political affiliation, we cannotcomment on how accurately GPT-2 reflects the ground truthfor these intersections. Third, for jobs in the informal sec-tor, such as ‘prostitute’, we cannot compare to real-worldincidences. Additionally, if terms such as ‘prostitute’ arecommonly used as slurs, GPT-2 may display a bias towardsover-estimating their proportion. Finally, by focusing onlyon two genders, the results do not adequately reflect occu-pational biases which may be associated with non-binarygender identities. Future research is recommended to inves-tigate ground truth comparisons across a broader range ofcountries against the set of gender-intersections examined inthis paper and to comment on a broader spectrum of genderidentities. Doing so would be valuable in establishing poten-tial areas of bias which risk being inherited by downstreamapplications of generative language models such as GPT-2.
6. Conclusion
What should be the goal of generative language models?It is certainly appropriate that they should not exacerbateexisting societal biases with regards to occupational segre-gation. It is less clear whether they should reflect or seekto correct skewed societal distributions. Compared to USdata, we identify a bias towards returning a small numberof stereotypical jobs too many times, especially for women.However, for a given job, we find that GPT-2 reflects soci-etal skew and, in some cases, errs on the side of correctingfor it. One proposed reason for this observed pattern isover-representation in the training data towards ‘exceptionalcases’. If society expects women to be teachers and nurses,it is possible that there are more training examples scrapedfrom social media platforms or newspaper articles of whenmen occupy these stereotypes, or vice-versa with plumbersand software developers. It remains to be seen whethera larger training set improves or impairs occupational bi-ases for intersections. Using the methodology developedin this paper, it is possible and would merit future researchto determine the effect of model and training set size bycomparing GPT-3 relative to its younger sibling analyzed inthis work, GPT-2. This work presents the first comprehen-sive analysis of protected class intersections with gender ingenerative language models, and we hope that it will sparknew research further investigating biases on topics relevantto downstream applications and intersectionality in AI morebroadly. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases
Acknowledgements
This work has been supported by the Oxford AI studentsociety, the EPSRC Centre for Doctoral Training in Au-tonomous Intelligent Machines & Systems [EP/L015897/1](A.S., Y.M.A.), and the Economic and Social ResearchCouncil grant [ES/P000649/1] (H.K). We also thank R.Maria del Rio-Chanona, Gesa Biermann for their usefulcomments.
References
Adiwardana, D., Luong, M.-T., So, D., Hall, J., Fiedel, N.,Thoppilan, R., Yang, Z., Kulshreshtha, A., Nemade, G.,Lu, Y., and Le, Q. V. Towards a human-like open-domainchatbot.
ArXiv , abs/2001.09977, 2020.Bender, E. M., Gebru, T., McMillan-Major, A., andShmitchell, S. On the dangers of stochastic parrots:Can language models be too big? . In
Conference onFairness, Accountability, and Transparency (FAccT ’21) .ACM, New York, NY, USA, 2021.Bhardwaj, R., Majumder, N., and Poria, S. Investigatinggender bias in bert.
ArXiv , abs/2009.05021, 2020.Bhatia, V., Rawat, P., Kumar, A., and Shah, R. End-to-endresume parsing and finding candidates for a job descrip-tion using bert.
ArXiv , abs/1910.03089, 2019.Blodgett, S. L., Barocas, S., Daum’e, H., and Wallach, H.Language (technology) is power: A critical survey of”bias” in nlp. In
ACL , 2020.Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., andKalai, A. Man is to computer programmer as woman isto homemaker? debiasing word embeddings. In
NeurIPS ,2016.Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan,J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G.,Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu,J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M.,Gray, S., Chess, B., Clark, J., Berner, C., McCandlish,S., Radford, A., Sutskever, I., and Amodei, D. Languagemodels are few-shot learners, 2020.Budzianowski, P. and Vulic, I. Hello, it’s GPT-2 - howcan I help you? towards the use of pretrained lan-guage models for task-oriented dialogue systems.
CoRR ,abs/1907.05774, 2019.Caliskan, A., Bryson, J., and Narayanan, A. Semanticsderived automatically from language corpora containhuman-like biases.
Science , 356:183 – 186, 2017. Crenshaw, K. Demarginalizing the intersection of raceand sex: A black feminist critique of antidiscriminationdoctrine, feminist theory and antiracist politics. 1989.Diaz, M., Johnson, I., Lazar, A., Piper, A., and Gergle, D.Addressing age-related bias in sentiment analysis.
Pro-ceedings of the 2018 CHI Conference on Human Factorsin Computing Systems , 2018.Dinan, E., Fan, A., Williams, A., Urbanek, J., Kiela, D.,and Weston, J. Queens are powerful too: Mitigating gen-der bias in dialogue generation.
ArXiv , abs/1911.03842,2020.Ethayarajh, K. How contextual are contextualized wordrepresentations? comparing the geometry of bert, elmo,and GPT-2 embeddings.
CoRR , abs/1909.00512, 2019.Fedus, W., Zoph, B., and Shazeer, N. Switch transform-ers: Scaling to trillion parameter models with simple andefficient sparsity, 2021.Foulds, J. and Pan, S. An intersectional definition of fair-ness. , pp. 1918–1921, 2020.Gonen, H. and Goldberg, Y. Lipstick on a pig: Debiasingmethods cover up systematic gender biases in word em-beddings but do not remove them.
ArXiv , abs/1903.03862,2019.He, P., Liu, X., Gao, J., and Chen, W. Deberta: Decoding-enhanced bert with disentangled attention.
ArXiv ,abs/2006.03654, 2020.Institute for Genealogical Studies. US: Reli-gious Records-Part 2, 2020. URL .Kennedy, C. J., Bacon, G., Sahn, A., and von Vacano, C.Constructing interval variables via faceted rasch mea-surement and multitask deep learning: a hate speechapplication.
ArXiv , abs/2009.10277, 2020.Kiritchenko, S. and Mohammad, S. M. Examining genderand race bias in two hundred sentiment analysis systems.In *SEM@NAACL-HLT , 2018.Kurita, K., Vyas, N., Pareek, A., Black, A., and Tsvetkov, Y.Measuring bias in contextualized word representations.
ArXiv , abs/1906.07337, 2019.Li, C., Fisher, E. M., Thomas, R., Pittard, S., Hertzberg,V., and Choi, J. D. Competence-level prediction andresume and job description matching using context-awaretransformer models.
ArXiv , abs/2011.02998, 2020. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases
Liu, H., Dacon, J., Fan, W., Liu, H., Liu, Z., and Tang,J. Does gender matter? towards fairness in dialoguesystems. In
COLING , 2020.Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R.,Bethard, S., and McClosky, D. The stanford corenlpnatural language processing toolkit. In
ACL (SystemDemonstrations) , pp. 55–60. The Association for Com-puter Linguistics, 2014. ISBN 978-1-941643-00-6.Pew Research. Religious Landscape Study,2020. URL .Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., andSutskever, I. Language models are unsupervised multitasklearners. 2019.Sheng, E., Chang, K.-W., Natarajan, P., and Peng, N. Thewoman worked as a babysitter: On biases in languagegeneration.
ArXiv , abs/1909.01326, 2019.Stanovsky, G., Smith, N. A., and Zettlemoyer, L. Eval-uating gender bias in machine translation.
ArXiv ,abs/1906.00591, 2019.Tan, Y. and Celis, L. Assessing social and intersectional bi-ases in contextualized word representations. In
NeurIPS ,2019.Tatman, R. Gender and dialect bias in youtube’s automaticcaptions. In
EthNLP@EACL , 2017.US Labor Bureau of Statistics. Employed peons by de-tailed occupation, sex, race, and Hispanic or Latino eth-nicity, 2019. URL .Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attentionis all you need. In
NeurIPS , 2017.Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., andBowman, S. R. Glue: A multi-task benchmark and anal-ysis platform for natural language understanding. In
BlackboxNLP@EMNLP , 2018.Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A.,Michael, J., Hill, F., Levy, O., and Bowman, S. R. Super-glue: A stickier benchmark for general-purpose languageunderstanding systems. In
NeurIPS , 2019.Wick, M. L., Silverstein, K., Tristan, J., Pocock, A. C., andJohnson, M. Detecting and exorcising statistical demonsfrom language models with anti-models of negative data.
ArXiv , abs/2010.11855, 2020. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.,and Le, Q. V. Xlnet: Generalized autoregressive pretrain-ing for language understanding. In
NeurIPS , 2019.Zhao, J., Zhou, Y., Li, Z., Wang, W., and Chang, K.-W.Learning gender-neutral word embeddings. In
EMNLP ,2018.Zhao, J., Wang, T., Yatskar, M., Cotterell, R., Ordonez, V.,and Chang, K.-W. Gender bias in contextualized wordembeddings.
ArXiv , abs/1904.03310, 2019. upplementary Material:How True is GPT-2?An Empirical Analysis of Intersectional Occupational Biases
Note on language used in this paper
In our paper, we focus on the occupational associations with binary gender identities i.e. “man” and “woman”. While we dosometimes refer to jobs dominated by women as ‘female-dominated jobs’, we do not make an explicit comparison to sex, i.e.prompting GPT-2 with the ‘female worker is a...’. We feel strongly about the importance in studying non-binary gender andin ensuring the field of machine learning and AI does not diminish the visibility of non-binary gender identities. In futurework, we hope to extend our analysis with the same data collection pipeline. For example, womxn is a umbrella term used inthe intersectional feminist community to be inclusive of transgender woman and non-binary individuals. The sentencesreturned when prompting GPT-2 with ‘womxn’ are primarily of two types: (i) stereotypical job associations e.g. ‘dragqueen’, ‘feminist’, ‘crossdresser’ or ‘nurse’, and (ii) not recognizing ‘womxn’ as a person noun e.g. ‘The womxn works as akind of a noodle shop’, ‘The womxn works as a battery’, ‘The womxn works as a mauve-wool hat’ or ‘The womxn works asa kind of virtual sex toy’. These preliminary findings suggest it is critical for future work to study occupational biases withnon-binary gender identities in generative language models.
A. GPT Model Downloads
We select the most downloaded version of GPT-2 available on HuggingFace as a proxy for popularity in use-cases byexperts and non-experts alike. In Tab. 5 we show the original GPT-2 models released by OpenAI (Radford et al., 2019)and available on HuggingFace. The small version has an order of magnitude more downloads as compared to the mediumand XL versions. Further, larger models of GPT-2 have been shown to have an increased capability to memorize traininginformation, introducing privacy concerns (Carlini, 2020). Finally, while the environment cost of inference is cheap, Benderet al. (2021) discuss how the environmental impact of training scales with model size, and the associated consequenceslikely disproportionately affect marginalized populations.
Table 5.
GPT-2 model available on Huggingface by number by total downloads (accessed 3rd February 2021)
Model
GPT-2 Medium
GPT-2 Large
GPT2 XL
B. Comparison with XLNet
XLNet sample generation.
In addition to the suite of models released by Open-AI, XLNet is a generalized autoregressivepre-training method which outperforms BERT across a number of benchmark tasks (Yang et al., 2019). To assessthe generalizability of our findings, we generate 7,000 sentences for the gender-occupation template ( X = {} , Y = { Man, Woman } ), and analyze the returned occupational tokens from XLNet. Out of the total 14,000 returned sentences,4,442 had no title recognized by the Stanford NLP Named Entity Recognizer. This sample loss of 31% is higher than GPT-2(Tab. 6). A plausible reason for this higher sample loss is in the way XLNet generates text which includes extra inverted ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases commas. Table 6.
Sample loss from sentences with no detected job title
Model Template Missing Titles Sample Loss
GPT-2 Gender-occupation 20,689 10.6%GPT-2 Names-occupation 39,203 19.6%XLNET Gender-occupation 4,442 31.7%
Distributional Analysis.
Fig. 10 shows the rank of jobs against the cumulative share. While 11 jobs account for 50% of theoutputs for men, only 5 jobs account for the same share for women. Similarly, considering 90% of the output, women areassociated with fewer jobs than men (31 vs 46, respectively). This disparity is similar to the one that we found in GPT-2,suggesting that XLNet also predicts a wider variety of jobs for men and a narrower set of jobs for women.
Table 7.
XLNet: Top five jobs for base man and base woman
XLNet Jobs (Proportions) SumWoman maid (0.27), waitress (0.14), prostitute (0.05), servant (0.04), nurse (0.04) 0.54
Man carpenter (0.11), mechanic (0.07), maid (0.05), waiter (0.05), taxi driver (0.04) 0.32
Top occupations.
Tab. 7 shows the top five jobs for men and women as predicted by XLNet. Similar to our observations forgender differences predicted by GPT-2, we see a higher cumulative share in the top jobs for women as compared to men.The top job for woman (maid at 27%) represents a substantially larger proportion than the top job for man (carpenter at11%). Interestingly, men are predicted to be maids 5% of the time, which was a pattern that we did not see with GPT-2.Fig. 11 shows the proportion of genders in all jobs mentioned more than 35 times for baseline man and woman. This is thesame threshold as the one we used to calculate the analogous gender parity graph for GPT-2 jobs. Men and woman areassociated with stereotypical jobs, but slightly different ones than those predicted by GPT-2. In this case, we see that menare associated with a variety of jobs, such as courier, barber, teller, magician, and builder. Women are, yet again, associatedwith domestic and care-giving jobs, such as nanny, housewife, and nurse. Women are also highly associated with jobs suchas gardener, bartender, secretary, and prostitute. Log(Rank) S h a r e o f T o t a l
11 jobs account for 50% of men5 jobs account for 50% of women 46 jobs account for 90% of men31 jobs account for 90% of women base_Wbase_M
Figure 10.
XLNet: Occupational distribution for men and women (baseline case) . As with GPT-2, the job titles predicted by XLNetare less diverse and more stereotypical for women than for men. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases c o u r i e r b a r b e r t e ll e r c o n t r a c t o r m a g i c i a n b u il d e r p r i n t e r p l u m b e r t a il o r c a r p e n t e r d r i v e r t r a d e r p a i n t e r w e a v e r hun t e r c u tt e r t a x i d r i v e r s a l e s m a n b u t c h e r m e c h a n i c b a k e r m e r c h a n t d o c t o r d r e ss e r f a c t o r y w o r k e r m e ss e n g e r w a i t e r m a n a g e r l a b o r e r s h o p k ee p e r c a s h i e r t e a c h e r s h o e m a k e r o w n e r c l e r k t h e r a p i s t v e n d o r j a n i t o r s e r v a n t c oo k h o u s e k ee p e r d r e ss m a k e r m a i d a ss i s t a n t c l e a n e r p r o s t i t u t e nu r s e s e c r e t a r y w a i t r e ss h o u s e w i f e b a r t e n d e r g a r d e n e r n a nn y Figure 11.
XLNet: gender proportions when querying for the base case, i.e. X = {} , Y = { Man , Woman } and show all jobs withgreater than
35 = n ∗ . mentions, making up 65% of returned valid responses. C. Processing
C.1. Named Entity Recognition
We used Stanford CoreNLP Named Entity Recognition (NER) to extract job titles from the sentences generated by GPT-2.Using this approach resulted in the sample loss of 10.6% for gender-occupation sentences and 19.6% for name-occupationsentences (see Tab. 6). The sample loss was due to Stanford CoreNLP NER not recognizing some job titles e.g. “Karimaworks as a consultant-development worker”, “The man works as a volunteer”, or “The man works as a maintenance man at alocal...”.
C.2. Adjustment Factors
When comparing to the US data, some adjustments are made to ensure fair comparison. Firstly, there are no breakdownsby gender and ethnicity in the US Labor Bureau data so we assume the proportion of women are equal across ethnicities.Secondly, for each gender-ethnicity pair, we generate the same number of sentence prompts per pair ( n = 7 , ). Thisimplies the ‘predicted’ labor force has equal representation across groups which is not the case in reality. Accordingly, thepredicted proportions are scaled by the true distribution of gender and ethnicity reported in the US Labor Statistics. Thescaling factor is: γ ( c ) = G ( c ) E ( c )ˆ D ( c ) , where G ( c ) , E ( c ) are the gender- and ethnicity-shares of the US data, respectively and ˆ D ( c ) = 12 . is our artificial “population”-share:adj. Pred ( i, c ) = γ ( c ) × Pred ( i, c ) , (3)where Pred( i ) is the share of job i for characteristics c . Tab. 8 shows the true proportions and the steps made in the adjustmentprocess. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases Table 8.
Adjustment calculations.
US Eth. US Gender G-E. Distr. GPT Distr. Correction ( E ) ( G ) ( D = G ∗ E ) ( ˆ D ) ( γ )Man NA 0.530 0.530 0.500 1.060Woman NA 0.470 0.470 0.500 0.940Asian Man 0.065 0.530 0.034 0.125 0.276Asian Woman 0.065 0.470 0.031 0.125 0.244Black Man 0.123 0.530 0.065 0.125 0.522Black Woman 0.123 0.470 0.058 0.125 0.462Hispanic Man 0.176 0.530 0.093 0.125 0.746Hispanic Woman 0.176 0.470 0.083 0.125 0.662White Man 0.777 0.530 0.412 0.125 3.294White Woman 0.777 0.470 0.365 0.125 2.922 C.3. Matching GPT-2 and US Jobs
The US data has four nested levels of disaggregation e.g. Management, professional, and related occupations → Professionaland related occupations → Computer and mathematical occupations → Computer Programmer. For GPT-2’s 50 mostfrequently mentioned jobs, we match the GPT-2 job title to one in the US data at the lowest nested level, apart from‘salesperson’ and ‘manager’ which are too general to match to the lowest disaggregation. For these, we match to ‘sales andrelated occupations’, and ‘management occupations’, respectively. In total, we find correspondences for 41/50 jobs. Jobswere not matched for three reasons: (i) there were too many varied mentions of a job e.g. ‘clerk’ was associated with 25different jobs spanning finance, law and hospitality sectors, (ii) there was no match for a job e.g. ‘prostitute’ and ‘translator’,(iii) the jobs were inherently gendered e.g. ‘waitress’ and ‘salesman’. There are two further considerations in matching. First,when a GPT-2 job is less general than the US categories. For example, while GPT-2 gave separate predictions for taxi driversand chauffeurs, the US data only reports ‘taxi drivers and chauffeurs’. Similarly, while GPT-2 gives separate predictions formaids, housekeepers and cleaners, the US category amalgamates these into ‘maids and housekeeping cleaners’. For thesecases, we average across GPT-2’s predictions for the relevant jobs, i.e. combining the predictions for maid, housekeeper andcleaner. Second, when GPT-2’s predictions are more general than the US categories. For example, when GPT-2 returns thetoken of ‘teacher’ but the US data reports ‘postsecondary teachers, ‘preschool and kindergarten teachers’, etc. For thesecases, we sum across the US sub-categories. Tab. 9 gives details on these matches. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases
Table 9.
Job matches between GPT-2 predicted jobs and US data.GPT US DATAbabysitter Childcare workerssecretary / assistant Secretaries and administrative assistantsreceptionist Receptionists and information clerkscleaner / housekeeper /maid Maids and housekeeping cleanersnurse Registered nursessocial worker Social workersteacher Postsecondary teachers, Preschool and kindergarten teachers, Elementary and middle school teachers,Special education teachersmodel Models, demonstrators, and product promoterswriter Writers and authorsbarista Counter attendants, cafeteria, food concession, and coffee shopbartender Bartendersphotographer Photographersbus driver Bus driversreporter / journalist News analysts, reporters and correspondentscook Cooksdoctor Physicians and surgeonsmanager Management occupationsjanitor Janitors and building cleanerslawyer Lawyersbarber Barberschef Chefs and head cooksguard / security guard/ bouncer Security guards and gaming surveillance officerscourier Couriers and messengerscomputer programmer Computer programmerspolice officer Police and sheriff’s patrol officerstaxi driver / chauffeur /driver Taxi drivers and chauffeurstruck driver Driver/sales workers and truck driversconstruction worker /laborer Construction laborerscarpenter Carpentersplumber Pipelayers, plumbers, pipefitters, and steamfittersmechanic Automotive service technicians and mechanicssalesperson Sales and related occupationsEXCLUDED JOBSclerk Too many sub-categoriestechnician Too many sub-categoriesconsultant No entrycontractor No entryprostitute No entrytranslator No entrysalesman Gendered titlewaitress Gendered titlewaiter Gendered title ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases
D. Regression Analysis
D.1. Percentage of Significant Coefficients
Tab. 10 shows the percentage of significant coefficients for each intersection. To produce these results, we run regressionsfor all jobs mentioned more times than the same threshold values used in the paper. Each regression includes all main effectsand interaction terms. We then compute the percentage of significant coefficients for each term across all regressions withbaseline man as the reference group. We repeat these steps for each intersection: ethnicity, religion, sexuality and politicalaffiliation. We did not run regression for continent name origin because there was no suitable baseline category given everyfirst name has geographic and gender associations.Considering religion, the Buddhist term has the higher percentage significance across all regressions (78%), while theHindu term has the lowest (55%). This supports the findings in the paper that some religions are stronger determinantsof jobs than others. Of the interaction terms, woman:buddhist is the least significant (19%). This finding suggests thatmale jobs are more highly determined by Buddhist membership, but female jobs are less strongly associated with thisaffiliation. Considering ethnicity, the Hispanic term is most commonly significant (64%), while the Asian term is lesscommonly significant (42%). The interactions for Hispanic and Black women are more frequently significant than thosefor White and Asian women. This finding suggests some ethnicity-gender pairs more saliently affect GPT-2’s priors onjob associations. Considering sexuality, both sexuality categories (gay/straight) are significant in approximately 50% ofregressions. A woman’s intersectional association with being lesbian is more commonly significant than an association withbeing straight. Considering political affiliation, the liberal term is more commonly significant than the conservative term,and the same pattern apply to gender-political interaction terms.Finally, we can compare the average significance of categories, gender and their intersections across religion, ethnicity,sexuality and political regressions. Religion main effects are on average significant in 66% of regressions, ethnicity maineffects in 53% of regressions, sexuality main effects in 48% of regressions and political main effects in 60% of regressions.This suggests for men, there is higher across-religion variation in predicted jobs than say for across-sexuality variation.The woman dummy is significant in 61% of religion regressions, in 71% of ethnicity regressions, in 61% of sexualityregressions and in 59% of political regressions. This finding demonstrates the woman and man variation is more influentialin distinguishing between job affiliations for ethnicity and least influential for political affiliation. Across all regressions, thewoman dummy is highly significant suggesting gender is an important determinant of job predictions. Finally, the interactionterms are significant in 26% of religion regressions, in 30% of ethnicity regressions, in 31% of sexuality regressions and in27% of political regressions. This suggests for women, sexuality and ethnicity are stronger determinants of job associations.Interaction terms are significant in approximately one-third of regressions, while the woman dummy is significant inapproximately two-thirds of regressions. This finding suggests, while intersectionality is an relevant determinant of predictedjob, gender more strongly influences GPT-2’s priors over occupational associations.
Table 10.
Percentage of significant coefficients in logistic regressions by intersection
RELIGION ETHNICITY SEXUALITY POLITICAL
Intercept 0.94 Intercept 0.95 Intercept 0.90 Intercept 0.92buddhist 0.78 asian 0.42 gay 0.51 conservative 0.55christian 0.69 black 0.55 straight 0.44 liberal 0.66hindu 0.55 hispanic 0.64 woman 0.61 woman 0.59jewish 0.66 white 0.49 woman:lesbian 0.35 woman:conservative 0.24muslim 0.64 woman 0.71 woman:straight 0.26 woman:liberal 0.30woman 0.61 woman:asian 0.29woman:buddhist 0.19 woman:black 0.36woman:christian 0.27 woman:hispanic 0.38woman:hindu 0.27 woman:white 0.16woman:jewish 0.33woman:muslim 0.25 ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases
D.2. All Regression Results
Fig. 12 presents the significant p-values in all regressions for main effects and interaction terms. Significant p-values( p < . ) are shaded in black, while non-significant terms are left as white. Considering for example ethnicity, there aretwo axes of variation. First, some jobs have significant p-values across all terms such as supervisor and teacher, indicatingthese jobs are highly segmented by gender and by ethnicity, but also by their interaction. Jobs with no significant p-valuesrepresents cases where the model did not converge which occurred when there was insufficient variation across differentdemographics. In Fig. 13, we present the direction and magnitude of significant coefficients. Any negative coefficients,i.e. those that make the job prediction less likely, are shaded in red. Any positive coefficients, i.e. those that make thejob association more likely, are shaded in blue. Any insignificant coefficients ( p > . ) are left as white. A darker colorindicates a larger strength of coefficient. We present all the results so an interested reader can select a certain job and findthe associated coefficients for gender and intersections, alongside their interaction terms.Finally, Fig. 14 presents the change in Pseudo- R for all job regressions across ethnicity when the woman dummy is addedand when the interaction terms are added. To produce these results, we first run a regression with all the main effects ofcategorical membership e.g. (‘Asian’, ‘Black’, ‘Hispanic’, ‘White’) but without the woman dummy. Given baseline ‘man’is the reference group, all gender variation resides in the intercept. Next, we re-add the woman dummy, and observe howthe model fit improves. Finally, we run a regression with all main effects and all interaction terms and see what additionalvariation is explained. The general pattern observed is that the woman dummy has a greater effect on the model fit than theinteractions. This finding suggests that while interaction terms for intersectional associations are significant in approximatelyone-third of job regressions, they explain a lower proportion of variation than gender. Once again, there is considerablevariation by job and by intersection, so for detailed insights we invite readers to examine particular occupation-demographicpatterns. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases g u a r d m o n k s u p e r v i s o r m a ss a g e _ t h e r a p i s t t e c hn i c i a n p r o s t i t u t e s e c r e t a r y l a b o r e r c a r e t a k e r t r a n s l a t o r p l u m b e r p o li c e _ o ff i c e r t r u c k _ d r i v e r m a i d h o u s e w i f e m o d e l c o u r i e r s a l e s m a n c o n t r a c t o r s a l e s _ c l e r k b a r b e r c l e a n e r s o c i a l _ w o r k e r m e c h a n i c d r i v e r r e p o r t e r p h o t o g r a p h e r nu r s e w a i t r e ss c o n s t r u c t i o n _ w o r k e r r e c e p t i o n i s t s e r v a n t e d i t o r w a i t e r c h a u ff e u r c l e r k j o u r n a li s t c o n s u l t a n t w r i t e r t a x i _ d r i v e r a ss i s t a n t b u s _ d r i v e r s e c u r i t y _ g u a r d b a r t e n d e r c a r p e n t e r j a n i t o r s a l e s _ r e p r e s e n t a t i v e d o c t o r t e a c h e r h o u s e k ee p e r m a n a g e r b a r i s t a c h e f l a w y e r c oo k Interceptasianblackhispanicwhitewomanwoman:asianwoman:blackwoman:hispanicwoman:white
ETHNICITY p a s t o r m o n k r a bb i t e c hn i c i a n g u i d e p r i e s t s h e p h e r d m i ss i o n a r y m a ss a g e _ t h e r a p i s t h o u s e w i f e f a r m e r p a i n t e r b a k e r l a b o r e r t a il o r c o n t r a c t o r p h o t o g r a p h e r r e p o r t e r s h o p k ee p e r g a r d e n e r nu r s e c o u r i e r b o un c e r c o n s t r u c t i o n _ w o r k e r s a l e s m a n s e r v a n t p l u m b e r t r u c k _ d r i v e r m e c h a n i c m a i d s e c r e t a r y t r a n s l a t o r c l e a n e r s e c u r i t y _ g u a r d w r i t e r r e c e p t i o n i s t s a l e s _ r e p r e s e n t a t i v e p o r t e r p a r a l e g a l e d i t o r b u s _ d r i v e r c o n s u l t a n t t a x i _ d r i v e r w a i t r e ss d r i v e r b a r b e r d o c t o r li b r a r i a n a ss i s t a n t w a i t e r b a r t e n d e r t e a c h e r c l e r k p o li c e _ o ff i c e r j o u r n a li s t c h a u ff e u r c a r p e n t e r b a r i s t a j a n i t o r m a n a g e r c h e f l a w y e r c oo k h o u s e k ee p e r Interceptbuddhistchristianhindujewishmuslimwomanwoman:buddhistwoman:christianwoman:hinduwoman:jewishwoman:muslim
RELIGION d e t e c t i v e t h e r a p i s t g u a r d c o a c h g r a p h i c _ d e s i g n e r d e v e l o p e r s t a n d _ i n s o f t w a r e _ e n g i n ee r c o un s e l o r b o d y g u a r d b a b y s i tt e r m a ss a g e _ t h e r a p i s t m o n k p l u m b e r c o n t r a c t o r p r i v a t e _ i n v e s t i g a t o r c a r e t a k e r c o n s t r u c t i o n _ w o r k e r c o u r i e r m o d e l t e c hn i c i a n p r i e s t c o m p u t e r _ p r o g r a mm e r t r u c k _ d r i v e r c l e a n e r h o u s e w i f e s u p e r v i s o r m e c h a n i c b u s _ d r i v e r s a l e s m a n m a i d d r i v e r p o li c e _ o ff i c e r b o un c e r l a b o r e r nu r s e s e c u r i t y _ g u a r d w a i t r e ss li b r a r i a n c h a u ff e u r t a x i _ d r i v e r c l e r k a ss i s t a n t s a l e s _ c l e r k j o u r n a li s t w a i t e r b a r b e r s e r v a n t s e c r e t a r y r e c e p t i o n i s t h o u s e _ p a i n t e r c a r p e n t e r p h o t o g r a p h e r b a r i s t a r e p o r t e r d o c t o r j a n i t o r d i r e c t o r m a n a g e r t r a n s l a t o r t e a c h e r p r o d u c e r e d i t o r b a r t e n d e r w r i t e r c o n s u l t a n t a r t i s t c h e f s a l e s _ r e p r e s e n t a t i v e h o u s e k ee p e r c oo k l a w y e r Interceptgaystraightwomanwoman:gaywoman:straight
SEXUALITY p l u m b e r c o u r i e r b a n k e r h o s t s u p e r v i s o r d r i v e r c r i t i c l o bb y i s t m a i d t e c hn i c i a n m e c h a n i c b a r b e r m a ss a g e _ t h e r a p i s t c o n s t r u c t i o n _ w o r k e r h o u s e w i f e c o un s e l o r t a x i _ d r i v e r t r u c k _ d r i v e r p e r s o n a l _ t r a i n e r a c t i v i s t p r o d u c e r j o u r n a li s t p r o s t i t u t e f i n a n c i a l _ a d v i s e r d i r e c t o r c o n t r a c t o r nu r s e s a l e s m a n b o un c e r t r a n s l a t o r m o d e l p o li c e _ o ff i c e r p r i v a t e _ d e t e c t i v e e d i t o r p r i v a t e _ i n v e s t i g a t o r s e c u r i t y _ g u a r d s e c r e t a r y d e v e l o p e r e x e c u t i v e s o c i a l _ w o r k e r w a i t r e ss a r t i s t s a l e s _ m a n a g e r a n a l y s t c a r p e n t e r r e c e p t i o n i s t s o f t w a r e _ e n g i n ee r a ss i s t a n t p r o f e ss o r l a w y e r p r e s i d e n t s p e c i a li s t s t u d e n t w r i t e r s e r v a n t r e p o r t e r b a r i s t a c oo k c l e r k w a i t e r c o m p u t e r _ p r o g r a mm e r p h o t o g r a p h e r c o n s u l t a n t m a n a g e r j a n i t o r t e a c h e r d o c t o r h o u s e k ee p e r c h e f s a l e s _ r e p r e s e n t a t i v e b a r t e n d e r Interceptconservativeliberalwomanwoman:conservativewoman:liberal
POLITICAL
Figure 12.
Significant p-values ( p < . for job regressions : significant (black), non-significant (white) ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases g u a r d m o n k s u p e r v i s o r m a ss a g e _ t h e r a p i s t t e c hn i c i a n p r o s t i t u t e s e c r e t a r y l a b o r e r c a r e t a k e r t r a n s l a t o r p l u m b e r p o li c e _ o ff i c e r t r u c k _ d r i v e r m a i d h o u s e w i f e m o d e l c o u r i e r s a l e s m a n c o n t r a c t o r s a l e s _ c l e r k b a r b e r c l e a n e r s o c i a l _ w o r k e r m e c h a n i c d r i v e r r e p o r t e r p h o t o g r a p h e r nu r s e w a i t r e ss c o n s t r u c t i o n _ w o r k e r r e c e p t i o n i s t s e r v a n t e d i t o r w a i t e r c h a u ff e u r c l e r k j o u r n a li s t c o n s u l t a n t w r i t e r t a x i _ d r i v e r a ss i s t a n t b u s _ d r i v e r s e c u r i t y _ g u a r d b a r t e n d e r c a r p e n t e r j a n i t o r s a l e s _ r e p r e s e n t a t i v e d o c t o r t e a c h e r h o u s e k ee p e r m a n a g e r b a r i s t a c h e f l a w y e r c oo k Interceptasianblackhispanicwhitewomanwoman:asianwoman:blackwoman:hispanicwoman:white
ETHNICITY p a s t o r m o n k r a bb i t e c hn i c i a n g u i d e p r i e s t s h e p h e r d m i ss i o n a r y m a ss a g e _ t h e r a p i s t h o u s e w i f e f a r m e r p a i n t e r b a k e r l a b o r e r t a il o r c o n t r a c t o r p h o t o g r a p h e r r e p o r t e r s h o p k ee p e r g a r d e n e r nu r s e c o u r i e r b o un c e r c o n s t r u c t i o n _ w o r k e r s a l e s m a n s e r v a n t p l u m b e r t r u c k _ d r i v e r m e c h a n i c m a i d s e c r e t a r y t r a n s l a t o r c l e a n e r s e c u r i t y _ g u a r d w r i t e r r e c e p t i o n i s t s a l e s _ r e p r e s e n t a t i v e p o r t e r p a r a l e g a l e d i t o r b u s _ d r i v e r c o n s u l t a n t t a x i _ d r i v e r w a i t r e ss d r i v e r b a r b e r d o c t o r li b r a r i a n a ss i s t a n t w a i t e r b a r t e n d e r t e a c h e r c l e r k p o li c e _ o ff i c e r j o u r n a li s t c h a u ff e u r c a r p e n t e r b a r i s t a j a n i t o r m a n a g e r c h e f l a w y e r c oo k h o u s e k ee p e r Interceptbuddhistchristianhindujewishmuslimwomanwoman:buddhistwoman:christianwoman:hinduwoman:jewishwoman:muslim
RELIGION d e t e c t i v e t h e r a p i s t g u a r d c o a c h g r a p h i c _ d e s i g n e r d e v e l o p e r s t a n d _ i n s o f t w a r e _ e n g i n ee r c o un s e l o r b o d y g u a r d b a b y s i tt e r m a ss a g e _ t h e r a p i s t m o n k p l u m b e r c o n t r a c t o r p r i v a t e _ i n v e s t i g a t o r c a r e t a k e r c o n s t r u c t i o n _ w o r k e r c o u r i e r m o d e l t e c hn i c i a n p r i e s t c o m p u t e r _ p r o g r a mm e r t r u c k _ d r i v e r c l e a n e r h o u s e w i f e s u p e r v i s o r m e c h a n i c b u s _ d r i v e r s a l e s m a n m a i d d r i v e r p o li c e _ o ff i c e r b o un c e r l a b o r e r nu r s e s e c u r i t y _ g u a r d w a i t r e ss li b r a r i a n c h a u ff e u r t a x i _ d r i v e r c l e r k a ss i s t a n t s a l e s _ c l e r k j o u r n a li s t w a i t e r b a r b e r s e r v a n t s e c r e t a r y r e c e p t i o n i s t h o u s e _ p a i n t e r c a r p e n t e r p h o t o g r a p h e r b a r i s t a r e p o r t e r d o c t o r j a n i t o r d i r e c t o r m a n a g e r t r a n s l a t o r t e a c h e r p r o d u c e r e d i t o r b a r t e n d e r w r i t e r c o n s u l t a n t a r t i s t c h e f s a l e s _ r e p r e s e n t a t i v e h o u s e k ee p e r c oo k l a w y e r Interceptgaystraightwomanwoman:gaywoman:straight
SEXUALITY p l u m b e r c o u r i e r b a n k e r h o s t s u p e r v i s o r d r i v e r c r i t i c l o bb y i s t m a i d t e c hn i c i a n m e c h a n i c b a r b e r m a ss a g e _ t h e r a p i s t c o n s t r u c t i o n _ w o r k e r h o u s e w i f e c o un s e l o r t a x i _ d r i v e r t r u c k _ d r i v e r p e r s o n a l _ t r a i n e r a c t i v i s t p r o d u c e r j o u r n a li s t p r o s t i t u t e f i n a n c i a l _ a d v i s e r d i r e c t o r c o n t r a c t o r nu r s e s a l e s m a n b o un c e r t r a n s l a t o r m o d e l p o li c e _ o ff i c e r p r i v a t e _ d e t e c t i v e e d i t o r p r i v a t e _ i n v e s t i g a t o r s e c u r i t y _ g u a r d s e c r e t a r y d e v e l o p e r e x e c u t i v e s o c i a l _ w o r k e r w a i t r e ss a r t i s t s a l e s _ m a n a g e r a n a l y s t c a r p e n t e r r e c e p t i o n i s t s o f t w a r e _ e n g i n ee r a ss i s t a n t p r o f e ss o r l a w y e r p r e s i d e n t s p e c i a li s t s t u d e n t w r i t e r s e r v a n t r e p o r t e r b a r i s t a c oo k c l e r k w a i t e r c o m p u t e r _ p r o g r a mm e r p h o t o g r a p h e r c o n s u l t a n t m a n a g e r j a n i t o r t e a c h e r d o c t o r h o u s e k ee p e r c h e f s a l e s _ r e p r e s e n t a t i v e b a r t e n d e r Interceptconservativeliberalwomanwoman:conservativewoman:liberal
POLITICAL
Figure 13.
Significant coefficients for job regressions : negative (red), positive (blue), and insignificant (white) ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases g u a r d s u p e r v i s o r t e c hn i c i a n p r o s t i t u t e s e c r e t a r y l a b o r e r c a r e t a k e r t r a n s l a t o r p l u m b e r p o li c e _ o ff i c e r t r u c k _ d r i v e r m a i d h o u s e w i f e m o d e l c o u r i e r s a l e s m a n c o n t r a c t o r s a l e s _ c l e r k b a r b e r c l e a n e r s o c i a l _ w o r k e r m e c h a n i c d r i v e r r e p o r t e r p h o t o g r a p h e r nu r s e w a i t r e ss c o n s t r u c t i o n _ w o r k e r r e c e p t i o n i s t s e r v a n t e d i t o r w a i t e r c h a u ff e u r c l e r k j o u r n a li s t c o n s u l t a n t w r i t e r t a x i _ d r i v e r a ss i s t a n t b u s _ d r i v e r s e c u r i t y _ g u a r d b a r t e n d e r c a r p e n t e r j a n i t o r s a l e s _ r e p r e s e n t a t i v e d o c t o r t e a c h e r h o u s e k ee p e r m a n a g e r b a r i s t a c h e f l a w y e r c oo k Add Woman DummyAdd Interactions
ETHNICITY p a s t o r t e c hn i c i a n g u i d e p r i e s t s h e p h e r d m i ss i o n a r y h o u s e w i f e f a r m e r p a i n t e r b a k e r l a b o r e r t a il o r c o n t r a c t o r p h o t o g r a p h e r r e p o r t e r s h o p k ee p e r g a r d e n e r nu r s e c o u r i e r b o un c e r c o n s t r u c t i o n _ w o r k e r s a l e s m a n s e r v a n t p l u m b e r t r u c k _ d r i v e r m e c h a n i c m a i d s e c r e t a r y t r a n s l a t o r c l e a n e r s e c u r i t y _ g u a r d w r i t e r r e c e p t i o n i s t s a l e s _ r e p r e s e n t a t i v e p o r t e r p a r a l e g a l e d i t o r b u s _ d r i v e r c o n s u l t a n t t a x i _ d r i v e r w a i t r e ss d r i v e r b a r b e r d o c t o r li b r a r i a n a ss i s t a n t w a i t e r b a r t e n d e r t e a c h e r c l e r k p o li c e _ o ff i c e r j o u r n a li s t c h a u ff e u r c a r p e n t e r b a r i s t a j a n i t o r m a n a g e r c h e f l a w y e r c oo k h o u s e k ee p e r Add Woman DummyAdd Interactions
RELIGION d e t e c t i v e t h e r a p i s t g u a r d c o a c h g r a p h i c _ d e s i g n e r d e v e l o p e r s t a n d _ i n c o un s e l o r b o d y g u a r d b a b y s i tt e r m o n k p l u m b e r c o n t r a c t o r p r i v a t e _ i n v e s t i g a t o r c a r e t a k e r c o n s t r u c t i o n _ w o r k e r c o u r i e r m o d e l t e c hn i c i a n p r i e s t c o m p u t e r _ p r o g r a mm e r t r u c k _ d r i v e r c l e a n e r h o u s e w i f e s u p e r v i s o r m e c h a n i c b u s _ d r i v e r s a l e s m a n m a i d d r i v e r p o li c e _ o ff i c e r b o un c e r l a b o r e r nu r s e s e c u r i t y _ g u a r d w a i t r e ss li b r a r i a n c h a u ff e u r t a x i _ d r i v e r c l e r k a ss i s t a n t s a l e s _ c l e r k j o u r n a li s t w a i t e r b a r b e r s e r v a n t s e c r e t a r y r e c e p t i o n i s t h o u s e _ p a i n t e r c a r p e n t e r p h o t o g r a p h e r b a r i s t a r e p o r t e r d o c t o r j a n i t o r d i r e c t o r m a n a g e r t r a n s l a t o r t e a c h e r p r o d u c e r e d i t o r b a r t e n d e r w r i t e r c o n s u l t a n t a r t i s t c h e f s a l e s _ r e p r e s e n t a t i v e h o u s e k ee p e r c oo k l a w y e r Add Woman DummyAdd Interactions
SEXUALITY p l u m b e r c o u r i e r b a n k e r h o s t s u p e r v i s o r d r i v e r c r i t i c l o bb y i s t m a i d t e c hn i c i a n m e c h a n i c b a r b e r m a ss a g e _ t h e r a p i s t c o n s t r u c t i o n _ w o r k e r h o u s e w i f e c o un s e l o r t a x i _ d r i v e r t r u c k _ d r i v e r p e r s o n a l _ t r a i n e r a c t i v i s t p r o d u c e r j o u r n a li s t p r o s t i t u t e f i n a n c i a l _ a d v i s e r d i r e c t o r c o n t r a c t o r nu r s e s a l e s m a n b o un c e r t r a n s l a t o r m o d e l p o li c e _ o ff i c e r p r i v a t e _ d e t e c t i v e e d i t o r p r i v a t e _ i n v e s t i g a t o r s e c u r i t y _ g u a r d s e c r e t a r y d e v e l o p e r e x e c u t i v e s o c i a l _ w o r k e r w a i t r e ss a r t i s t s a l e s _ m a n a g e r a n a l y s t c a r p e n t e r r e c e p t i o n i s t s o f t w a r e _ e n g i n ee r a ss i s t a n t p r o f e ss o r l a w y e r p r e s i d e n t s p e c i a li s t s t u d e n t w r i t e r s e r v a n t r e p o r t e r b a r i s t a c oo k c l e r k w a i t e r c o m p u t e r _ p r o g r a mm e r p h o t o g r a p h e r c o n s u l t a n t m a n a g e r j a n i t o r t e a c h e r d o c t o r h o u s e k ee p e r c h e f s a l e s _ r e p r e s e n t a t i v e b a r t e n d e r Add Woman DummyAdd Interactions
POLITICAL
Figure 14.
Change in R from addition of woman dummy and interaction terms for job regressions . The plots show that theaddition of woman has a greater effect on R than the addition of interaction terms. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases E. Further Analysis for Intersectional Breakdowns
Distributional Analysis.
Fig. 15 shows the distributional analysis for man and woman by intersection. The distributions forethnicity, religion, and sexuality intersections show job titles predicted by GPT-2 are less diverse and more stereotypical forwomen than for men. For political intersections and for continent-based name intersections, the disparity is not as apparent.For these latter two cases, the distribution of jobs predicted for men and women are more similar. Log(Rank) S h a r e o f T o t a l base_Wbase_M Log(Rank) S h a r e o f T o t a l ethnicity_Wethnicity_M Log(Rank) S h a r e o f T o t a l religion_Wreligion_M Log(Rank) S h a r e o f T o t a l sexuality_Wsexuality_M Log(Rank) S h a r e o f T o t a l political_Wpolitical_M Log(Rank) S h a r e o f T o t a l continent_Wcontinent_M Figure 15.
Occupational distribution for men and women by intersection . With the exception of the continent name origin intersection(bottom-right), all the others intersections show that the job titles predicted by GPT-2 are less diverse and more stereotypical for womenthan for men. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases
Lorenz Curve Analysis.
Fig. 16 shows the Lorenz Curve for men and women by intersection. With the exception ofintersections with continent-based names, women are concentrated in a smaller number of job titles as compared to men.This can be seen clearly in Fig. 17, which zooms in on the interesting part of the curve ( y = [0 , . ). We see that the largestdistributional difference is in the religion and sexuality intersections. This distributional difference is smaller for politicalintersections, agreeing with our finding in the paper that political affiliation has less of an effect by gender in GPT-2’soccupational predictions. The curves for continent-based name intersections are nearly identical, suggesting that GPT-2predicts a distribution with less disparity when it is prompted with first names rather than an explicit intersection e.g. ‘Blackwoman’/ ‘Buddhist man’. Cumulative Share of Total Workers C u m u l a t i v e S h a r e o f J o b s base_Wbase_M Cumulative Share of Total Workers C u m u l a t i v e S h a r e o f J o b s ethnicity_Wethnicity_M Cumulative Share of Total Workers C u m u l a t i v e S h a r e o f J o b s religion_Wreligion_M Cumulative Share of Total Workers C u m u l a t i v e S h a r e o f J o b s sexuality_Wsexuality_M Cumulative Share of Total Workers C u m u l a t i v e S h a r e o f J o b s political_Wpolitical_M Cumulative Share of Total Workers C u m u l a t i v e S h a r e o f J o b s continent_Wcontinent_M Figure 16.
Lorenz curve for men and women by intersection . For all intersections – except for continent-based names – the majorityof occupations for women are concentrated in a smaller number of job titles compared to men. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases
Cumulative Share of Total Workers C u m u l a t i v e S h a r e o f J o b s base_Wbase_M Cumulative Share of Total Workers C u m u l a t i v e S h a r e o f J o b s ethnicity_Wethnicity_M Cumulative Share of Total Workers C u m u l a t i v e S h a r e o f J o b s religion_Wreligion_M Cumulative Share of Total Workers C u m u l a t i v e S h a r e o f J o b s sexuality_Wsexuality_M Cumulative Share of Total Workers C u m u l a t i v e S h a r e o f J o b s political_Wpolitical_M Cumulative Share of Total Workers C u m u l a t i v e S h a r e o f J o b s continent_Wcontinent_M Figure 17.
Focused lorenz curve ( y = [0 , . ) for men and women by intersection . The largest distributional difference is in thereligion intersection, whereas the smallest is in the continent-based name origin. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases Occupations by intersections.
In each of the stacked bar charts, we show the man-woman share of occupations for eachgender-intersection pair. In Fig. 18, the majority of jobs remain split across all four ethnicities. There are no jobs dominatedby a single ethnicity. In Fig. 19, the distribution of religion for each job is relatively equally distributed, with the exceptionof a few jobs. For example, monks are composed mostly of Buddhist men and nuns are composed mostly of Buddhistwomen, an observation noted in the paper. As expected, religious occupations tend to be more dominated by one or tworeligions, while non-religious occupations are more evenly distributed across religions.
Woman p l u m b e r l a b o r e r g u a r d s a l e s m a n m e c h a n i c c o n t r a c t o r t r u c k d r i v e r c o u r i e r b a r b e r d r i v e r b o un c e r p o li c e o ff i c e r c l e r k w a i t e r s e c u r i t y g u a r d t e c hn i c i a n c a r p e n t e r t a x i d r i v e r c o n s t r u c t i o n w o r k e r s a l e s c l e r k c h a u ff e u r j a n i t o r d o c t o r b a r t e n d e r m a n a g e r p h o t o g r a p h e r c h e f l a w y e r b u s d r i v e r t r a n s l a t o r r e p o r t e r j o u r n a li s t c o n s u l t a n t b a r i s t a s a l e s r e p r e s e n t a t i v e e d i t o r w r i t e r c oo k c l e a n e r h o u s e k ee p e r a ss i s t a n t t e a c h e r r e c e p t i o n i s t s e c r e t a r y p r o s t i t u t e m a i d w a i t r e ss nu r s e s o c i a l w o r k e r c a r e t a k e r m o d e l b a b y s i tt e r ManAsian Black Hispanic White
Figure 18.
Man-woman share by ethnicity for all jobs with greater than
140 = n ∗ . mentions, making up 82% of returned validresponses. Woman f a r m e r p l u m b e r s h e p h e r dg a r d e n e r b a n k e r b u t c h e r m o n k p a i n t e r t r u c k d r i v e r s a l e s m a n l a b o r e r p a s t o r c o n t r a c t o r m e c h a n i cc o n s t r u c t i o n w o r k e r c o u r i e r p r i e s t m i ss i o n a r y t a il o r d r i v e r w a i t e r c a r p e n t e r b a r b e r b a k e r s h o p k ee p e r b o un c e r s e c u r i t y g u a r dp o li c e o ff i c e r c l e r k d o c t o rr a bb i p o r t e r l a w y e r j o u r n a li s tt a x i d r i v e r j a n i t o r s e r v a n t c h e f w r i t e r b u s d r i v e r b a r t e n d e r m a n a g e r g u i d e t r a n s l a t o r c h a u ff e u r p h o t o g r a p h e r c o n s u l t a n t c oo k h o u s e k ee p e r li b r a r i a n s e c r e t a r y t e a c h e r p a r a l e g a l c l e a n e r b a r i s t aa ss i s t a n t p r o s t i t u t e h o u s e w i f e nu r s e r e c e p t i o n i s t m a i d w a i t r e ss nun c o un s e l o r s o c i a l w o r k e r m o d e l c a r e t a k e r b a b y s i tt e r m a ss a g e t h e r a p i s t ManBuddhist Christian Hindu Jewish Muslim
Figure 19.
Man-woman share by religion for all jobs with greater than
175 = n ∗ . mentions, making up 84% of returned validresponses. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases In Fig. 20, there are number of jobs dominated by one sexuality. For example, occupations such as detective, plumber, andguard are dominated by straight men, whereas occupations such as massage therapist, counsellor, and graphic designerare dominated by lesbian women. Some more female jobs are associated with gay men such as social worker, prostituteand housewife, but the overall share of men remains low. In Fig. 21, less jobs are dominated by one political affiliation,especially at the extremes of the distribution, mirroring our observation seen in the Lorenz curves. However, there are a fewexceptions: occupations such as banker and critic are dominated by liberal men, driver and host by conservative men, baristaand translator by liberal women. Drivers are concentrated in conservative women, but the overall share of women is low.
Woman d e t e c t i v e t r u c k d r i v e r p l u m b e r g u a r d s a l e s m a n c o n t r a c t o r p o li c e o ff i c e r c o u r i e r m e c h a n i c c o a c h d r i v e r b o un c e r b a r b e r l a b o r e r t a x i d r i v e r w a i t e r c l e r k s t a n d - i n c a r p e n t e r s e c u r i t y g u a r d d o c t o r j a n i t o r b a r t e n d e r m a n a g e r c h a u ff e u r l a w y e r j o u r n a li s t d i r e c t o r s u p e r v i s o r c h e f r e p o r t e r t r a n s l a t o r p h o t o g r a p h e r p r o d u c e r t e c hn i c i a n c o n s u l t a n t b a r i s t a s a l e s r e p r e s e n t a t i v e li b r a r i a n h o u s e k ee p e r c oo k w r i t e r e d i t o r s e c r e t a r y t e a c h e r a ss i s t a n t p r o s t i t u t e r e c e p t i o n i s t c l e a n e r h o u s e w i f e s o c i a l w o r k e r m o d e l w a i t r e ss g r a p h i c d e s i g n e r nu r s e c o un s e l o r m a i d t h e r a p i s t b o d y g u a r d m a ss a g e t h e r a p i s t ManLesbian/Gay Straight
Figure 20.
Man-woman share by sexuality for all jobs with greater than
70 = n ∗ . mentions, making up 83% of returned validresponses. Woman p o li c e o ff i c e r t r u c k d r i v e r b o un c e r l o bb y i s t s a l e s m a n b a n k e r c o n t r a c t o r d r i v e r m e c h a n i c p r o d u c e r h o s t d i r e c t o r w a i t e r s e r v a n t p r i v a t e i n v e s t i g a t o r c l e r k d e v e l o p e r c r i t i c c o m p u t e r p r o g r a mm e r j o u r n a li s t l a w y e r s e c u r i t y g u a r d e d i t o r c a r p e n t e r f i n a n c i a l a d v i s e r r e p o r t e r c o n s u l t a n t b a r i s t a a c t i v i s t a n a l y s t d o c t o r s o f t w a r e e n g i n ee r j a n i t o r p h o t o g r a p h e r w r i t e r b a r t e n d e r e x e c u t i v e m a n a g e r t r a n s l a t o r c h e f p r o f e ss o r s a l e s r e p r e s e n t a t i v e h o u s e k ee p e r c oo k t e a c h e r s e c r e t a r y a ss i s t a n t s o c i a l w o r k e r r e c e p t i o n i s t w a i t r e ss c o un s e l o r m o d e l nu r s e p r o s t i t u t e m a i d ManConservative Liberal
Figure 21.
Man-woman share by political affiliation for all jobs with greater than
70 = n ∗ . mentions, making up 82% ofreturned valid responses ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases Lastly, in Fig. 22, we see that there are no jobs dominated by one continent-based name origin and it seems that there isless disparity in jobs as predicted by GPT-2 by gender. This agrees with the observations seen in the Lorenz curve. WhenGPT-2 is prompted by first name, gender is a greater prediction of job titles rather than geographic origin of the name, butthe gender-split is still less stark than explicit ‘man’, ‘woman’ prompts.
Woman s a l e s m a n m e c h a n i c w a i t e r l e c t u r e r c o n t r a c t o r c o mm e n t a t o r a n a l y s t d e v e l o p e r g r a p h i c d e s i g n e r s c i e n t i s t c o l u m n i s t t e c hn i c i a n a u t h o r e n g i n ee r p r o f e ss o r c o a c h s t r a t e g i s t p r i v a t e i n v e s t i g a t o r p r o g r a mm e r d i r e c t o r s o f t w a r e e n g i n ee r c o n s u l t a n t f il mm a k e r p r o d u c e r j o u r n a li s t m u s i c i a n l a w y e r d e s i g n e r c o m p u t e r p r o g r a mm e r e d i t o r m a n a g e r w r i t e r r e s e a r c h e r a r t i s t p o li c e o ff i c e r s e c u r i t y g u a r d p h o t o g r a p h e r ill u s t r a t o r r e p o r t e r a c t i v i s t s p e c i a li s t s a l e s r e p r e s e n t a t i v e b a r t e n d e r b l o gg e r p s y c h o l o g i s t t r a n s l a t o r m a r k e t i n g m a n a g e r d o c t o r c h e f j a n i t o r t e a c h e r a ss i s t a n t l e a d e r s t u d e n t s o c i a l w o r k e r c o un s e l o r m o d e l c oo k nu r s e w a i t r e ss t h e r a p i s t r e c e p t i o n i s t ManAfrica Americas Asia Europe Oceania
Figure 22.
Man-woman share by continent name-origin for all jobs with greater than
500 = n ∗ . mentions, making up 76% ofreturned valid responses E.1. Most Frequent Jobs Per Gender-Intersection
Tab. 11 shows the top five jobs per intersectional category with associated proportions of the category total. In general, thetop five jobs for women of all intersections (except continent-based names) does not deviate too far from the top five jobspredicted for the baseline woman case. In fact, the top job predicted for baseline women, which is waitress, is within the topfive predicted jobs for women of all intersections, at similar levels of proportions.The top five jobs for men of all intersections (except continent-based names) has more variety from the top five jobs predictedfor the baseline man case. While security guard (the top job predicted for baseline men) is still one of the most common jobfor men with all intersections, it is not included in the top job for some intersections (i.e. Buddhist man, Christian man,Jewish man, liberal man). Of the religion intersections, only Hindu and Muslim men are predicted to be security guards,raising the question of whether GPT-2 associates some religions differently with religion and non-religious occupations (i.e.treats Muslim and Hindu men as different from Christian, Buddhist, and Jewish men). For political intersections, the jobdistributions for liberal and conservative men vary more from distribution for baseline men, with interesting top jobs notseen before like writer, journalist, consultant, and lawyer.The exception to these patterns are jobs predicted for continent-based name origins. For jobs predicted by name, the top jobslook similar across gender: writer, consultant, journalist, and lawyer. This finding suggests that if we do not prompt GPT-2with an explicit gender (man/woman), GPT-2 predicts a similar set of jobs for men and women. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases
Table 11.
Top five jobs per intersectional category with associated proportions of category total.
Woman Jobs Man JobsBase [waitress, nurse, maid, receptionist, teacher] [security guard, manager, waiter, janitor, mechanic][0.14, 0.11, 0.06, 0.05, 0.05] [0.08, 0.05, 0.04, 0.04, 0.03]
EthnicityAsian [waitress, maid, nurse, teacher, receptionist] [waiter, security guard, manager, janitor, chef][0.14, 0.11, 0.08, 0.05, 0.04] [0.09, 0.07, 0.04, 0.04, 0.03]
Black [waitress, nurse, maid, prostitute, teacher] [security guard, waiter, bartender, janitor, mechanic][0.18, 0.1, 0.07, 0.05, 0.04] [0.08, 0.07, 0.05, 0.05, 0.04]
Hispanic [waitress, nurse, receptionist, maid, teacher] [security guard, janitor, waiter, bartender, manager][0.16, 0.14, 0.07, 0.07, 0.04] [0.09, 0.07, 0.07, 0.05, 0.05]
White [waitress, nurse, maid, teacher, receptionist] [waiter, security guard, janitor, mechanic, bartender][0.17, 0.11, 0.07, 0.05, 0.04] [0.06, 0.06, 0.05, 0.04, 0.04]
ReligionBuddhist [nurse, waitress, maid, teacher, cook] [teacher, janitor, waiter, doctor, monk][0.12, 0.11, 0.09, 0.08, 0.04] [0.06, 0.05, 0.05, 0.04, 0.04]
Christian [waitress, nurse, maid, teacher, prostitute] [clerk, doctor, waiter, janitor, teacher][0.13, 0.12, 0.1, 0.07, 0.06] [0.06, 0.04, 0.04, 0.04, 0.04]
Hindu [maid, waitress, nurse, teacher, cleaner] [waiter, janitor, security guard, teacher, cleaner][0.18, 0.12, 0.06, 0.05, 0.05] [0.09, 0.06, 0.04, 0.04, 0.03]
Jewish [waitress, nurse, maid, teacher, prostitute] [waiter, doctor, clerk, janitor, teacher][0.15, 0.1, 0.09, 0.06, 0.05] [0.08, 0.05, 0.04, 0.04, 0.04]
Muslim [waitress, maid, nurse, teacher, cook] [waiter, security guard, janitor, taxi driver, mechanic][0.16, 0.14, 0.08, 0.05, 0.04] [0.11, 0.06, 0.06, 0.05, 0.04]
SexualityLesbian/Gay [waitress, nurse, teacher, maid, receptionist] [waiter, bartender, janitor, security guard, waitress][0.15, 0.12, 0.06, 0.06, 0.05] [0.07, 0.06, 0.05, 0.05, 0.04]
Straight [waitress, nurse, maid, teacher, receptionist] [waiter, bartender, security guard, manager, clerk][0.19, 0.08, 0.07, 0.04, 0.04] [0.06, 0.05, 0.04, 0.04, 0.04]
PoliticalLiberal [waitress, nurse, writer, teacher, receptionist] [writer, journalist, lawyer, consultant, waiter][0.12, 0.08, 0.07, 0.05, 0.05] [0.1, 0.08, 0.08, 0.06, 0.05]
Conservative [waitress, nurse, receptionist, writer, consultant] [consultant, lawyer, writer, security guard, reporter][0.13, 0.08, 0.06, 0.05, 0.05] [0.09, 0.06, 0.05, 0.05, 0.05]
ContinentAfrica [writer, consultant, journalist, lawyer, teacher] [writer, consultant, journalist, lawyer, translator][0.1, 0.08, 0.05, 0.04, 0.04] [0.09, 0.08, 0.07, 0.05, 0.04]
Americas [writer, consultant, journalist, lawyer, teacher] [writer, consultant, journalist, lawyer, manager][0.1, 0.08, 0.05, 0.04, 0.04] [0.1, 0.1, 0.06, 0.05, 0.04]
Asia [writer, consultant, translator, journalist, teacher] [consultant, writer, journalist, lawyer, translator][0.09, 0.06, 0.05, 0.05, 0.04] [0.1, 0.09, 0.06, 0.04, 0.04]
Europe [writer, consultant, journalist, nurse, teacher] [writer, consultant, journalist, lawyer, producer][0.1, 0.07, 0.05, 0.05, 0.04] [0.11, 0.1, 0.06, 0.04, 0.04]
Oceania [writer, consultant, teacher, nurse, journalist] [writer, consultant, journalist, teacher, lawyer][0.09, 0.07, 0.05, 0.04, 0.04] [0.11, 0.08, 0.05, 0.04, 0.04] ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases
F. Further Analysis for US Comparison
F.1. Gender Predictions
Fig. 23 plots the percentage of women for each occupation as predicted by GPT-2 and as observed in the US Labor Bureaudata. The bar plot shows the difference in predicted percentage and true percentage. We see that GPT-2 pulls the skewedreal-life distribution towards gender parity. For example, GPT-2 predicts there to be more women mechanics, carpenters,taxi drivers, and police officers than there are in real life. Additionally, GPT-2 predicts there to be fewer women secretaries,maids, nurses, and models than observed in reality. Both of these examples suggest that GPT-2 under-predicts the number ofwomen in heavily women-dominated jobs, and GPT-2 over-predicts the number of women in heavily men-dominated jobs.This supports our finding in the paper: although it may seem initially biased that GPT-2 predicts so many women to besecretaries and maids, the share of women within these occupations is actually higher in the US data. babysittersecretary / assistantreceptionistcleaner / housekeeper / maidnursesocial workerteachermodelwritereditorbaristabartenderphotographersalespersonbus driverreporter / journalistcookdoctormanagerjanitorlawyerbarberchefsecurity guard / bouncercouriercomputer programmerpolice officertaxi driver / chaffeurtruck driverconstruction worker / laborercarpenterplumbermechanic
Figure 23.
GPT-2 predictions versus US data by gender share . Difference in percentage of women predicted by GPT-2 and thepercentage of women in the 2019 US Labor Force Statistics data, per occupation. ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases
F.2. Gender-Ethnicity Predictions
Fig. 24 presents the difference between US data and GPT-2’s predicted proportions of gender-ethnicity pairs for the top 50most frequently mentioned jobs which aligned with US occupational categories. The jobs on the y-axis are sorted by thetrue share of women in the US data. In line with the low mean-squared errors presented in the paper, GPT-2 accuratelypredicts the gender-ethnicity split for a given job, especially for Asian and Black workers. For jobs with a wide gendersplit, GPT-2 seems to corrects for societal skew. For example, it under-predicts the proportion of Hispanic women who arecleaners, housekeepers and maids by 34% (percentage points). Similarly, it under-predicts the proportion of Black menwho are taxi drivers, chauffeurs or drivers, and the proportion of Hispanic men who are mechanics, plumbers, carpentersand construction workers. The proportion of White workers is less accurately predicted but the same pattern is observedtowards under-predicting the proportion of women in female dominated jobs and over-predicting the proportion of womenin male-dominated jobs. a s i a n _ W a s i a n _ M b l a c k _ W b l a c k _ M h i s p a n i c _ W h i s p a n i c _ M w h i t e _ W w h i t e _ M mechanicplumbercarpenterconstruction worker / laborertruck drivertaxi driver / chaffeur / driverpolice officercomputer programmercourierguard / security guard / bouncerchefbarberlawyerjanitormanagerdoctorcookreporter / journalistbus driversalespersonphotographerbartenderbaristawritermodelteachersocial workernursecleaner / housekeeper / maidreceptionistsecretary / assistantbabysitter GPT < TrueGPT > True
Figure 24.
GPT-2 predictions versus US data by gender-ethnicity intersection . Red means that GPT-2 over-predicts the share of theoccupation-ethnicity intersection pair; Blue means that GPT-2 under-predicts it.
G. Companies Using AI for Hiring
Gartner has identified various use cases where AI can be useful in hiring process such as talentacquisition and HR virtual assistant ( ). A number of compa-nies are already using AI in hiring e.g. Aviro AI ( ) and ow True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases
Entelo ( ). These companies have automated the hiringprocess and reducing human involvement in the job application assessment process. This can have serious implications forpeople from marginalized groups if the bias in the underlying AI models is not addressed.
References
Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. On the dangers of stochastic parrots: Can languagemodels be too big? In
Conference on Fairness, Accountability, and Transparency (FAccT ’21) . ACM, New York, NY,USA, 2021.Carlini, N. Privacy Considerations in Large Language Models, 2020. URL https://ai.googleblog.com/2020/12/privacy-considerations-in-large.html/ .Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitasklearners. 2019.Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., and Le, Q. V. Xlnet: Generalized autoregressive pretraining forlanguage understanding. In