Gender trends in computer science authorship
GG ENDER T RENDS IN C OMPUTER S CIENCE A UTHORSHIP
A P
REPRINT
Lucy Lu Wang, Gabriel Stanovsky, Luca Weihs, Oren EtzioniAllen Institute for Artificial IntelligenceSeattle, Washington, USA
June 20, 2019 A BSTRACT
A comprehensive and up-to-date analysis of Computer Science literature (2.87 million papersthrough 2018) reveals that, if current trends continue, parity between the number of male and fe-male authors will not be reached in this century. Under our most optimistic projection models,gender parity is forecast to be reached by 2100, and significantly later under more realistic assump-tions. In contrast, parity is projected to be reached within two to three decades in the biomedicalliterature. Finally, our analysis of collaboration trends in Computer Science reveals decreasing ratesof collaboration between authors of different genders.
This paper presents a comprehensive and up-to-date analysis of gender trends in the Computer Science literature(ranging from 1970 through 2018). Specifically, we aim to address the following questions regarding gender andauthorship in the Computer Science literature: • How is gender balance among authors changing over time? • When might gender parity be reached among authors? • How is gender associated with co-authorship?We answer these questions by performing an automated study of literature meta-data from Computer Science con-ferences and journals (2.87 millions papers), utilizing data from the Semantic Scholar academic search engine. Toprovide a basis for comparison, we also analyze papers from the top 1,000 Medline journals by citation count (11.63million papers), and compare the trends observed in Computer Science to those in the biomedical literature.
Corpus Total papers(millions) Total author-paperunits (millions) Average author perpaper Unique firstnamesComputer Science 2.87 8.24 2.87 186116Medline 11.63 47.66 4.10 439981Table 1: Corpus statistics for Computer Science and Medline.Our analysis was performed over the Computer Science and Medline corpora and their meta-data. The corpora containpapers published between 1970 and June 2018, and associated metadata such as title, abstract, authors, publicationvenue, and year of publication. Summary statistics for both corpora are given in Table 1. The Computer Science We acknowledge that gender is not binary, but for the sake of this large-scale study–we adopt a simplified view of gender as binaryand rely on first names as an approximate proxy for the author’s gender. a r X i v : . [ c s . D L ] J un PREPRINT - J
UNE
20, 2019
Figure 1: At current rates of growth, the proportion of female authors is predicted to reach 0.45 around 2137 (95%Confidence Interval: [2109, 2172]). The trend line is given by an ARIMA projection with 95% confidence intervals.corpus consists of 2.87 million papers retrieved from conferences and journals in Computer Science. Publication andauthor metadata are automatically derived by Semantic Scholar from DBLP. The Medline corpus consists of 11.63million papers from the top 1000 Medline-indexed journals as determined using overall citation count.The author list is extracted from all publications and compiled into a list of first names. We use Gender API to performgender lookup for each name. Gender API is a large online database of known name-gender relationships derived bylinking publicly available governmental data with social media profiles in various countries. For each name, GenderAPI outputs the predicted binary gender ( female or male ), along with the accuracy associated with the prediction andthe number of samples used to arrive at that determination. Authors for whom only first initials were available (lessthan 0.5% of all authors in our corpora) were excluded from analysis.Because many names are gender-ambiguous, we use the accuracy returned by Gender API to represent each author asa composite of male and female. For example, the first name Matthew is determined to be male with an accuracy scoreof 100, the maximum. This result is unambiguous. The name Taylor, however, is determined to be female but only https://dblp.uni-trier.de/ https://gender-api.com/ Figure 2: The total number of male and female authors in the Computer Science corpus over time. PREPRINT - J
UNE
20, 2019 receives an accuracy score of 55. The accuracy is used to generate a two probabilities for each name, ( m, f ) , where m is the probability of the associated author being male, f is the probability of the associated author being female,where m + f = 1 . In this example, each author with the first name Matthew will be represented with the probabilitytuple (1 . , . , and each author with the first name Taylor will be represented as (0 . , . .Most papers are authored by more than one individual. For the purposes of our analysis, each author-paper pair istreated as one unit. A single-author paper yields one author-paper pair; a three-author paper yields three author-paperpairs and so on. In the Computer Science corpus, the average number of authors is 2.87 per paper. Average authorsper paper increased from approximately 1.4 per paper in 1970 to approximately 3.5 in 2018. We perform two types of analysis on this data. First, we analyze publication trends, examining the proportion offemale authors over time (Section 3.1). To identify when gender parity may be reached, we project the proportion offemale authors based on current trends. Here, we define parity as the proportion of female authors falling within 10%of 0.5, within the range of 0.45-0.55. Second, we study the interactions between authors in the community throughco-authorship as reflected in our data (Section 3.2).
The proportion of female authors over time is used to determine the trend towards gender parity. The number offemale authors in any year is computed as the sum of probabilities f over the author-paper units of that year, and thenumber of male authors is correspondingly generated as the sum of probabilities m . The proportion of female authorsfor each year F t is computed as the number of female author-paper units divided by the total number of author-paperunits for the corresponding year. We compute projections by performing an autoregressive integrated moving average(ARIMA) analysis, a commonly-used method for creating forecasting models [1]. We use the auto ARIMA functionin the R ‘forecast’ package [2], which automates the selection of ARIMA model order, with a preference for simplemodels with lower order.The growth in gender proportion should observe logistic behavior, where a stable equilibrium will eventually bereached in gender balance. We first apply σ − α , the inverse of the α -scaled sigmoid (or logit) function σ α ( x ) = α/ (1 +exp( − x )) , to map the gender proportion into the real line so that the data is more amenable to linear approximation.We call α the expected equilibrium proportion parameter. This transform generates y t = σ − α ( F t ) , where F t is theproportion of female authors per year. We then fit a non-seasonal ARIMA model with parameters p , d , and q for thetransformed process y t represented by the following equation: φ p ( B )(1 − B d ) y t = c + θ q ( B ) ε t (1)where B is the backshift operator, which shifts by one to the previous time point, and ε t is zero-centered, normallydistributed noise [2].Finally, we obtain the forecast in the original domain using a sigmoid transform over the projected values, applying σ α to y t for t > . We let α = 0 . so that σ . has minimum and maximum values of 0 and 0.5 respectively. Thisconstrains the projected values to be between 0 and the expected equilibrium proportion of 0.5. The 95% predictiveinterval is computed and shown for all projections. Note that α represents the proportion of female authors we expectin the long run. An equilibrium proportion of 0.5 indicates that we expect the authorship makeup to eventually stabilizeat around 50% men and 50% women. An equilibrium proportion of 0.9 indicates that we expect the authorship makeupto eventually stabilize at around 10% men and 90% women. Trends toward equality suggest that the former is moreplausible than the latter. As is further elaborated in Section 4.1, we perform a sensitivity analysis to determine theeffect of the selected α parameter on the year in which parity is expected to be reached. Co-authorship is computed for each unique pair of author-paper pairs for each paper. If a paper has n authors, (cid:0) n (cid:1) co-author pairs are generated. Given a co-author pair ( n , n ) and associated gender probabilities: n → ( m , f ) n → ( m , f ) (2) PREPRINT - J
UNE
20, 2019
Figure 3: The equilibrium female author proportion parameter affects the year that parity is reached. The expectedyear for reaching exact parity (the first year in which the female author proportion equals or exceeds 0.5) is shownalong with 95% confidence intervals.we compute three probabilities, p mm , p mf , and p ff , corresponding to the possible gender combinations, i.e., betweentwo male authors, one male and one female author, and two female authors respectively. These probabilities are: p mm = m m p mf = m f + f m p ff = f f (3)where p mm + p mf + p ff = 1 . The total number of male-male, male-female, and female-female co-author pairs foreach year is computed by summing each of the three above probabilities over all co-authorship pairs of that year.We then assess the number of same-gender and different-gender collaborations over time. The results are measuredas a deviation from the expected, where the expected co-authorships are determined by sampling from the numbersof female and male authors active in a given year, assuming the same number of collaborations per year as observedin our data. The total number of extra or missing collaborations is computed as the difference between the observedcounts of each type of collaboration and the expected value. To show rates of change, we also compute the ratiobetween observed and expected collaborations (O/E) of each type. The 2.87 million papers in the Computer Science corpus yield 8.24 million author-paper units.
Figure 2 shows the number of female and male authors over time. The total number of authors is increasing over time,along with the proportion of female authors.Figure 1 shows the projected proportion of female authors in the Computer Science corpus. The projected growthin female author proportion is computed using ARIMA, with model order ( p, d, q ) = (2 , , . Residuals of the fitline appear normally distributed, and are not significant under the Shapiro-Wilk Normality Test (W = 0.98, p-value =0.68) [3]. Based on these projections, the proportion of female authors in Computer Science is predicted to reach 0.45around 2137 (95% CI: [2109, 2172]), more than 115 years from now.Figure 3 shows a sensitivity analysis over the equilibrium female author proportion parameter α . This analysis showsthe year in which parity is first reached at each equilibrium proportion; note that when α = 0 . , exact 50/50 parityis, by definition, never attained in finite time. We therefore report the time at which the female author proportionsurpasses 0.45, within 10% of exact parity. When the equilibrium proportion is expected to favor women over men(above 0.5), the year in which parity is reached occurs earlier. Even with the aggressive projection that women will PREPRINT - J
UNE
20, 2019
Figure 4: The difference ( left ) and ratio ( right ) between observed and expected same- and different-gender co-authorships in Computer Science since 1995.Figure 5: The total numbers of female and male authors in the Medline corpus.eventually author 90% of all publications, the expected year in which parity will be reached at current rates of growthis still around 2100.
The number of same- and different-gender co-authorships in Computer Science were computed for each year. Figure 4shows the number of extra and missing same- and different-gendered collaborations since 1995. There are more same-gender co-authorships than would be expected among both men and women, and less different-gender co-authorshipsthan would be expected. In recent years, more than 20,000 different-gender collaborations per year were missing whencompared to expected numbers.The observed to expected ratio shows pessimistic collaboration trends. Although both men and women are more likelyto collaborate with authors of their own gender (positive O/E), the degree of same-gender preference is decliningamong female authors but increasing among male authors. At the same time, the different-gender collaboration gap(O/E < PREPRINT - J
UNE
20, 2019
The Medline corpus of 11.63 million papers yield 47.66 million author-paper units. Figure 5 shows the number offemale and male authors in the Medline corpus. Figure 6 shows the projected proportion of female authors forecastusing ARIMA, with model order ( p, d, q ) = (0 , , . A discontinuity can be observed in the Medline corpus data in2002. This is due to the requirement of full author names in Medline-indexed records beginning in publication year2002 [4]. The drop in proportion in 2002 shows that Medline journals not using full names for authors contributed tothe false appearance of a high representation of female authors prior to 2002. Consequently, the ARIMA projectionis computed using only the proportion data since 2002. The projection forecasts the proportion of female authors tosurpass 0.45 around 2048 (95% CI: [2045, 2051]), a bit over 25 years from now. Because of the large number ofauthors in the Medline corpus since 2002, the confidence intervals for this projection are quite narrow. Our analysis of the Computer Science literature reveals persistent patterns of inequality in gender and academic au-thorship. Although gender balance is improving, progress is slower than we had hoped.
Inferring gender from names is imperfect, and all gender-inference tools are subject to biases. Several studies havedescribed and measured the differences between these services [5, 6]. Based on results in Santamar´ıa and Mihaljevi´c,Gender API has the lowest overall error rate but was slightly biased toward under-representation of females in theirevaluation, in other words, the number of women estimated may be slightly lower than in reality. However, this biasmay be offset by our sampling bias, since the population of Computer Science authors is unlikely to be an unbiasedsample of the general population, or the subset of the general population whose names were used to construct thedatabase behind Gender API. We attempted to mitigate some of these biases by treating the output of Gender API asprobabilistic. To assess the accuracy of Gender API, we validated the predictions for the 50 names most commonlypredicated as male and 50 names most commonly predicted as female, and found them to match our expectations.The proportion of authors with high uncertainty Gender API results has also grown in our corpus over time. As evidentin Figure 7, our average confidence in gender prediction decreased from about 90% in 1970 to 85% in 2018. WhileGender API’s average prediction confidence on our corpus is still high, this trend may pose a challenge for similaranalyses in the future. Upon inspection of the data, we attribute this to the growing number of East Asian authorsFigure 6: The proportion of female authors in the Medline corpus is projected to surpass 0.45 around 2048 (95%Confidence Interval: [2045, 2051]) based on ARIMA projections. The 95% confidence interval around the projectionis plotted, but the error is small. PREPRINT - J
UNE
20, 2019
Figure 7: The average author gender confidence of Gender API on the Computer Science corpus, per publicationyear.publishing in recent years. East Asian first names, especially when subject to romanization, can be quite gender am-biguous. We believe that by representing each author as a composite of male and female based on probability, we offsetsome of the issues associated with the increasing numbers of ambiguous names in our corpus over time. However, theauthors of Computer Science literature are unlikely to be an unbiased sub-sample of the broader population, and thisassumption may introduce some error into our analysis.We also recognize the limitations of using author-paper pairs as our units of measure. We do not distinguish betweena person who is a single author on a paper, and a person who co-authors with many others. This biases our databy over-weighting authors in papers with more authors. Similarly, in our analysis of collaboration, we take eachcombination of authors for a paper as a collaborating pair, which over-weights again papers with more authors. In theComputer Science corpus, we observe an increase in the average authors per paper over time, growing to approximately3.5 authors per paper in 2018. However, Computer Science papers are still generally authored by smaller groupsof individuals in the lower single digits, and we believe the bias introduced by our usage of author-paper pairs orcollaborating author pairs to be minimal.Each author is also weighted equivalently in our analysis. We acknowledge the special recognition extended to firstauthors, last authors, and single authors, and previous studies have already shown the distinctions between these groups[7].
Gender bias is a well documented and studied issue in academia. Studies have shown that existent and perceived genderbias may affect many aspects of career and academic success, including but not limited to a woman’s choice of collegemajor [8], crediting in scientific publications [9], access to mentorship [10, 11], and opportunities for collaboration[12]. All these factors and more can lead to biased representation of women in certain fields of study.With the increasing digitization of scholarly communication and availability of publication-related metadata, scholarshave been better able to quantify inequality in authorship. A 2012 analysis of 1.8 million papers from JSTOR, a largemulti-disciplinary repository of academic literature, revealed that although gender gaps are shrinking in academicpublications, women were found to be significantly underrepresented as last and single authors [7]. Elsevier, the largestpublisher of academic manuscripts, in an analysis of data from Scopus and ScienceDirect, reported the presence ofgender imbalance among authors and inconsistent trends towards equal representation among different fields [12]. Astudy in early 2018 confirmed continuing gender disparities among Nature Index journals, commonly considered someof the most reputable sources of academic literature, and in particular, limited representation of women among lastauthors, who are often perceived as more senior [13].A study of gender bias in authorship conducted by Holman et al. projected the closing of the gender gap in variousfields based on current trends [14]. Through analyzing 9.1 million articles from PubMed, the authors projected thatgender parity would be reached in around 20 years in certain biomedical fields such as Molecular Biology, Medicine,or Biochemistry. Holman et al.’s analysis of a small corpus of Computer Science pre-prints from arXiv show thatgender parity in Computer Science will be reached in more than 100 years from the present [14]. PREPRINT - J
UNE
20, 2019
Major strides have been made to reduce gender disparities. The presence of an overall structure of sexism in academiacontinues to be debated [15, 16, 17], but many academic institutions recognize the problem and have sought to equalizeadmissions and hiring procedures. Evidence of movement toward equal representation in hiring and publication hasbeen observed in some controlled settings [18, 19, 20]. How these observations translate into systemic changes remainto be seen. It is clear, however, that the rate of change in reducing the gender gap may be insufficient in many fieldsfor parity to occur within several generations [14].
We performed a comprehensive analysis of the Computer Science literature (2.87 million papers) to evaluate gendertrends among authors. Based on recent trends, the proportion of female authors in Computer Science is forecast tonot reach parity in this century, and under more realistic assumptions—it may take far longer. We also observed lowerthan expected numbers of cross-gender collaborations, with the ratio of observed to expected decreasing over time.Slow rates of growth in the proportion of female scientists in Computer Science continue to challenge women enteringthe field. Female scientists may face more challenges finding collaborators than their male counterparts due to theexisting gender distribution of authors and observed co-authorship behaviors. We hope that these findings will motivateothers in the field to evaluate their relationship to these gender biases and consider ways to improve the status quo.
Acknowledgements
We would like to thank Jonathan Borchardt, Matt Gardner, and Candace Ross for conducting the initial analysis thatmotivated this project. We would also like to thank Maarten Sap, Noah Smith, and Mark Yatskar for helpful commentson earlier drafts of this paper.
References [1] G. E. P. Box, G. M. Jenkins, and G. C. Reinsel.
Time series analysis: Forecasting and control . Prentice Hall,Englewood Cliffs, N.J., 3 edition, 1994.[2] Rob J Hyndman and Yeasmin Khandakar. Automatic time series forecasting: the forecast package for R.
Journalof Statistical Software , 26(3):1–22, 2008.[3] S. S. Shapiro and M. B. Wilk. An analysis of variance test for normality (complete samples).
Biometrika ,52:591–611, 1965.[4] Medline R (cid:13) data changes – 2002. NLM Tech Bull , Nov-Dec(323):e11, 2001.[5] Fariba Karimi, Claudia Wagner, Florian Lemmerich, Mohsen Jadidi, and Markus Strohmaier. Inferring genderfrom names on the web: A comparative evaluation of gender detection methods. In
WWW , 2016.[6] Luc´ıa Prieto Santamar´ıa and Helena Mihaljevi´c. Comparison and benchmark of name-to-gender inference ser-vices.
PeerJ Computer Science , 4:e156, 2018.[7] Jevin D. West, Jennifer Jacquet, Molly M. King, Shelley J. Correll, and Carl T. Bergstrom. The role of gender inscholarly authorship.
PloS one , 8(7):e66212, 2013.[8] Rachael D. Robnett. Gender bias in STEM fields: variation in prevalence and links to STEM self-concept.
Psychology of Women Quarterly , 2015.[9] David F. Feldon, James L. Peugh, Michelle A. Maher, Josipa Roksa, and Colby Tofel-Grehl. Time-to-creditgender inequities of first-year PhD students in the biological sciences. In
CBE life sciences education , 2017.[10] Rochelle Decastro, Kent A. Griffith, Peter Anthony Ubel, Abigail J. Stewart, and Reshma Jagsi. Mentoringand the career satisfaction of male and female academic medical faculty.
Academic medicine : journal of theAssociation of American Medical Colleges , 89(2):301–11, 2014.[11] Natalie Schluter. The glass ceiling in NLP. In
EMNLP , 2018.[12] Gender in the global research landscape. Technical report, Elsevier, 2017.[13] Michael H. K. Bendels, Ruth Mueller, Doerthe Brueggmann, and David Alexander Groneberg. Gender dispari-ties in high-quality research revealed by nature index journals.
PloS one , 13(1):e0189136, 2018.[14] Luke Holman, Devi Stuart-Fox, and Cindy E. Hauser. The gender gap in science: How long until women areequally represented?
PLoS biology , 16(4):e2004956, 2018. PREPRINT - J
UNE
20, 2019 [15] Jamie Lundine, Ivy Lynn Bourgeault, Jocalyn Clark, Shirin Heidari, and Dina Balabanova. The gendered systemof academic publishing.
The Lancet , 391(10132):1754–6, 2018.[16] Jason R Boynton, Kristina Georgiou, Mark Reid, and Andrew Govus. Gender bias in publishing.
The Lancet ,392(10157):1514–5, 2018.[17] Jamie Lundine, Ivy Lynn Bourgeault, Jocalyn Clark, Shirin Heidari, and Dina Balabanova. Gender bias inacademia.
The Lancet , 393(10173):741–3, 2019.[18] W. Mattieu Williams and Stephen J Ceci. National hiring experiments reveal 2:1 faculty preference for womenon stem tenure track.
Proceedings of the National Academy of Sciences of the United States of America ,112(17):5360–5, 2015.[19] Erin Hengel. Publishing while female. are women held to higher standards? Evidence from peer review.
Cam-bridge Working Paper Economics , 1753, 2017.[20] Stephen J. Ceci and Wendy M. Williams. Understanding current causes of women’s underrepresentation inscience.
Proceedings of the National Academy of Sciences , 108(8):3157–3162, 2011., 108(8):3157–3162, 2011.