Systematics in the ALMA Proposal Review Rankings
DDraft version August 27, 2019
Typeset using L A TEX preprint style in AASTeX63
Systematics in the ALMA Proposal Review Rankings
John Carpenter, Joint ALMA Observatory, Avenida Alonso de C´ordova 3107, Vitacura, Santiago, Chile
ABSTRACTThe results from the ALMA proposal peer review process in Cycles 0–6 are analyzedto identify any systematics in the scientific rankings that may signify bias. Proposalrankings are analyzed with respect to the experience level of a Principal Investigator(PI) in submitting ALMA proposals, regional affiliation (Chile, East Asia, Europe,North America, or Other), and gender. The analysis was conducted for both the Stage1 rankings, which are based on the preliminary scores from the reviewers, and the Stage2 rankings, which are based on the final scores from the reviewers after participating in aface-to-face panel discussion. Analysis of the Stage 1 results shows that PIs who submitan ALMA proposal in multiple cycles have systematically better proposal ranks thanPIs who have submitted proposals for the first time. In terms of regional affiliation, PIsfrom Europe and North America have better Stage 1 rankings than PIs from Chile andEast Asia. Consistent with Lonsdale et al. (2016), proposals led by men have betterStage 1 rankings than women when averaged over all cycles. This trend was mostnoticeably present in Cycle 3, but no discernible differences in the Stage 1 rankingsare present in recent cycles. Nonetheless, in each cycle to date, women have had alower proposal acceptance rate than men even after differences in demographics areconsidered. Comparison of the Stage 1 and Stage 2 rankings reveal no significantchanges in the distribution of proposal ranks by experience level, regional affiliation, orgender as a result of the panel discussions, although the proposal ranks for East AsianPIs show a marginally significant improvement from Stage 1 to Stage 2 when averagedover all cycles. Thus any systematics in the proposal rankings are introduced primarilyin the Stage 1 process and not from the face-to-face discussions. These results arediscussed in the context of potential language and cultural biases, but any conclusionson the origin of the observed systematics remain speculative. INTRODUCTIONThe Atacama Large Millimeter/Submillimeter Array (ALMA) is an international astronomicalfacility operated in a partnership of the European Organisation for Astronomical Research in theSouthern Hemisphere (ESO), the U.S. National Science Foundation, and the National Institutes ofNatural Sciences of Japan in cooperation with the Republic of Chile. The Joint ALMA Observatory(JAO) solicits observing proposals from the scientific community to use ALMA through an annualCall for Proposals. This is the primary means by which projects are selected for observation. ALMAproposals are peer reviewed by volunteers from the scientific community. Projects are added to theobserving queue based primarily on the scientific rank from the review process, but also operational a r X i v : . [ a s t r o - ph . I M ] A ug onsiderations, including over-subscription in antenna configurations and by right ascension, therequired weather conditions for the observations, and the pre-determined share of observing timethat is awarded to Chile, East Asia, Europe, and North America.Given the importance that telescope access can have on developing a scientific career, it is imper-ative that the community has confidence that scientific merit is the primary determinant of proposalrank. Analyzing the results from the review process and presenting the outcomes in a transparentmanner is an important part in building confidence within the community. Along these lines, identi-fying potential systematics in the proposal review process at astronomical observatories has receivedprominent attention in recent years. If the probability of success of a proposal depends on somecharacteristic (e.g., the gender of the Principal Investigator, or PI) that should not correlate withthe underlying scientific merit, it may indicate a bias in the review process.Reid (2014) raised attention to possible biases in the proposal review process of the Hubble SpaceTelescope (HST) when he found that proposals led by women had a lower acceptance rate thanproposals led by men for HST Cycles 11 through 21. While the difference in acceptance rate bygender is not significant in any given cycle, the persistent trend over time cannot be attributedto random noise. Reid (2014) speculated on possible causes of the gender-based systematic, butestablishing the origin with any degree of certainty was difficult. One possibility is that unconsciousbias is present among the reviewers that favor men over women in the scientific review. However,the data contained hints that demographic differences may also play a role. In response to thesesystematics, HST took the step of listing the investigators alphabetically on the proposal cover sheetso that reviewers could no longer identify the PI. Yet, women continued to have a lower acceptancerate than men until a double-anonymous review was instituted that hid the identity of all investigatorsto the reviewers (Strolger & Natarajan 2019).Following the study by Reid (2014), Patat (2016) analyzed the proposal statistics for ESO andfound that women also have had a lower success rate than men when applying for observing time. Hefound that the difference in the acceptance rate can be largely attributed to demographic differencesin the seniority of PIs that correlate with gender. Proposals led by senior astronomers have a higheracceptance rate than proposals from junior astronomers, and the fraction of senior astronomers thatare women is lower than among junior applicants. After accounting for seniority, Patat (2016) founda residual systematic remained that could reflect either unaccounted demographic differences betweenwomen and men or potentially a true gender bias.Lonsdale et al. (2016) analyzed the results from the proposal review process for four facilitiesoperated in full or in part by the National Radio Astronomical Observatory (NRAO): the JanskyVery Large Array (JVLA), the Very Long Baseline Array (VLBA), the Green Bank Telescope (GBT)and ALMA. Analogous to the results for HST and ESO, they found that the proposal rankings favoredmen over women in ALMA Cycles 2-4, with the largest and most significant difference found in Cycle3. The other NRAO telescopes showed similar trends, although the significance was lower than foundfor ALMA, and in some semesters, women had higher overall rankings than men. Hunt et al. (2019)extended the analysis by Lonsdale et al. (2016) to include more recent proposal rounds at the JVLA,VLBA, and the GBT. They found that when averaged over all proposal semesters between 2012Aand 2019A for the JVLA, VLBA and GBT combined, men had a statistically significant advantageover women in the proposal scores. 2he study presented here extends the analysis of the ALMA proposal rankings conducted byLonsdale et al. (2016) in several aspects. First, the analysis is extended to include all cycles to date(Cycles 0-6). Second, since ALMA has a two-stage review process, the science rankings are evaluatedfor both the preliminary science assessments (Stage 1) and the final assessment resulting from theface-to-face review (Stage 2) to establish at which stage in the review process any systematics areintroduced. Finally, the correlation of the proposal rankings with other variables in addition togender, including experience level in submitting ALMA proposals and regional affiliation of the PIs,is investigated.This paper is organized as follows. Section 2 describes the salient aspects of the two-stage pro-posal review process adopted by ALMA and the demographic data collected for this study. Section 3explores any systematics in the proposal rankings introduced in the first stage of the review process.Section 4 compares the rankings between the first and second stages of the review process to inves-tigate any systematics that result from the face-to-face discussion between reviewers. Section 4 alsoexamines correlation between gender and the acceptance rate of proposals into the observing queue.Section 5 summarizes the results and briefly describes how the ALMA review process will evolve inthe near future. An analysis of the acceptance rate of proposals submitted by ALMA reviewers ispresented in the Appendix to investigate if the ALMA review process favors proposals submitted bythe reviewers. PROPOSAL REVIEW PROCESS AND DEMOGRAPHIC DATAThis section describes the ALMA proposal review process and how a list of scientific rankingsis produced from the reviewer scores. The demographic data are then described that will be usedto evaluate potential systematics in the proposal rankings against 1) the experience level of a PI insubmitting ALMA proposals, 2) the regional affiliation of the PI, and 3) the gender of the PI.2.1.
The ALMA proposal review process
Similar to many other observatories, ALMA has adopted a peer-review, panel-based system toevaluate and rank the proposals based on scientific merit. This system has been used for sevenproposal calls, starting with Cycle 0 in 2011 and continuing to Cycle 6 in 2018. Since Cycle 1,the ALMA review panels have been split across five scientific categories: 1) Cosmology and thehigh-redshift universe, 2) Galaxies and galactic nuclei, 3) Interstellar medium, star formation, andastrochemistry, 4) Circumstellar disks and the solar system, and 5) Stars and stellar evolution. Fourcategories were used in Cycle 0, where categories 4 and 5 were combined. The number of panels ineach category has increased over the years in response to an increased number of submitted proposals.Cycles 4-6 each contained 18 panels split across the five categories.Proposals are assigned to a panel by the JAO based on the science category selected by the PIat the time of proposal submission. Further refinement in the panel assignments can be done basedon the scientific keywords selected by the PI such that proposals with similar keywords may begrouped in a single panel. For example, in Cycle 6, the category for Circumstellar disks and thesolar systems contained four review panels, but planetary proposals were placed into two of the fourpanels. Proposals are assigned to and scored by a single review panel. The exception is proposalsfor Large Programs, which are assigned to all panels in the appropriate scientific category.3he review process proceeds in two stages. In Stage 1, panel members review their assignedproposals and provide preliminary numerical scores on a scale of 1 (best) to 10 (worst). Reviewers donot score proposals for which they have a conflict of interest. Conflicts of interest can be identifiedby either the JAO Proposal Handling Team (PHT) or self-declared by the reviewers. For this study,the average Stage 1 score is computed for each proposal . Stage 2 of the review process consists ofa face-to-face meeting of all reviewers at a common venue. The proposals are discussed and thenre-scored by each non-conflicted reviewer. The individual scores are averaged to produce the finalStage 2 score.The number of reviewers per panel has varied from six in Cycle 0, to seven in Cycles 1-2, and toeight in Cycles 4-6. In Cycle 0, panels in Category 4 had seven reviewers since additional expertisewas needed to cover the broad range of topics since the category also included stellar evolution.Similarly, Category 5 had nine reviewers per panel in Cycles 5 and 6. In Cycles 0-3, each proposalwas scored by four reviewers in Stage 1. Since Cycle 4, reviewers score all proposals in Stage 1 forwhich they do not have a conflict.Starting in Cycle 1, approximately 25-35% of proposals that have poor scores in the Stage 1 reviewsare “triaged” by the JAO. Unless a reviewer “resurrects” a triaged proposal, triaged proposals arenot discussed at the face-to-face review to allow the review panels to focus their deliberations on thebetter ranked proposals. The triage level is adjusted on a regional basis in order to maintain at leasta factor of two over-subscription in the requested observing time relative to the available time foreach region. In particular, the percentage of Chilean proposals that is triaged is typically lower thanother regions since Chile has had the lowest over-subscription rate. The non-triaged proposals arereviewed and scored by all non-conflicted reviewers in the panel in Stage 2.2.2. Proposal ranks
The outcome of the review process is a merged list of proposal ranks for all panels combined.The JAO then determines which proposals are accepted into the observing queue using primarilythe scientific rank from the review process, as well as the time available to each region, the over-subscription in array configurations and right ascension, and the required weather conditions neededto carry out the observations. Proposals that are accepted into the observing queue are assigned apriority grade (A, B, or C) while the remaining proposals are declined. Because of the operationalconsiderations, the priority grades do not strictly follow the proposal rankings. Therefore, in searchingfor systematics, primarily the proposal rankings are considered (see, however, the analysis presentedin Section 4.3 and the Appendix). Two lists of proposal rankings are created: i) the ranked listafter the initial proposal assessments (Stage 1), and ii) the ranked list after the face-to-face review(Stage 2), which excludes triaged proposals. Large Program proposals, Director’s Discretionary Timeproposals, and proposals submitted to the Cycle 4 Supplemental Call for the 7-m Array are excludedfrom the analysis since they are reviewed in a different manner.In each stage, the average scores from the reviewers are used to rank the proposals within a panelbetween 1 (best) to N (worst), where N is the number of proposals under consideration in either In the actual proposal review process, the JAO normalizes the composite Stage 1 scores for each reviewer to have thesame mean and standard deviation before averaging the scores. The normalization is not done in this study to treatthe Stage 1 and Stage 2 scores in a consistent manner. in the normalized rankingsare broken using a random number generator. The final ranked list is then normalized from 0 (best)to 1 (worst) with steps of 1/( N -1). A merged ranked list is created separately based on the scores inthe Stage 1 and Stage 2 process. Since triaged proposals are not re-scored by the panels in Stage 2,these proposals are excluded from the merged Stage 2 ranked list.2.3. Experience level in submitting ALMA proposals
Patat (2016) found that the proposal success rate in ESO was higher for “professional astronomers”than for less experienced PIs (classified into “postdocs” and “students”). The expertise of a PI maycorrelate with the success of a proposal if established PIs are able to write a more compelling sciencecase based on experience or have a better understanding on how to use ALMA optimally. On theother hand, this assumption may lead to some element of “prestige” bias, where proposals led by awell-known PI are given more favorable scores in the review process based on reputation or standingin the community that is not based on the scientific merit of the actual proposal.The experience factor consists of at least two components. One component is the overall experi-ence level of the PI, for which one measure is the year since the PhD was obtained or the numberof years as a professional astronomer. A second component is the experience of the PI in millime-ter/submillimeter interferometry overall and with ALMA in particular, as one may expect such a PIto understand better the capabilities of the instrument and the current state of the field.While ALMA users are requested to complete a demographic profile that includes the year oftheir PhD and a self-assessment of their expertise in submillimeter astronomy and other fields, mostusers do not complete their profiles. Therefore, as a surrogate for experience, the number of cyclesin which a user has submitted an ALMA proposal as PI was determined, regardless if the proposalwas ultimately accepted or not. The experience level is computed for each user and each cycle. Forexample, in Cycle 6, a user with an experience level of 1 indicates the user submitted an ALMAproposal as PI for the first time, while a user with an experience level of 7 has submitted at least oneproposal as PI in all seven ALMA cycles and has considerable experience with ALMA. This metricbest measures the experience that a user has in submitting ALMA proposals, but not their careerstanding.The main advantage of this metric is that it can be computed in a straightforward and consistentmanner for all users and a given cycle. However, it does not reflect the role co-investigators may havein formulating the proposal, especially faculty advisers to students. More subtly, this experiencemetric may be a biased measure in that success in one proposal cycle may encourage additionalproposals in subsequent cycles, either as positive reinforcement or by collecting ALMA data that canbe used to justify follow-up proposals. Conversely, having a proposal declined, especially in multipleproposal cycles, may discourage a user from submitting further proposals. In generating the ranked list of proposals used to assign priority grades, ties between proposal ranks are broken usingthe Stage 2 scores, and if a tie persists, by the proposal number.
Regional affiliation
ALMA proposals can be submitted by anyone without regard to nationality or affiliation. SinceALMA operations are funded by three regions (East Asia, Europe, and North America) with co-operation of the Chilean government, there is an inherent diversity in the ALMA user base (seeSection 2.6). All PIs self-identify their regional affiliation (Chile, East Asia, Europe, North America,or Other) when submitting their proposals. In this context, regional affiliation refers to the regionof the host institution as opposed to the nationality of the PI. Chilean proposals are submitted byPIs with an affiliation at a Chilean research institute. Proposals assigned to East Asia consist of PIswith affiliations in Japan, Taiwan, or the Republic of Korea. Proposals assigned to Europe consistsof PIs who have affiliations in one of the ESO member states. Proposals assigned to North Americaconsist of PIs from the United States, Canada, or Taiwan. Since Taiwanese agencies contributedfunding for ALMA in both East Asia and North America, Taiwan users are listed as having a jointEast Asia and North America affiliation in the proposal process, but for the purpose of this study,they are assigned to East Asia. Proposals from any non-ALMA regions are grouped as “Other”.2.5.
Gender
ALMA does not collect the gender of PIs during the proposal submission process, although PIs canoptionally enter this information as part of their demographic profile. As mentioned previously, mostPIs do not complete their demographic profiles and therefore this information was gathered manually.Lonsdale et al. (2016) compiled genders for ALMA PIs in Cycles 2-4 and kindly provided theirdatabase for this analysis. In collaboration with C. Lonsdale, a small number of gender assignmentswere corrected, and genders were identified for PIs from Cycles 0, 1, 5, and 6 that were not in thedatabase. Genders were determined by using information on the internet or familiarity with the PIby the author or by colleagues. Software tools to identify the gender based on the first name werealso utilized, but corroborating information was sought. While recognizing that the subject of genderidentity is complex, genders were classified as “male” or “female” for this study.2.6.
Demographic overview
Tables 1 and 2 summarize the regional and gender demographics of the proposal PIs for eachproposal cycle. The regional distribution of proposals has been fairly constant throughout the firstseven cycles in that Chilean PIs have submitted ∼
6% of the proposals, East Asian PIs ∼ ∼ ∼ ∼ Table 1.
Regional Demographics of ALMA Principal InvestigatorsCycle Number RegionProposals Chile East Asia Europe North America Other0 919 3.8% 19.9% 43.5% 30.5% 2.3%1 1131 5.7% 18.7% 43.0% 29.9% 2.7%2 1381 6.9% 19.7% 40.8% 30.1% 2.5%3 1578 7.3% 18.8% 41.6% 29.4% 2.9%4 1571 6.1% 21.6% 42.3% 27.1% 2.9%5 1661 5.3% 20.0% 42.2% 29.6% 2.9%6 1836 5.8% 20.0% 42.6% 28.5% 3.1%
Note —Table shows the percentage of proposal PIs from each region.
Table 2.
Gender Demographics of ALMA Principal InvestigatorsCycle Chile East Asia Europe North America Other All0 28.6% 16.9% 30.5% 32.1% 23.8% 28.1%1 24.6% 14.6% 30.2% 32.2% 16.7% 27.2%2 25.3% 24.6% 35.7% 33.7% 17.1% 31.7%3 14.8% 26.3% 36.2% 32.3% 33.3% 31.6%4 19.8% 24.7% 36.7% 30.5% 33.3% 31.4%5 20.5% 25.2% 36.9% 35.8% 22.9% 33.0%6 19.8% 26.6% 36.8% 37.5% 23.2% 33.6%
Note —Table lists the percentage of PIs who are women in each region. able 3. Demographics of the ALMA Proposal Reviewers
Cycle Number Region GenderReviewers Chile East Asia Europe North America Other Female Male0 49 10.2% 20.4% 36.7% 28.6% 4.1% 40.8% 59.2%1 77 10.4% 22.1% 32.5% 32.5% 2.6% 39.0% 61.0%2 77 10.4% 22.1% 33.8% 32.5% 1.3% 40.3% 59.7%3 96 10.4% 21.9% 33.3% 33.3% 1.0% 44.8% 55.2%4 145 9.7% 21.4% 33.1% 33.1% 2.8% 47.6% 52.4%5 146 9.6% 19.9% 35.6% 31.5% 3.4% 47.9% 52.1%6 146 10.3% 22.6% 32.9% 29.5% 4.8% 41.8% 58.2%
Note —Table shows the regional and gender distribution of the ALMA reviewers. ANALYSIS OF THE STAGE 1 RANKINGSThis section analyzes the Stage 1 proposal rankings to identify any systematics based on experiencelevel (Section 3.1), regional affiliation (Section 3.2), and gender (Section 3.3) that are introduced inthe preliminary reviewer scores. Potential systematics are examined by analyzing the cumulativedistribution of proposal ranks; e.g., comparing the cumulative distribution of proposal ranks forfemale and male PIs. This approach has the advantage that differences anywhere along the cumulativeprofiles can be captured. The number of cumulative distributions being compared can be as few astwo when comparing by gender, to as many as five for regional comparisons, and seven for experience-level comparisons.The Anderson-Darling k -sample test (Scholz et al. 1987) as implemented in scipy was used tomeasure the difference between cumulative distributions. The Anderson-Darling test statistic wasthen used to compute the probability ( p AD , 0 ≤ p AD ≤
1) that the k samples are drawn from thesame (but unspecified) population using the pval function within the kSamples package designed for R . A low value of p AD suggests that the k samples are drawn from different distributions while a highvalue of p AD suggests that the k samples have similar distributions. Any differences in the cumulativeranks are arbitrarily defined as “significant” if the probability that the distributions are drawn fromthe same population is p AD < .
01 and “marginally significant” if the probability is 0 . ≤ p AD ≤ . Experience level
Figure 1 shows the cumulative distribution of Stage 1 proposal ranks by experience level for Cycles1-6, where the experience level is at the time of the indicated cycle. Cycle 0 is not shown since all PIssubmitted proposals for the first time. In this figure and all similar figures that follow, the solid lineshows the cumulative distribution of ranks and the shaded region shows the 68.3% confidence interval(i.e., “1 σ ”) computed using the beta function. Since the best-ranked proposals have a normalizedrank of 0 and the poorest-ranked proposals have a normalized rank of 1, curves shifted to the upper8eft have better overall ranks compared to curves shifted to the lower right. The probability ( p AD )that the curves are drawn from the same population is indicated in the lower right of each panel.As an example, the upper left panel in Figure 1 shows that in Cycle 1, PIs who submitted proposalsin both Cycles 0 and 1 (experience level = 2) had better proposal ranks than first-time PIs in Cycle1 (experience level = 1). The trend is present in each of the first, second, and third quartiles ofthe cumulative distributions. The difference in proposal ranks is significant in that the probabilitythat the two distributions are drawn from the same population is p AD < − . Each subsequent cycleshows the same trend in that PIs who have submitted proposals in more cycles tend to have betterproposal ranks than PIs who have submitted proposals in fewer proposal cycles. The strongest andmost persistent trend is that first-time PIs have the poorest proposal ranks, while PIs who submitproposals every cycle have the best proposal ranks. Proposal ranks for intermediate experience levelsare also generally correlated with experience level. While not shown here, these basic trends aretypically present within each region separately, although there are singular cycles where the trendsare not strictly followed within a region. C u m u l a t i v e d i s t r i bu t i on p AD < 10 Cycle 1 p AD < 10 Cycle 2 p AD < 10 Cycle 3
Normalized Stage 1 rank C u m u l a t i v e d i s t r i bu t i on p AD < 10 Cycle 4
Normalized Stage 1 rank p AD < 10 Cycle 5
Normalized Stage 1 rank p AD < 10 Cycle 6
Figure 1.
Normalized cumulative distribution of Stage 1 proposal ranks by experience level for each cycle.The normalized ranks vary between 0 (best) to 1 (worst). The shaded region indicates the 68.3% confidenceinterval computed using the beta function. The probability from the Anderson-Darling k -sample test thatthe distributions within a cycle are drawn from the same population is indicated in the lower right cornerof each panel. Regional affiliation
Figure 2 shows the cumulative distribution of Stage 1 proposal ranks by regional affiliation forCycles 0–6. Each cycle exhibits the same trend in that PIs from North America and Europe havebetter proposal rankings overall than PIs from Chile, East Asia, and other regions. The trend ispresent and significant in each cycle. The differences appear to moderate somewhat in Cycles 2 and3 but increase in Cycles 4-6.In Cycles 0 and 1, North American PIs had better ranked proposals than European PIs. However,in later cycles, the differences diminished. Averaged over all cycles, there is a marginal tendency forNorth American proposals to have better ranks than European proposals, but the tendency vanishesif Cycles 0 and 1 are excluded. No significant difference in the proposals ranks are observed forChilean and East Asian PIs within a cycle or when averaged over all cycles ( p AD =0.56).Differences in the relative proposal ranks by region transcend across the experience levels of thePIs. Figures 3 shows the cumulative proposal ranks by region for the most experienced PIs, definedas users who have submitted proposals in at least five of the seven cycles. Similarly, Figure 4 showsthe results for PIs who have submitted proposals in only one or two cycles to select inexperiencedALMA users. In both subsamples, PIs from Europe and North America have significantly betterproposal ranks than PIs from other regions. C u m u l a t i v e d i s t r i bu t i on p AD < 10 Cycle 0
ChileEast AsiaEuropeNorth AmericaOther p AD < 10 Cycle 1
ChileEast AsiaEuropeNorth AmericaOther p AD < 10 Cycle 2
ChileEast AsiaEuropeNorth AmericaOther p AD < 10 Cycle 3
ChileEast AsiaEuropeNorth AmericaOther0.00 0.25 0.50 0.75 1.00
Normalized Stage 1 rank C u m u l a t i v e d i s t r i bu t i on p AD < 10 Cycle 4
ChileEast AsiaEuropeNorth AmericaOther 0.00 0.25 0.50 0.75 1.00
Normalized Stage 1 rank p AD < 10 Cycle 5
ChileEast AsiaEuropeNorth AmericaOther 0.00 0.25 0.50 0.75 1.00
Normalized Stage 1 rank p AD < 10 Cycle 6
ChileEast AsiaEuropeNorth AmericaOther 0.00 0.25 0.50 0.75 1.00
Normalized Stage 1 rank p AD < 10 All cycles
ChileEast AsiaEuropeNorth AmericaOther
Figure 2.
Normalized cumulative distribution of Stage 1 proposal ranks by regional affiliation for eachcycle. .000.250.500.751.00 C u m u l a t i v e d i s t r i bu t i on p AD < 10 Cycle 0
ChileEast AsiaEuropeNorth AmericaOther p AD < 10 Cycle 1
ChileEast AsiaEuropeNorth AmericaOther p AD = 6 e Cycle 2
ChileEast AsiaEuropeNorth AmericaOther p AD < 10 Cycle 3
ChileEast AsiaEuropeNorth AmericaOther0.00 0.25 0.50 0.75 1.00
Normalized Stage 1 rank C u m u l a t i v e d i s t r i bu t i on p AD < 10 Cycle 4
ChileEast AsiaEuropeNorth AmericaOther 0.00 0.25 0.50 0.75 1.00
Normalized Stage 1 rank p AD < 10 Cycle 5
ChileEast AsiaEuropeNorth AmericaOther 0.00 0.25 0.50 0.75 1.00
Normalized Stage 1 rank p AD = 0.0009 Cycle 6
ChileEast AsiaEuropeNorth AmericaOther 0.00 0.25 0.50 0.75 1.00
Normalized Stage 1 rank p AD < 10 All cycles
ChileEast AsiaEuropeNorth AmericaOther
Figure 3.
Normalized cumulative distribution of Stage 1 proposal ranks by regional affiliation for eachcycle for PIs who have submitted proposals in 5 or more cycles, which represents the most experiencedALMA users. C u m u l a t i v e d i s t r i bu t i on p AD = 0.001 Cycle 0
ChileEast AsiaEuropeNorth AmericaOther p AD = 4 e Cycle 1
ChileEast AsiaEuropeNorth AmericaOther p AD = 0.003 Cycle 2
ChileEast AsiaEuropeNorth AmericaOther p AD = 0.008 Cycle 3
ChileEast AsiaEuropeNorth AmericaOther0.00 0.25 0.50 0.75 1.00
Normalized Stage 1 rank C u m u l a t i v e d i s t r i bu t i on p AD = 0.002 Cycle 4
ChileEast AsiaEuropeNorth AmericaOther 0.00 0.25 0.50 0.75 1.00
Normalized Stage 1 rank p AD < 10 Cycle 5
ChileEast AsiaEuropeNorth AmericaOther 0.00 0.25 0.50 0.75 1.00
Normalized Stage 1 rank p AD < 10 Cycle 6
ChileEast AsiaEuropeNorth AmericaOther 0.00 0.25 0.50 0.75 1.00
Normalized Stage 1 rank p AD < 10 All cycles
ChileEast AsiaEuropeNorth AmericaOther
Figure 4.
Normalized cumulative distribution of Stage 1 proposal ranks by regional affiliation for eachcycle for PIs who have submitted proposals in only 1 or 2 cycles, which represents the least experiencedALMA users.
Gender
Figure 5 shows the cumulative distribution of Stage 1 proposal ranks by gender for each cycle. Nosignificant difference between the proposal ranks for women or men exists in any individual cycle.Consistent with Lonsdale et al. (2016) , proposals led by men had better ranks than proposals led bywomen in Cycle 3 with marginal significance ( p AD =0.01) and to a lesser extent in Cycle 2 ( p AD =0.14).Averaged over all cycles, the probability that the distribution of proposal ranks are different basedon gender is p AD =0.04 and is marginally significant.The results in Cycle 3 stand out in that men had better ranks than women when measured at thefirst, second, and third quartile points in the cumulative rankings. It is unclear why Cycle 3 wouldbe noteworthy in this regard. No fundamental change was introduced in the review process itself,and the percentage of proposals from women and the percentage of women reviewers were in linewith other cycles. C u m u l a t i v e d i s t r i bu t i on p AD = 0.59 Cycle 0
FemaleMale p AD = 0.64 Cycle 1
FemaleMale p AD = 0.14 Cycle 2
FemaleMale p AD = 0.01 Cycle 3
FemaleMale0.00 0.25 0.50 0.75 1.00
Normalized Stage 1 rank C u m u l a t i v e d i s t r i bu t i on p AD = 0.88 Cycle 4
FemaleMale 0.00 0.25 0.50 0.75 1.00
Normalized Stage 1 rank p AD = 0.36 Cycle 5
FemaleMale 0.00 0.25 0.50 0.75 1.00
Normalized Stage 1 rank p AD = 0.46 Cycle 6
FemaleMale 0.00 0.25 0.50 0.75 1.00
Normalized Stage 1 rank p AD = 0.04 All cycles
FemaleMale
Figure 5.
Normalized cumulative distribution of Stage 1 proposal ranks by gender for each cycle.
While the results of the Lonsdale et al. (2016) paper were posted after the Cycle 4 proposalreview, preliminary results had been presented to the community before the Cycle 4 proposal reviewand the JAO communicated the results to the Cycle 4 reviewers at the Stage 2 orientation meeting.The Cycle 5 and 6 reviewers received guidance on the results and the role of unconscious bias inthe written Stage 1 review instructions and at the Stage 2 orientation meeting. Once the presenceof systematics began to be communicated (in Cycles 4-6), no discernible differences in the Stage 1proposal rankings between women and men are evident in any individual cycle or when the threecycles are combined ( p AD =0.73). However, it is unclear that alerting the community after Cycle 3 The Lonsdale et al. (2016) results differ in detail compared to this paper since they used a ranked list of proposalsbased on the Stage 2 results merged with the triaged proposals. This paper analyzes both the Stage 1 and Stage 2results, but does not merge triaged proposals with the Stage 2 rankings. .000.250.500.751.00 C u m u l a t i v e d i s t r i bu t i on p AD = 0.68 Cycle 0
FemaleMale p AD = 0.60 Cycle 1
FemaleMale p AD = 0.47 Cycle 2
FemaleMale p AD = 0.004 Cycle 3
FemaleMale0.00 0.25 0.50 0.75 1.00
Normalized Stage 1 rank C u m u l a t i v e d i s t r i bu t i on p AD = 0.69 Cycle 4
FemaleMale 0.00 0.25 0.50 0.75 1.00
Normalized Stage 1 rank p AD = 0.73 Cycle 5
FemaleMale 0.00 0.25 0.50 0.75 1.00
Normalized Stage 1 rank p AD = 0.94 Cycle 6
FemaleMale 0.00 0.25 0.50 0.75 1.00
Normalized Stage 1 rank p AD = 0.02 All cycles
FemaleMale
Figure 6.
Normalized cumulative distribution of Stage 1 proposal ranks for women and men in Europeand North America for Cycles 0-6 and all cycles combined. The results are shown only for PIs who havesubmitted an ALMA proposal in at least 5 cycles. actually contributed to reducing the gender-based systematic or if the results from Cycle 3 were justa statistical outlier.One difficulty in interpreting Figure 5 is that any systematics between genders are much smallerthan the systematics present by experience level and regional affiliation. Thus, changes in the un-derlying experience or regional demographics of the PIs can be responsible for the difference in theproposal ranks by gender (see also Patat 2016). Given these considerations, subsets of the data areanalyzed to further examine possible systematics in order to isolate the impact of gender alone.Figure 6 shows the cumulative distribution of proposal ranks for PIs from Europe and NorthAmerica who have submitted proposals in at least five cycles. Europe and North America weregrouped together since they share similar proposal ranks overall. The most experienced PIs wereselected since the fraction of women PIs has been increasing over time and first-time PIs typicallyhave poorer proposal ranks. Figure 6 show that even among experienced PIs, women have hadpoorer ranked proposals than men when averaged over all cycles, but with marginal significance.The difference is driven by the significant difference found in Cycle 3. If Cycle 3 is excluded, anydifferences in the proposal rankings between experience female and male PIs are insignificant evenwhen averaged over the other cycles ( p AD =0.25).Figure 7 shows the difference in the proposal ranks for experienced PIs from Chile, East Asia,and non-ALMA regions. No significant difference in the proposal rankings are found even if averagedover all cycles ( p AD =0.70). Interestingly, in Cycle 3, women from Chile, East Asia, and non-ALMAregions had better proposal ranks than men, although the difference is not significant. Nonetheless,this is opposite of the trend found amount European and North American PIs. These different resultssuggests that any potential biases are complex and cannot be simply cast by gender alone.13 .000.250.500.751.00 C u m u l a t i v e d i s t r i bu t i on p AD = 0.75 Cycle 0
FemaleMale p AD = 0.47 Cycle 1
FemaleMale p AD = 0.78 Cycle 2
FemaleMale p AD = 0.14 Cycle 3
FemaleMale0.00 0.25 0.50 0.75 1.00
Normalized Stage 1 rank C u m u l a t i v e d i s t r i bu t i on p AD = 0.44 Cycle 4
FemaleMale 0.00 0.25 0.50 0.75 1.00
Normalized Stage 1 rank p AD = 0.45 Cycle 5
FemaleMale 0.00 0.25 0.50 0.75 1.00
Normalized Stage 1 rank p AD = 0.78 Cycle 6
FemaleMale 0.00 0.25 0.50 0.75 1.00
Normalized Stage 1 rank p AD = 0.70 All cycles
FemaleMale
Figure 7.
Normalized cumulative distribution of Stage 1 proposal ranks for women and men in Chile,East Asia, and non-ALMA regions for Cycles 0-6 and all cycles combined. The results are shown only forPIs who have submitted an ALMA proposal in at least 5 cycles. C u m u l a t i v e d i s t r i bu t i on p AD = 0.65 Cycle 6, Experience=1
Stage 1Stage 2 p AD = 0.49 Cycle 6, Experience=2
Stage 1Stage 2 p AD = 0.99 Cycle 6, Experience=3
Stage 1Stage 2 p AD = 0.96 Cycle 6, Experience=4
Stage 1Stage 20.00 0.25 0.50 0.75 1.00
Normalized rank C u m u l a t i v e d i s t r i bu t i on p AD = 0.98 Cycle 6, Experience=5
Stage 1Stage 2 0.00 0.25 0.50 0.75 1.00
Normalized rank p AD = 0.96 Cycle 6, Experience=6
Stage 1Stage 2 0.00 0.25 0.50 0.75 1.00
Normalized rank p AD = 1.00 Cycle 6, Experience=7
Stage 1Stage 2
Figure 8.
Normalized cumulative distribution of Stage 1 and Stage 2 proposal ranks (solid curves) inCycle 6 for different experience levels. Only non-triaged proposals are shown. . ANALYSIS OF THE STAGE 2 RESULTSThe analysis in Section 3 showed that systematics are introduced in the Stage 1 rankings, especiallywith respect to experience level and regional affiliation. In this section, any systematics in the Stage 2results are analyzed. First, the triage proposals are analyzed to assess if any systematics are presentwith respect to gender (Section 4.1). Then the Stage 1 and Stage 2 rankings for the non-triagedproposals are compared to determine if any of the systematics identified in the Stage 1 rankings areamplified or reduced as a result of the face-to-face discussion (Section 4.2). Finally, the proposalsadded to the observing queue are analyzed to determine the acceptance rate of proposals with respectto gender (Section 4.3). 4.1.
Triaged proposals
As described in Section 2.1, poorly ranked proposals are triaged after the Stage 1 review to reducethe number of proposals discussed at Stage 2. The JAO identifies the proposals that are triaged, butupon request the reviewers may resurrect a triaged proposal and have it discussed in the face-to-facereview. Table 4 lists the fraction of triaged proposals that have a female PI. This triage fractionby gender should not be compared directly to the overall fraction of proposals led by women (seeTable 2) to identify potential biases since the demographics of triaged proposals are not in general thesame as the overall proposals. This is primarily because the gender balance differs between regions,and the fraction of proposals triaged per region will differ.To account for the demographics of the triaged proposals, the expected number of triaged proposalswith female PIs was estimated as the number of triaged proposals in a given demographic groupmultiplied by the fraction of all proposals with a female PI in that group. Formally, the expectedfraction of triaged proposals with female PIs ( f t,expected ) is f t,expected = (cid:80) Region (cid:80)
Experience (cid:80)
Category f ( R, E, C ) N triage ( R, E, C ) (cid:80) Region (cid:80)
Experience (cid:80)
Category N triage ( R, E, C ) , (1)where f ( R, E, C ) is the fraction of PIs with regional affiliation R , experience level E , and science cat-egory C that are female; N triage ( R, E, C ) is the total number of triaged proposals in the demographicgroup, excluding proposals resurrected by reviewers. The uncertainties in the expected fraction wereestimated assuming Poisson statistics on the number of female PIs used to compute f ( R, E, C ).The expected fraction of triaged proposals for female PIs is listed in of Table 4 by region and forall regions combined. In Cycles 1-5, female PIs across all regions had a larger fraction of the triagedproposals than expected based on the demographics and share of the proposals. The difference waslargest in Cycle 3 as could have been anticipated based on the Stage 1 rankings (see Figure 5).Nonetheless, the differences are not statistically significant in any given cycle. Only in Cycle 6 didfemale PIs have a lower percentage of the triaged proposals than expected based on the model. Thesebasic trends are seen in East Asia, Europe, and North America, although there are individual cycleswhere female PIs in East Asia (Cycles 1 and 2) and North America (Cycles 5 and 6) had fewerproposals triaged than expected. Notably in Europe, female PIs had a greater fraction of proposalstriaged than expected in each cycle. The number of triaged proposals in Chile and non-ALMA regionsare too small to identify any meaningful trends. 15 able 4.
Gender Demographics of Triaged Proposals
Cycle Chile East Asia Europe North America Other All regions f t f t,expected f t f t,expected f t f t,expected f t f t,expected f t f t,expected f t f t,expected ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± · · · · · · ± ± ± ± ± ± ± ± ± ± ± Note —The table lists the fraction of triaged proposals with a female PI ( f t ) and the expected fraction ( f t,expected ) given the demographics of the triaged proposals,as described in the text. The results are given for each region and all regions combined. Cycle 0 is not listed since no proposals were triaged in that cycle. Comparison of Stage 1 and Stage 2 rankings
This section compares the Stage 1 rankings with the Stage 2 rankings to determine if the system-atics identified in Stage 1 change significantly as a result of the face-to-face discussion. The impact ofthe face-to-face discussions was assessed by comparing the cumulative Stage 1 and Stage 2 proposalrankings of the non-triaged proposals. The Stage 1 proposal rankings for the non-triaged proposalswere extracted and then renormalized on a scale of 0 to 1 (see Section 2.2). This renormalization wasneeded to eliminate systematic differences between the Stage 1 and Stage 2 rankings since triagedproposals have preferentially poorer ranks by design.Figure 8 shows the cumulative distributions of Cycle 6 proposal rankings in Stage 1 and Stage 2grouped by experience level. For each experience level, the cumulative distributions for the Stage 1and Stage 2 normalized ranks are similar and none of the differences are considered even marginallysignificant. While not shown here, previous cycles have similar results. The ranks for individualproposals did in fact change between the Stage 1 and Stage 2 reviews, but Figure 8 indicates nosystematic differences were introduced based on the experience level of the PI.Figures 9, 10, 11, and 12 compare the Stage 1 and Stage 2 proposal ranks in all seven ALMA cyclesfor PIs from Chile, East Asia, Europe, and North America, respectively. Each figure also includes aplot that combines the results from all cycles. While none of the differences between the Stage 1 andStage 2 ranks are significant in any given region or cycle, some tendencies are seen. Proposals fromEast Asia tend to be rated better as a result of the face-to-face discussions. This was most notable inCycle 3 and to a less extent in Cycles 1, 4, and 6 (Figure 10). In contrast, European PIs (Figure 11)tended toward lower ranks after the face-to-face discussions in Cycle 3 (third and fourth quartiles)and Cycle 5 (second and third quartiles).Combining all cycles, the probability that the Stage 1 and Stage 2 cumulative ranks for non-triagedproposals originate from the same population is 0.13 for Chile, 0.013 for East Asia, 0.07 for Europe,and 0.84 for North America. Thus there is a marginally significant tendency for the face-to-facediscussion to improve the rankings of East Asian proposals while negatively impacting the overallEuropean proposal ranks.Figure 13 shows the cumulative Stage 1 and Stage 2 rankings for the non-triaged proposals led bywomen. No significant or marginally significant differences between the two distributions are seen in16 .000.250.500.751.00 C u m u l a t i v e d i s t r i bu t i on p AD = 0.95 Cycle 0
Stage 1Stage 2 p AD = 0.94 Cycle 1
Stage 1Stage 2 p AD = 0.86 Cycle 2
Stage 1Stage 2 p AD = 0.79 Cycle 3
Stage 1Stage 20.00 0.25 0.50 0.75 1.00
Normalized rank C u m u l a t i v e d i s t r i bu t i on p AD = 0.44 Cycle 4
Stage 1Stage 2 0.00 0.25 0.50 0.75 1.00
Normalized rank p AD = 0.14 Cycle 5
Stage 1Stage 2 0.00 0.25 0.50 0.75 1.00
Normalized rank p AD = 0.29 Cycle 6
Stage 1Stage 2 0.00 0.25 0.50 0.75 1.00
Normalized rank p AD = 0.13 All cycles
Stage 1Stage 2
Figure 9.
Cumulative distribution of Stage 1 and Stage 2 proposal ranks for non-triaged proposals led bya Chilean PI in Cycles 0–6 and for all cycles combined. C u m u l a t i v e d i s t r i bu t i on p AD = 0.65 Cycle 0
Stage 1Stage 2 p AD = 0.45 Cycle 1
Stage 1Stage 2 p AD = 0.79 Cycle 2
Stage 1Stage 2 p AD = 0.10 Cycle 3
Stage 1Stage 20.00 0.25 0.50 0.75 1.00
Normalized rank C u m u l a t i v e d i s t r i bu t i on p AD = 0.12 Cycle 4
Stage 1Stage 2 0.00 0.25 0.50 0.75 1.00
Normalized rank p AD = 0.65 Cycle 5
Stage 1Stage 2 0.00 0.25 0.50 0.75 1.00
Normalized rank p AD = 0.90 Cycle 6
Stage 1Stage 2 0.00 0.25 0.50 0.75 1.00
Normalized rank p AD = 0.01 All cycles
Stage 1Stage 2
Figure 10.
Cumulative distribution of Stage 1 and Stage 2 proposal ranks for non-triaged proposals ledby an East Asian PI in Cycles 0–6 and for all cycles combined. any cycle. When averaged over all cycles, the probability is 0.91 that the Stage 1 and Stage 2 ranksfor non-triaged proposals led by women PIs share the same population. Thus the two distributionsare indistinguishable. 17 .000.250.500.751.00 C u m u l a t i v e d i s t r i bu t i on p AD = 0.98 Cycle 0
Stage 1Stage 2 p AD = 0.98 Cycle 1
Stage 1Stage 2 p AD = 0.99 Cycle 2
Stage 1Stage 2 p AD = 0.11 Cycle 3
Stage 1Stage 20.00 0.25 0.50 0.75 1.00
Normalized rank C u m u l a t i v e d i s t r i bu t i on p AD = 0.68 Cycle 4
Stage 1Stage 2 0.00 0.25 0.50 0.75 1.00
Normalized rank p AD = 0.15 Cycle 5
Stage 1Stage 2 0.00 0.25 0.50 0.75 1.00
Normalized rank p AD = 0.88 Cycle 6
Stage 1Stage 2 0.00 0.25 0.50 0.75 1.00
Normalized rank p AD = 0.07 All cycles
Stage 1Stage 2
Figure 11.
Cumulative distribution of Stage 1 and Stage 2 proposal ranks for non-triaged proposals ledby an European PI in Cycles 0–6 and for all cycles combined. C u m u l a t i v e d i s t r i bu t i on p AD = 0.99 Cycle 0
Stage 1Stage 2 p AD = 0.77 Cycle 1
Stage 1Stage 2 p AD = 0.96 Cycle 2
Stage 1Stage 2 p AD = 0.83 Cycle 3
Stage 1Stage 20.00 0.25 0.50 0.75 1.00
Normalized rank C u m u l a t i v e d i s t r i bu t i on p AD = 0.57 Cycle 4
Stage 1Stage 2 0.00 0.25 0.50 0.75 1.00
Normalized rank p AD = 0.66 Cycle 5
Stage 1Stage 2 0.00 0.25 0.50 0.75 1.00
Normalized rank p AD = 1.00 Cycle 6
Stage 1Stage 2 0.00 0.25 0.50 0.75 1.00
Normalized rank p AD = 0.84 All cycles
Stage 1Stage 2
Figure 12.
Cumulative distribution of Stage 1 and Stage 2 proposal ranks for non-triaged proposals ledby an North American PI in Cycles 0–6 and for all cycles combined.
In summary, the results shown in Figures 8-13 indicate that no significant systematics in theproposal rankings are introduced by the face-to-face review in terms of experience, regional affiliation,or gender in any given cycle. When averaged over all cycles, East Asian proposals tend to improvetheir rankings in the Stage 2 process relative to Stage 1.18 .000.250.500.751.00 C u m u l a t i v e d i s t r i bu t i on p AD = 1.00 Cycle 0
Stage 1Stage 2 p AD = 0.73 Cycle 1
Stage 1Stage 2 p AD = 0.82 Cycle 2
Stage 1Stage 2 p AD = 0.65 Cycle 3
Stage 1Stage 20.00 0.25 0.50 0.75 1.00
Normalized rank C u m u l a t i v e d i s t r i bu t i on p AD = 0.59 Cycle 4
Stage 1Stage 2 0.00 0.25 0.50 0.75 1.00
Normalized rank p AD = 0.68 Cycle 5
Stage 1Stage 2 0.00 0.25 0.50 0.75 1.00
Normalized rank p AD = 0.76 Cycle 6
Stage 1Stage 2 0.00 0.25 0.50 0.75 1.00
Normalized rank p AD = 0.91 All cycles
Stage 1Stage 2
Figure 13.
Cumulative distribution of Stage 1 and Stage 2 proposal ranks for non-triaged proposals ledby women in Cycles 0–6 and all cycles combined.
Proposal priority grades
Priority grades for the observing queue are assigned to the proposals by the JAO based on theStage 2 rankings and also the share of time per executive. In recent cycles, balancing the proposalpressure in the various array configurations, local sidereal time, and weather conditions have alsobeen considered. Thus the systematic in the proposal rankings by region is partially compensated forwhen assigning priority grades by ensuring each region obtains their pre-determined share of time.However, any systematics with gender in the proposal rankings are not considered when assigninggrades.Similar to the analysis of the gender distribution of triaged proposals, demographics need to beconsidered when comparing the acceptance rate of proposals led by men and women. The expectedacceptance rate, where an “accepted” proposal is defined as receiving a priority grade of A or B, wascomputed as f AB,expected = (cid:80) Region (cid:80)
Experience (cid:80)
Category f g ( R, E, C ) N AB ( R, E, C ) (cid:80) Region (cid:80)
Experience (cid:80)
Category f g ( R, E, C ) N tot ( R, E, C ) (2)where f g ( R, E, C ) is the fraction of the proposals in the demographic group (
R, E, C ) that havegender g , N AB ( R, E, C ) is the number of proposals awarded grade A or B, and N tot ( R, E, C ) is thetotal number of submitted proposals.Table 5 lists the proposal acceptance rate by cycle for female and male PIs and the expectedacceptance rate based on demographics. Because of the demographics, the overall acceptance rateof female PIs is expected to be lower than male PIs. This can be attributed primarily to regionaldifferences in that Europe and North America have the highest regional fraction of female PIs andhigh oversubscription rates while Chile and East Asia have a lower fraction of female PIs and often19ower oversubscription rates. In addition, an increasing fraction of the PIs in East Asia and Europeare female, and relatively inexperienced PIs have poorer proposal rankings.Figure 14 plots the difference between the actual and expected proposal acceptance rate by genderfor East Asia, Europe, North American, and all regions combined, including Chile and non-ALMAregions. Plots are not shown for Chile and non-ALMA regions since these regions have relativelyfew proposals and have large uncertainties. Considering all regions, female PIs have had a smallerfraction of their proposals assigned priority Grade A or B than expected in each cycle, although thedifferences are not statistically significant in any given cycle. Conversely, male PIs have had a higheracceptance rate than expected in each cycle. The difference was largest is Cycles 3 and 4 and hasdiminished in Cycles 5 and 6. The Cycle 3 result was anticipated based on the Stage 1 rankings (seeFigure 5). The Cycle 4 result is more surprising in that the Stage 1 rankings for female and male PIsin Cycle 4 are indistinguishable. However, in that cycle, female PIs tended to have lower proposalranks after the Stage 2 process (see Figure 13). Even though the difference was not statisticallysignificant, it was enough to lower the overall acceptance rate.The trends are similar for individual regions. In East Asia, female PIs have had a lower acceptancerate than expected in 5 of the 7 cycles. In Europe, female PIs exceeded the expected acceptancerate in only one cycle. While female PIs in North America have exceeded expectations in 4 of the 7cycles, the cycles where the acceptance rate is below expectations (Cycles 2 and 3 in particular) aremore extreme than when the acceptance rate is higher than expected. SUMMARY AND CONCLUSIONSAn analysis is presented of the outcomes of the proposal review process in ALMA Cycles 0-6 to identify systematics in the proposal rankings that may signify potential bias with respect tothe experience level, regional affiliation, or gender of the PI. The analysis was conducted for boththe Stage 1 rankings, which are based on the preliminary scores of the reviewers, and the Stage 2rankings, which are based on the face-to-face panel discussions. The results show that systematicsare introduced primarily in the Stage 1 process and not the face-to-face discussion.One significant trend is that PIs who submit proposals every cycle have better proposal ranks thanPIs who have submitted proposals for the first time. The trend is also present within intermediatelevels of experience. One should not expect a completely random correlation of the proposal rankswith experience level since many expert PIs will have a detailed understanding on how best to use theALMA and it would not be surprising if they can write compelling proposals. Also, PIs who resubmitpreviously declined proposals have feedback from the reviewers on how to improve the proposal. Thequestion remains, however, to what degree reviewers give experienced PIs leeway in the proposalreview on perceived prestige, but the data in hand cannot address that question directly.A second significant trend is that proposals submitted by PIs from North America and Europehave better ranked proposals than PIs from East Asia and Chile, although a small improvement inthe proposal rankings for East Asia are observed as a result of the Stage 2 discussions. This trend isnot unique to ALMA. Reid (2014) reports that PIs from Europe and North America have a highersuccess rate on proposals submitted to the Hubble Space Telescope (HST) than PIs from the rest ofthe world. 20 able 5.
Acceptance Rate of Proposals
Cycle Female Male f AB f AB,expected f AB f AB,expected
Chile ± ± ± ± ± ± ± ± ± ± ± ± ± ± East Asia ± ± ± ± ± ± ± ± ± ± ± ± ± ± Europe ± ± ± ± ± ± ± ± ± ± ± ± ± ± North America ± ± ± ± ± ± ± ± ± ± ± ± ± ± Other ± ± ± ± ± ± ± ± ± ± ± ± ± ± All regions ± ± ± ± ± ± ± ± ± ± ± ± ± ± Note —The table lists the fraction of proposals assigned prioritygrade A or B ( f AB ) with a female or male PI. Also listed is theexpected fraction ( f AB,expected ) given the demographics of theaccepted proposals, as described in the text. A cc ep t an c e R a t e : A c t ua l - E x pe c t ed East Asia
MaleFemale
Europe
MaleFemale0 1 2 3 4 5 6
Cycle -10%-5%0%5%10% A cc ep t an c e R a t e : A c t ua l - E x pe c t ed North America
MaleFemale 0 1 2 3 4 5 6
Cycle
All regions
MaleFemale
Figure 14.
The difference between the actual and expected acceptance rate of proposals with female andmale PIs by cycle for East Asia, Europe, North America, and all regions combined (including Chile andnon-ALMA regions). The vertical bars indicate the 1 σ uncertainties. While the origin of the trend in the ALMA proposal rankings with region is unclear, one specula-tion is that it can be attributed to differences in the proficiency in using the English language. ALMAproposals are mandated to be written in English, which is a second language for a large fraction ofALMA PIs. However, the proficiency level in English likely varies between regions and may causereviewers to penalize proposals that are not as well written even if the underlying science is strong.Stylistic differences in how a proposal is structured may also potentially exist between regions. Hall(1976) introduced the concept of high- and low-context communication and indicated communica-tion styles can vary between countries. High-context cultures rely heavily on non-verbal methodsto convey information, while low-context cultures communicate information primarily through lan-guage. Hall (1976) indicated that Japan has high-context communication while the United Statesis low context. Since approximately two thirds of the ALMA reviewers are from Europe and NorthAmerica, the proposal rankings could reflect preference toward proposal styles from those regions orthat low-context styles are more suitable to proposal peer review. It is tempting to conclude thatthe improvement in the scientific rankings of East Asian proposals during the Stage 2 face-to-face22iscussions is a result of overcoming potential language biases or stylistic differences to select thebest science, but that remains speculation for now.As noted by Lonsdale et al. (2016), male PIs tend to have better ALMA proposal ranks than femalePIs in Cycles 2-4. This was most apparent in Cycle 3, but since then, no measurable differences existin the cumulative Stage 1 proposal rankings between women and men in Cycles 4-6 even whenindividual cycles are combined. Nonetheless, the proposal acceptance rates, which ultimately reflectwhat is scheduled on the telescope, show a similar trend as HST (Reid 2014) and ESO (Patat 2016)in that proposals with female PIs have had a lower success rate in receiving telescope time thanproposals with male PIs, even after accounting for demographic differences by region, experience,and science category. The difference between the actual and expected proposal acceptance rate forwomen is not significant in any given cycle, but is present in each cycle when considering all regionscombined. Whether the systematic differences in the acceptance rate represents a bias in the reviewprocess or an unaccounted for demographic difference between women and men (e.g., seniority; c.f.Patat 2016) is unclear.Identifying the underlying causes of the systematics in the proposal rankings is difficult. Multiplefactors could plausibly contribute to the observed trends and limited ancillary demographic data (e.g.,seniority of the PI) is present to test various hypotheses. A deeper and more sophisticated analysisthan presented here may clarify some of the causes, including understanding to what extent anysystematics depend on the regional affiliation of the proposal reviewers. Establishing an objectivemeasure of any stylistic differences in the proposals by region in particular could yield importantinsights into the region-based systematics.In Cycle 7, ALMA took steps to reduce the impact of potential biases in the review process. Theinvestigators were listed in random order on the proposal coversheet such that reviewers will knowthe members of the proposal team but not the identity of the PI. Also, first names were listed onlyby the first initial so that the gender cannot be readily inferred. These steps are expected to reducebiases that may be triggered by simply knowing the name of the PI, but other systematics identifiedhere could remain. For example, if the differences in proposal ranks by region are caused by languageor style, the systematics in the proposal rankings by region should not change.These steps to modify the proposal coversheet follow those taken by the HST after Reid (2014)identified a small but persistent effect where the acceptance rate of HST proposals was lower forwomen (19% on average) than men (23% on average). Similar steps have recently been taken at ESO.Interestingly, the difference in the acceptance rate of HST proposals by gender persisted even afteranonymizing the PIs. Only after the list of investigators was completed hidden from the reviewers ina double-anonymous review did the acceptance rate for proposals led by women exceed that of men(Strolger & Natarajan 2019). ALMA is following the HST experience and is considering implementinga double-anonymous review in future cycles. 23CKNOWLEDGMENTSI am grateful to Andrea Corvillon for providing the proposal data used in this analysis. C. Lonsdalekindly provided her tabulation of genders for ALMA PIs and assisted in updating the information.D. Iono also helped identify many of the gender demographics for East Asian PIs. I also thank S.Dougherty, G. Mathys, M. Fukagawa, J. Greaves, L. Barcos, F. Schwab, L. Ball, D. Balser, and theanonymous referee for comments on the manuscript.
Software:
Astropy (Astropy collaboration et al. 2018), SciPy (Jones et al. 2001), kSamples, (Scholz& Zhu 2019) REFERENCES
Astropy collaboration, Price-Whelan, A. M.,Sip˝ocz, B. M., et al. 2018, AJ, 156, 123Greaves, J. 2018, RNAAS, 2, 203Hall E. T. 1976, Beyond Culture. New York:Anchor Books/DoubledayHunt, G., Schwab, F. R., & Ball, L. 2019, NRAOTelescope Time Allocation Report A. PROPOSAL ACCEPTANCE RATE FOR ALMA REVIEWERSGreaves (2018) analyzed the proposal statistics for an anonymous observatory using publishedlists of reviewers and accepted proposals. Private communication with J. Greaves confirmed thatthe observatory is ALMA. The main result of the paper is that the Cycles 2-4 reviewers had athree-fold increase in the number of proposals accepted while serving on the panel compared towhen they were not serving on a review panel. The inference was that there is a bias in the reviewprocess where reviewers preferentially favor proposals from the other review participants. Since theproposals submitted by a given reviewer are generally reviewed in a different panel, the implicationis that reviewers are predisposed toward proposals from reviewers in other panels. A limitation ofthis analysis, however, is that Greaves (2018) did not have access to the total number of proposalssubmitted by the reviewers and could not compute the fraction of submitted proposals that wereaccepted.This appendix investigates the result from Greaves (2018) by analyzing both the accepted andrejected proposals to examine the proposal acceptance rate of the reviewers and not just the numberof accepted proposals. Table 6 presents the number of proposals submitted and accepted by cycle forreviewers while they served on the ALMA review panels in any cycle and when they were not on thereview panels. Most reviewers serve for three consecutive cycles. For example, the Cycle 0 reviewersalso typically served on the review panels in Cycles 1 and 2. The Cycle 0 reviewers submitted 130proposals while serving on a panel in any cycle, and 48 were awarded Grade A or B for an overallacceptance fraction of 36.9%. By comparison, when the Cycle 0 reviewers were not serving on apanel, they have submitted 187 proposals and 60 have been awarded Grade A or B for an acceptancefraction of 32.1%. The uncertainties in each of the acceptance fractions is ∼ ± ± able 6. Proposal Statistics for ALMA Reviewers
Cycle All PIs Reviewers serving on a panel Reviewers not serving on a panelAcceptance Submitted Accepted Acceptance Submitted Accepted Acceptancefraction fraction fraction0 17 . +1 . − .
130 48 36 . +4 . − .
187 60 32 . +3 . − . . +1 . − .
211 79 37 . +3 . − .
279 93 33 . +2 . − . . +1 . − .
233 92 39 . +3 . − .
292 97 33 . +2 . − . . +1 . − .
234 95 40 . +3 . − .
271 95 35 . +2 . − . . +1 . − .
350 128 36 . +2 . − .
497 165 33 . +2 . − . . +1 . − .
343 118 34 . +2 . − .
565 197 34 . +2 . − . . +1 . − .
348 122 35 . +2 . − .
704 258 36 . +1 . − . All 23 . +0 . − . . +1 . − . . +0 .9
704 258 36 . +1 . − . All 23 . +0 . − . . +1 . − . . +0 .9 − .9