[PDF] Investigating institutional influence on graduate program admissions by modelling physics GRE cut-off scores

Abstract

Despite limiting access to applicants from underrepresented racial and ethnic groups, the practice of using hard or soft GRE cut-off scores in physics graduate program admissions is still a popular method for reducing the pool of applicants. The present study considers whether the undergraduate institutions of applicants have any influence on the admissions process by modelling a physics GRE cut-off score with application data from admissions offices of five universities. Two distinct approaches based on inferential and predictive modelling are conducted. While there is some disagreement regarding the relative importance between features, the two approaches largely agree that including institutional information significantly aids the analysis. Both models identify cases where the institutional effects are comparable to factors of known importance such as gender and undergraduate GPA. As the results are stable across many cut-off scores, we advocate against the practice of employing physics GRE cut-off scores in admissions.

Full PDF

IInvestigating institutional inﬂuence on graduate program admissions by modellingphysics GRE cut-oﬀ scores

Nils J. Mikkelsen, Nicholas T. Young,

2, 3 and Marcos D. Caballero

1, 2, 3, 4, ∗ Center for Computing in Science Education & Department of Physics, University of Oslo, N-0316 Oslo, Norway Department of Physics and Astronomy, Michigan State University, East Lansing, Michigan 48824 Department of Computational Mathematics, Science, and Engineering,Michigan State University, East Lansing, Michigan 48824 CREATE for STEM Institute, Michigan State University, East Lansing, Michigan 48824 (Dated: September 30, 2020)Despite limiting access to applicants from underrepresented racial and ethnic groups, the practiceof using hard or soft GRE cut-oﬀ scores in physics graduate program admissions is still a popularmethod for reducing the pool of applicants. The present study considers whether the undergradu-ate institutions of applicants have any inﬂuence on the admissions process by modelling a physicsGRE cut-oﬀ score with application data from admissions oﬃces of ﬁve universities. Two distinctapproaches based on inferential and predictive modelling are conducted. While there is some dis-agreement regarding the relative importance between features, the two approaches largely agreethat including institutional information signiﬁcantly aids the analysis. Both models identify caseswhere the institutional eﬀects are comparable to factors of known importance such as gender andundergraduate GPA. As the results are stable across many cut-oﬀ scores, we advocate against thepractice of employing physics GRE cut-oﬀ scores in admissions.

Keywords: Physics graduate admissions, physics GRE, institutional inﬂuence, logistic regression modelling,supervised machine learning.

I. INTRODUCTION

While recent studies have called into question the over-reliance on Graduate Record Examination (GRE) scoresin physics graduate admissions [1, 2], ﬁltering applicantsbased on a strict or eﬀective minimum score is still apopular practice today [3]. Given the role of the GRE inadmissions, understanding the factors inﬂuencing GREscores may provide insight into how, when compared toother science, technology, engineering and mathematics(STEM) disciplines, the physics graduate admissions pro-cess has failed to improve gender, racial, and ethnic diver-sity by systematically excluding these applicants [4, 5].A number of studies have investigated correlations be-tween GRE scores and demographics [1, 2], but little at-tention has been given to the institutional backgroundsof applicants. An applicant’s undergraduate backgroundcould play a signiﬁcant role in their graduate applica-tion [6]. Institutions oﬀering a PhD program themselveswould likely place more emphasis on both preparing andmotivating undergraduate students for further studies.Larger physics departments with more resources are ableto oﬀer students more advanced course-work and hands-on experimental work as well as provide a larger vari-ety of staﬀ expertise. Larger undergraduate programscan facilitate network-building, both between studentsand faculty members, and collaboration via projects andstudy-groups. Although attributes such as motivationand opportunity cannot be appropriately measured, theireﬀects on the GRE can be linked to metrics such as the ∗ Corresponding Author: [email protected] size and type of institutions as was done in Halley et al.[7]. In order to estimate these institutional eﬀects, wehave analyzed the Physics GRE Subject Test (P-GRE)scores of graduate program applications from four publicuniversities and one private university.The applications include a variety of information, butthe present study will focus on numerical and categoricaldata, all of which constitutes a mixture of data struc-tures. A number of recent studies working with similardata have approached the problem using machine learn-ing methods [8, 9]. Many machine learning methodslend themselves to problems with mixed data, albeit theydo not share the interpretability of more conventionalmodelling methods. The present study will employ bothapproaches, contrasting and comparing the results.The aim of this study is to continue the discussionon the practice of employing formal or informal P-GREscore cut-oﬀs in graduate admissions using a combina-tion of modelling and machine learning methods. Theidea is to analyze the P-GRE scores of PhD program ap-plicants with respect to applicants’ undergraduate GradePoint Average (U-GPA), demographics and institutionalbackground. Our guiding research questions (RQs) areas follows.1. To what extent does the applicant’s undergraduateinstitution inﬂuence whether they are able to at-tain a minimum P-GRE score expected by an ad-missions committee?2. To what degree do the institutional eﬀects compareto known eﬀects such as U-GPA, gender and race?3. How do the results depend on the speciﬁc cut-oﬀchosen by the admissions oﬃce? a r X i v : . [ phy s i c s . e d - ph ] S e p

4. How well do the conventional and machine learningapproaches agree on RQs 1, 2 and 3?

II. BACKGROUND

Following the calls for increasing diversity in STEMdisciplines, there has been a steady growth of women’sand ethnic/racial minorities’ representation over the pastcouple of decades [10]. Despite the progress however,physics has seen particularly poor development in com-parison. Since the late 1990s, the percentage of bache-lor and PhD degrees awarded to women in physics hasstagnated at about 20%, mirroring similar numbers ofengineering and computer science [4]. The numbers areeven more concerning for racial minorities who during thethree-year period 2014-2016 earned 11% of bachelor de-grees and only 7% of PhD degrees [10]. The discrepancyin female, racial and ethnic representation likely stemsfrom variety of factors involving admission and retentionissues, many of which are rooted in cultural and struc-tural problems including sexual harassment and systemicracism [11–13].In her extensive review of the general practices of grad-uate program admissions,

Inside Graduate Admissions (2016) [14], Posselt notes that most admissions (in thenatural sciences as well as in the humanities and socialsciences) measured students’ merit primarily on the basisof their undergraduate GPA (U-GPA) and GRE scoresalone. Indeed, Young and Caballero were able to predictthe admittance of prospective physics PhD students with75% accuracy using machine learning methods based onlyon their U-GPA and P-GRE score [8]. The GRE testmakers, Educational Testing Service (ETS), recommendsagainst the use of GRE scores as the sole basis for ad-missions decisions, particularly emphasizing the practiceof ﬁltering applicants based on a minimum cut-oﬀ score[15]. Despite this, Potvin et al. found that 32% of physicsgraduate program admissions state they ﬁlter applicantswith a minimum P-GRE score [3]. Furthermore, of theprograms that say they do not ﬁlter applicants, severalreported using a “ rough cut-oﬀ ” or wanting a “ preferablescore ”, suggesting that more than 32% of programs ﬁlterapplicants in practice.As highlighted by Miller and Stassun in 2014 [1], on av-erage, women score 80 pts lower than men on the GREin the physical sciences, while Black test-takers score 200pts lower than white test-takers. The authors furthernote that the practice of ﬁltering prospective studentswith a minimum score, which is in violation with ETS’sown guidelines, thus “ adversely eﬀects women and minor-ity applicants ”. In addition to limiting access for minor-ity applicants during the application process, the GREalso acts as barrier to apply. In a survey of prospectivestudents from underrepresented racial and ethnic groupsinterested in pursuing a PhD in physics who ultimatelychose not to apply, Cochran et al. notes that the GREwas the “ most common theme ” expressed by students as a barrier to apply [16].In spite of its established popularity in admissions, theGRE’s ability to identify promising students has recentlybeen called into question. One study found that while re-quiring a minimum P-GRE score limits access to physicsgraduate program applicants from minority groups, GREscores were incapable of predicting PhD completion [2].In a 2015 survey of prize-winning postdoctoral fellows inastronomy [17], Levesque et al. found that the P-GREscores of fellows did not adhere to any minimum per-centile score, suggesting that the GRE is also a poor esti-mator for future research excellence. The authors furtherpoint out that a minimum percentile score of 60% wouldhave eliminated 44% of participants, including 60% offemale fellows. The inability of the GRE to identifypromising students has also been noticed by other groupssuch as the National Science Foundation, which recentlydecided to drop the GRE from the application to theirGraduate Research Fellowship Program (see FAQ no. 52[18]).Prior work has typically focused on admissions com-mittees’ over-reliance on the GRE and the consequencesof using cut-oﬀ scores in graduate admissions [1, 2, 19,20]. Missing from the conversation is an understanding ofwhat institutional factors, which come into play duringapplicants’ undergraduate study (or even earlier), mayinﬂuence GRE scores. In a 1991 study, Halley et al. in-vestigated how the topics covered by P-GRE comparedwith physics major curriculum by analyzing the P-GREscores of students from diﬀerent institution types [7]. Theauthors noted that the portion of correct answers washigher for students from "top" institutions, and highestfor students from "top" institutions with graduate pro-grams. However, this study is both nearly 30 years oldand worked with an imbalanced sample (701 test-takersin total, 21 of which attended a top undergraduate in-stitution). Since then, the GRE has evolved and thenumber of physics degrees awarded annually has almostdoubled [21]. Nowadays, the GRE does not penalize in-correct answers, i.e., guessing, which has likely changedthe way students approach the test. To our knowledge,there has not yet been a modern study analyzing howinstitutional factors may aﬀect GRE scores.

III. METHODS

The target for this investigation is to explain whethera student scores above or below a P-GRE cut-oﬀ scoreselected by an admissions committee. This is encodedusing a binary response variable named ABOVE with theinterpretation that an applicant with a score above orequal to the cut-oﬀ has

ABOVE = 1 , and an applicantwith a score below the cut-oﬀ has

ABOVE = 0 . That is,given a test score of x and a cut-oﬀ score of C , we deﬁne ABOVE = (cid:40) , x ≥ C, , x < C. (1)The reader should recall that the possible scores on GREsubject tests range from 200 to 990 in 10 pt. intervals.We have focused on P-GRE cut-oﬀ scores ranging from620 to 800 pt., corresponding to the 32nd and 67th na-tional percentiles [22]. Typical P-GRE cut-oﬀ scores liein the region of 700 [2].The data used in this study consists of 2017/2018 ad-missions records for physics graduate programs from 4public universities in the Big Ten Academic Alliance andone private Midwestern university. The records containunidentiﬁed proﬁles of program applicants with informa-tion regarding their GRE performance, undergraduateGPA, ethnicity and race, gender, etc. In addition, therecords also include which institution the applicants at-tended during their bachelor’s degrees. Complementarydata describing the bachelor-institutions have been addedfrom three sources: the 2015 Carnegie Classiﬁcation ofInstitutions of Higher Education [23], Barron’s selectivityindex [24], and 2017-2018 surveys of American universi-ties by the American Institute of Physics (AIP) [25, 26].The additional data describes several aspects of the in-stitutions such as institution-wide admissions selectivityand the size of physics programs. The main idea is tostudy the statistical eﬀects from applicants’ institutionalbackgrounds using this complementary data. A. On the data

The admissions records contain 5738 applications in to-tal, but only 5314 (ca. 93%) of them include the students’P-GRE scores. Applications without P-GRE scores areignored to avoid inﬂuencing the P-GRE distribution. Ofthe remaining applications, 2575 are domestic (ca. 48%).This study will focus entirely on domestic students fortwo main reasons. First, the P-GRE distribution for in-ternational students is much more saturated with perfectscores than the distribution is for domestic students. Thesaturation problem is visualized in ﬁgure 1: The percent-age of international students scoring above the selectedcut-oﬀ scores both starts oﬀ much higher and falls oﬀmuch slower than the percentage of domestic students.Second, because there is not a systematic collection ofgraduation records for non-US schools, it is diﬃcult toreliably collect the necessary information from every in-ternational student.Because the applicants are not identiﬁed, several ap-plications may come from the same student. While theseapplications are unique in the sense that each applicationaddresses a diﬀerent school, they count as duplicated ap-plications in this analysis by virtue of being from thesame student. Duplicate applications could have an ef-fect on the results, most notably in the logistic regres-sion model that relies on independent observations (seesupplementary material). By comparing applications ac-cording to demographics and academic performance, anumber of possible duplicate applications have been iden-tiﬁed. In case all candidates are duplicates, roughly

Figure 1. A comparison of the P-GRE distribution betweennational data [22] and data used in this study. The analysis isprimarily concerned with domestic applicants (green curve).

1. The raw features

In addition to the P-GRE score, thirteen features, orvariables, have been selected for analysis. A summary ofthe features and their sources is given in Table I.The features from the admissions records include theapplicants’ P-GRE score, U-GPA, gender, and race. Notethat the gender feature is encoded as a binary variable;while we acknowledge that gender is not binary, moredetailed descriptions were not collected by the admis-sions oﬃces [27]. Similarly, diﬀerent practices regarding

Table I. A summary of the features used in this study.Feature Type SourcePhysics GRE score continuous AdmissionsUndergraduate GPA continuous AdmissionsGender binary AdmissionsRace categorical AdmissionsCarnegie Classiﬁcation categorical CarnegieUndergrad Population Proﬁle categorical CarnegieFunding category categorical CarnegieACT selectivity category categorical CarnegieMinority Serving Institution binary CarnegieBarron’s selectivity index categorical Barron’sNo. bachelor graduates (2017) continuous AIP SurveyNo. bachelor graduates (2018) continuous AIP SurveyNo. PhD graduates (2017) continuous AIP SurveyNo. PhD graduates (2018) continuous AIP Survey the collection of data on racial and ethnic backgroundshas limited the scope of the race feature. See Posseltet al. for more details regarding collection of data onracial and ethnic backgrounds by admissions oﬃces [20].The features from the admissions records constitute theapplicant-speciﬁc component of the models, while the re-maining features comprise the institutional component.Of the Carnegie features, the two most prominent arethe (2015) Carnegie (basic) classiﬁcation of institutionsand the (2015) undergraduate population proﬁle classiﬁ-cation. The basic classiﬁcation is an overall categoriza-tion of the academic degrees oﬀered and awarded by theinstitutions, e.g.

Doctoral university with high researchactivity and

Master’s college with large programs . Theundergraduate population proﬁle classiﬁcation character-izes the typical undergraduate population according tothree metrics: portion of full-time undergraduates, aca-demic achievements of ﬁrst-year and ﬁrst-time students,portion of entering transfer students. In addition, theCarnegie features also include the institutions’ Fundingcategory and ACT selectivity category, and whether theinstitutions are Minority Serving Institutions (MSI). TheACT category measures the entry selectivity of admis-sions oﬃces by grouping all institutions according to theACT scores of ﬁrst-year bachelor students, and MSI in-dicates whether an institution satisﬁes the requirementsfor a Minority Serving Institution [28].Lastly, Barron’s provides the Proﬁle of American Col-leges [24], which is an index for institution-wide admis-sions selectivity, and the AIP surveys provide the num-bers of bachelor and PhD students graduating in physics.The data will be analyzed using two diﬀerent dataanalysis methods based on logistic regression modellingand predictive machine learning analysis (described inSec. III B). As they stand, the raw features are not well-suited for logistic regression due to computational issuesas well as modelling-related diﬃculties. The remainingpart of this section describes our data preprocessing andmodelling choices. See Sec. V C for a discussion of po-tential issues. Because the predictive analysis requiresless preprocessing than logistic regression, we provide asummary of all the models used in this study in Sec. III Cto avoid confusion.

2. Underrepresented racial and ethnic minorities

The small representation seen of applicants from racialand ethnic minorities (Black, Latinx, Multi and Native)is of computational concern because logistic regressionfairs poorly with low-frequency categories [29]. Becauseinitial tests including every racial group produced resultswith limited statistical power (e.g. inﬁnite p -value conﬁ-dence intervals), we combined racial and ethnic minori-ties in an underrepresented minority (URM) category de-spite Teranishi’s warning [30]. This also combines theirP-GRE distributions (see Figure 2), leading to loss of in-formation. This issue is further discussed in Sec. V C. Figure 2. Estimated P-GRE distributions by racial and ethnicgroups (number of applicants indicated in parenthesis). Notethat the combined distribution normalizes the diﬀerences be-tween the combined groups.

3. The Carnegie classiﬁcation & undergraduate populationproﬁle

While the Carnegie classiﬁcation and undergraduatepopulation proﬁle support 34 and 16 unique categorieseach, the limited pool of applications leaves many cate-gories empty or with only a handful of applicants. Mostof the categories are diﬃcult to combine into meaningfulgroups. Thus, to avoid computational issues the featuresare replaced by the binary labels:

Doctoral university w/highest research activity and

Most selective undergradu-ate population .

4. Funding category & ACT selectivity category

Similar to the Carnegie features, both

Funding cate-gory and

ACT selectivity category have categories withtoo few applicants. To avoid complications, the featuresare reduced to binary labels

Public Funding and

MostACT-selective , which, respectively, indicate whether theinstitution is publicly funded and if the institution is inthe most selective ACT category.

5. Barron’s selectivity index

Barron’s selectivity index is an admissions selectivitymeasure that categorizes institutions according to schoolcompetitiveness. In decreasing order of competitiveness,the categories include most competitive , highly competi-tive , very competitive , competitive , less competitive and non-competitive . Additional “plus” categories such as highly competitive plus have been collapsed into their cor-responding ordinary levels. In this study, admissions se-lectivity is used as a metric for an institution’s resourcesand staﬀ experience. Because admissions selectivity isexpected to have an eﬀect only for the most selectiveschools, the selectivity categories less competitive than most and highly competitive are combined to a not ascompetitive category.

6. No. bachelor/PhD graduates 2017/2018

In this study, the AIP features (see Table I) providea measure of the size of undergraduate physics depart-ments. As larger departments typically have more ﬁ-nancial resources available and may oﬀer students moreopportunities for advanced coursework or research, theP-GRE scores of applicants from larger programs is ex-pected to be higher [7]. However, because of the varietyof institutions and physics programs, a systemic eﬀectis expected to only emerge for very large physics pro-grams. Instead of analyzing the raw number of graduates,a physics program is classiﬁed as large if the number ofgraduates is above the 75th national percentile [21].While the typical size of physics departments is un-likely to change on a yearly basis for most institutions,the exact number of graduates is much more sensitive tovariation. Moreover, the applicants spent several years atthe undergraduate institutions, thus it is unreasonable toestimate the general size of the physics departments usingdata from a single year. Because the statistical modelscannot include data on both years simultaneously (i.e.,as individual features) due to correlation issues, the 2017and 2018 data must be combined (bachelor and PhD fea-tures separated). For most institutions, the diﬀerencein the number of bachelor/PhD graduates between 2017and 2018 is not signiﬁcant enough to have any eﬀect onthe analysis. However, because the diﬀerence is largefor some institutions, naively selecting, say, the averagecould overestimate or underestimate the size of some de-partments. In addition, there are some institutions forwhich data is missing for either 2017 or 2018. To avoidinaccurate single-point estimates of department sizes, themaximum and minimum cases are considered separately.In the maximum graduates models , the maximum num-ber of bachelor and PhD students between the 2017 and2018 data is included, and vice versa in the minimumgraduates models . For institutions with missing data, anyavailable data is used for both models.

B. Methods for data analysis

The following section provides a brief overview of themethods used in this study. Additional details are pro-vided as supplementary material. Because logistic regres-sion is likely familiar to a greater audience, more time isspent on the machine learning methods.

1. Logistic regression modelling

Logistic regression analysis is a technique for modellinga binary response y ∈ { , } with respect to explanatoryvariables x . . . , x k , which may consist of a mixture ofcontinuous and discrete data. While binary data is natu-rally handled by logistic regression, categorical (discrete)data with M > no. categories must be encoded using M − binary variables according the one-hot encoding scheme (see supplementary material for details). The re-sponse is modelled according to the odds equation,odds ( p ) = exp (cid:0) β x + · · · + β k x k + (cid:15) (cid:1) , (2)where p is the probability of the outcome y = 1 , β i is theregression coeﬃcient of x i and (cid:15) is an error term. Theregression coeﬃcients are determined numerically usingan iterative scheme based on maximum likelihood esti-mation. In our study this is handled by the glm functionin R [31]A major beneﬁt of logistic regression modelling is theinterpretability of its regression coeﬃcients. When x i increases by 1 unit, the odds change by a factor of exp( β i ) called the odds ratio:OR ( p ; x i ) = odds ( p ; x i + 1) odds ( p ; x i ) = e β i . (3)The interpretation of the odds ratio depends on whether x i is continuous or categorical. For continuous features,the change is associated with a unit increase in x i . For bi-nary features, the change is associated with a switch in x i from category 0 to category 1. Because multi-leveled cat-egorical features are encoded with binary features, eachbinary represents a change from the reference categoryto the category associated with the binary. Odds ratiosbelow 1 are inverted so that / OR ( x i ) is the odds ratioassociated with a unit decrease in x i or a switch in x i fromcategory 1 to category 0. In order to avoid interpretationissues relating to very large or very small continuous fea-tures, it is customary to standardize continuous featuresby centering the mean about 0 and normalizing the vari-ance to 1. For standardized features, the odds ratio isassociated with an increase in the original feature by onestandard deviation.Alongside the regression coeﬃcients, the glm functionprovides the corresponding p -values. To avoid multiplecomparisons problems, the p -values are adjusted accord-ing to the Bonferroni correction. For a logistic regressionmodel with N features, the Bonferroni-adjusted p -valueis ˜ p = pN . We follow common practice and include threelevels of signiﬁcance: α = 0 . , α = 0 . and α = 0 . .Because logistic regression is unable to handle miss-ing values, we follow Nissen et al.’s recommendation ofimputing the missing data instead of discarding it [32].Our approach employs the MICE (Multiple Imputationby Chained Equations) algorithm, which is handled bythe mice package in R [33]. MICE is an iterative al-gorithm that applies linear and logistic regression tech-niques in order to impute the data while conserving therelationship between the features as well as possible. Thealgorithm constructs N individual data sets to be mod-elled separately, the results of which are pooled (com-bined) according to Rubin’s rules [34]. In this study, 5imputation sets were created using 20 iterations (leavingother mice parameters to their defaults). Because theraw features are processed, the transformation must oc-cur either before, after, or during the imputation. To ourknowledge, there are no recommended strategies for thekinds of transformations used in this study. We thereforefollow the general recommendation of von Hippel of “ im-pute, then transform ” [35]. As recommended by Moonset al. [36], the P-GRE scores are included in the impu-tation before preparing ABOVE .

2. Machine learning analysis

Whereas logistic regression favors interpretability (viathe odds ratios), machine learning analysis (MLA) fo-cuses on making accurate and reliable predictions. Giveninputs x , . . . , x k and an output y , the goal of MLA is toidentify a map f such that y = f ( x , . . . , x k ) + (cid:15), (4)where (cid:15) is a prediction error. When y is categorical (e.g.binary), f is called a classiﬁer because it classiﬁes a setof inputs into discrete outputs. As classiﬁers are seldomperfect, a major component of MLA consists of ﬁndingthe optimal f , i.e. minimizing (cid:15) . To measure "how well"a classiﬁer is able to classify inputs we use performancemetrics . Diﬀerent metrics highlight diﬀerent types of be-havior, meaning a classiﬁer can score well according toone metric, but poorly according to another metric. Thisstudy employs two metrics: prediction accuracy score and AUC-ROC score .The prediction accuracy score of a classiﬁer is the por-tion of correctly classiﬁed cases. In terms of our data,a correctly classiﬁed case is any application for whichthe classiﬁer successfully predicts whether the applicantscores above or below the cut-oﬀ score. It is typically re-ferred to as simply the accuracy and it is often reportedas a percentage. Accuracy is a number between 0% and100%, where 100% signiﬁes a perfect classiﬁer. Whileeasy to interpret, accuracy is very sensitive to unbalancedoutput classes (see the "Domestic applicants" curve inFigure 1 for the class imbalance faced in this study) be-cause it does not distinguish between the output classes.For instance, if 80% of applicants score above the cut-oﬀ, then a naive classiﬁer predicting above regardless ofthe inputs will have an accuracy score of 80%. For thisreason, accuracy should always be considered relative toclass imbalance. Furthermore, because the class imbal-ance changes as the cut-oﬀ increases (Fig. 1), the inter-pretation of the nominal accuracy score changes. Hence,the accuracy scores of two classiﬁers using diﬀerent cut-oﬀs should not be compared nominally. The AUC-ROC score is a more complex metric thanaccuracy. Here, ROC refers to a Receiver OperatingCharacteristic curve and AUC means taking the AreaUnder the ROC Curve. For more details regarding ROCcurves, consult the supplementary material. The AUC-ROC score, or simply the AUC, is a measure of a classi-ﬁer’s ability to distinguish between output classes. AUCis a number between 0 and 1, where 1 signiﬁes a per-fect classiﬁer, but a score of 0.5 is equivalent to com-plete guesswork. There is no universal scheme for judgingAUC scores, but Hosmer et al. provides a rough guide: . ≤ AUC < . is acceptable, . ≤ AUC < . is ex-cellent and . ≤ AUC is outstanding [29]. In contrastwith the accuracy score, AUC is more robust towardsimbalanced output classes [37], and thus AUC scores canbe more reliably compared across diﬀerent cut-oﬀ scores.MLA typically consists of 2 phases: training and test-ing. Here, training refers to the construction of a clas-siﬁer, and testing refers to its evaluation based on per-formance metrics. A typical problem in MLA known as overﬁtting arises when a classiﬁer is trained to recognize"too many details" of a data set. Thus, instead of repli-cating the general trend of the data set, the classiﬁerreplicates the random errors. To avoid this, it is stan-dard practice to use diﬀerent data sets for the trainingand testing phases by splitting the (complete) data setat random. Because random splits can have unforeseenconsequences, it is common to conduct several training-testing procedures and average the performance metrics,using the standard errors of the averages as indicators forthe conﬁdence intervals. This study employs the K -foldcross-validation algorithm with K = 10 to prepare therandom splits [38].It is important to note that to ﬁnd a perfect classi-ﬁer is typically considered impossible, even if (cid:15) = 0 forall known data. Thus, there is no correct algorithm forconstructing f , and in fact, there are many unique algo-rithms to choose from. This study employs the condi-tional inference forest (CIF) algorithm, which is variantof the earlier random forest algorithm [39, 40]. A randomforest is comprised of an ensemble of decision trees, eachof which is an independent classiﬁer. A decision tree isan algorithmic approach to decision-making (predictions)that asks a series of yes-no questions based on the inputdata (e.g. "male?" and "GPA > . "). The questionsare determined during the training phase and are chosento optimize performance. Each tree is given a randomsample of the training set and a random selection of theinput features. Predictions of the forest are then basedon a majority vote among the predictions of the trees. ACIF is similar to a random forest in principle, but diﬀersin its construction.This study employs the CIF algorithm via the party package in R [41–43]. The forests were built using 200trees and 3 features per tree (following the recommended √ p [44]), all other parameters kept at their defaults. Oneof the selling points of the CIF is that it provides a nat-ural way of measuring the importance of each feature inthe model. The process of preparing the importance mea-sures for each feature is also handled by party . The ideais to remove a feature from the forest and measure theresulting change in a performance measure, interpretinga larger change as the feature being more important. Asdescribed in Janitza et al., measuring AUC loss is pre-ferred due to its robustness with imbalanced data [45].The importance measure is a tool for comparing the rela-tive importance of features and should not be interpretedfurther [46].Because the importance measures focus on the impactof removing each feature separately, a backwards recur-sive feature elimination (RFE) procedure is conducted tostudy the eﬀect of removing several features. (see e.g.,[38]) To restrict the scope, the procedure is only executedfor P-GRE cut-oﬀs in intervals of 30 pt. RFE is an itera-tive process that involves training a forest, estimating itsperformance, and removing the least important featurefrom the set of active features. Starting with all fea-tures, the process is repeated until one feature remains.The order of removal is determined by the importancemeasures of the forest model. The importance measuresare computed using the complete model, i.e., not dur-ing the procedure, to avoid overﬁtting [47]. Because theimportance measures vary depending on the cut-oﬀ, onewould ideally prepare a removal order separately for eachcut-oﬀ and conduct a unique RFE for each cut-oﬀ. How-ever, because the importances measures are similar fordiﬀerent cut-oﬀs, an average removal order is used for allcut-oﬀs. C. A summary of the models

Most of the data preprocessing described in Sec. III Ais done for logistic regression. This includes combiningracial and ethnic minorities in an underrepresented mi-nority category; reducing the Carnegie features

CarnegieClassiﬁcation , Undergraduate Population Proﬁle , Fund-ing Category and

ACT selectivity category to binary la-bels; combining the Barron’s selectivity categories lesscompetitive than most and highly competitive to a not ascompetitive category; and categorizing physics programs(both undergraduate and graduate) as large if the num-ber of graduates is above the 75th national percentile.Because the computational diﬃculties of logistic regres-sion related to multicolinearity and low-frequency cate-gories are circumvented by the decision-tree constructionof the CIF algorithm, none of these preprocessing pro-cedures are required for the data to be compatible withthe CIF models. With the exception of combining the2017 and 2018 graduates data into minimum and maxi-mum cases, the data is only preprocessed for the logisticregression models.Avoiding to preprocess the data for the CIF models isactually in line with the philosophy of the predictive mod-elling approach. In contrast with how logistic regressionemphasizes interpretability, machine learning is only in- terested in the relationship between the features and theresponse. Preprocessing the data dilutes the available in-formation, and thus may negatively aﬀect the predictiveanalysis (e.g., as in Fig. 2).Overall, × × logistic regression and CIF mod-els are studied: there are 19 unique P-GRE cut-oﬀs un-der consideration, and for each cut-oﬀ, a model is con-structed with and without the potential duplicate appli-cations (Sec. III A), and using the maximum and mini-mum number of graduates between 2017 and 2018 (Sec.III A 6). IV. RESULTS

Because of the large volume of similar results, we willprimarily present the results (odds ratios of logistic re-gression and feature importance measures of the CIF) forthe models including the potential duplicate applications.We discuss deviations from these results where relevant.

A. Key Findings

The statistical eﬀects from the institutional featuresbecome more involved, both in the logistic regressionmodels and the CIF models, as the P-GRE cut-oﬀ in-creases. In particular, applicants from well-funded insti-tutions with large physics programs and high researchactivity are more likely to score above the cut-oﬀ. Thelogistic regression models and the CIF models identifyseveral examples where the eﬀect of an institutional fea-ture is comparable to U-GPA or gender. While U-GPAand gender are integral components of every model (asexpected), the race and ethnicity of applicants did notcontribute as much to our models as anticipated basedon the diﬀerences in scores between racial and ethnicgroups found by Miller et al. [2]. Overall, the logisticregression approach and the machine learning approachtypically agree on whether a feature has any relevancein the model. Having said that, the odds ratios typi-cally identify a larger set of important features that, inaddition, changes as the P-GRE cut-oﬀ increases.When it comes to the maximum and minimum grad-uates models, the contributions from the AIP featuresare devalued in the minimum graduates models in favorof Barron’s selectivity index and high research activity.Of the maximum models, the logistic regression mod-els favor attending an institution with a large bachelorprogram over a large PhD program, while the opposite istrue in the CIF models. Finally, the analysis as a whole issimilar for the models with and the models without possi-ble duplicate applications. Speciﬁcally, by removing thepossible duplicates, some features become less signiﬁcantin the logistic regression models and the performance ofsome CIF models are slightly reduced.

Figure 3. Summary of odds ratio signiﬁcance levels of 19 independent models: Each P-GRE cut-oﬀ score (horizontal coordinate)denotes an independent maximum graduates logistic regression model that includes possible duplicates (see Sec. III C). In eachmodel, the ﬁelds of the statistically signiﬁcant features are marked with a symbol indicating the Bonferroni-corrected signiﬁcancelevel: circle indicates α = 0 . , triangle indicates α = 0 . and diamond indicates α = 0 . . Blank ﬁelds indicate statisticallyinsigniﬁcant results. The dotted separation lines categorize the features as applicant-related, institution-wide metrics andconcerning Physics departments. B. Logistic Regression

1. Signiﬁcance analysis

Consider ﬁrst the maximum number of graduates mod-els. Figure 3 shows a diagram indicating how the set ofsigniﬁcant features changes between the logistic regres-sion models as the P-GRE cut-oﬀ score is increased from620 to 800. The features are typically signiﬁcant for ev-ery or almost every cut-oﬀ, for no or few cut-oﬀs, or forhigher cut-oﬀs. In the following we provide an overviewof the applicant-related and institutional features thatare statistically signiﬁcant.Of the applicant-related features, the odds ratios ofU-GPA and gender are always statistically signiﬁcant.However, contrary to expectations, odds ratios betweenapplicants from diﬀerent racial groups were only statis-tically signiﬁcant for some cases. In particular, whencompared to applicants identifying as white, the oddsratios for applicants identifying as Asian are signiﬁcantfor higher cut-oﬀs, while the odds ratios for applicantsidentifying as Black, Latinx, Multi or Native are onlysigniﬁcant for a few cut-oﬀs in the region of ≈ . Thisis further discussed in Sec. V C. When it comes to the institutional features, those sta-tistically signiﬁcant to a majority of the P-GRE cut-oﬀsinclude attending a most competitive institution, an in-stitution practicing some of the highest amounts of re-search activity, and an institution with a large physicsbachelor program. Interestingly, attending a highly com-petitive institution is only signiﬁcant for cut-oﬀs between640 and 690, while attending one of the most competitiveinstitutions is signiﬁcant for all cut-oﬀs up to 760. Ad-ditionally, attending private universities or universitieswith large PhD programs becomes signiﬁcant when thecut-oﬀ increases beyond ≈ . In contrast, attendingan MSI, a most ACT-selective institution or to graduatein a most selective undergraduate population proﬁle isnever signiﬁcant, regardless of the cut-oﬀ.In order to provide a rough overview of the diﬀerencebetween the maximum and minimum number of gradu-ates models, Table II shows the fraction of P-GRE cut-oﬀs for which each feature is signiﬁcant. Note that thetable also separates models with and without the possibleduplicate applications. By removing the possible dupli-cates, the general signiﬁcance of the features decreases.The change does not seem to originate in any particu-lar feature as, with the exception of U-GPA, gender and Table II. Fraction of P-GRE cut-oﬀs for which the odds ra-tio of each feature is statistically signiﬁcant in the logisticregression models. There are 19 logistic regression models ineach category (see Sec. III C). The ﬁrst column (the maxi-mum graduates models with possible duplicates) correspondsto the signiﬁcance diagram (Fig. 3).Variable Maximum MinimumWith Without With Without(Intercept) 0.63 0.58 0.58 0.58Undergraduate GPA 1.00 1.00 1.00 1.00Gender 1.00 1.00 1.00 1.00Combined B, L, M & N 0.26 0.00 0.21 0.00Asian 0.53 0.37 0.47 0.42Highly competitive 0.26 0.26 0.63 0.26Most competitive 0.79 0.89 1.00 1.00Highest research activity 0.89 0.63 1.00 0.63Most selective UG p. proﬁle 0.00 0.00 0.00 0.00Most ACT selective 0.00 0.00 0.00 0.00Privately Funded 0.42 0.21 0.32 0.21Minority Serving Institution 0.00 0.00 0.00 0.00Large bachelor program 1.00 0.74 0.37 0.21Large PhD program 0.32 0.05 0.32 0.11 most competitive, the fraction of signiﬁcant cut-oﬀs is re-duced for all features. Compared to the maximum grad-uates models, the typical signiﬁcance of attending a largebachelor program is considerably lower in the minimumgraduates models. Notably, the diﬀerence correspondswith an improvement in the fraction of signiﬁcant cut-oﬀs for attending a competitive school or an institutionwith high research activity, thus suggesting the variablesmay suﬀer from a confounding issue (see Sec. V B for adiscussion).Considerable changes in the set of signiﬁcant featuresare only observed for large changes in the cut-oﬀ score.We therefore only discuss the odds ratios correspondingto cut-oﬀs 650, 710 and 770, representing the lower, mid-dle and higher regions, respectively.

2. Odds ratios

The odds ratios for the maximum and minimum num-ber of graduates models are shown in Tables III (a) and(b) respectively. First and foremost, improving one’sundergraduate GPA by one standard deviation, roughlyequivalent of improving a B to a B+, improves the oddsof scoring above the cut-oﬀ by at minimum a factor of2.5 (increases to ≈ . for higher cut-oﬀs). This substan-tial increase in odds reﬂects the importance of U-GPAin admissions expressed by both admission committeesand prospective students [3, 48]. Additionally, the oddsof scoring above the cut-oﬀ is / . ≈ . times greaterfor male applicants than for female applicants. The oddsratios of U-GPA and gender are consistent for all P-GREcut-oﬀs in both the maximum and minimum number ofgraduates models. While the beneﬁt of attending a competitive institu-tion diminishes as the P-GRE cut-oﬀ increases from 650to 710 and 770, attending one of the most competitiveinstitutions is always preferable to a highly competitiveinstitution. For cut-oﬀs 650 and 710, the odds-increasefrom attending a most competitive school is similar tothe applicant increasing their U-GPA from a B to a B+.The model also ﬁnds institutional funding and high lev-els of research activity to be important factors. For highP-GRE cut-oﬀs (e.g. 770), the odds of scoring above thecut-oﬀ is about 2 times as large for applicants who at-tended a private university compared to applicants whoattended a public university. Similarly, for applicants at-tending a university that practices some of the highestlevels of research activity, the odds ratio is roughly 1.6-2.0 depending on the cut-oﬀ.Applicants from institutions with large physics pro-grams typically also score higher. In the maximum grad-uates number of models, having attended a universitywith one of the largest undergraduate physics programsimproves the odds of scoring above the P-GRE cut-oﬀ bya factor of about 1.7-2.0 (typically closer to 2.0). Whenthe cut-oﬀ is high, a similar eﬀect is seen for studentsattending a university oﬀering a large graduate program(an odds ratio of about 1.6). In the minimum number ofgraduates models, the odds ratios are only statisticallysigniﬁcant for the highest cut-oﬀs ( ≥ ). They are alsotypically smaller than the corresponding odds ratios inthe maximum graduates models. The only statisticallysigniﬁcant example in Table IVb is the odds ratios forattending an institution with one of the largest PhD pro-grams.The remaining variables, i.e., most ACT-selective,most selective undergraduate population proﬁle and MSI,contribute little to none. C. Conditional Inference Forest

The general performance of the CIF models is shownin Figure 4. Alongside the accuracy score is the classimbalance, which provides the baseline from which theaccuracy score is interpreted. Because the imbalance isconsiderably high for lower cut-oﬀs, the accuracy score ismore representative of the CIF’s ability to identify appli-cants scoring above the cut-oﬀ when the cut-oﬀ is higher(as the imbalance decreases as the cut-oﬀ score increases,the accuracy becomes increasingly more representative).However, because the imbalance level is outside the stan-dard errors of the K -fold estimate, it is reasonable to con-clude that the CIF is not simply predicting the majorityclass. Additionally, the AUC score is mostly outstanding (>0.9) and, more importantly, very stable with respect tochanges in the P-GRE cut-oﬀ. The stability of the AUCcoupled with the high score suggests that the results ofthe model may be reasonably interpreted, that is, thatthe feature importances provide a reasonable picture ofthe relationship between the features and the output for0 Table III. Odds ratios for P-GRE cut-oﬀ scores 650, 710 and 770 of the logistic regression models with possible duplicates(see Sec. III C). The maximum and minimum graduates models are separated in Tables (a) and (b) respectively. Statisticallysigniﬁcant odds ratios are marked with asterisks (see below (b)). Note that ˜ p = 14 p refers to the Bonferroni-corrected p -values. (a) Maximum bachelor/PhD graduates models Variable cut-oﬀ 650 cut-oﬀ 710 cut-oﬀ 770OR 95% CI OR 95% CI OR 95% CI(Intercept) 1.72 [1.09, 2.72] ∗∗ 0.81 [0.53, 1.25] 0.29 [0.18, 0.47] ∗∗∗Undergraduate GPA 2.53 [2.12, 3.03] ∗∗∗ 2.63 [2.22, 3.11] ∗∗∗ 2.87 [2.40, 3.42] ∗∗∗Gender 0.16 [0.10, 0.24] ∗∗∗ 0.18 [0.12, 0.27] ∗∗∗ 0.17 [0.11, 0.25] ∗∗∗Asian 1.37 [0.73, 2.54] 1.61 [0.95, 2.71] 2.20 [1.34, 3.59] ∗∗∗Combined B, L, M & N 0.68 [0.41, 1.11] 0.65 [0.42, 1.00] 0.76 [0.49, 1.19]Highly competitive 1.98 [1.10, 3.58] ∗ 1.55 [0.95, 2.52] 1.22 [0.76, 1.93]Most competitive 3.40 [1.76, 6.56] ∗∗∗ 2.79 [1.53, 5.07] ∗∗∗ 1.62 [0.97, 2.73]Doc. inst. w/ highest research activity 1.68 [1.00, 2.81] ∗ 1.94 [1.23, 3.04] ∗∗∗ 1.63 [1.06, 2.51] ∗Most selective UG population proﬁle 1.72 [0.68, 4.37] 1.85 [0.73, 4.68] 1.75 [0.71, 4.32]Most ACT selective 0.66 [0.26, 1.69] 0.49 [0.19, 1.27] 0.66 [0.25, 1.75]Privately funded 0.96 [0.57, 1.62] 1.42 [0.91, 2.21] 2.07 [1.38, 3.11] ∗∗∗Minority Serving Institution 1.17 [0.62, 2.21] 1.05 [0.60, 1.82] 1.13 [0.64, 1.99]Large bachelor program 1.96 [1.23, 3.12] ∗∗∗ 1.74 [1.15, 2.65] ∗∗ 1.94 [1.28, 2.92] ∗∗∗Large PhD program 1.28 [0.72, 2.25] 1.31 [0.83, 2.08] 1.62 [1.07, 2.45] ∗∗ (b) Minimum bachelor/PhD graduates models

Variable cut-oﬀ 650 cut-oﬀ 710 cut-oﬀ 770OR 95% CI OR 95% CI OR 95% CI(Intercept) 2.03 [1.31, 3.15] ∗∗∗ 0.94 [0.61, 1.45] 0.34 [0.22, 0.53] ∗∗∗Undergraduate GPA 2.50 [2.08, 2.99] ∗∗∗ 2.59 [2.19, 3.08] ∗∗∗ 2.82 [2.35, 3.39] ∗∗∗Gender 0.17 [0.11, 0.25] ∗∗∗ 0.19 [0.13, 0.28] ∗∗∗ 0.17 [0.12, 0.26] ∗∗∗Asian 1.41 [0.72, 2.75] 1.63 [0.96, 2.77] 2.19 [1.35, 3.56] ∗∗∗Combined B, L, M & N 0.66 [0.41, 1.07] 0.65 [0.42, 0.99] ∗ 0.77 [0.50, 1.17]Highly competitive 2.21 [1.25, 3.93] ∗∗∗ 1.68 [1.00, 2.80] ∗ 1.35 [0.83, 2.18]Most competitive 3.69 [2.07, 6.55] ∗∗∗ 3.09 [1.89, 5.05] ∗∗∗ 1.87 [1.17, 3.00] ∗∗Doc. inst. w/ highest research activity 1.72 [1.02, 2.92] ∗ 2.10 [1.33, 3.33] ∗∗∗ 1.83 [1.20, 2.79] ∗∗∗Most selective UG population proﬁle 1.56 [0.60, 4.05] 1.72 [0.62, 4.76] 1.59 [0.62, 4.07]Most ACT selective 0.75 [0.28, 2.03] 0.56 [0.19, 1.67] 0.74 [0.28, 1.99]Privately funded 0.88 [0.53, 1.45] 1.27 [0.81, 1.98] 1.88 [1.27, 2.79] ∗∗∗Minority Serving Institution 1.11 [0.60, 2.06] 1.00 [0.58, 1.74] 1.11 [0.64, 1.93]Large bachelor program 1.44 [0.91, 2.29] 1.28 [0.84, 1.94] 1.47 [0.99, 2.20]Large PhD program 1.43 [0.83, 2.45] 1.34 [0.86, 2.08] 1.54 [1.05, 2.26] ∗∗∗∗ : ˜ p ≤ . ∗∗ : ˜ p ≤ . ∗ : ˜ p ≤ . all P-GRE cut-oﬀs. Because of the similarity in perfor-mance between the maximum and minimum graduatesCIF models, we present only the maximum graduatesmodels going forward.Figure 5 graphs the change in the importance mea-sure of the features as the cut-oﬀ increases. The plotshows evidence of distinct groups of features with sim-ilar importances. The ﬁrst group consists only of un-dergraduate GPA, whose importance measure is about2 times higher than any other feature. The next groupconsists of gender and no. PhD graduates, which standout when compared to the remaining group of the leastimportant features (see Sec. III B 2 for how to interpretthe importance measure). With the exception of someminor variation, the importance measure of U-GPA isfairly stable across all models. Notably however, while the importance measure decreases for gender as the cut-oﬀ increases, it simultaneously increases for no. PhDgraduates. Hence, for cut-oﬀs greater than ≈ , themodel ﬁnds a greater statistical diﬀerence between appli-cants scoring above and below the cut-oﬀ when given theno. PhD graduates compared to an applicant’s gender.Proportional to their own importance measures, severalfeatures in the remaining group undergo large changes inimportance measures. However, because these variationsare small when compared to U-GPA, gender and the no.PhD graduates, they should not be overemphasized.The results of the feature elimination procedure areshown in Figure 6. The diagram is arranged such thatthe features are removed left to right, starting from acomplete model and ending with a model that only in-cludes U-GPA (i.e., the named feature at a given hor-1 Figure 4. Overall performance of the conditional inference forest. The standard errors of the K -fold ( K = 10 ) estimates areindicated by the error bars. While the ABOVE class imbalance is very high for lower cut-oﬀs, the accuracy standard errors arealways above the imbalance level. The AUC score is mostly above 0.9, which Hosmer et al. categorizes as “outstanding” [29].Figure 5. Conditional inference forest feature importance measures. The standard errors of the K -fold ( K = 10 ) estimatesare indicated by the error bars. Features included in the CIF are listed in the legend in decreasing order of average (across allcut-oﬀs) decrease in AUC upon removal. Figure 6. Conditional inference forest feature elimination procedure. The standard errors of the K -fold ( K = 10 ) estimatesare indicated by the error bars. The features are eliminated from left to right, where the named feature is currently the leastimportant feature, and thus the next to be dropped from the model.Figure 7. Performance comparison between conditional inference forests with all features, with 2 institutional features, andwithout institutional features (only U-GPA, gender and race). The standard errors of the K -fold ( K = 10 ) estimates areindicated by the error bars. The signiﬁcant improvement in performance by the simple addition of 2 institutional featuressuggests that the contribution from the institutional features is captured by a few features. ≈ . on the AUCmetric and roughly between 80% and 90% on the accu-racy metric.Figure 3 shows that the set of statistically signiﬁcantfeatures in the logistic regression models changes as theP-GRE cut-oﬀ score increases (e.g. whether the insti-tution is privately funded is only signiﬁcant for cut-oﬀs ≥ ). A similar change is not present in the impor-tance measures of the CIF models (Fig. 5), which, incontrast with the odds ratios, preserve the feature groupsdescribed above. In particular, the features: U-GPA,gender and number of PhD graduates, are the three mostimportant features for every cut-oﬀ score. Because theimportance measures of the remaining features are con-sistently lower by a considerable margin for all cut-oﬀscores, the set of important features in the CIF modelsis very robust towards changes in the cut-oﬀ score.As a ﬁnal check for whether the added performancecan be attributed to including the institutional features,the performance of the full CIF is compared to a CIFexcluding all institutional features, and a CIF includingthe number of PhD graduates and the Carnegie classiﬁ-cation. The results of the comparison is summarized inFigure 7: The addition of only two institutional featuresmakes a considerable improvement for both metrics, re-gardless of the cut-oﬀ. Hence, the added performanceis reasonably attributed to the inclusion of institutionalfeatures. V. DISCUSSIONA. Research Questions

This study investigated four research questions (RQs)that we address in order.1. To what extent does the applicant’s undergraduateinstitution inﬂuence whether they are able to at-tain a minimum P-GRE score expected by an ad-missions committee?2. To what degree do the institutional eﬀects compareto known eﬀects such as U-GPA, gender and race?3. How do the results depend on the speciﬁc cut-oﬀchosen by the admissions oﬃce?4. How well do the conventional and machine learningapproaches agree on RQs 1, 2 and 3? Regarding RQ 1, the institutional background helpsexplain whether a student scores above a given P-GREcut-oﬀ. Consider a cut-oﬀ score of 710, which is justabove the most common cut-oﬀ score of 700. In the lo-gistic regression models (see Table III), applicants fromcompetitive institutions with large physics programs,practicing high levels of research are statistically morelikely to score above the cut-oﬀ than other applicants.Similarly, the size of physics programs (number of grad-uates) and the institution-wide Carnegie classiﬁcation areintegral components of the predictive capacity of the CIFmodels (see Sec. 6). Hence, the models suggest that toemploy a cut-oﬀ score of 710 not only limits access toracial and ethnic minorities [2], but also to applicantsfrom smaller universities with less resources that are lesscompetitive and practice lower (not necessarily amongthe lowest) levels of research. Similar observations arefound for every other cut-oﬀ in the CIF models. In thecase of the logistic regression models, the set of statisti-cally signiﬁcant institutional features varies depending onthe cut-oﬀ, but the overall interpretation is similar: Toinclude institutional data in the analysis certainly helpsexplain whether a student scores above the cut-oﬀ, re-gardless of the chosen cut-oﬀ.Now, is it necessary to include a complete descriptionof an applicant’s undergraduate background? Figure 6suggests that this is probably not the case as a largeportion of the institutional data does not contribute tothe models. Moreover, because the performance of theCIF does not decrease as the Carnegie classiﬁcation isremoved, there is also reason to suspect that the institu-tional features may share information. The independenceof the features is discussed more in detail in Sec. V B.The modelling and machine learning approaches dis-agree somewhat with respect to RQ 2. In the logisticregression model, the odds ratios for U-GPA is compa-rable to admission competitiveness (roughly 2-3), whilethe odds ratios for gender is just shy of 6.0. In contrast,U-GPA is by far the most important feature in all CIFmodels. Meanwhile, the feature importance measure ofgender is similar to the number of PhD graduates, par-ticularly for higher cut-oﬀs ( ≥ ). Because neither ap-proach placed as much emphasis on race and ethnicity, itis unreasonable to judge the overall eﬀect of institutionaldata by comparing it to the eﬀects of race and ethnicityin the models. Despite disagreeing on some of the ﬁnerdetails, both approaches ﬁnd examples where the eﬀectsfrom institutional data, e.g. admission competitivenessand the size of Physics departments, are comparable toU-GPA and gender. The most clear-cut example is shownin ﬁgure 7, which demonstrates that to replace a CIFmodel without institutional features with a similar CIFmodel that includes the Carnegie classiﬁcation and num-ber of PhD graduates provides a blanket improvement inthe accuracy and AUC scores for every P-GRE cut-oﬀ.Finally, we address RQs 3 and 4 together. First andforemost, both approaches have identiﬁed statisticallysigniﬁcant diﬀerences in P-GRE scores of applicants with4diﬀerent institutional backgrounds. Having said that, thespeciﬁcs regarding the statistical diﬀerence and the ex-tent to which it is explained by diﬀerent institutionalbackgrounds depends on the model and cut-oﬀ in ques-tion. For instance, the signiﬁcance level of odds ratiosvary to such an extent that some features are only rel-evant for a select few cut-oﬀs (e.g. private/public insti-tution for higher cut-oﬀs). The importance measures ofthe CIF models are much more stable across cut-oﬀs, butlacks the interpretability of the odds ratios. Nevertheless,while the set of useful features changes with the cut-oﬀ,institutional features always contribute to the analysis.Here, logistic regression disagree with the CIF on theset of useful features and their importance to the model,but both recognize useful institutional features for everycut-oﬀ score. B. Limitations

Central to this study is the question of whether theinstitutional background of an applicant can be reliablymeasured, or estimated, with the available data. Here,"institutional background" is used in an extended sensethat includes the applicant’s experiences in relation toattending a particular institution. Our data certainlydoes not allow for quantifying the eﬀects of such expe-riences as studying in an encouraging environment or atan institution with a large array of opportunities. How-ever, data such as the Carnegie features and the numberof graduating bachelor and PhD students likely capturesome aggregate eﬀect of studying at diﬀerent types of in-stitutions. In addition, these features were found to beimportant in our models, suggesting that there is a statis-tical diﬀerence between the applicants that is dependenton the institutions.Because the universities considered in this study aretypically highly regarded, the data likely suﬀers from aselection bias eﬀect, favoring prospective students withhigher grades and GRE scores. In a 2018 survey ofprospective students from racial and ethnic minorities,Cochran et al. identiﬁed concerns regarding GRE scoresand undergraduate GPA as commonly expressed barriersto apply to physics graduate programs [16]. Indeed, thisis reﬂected in the P-GRE distribution of the applicants inour data set: Figure 1 shows that the applicants consis-tently score as high or higher than the national averages,thus implying our data set consists of a biased selection ofall prospective students (the data set comprises an upperlimit of ≈ of all P-GRE test-takers in 2017-18 [22].Because of this selection bias, the distributions of theother features in our data set are likely also biased. Mostprominently, the selection bias will disproportionately af-fect women, and racial and ethnic minorities [4, 10]. Theproblem of selection bias and its consequences for Physicseducation research as a whole was recently discussed inKanim and Cid [49]. Our ﬁndings should thus be consid-ered in light of our biased sample and their discussion. A related, but diﬀerent issue is that applicants aremore likely to have attended large programs by virtue ofthere being more prospective students from larger pro-grams than smaller programs. This can be seen in ourdata from the median number of Bachelor graduates.Whereas the national median was 8 in both 2017 and2018 [25, 26], the median in our data is 27 (2017) and 30(2018), i.e., more than 3 times as many. Consequently,our data consists of a larger fraction of applicants fromlarger programs than usual, and thus the distributionsof all the features in our data are likely primarily de-termined by applicants from larger programs. This alsocontributes to the selection bias discussed above.Another methodological problem is the question ofwhether the diﬀerent institutional variables attempt todescribe the same eﬀect, implying a possible problem ofcorrelation, or even multicollinearity, between the fea-tures. The number of Bachelor and PhD graduates areparticularly sensitive to this issue as they both representa measure of the size of physics departments. Indeed, thefeatures share a positive correlation of roughly 0.7. Bothapproaches present evidence in favor of there being somedegree of relationship between the features. For instance,when comparing the minimum and maximum graduateslogistic regression models, Table II shows that the diﬀer-ence in the fraction of P-GRE cut-oﬀs for which the sizeof bachelor and PhD programs are signiﬁcant is similarto the same diﬀerence for attending a competitive schoolor an institution with high research activity. As it is notuncommon for institutions with larger programs to bemore competitive or practice higher levels of research, wesuspect that some statistical relationship between thesefeatures is likely. A more direct example is seen in Fig-ure 6, where the removal of the Carnegie classiﬁcationduring the feature elimination procedure does not deteri-orate the performance by any measurable amount. Thisindiﬀerence suggests that the information contained inthe Carnegie classiﬁcation, which is known to be consid-erable due to Carnegie’s high importance measure (seeFigure 5), is also contained within the remaining set offeatures (U-GPA, gender and number of PhD graduates).As a ﬁnal example, the performance comparison (Figure7) shows that most of the overall eﬀect of the institu-tional inﬂuence can be described by a limited selectionof institutional features. C. Data processing and modelling choices

A major diﬃculty for the logistic regression approachis the need for data processing, especially in the contextof losing information by unfortunate modelling choices.The most prominent example in this study is the com-bination of racial and ethnic groups into a single under-represented minority group. As suggested by Figure 2,the lack of race features being important in the logis-tic regression model may actually be a case of Simp-son’s paradox (information loss due to combining data5[50]). That is, because the combined P-GRE distribu-tion of URM applicants resembles the P-GRE distribu-tion of white applicants (see Figure 2), and because therace feature was one-hot encoded using “white” as ref-erence level, the diﬀerence between the distributions isnot large enough to be statistically signiﬁcant. In com-parison, the distribution is much more skewed for Asianapplicants, and thus the diﬀerence becomes statisticallysigniﬁcant for higher cut-oﬀs. Other examples includethe Carnegie classiﬁcation and undergraduate popula-tion proﬁle, which were essentially reduced from multi-level categorizations to simple binaries. Estimating theamount of meaningful information lost for these featuresis particularly complicated because of the high numberof low-frequency categories.Compared to the logistic regression approach, the CIFavoids the data processing issues described above. Whenprocessing categorical features for inferential modelling,the features must remain interpretable. However, be-cause the CIF does not require the combination of cate-gorical levels to be meaningful, a tree node can ﬁnd theoptimal grouping of categories without regard to inter-pretation. Indeed, the construction of the CIF algorithmallows it to naturally handle unprocessed data withoutsuﬀering the same issues as logistic regression (and othermachine learning methods that require preprocessing thedata). As a result, the CIF is able to identify statisticalproperties much more easily than logistic regression. Anexample of this eﬀect is seen in Barron’s selectivity in-dex: Whereas the odds ratios decrease and become lesssigniﬁcant as the P-GRE cut-oﬀ increases (Table III),the feature importance is relatively stable with respectto changes in the cut-oﬀ (Figure 5).Furthermore, compared to the odds ratios of logisticregression, the importance measures of the CIF are moreeﬀective and provide a clearer picture. The frameworkof logistic regression assumes that every feature is a dis-tinct component of the response (eq. (2)). In contrast,a tree in the CIF will only include a feature if its foundto be important enough (see Sec. III B 2). Hence, if aparticular feature is always less important than the otherfeatures in every tree (recall each tree is built on a subsetof the features), then its importance measure will be 0.A similar mechanism is not present in the logistic regres-sion framework, which will always try to interpret everyfeature as an integral component of the model. Accord-ingly, the importance measures more accurately reﬂectthe degree to which the features are associated with theresponse. Indeed, note that the set of features essentialto the model is always larger in the logistic regressionmodels, and in addition, changes as the P-GRE cut-oﬀincreases. For example, the odds ratio for attending a pri-vately funded institution is only statistically signiﬁcantfor cut-oﬀ scores ≥ (Figure 3). By relaxing the nec-essary assumptions of the logistic regression framework,we get a more eﬀective tool for identifying the relation-ship between the features and the response, albeit onethat is harder to interpret. The eﬀects of unfortunate modelling choices in logis-tic regression models depends, in the end, on the data.In our case, the combining of racial and ethnic minori-ties in an underrepresented minority category has likelyinﬂuenced how racial and ethnic information is treatedmodel. Similarly, the signiﬁcance of other processed fea-tures may also have been diminished. That being said,we have conducted two very diﬀerent analyses (inferen-tial vs. predictive modelling) and found similar results.It is therefore unlikely that the choices unique to eachapproach have aﬀected the overall results of the analysis. D. Future work

The present study has looked into how the undergrad-uate institutions of applicants may inﬂuence the physicsgraduate admissions process by studying its statisticalrelationship with P-GRE cut-oﬀ scores. Lacking fromthis analysis is an understanding of whether institutionalinﬂuence may exert its primary eﬀect at a diﬀerent stagein the admissions process. For example, it is known thata number of bachelor students that are interested in fur-ther studies eventually decide not to apply [16]. Whilesome cases arise due to personal or ﬁnancial concerns,some students may not have received the preparation orencouragement necessary for motivating further studies.If such motivation plays a signiﬁcant role for studentsunsure of whether to pursue a career in physics, thenone would expect that prospective students from institu-tions with PhD programs would be more likely to applyto graduate programs. Additionally, it is worth consid-ering whether these prospective students are more likelyto apply to any graduate program in general, or simplythe program at their undergraduate institution.

VI. CONCLUSION

The present work has studied the eﬀects of institu-tional inﬂuence on graduate program admissions by mod-elling a hard physics GRE cut-oﬀ score with applica-tion data from ﬁve Midwestern universities. For com-pleteness, all possible cut-oﬀ scores between 620 and 800(32nd and 67th percentile) have been analyzed, althoughmost admissions employ a cut-oﬀ of 700. The analysishas been conducted using both inferential and predictivemodelling based on logistic regression and the conditionalinference forest algorithm respectively. Both approachesidentify the known eﬀects of undergraduate GPA andgender, but do not emphasize a statistical diﬀerence be-tween applicants from diﬀerent racial and ethnic minori-ties as expected from earlier work [2]. However, this ap-parent contradiction with past work can likely be un-derstood as a combination of a Simpson’s paradox andselection bias among the applicants. Both approachesidentiﬁed cases where the impact of institutional featureswere comparable to the known eﬀects of undergraduate6GPA and gender. Overall, the two approaches agree onthe analysis as a whole, but disagree on the result of in-creasing the P-GRE cut-oﬀ. In terms of the odds ratios,increasing the cut-oﬀ places more signiﬁcance on institu-tional features associated with competitive schools, pri-vate funding, large physics programs and high researchactivity. On the other hand, the added performance whenincluding institutional features can be attributed to asmall number of features.In conclusion, when analyzing graduate program ap-plications we recommend including information regard-ing the applicants’ bachelor institutions. Moreover, dueto the innate ﬂexibility and precision of the conditionalinference forest algorithm, combined with the large vari-ety of data structures seen in application data, we alsorecommend the forest algorithm as well as the predictive analysis approach in general. Based on these ﬁndings andits known problems of limiting underrepresented racialand ethnic minorities, we advocate against the practiceof using GRE cut-oﬀ scores in admissions.

ACKNOWLEDGMENTS

This project was supported by the Michigan State Uni-versity College of Natural Sciences, the Lappan-PhillipsFoundation, and the Norwegian Agency for Quality As-surance in Education (NOKUT), which supports theCenter for Computing in Science Education. This projecthas also received support from the INTPART project ofthe Research Council of Norway (Grant No. 288125) andthe Thon foundation. [1] C. Miller and K. Stassun, Nature , 303 (2014).[2] C. W. Miller, B. M. Zwickl, J. R. Posselt, R. T. Sil-vestrini, and T. Hodapp, Science Advances , eaat7550(2019).[3] G. Potvin, D. Chari, and T. Hodapp, Physical ReviewPhysics Education Research , 020142 (2017).[4] A. M. Porter and R. Ivie, Women in Physics and Astron-omy, 2019 , Tech. Rep. (American Institute of Physics,2019).[5] Laura Merner and John Tyler,

African American, His-panic, and Native American Women among Bachelors inPhysical Sciences & Engineering , Tech. Rep. (2017).[6] N. T. Young and M. D. Caballero, arXiv:2008.10712[physics] (2020), arXiv: 2008.10712.[7] J. W. Halley, A. Adjoudani, P. Heller, and J. S. Ter-williger, American Journal of Physics , 403 (1991),publisher: American Association of Physics Teachers.[8] N. T. Young and M. D. Caballero, arXiv:1907.01570[physics] (2019), arXiv: 1907.01570.[9] C. Zabriskie, J. Yang, S. DeVore, and J. Stewart,Physical Review Physics Education Research (2019),10.1103/PhysRevPhysEducRes.15.020120.[10] R. Ivie, “Beyond Representation Data to Improve theSituation of Women and Minorities in Physics and As-tronomy,” (2018).[11] L. M. Aycock, Z. Hazari, E. Brewe, K. B. Clancy, T. Ho-dapp, and R. M. Goertzen, Physical Review Physics Ed-ucation Research , 010121 (2019).[12] K. Rosa and F. M. Mensah, Physical Review Physics Ed-ucation Research , 020113 (2016).[13] S. Hyater-Adams, C. Fracchiolla, N. Finkelstein, andK. Hinko, Physical Review Physics Education Research , 010132 (2018).[14] J. R. Posselt, Inside graduate admissions (Harvard Uni-versity Press, 2016).[15] Educational Testing Service, “Guide to the Use ofScores,” (2019).[16] G. L. Cochran, T. Hodapp, and E. E. A. Brown,in

Physics Education Research Conference Proceedings ,PER Conference (Cincinnati, OH, 2018) pp. 92–95.[17] E. M. Levesque, R. Bezanson, and G. R. Tremblay, arXiv:1512.03709 [astro-ph, physics:physics] (2015),arXiv: 1512.03709.[18] N. S. Foundation, “Frequently Asked Questions (FAQs)for NSF 20-587, Applicants for FY2021 Graduate Re-search Fellowship Program (GRFP),” (2020).[19] G. Attiyeh and R. Attiyeh, The Journal of Human Re-sources; Madison , 524 (1997).[20] J. R. Posselt, T. E. Hernandez, G. L. Cochran, and C. W.Miller, Journal of Women and Minorities in Science andEngineering , 283 (2019).[21] P. J. Mulvey and S. Nicholson, Physics Bachelor’s De-grees , Tech. Rep. (American Institute of Physics, 2015).[22] Educational Testing Service, “GRE® Subject Test In-terpretative Data,” (2019).[23] Center for Postsecondary Research,

The Carnegie Classi-ﬁcation of Institutions of Higher Education (Indiana Uni-versity Bloomington, Bloomington, IN, 2016).[24] Barron’s Educational Series, inc. College Division,

Bar-ron’s Proﬁles of American Colleges (Barron’s).[25] S. Nicholson and P. J. Mulvey,

Roster of Physics De-partments with Enrollment and Degree Data, 2017 , Tech.Rep. (American Institute of Physics, 2018).[26] S. Nicholson and P. J. Mulvey,

Roster of Physics De-partments with Enrollment and Degree Data, 2018 , Tech.Rep. (American Institute of Physics, 2019).[27] A. L. Traxler, X. C. Cid, J. Blue, and R. Barthelemy,Physical Review Physics Education Research , 020114(2016).[28] U.S. Department of Education, “Lists of PostsecondaryMinority Institutions,” .[29] D. W. Hosmer and S. Lemeshow, Applied Logistic Re-gression , 2nd ed. (John Wiley & Sons, Inc., 2000).[30] R. Teranishi, New Directions for Institutional Research , 37 (2007).[31] J. Friedman, T. Hastie, and R. Tibshirani, Journal ofStatistical Software (2010), 10.18637/jss.v033.i01.[32] J. Nissen, R. Donatello, and B. Van Dusen, Phys-ical Review Physics Education Research (2019),10.1103/PhysRevPhysEducRes.15.020106.[33] S. van Buuren and C. Groothuis-Oudshoorn, Journal ofStatistical Software (2011), 10.18637/jss.v045.i03. [34] D. B. Rubin, Multiple Imputation for Nonresponse inSurveys (John Wiley & Sons, Inc., 1987).[35] P. T. von Hippel, Sociological Methodology , 265(2009).[36] K. G. M. Moons, R. A. R. T. Donders, T. Stijnen, andF. E. Harrell, Journal of Clinical Epidemiology , 1092(2006).[37] C. X. Ling, J. Huang, and H. Zhang, in Advances in Ar-tiﬁcial Intelligence , Lecture Notes in Computer Science,edited by Y. Xiang and B. Chaib-draa (Springer, Berlin,Heidelberg, 2003) pp. 329–341.[38] T. Hastie, R. Tibshirani, and J. Friedman,

The Elementsof Statistical Learning: Data Mining, Inference, and Pre-diction , 2nd ed. (Springer, 2017).[39] L. Breiman, Machine Learning , 5 (2001).[40] T. Hothorn, K. Hornik, and A. Zeileis, Journal of Com-putational and Graphical Statistics , 651 (2006).[41] T. Hothorn, P. Bühlmann, S. Dudoit, A. Molinaro, andM. Laan, Biostatistics (Oxford, England) , 355 (2006).[42] C. Strobl, A.-L. Boulesteix, A. Zeileis, and T. Hothorn, BMC Bioinformatics , 25 (2007).[43] C. Strobl, A.-L. Boulesteix, T. Kneib, T. Augustin, andA. Zeileis, BMC bioinformatics , 307 (2008).[44] V. Svetnik, A. Liaw, C. Tong, J. C. Culberson, R. P.Sheridan, and B. P. Feuston, Journal of Chemical Infor-mation and Computer Sciences , 1947 (2003).[45] S. Janitza, C. Strobl, and A.-L. Boulesteix, BMC Bioin-formatics , 119 (2013).[46] L. Auret and C. Aldrich, Minerals Engineering , 27(2012).[47] V. Svetnik, A. Liaw, C. Tong, and T. Wang, in MultipleClassiﬁer Systems , Lecture Notes in Computer Science,edited by F. Roli, J. Kittler, and T. Windeatt (Springer,Berlin, Heidelberg, 2004) pp. 334–343.[48] D. Chari and G. Potvin, Physical Review Physics Edu-cation Research , 023101 (2019).[49] S. Kanim and X. Cid, Physical Review Physics EducationResearch , 020106 (2020).[50] E. H. Simpson, Journal of the Royal Statistical Society.Series B (Methodological)13