Investigating institutional influence on graduate program admissions by modelling physics GRE cut-off scores
IInvestigating institutional influence on graduate program admissions by modellingphysics GRE cut-off scores
Nils J. Mikkelsen, Nicholas T. Young,
2, 3 and Marcos D. Caballero
1, 2, 3, 4, ∗ Center for Computing in Science Education & Department of Physics, University of Oslo, N-0316 Oslo, Norway Department of Physics and Astronomy, Michigan State University, East Lansing, Michigan 48824 Department of Computational Mathematics, Science, and Engineering,Michigan State University, East Lansing, Michigan 48824 CREATE for STEM Institute, Michigan State University, East Lansing, Michigan 48824 (Dated: September 30, 2020)Despite limiting access to applicants from underrepresented racial and ethnic groups, the practiceof using hard or soft GRE cut-off scores in physics graduate program admissions is still a popularmethod for reducing the pool of applicants. The present study considers whether the undergradu-ate institutions of applicants have any influence on the admissions process by modelling a physicsGRE cut-off score with application data from admissions offices of five universities. Two distinctapproaches based on inferential and predictive modelling are conducted. While there is some dis-agreement regarding the relative importance between features, the two approaches largely agreethat including institutional information significantly aids the analysis. Both models identify caseswhere the institutional effects are comparable to factors of known importance such as gender andundergraduate GPA. As the results are stable across many cut-off scores, we advocate against thepractice of employing physics GRE cut-off scores in admissions.
Keywords: Physics graduate admissions, physics GRE, institutional influence, logistic regression modelling,supervised machine learning.
I. INTRODUCTION
While recent studies have called into question the over-reliance on Graduate Record Examination (GRE) scoresin physics graduate admissions [1, 2], filtering applicantsbased on a strict or effective minimum score is still apopular practice today [3]. Given the role of the GRE inadmissions, understanding the factors influencing GREscores may provide insight into how, when compared toother science, technology, engineering and mathematics(STEM) disciplines, the physics graduate admissions pro-cess has failed to improve gender, racial, and ethnic diver-sity by systematically excluding these applicants [4, 5].A number of studies have investigated correlations be-tween GRE scores and demographics [1, 2], but little at-tention has been given to the institutional backgroundsof applicants. An applicant’s undergraduate backgroundcould play a significant role in their graduate applica-tion [6]. Institutions offering a PhD program themselveswould likely place more emphasis on both preparing andmotivating undergraduate students for further studies.Larger physics departments with more resources are ableto offer students more advanced course-work and hands-on experimental work as well as provide a larger vari-ety of staff expertise. Larger undergraduate programscan facilitate network-building, both between studentsand faculty members, and collaboration via projects andstudy-groups. Although attributes such as motivationand opportunity cannot be appropriately measured, theireffects on the GRE can be linked to metrics such as the ∗ Corresponding Author: [email protected] size and type of institutions as was done in Halley et al.[7]. In order to estimate these institutional effects, wehave analyzed the Physics GRE Subject Test (P-GRE)scores of graduate program applications from four publicuniversities and one private university.The applications include a variety of information, butthe present study will focus on numerical and categoricaldata, all of which constitutes a mixture of data struc-tures. A number of recent studies working with similardata have approached the problem using machine learn-ing methods [8, 9]. Many machine learning methodslend themselves to problems with mixed data, albeit theydo not share the interpretability of more conventionalmodelling methods. The present study will employ bothapproaches, contrasting and comparing the results.The aim of this study is to continue the discussionon the practice of employing formal or informal P-GREscore cut-offs in graduate admissions using a combina-tion of modelling and machine learning methods. Theidea is to analyze the P-GRE scores of PhD program ap-plicants with respect to applicants’ undergraduate GradePoint Average (U-GPA), demographics and institutionalbackground. Our guiding research questions (RQs) areas follows.1. To what extent does the applicant’s undergraduateinstitution influence whether they are able to at-tain a minimum P-GRE score expected by an ad-missions committee?2. To what degree do the institutional effects compareto known effects such as U-GPA, gender and race?3. How do the results depend on the specific cut-offchosen by the admissions office? a r X i v : . [ phy s i c s . e d - ph ] S e p
4. How well do the conventional and machine learningapproaches agree on RQs 1, 2 and 3?
II. BACKGROUND
Following the calls for increasing diversity in STEMdisciplines, there has been a steady growth of women’sand ethnic/racial minorities’ representation over the pastcouple of decades [10]. Despite the progress however,physics has seen particularly poor development in com-parison. Since the late 1990s, the percentage of bache-lor and PhD degrees awarded to women in physics hasstagnated at about 20%, mirroring similar numbers ofengineering and computer science [4]. The numbers areeven more concerning for racial minorities who during thethree-year period 2014-2016 earned 11% of bachelor de-grees and only 7% of PhD degrees [10]. The discrepancyin female, racial and ethnic representation likely stemsfrom variety of factors involving admission and retentionissues, many of which are rooted in cultural and struc-tural problems including sexual harassment and systemicracism [11–13].In her extensive review of the general practices of grad-uate program admissions,
Inside Graduate Admissions (2016) [14], Posselt notes that most admissions (in thenatural sciences as well as in the humanities and socialsciences) measured students’ merit primarily on the basisof their undergraduate GPA (U-GPA) and GRE scoresalone. Indeed, Young and Caballero were able to predictthe admittance of prospective physics PhD students with75% accuracy using machine learning methods based onlyon their U-GPA and P-GRE score [8]. The GRE testmakers, Educational Testing Service (ETS), recommendsagainst the use of GRE scores as the sole basis for ad-missions decisions, particularly emphasizing the practiceof filtering applicants based on a minimum cut-off score[15]. Despite this, Potvin et al. found that 32% of physicsgraduate program admissions state they filter applicantswith a minimum P-GRE score [3]. Furthermore, of theprograms that say they do not filter applicants, severalreported using a “ rough cut-off ” or wanting a “ preferablescore ”, suggesting that more than 32% of programs filterapplicants in practice.As highlighted by Miller and Stassun in 2014 [1], on av-erage, women score 80 pts lower than men on the GREin the physical sciences, while Black test-takers score 200pts lower than white test-takers. The authors furthernote that the practice of filtering prospective studentswith a minimum score, which is in violation with ETS’sown guidelines, thus “ adversely effects women and minor-ity applicants ”. In addition to limiting access for minor-ity applicants during the application process, the GREalso acts as barrier to apply. In a survey of prospectivestudents from underrepresented racial and ethnic groupsinterested in pursuing a PhD in physics who ultimatelychose not to apply, Cochran et al. notes that the GREwas the “ most common theme ” expressed by students as a barrier to apply [16].In spite of its established popularity in admissions, theGRE’s ability to identify promising students has recentlybeen called into question. One study found that while re-quiring a minimum P-GRE score limits access to physicsgraduate program applicants from minority groups, GREscores were incapable of predicting PhD completion [2].In a 2015 survey of prize-winning postdoctoral fellows inastronomy [17], Levesque et al. found that the P-GREscores of fellows did not adhere to any minimum per-centile score, suggesting that the GRE is also a poor esti-mator for future research excellence. The authors furtherpoint out that a minimum percentile score of 60% wouldhave eliminated 44% of participants, including 60% offemale fellows. The inability of the GRE to identifypromising students has also been noticed by other groupssuch as the National Science Foundation, which recentlydecided to drop the GRE from the application to theirGraduate Research Fellowship Program (see FAQ no. 52[18]).Prior work has typically focused on admissions com-mittees’ over-reliance on the GRE and the consequencesof using cut-off scores in graduate admissions [1, 2, 19,20]. Missing from the conversation is an understanding ofwhat institutional factors, which come into play duringapplicants’ undergraduate study (or even earlier), mayinfluence GRE scores. In a 1991 study, Halley et al. in-vestigated how the topics covered by P-GRE comparedwith physics major curriculum by analyzing the P-GREscores of students from different institution types [7]. Theauthors noted that the portion of correct answers washigher for students from "top" institutions, and highestfor students from "top" institutions with graduate pro-grams. However, this study is both nearly 30 years oldand worked with an imbalanced sample (701 test-takersin total, 21 of which attended a top undergraduate in-stitution). Since then, the GRE has evolved and thenumber of physics degrees awarded annually has almostdoubled [21]. Nowadays, the GRE does not penalize in-correct answers, i.e., guessing, which has likely changedthe way students approach the test. To our knowledge,there has not yet been a modern study analyzing howinstitutional factors may affect GRE scores.
III. METHODS
The target for this investigation is to explain whethera student scores above or below a P-GRE cut-off scoreselected by an admissions committee. This is encodedusing a binary response variable named ABOVE with theinterpretation that an applicant with a score above orequal to the cut-off has
ABOVE = 1 , and an applicantwith a score below the cut-off has
ABOVE = 0 . That is,given a test score of x and a cut-off score of C , we define ABOVE = (cid:40) , x ≥ C, , x < C. (1)The reader should recall that the possible scores on GREsubject tests range from 200 to 990 in 10 pt. intervals.We have focused on P-GRE cut-off scores ranging from620 to 800 pt., corresponding to the 32nd and 67th na-tional percentiles [22]. Typical P-GRE cut-off scores liein the region of 700 [2].The data used in this study consists of 2017/2018 ad-missions records for physics graduate programs from 4public universities in the Big Ten Academic Alliance andone private Midwestern university. The records containunidentified profiles of program applicants with informa-tion regarding their GRE performance, undergraduateGPA, ethnicity and race, gender, etc. In addition, therecords also include which institution the applicants at-tended during their bachelor’s degrees. Complementarydata describing the bachelor-institutions have been addedfrom three sources: the 2015 Carnegie Classification ofInstitutions of Higher Education [23], Barron’s selectivityindex [24], and 2017-2018 surveys of American universi-ties by the American Institute of Physics (AIP) [25, 26].The additional data describes several aspects of the in-stitutions such as institution-wide admissions selectivityand the size of physics programs. The main idea is tostudy the statistical effects from applicants’ institutionalbackgrounds using this complementary data. A. On the data
The admissions records contain 5738 applications in to-tal, but only 5314 (ca. 93%) of them include the students’P-GRE scores. Applications without P-GRE scores areignored to avoid influencing the P-GRE distribution. Ofthe remaining applications, 2575 are domestic (ca. 48%).This study will focus entirely on domestic students fortwo main reasons. First, the P-GRE distribution for in-ternational students is much more saturated with perfectscores than the distribution is for domestic students. Thesaturation problem is visualized in figure 1: The percent-age of international students scoring above the selectedcut-off scores both starts off much higher and falls offmuch slower than the percentage of domestic students.Second, because there is not a systematic collection ofgraduation records for non-US schools, it is difficult toreliably collect the necessary information from every in-ternational student.Because the applicants are not identified, several ap-plications may come from the same student. While theseapplications are unique in the sense that each applicationaddresses a different school, they count as duplicated ap-plications in this analysis by virtue of being from thesame student. Duplicate applications could have an ef-fect on the results, most notably in the logistic regres-sion model that relies on independent observations (seesupplementary material). By comparing applications ac-cording to demographics and academic performance, anumber of possible duplicate applications have been iden-tified. In case all candidates are duplicates, roughly
Figure 1. A comparison of the P-GRE distribution betweennational data [22] and data used in this study. The analysis isprimarily concerned with domestic applicants (green curve).
1. The raw features
In addition to the P-GRE score, thirteen features, orvariables, have been selected for analysis. A summary ofthe features and their sources is given in Table I.The features from the admissions records include theapplicants’ P-GRE score, U-GPA, gender, and race. Notethat the gender feature is encoded as a binary variable;while we acknowledge that gender is not binary, moredetailed descriptions were not collected by the admis-sions offices [27]. Similarly, different practices regarding
Table I. A summary of the features used in this study.Feature Type SourcePhysics GRE score continuous AdmissionsUndergraduate GPA continuous AdmissionsGender binary AdmissionsRace categorical AdmissionsCarnegie Classification categorical CarnegieUndergrad Population Profile categorical CarnegieFunding category categorical CarnegieACT selectivity category categorical CarnegieMinority Serving Institution binary CarnegieBarron’s selectivity index categorical Barron’sNo. bachelor graduates (2017) continuous AIP SurveyNo. bachelor graduates (2018) continuous AIP SurveyNo. PhD graduates (2017) continuous AIP SurveyNo. PhD graduates (2018) continuous AIP Survey the collection of data on racial and ethnic backgroundshas limited the scope of the race feature. See Posseltet al. for more details regarding collection of data onracial and ethnic backgrounds by admissions offices [20].The features from the admissions records constitute theapplicant-specific component of the models, while the re-maining features comprise the institutional component.Of the Carnegie features, the two most prominent arethe (2015) Carnegie (basic) classification of institutionsand the (2015) undergraduate population profile classifi-cation. The basic classification is an overall categoriza-tion of the academic degrees offered and awarded by theinstitutions, e.g.
Doctoral university with high researchactivity and
Master’s college with large programs . Theundergraduate population profile classification character-izes the typical undergraduate population according tothree metrics: portion of full-time undergraduates, aca-demic achievements of first-year and first-time students,portion of entering transfer students. In addition, theCarnegie features also include the institutions’ Fundingcategory and ACT selectivity category, and whether theinstitutions are Minority Serving Institutions (MSI). TheACT category measures the entry selectivity of admis-sions offices by grouping all institutions according to theACT scores of first-year bachelor students, and MSI in-dicates whether an institution satisfies the requirementsfor a Minority Serving Institution [28].Lastly, Barron’s provides the Profile of American Col-leges [24], which is an index for institution-wide admis-sions selectivity, and the AIP surveys provide the num-bers of bachelor and PhD students graduating in physics.The data will be analyzed using two different dataanalysis methods based on logistic regression modellingand predictive machine learning analysis (described inSec. III B). As they stand, the raw features are not well-suited for logistic regression due to computational issuesas well as modelling-related difficulties. The remainingpart of this section describes our data preprocessing andmodelling choices. See Sec. V C for a discussion of po-tential issues. Because the predictive analysis requiresless preprocessing than logistic regression, we provide asummary of all the models used in this study in Sec. III Cto avoid confusion.
2. Underrepresented racial and ethnic minorities
The small representation seen of applicants from racialand ethnic minorities (Black, Latinx, Multi and Native)is of computational concern because logistic regressionfairs poorly with low-frequency categories [29]. Becauseinitial tests including every racial group produced resultswith limited statistical power (e.g. infinite p -value confi-dence intervals), we combined racial and ethnic minori-ties in an underrepresented minority (URM) category de-spite Teranishi’s warning [30]. This also combines theirP-GRE distributions (see Figure 2), leading to loss of in-formation. This issue is further discussed in Sec. V C. Figure 2. Estimated P-GRE distributions by racial and ethnicgroups (number of applicants indicated in parenthesis). Notethat the combined distribution normalizes the differences be-tween the combined groups.
3. The Carnegie classification & undergraduate populationprofile
While the Carnegie classification and undergraduatepopulation profile support 34 and 16 unique categorieseach, the limited pool of applications leaves many cate-gories empty or with only a handful of applicants. Mostof the categories are difficult to combine into meaningfulgroups. Thus, to avoid computational issues the featuresare replaced by the binary labels:
Doctoral university w/highest research activity and
Most selective undergradu-ate population .
4. Funding category & ACT selectivity category
Similar to the Carnegie features, both
Funding cate-gory and
ACT selectivity category have categories withtoo few applicants. To avoid complications, the featuresare reduced to binary labels
Public Funding and
MostACT-selective , which, respectively, indicate whether theinstitution is publicly funded and if the institution is inthe most selective ACT category.
5. Barron’s selectivity index
Barron’s selectivity index is an admissions selectivitymeasure that categorizes institutions according to schoolcompetitiveness. In decreasing order of competitiveness,the categories include most competitive , highly competi-tive , very competitive , competitive , less competitive and non-competitive . Additional “plus” categories such as highly competitive plus have been collapsed into their cor-responding ordinary levels. In this study, admissions se-lectivity is used as a metric for an institution’s resourcesand staff experience. Because admissions selectivity isexpected to have an effect only for the most selectiveschools, the selectivity categories less competitive than most and highly competitive are combined to a not ascompetitive category.
6. No. bachelor/PhD graduates 2017/2018
In this study, the AIP features (see Table I) providea measure of the size of undergraduate physics depart-ments. As larger departments typically have more fi-nancial resources available and may offer students moreopportunities for advanced coursework or research, theP-GRE scores of applicants from larger programs is ex-pected to be higher [7]. However, because of the varietyof institutions and physics programs, a systemic effectis expected to only emerge for very large physics pro-grams. Instead of analyzing the raw number of graduates,a physics program is classified as large if the number ofgraduates is above the 75th national percentile [21].While the typical size of physics departments is un-likely to change on a yearly basis for most institutions,the exact number of graduates is much more sensitive tovariation. Moreover, the applicants spent several years atthe undergraduate institutions, thus it is unreasonable toestimate the general size of the physics departments usingdata from a single year. Because the statistical modelscannot include data on both years simultaneously (i.e.,as individual features) due to correlation issues, the 2017and 2018 data must be combined (bachelor and PhD fea-tures separated). For most institutions, the differencein the number of bachelor/PhD graduates between 2017and 2018 is not significant enough to have any effect onthe analysis. However, because the difference is largefor some institutions, naively selecting, say, the averagecould overestimate or underestimate the size of some de-partments. In addition, there are some institutions forwhich data is missing for either 2017 or 2018. To avoidinaccurate single-point estimates of department sizes, themaximum and minimum cases are considered separately.In the maximum graduates models , the maximum num-ber of bachelor and PhD students between the 2017 and2018 data is included, and vice versa in the minimumgraduates models . For institutions with missing data, anyavailable data is used for both models.
B. Methods for data analysis
The following section provides a brief overview of themethods used in this study. Additional details are pro-vided as supplementary material. Because logistic regres-sion is likely familiar to a greater audience, more time isspent on the machine learning methods.
1. Logistic regression modelling
Logistic regression analysis is a technique for modellinga binary response y ∈ { , } with respect to explanatoryvariables x . . . , x k , which may consist of a mixture ofcontinuous and discrete data. While binary data is natu-rally handled by logistic regression, categorical (discrete)data with M > no. categories must be encoded using M − binary variables according the one-hot encoding scheme (see supplementary material for details). The re-sponse is modelled according to the odds equation,odds ( p ) = exp (cid:0) β x + · · · + β k x k + (cid:15) (cid:1) , (2)where p is the probability of the outcome y = 1 , β i is theregression coefficient of x i and (cid:15) is an error term. Theregression coefficients are determined numerically usingan iterative scheme based on maximum likelihood esti-mation. In our study this is handled by the glm functionin R [31]A major benefit of logistic regression modelling is theinterpretability of its regression coefficients. When x i increases by 1 unit, the odds change by a factor of exp( β i ) called the odds ratio:OR ( p ; x i ) = odds ( p ; x i + 1) odds ( p ; x i ) = e β i . (3)The interpretation of the odds ratio depends on whether x i is continuous or categorical. For continuous features,the change is associated with a unit increase in x i . For bi-nary features, the change is associated with a switch in x i from category 0 to category 1. Because multi-leveled cat-egorical features are encoded with binary features, eachbinary represents a change from the reference categoryto the category associated with the binary. Odds ratiosbelow 1 are inverted so that / OR ( x i ) is the odds ratioassociated with a unit decrease in x i or a switch in x i fromcategory 1 to category 0. In order to avoid interpretationissues relating to very large or very small continuous fea-tures, it is customary to standardize continuous featuresby centering the mean about 0 and normalizing the vari-ance to 1. For standardized features, the odds ratio isassociated with an increase in the original feature by onestandard deviation.Alongside the regression coefficients, the glm functionprovides the corresponding p -values. To avoid multiplecomparisons problems, the p -values are adjusted accord-ing to the Bonferroni correction. For a logistic regressionmodel with N features, the Bonferroni-adjusted p -valueis ˜ p = pN . We follow common practice and include threelevels of significance: α = 0 . , α = 0 . and α = 0 . .Because logistic regression is unable to handle miss-ing values, we follow Nissen et al.’s recommendation ofimputing the missing data instead of discarding it [32].Our approach employs the MICE (Multiple Imputationby Chained Equations) algorithm, which is handled bythe mice package in R [33]. MICE is an iterative al-gorithm that applies linear and logistic regression tech-niques in order to impute the data while conserving therelationship between the features as well as possible. Thealgorithm constructs N individual data sets to be mod-elled separately, the results of which are pooled (com-bined) according to Rubin’s rules [34]. In this study, 5imputation sets were created using 20 iterations (leavingother mice parameters to their defaults). Because theraw features are processed, the transformation must oc-cur either before, after, or during the imputation. To ourknowledge, there are no recommended strategies for thekinds of transformations used in this study. We thereforefollow the general recommendation of von Hippel of “ im-pute, then transform ” [35]. As recommended by Moonset al. [36], the P-GRE scores are included in the impu-tation before preparing ABOVE .
2. Machine learning analysis
Whereas logistic regression favors interpretability (viathe odds ratios), machine learning analysis (MLA) fo-cuses on making accurate and reliable predictions. Giveninputs x , . . . , x k and an output y , the goal of MLA is toidentify a map f such that y = f ( x , . . . , x k ) + (cid:15), (4)where (cid:15) is a prediction error. When y is categorical (e.g.binary), f is called a classifier because it classifies a setof inputs into discrete outputs. As classifiers are seldomperfect, a major component of MLA consists of findingthe optimal f , i.e. minimizing (cid:15) . To measure "how well"a classifier is able to classify inputs we use performancemetrics . Different metrics highlight different types of be-havior, meaning a classifier can score well according toone metric, but poorly according to another metric. Thisstudy employs two metrics: prediction accuracy score and AUC-ROC score .The prediction accuracy score of a classifier is the por-tion of correctly classified cases. In terms of our data,a correctly classified case is any application for whichthe classifier successfully predicts whether the applicantscores above or below the cut-off score. It is typically re-ferred to as simply the accuracy and it is often reportedas a percentage. Accuracy is a number between 0% and100%, where 100% signifies a perfect classifier. Whileeasy to interpret, accuracy is very sensitive to unbalancedoutput classes (see the "Domestic applicants" curve inFigure 1 for the class imbalance faced in this study) be-cause it does not distinguish between the output classes.For instance, if 80% of applicants score above the cut-off, then a naive classifier predicting above regardless ofthe inputs will have an accuracy score of 80%. For thisreason, accuracy should always be considered relative toclass imbalance. Furthermore, because the class imbal-ance changes as the cut-off increases (Fig. 1), the inter-pretation of the nominal accuracy score changes. Hence,the accuracy scores of two classifiers using different cut-offs should not be compared nominally. The AUC-ROC score is a more complex metric thanaccuracy. Here, ROC refers to a Receiver OperatingCharacteristic curve and AUC means taking the AreaUnder the ROC Curve. For more details regarding ROCcurves, consult the supplementary material. The AUC-ROC score, or simply the AUC, is a measure of a classi-fier’s ability to distinguish between output classes. AUCis a number between 0 and 1, where 1 signifies a per-fect classifier, but a score of 0.5 is equivalent to com-plete guesswork. There is no universal scheme for judgingAUC scores, but Hosmer et al. provides a rough guide: . ≤ AUC < . is acceptable, . ≤ AUC < . is ex-cellent and . ≤ AUC is outstanding [29]. In contrastwith the accuracy score, AUC is more robust towardsimbalanced output classes [37], and thus AUC scores canbe more reliably compared across different cut-off scores.MLA typically consists of 2 phases: training and test-ing. Here, training refers to the construction of a clas-sifier, and testing refers to its evaluation based on per-formance metrics. A typical problem in MLA known as overfitting arises when a classifier is trained to recognize"too many details" of a data set. Thus, instead of repli-cating the general trend of the data set, the classifierreplicates the random errors. To avoid this, it is stan-dard practice to use different data sets for the trainingand testing phases by splitting the (complete) data setat random. Because random splits can have unforeseenconsequences, it is common to conduct several training-testing procedures and average the performance metrics,using the standard errors of the averages as indicators forthe confidence intervals. This study employs the K -foldcross-validation algorithm with K = 10 to prepare therandom splits [38].It is important to note that to find a perfect classi-fier is typically considered impossible, even if (cid:15) = 0 forall known data. Thus, there is no correct algorithm forconstructing f , and in fact, there are many unique algo-rithms to choose from. This study employs the condi-tional inference forest (CIF) algorithm, which is variantof the earlier random forest algorithm [39, 40]. A randomforest is comprised of an ensemble of decision trees, eachof which is an independent classifier. A decision tree isan algorithmic approach to decision-making (predictions)that asks a series of yes-no questions based on the inputdata (e.g. "male?" and "GPA > . "). The questionsare determined during the training phase and are chosento optimize performance. Each tree is given a randomsample of the training set and a random selection of theinput features. Predictions of the forest are then basedon a majority vote among the predictions of the trees. ACIF is similar to a random forest in principle, but differsin its construction.This study employs the CIF algorithm via the party package in R [41–43]. The forests were built using 200trees and 3 features per tree (following the recommended √ p [44]), all other parameters kept at their defaults. Oneof the selling points of the CIF is that it provides a nat-ural way of measuring the importance of each feature inthe model. The process of preparing the importance mea-sures for each feature is also handled by party . The ideais to remove a feature from the forest and measure theresulting change in a performance measure, interpretinga larger change as the feature being more important. Asdescribed in Janitza et al., measuring AUC loss is pre-ferred due to its robustness with imbalanced data [45].The importance measure is a tool for comparing the rela-tive importance of features and should not be interpretedfurther [46].Because the importance measures focus on the impactof removing each feature separately, a backwards recur-sive feature elimination (RFE) procedure is conducted tostudy the effect of removing several features. (see e.g.,[38]) To restrict the scope, the procedure is only executedfor P-GRE cut-offs in intervals of 30 pt. RFE is an itera-tive process that involves training a forest, estimating itsperformance, and removing the least important featurefrom the set of active features. Starting with all fea-tures, the process is repeated until one feature remains.The order of removal is determined by the importancemeasures of the forest model. The importance measuresare computed using the complete model, i.e., not dur-ing the procedure, to avoid overfitting [47]. Because theimportance measures vary depending on the cut-off, onewould ideally prepare a removal order separately for eachcut-off and conduct a unique RFE for each cut-off. How-ever, because the importances measures are similar fordifferent cut-offs, an average removal order is used for allcut-offs. C. A summary of the models
Most of the data preprocessing described in Sec. III Ais done for logistic regression. This includes combiningracial and ethnic minorities in an underrepresented mi-nority category; reducing the Carnegie features
CarnegieClassification , Undergraduate Population Profile , Fund-ing Category and
ACT selectivity category to binary la-bels; combining the Barron’s selectivity categories lesscompetitive than most and highly competitive to a not ascompetitive category; and categorizing physics programs(both undergraduate and graduate) as large if the num-ber of graduates is above the 75th national percentile.Because the computational difficulties of logistic regres-sion related to multicolinearity and low-frequency cate-gories are circumvented by the decision-tree constructionof the CIF algorithm, none of these preprocessing pro-cedures are required for the data to be compatible withthe CIF models. With the exception of combining the2017 and 2018 graduates data into minimum and maxi-mum cases, the data is only preprocessed for the logisticregression models.Avoiding to preprocess the data for the CIF models isactually in line with the philosophy of the predictive mod-elling approach. In contrast with how logistic regressionemphasizes interpretability, machine learning is only in- terested in the relationship between the features and theresponse. Preprocessing the data dilutes the available in-formation, and thus may negatively affect the predictiveanalysis (e.g., as in Fig. 2).Overall, × × logistic regression and CIF mod-els are studied: there are 19 unique P-GRE cut-offs un-der consideration, and for each cut-off, a model is con-structed with and without the potential duplicate appli-cations (Sec. III A), and using the maximum and mini-mum number of graduates between 2017 and 2018 (Sec.III A 6). IV. RESULTS
Because of the large volume of similar results, we willprimarily present the results (odds ratios of logistic re-gression and feature importance measures of the CIF) forthe models including the potential duplicate applications.We discuss deviations from these results where relevant.
A. Key Findings
The statistical effects from the institutional featuresbecome more involved, both in the logistic regressionmodels and the CIF models, as the P-GRE cut-off in-creases. In particular, applicants from well-funded insti-tutions with large physics programs and high researchactivity are more likely to score above the cut-off. Thelogistic regression models and the CIF models identifyseveral examples where the effect of an institutional fea-ture is comparable to U-GPA or gender. While U-GPAand gender are integral components of every model (asexpected), the race and ethnicity of applicants did notcontribute as much to our models as anticipated basedon the differences in scores between racial and ethnicgroups found by Miller et al. [2]. Overall, the logisticregression approach and the machine learning approachtypically agree on whether a feature has any relevancein the model. Having said that, the odds ratios typi-cally identify a larger set of important features that, inaddition, changes as the P-GRE cut-off increases.When it comes to the maximum and minimum grad-uates models, the contributions from the AIP featuresare devalued in the minimum graduates models in favorof Barron’s selectivity index and high research activity.Of the maximum models, the logistic regression mod-els favor attending an institution with a large bachelorprogram over a large PhD program, while the opposite istrue in the CIF models. Finally, the analysis as a whole issimilar for the models with and the models without possi-ble duplicate applications. Specifically, by removing thepossible duplicates, some features become less significantin the logistic regression models and the performance ofsome CIF models are slightly reduced.
Figure 3. Summary of odds ratio significance levels of 19 independent models: Each P-GRE cut-off score (horizontal coordinate)denotes an independent maximum graduates logistic regression model that includes possible duplicates (see Sec. III C). In eachmodel, the fields of the statistically significant features are marked with a symbol indicating the Bonferroni-corrected significancelevel: circle indicates α = 0 . , triangle indicates α = 0 . and diamond indicates α = 0 . . Blank fields indicate statisticallyinsignificant results. The dotted separation lines categorize the features as applicant-related, institution-wide metrics andconcerning Physics departments. B. Logistic Regression
1. Significance analysis
Consider first the maximum number of graduates mod-els. Figure 3 shows a diagram indicating how the set ofsignificant features changes between the logistic regres-sion models as the P-GRE cut-off score is increased from620 to 800. The features are typically significant for ev-ery or almost every cut-off, for no or few cut-offs, or forhigher cut-offs. In the following we provide an overviewof the applicant-related and institutional features thatare statistically significant.Of the applicant-related features, the odds ratios ofU-GPA and gender are always statistically significant.However, contrary to expectations, odds ratios betweenapplicants from different racial groups were only statis-tically significant for some cases. In particular, whencompared to applicants identifying as white, the oddsratios for applicants identifying as Asian are significantfor higher cut-offs, while the odds ratios for applicantsidentifying as Black, Latinx, Multi or Native are onlysignificant for a few cut-offs in the region of ≈ . Thisis further discussed in Sec. V C. When it comes to the institutional features, those sta-tistically significant to a majority of the P-GRE cut-offsinclude attending a most competitive institution, an in-stitution practicing some of the highest amounts of re-search activity, and an institution with a large physicsbachelor program. Interestingly, attending a highly com-petitive institution is only significant for cut-offs between640 and 690, while attending one of the most competitiveinstitutions is significant for all cut-offs up to 760. Ad-ditionally, attending private universities or universitieswith large PhD programs becomes significant when thecut-off increases beyond ≈ . In contrast, attendingan MSI, a most ACT-selective institution or to graduatein a most selective undergraduate population profile isnever significant, regardless of the cut-off.In order to provide a rough overview of the differencebetween the maximum and minimum number of gradu-ates models, Table II shows the fraction of P-GRE cut-offs for which each feature is significant. Note that thetable also separates models with and without the possibleduplicate applications. By removing the possible dupli-cates, the general significance of the features decreases.The change does not seem to originate in any particu-lar feature as, with the exception of U-GPA, gender and Table II. Fraction of P-GRE cut-offs for which the odds ra-tio of each feature is statistically significant in the logisticregression models. There are 19 logistic regression models ineach category (see Sec. III C). The first column (the maxi-mum graduates models with possible duplicates) correspondsto the significance diagram (Fig. 3).Variable Maximum MinimumWith Without With Without(Intercept) 0.63 0.58 0.58 0.58Undergraduate GPA 1.00 1.00 1.00 1.00Gender 1.00 1.00 1.00 1.00Combined B, L, M & N 0.26 0.00 0.21 0.00Asian 0.53 0.37 0.47 0.42Highly competitive 0.26 0.26 0.63 0.26Most competitive 0.79 0.89 1.00 1.00Highest research activity 0.89 0.63 1.00 0.63Most selective UG p. profile 0.00 0.00 0.00 0.00Most ACT selective 0.00 0.00 0.00 0.00Privately Funded 0.42 0.21 0.32 0.21Minority Serving Institution 0.00 0.00 0.00 0.00Large bachelor program 1.00 0.74 0.37 0.21Large PhD program 0.32 0.05 0.32 0.11 most competitive, the fraction of significant cut-offs is re-duced for all features. Compared to the maximum grad-uates models, the typical significance of attending a largebachelor program is considerably lower in the minimumgraduates models. Notably, the difference correspondswith an improvement in the fraction of significant cut-offs for attending a competitive school or an institutionwith high research activity, thus suggesting the variablesmay suffer from a confounding issue (see Sec. V B for adiscussion).Considerable changes in the set of significant featuresare only observed for large changes in the cut-off score.We therefore only discuss the odds ratios correspondingto cut-offs 650, 710 and 770, representing the lower, mid-dle and higher regions, respectively.
2. Odds ratios
The odds ratios for the maximum and minimum num-ber of graduates models are shown in Tables III (a) and(b) respectively. First and foremost, improving one’sundergraduate GPA by one standard deviation, roughlyequivalent of improving a B to a B+, improves the oddsof scoring above the cut-off by at minimum a factor of2.5 (increases to ≈ . for higher cut-offs). This substan-tial increase in odds reflects the importance of U-GPAin admissions expressed by both admission committeesand prospective students [3, 48]. Additionally, the oddsof scoring above the cut-off is / . ≈ . times greaterfor male applicants than for female applicants. The oddsratios of U-GPA and gender are consistent for all P-GREcut-offs in both the maximum and minimum number ofgraduates models. While the benefit of attending a competitive institu-tion diminishes as the P-GRE cut-off increases from 650to 710 and 770, attending one of the most competitiveinstitutions is always preferable to a highly competitiveinstitution. For cut-offs 650 and 710, the odds-increasefrom attending a most competitive school is similar tothe applicant increasing their U-GPA from a B to a B+.The model also finds institutional funding and high lev-els of research activity to be important factors. For highP-GRE cut-offs (e.g. 770), the odds of scoring above thecut-off is about 2 times as large for applicants who at-tended a private university compared to applicants whoattended a public university. Similarly, for applicants at-tending a university that practices some of the highestlevels of research activity, the odds ratio is roughly 1.6-2.0 depending on the cut-off.Applicants from institutions with large physics pro-grams typically also score higher. In the maximum grad-uates number of models, having attended a universitywith one of the largest undergraduate physics programsimproves the odds of scoring above the P-GRE cut-off bya factor of about 1.7-2.0 (typically closer to 2.0). Whenthe cut-off is high, a similar effect is seen for studentsattending a university offering a large graduate program(an odds ratio of about 1.6). In the minimum number ofgraduates models, the odds ratios are only statisticallysignificant for the highest cut-offs ( ≥ ). They are alsotypically smaller than the corresponding odds ratios inthe maximum graduates models. The only statisticallysignificant example in Table IVb is the odds ratios forattending an institution with one of the largest PhD pro-grams.The remaining variables, i.e., most ACT-selective,most selective undergraduate population profile and MSI,contribute little to none. C. Conditional Inference Forest
The general performance of the CIF models is shownin Figure 4. Alongside the accuracy score is the classimbalance, which provides the baseline from which theaccuracy score is interpreted. Because the imbalance isconsiderably high for lower cut-offs, the accuracy score ismore representative of the CIF’s ability to identify appli-cants scoring above the cut-off when the cut-off is higher(as the imbalance decreases as the cut-off score increases,the accuracy becomes increasingly more representative).However, because the imbalance level is outside the stan-dard errors of the K -fold estimate, it is reasonable to con-clude that the CIF is not simply predicting the majorityclass. Additionally, the AUC score is mostly outstanding (>0.9) and, more importantly, very stable with respect tochanges in the P-GRE cut-off. The stability of the AUCcoupled with the high score suggests that the results ofthe model may be reasonably interpreted, that is, thatthe feature importances provide a reasonable picture ofthe relationship between the features and the output for0 Table III. Odds ratios for P-GRE cut-off scores 650, 710 and 770 of the logistic regression models with possible duplicates(see Sec. III C). The maximum and minimum graduates models are separated in Tables (a) and (b) respectively. Statisticallysignificant odds ratios are marked with asterisks (see below (b)). Note that ˜ p = 14 p refers to the Bonferroni-corrected p -values. (a) Maximum bachelor/PhD graduates models Variable cut-off 650 cut-off 710 cut-off 770OR 95% CI OR 95% CI OR 95% CI(Intercept) 1.72 [1.09, 2.72] ∗∗ 0.81 [0.53, 1.25] 0.29 [0.18, 0.47] ∗∗∗Undergraduate GPA 2.53 [2.12, 3.03] ∗∗∗ 2.63 [2.22, 3.11] ∗∗∗ 2.87 [2.40, 3.42] ∗∗∗Gender 0.16 [0.10, 0.24] ∗∗∗ 0.18 [0.12, 0.27] ∗∗∗ 0.17 [0.11, 0.25] ∗∗∗Asian 1.37 [0.73, 2.54] 1.61 [0.95, 2.71] 2.20 [1.34, 3.59] ∗∗∗Combined B, L, M & N 0.68 [0.41, 1.11] 0.65 [0.42, 1.00] 0.76 [0.49, 1.19]Highly competitive 1.98 [1.10, 3.58] ∗ 1.55 [0.95, 2.52] 1.22 [0.76, 1.93]Most competitive 3.40 [1.76, 6.56] ∗∗∗ 2.79 [1.53, 5.07] ∗∗∗ 1.62 [0.97, 2.73]Doc. inst. w/ highest research activity 1.68 [1.00, 2.81] ∗ 1.94 [1.23, 3.04] ∗∗∗ 1.63 [1.06, 2.51] ∗Most selective UG population profile 1.72 [0.68, 4.37] 1.85 [0.73, 4.68] 1.75 [0.71, 4.32]Most ACT selective 0.66 [0.26, 1.69] 0.49 [0.19, 1.27] 0.66 [0.25, 1.75]Privately funded 0.96 [0.57, 1.62] 1.42 [0.91, 2.21] 2.07 [1.38, 3.11] ∗∗∗Minority Serving Institution 1.17 [0.62, 2.21] 1.05 [0.60, 1.82] 1.13 [0.64, 1.99]Large bachelor program 1.96 [1.23, 3.12] ∗∗∗ 1.74 [1.15, 2.65] ∗∗ 1.94 [1.28, 2.92] ∗∗∗Large PhD program 1.28 [0.72, 2.25] 1.31 [0.83, 2.08] 1.62 [1.07, 2.45] ∗∗ (b) Minimum bachelor/PhD graduates models
Variable cut-off 650 cut-off 710 cut-off 770OR 95% CI OR 95% CI OR 95% CI(Intercept) 2.03 [1.31, 3.15] ∗∗∗ 0.94 [0.61, 1.45] 0.34 [0.22, 0.53] ∗∗∗Undergraduate GPA 2.50 [2.08, 2.99] ∗∗∗ 2.59 [2.19, 3.08] ∗∗∗ 2.82 [2.35, 3.39] ∗∗∗Gender 0.17 [0.11, 0.25] ∗∗∗ 0.19 [0.13, 0.28] ∗∗∗ 0.17 [0.12, 0.26] ∗∗∗Asian 1.41 [0.72, 2.75] 1.63 [0.96, 2.77] 2.19 [1.35, 3.56] ∗∗∗Combined B, L, M & N 0.66 [0.41, 1.07] 0.65 [0.42, 0.99] ∗ 0.77 [0.50, 1.17]Highly competitive 2.21 [1.25, 3.93] ∗∗∗ 1.68 [1.00, 2.80] ∗ 1.35 [0.83, 2.18]Most competitive 3.69 [2.07, 6.55] ∗∗∗ 3.09 [1.89, 5.05] ∗∗∗ 1.87 [1.17, 3.00] ∗∗Doc. inst. w/ highest research activity 1.72 [1.02, 2.92] ∗ 2.10 [1.33, 3.33] ∗∗∗ 1.83 [1.20, 2.79] ∗∗∗Most selective UG population profile 1.56 [0.60, 4.05] 1.72 [0.62, 4.76] 1.59 [0.62, 4.07]Most ACT selective 0.75 [0.28, 2.03] 0.56 [0.19, 1.67] 0.74 [0.28, 1.99]Privately funded 0.88 [0.53, 1.45] 1.27 [0.81, 1.98] 1.88 [1.27, 2.79] ∗∗∗Minority Serving Institution 1.11 [0.60, 2.06] 1.00 [0.58, 1.74] 1.11 [0.64, 1.93]Large bachelor program 1.44 [0.91, 2.29] 1.28 [0.84, 1.94] 1.47 [0.99, 2.20]Large PhD program 1.43 [0.83, 2.45] 1.34 [0.86, 2.08] 1.54 [1.05, 2.26] ∗∗∗∗ : ˜ p ≤ . ∗∗ : ˜ p ≤ . ∗ : ˜ p ≤ . all P-GRE cut-offs. Because of the similarity in perfor-mance between the maximum and minimum graduatesCIF models, we present only the maximum graduatesmodels going forward.Figure 5 graphs the change in the importance mea-sure of the features as the cut-off increases. The plotshows evidence of distinct groups of features with sim-ilar importances. The first group consists only of un-dergraduate GPA, whose importance measure is about2 times higher than any other feature. The next groupconsists of gender and no. PhD graduates, which standout when compared to the remaining group of the leastimportant features (see Sec. III B 2 for how to interpretthe importance measure). With the exception of someminor variation, the importance measure of U-GPA isfairly stable across all models. Notably however, while the importance measure decreases for gender as the cut-off increases, it simultaneously increases for no. PhDgraduates. Hence, for cut-offs greater than ≈ , themodel finds a greater statistical difference between appli-cants scoring above and below the cut-off when given theno. PhD graduates compared to an applicant’s gender.Proportional to their own importance measures, severalfeatures in the remaining group undergo large changes inimportance measures. However, because these variationsare small when compared to U-GPA, gender and the no.PhD graduates, they should not be overemphasized.The results of the feature elimination procedure areshown in Figure 6. The diagram is arranged such thatthe features are removed left to right, starting from acomplete model and ending with a model that only in-cludes U-GPA (i.e., the named feature at a given hor-1 Figure 4. Overall performance of the conditional inference forest. The standard errors of the K -fold ( K = 10 ) estimates areindicated by the error bars. While the ABOVE class imbalance is very high for lower cut-offs, the accuracy standard errors arealways above the imbalance level. The AUC score is mostly above 0.9, which Hosmer et al. categorizes as “outstanding” [29].Figure 5. Conditional inference forest feature importance measures. The standard errors of the K -fold ( K = 10 ) estimatesare indicated by the error bars. Features included in the CIF are listed in the legend in decreasing order of average (across allcut-offs) decrease in AUC upon removal. Figure 6. Conditional inference forest feature elimination procedure. The standard errors of the K -fold ( K = 10 ) estimatesare indicated by the error bars. The features are eliminated from left to right, where the named feature is currently the leastimportant feature, and thus the next to be dropped from the model.Figure 7. Performance comparison between conditional inference forests with all features, with 2 institutional features, andwithout institutional features (only U-GPA, gender and race). The standard errors of the K -fold ( K = 10 ) estimates areindicated by the error bars. The significant improvement in performance by the simple addition of 2 institutional featuressuggests that the contribution from the institutional features is captured by a few features. ≈ . on the AUCmetric and roughly between 80% and 90% on the accu-racy metric.Figure 3 shows that the set of statistically significantfeatures in the logistic regression models changes as theP-GRE cut-off score increases (e.g. whether the insti-tution is privately funded is only significant for cut-offs ≥ ). A similar change is not present in the impor-tance measures of the CIF models (Fig. 5), which, incontrast with the odds ratios, preserve the feature groupsdescribed above. In particular, the features: U-GPA,gender and number of PhD graduates, are the three mostimportant features for every cut-off score. Because theimportance measures of the remaining features are con-sistently lower by a considerable margin for all cut-offscores, the set of important features in the CIF modelsis very robust towards changes in the cut-off score.As a final check for whether the added performancecan be attributed to including the institutional features,the performance of the full CIF is compared to a CIFexcluding all institutional features, and a CIF includingthe number of PhD graduates and the Carnegie classifi-cation. The results of the comparison is summarized inFigure 7: The addition of only two institutional featuresmakes a considerable improvement for both metrics, re-gardless of the cut-off. Hence, the added performanceis reasonably attributed to the inclusion of institutionalfeatures. V. DISCUSSIONA. Research Questions
This study investigated four research questions (RQs)that we address in order.1. To what extent does the applicant’s undergraduateinstitution influence whether they are able to at-tain a minimum P-GRE score expected by an ad-missions committee?2. To what degree do the institutional effects compareto known effects such as U-GPA, gender and race?3. How do the results depend on the specific cut-offchosen by the admissions office?4. How well do the conventional and machine learningapproaches agree on RQs 1, 2 and 3? Regarding RQ 1, the institutional background helpsexplain whether a student scores above a given P-GREcut-off. Consider a cut-off score of 710, which is justabove the most common cut-off score of 700. In the lo-gistic regression models (see Table III), applicants fromcompetitive institutions with large physics programs,practicing high levels of research are statistically morelikely to score above the cut-off than other applicants.Similarly, the size of physics programs (number of grad-uates) and the institution-wide Carnegie classification areintegral components of the predictive capacity of the CIFmodels (see Sec. 6). Hence, the models suggest that toemploy a cut-off score of 710 not only limits access toracial and ethnic minorities [2], but also to applicantsfrom smaller universities with less resources that are lesscompetitive and practice lower (not necessarily amongthe lowest) levels of research. Similar observations arefound for every other cut-off in the CIF models. In thecase of the logistic regression models, the set of statisti-cally significant institutional features varies depending onthe cut-off, but the overall interpretation is similar: Toinclude institutional data in the analysis certainly helpsexplain whether a student scores above the cut-off, re-gardless of the chosen cut-off.Now, is it necessary to include a complete descriptionof an applicant’s undergraduate background? Figure 6suggests that this is probably not the case as a largeportion of the institutional data does not contribute tothe models. Moreover, because the performance of theCIF does not decrease as the Carnegie classification isremoved, there is also reason to suspect that the institu-tional features may share information. The independenceof the features is discussed more in detail in Sec. V B.The modelling and machine learning approaches dis-agree somewhat with respect to RQ 2. In the logisticregression model, the odds ratios for U-GPA is compa-rable to admission competitiveness (roughly 2-3), whilethe odds ratios for gender is just shy of 6.0. In contrast,U-GPA is by far the most important feature in all CIFmodels. Meanwhile, the feature importance measure ofgender is similar to the number of PhD graduates, par-ticularly for higher cut-offs ( ≥ ). Because neither ap-proach placed as much emphasis on race and ethnicity, itis unreasonable to judge the overall effect of institutionaldata by comparing it to the effects of race and ethnicityin the models. Despite disagreeing on some of the finerdetails, both approaches find examples where the effectsfrom institutional data, e.g. admission competitivenessand the size of Physics departments, are comparable toU-GPA and gender. The most clear-cut example is shownin figure 7, which demonstrates that to replace a CIFmodel without institutional features with a similar CIFmodel that includes the Carnegie classification and num-ber of PhD graduates provides a blanket improvement inthe accuracy and AUC scores for every P-GRE cut-off.Finally, we address RQs 3 and 4 together. First andforemost, both approaches have identified statisticallysignificant differences in P-GRE scores of applicants with4different institutional backgrounds. Having said that, thespecifics regarding the statistical difference and the ex-tent to which it is explained by different institutionalbackgrounds depends on the model and cut-off in ques-tion. For instance, the significance level of odds ratiosvary to such an extent that some features are only rel-evant for a select few cut-offs (e.g. private/public insti-tution for higher cut-offs). The importance measures ofthe CIF models are much more stable across cut-offs, butlacks the interpretability of the odds ratios. Nevertheless,while the set of useful features changes with the cut-off,institutional features always contribute to the analysis.Here, logistic regression disagree with the CIF on theset of useful features and their importance to the model,but both recognize useful institutional features for everycut-off score. B. Limitations
Central to this study is the question of whether theinstitutional background of an applicant can be reliablymeasured, or estimated, with the available data. Here,"institutional background" is used in an extended sensethat includes the applicant’s experiences in relation toattending a particular institution. Our data certainlydoes not allow for quantifying the effects of such expe-riences as studying in an encouraging environment or atan institution with a large array of opportunities. How-ever, data such as the Carnegie features and the numberof graduating bachelor and PhD students likely capturesome aggregate effect of studying at different types of in-stitutions. In addition, these features were found to beimportant in our models, suggesting that there is a statis-tical difference between the applicants that is dependenton the institutions.Because the universities considered in this study aretypically highly regarded, the data likely suffers from aselection bias effect, favoring prospective students withhigher grades and GRE scores. In a 2018 survey ofprospective students from racial and ethnic minorities,Cochran et al. identified concerns regarding GRE scoresand undergraduate GPA as commonly expressed barriersto apply to physics graduate programs [16]. Indeed, thisis reflected in the P-GRE distribution of the applicants inour data set: Figure 1 shows that the applicants consis-tently score as high or higher than the national averages,thus implying our data set consists of a biased selection ofall prospective students (the data set comprises an upperlimit of ≈ of all P-GRE test-takers in 2017-18 [22].Because of this selection bias, the distributions of theother features in our data set are likely also biased. Mostprominently, the selection bias will disproportionately af-fect women, and racial and ethnic minorities [4, 10]. Theproblem of selection bias and its consequences for Physicseducation research as a whole was recently discussed inKanim and Cid [49]. Our findings should thus be consid-ered in light of our biased sample and their discussion. A related, but different issue is that applicants aremore likely to have attended large programs by virtue ofthere being more prospective students from larger pro-grams than smaller programs. This can be seen in ourdata from the median number of Bachelor graduates.Whereas the national median was 8 in both 2017 and2018 [25, 26], the median in our data is 27 (2017) and 30(2018), i.e., more than 3 times as many. Consequently,our data consists of a larger fraction of applicants fromlarger programs than usual, and thus the distributionsof all the features in our data are likely primarily de-termined by applicants from larger programs. This alsocontributes to the selection bias discussed above.Another methodological problem is the question ofwhether the different institutional variables attempt todescribe the same effect, implying a possible problem ofcorrelation, or even multicollinearity, between the fea-tures. The number of Bachelor and PhD graduates areparticularly sensitive to this issue as they both representa measure of the size of physics departments. Indeed, thefeatures share a positive correlation of roughly 0.7. Bothapproaches present evidence in favor of there being somedegree of relationship between the features. For instance,when comparing the minimum and maximum graduateslogistic regression models, Table II shows that the differ-ence in the fraction of P-GRE cut-offs for which the sizeof bachelor and PhD programs are significant is similarto the same difference for attending a competitive schoolor an institution with high research activity. As it is notuncommon for institutions with larger programs to bemore competitive or practice higher levels of research, wesuspect that some statistical relationship between thesefeatures is likely. A more direct example is seen in Fig-ure 6, where the removal of the Carnegie classificationduring the feature elimination procedure does not deteri-orate the performance by any measurable amount. Thisindifference suggests that the information contained inthe Carnegie classification, which is known to be consid-erable due to Carnegie’s high importance measure (seeFigure 5), is also contained within the remaining set offeatures (U-GPA, gender and number of PhD graduates).As a final example, the performance comparison (Figure7) shows that most of the overall effect of the institu-tional influence can be described by a limited selectionof institutional features. C. Data processing and modelling choices
A major difficulty for the logistic regression approachis the need for data processing, especially in the contextof losing information by unfortunate modelling choices.The most prominent example in this study is the com-bination of racial and ethnic groups into a single under-represented minority group. As suggested by Figure 2,the lack of race features being important in the logis-tic regression model may actually be a case of Simp-son’s paradox (information loss due to combining data5[50]). That is, because the combined P-GRE distribu-tion of URM applicants resembles the P-GRE distribu-tion of white applicants (see Figure 2), and because therace feature was one-hot encoded using “white” as ref-erence level, the difference between the distributions isnot large enough to be statistically significant. In com-parison, the distribution is much more skewed for Asianapplicants, and thus the difference becomes statisticallysignificant for higher cut-offs. Other examples includethe Carnegie classification and undergraduate popula-tion profile, which were essentially reduced from multi-level categorizations to simple binaries. Estimating theamount of meaningful information lost for these featuresis particularly complicated because of the high numberof low-frequency categories.Compared to the logistic regression approach, the CIFavoids the data processing issues described above. Whenprocessing categorical features for inferential modelling,the features must remain interpretable. However, be-cause the CIF does not require the combination of cate-gorical levels to be meaningful, a tree node can find theoptimal grouping of categories without regard to inter-pretation. Indeed, the construction of the CIF algorithmallows it to naturally handle unprocessed data withoutsuffering the same issues as logistic regression (and othermachine learning methods that require preprocessing thedata). As a result, the CIF is able to identify statisticalproperties much more easily than logistic regression. Anexample of this effect is seen in Barron’s selectivity in-dex: Whereas the odds ratios decrease and become lesssignificant as the P-GRE cut-off increases (Table III),the feature importance is relatively stable with respectto changes in the cut-off (Figure 5).Furthermore, compared to the odds ratios of logisticregression, the importance measures of the CIF are moreeffective and provide a clearer picture. The frameworkof logistic regression assumes that every feature is a dis-tinct component of the response (eq. (2)). In contrast,a tree in the CIF will only include a feature if its foundto be important enough (see Sec. III B 2). Hence, if aparticular feature is always less important than the otherfeatures in every tree (recall each tree is built on a subsetof the features), then its importance measure will be 0.A similar mechanism is not present in the logistic regres-sion framework, which will always try to interpret everyfeature as an integral component of the model. Accord-ingly, the importance measures more accurately reflectthe degree to which the features are associated with theresponse. Indeed, note that the set of features essentialto the model is always larger in the logistic regressionmodels, and in addition, changes as the P-GRE cut-offincreases. For example, the odds ratio for attending a pri-vately funded institution is only statistically significantfor cut-off scores ≥ (Figure 3). By relaxing the nec-essary assumptions of the logistic regression framework,we get a more effective tool for identifying the relation-ship between the features and the response, albeit onethat is harder to interpret. The effects of unfortunate modelling choices in logis-tic regression models depends, in the end, on the data.In our case, the combining of racial and ethnic minori-ties in an underrepresented minority category has likelyinfluenced how racial and ethnic information is treatedmodel. Similarly, the significance of other processed fea-tures may also have been diminished. That being said,we have conducted two very different analyses (inferen-tial vs. predictive modelling) and found similar results.It is therefore unlikely that the choices unique to eachapproach have affected the overall results of the analysis. D. Future work
The present study has looked into how the undergrad-uate institutions of applicants may influence the physicsgraduate admissions process by studying its statisticalrelationship with P-GRE cut-off scores. Lacking fromthis analysis is an understanding of whether institutionalinfluence may exert its primary effect at a different stagein the admissions process. For example, it is known thata number of bachelor students that are interested in fur-ther studies eventually decide not to apply [16]. Whilesome cases arise due to personal or financial concerns,some students may not have received the preparation orencouragement necessary for motivating further studies.If such motivation plays a significant role for studentsunsure of whether to pursue a career in physics, thenone would expect that prospective students from institu-tions with PhD programs would be more likely to applyto graduate programs. Additionally, it is worth consid-ering whether these prospective students are more likelyto apply to any graduate program in general, or simplythe program at their undergraduate institution.
VI. CONCLUSION
The present work has studied the effects of institu-tional influence on graduate program admissions by mod-elling a hard physics GRE cut-off score with applica-tion data from five Midwestern universities. For com-pleteness, all possible cut-off scores between 620 and 800(32nd and 67th percentile) have been analyzed, althoughmost admissions employ a cut-off of 700. The analysishas been conducted using both inferential and predictivemodelling based on logistic regression and the conditionalinference forest algorithm respectively. Both approachesidentify the known effects of undergraduate GPA andgender, but do not emphasize a statistical difference be-tween applicants from different racial and ethnic minori-ties as expected from earlier work [2]. However, this ap-parent contradiction with past work can likely be un-derstood as a combination of a Simpson’s paradox andselection bias among the applicants. Both approachesidentified cases where the impact of institutional featureswere comparable to the known effects of undergraduate6GPA and gender. Overall, the two approaches agree onthe analysis as a whole, but disagree on the result of in-creasing the P-GRE cut-off. In terms of the odds ratios,increasing the cut-off places more significance on institu-tional features associated with competitive schools, pri-vate funding, large physics programs and high researchactivity. On the other hand, the added performance whenincluding institutional features can be attributed to asmall number of features.In conclusion, when analyzing graduate program ap-plications we recommend including information regard-ing the applicants’ bachelor institutions. Moreover, dueto the innate flexibility and precision of the conditionalinference forest algorithm, combined with the large vari-ety of data structures seen in application data, we alsorecommend the forest algorithm as well as the predictive analysis approach in general. Based on these findings andits known problems of limiting underrepresented racialand ethnic minorities, we advocate against the practiceof using GRE cut-off scores in admissions.
ACKNOWLEDGMENTS
This project was supported by the Michigan State Uni-versity College of Natural Sciences, the Lappan-PhillipsFoundation, and the Norwegian Agency for Quality As-surance in Education (NOKUT), which supports theCenter for Computing in Science Education. This projecthas also received support from the INTPART project ofthe Research Council of Norway (Grant No. 288125) andthe Thon foundation. [1] C. Miller and K. Stassun, Nature , 303 (2014).[2] C. W. Miller, B. M. Zwickl, J. R. Posselt, R. T. Sil-vestrini, and T. Hodapp, Science Advances , eaat7550(2019).[3] G. Potvin, D. Chari, and T. Hodapp, Physical ReviewPhysics Education Research , 020142 (2017).[4] A. M. Porter and R. Ivie, Women in Physics and Astron-omy, 2019 , Tech. Rep. (American Institute of Physics,2019).[5] Laura Merner and John Tyler,
African American, His-panic, and Native American Women among Bachelors inPhysical Sciences & Engineering , Tech. Rep. (2017).[6] N. T. Young and M. D. Caballero, arXiv:2008.10712[physics] (2020), arXiv: 2008.10712.[7] J. W. Halley, A. Adjoudani, P. Heller, and J. S. Ter-williger, American Journal of Physics , 403 (1991),publisher: American Association of Physics Teachers.[8] N. T. Young and M. D. Caballero, arXiv:1907.01570[physics] (2019), arXiv: 1907.01570.[9] C. Zabriskie, J. Yang, S. DeVore, and J. Stewart,Physical Review Physics Education Research (2019),10.1103/PhysRevPhysEducRes.15.020120.[10] R. Ivie, “Beyond Representation Data to Improve theSituation of Women and Minorities in Physics and As-tronomy,” (2018).[11] L. M. Aycock, Z. Hazari, E. Brewe, K. B. Clancy, T. Ho-dapp, and R. M. Goertzen, Physical Review Physics Ed-ucation Research , 010121 (2019).[12] K. Rosa and F. M. Mensah, Physical Review Physics Ed-ucation Research , 020113 (2016).[13] S. Hyater-Adams, C. Fracchiolla, N. Finkelstein, andK. Hinko, Physical Review Physics Education Research , 010132 (2018).[14] J. R. Posselt, Inside graduate admissions (Harvard Uni-versity Press, 2016).[15] Educational Testing Service, “Guide to the Use ofScores,” (2019).[16] G. L. Cochran, T. Hodapp, and E. E. A. Brown,in
Physics Education Research Conference Proceedings ,PER Conference (Cincinnati, OH, 2018) pp. 92–95.[17] E. M. Levesque, R. Bezanson, and G. R. Tremblay, arXiv:1512.03709 [astro-ph, physics:physics] (2015),arXiv: 1512.03709.[18] N. S. Foundation, “Frequently Asked Questions (FAQs)for NSF 20-587, Applicants for FY2021 Graduate Re-search Fellowship Program (GRFP),” (2020).[19] G. Attiyeh and R. Attiyeh, The Journal of Human Re-sources; Madison , 524 (1997).[20] J. R. Posselt, T. E. Hernandez, G. L. Cochran, and C. W.Miller, Journal of Women and Minorities in Science andEngineering , 283 (2019).[21] P. J. Mulvey and S. Nicholson, Physics Bachelor’s De-grees , Tech. Rep. (American Institute of Physics, 2015).[22] Educational Testing Service, “GRE® Subject Test In-terpretative Data,” (2019).[23] Center for Postsecondary Research,
The Carnegie Classi-fication of Institutions of Higher Education (Indiana Uni-versity Bloomington, Bloomington, IN, 2016).[24] Barron’s Educational Series, inc. College Division,
Bar-ron’s Profiles of American Colleges (Barron’s).[25] S. Nicholson and P. J. Mulvey,
Roster of Physics De-partments with Enrollment and Degree Data, 2017 , Tech.Rep. (American Institute of Physics, 2018).[26] S. Nicholson and P. J. Mulvey,
Roster of Physics De-partments with Enrollment and Degree Data, 2018 , Tech.Rep. (American Institute of Physics, 2019).[27] A. L. Traxler, X. C. Cid, J. Blue, and R. Barthelemy,Physical Review Physics Education Research , 020114(2016).[28] U.S. Department of Education, “Lists of PostsecondaryMinority Institutions,” .[29] D. W. Hosmer and S. Lemeshow, Applied Logistic Re-gression , 2nd ed. (John Wiley & Sons, Inc., 2000).[30] R. Teranishi, New Directions for Institutional Research , 37 (2007).[31] J. Friedman, T. Hastie, and R. Tibshirani, Journal ofStatistical Software (2010), 10.18637/jss.v033.i01.[32] J. Nissen, R. Donatello, and B. Van Dusen, Phys-ical Review Physics Education Research (2019),10.1103/PhysRevPhysEducRes.15.020106.[33] S. van Buuren and C. Groothuis-Oudshoorn, Journal ofStatistical Software (2011), 10.18637/jss.v045.i03. [34] D. B. Rubin, Multiple Imputation for Nonresponse inSurveys (John Wiley & Sons, Inc., 1987).[35] P. T. von Hippel, Sociological Methodology , 265(2009).[36] K. G. M. Moons, R. A. R. T. Donders, T. Stijnen, andF. E. Harrell, Journal of Clinical Epidemiology , 1092(2006).[37] C. X. Ling, J. Huang, and H. Zhang, in Advances in Ar-tificial Intelligence , Lecture Notes in Computer Science,edited by Y. Xiang and B. Chaib-draa (Springer, Berlin,Heidelberg, 2003) pp. 329–341.[38] T. Hastie, R. Tibshirani, and J. Friedman,
The Elementsof Statistical Learning: Data Mining, Inference, and Pre-diction , 2nd ed. (Springer, 2017).[39] L. Breiman, Machine Learning , 5 (2001).[40] T. Hothorn, K. Hornik, and A. Zeileis, Journal of Com-putational and Graphical Statistics , 651 (2006).[41] T. Hothorn, P. Bühlmann, S. Dudoit, A. Molinaro, andM. Laan, Biostatistics (Oxford, England) , 355 (2006).[42] C. Strobl, A.-L. Boulesteix, A. Zeileis, and T. Hothorn, BMC Bioinformatics , 25 (2007).[43] C. Strobl, A.-L. Boulesteix, T. Kneib, T. Augustin, andA. Zeileis, BMC bioinformatics , 307 (2008).[44] V. Svetnik, A. Liaw, C. Tong, J. C. Culberson, R. P.Sheridan, and B. P. Feuston, Journal of Chemical Infor-mation and Computer Sciences , 1947 (2003).[45] S. Janitza, C. Strobl, and A.-L. Boulesteix, BMC Bioin-formatics , 119 (2013).[46] L. Auret and C. Aldrich, Minerals Engineering , 27(2012).[47] V. Svetnik, A. Liaw, C. Tong, and T. Wang, in MultipleClassifier Systems , Lecture Notes in Computer Science,edited by F. Roli, J. Kittler, and T. Windeatt (Springer,Berlin, Heidelberg, 2004) pp. 334–343.[48] D. Chari and G. Potvin, Physical Review Physics Edu-cation Research , 023101 (2019).[49] S. Kanim and X. Cid, Physical Review Physics EducationResearch , 020106 (2020).[50] E. H. Simpson, Journal of the Royal Statistical Society.Series B (Methodological)13