[PDF] Using Machine Learning to Identify the Most At-Risk Students in Physics Classes

Abstract

Machine learning algorithms have recently been used to predict students' performance in an introductory physics class. The prediction model classified students as those likely to receive an A or B or students likely to receive a grade of C, D, F or withdraw from the class. Early prediction could better allow the direction of educational interventions and the allocation of educational resources. However, the performance metrics used in that study become unreliable when used to classify whether a student would receive an A, B or C (the ABC outcome) or if they would receive a D, F or withdraw (W) from the class (the DFW outcome) because the outcome is substantially unbalanced with between 10\% to 20\% of the students receiving a D, F, or W. This work presents techniques to adjust the prediction models and alternate model performance metrics more appropriate for unbalanced outcome variables. These techniques were applied to three samples drawn from introductory mechanics classes at two institutions ( N=7184 , 1683 , and 926 ). Applying the same methods as the earlier study produced a classifier that was very inaccurate, classifying only 16\% of the DFW cases correctly; tuning the model increased the DFW classification accuracy to 43\%. Using a combination of institutional and in-class data improved DFW accuracy to 53\% by the second week of class. As in the prior study, demographic variables such as gender, underrepresented minority status, first-generation college student status, and low socioeconomic status were not important variables in the final prediction models.

Full PDF

UUsing Machine Learning to Identify the Most At-Risk Students in Physics Classes

Jie Yang, Seth DeVore, Dona Hewagallage, Paul Miller, Qing X. Ryan, and John Stewart ∗ Department of Physics and Astronomy, West Virginia University, Morgantown WV, 26506 Department of Physics and Astronomy, California State Polytechnic University, Pomona CA, 91768 (Dated: July 28, 2020)Machine learning algorithms have recently been used to predict students’ performance in an in-troductory physics class. The prediction model classiﬁed students as those likely to receive an Aor B or students likely to receive a grade of C, D, F or withdraw from the class. Early predic-tion could better allow the direction of educational interventions and the allocation of educationalresources. However, the performance metrics used in that study become unreliable when used toclassify whether a student would receive an A, B or C (the ABC outcome) or if they would receivea D, F or withdraw (W) from the class (the DFW outcome) because the outcome is substantiallyunbalanced with between 10% to 20% of the students receiving a D, F, or W. This work presentstechniques to adjust the prediction models and alternate model performance metrics more appro-priate for unbalanced outcome variables. These techniques were applied to three samples drawnfrom introductory mechanics classes at two institutions ( N = 7184, 1683, and 926). Applying thesame methods as the earlier study produced a classiﬁer that was very inaccurate, classifying only16% of the DFW cases correctly; tuning the model increased the DFW classiﬁcation accuracy to43%. Using a combination of institutional and in-class data improved DFW accuracy to 53% bythe second week of class. As in the prior study, demographic variables such as gender, underrep-resented minority status, ﬁrst-generation college student status, and low socioeconomic status werenot important variables in the ﬁnal prediction models. I. INTRODUCTION

Physics courses, along with other core science andmathematics courses, form key hurdles for Science, Tech-nology, Engineering, and Mathematics (STEM) studentsearly in their college career. Student success in theseclasses is important to improving STEM retention; thesuccess of students traditionally underrepresented inSTEM disciplines in the core classes may be a limitingfactor in increasing inclusion in STEM ﬁelds. PhysicsEducation Research (PER) has developed a wide rangeof research-based instructional materials and practicesto help students learn physics [1]. Research-based in-structional strategies have been demonstrated to increasestudent success and retention [2]. While some of thesestrategies are easily implemented for large classes, oth-ers have substantial implementation costs. Further, noclass could implement all possible research-based strate-gies, and some may be more appropriate for some sub-sets of students than for others. One method to betterdistribute resources to the students who would beneﬁtthe most is to identify at-risk students early in physicsclasses. The eﬀective identiﬁcation of students at risk inphysics classes and the eﬃcacious uses of this classiﬁca-tion represents a promising new research strand in PER.The need for STEM graduates continues to increase ata rate that is outstripping STEM graduation rates acrossAmerican institutions. A 2012 report from the Presi-dent’s Council of Advisors on Science and Technology [3]identiﬁed the need to increase graduation of STEM ma-jors to avoid a projected shortfall of one million STEM ∗ [email protected] job candidates over the next decade. Improving STEMretention has long been an important area of investiga-tion for science education researchers [4–11]. Targetinginterventions to students at risk in core introductory sci-ence and mathematics courses taken early in college of-fers one potential mechanism to improve STEM gradua-tion rates. In recent years, educational data mining hasbecome a prominent method of analyzing student datato inform course redesign and to predict student perfor-mance and persistence [12–16].The current study investigates the application of ma-chine learning algorithms to identify at-risk students.Machine learning and data science as a whole are grow-ing explosively in many segments of the economy as thesenew methods are used to make sense and exploit the ex-ponentially growing data collected in an increasing onlineworld. These methods are also being adapted to under-stand and improve educational data systems. It seemslikely that this process will accelerate in the near futureas universities, in a challenging ﬁnancial climate, attemptto retain as many students as possible. We argue thatPER should help shape both the construction of reten-tion models of physics students and explore their mosteﬀective and most ethical use. The following summarizesthe prior study applying Education Data Mining (EDM)techniques in physics classes, provides an overview ofEDM, and more speciﬁcally an overview of the use ofEDM for grade prediction. A. Prior Study: Study 1

This study extends the results of Zabriskie et al. [17]which will be referred to as Study 1 in this work. Study a r X i v : . [ phy s i c s . e d - ph ] J u l B. Research Questions

This study seeks to extend the application of machinelearning algorithms to predict whether a student will earna D or F or withdraw (W) from a physics class. In par-ticular, we explore the following research questions:RQ1: How can machine learning algorithms be applied topredict an unbalanced outcome in a physics class?RQ2: Does classiﬁcation accuracy diﬀer for underrepre-sented groups in physics? If so, how and why doesit diﬀer?RQ3: How can the results of a machine learning analy-sis be applied to better understand and improvephysics instruction?

C. Educational Data Mining

Educational Data Mining (EDM) can be described asthe use of statistical, machine learning, and traditionaldata mining methods to draw conclusions from large edu-cational datasets while incorporating predictive modelingand psychometric modeling [16]. In a 2014 meta-analysisof 240 EDM articles by Pe˜na-Ayala, 88% of the studieswere found to use a statistical and/or machine learningapproach to draw conclusions from the data presented.Of these studies, 22% analyzed student behavior, 21% examined student performance, and 20% examined as-sessments [18]. Pe˜na-Ayala also found that classiﬁcationwas the most common method used in EDM applied in42% of all analyses, with clustering used in 27%, andregression used in 15% of studies.Educational Data Mining encompasses a large num-ber of statistical and machine learning techniques withlogistic regression, decision trees, random forests, neu-ral networks, naive Bayes, support vector machines, andK-nearest neighbor algorithms commonly applied [19].Pe˜na-Ayala’s [18] analysis found 20% of studies employedBayes theorem and 18% decision trees. Decision treesand random forests are one of the more commonly usedtechniques in EDM. We use these techniques to investi-gate our research questions and explore ways to assessthe success of machine learning algorithms. More infor-mation on the fundamentals of these and other machinelearning techniques are readily available through a num-ber of machine learning texts [20, 21].

D. Grade Prediction and Persistence

While EDM is used for a wide array of purposes, ithas often been used to examine student performance andpersistence. One survey by Shahiri et al. summarized 30studies in which student performance was examined us-ing EDM techniques [22]. Neural networks and decisiontrees were the two most common techniques used in stud-ies examining student performance with naive Bayes, K-nearest neighbors, and support vector machines used insome studies. A study by Huang and Fang examined stu-dent performance on the ﬁnal exam for a large-enrollmentengineering course using measurements of college GPA,performance in 3 prerequisite math classes as well asPhysics 1, and student performance on in-semester ex-aminations [23]. They analyzed the data using a largenumber of techniques commonly used in EDM and foundrelatively little diﬀerence in the accuracy of the resultingmodels. Study 1 also found little diﬀerence in the per-formance of machine learning algorithms in predictingphysics grades. Another study examining an introduc-tory engineering course by Marbouti et al. used an arrayof EDM techniques to predict student grade outcomesof C or better [24]. They used in-class measures of stu-dent performance including homework, quiz, and exam1 scores and found that logistic regression provided thehighest accuracy at 94%. A study by Macfadyen andDawson attempted to identify students at risk of fail-ure in an introductory biology course [25]. Using logis-tic regression they were able to identify students failing(deﬁned as having a grade of less than 50%) with 81%accuracy. With the goal of improving STEM retention,many universities are taking a rising interest in usingEDM techniques for grade and persistence prediction inSTEM classes [26].The use of machine learning techniques in physicsclasses has only begun recently. In addition to Study1, random forests were used in a 2018 study by Aiken et al. to predict student persistence as physics majorsand identify the factors that are predictive of studentseither remaining physics majors or becoming engineeringmajors [27].

II. METHODSA. Sample

This study used three samples drawn from the intro-ductory calculus-based physics classes at two institutions.Sample 1 and 2 were collected in the introductory,calculus-based mechanics course (Physics 1) taken byphysical science and engineering students at a large east-ern land-grant university (Institution 1) serving approxi-mately 21,000 undergraduate students. The general uni-versity undergraduate population had ACT scores rang-ing from 21 to 26 (25th to 75th percentile) [28]. Theoverall undergraduate demographics were 80% White,4% Hispanic, 6% international, 4% African American,4% students reporting two or more races, 2% Asian, andother groups each with 1% or less [28].Sample 1 was drawn from institutional records and in-cludes all students who completed Physics 1 from 2000to 2018, for a sample size of 7184. Over the period stud-ied, the instructional environment of the course variedwidely, and as such, the result for this sample may be ro-bust to pedagogical variations. Prior to the spring 2011semester, the course was presented traditionally withmultiple instructors teaching largely traditional lecturesand students performing cookbook laboratory exercises.In spring 2011, the department implemented a LearningAssistant (LA) program [29] using the

Tutorials in Intro-ductory Physics [30]. In fall 2015, the program was mod-iﬁed because of a funding change with LAs assigned toonly a subset of laboratory sections. The Tutorials werereplaced with open source materials [31] which loweredtextbook costs to students and allowed full integration ofthe research-based materials with laboratory activities.Sample 2 was collected from the fall 2016 to the spring2019 semester when the instructional environment wasstable, for a sample size of 1683. The same institutionaldata were collected and the sample also included a lim-ited number of in-class performance measures: clicker av-erage, homework average, Force and Motion ConceptualEvaluation (FMCE) pretest score, FMCE pretest partic-ipation, and the score on in-semester examinations. Amore detailed explanation of these variables will be pro-vided in the next section.Sample 3 was collected at a primarily undergraduateand Hispanic-serving university (Institution 2) with ap-proximately 26,000 students in the western US. Fifty per-cent of the general undergraduate population had ACTscores in the range 19 to 27. The demographics of thegeneral undergraduate population were 46% Hispanic,21% Asian, 16% White, 6% International, 4% two or more races, 3% African American, 3% unknown, withother races 1% or less [28]. The sample was collectedin the introductory calculus-based mechanics class forall four quarters of the 2017 calendar year. This classalso primarily serves physical science and engineering stu-dents. The course was taught in multiple sections eachquarter with multiple diﬀerent instructors. The pedagog-ical style varied greatly with some instructors giving tra-ditional lectures and some teaching using active-learningmethods.

B. Variables

The variables used in this study were drawn from in-stitutional records and from data collected within theclasses and are shown in Table I. Two types of variableswere used: two-level dichotomous variables and continu-ous variables. A few variables require additional expla-nation. The variable CalReady measures the student’smath-readiness. Calculus 1 is a pre-requisite for Physics1. For the vast majority of students in Physics 1, thestudent’s four-year degree plans assume the student en-rolls in Calculus 1 their ﬁrst semester at the university.These students are considered “math ready.” A substan-tial percentage of the students at Institution 1 are notmath ready. The variable STEMCls captures the num-ber of STEM classes completed before the start of thecourse studied. STEM classes include mathematics, bi-ology, chemistry, engineering, and physics classes.For all samples, demographic information was also col-lected from institutional records. Students were consid-ered ﬁrst generation if neither of their parents completeda four-year degree. A student was classiﬁed as an under-represented minority student (URM) if they identiﬁed asHispanic or reported a race other than White or Asian.Gender was also collected from university records; for theperiod studied gender was recorded as a binary variable.While not optimal, this reporting is consistent with theuse of gender in most studies in PER; for a more nuanceddiscussion of gender and physics see Traxler et al. [32].For Sample 2, in-class data were also available on aweekly basis. This data included clicker scores (givenfor participation points), homework averages, test scores,and a conceptual pretest score (PreScore) using theFMCE [33]. Students not in attendance on the day theFMCE was given received a zero; whether students com-pleted the FMCE was captured by the dichotomous vari-able (PreTaken) which is one if the test was taken, zerootherwise.For Sample 3, socioeconomic status (SES) was mea-sured by whether the students qualiﬁed for a federal Pellgrant. A student is eligible for a Pell grant if their familyincome is less than $50 ,

000 US dollars; however, mostPell grants are awarded to students with family incomesless than $20 ,

000 [34].

Table I. Full list of variables.Variable Sample Type Description1 2 3 Institutional VariablesGender × × ×

Dichotomous Does the student identify as a man or a women?URM × × ×

Dichotomous Does the student identify as an underrepresented minority?FirstGen × × ×

Dichotomous Is the student a ﬁrst-generation college student?CalReady × ×

Dichotomous Is the student ready for calculus?SES × Dichotomous Does the student qualify for a Pell grant?CmpPct × ×

Continuous Percentage of credit hours attempted that were completed.CGPA × × ×

Continuous College GPA at the start of the course.STEMCls × ×

Continuous Number of STEM classes completed at the start of the course.HrsCmp × ×

Continuous Total credits hours earned at the start of the course.HrsEnroll × ×

Continuous Current credits hours enrolled at the start of the course.HSGPA × × ×

Continuous High school GPA.ACTM × × ×

Continuous ACT/SAT mathematics percentile score.ACTV × ×

Continuous ACT/SAT verbal percentile score.APCredit × ×

Continuous Number of credits hours received from AP courses.TransCrd × ×

Continuous Number of credits hours received from transfer courses.In-Class VariablesClicker × Continuous Average clicker score graded for participation.Homework × Continuous Homework average.TestAve × Continuous Average for the ﬁrst or the ﬁrst and second exam.Pretest Participation × Dichotomous Was the pretest taken?Pretest Score × Continuous FMCE pretest score.

C. Random Forest Classiﬁcation Models

This work employs the random forests machine learn-ing algorithm to predict students’ ﬁnal grade outcomes inintroductory physics. Random forests are one of manymachine learning classiﬁcation algorithms. Study 1 re-ported that most machine learning algorithms had simi-lar performance when predicting physics grades. A classi-ﬁcation algorithm seeks to divide a dataset into multipleclasses. This study will classify students as those whowill will receive an A, B, or C (ABC students) and stu-dents who will receive a D or F or withdraw (W) (DFWstudents).To understand the performance of a classiﬁcation al-gorithm, the dataset is ﬁrst divided into test and train-ing datasets. The training dataset is used to developthe classiﬁcation model, to train the classiﬁer. The testdataset is then used to characterize the model perfor-mance. The classiﬁcation model is used to predict theoutcome of each student in the test dataset; this pre-diction is compared to the actual outcome. Section II Ddiscusses performance metrics used to characterize thesuccess of the classiﬁcation algorithm. For this work,50% of the data were included in the test dataset and50% in the training dataset. This split was selected tomaintain a substantial number of underrepresented stu-dents in both the test and training datasets.The random forest algorithm uses decision trees, an-other machine learning classiﬁcation algorithm. Decisiontrees work by splitting the dataset into two or more sub-groups based on one of the model variables. The variable selected for each split is chosen to divide the dataset intothe two most homogeneous subsets of outcomes possible,that is, subsets with a high percentage of one of the twoclassiﬁcation outcomes. The variable and the thresholdfor the variable represents the decision for each node inthe tree. For example, one node may split the datasetusing the criteria (the decision) that a student’s collegeGPA is less than 3.2. The process continues by splittingthe subsets forming the decision tree until each node con-tains only one of the two possible outcomes. Decisiontrees are less susceptible to multicollinearity than manystatistical methods common in PER such as linear re-gression [35].Random forests extend the decision tree algorithm bygrowing many trees instead of a single tree. The “for-est” of decision trees is used to classify each instance inthe data; each tree “votes” on the most probable out-come. The decision threshold determines what fractionof the trees must vote for the outcome for the outcometo be selected as the overall prediction of the randomforest. Random forests use bootstrapping to prevent onevariable from being obscured by another variable. Boot-strapping is a statistical method where multiple randomsubsets of a dataset are created by sampling with replace-ment. Individual trees are grown on Z subsamples gener-ated by sampling the training data set with replacementusing a subset of size m = √ k of the variables, where k is the number of independent variables in the model[36]. This method ensures the trees are not correlatedand that the stronger variables do not overwhelm weakervariables [20]. The “randomForest” package in “R” wasused for the analysis. The Supplemental Material con-tains an example of random forest code in R [37]. D. Performance Metrics

The confusion matrix [38] as shown in Table II sum-marizes the results of a classiﬁcation algorithm and is thebasis for calculating most model performance metrics. Toconstruct the confusion matrix, the classiﬁcation modeldeveloped from the training dataset is used to classifystudents in the test dataset. The confusion matrix cate-gorizes the outcome of this classiﬁcation.

Table II. Confusion MatrixActual Negative Actual PositivePredicted Negative True Negative (TN) False Negative (FN)Predicted Positive False Positive (FP) True Positive (TP)

For classiﬁcation, one of the dichotomous outcomes isselected as the positive result. In the current study, weuse the DFW outcome as the positive result. This choicewas made because some of the model performance met-rics focus on the positive results and we feel that mostinstructors would be more interested in accurately iden-tifying students at risk of failure.From the confusion matrix, many performance metricscan be calculated. Study 1 reported the overall classiﬁca-tion accuracy, the fraction of correct predictions, shownin Eqn. 1 Overall Accuracy = TN + TP N test (1)where N test = TP+TN+FP+FN is the size of the testdataset.The true positive rate (TPR) and the true negativerate (TNR) characterize the rate of making accurate pre-dictions of either the DFW or the ABC outcome. TheDFW accuracy is the fraction of the actual DFW casesthat are classiﬁed as DFW (Eqn 2) in the test dataset.DFW Accuracy = TPR = TPTP + FN (2)ABC accuracy is the fraction of the actual ABC casesthat are classiﬁed as ABC (Eqn 3).ABC Accuracy = TNR = TNTN + FP (3)DFW accuracy is called “sensitivity” or “recall” in ma-chine learning; ABC accuracy is “speciﬁcity.”ABC and DFW accuracy can be adjusted by changingthe strictness of the classiﬁcation criteria. If the modelclassiﬁes even the only slightly promising cases as DFW,it will probably classify most actual DFW cases as DFWproducing a high DFW accuracy. It will also make alot of mistakes; the DFW precision or the positive pre-dictive value (PPV) captures the rate of making correct predictions and is deﬁned as the fraction of the DFWpredictions which are correct (Eqn. 4).DFW Precision = PPV = TPTP + FP (4)DFW precision is called “precision” or “positive predic-tive value” in machine learning.This study seeks models that balance DFW accuracyand precision; however, the correct balance for a givenapplication must be selected based on the individual fea-tures of the situation. If there is little cost and no riskto an intervention, then optimizing for higher DFW ac-curacy might be the correct choice to identify as manyDFW students as possible. If the intervention is expen-sive or carries risk, optimizing the DFW precision so thatmost students who are given the intervention are actuallyat risk might be more appropriate.Beyond simply evaluating the overall performance of aclassiﬁcation algorithm, we would like to establish howmuch better the algorithm performs than pure guessing.For example, Sample 1 is substantially unbalanced be-tween the DFW and ABC outcomes with 88% of the stu-dents receiving an A, B, or C. If a classiﬁcation methodguessed that all student would receive an A, B, or C,then the classiﬁer would have an overall accuracy of 0 . κ , measures agreement amongobservers [39] correcting for the eﬀect of pure guessing asshown in Eqn. 5. κ = p − p e − p e (5)where p is the observed agreement and p e is agreementby chance. Fit criteria have been developed for κ with κ less than 0 . .

5. An AUCof 1.0 represents perfect discrimination [38, 41]. Hosmer et al. [41] suggest an AUC threshold of 0 .

80 for excellentdiscrimination.

E. Model Tuning and Validation

We will ﬁnd that the random forest classiﬁcation mod-els have poor performance predicting whether a student

N Physics Grade ACT Math % HSGPA CGPAOverall 7184 2 . ± . ±

14 3 . ± . . ± . . ± . ±

15 3 . ± . . ± . . ± . ±

14 3 . ± . . ± . . ± . ±

17 3 . ± . . ± . . ± . ±

14 3 . ± . . ± . . ± . ±

15 3 . ± . . ± . . ± . ±

14 3 . ± . . ± . ± the standard deviation.N Physics Grade SAT Math % HSGPA CGPAOverall 926 2 . ± . ±

18 3 . ± . . ± . . ± . ±

17 3 . ± . . ± . . ± . ±

19 3 . ± . . ± . . ± . ±

17 3 . ± . . ± . . ± . ±

19 3 . ± . . ± . . ± . ±

14 3 . ± . . ± . . ± . ±

19 3 . ± . . ± . . ± . ±

15 3 . ± . . ± . . ± . ±

19 3 . ± . . ± . . ± . ±

16 3 . ± . . ± . ± the standard deviation. will receive a D, F, or W using the default parametersof the model. To improve performance, the models aretuned by adjusting the decision threshold. The imbal-ance of both the outcome variable and some of the de-mographic variables must also be investigated to verifythat the models are valid and the conclusions are reliable.This process is described in detail the Supplemental Ma-terial [37]. III. RESULTS

General descriptive statistics are shown in Table IIIand IV for Samples 1 and 3 respectively. The descrip-tive statistics for Sample 2 are similar to Sample 1 andare presented in the Supplemental Material [37]. The di-chotomous outcome variable divides each sample into twosubsets with diﬀerent academic characteristics. The di-chotomous independent variables further divide the sub-sets deﬁned by the outcome variables. The overall de-mographic composition of the sample is shown for eachsample in the Supplemental Material [37].

A. Classiﬁcation Models

To explore the classiﬁcation of DFW students, multipleclassiﬁcation models were constructed for each sample.To allow comparison, each model was tuned so that theDFW accuracy and DFW precision was approximatelyequal. Table V shows the overall model ﬁt for all samples. Each sample is discussed separately.

1. Sample 1

Sample 1 was ﬁrst analyzed using the default decisionthreshold for the randomForest package in “R” where50% of the trees must vote for the outcome to be se-lected. This was the threshold used in Study 1. Thisresult is shown as the “Default” model in Table V. Themodel has very poor DFW accuracy with only 16% ofthe DFW students identiﬁed. It also has fairly poor κ and AUC. This poor performance results from the un-balanced DFW outcome where only 12% of the studentsreceive a D, F, or W. This model was tuned to producethe “Overall” model by adjusting the decision thresholdas shown in the Supplemental Material [37]. A thresholdof 32% of trees voting for the DFW classiﬁcation pro-duced the Overall model which balanced DFW accuracyand precision. This model substantially improved DFWaccuracy to 43% at the expense of lower DFW precisionand had substantially better κ and AUC; κ = 0 .

36 rep-resented fair agreement; however, the AUC value of 0.68was well below Hosmer’s threshold of 0.80 for excellentdiscrimination.The classiﬁcation model constructed on the full train-ing dataset was then used to classify each demographicsubgroup in the test dataset to determine if a modeltrained on a sample composed predominantly of majoritystudents would be accurate for other students. The κ andAUC of the models classifying women, URM students, Table V. Model performance parameters. Values represent the mean ± the standard deviation.Model Overall DFW ABC DFW κ AUCAccuracy Accuracy Accuracy PrecisionSample 1 ( N = 7184)Default 0 . ± .

00 0 . ± .

02 0 . ± .

00 0 . ± .

04 0 . ± .

02 0 . ± . . ± .

01 0 . ± .

02 0 . ± .

01 0 . ± .

02 0 . ± .

02 0 . ± . . ± .

01 0 . ± .

05 0 . ± .

01 0 . ± .

06 0 . ± .

05 0 . ± . . ± .

02 0 . ± .

07 0 . ± .

02 0 . ± .

06 0 . ± .

06 0 . ± . . ± .

01 0 . ± .

06 0 . ± .

01 0 . ± .

06 0 . ± .

05 0 . ± . . ± .

01 0 . ± .

02 0 . ± .

01 0 . ± .

02 0 . ± .

02 0 . ± . N = 1683)Institutional 0 . ± .

01 0 . ± .

05 0 . ± .

01 0 . ± .

04 0 . ± .

04 0 . ± . . ± .

01 0 . ± .

05 0 . ± .

02 0 . ± .

05 0 . ± .

04 0 . ± . . ± .

01 0 . ± .

05 0 . ± .

01 0 . ± .

04 0 . ± .

04 0 . ± . . ± .

01 0 . ± .

05 0 . ± .

01 0 . ± .

05 0 . ± .

04 0 . ± . . ± .

01 0 . ± .

05 0 . ± .

01 0 . ± .

04 0 . ± .

04 0 . ± . . ± .

01 0 . ± .

06 0 . ± .

01 0 . ± .

05 0 . ± .

04 0 . ± . . ± .

01 0 . ± .

05 0 . ± .

01 0 . ± .

05 0 . ± .

04 0 . ± . . ± .

01 0 . ± .

05 0 . ± .

01 0 . ± .

05 0 . ± .

04 0 . ± . . ± .

01 0 . ± .

05 0 . ± .

01 0 . ± .

04 0 . ± .

04 0 . ± . N = 926)Overall 0 . ± .

02 0 . ± .

05 0 . ± .

03 0 . ± .

04 0 . ± . . ± .

02 0 . ± .

08 0 . ± .

04 0 . ± .

05 0 . ± .

06 0 . ± . . ± .

03 0 . ± .

09 0 . ± .

05 0 . ± .

06 0 . ± . . ± .

02 0 . ± .

07 0 . ± .

03 0 . ± .

04 0 . ± .

06 0 . ± . . ± .

03 0 . ± .

09 0 . ± .

05 0 . ± .

06 0 . ± .

07 0 . ± . and ﬁrst-generation students were very similar. Some,but not extreme, variation was measured for DFW accu-racy and precision. The overall classiﬁer had lower DFWaccuracy for women and higher accuracy for URM stu-dents (with corresponding changes in precision). Thismay indicate that it would be productive to tune themodels separately for diﬀerent demographic groups.Finally, the model labeled “Restricted” was con-structed using only a subset of variables similar to thoseavailable for Sample 3. Sample 3 contained institu-tional variables that are commonly supplied with a de-mographic data request to institutional records; Sample1 also included variables such as STEMCls which maybe of particular interest for prediction of the outcomesof physics students and variables such as the percentageof classes completed that may be of particular impor-tance in DFW classiﬁcation. As one might expect, theRestricted model using fewer variables performed moreweakly than the Overall model with DFW accuracy re-duced by 7%.

2. Sample 2

Sample 2 contained the same institutional variables asSample 1, but also included in-class data such as home-work grades and clicker grades which were available on aweekly basis. While the institutional data would requirea data request to institutional research at most institu- tions, the in-class variables should be available to mostphysics instructors. Table V shows the progression ofDFW accuracy and precision as the class progresses.A model using only the institutional variables was ﬁrstconstructed to determine how well DFW students couldbe identiﬁed using only variables available before thesemester begins. This model (Institutional) had superiorperformance characteristics to the Overall model of Sam-ple 1 which used the same variables and a larger samplecollected over a longer time period. The improved perfor-mance quite possibly was the result of Sample 1 averag-ing over many instructional environments while Sample2 contained data from a single instructional design. Thissuggests that limiting the data used for the classiﬁer tothe current implementation of a course may produce su-perior results, even with lower sample size.The performance of models using only the in-class dataeasily available to instructors consistently performedmore weakly than those which mixed in-class and in-stitutional data. The in-class-only models improved asthe class progressed and became better than the modelincluding only institutional data after the ﬁrst test wasgiven in week 5. The in-class-only model was substan-tially better than the institutional model after the secondtest was given in week 8. As such, if the goal of a clas-siﬁcation algorithm is to predict student outcomes wellinto the class, only in-class data is needed.The models combining in-class and institutional dataadded surprisingly little predictive power to the institu-

PreTakenMathEntryHrsCmpHSGPAURMFirstGenTransCrdACTMGenderPreScoreAPCreditClickerACTVHrsEnrollSTEMClsCmpPctCGPAHomework 0.00 0.05 0.10

Decrease in DFW Accuracy

ClickerPreScorePreTakenMathEntryURMHSGPATransCrdAPCreditHrsCmpGenderFirstGenSTEMClsACTMCmpPctACTVHrsEnrollCGPAHomework 0.00 0.05 0.10

Decrease in Overall Accuracy

ClickerPreScorePreTakenMathEntryURMHSGPAAPCreditTransCrdGenderHrsCmpFirstGenSTEMClsCmpPctACTMACTVHrsEnrollCGPAHomework 0.00 0.05 0.10

Decrease in DFW Precision

Figure 1. Variable importance of the optimized model predicting DFW for Sample 2 using institutional data and data availablein-class at the end of week 2. Error bars are one standard deviation in length. tional model, particularly early in the class. This furthersupports the need to access a rich set of institutionaldata for accurate classiﬁcation early in a class and sug-gests predictions made using only institutional data willnot be substantially modiﬁed using in-class data until theﬁrst test is given.

3. Sample 3

As shown in Table I, Sample 3 contains many fewervariables than Sample 1. The classiﬁcation model forSample 3 had lower DFW accuracy and precision thansimilar models for Samples 1 and 2. Restricting the vari-able set of Sample 1 to be approximately that of Sample3 (the Reduced model) produced a classiﬁer with similarproperties to that of Sample 3. The diﬀerence in classiﬁ-cation accuracy, therefore, seems to be the result of thediﬀerence in the variables available and not the diﬀerencein sample size or diﬀerences between the universities.The student population of Sample 3 is substantiallymore diverse than that of Sample 1 or 2. Model perfor-mance predicting only the outcomes of minority demo-graphic subgroups was approximately that of the over-all model performance with somewhat lower variationthan Sample 1. This suggests that the diﬀerences inmodel performance for demographic subgroups observedin Sample 1 were not a result of the low representation ofthose groups in the sample. Low SES students were alsoanalyzed separately; the model performance for low SESstudents was similar to the overall model performance.

B. Variable Importance

Once constructed, classiﬁcation models can providephysics instructors and departments a much more nu-anced picture of student risk and provide tools to betterserve their students. This section and the next sectionwill introduce some of the additional insights which canbe extracted once a classiﬁcation model is constructed.Institutional data is exceptionally complex; randomforest classiﬁcation models allow the identiﬁcation of theparts of the institutional data that are important for theprediction of student risk and the thresholds in that datathat go into classifying a student as at-risk.The ﬁrst measure useful in further understandingwhich variables are most important in the classiﬁcationprocess is “variable importance.” The importance of avariable to one of the model characterization metrics suchas DFW accuracy is computed by ﬁtting the model withthe variable and then without the variable to determinethe mean decrease in the characterization measure whenthe variable is removed from the model. Figure 1 showsthe mean decrease in DFW accuracy, DFW precision,and overall accuracy as the diﬀerent variables used in thefull model are removed for Sample 2 using data availablein the second week of the class. Similar plots for Samples1 and 3 are presented in the Supplemental Material [37].The variable importance plots shown in Fig. 1 showthat homework average followed by CGPA were the mostimportant variables in accurately identifying DFW stu-dents. In addition to these variables, only CmpPct (thepercentage of credit hours complete) has an error barthat does not include zero. These results are very dif-

Figure 2. Predicted probability of earning an A, B, or C for Sample 1 disaggregated by the actual grade received in the class.The ﬁgure plots the probability density of each outcome. The order of the peaks in the lower ﬁgure from left to right is W, F,D, C. ferent than the variable importance results of Study 1which predicted the AB outcome and used overall accu-racy to measure model performance. In Study 1, whilehomework grade grew in variable importance from weekto week, it was less important than CGPA until week 5when test 1 was given. As in Study 1, a very limitednumber of institutional variables were needed to predictgrades in a physics class.While many instructors would select CGPA as an im-portant variable and would hope that homework averageswere important, quantitatively having a relative measureof importance is valuable. The variable importance plotsin Fig. 1 also identify many variables that seem impor-tant such as HSGPA, ACTM, and demographic variables,which were not important for the prediction of the DFWoutcome.

C. Applying Classiﬁcation Models

The most basic output of a classiﬁcation model is theassignment of each student in the dataset into one of twoclasses: those students likely to receive and A, B, or Cand those likely to receive a D, F, or W. Classiﬁcationalgorithms, once constructed, can provide a ﬁner grainedpicture of student risk that may be more useful in ap- plying machine learning results to manage instructionalinterventions for at-risk students. A classiﬁcation modelcan also provide the probability a student will receiveeach outcome. The predicted probability density distri-bution of receiving an A, B, or C is plotted for each ac-tual grade outcome in Fig. 2. Two plots are provided toimprove readability. The distribution of probability es-timates of students who actually earn an A or B is verynarrow with most students with a predicted probabilityabove 0.75. This suggests that the students who actuallyreceive A or B in the class are predicted to receive anA, B, or C with very high probability. The probabilitycurve for students earning a C is much broader but stillpeaked near one. Examination of the C distribution il-lustrates two key features of the prediction: (1) the vastmajority of students who actually earn a C are predictedto do so with probability p > . Figure 3. Decision tree for predicting the DFW outcome for Sample 2 using institutional data and data available in-class atthe end of week 2. risk decisions.Variable importance plots quantify the relative impor-tance of the many variables used in the classiﬁcationmodel correcting for the collinearity of many of the vari-ables. These plots, however, do not provide informationabout the levels of these variables important in makingthe classiﬁcation. A random forest grows thousands ofdecision trees on a subset of the variables; examininga single decision tree using all variables can show thethresholds for the important variables. The decision treefor the training dataset of Sample 2 in week 2 of the classis shown in Fig. 3. Each node in the tree is labeled withthe majority member of the node, either ABC or DFW.The root node (top node) contains the entire trainingdataset, indicated by the 100% at the bottom node. Ev-ery node indicates the fraction of the training datasetcontained in the node. The fraction of each outcome isshown in the center of the node; for example, the rootnode contains 10% DFW students and 90% ABC stu-dents. The decision condition is printed below the node.If the condition is true for the student, the left branchof the tree is taken; if false, the right branch is taken. For example, the decision condition for the root node iswhether the week 2 homework average is above or below62%. For the 8% of the students below this average, theleft branch is taken to node 2. Only 47% of the studentsin node 2 receive an A, C, or C. For the 3% of these stu-dents with CGPA less than 2.5, only 17% receive an A,B, or C (node 4). The decision tree gives a very clearpicture of the relative variable importance (higher vari-ables in the tree are more important) and the thresholdof risk of receiving a D, F, or W at each level of the tree.

IV. DISCUSSION

This study sought to answer three research questions;they will be addressed in the order proposed.

RQ1: How can machine learning algorithms be appliedto predict unbalanced physics class outcomes?

Study 1used random forests and logistic regression to predictwhich students would receive an A or B in introductoryphysics. The default random forest parameters were usedto build the models and the models were characterized by1their overall accuracy, κ , and AUC. Because the outcomevariable was fairly balanced, with 63% of the students re-ceiving an A or B, overall accuracy provided an accept-able measure of model performance. The pure guessingaccuracy was 63%, and therefore, this statistic could varyover the range 63% to 100% as variables were added tothe model.In the current work, the methods introduced in Study1 were unproductive because the outcome variable, pre-dicting the DFW outcome, was substantially unbalancedwith only 10% (Sample 2) to 20% (Sample 3) of the stu-dents receiving this outcome. For this outcome, the pureguessing overall accuracy (simply predicting everyone re-ceives an A, B, or C) is from 80% to 90% making it aninappropriate statistic to judge model quality. This workintroduced the DFW accuracy and precision as more use-ful statistics to evaluate model performance. In Sample1, using the default random forest algorithm parameters(Table V, Default model) produced a model with verylow DFW accuracy identifying only 16% of the studentswho actually received a D, F, or W in the test dataset;however, 57% of its predictions were correct. This doesnot necessarily make it a bad model, rather a model thatis tuned for a speciﬁc purpose where it is much moreimportant for the predictions to be correct than it is toidentify the most potentially at-risk students. This mightbe useful for an application that tries to identify studentsfor a high cost or non-negligible-risk intervention whereonly the most likely at-risk students could be accommo-dated.Multiple methods were explored to improve model per-formance: oversampling, undersampling, hyperparame-ter tuning, and grid search. This exploration is describedin the Supplemental Material [37]. All methods improvedthe balance of DFW accuracy and precision. Oversam-pling led to models that overﬁt the data and was notused. Grid search showed that, for this dataset, it wasalways possible to use hyperparameter tuning by adjust-ing the decision threshold without having to undersampleto produce a model with a balance of DFW accuracy andprecision. The decision threshold for models in Table Vexcluding the default model and the models applied onlyto underrepresented groups was adjusted for each modelto balance DFW accuracy and precision. For the overallmodel of Sample 1, this produced a model with substan-tially higher DFW accuracy and κ that the default model;however, it still only identiﬁed 43% of the students whowould receive a D, F, or W, DFW accuracy= 0.43, andhad κ = 0 .

36 in the range fair agreement by Cohen’scriteria.Sample 2 restricted the time frame in which the insti-tutional data were collected to a 3-year period in whichthe course studied had a consistent instructional envi-ronment. Even though the size of the sample was muchsmaller, model performance was improved showing thatit is important to collect the training sample for a periodwhere the class was presented in the same form as theclass in which the model will be used. The Sample 2 model using only institutional variableswas much better than models using only in-class vari-ables early in the semester. If an instructor wants todevelop classiﬁcation models for prediction of studentsat risk early in the semester, accessing a set of institu-tional data can substantially improve the models. Thecombination of institutional and in-class variables gavethe highest model performance with an improvement of3% in week 1, 6% in week 2, 9% in week 5 (when test 1grades were available), and 18% in week 8 (when test 2grades were available) compared to the model containingonly institutional variables. As such, for identiﬁcation ofat-risk students early in the semester most of the pre-diction accuracy can be achieved with institutional dataalone.Sample 3 included a more restricted set of institutionalvariables than Sample 1, but included a variable indi-cating socioeconomic status and featured a more demo-graphically diverse population. The overall model for thissample had weaker performance metrics than the overallmodel for Sample 1 or the institutional model for Sam-ple 2. When the set of variables used in Sample 1 wasrestricted to be approximately those used in Sample 3,model performance was commensurate. It is, therefore,important for improving model performance to work withinstitutional research to provide the machine learning al-gorithms with as rich a set of data as possible.

RQ2: Does classiﬁcation accuracy diﬀer for underrep-resented groups in physics? If so, how and why does itdiﬀer?

For Samples 1 and 3, once the model was con-structed for the full training dataset, the overall modelwas used to classify demographic subgroups in the testdataset separately as shown in Table V. These modelsexamined women, URM students, ﬁrst-generation col-lege students, and low SES students. In both samples,the model performance metrics for some minority de-mographic groups were diﬀerent (either better or worse)than the overall model; however, these diﬀerences werewithin one standard deviation of the overall model. Assuch, the classiﬁer built on the full training dataset pre-dicted the outcomes of underrepresented physics studentswith approximately equal accuracy. While the diﬀerencesobserved in Table V are within the error of the sample,should signiﬁcant diﬀerences be detected, it is possibleto re-tune the models for each underrepresented groupseparately.Figure 1 and similar ﬁgures in the Supplemental Mate-rial [37] show the demographic variables, gender, URM,FirstGen, and SES are of low importance in the classi-ﬁcation models. This is likely because these factors al-ready have a general eﬀect on other variables included inthe models such as CGPA. The Supplemental Material[37] includes an analysis which undersamples the major-ity demographic class (for example, men) to produce amore balanced dataset (for example, a dataset with thesame number of men and women) (Supplemental Figs. 7to 9). The variable importance of the demographic vari-ables used in this study was fairly consistent with the2rate of undersampling showing that the low importancewas not simply a result of the lower number of studentsfrom minority demographic groups in the sample.To further investigate the low variable importance ofthe demographic variables, we examined a more diversepopulation (Sample 3). Model performance metrics wereconsistent with those obtained from Sample 1, suggestingthe low variable importance was not the result of the re-stricted number of underrepresented students in the sam-ple.

RQ3: How can the results of a machine learning anal-ysis be used to better understand and improve physicsinstruction?

Once a classiﬁcation model is constructed,the same model can be used to characterize new groupsof students. Sections III B and III C presented three dif-ferent possible analyses that can be performed with clas-siﬁcation models that have classroom applications.The ﬁrst analysis computed the variable importanceof each variable in the classiﬁer, Fig. 1. This is done byﬁnding the mean decrease in some performance metric ifthe variable is removed from the model. This analysisallows the identiﬁcation of the variables which are mostpredictive of a student receiving a D, F, or W. This canshow a working instructor where to look in complex in-stitutional datasets and allow departments to shape theirdata requests.The second analysis computed a probability of receiv-ing an A, B, or C for each individual student. This wasplotted for each actual grade received in Fig. 2. This al-lows an individual quantitative risk to be applied to eachstudent. This risk could be updated as the semester pro-gresses based on in-class performance.The ﬁnal analysis computed a decision tree, Fig. 3.This tree shows the decision thresholds which indicatethe levels of the variable that are important in classifyingat-risk students. As long as the instructional setting andassignment policy remains consistent, these trees can bereused semester to semester without having to rerun theanalysis. The tree shows that homework average, CGPA,and the percent of hours completed were important in thedecision to classify a student at risk of a DFW outcome.These analysis results represent examples of the addi-tional tools classiﬁcation algorithms can provide instruc-tors; many more examples could be given. The followingrepresent some of the applications of these results beingconsidered at Institution 1. These applications are de-signed around the principle that any additional instruc-tional activity must potentially beneﬁt all students. Themodels are far from perfect and, as such, all students mayactually be at risk so any intervention must be availableto any student.

Informing Resource Allocation:

Students inphysics classes at Institution 1 elect laboratory sectionswhere a substantial part of the interactive instruction inthe course is presented. Because a success probability canbe generated for each student, an average probability ofsuccess could be calculated for each laboratory section.More experienced teaching assistants could be assigned to these sections. The department also has a LearningAssistant (LA) [29] program using a for-credit model.Learning Assistants are not available for all lab sections;allocation of LAs to at-risk lab sections could be priori-tized.

Planning Revised Assignment Policy:

The deci-sion tree in Fig. 3 and variable importance measures inFig. 1 shows that homework grades in the second week ofthe class are the most important variable for predictingsuccess and gives a homework score threshold of 62% asthe highest level decision for predicting success or fail-ure. To develop the habit of completing homework andinvesting suﬃcient eﬀort to do well on homework, a policyallowing the reworking of homework assignments whichreceived a grade of less than 60% for additional (or ini-tial) credit could be implemented early in the class.

Planning Student Communication:

Instructorscan use the variable importance results to provide gen-eral advice to students with low homework grades andencourage them to seek additional help by attending of-ﬁce hours or to change habits so homework assignmentsare started earlier and suﬃcient time is allowed for com-pletion. In general, an instructor of a large service coursedoes not have time to personally communicate with eachstudent; however, the combination of the individual suc-cess probability, variable importance, and variable de-cision threshold would allow an instructor to monitorand communicate directly with a small subset of studentsparticularly at risk in the class. These communicationscould let the students know that the instructor noticedthat early homework assignments needed additional workand suggest strategies to the students for improvementopening channels of personal communication with at-riskstudents.Many other potential instructional uses of this typeof analysis are possible. Naturally, if the intervention issuccessful, it will modify student outcomes changing stu-dents’ risk proﬁles. The classiﬁer will need to be rebuiltusing student outcomes after the implementation of theintervention to reﬂect this modiﬁed risk.While using the random forest algorithm to make pre-dictions is technically fairly straightforward for instruc-tors trained in physics (the base code is presented in theSupplemental Material [37]), obtaining the institutionaldataset may present a substantial barrier for overworkedinstructors of large service introductory classes. As such,we present some recommendations for managing the pro-cess of obtaining institutional data.Gathering additional data for use by instructors shouldprobably be the responsibility of a departmental com-mittee or staﬀ. The data required for diﬀerent classesare quite similar. A departmental data committee wouldalso be able to establish ethical standards for the useand handling of the data. Some eﬀort will be needed tounderstand the data available at the institutional leveland to work with institutional research to ﬁne tune thedata request. For example, if one requests a basic setof demographic and descriptive variables about students3enrolled in a course over a number of semesters, the GPAvariable provided will probably be the student’s currentGPA where one actually wants the student’s GPA beforehe or she enrolled in the class of interest. Some inter-action would also be required to develop variables suchas the student’s math readiness or the fraction of classescompleted. However, once a set of variables is identi-ﬁed, institutional records can quickly generate the datafor the department each semester. Once the institutionaldata is acquired and understood, applying the machinelearning code is fairly straightforward. It is also worthpursuing the possibility that institutional research couldhandle the entire process and provide a machine learningrisk analysis to interested instructors. Student retentionis of vital interest to most institutions with retention incore mathematics and science classes an important partof the puzzle.

V. ETHICAL CONSIDERATIONS

The results of a machine learning classiﬁcation repre-sent a new tool for physics instructors to shape instruc-tion; as with any tool, it can be correctly used or misused.If an instructor is to use the predictions of a classiﬁca-tion algorithm, it is important that these results do notbias his or her treatment of individual students. Figure2 shows that it is possible for students with very low pre-dicted probability of earning an A, B, or C to get a Cor higher in the class. Machine learning algorithms willnever be 100% accurate and this should be taken into ac-count in any application of the results of the algorithms.Further, while the classiﬁcation results may be used todirect resources to the students most at risk, this shouldbe done with the goal of improving instruction for all stu-dents. Machine learning results should also not be usedto exclude students from additional educational activitiesto support at-risk students. Because the predictions arenot 100% accurate, additional tutoring sessions or similarresources should be available to all; however, the resultsof classiﬁcation models could be used to deliver encour- agement to the students most at risk to avail themselvesof these opportunities.

VI. CONCLUSIONS

This work applied the random forest machine learn-ing algorithm to predict whether introductory mechanicsstudents would receive a grade of D or F, or withdrawfrom a physics class. Metrics and methods applied in pre-vious work produced classiﬁcation models with poor per-formance; however, selecting metrics appropriate for un-balanced outcomes and tuning the random forest modelsgreatly improved the classiﬁcation accuracy of the DFWoutcome. Classiﬁcation models performed similarly forstudents from two institutions with very diﬀerent demo-graphic characteristics. Models with a richer set of in-stitutional variables were somewhat (7%) more accuratethan models with a limited set of variables. The additionof in-semester variables, particularly homework averagesand test scores, improved model performance. The in-stitutional model far outperformed a model using onlyin-semester variables early in the semester; the perfor-mance of the in-semester only models exceeded that ofthe institutional only models once the ﬁrst test was in-cluded as a variable. The classiﬁer trained on the full setof students produced somewhat diﬀerent performance forwomen, underrepresented minority students, and ﬁrst-generation college students with some metrics improvedand some weaker for these students. Once a classiﬁer isconstructed, multiple new analyses are available allowingthe direction of additional resources to at-risk students.

ACKNOWLEDGMENTS

This work was supported in part by the National Sci-ence Foundation under grant ECR-1561517 and HRD-1834569. [1] D.E. Meltzer and R.K. Thornton, “Resource letterALIP–1: Active-learning instruction in physics,” Am. J.Phys. , 478–496 (2012).[2] S. Freeman, S.L. Eddy, M. McDonough, M.K. Smith,N. Okoroafor, H. Jordt, and M.Pat. Wenderoth, “Ac-tive learning increases student performance in science,engineering, and mathematics,” P. Nat. Acad. Sci. ,8410–8415 (2014).[3] President’s Council of Advisors on Science and Technol-ogy, “Report to the President. Engage to Excel: Pro-ducing One Million Additional College Graduates withDegrees in Science, Technology, Engineering, and Math-ematics,” Executive Oﬃce of the President: Washington,DC (2012). [4] K. Rask, “Attrition in STEM ﬁelds at a liberal arts col-lege: The importance of grades and pre-collegiate prefer-ences,” Econ. Educ. Rev. , 892–900 (2010).[5] X. Chen, “STEM Attrition: College students’ paths intoand out of STEM ﬁelds. NCES 2014-001.” National Cen-ter for Education Statistics (2013).[6] E.J. Shaw and S. Barbuti, “Patterns of persistence inintended college major with a focus on STEM majors,”NACADA J. , 19–34 (2010).[7] A.V. Maltese and R.H. Tai, “Pipeline persistence: Ex-amining the association of educational experiences withearned degrees in STEM among US students,” Sci. Educ. , 877–907 (2011). [8] G. Zhang, T.J. Anderson, M.W. Ohland, and B.R.Thorndyke, “Identifying factors inﬂuencing engineer-ing student graduation: A longitudinal and cross-institutional study,” J. Eng. Educ. , 313–320 (2004).[9] B.F. French, J.C. Immekus, and W.C. Oakes, “An ex-amination of indicators of engineering students’ successand persistence,” J. Eng. Educ. , 419–425 (2005).[10] R.M. Marra, K.A. Rodgers, D. Shen, and B. Bogue,“Leaving engineering: A multi-year single institutionstudy,” J. Eng. Educ. , 6–27 (2012).[11] C.W. Hall, P.J. Kauﬀmann, K.L. Wuensch, W.E. Swart,K.A. DeUrquidi, O.H. Griﬃn, and C.S. Duncan, “Ap-titude and personality traits in retention of engineeringstudents,” J. Eng. Educ. , 167–188 (2015).[12] P. Baepler and C.J. Murdoch, “Academic analytics anddata mining in higher education,” Int. J. Scholarsh.Teach. Learn. , 17 (2010).[13] R.S.J.D. Baker and K. Yacef, “The state of educationaldata mining in 2009: A review and future visions,” J.Educ. Data Mine , 3–17 (2009).[14] Z. Papamitsiou and A.A. Economides, “Learning analyt-ics and educational data mining in practice: A systematicliterature review of empirical evidence.” J. Educ. Tech.Soc. (2014).[15] A. Dutt, M.A. Ismail, and T. Herawan, “A system-atic review on educational data mining,” IEEE Access , 15991–16005 (2017).[16] C. Romero and S. Ventura, “Educational data mining: Areview of the state of the art,” IEEE T. Syst. Man Cy.C , 601–618 (2010).[17] C. Zabriskie, J. Yang, S. DeVore, and J. Stewart, “Usingmachine learning to predict physics course outcomes,”Phys. Rev. Phys. Educ. Res. , 020120 (2019).[18] A. Pe˜na-Ayala, “Educational data mining: A survey anda data mining-based analysis of recent works,” ExpertSyst. Appl. , 1432–1462 (2014).[19] C. Romero, S. Ventura, P.G. Espejo, and C. Herv´as,“Data mining algorithms to classify students,” in Pro-ceeding of the 1st International Conference on Educa-tional Data Mining , edited by R.S. Joazeiro de Baker,T. Barnes, and J.E. Beck (Montreal, Quebec, Canada,2008).[20] G. James, D. Witten, T. Hastie, and R. Tibshirani,

AnIntroduction to Statistical Learning with Applications inR , Vol. 112 (Springer-Verlag, New York, NY, 2017).[21] A.C. M¨uller and S. Guido,

Introduction to MachineLearning with Python: A Guide for Data Scientists (O’Reilly Media, Boston, MA, 2016).[22] A.M. Shahiri, W. Husain, and N.A. Rashid, “A reviewon predicting student’s performance using data miningtechniques,” Procedia Comput. Sci. , 414–422 (2015).[23] S. Huang and N. Fang, “Predicting student academic per-formance in an engineering dynamics course: A compar-ison of four types of predictive mathematical models,”Comput. Educ. , 133–145 (2013).[24] F. Marbouti, H.A. Diefes-Dux, and K. Madhavan, “Mod-els for early prediction of at-risk students in a course us-ing standards-based grading,” Comput. Educ. , 1–15 (2016).[25] L.P. Macfadyen and S. Dawson, “Mining LMS data todevelop an early warning system for educators: A proofof concept,” Comput. Educ. , 588–599 (2010).[26] U. bin Mat, N. Buniyamin, P.M. Arsad, and R. Kas-sim, “An overview of using academic analytics to predictand improve students’ achievement: A proposed proac-tive intelligent intervention,” in Engineering Education(ICEED), 2013 IEEE 5th Conference on (IEEE, 2013)pp. 126–130.[27] J.M. Aiken, R. Henderson, and M.D. Caballero, “Model-ing student pathways in a physics bachelor’s degree pro-gram,” Phys. Rev. Phys. Educ. Res. , 010128 (2019).[28] “US News & World Report: Education,” USNews and World Report, Washington, DC,https://premium.usnews.com/best-colleges. Accessed2/23/2019.[29] V. Otero, S. Pollock, and N. Finkelstein, “A physics de-partment’s role in preparing physics teachers: The Col-orado Learning Assistant model,” Am. J. Phys. , 1218–1224 (2010).[30] L.C. McDermott and P.S. Shaﬀer, Tutorials in Introduc-tory Physics (Prentice Hall, Upper Saddle River, NJ,1998).[31] E. Elby, R.E. Scherr, T. McCaskey, R. Hodges, T. Bing,D. Hammer, and E.F. Redish, “Open Source Tutori-als in Physics Sensemaking,” http://umdperg.pbworks.com/w/page/10511218/OpenSourceTutorials . Accessed9/17/2018.[32] A.L. Traxler, X.C. Cid, J. Blue, and R. Barthelemy,“Enriching gender in physics education research: A bi-nary past and a complex future,” Phys. Rev. Phys. Educ.Res. , 020114 (2016).[33] R.K. Thornton and D.R. Sokoloﬀ, “Assessing studentlearning of Newton’s laws: The Force and Motion Con-ceptual Evaluation and the evaluation of active learninglaboratory and lecture curricula,” Am. J. Phys. Classiﬁcation and Regression Trees (Wadsworth& Brooks/Cole, Monterey, CA, 1984).[36] T. Hastie, R. Tibshirani, and J. Friedman,

The Elementsof Statistical Learning: Data Mining, Inference, and Pre-diction (Springer-Verlag, New York, NY, 2009).[37] See Supplemental Material at [URL will be inserted bypublisher] for model tuning, investigation on underrepre-sented groups, and sample random forest code.[38] T. Fawcett, “An introduction to ROC analysis,” PatternRecogn. Lett. , 861–874 (2006).[39] J. Cohen, Statistical Power Analysis for the BehavioralSciences (Academic Press, New York, NY, 1977).[40] D.G. Altman,

Practical Statistics for Medical Research (CRC Press, Boca Raton, FL, 1990).[41] D.W. Hosmer Jr, S. Lemeshow, and R.X. Sturdivant,