Extending Machine Learning to Predict Unbalanced Physics Course Outcomes
aa r X i v : . [ phy s i c s . e d - ph ] F e b Extending Machine Learning to Predict Unbalanced Physics Course Outcomes
Seth DeVore, Jie Yang, and John Stewart ∗ Department of Physics and Astronomy, West Virginia University, Morgantown WV, 26506 (Dated: February 7, 2020)Machine learning algorithms have recently been used to classify students as those likely to receivean A or B or students likely to receive a C, D, or F in a physics class. The performance metrics usedin that study become unreliable when the outcome variable is substantially unbalanced. This studyseeks to further explored the classification of students who will receive a C, D, and F and extendthose methods to predicting whether a student will receive a D or F. The sample used for this work( N = 7184) is substantially unbalanced with only 12% of the students receiving a D or F. Applyingthe same methods as the previous study produced a classifier that was very inaccurate, classifyingonly 20% of the D or F cases correctly. This study will focus on the random forest machinelearning algorithm. By adjusting the random forest decision threshold, the correct classificationrate of the D or F outcome rose to 46%. This study also investigated the previous finding thatdemographic variables such as gender, underrepresented minority status, and first generation statushad low variable importance for predicting class outcomes. Downsampling revealed that this wasnot the result of the underrepresentation of these students. An optimized classification model wasconstructed which predicted the D and F outcome with 46% accuracy and C, D, and F outcome with69% accuracy; the accuracy of prediction of these outcomes is called “sensitivity” in the machinelearning literature. Substantial variation was detected when this classification model was applied topredict the C, D, or F outcome for underrepresented demographic groups with 61% sensitivity forwomen, 67% for underrepresented minority students, and 78% for first-generation students. Similarvariation was observed for the D and F outcome. I. INTRODUCTION
Physics courses, along with other core science andmathematics courses, form key hurdles for Science, Tech-nology, Engineering, and Mathematics (STEM) studentsearly in their college career. Student success in theseclasses is important to improving STEM retention; thesuccess of students traditionally underrepresented inSTEM disciplines in the core classes may be a factor lim-iting efforts to increase inclusion in STEM fields. PhysicsEducation Research (PER) has developed a wide rangeof research-based instructional materials and practicesto help students learn physics [1]. Research-based in-structional strategies have been demonstrated to increasestudent success and retention [2]. While some of thesestrategies are easily implemented for large classes, oth-ers have substantial implementation costs. Further, noclass could implement all possible research-based strate-gies, and some may be more appropriate for some subsetsof students than for others. One method to allow thedirection of resources to students who would most bene-fit would be to identify at-risk students early in physicsclasses. The effective classification of students at riskin physics classes represents a promising new researchstrand in PER.The need for STEM graduates continues to increase ata rate that is outstripping STEM graduation rates acrossAmerican institutions. A 2012 report from the Presi-dent’s Council of Advisors on Science and Technology [3]identified the need to increase graduation of STEM ma- ∗ [email protected] jors to avoid a projected shortfall of one million STEMjob candidates over the next decade. The U.S. Depart-ment of Education reported that STEM attrition ratesrange from 59% for computer/information science ma-jors to 38% for math majors with an average of 48%[4]. With demand for jobs requiring at least a STEMbachelors degree growing to 20% of the job market overthe last decade [5], but with STEM degree completionrates remaining only 40% among students initially major-ing in STEM [3], improvement in retention could relievesome of the growing shortfall. Targeting interventions tostudents at risk in core introductory science and mathe-matics courses taken early in college offer one potentialmechanism to improve STEM graduation rates.Improving STEM retention has long been an importantarea of investigation for science education researchers[4, 6–12]. Many studies have shown that measures ofpre-college preparation (i.e. high school GPA and ACTor SAT scores) in concert with college performance mea-sures such as college GPA are the strongly predictive ofstudent success. With introductory courses in physics,mathematics, and chemistry being high attrition pointsfor STEM majors, work focused on identifying factorsrelated to student success in these courses is key to un-derstanding the retention problem. In recent years, edu-cational data mining has become a prominent method ofanalyzing student data to inform course redesign [13–17]. A. Prior Study : Study 1
This study extends and explores the results ofZabriskie et al. [18] which will be referenced as Study1 in this work. Study 1 used institutional data such asACT scores and college GPA (CGPA) as well as datacollected within a physics class such as homework gradesand test scores to predict whether a student would receivean A or B in the first and second semester calculus-basedphysics classes at a large university. The study usedboth logistic regression and random forests to classifystudents. Random forest classification using only insti-tutional variables was 73% accurate for the first semesterclass. This accuracy increased to 80% by the fifth weekof the class when in-class variables were included. Thelogistic regression and random forest classification algo-rithms generated very similar results. Study 1 chose topredict A and B outcomes, rather than the more impor-tant A, B, and C outcomes, partially because the samplewas significantly unbalanced. Sample imbalance makesclassification accuracy more difficult to interpret. Thestudy also made a number of decisions about classifica-tion parameters such as the relative size of the test andtraining dataset, the number of trees grown, and the de-cision threshold (explained in Sect. II) which should befurther explored. Study 1 also investigated the effect ofa number of demographic variables on grade prediction(gender, underrepresented minority status, and first gen-eration status) and found they were not important tograde classification. These groups were very underrepre-sented in the courses studied; it was unclear as to whatextent sample imbalance caused by underrepresentationwas the cause of the low importance of the demographicvariables.
B. Research Questions
This study seeks to more fully explore the random for-est machine learning algorithm and explore questions leftunanswered by Study 1. It also seeks to extend the ap-plication of the algorithm to unbalanced dependent andindependent variables.This study seeks to answer the following research ques-tions:RQ1: How can machine learning algorithms be applied topredict unbalanced physics class outcomes?RQ2: What is a productive set of performance metrics tocharacterize the classification algorithms?RQ3: What sample size is required for accurate predic-tion of physics class outcomes?RQ4: How does classification accuracy differ for groupsunderrepresented in physics? How can machinelearning models be optimized to predict the out-comes of all groups with equal accuracy?
C. Educational Data Mining
Educational Data Mining (EDM) can be described asthe use of statistical, machine learning, and traditionaldata mining methods to draw conclusions from large ed-ucational datasets while incorporating predictive mod-eling and psychometric modeling [17]. In a 2014 meta-analysis of 240 EDM articles by Pe˜na-Ayala, 88% werefound to use a statistical and/or machine learning ap-proach to draw conclusions from the data presented. Ofthese studies 22% analyzed student behavior, 21% ex-amined student performance, and 20% examined assess-ments [19]. Pe˜na-Ayala also found that classification wasthe most common method used in EDM applied in 42% ofall analyses, with clustering used in 27%, and regressionused in 15% of studies.Educational Data Mining encompasses a large num-ber of statistical and machine learning techniques withlogistic regression, decision trees, random forests, neu-ral networks, naive Bayes, support vector machines, andK-nearest neighbor algorithms commonly applied [20].Pe˜na-Ayala’s [19] analysis found 20% of studies employedBayes theorem and 18% decision trees. Decision treesand random forests are one of the more commonly usedtechniques in EDM making them suitable techniques toinvestigate our research questions and explore techniquesto assess the success of machine learning algorithms.More information on the fundamentals of these and othermachine learning techniques are readily available througha number of machine learning texts [21, 22]
D. Grade Prediction and Persistence
While EDM is used for a wide array of purposes, ithas often been used to examine student performance andpersistence. One survey by Shahiri et al. summarized 30studies in which student performance was examined us-ing EDM techniques [23]. Neural networks and decisiontrees were the two most common techniques used in stud-ies examining student performance with naive Bayes, K-nearest neighbors, and support vector machines used insome studies. A study by Huang and Fang examined stu-dent performance on the final exam for a large-enrollmentengineering course using measurements of college GPA,performance in 3 prerequisite math classes as well asPhysics 1, and student performance on in-semester ex-aminations [24]. They analyzed the data using a largenumber of techniques commonly used in EDM and foundrelatively little difference in the accuracy of the resultingmodels. Study 1 also found little difference in the per-formance of machine learning algorithms in predictingphysics grades. Another study examining an introduc-tory engineering course by Marbouti et al. used an arrayof EDM techniques to predict student grade outcomesof C or better [25]. They used in-class measures of stu-dent performance including homework, quiz, and exam1 scores and found that logistic regression provided the
Table I. Full list of variables.Variable Type DescriptionGender Dichotomous Gender (Men = 1 Women = 0).URM Dichotomous Student does not identify as White non-Hispanic or Asian (True = 1, False = 0).CalReady Dichotomous The first math class taken calculus or more advanced (True = 1, False = 0).FirstGen Dichotomous Student is a first generation college student (True = 1, False = 0).CmpPct Continuous Percentage of hours attempted that were completed at the start of course.CGPA Continuous College GPA at start of course.STEMHrs Continuous Number of STEM (Math, Bio, Chem, Eng, Phys) credit hours completed at start of course.HrsCmp Continuous Total credits hours earned at start of course.HrsEnroll Continuous Total credits hours enrolled at start of course.HSGPA Continuous High school GPA.ACTM Continuous ACT/SAT mathematics percentile.ACTV Continuous ACT/SAT verbal percentile.APCredit Continuous Number of credits hours received for AP tests.TransCrd Continuous Number of credits hours received for transfer courses. highest accuracy at 94%. A study by Macfadyen andDawson attempted to identify students at risk of fail-ure in an introductory biology course [26]. Using logis-tic regression they were able to identify students failing(defined as having a grade of less than 50%) with 81%accuracy. Interest in grade prediction and persistence inSTEM classes has risen to the point where many univer-sities are using EDM techniques to improve retention ofSTEM students [27].The use of machine learning techniques in physicsclasses has only begun recently. Beyond Study 1, ran-dom forests were used in a 2018 study by Aiken et al. topredict student persistence as physics majors and iden-tify the factors that are predictive of students either re-maining physics majors or becoming engineering majors[28].
II. METHODSA. Sample
This study was performed using course grades from theintroductory, calculus-based mechanics course (Physics1) taken by physical science and engineering students at alarge eastern land-grant university serving approximately30,000 students. The general university undergraduatepopulation had ACT scores ranging from 21 to 26 (25thto 75th percentile) [29]. The overall undergraduate de-mographics were 80% White, 4% Hispanic, 6% interna-tional, 4% African American, 4% students reporting withtwo or more races, 2% Asian, and other groups each with1% or less [29] The sample was primarily male (82%).The sample for this study was drawn from institu-tional records and includes all students who completedPhysics 1 from 2000 to 2018, a total of 7184 students.Over the period studied, the instructional environment ofthe course varied widely, and as such, the results of this study may provide a general picture of the performanceof machine learning algorithms to predict physics grades.Improved performance might arise within more stableinstructional environments. Prior to the spring 2011semester, the course was presented traditionally withmultiple instructors teaching largely traditional lecturesand students performing cookie-cutter laboratory exer-cises. In spring 2011, the department made a commit-ment to improved instruction with the implementation ofa Learning Assistant (LA) program [30] and the hire ofan expert educator to manage the program. This educa-tor brought the Peer Instruction pedagogy to the lecturecomponent of the course [31]. Learning Assistants wereinstructed in reformed pedagogy and presented lessonsfrom the University of Washington
Tutorials in Intro-ductory Physics [32] in the laboratory sections; studentsalso continued to perform traditional laboratory exper-iments in the same sessions. In fall 2015, the programwas modified because of a change in funding with LAsassigned to only a subset of laboratory sections. Thecourse introduced a team-teaching model at this timefeaturing strong coordination of the lecture and labora-tory components; there was little coordination betweenthe lecture and laboratory components prior to fall 2015.The Tutorials were replaced with open source materials[33] which lowered textbook cost to students and allowedfull integration of the research-based materials with thelaboratory activity.
B. Variables
The variables used in this study were drawn from insti-tutional records and are shown in Table I. Two types ofvariables are used: two-level dichotomous variables andcontinuous variables. A few variables require additionalexplanation. The variable CalReady measures the stu-dent’s math-readiness. Calculus 1 is a pre-requisite forPhysics 1. For the vast majority of students in Physics1, the student’s four-year degree plans assume the stu-dent enrolls in Calculus 1 their first semester at the uni-versity. These students are considered “math ready.”A substantial percentage of the students at the insti-tution studied are not math ready. Study 1 used a 3-level math-readiness variable; this study uses a 2-levelvariable to allow a more thorough exploration of unbal-anced dichotomous independent variables. The variableSTEMHrs captures the number of credit hours of STEMclasses completed before the start of the course modeled.STEM classes include mathematics, biology, chemistry,engineering, and physics classes.Demographic information was also collected from in-stitutional records. Students self-report first-generationstatus; students are considered first generation if neitherof their parents completed a four-year degree. Racial andethnic information was also accessed. A student was clas-sified as an underrepresented minority student (URM) ifthey did not reported ethnicity of Hispanic or reporteda race other than White or Asian. Gender was also col-lected from university records; for the period studied gen-der was recorded as a binary variable by the institution.While not optimal, this reporting is consistent with theuse of gender in most studies in PER; for a more nuanceddiscussion of gender and physics see Traxler et al. [34].
C. Random Forest Classification Models
This work employs the random forest machine learn-ing algorithm to predict students’ final grade outcomesin introductory physics. Random forest are one of manymachine learning classification algorithms. Study 1 re-ported that most machine learning algorithms had simi-lar performance when predicting physics grades. A classi-fication algorithm seeks to divide a dataset into multipleclasses. This study will classify students as those whichwill receive an A or B (AB) and students who will receivea C, D, or F (CDF) in Physics 1 following Study 1. Itwill also classify students who will receive an A, B, orC (ABC) and students who will receive a D or F (DF).This classification is fairly unbalanced and will requireadditional techniques.To understand the performance of a classification algo-rithm, the dataset is first divided into test and trainingdatasets. The training dataset is used to develop the clas-sification model, to train the model. The test dataset isthen used to characterize the model. The classificationmodel is used to predict the outcome of each studentin the test dataset; this prediction is compared to theactual outcome. Section II D discusses performance met-rics used to characterize the success of the classificationalgorithm.The random forest algorithm uses decision trees, an-other machine learning classification algorithm. Decisiontrees work by splitting the dataset into two or more sub-groups based on one of the model variables. The variable selected for each split is chosen to divide the dataset intothe two most homogeneous subsets of outcomes possible,that is, subsets with a high percentage of one of the twoclassification outcomes. The variable and the thresholdfor the variable represents the decision for each node inthe tree. For example, one node may split the datasetusing the criteria (the decision) that a student’s collegeGPA is less than 3.2. The process continues by splittingthe subsets forming the decision tree until each node con-tains only one of the two possible outcomes. Decisiontrees are less susceptible to multicollinearity than manystatistical methods common in PER such as linear re-gression [35].Random forests extend the decision tree algorithm bygrowing many trees instead of a single tree. The “forest”of decision trees is used to classify each instance in thedata; each tree “votes” on the most probable outcome.The decision threshold determines what fraction of thetrees must vote for the outcome for the outcome to beselected as the overall prediction of the random forest.Random forests use bootstrapping to prevent one vari-able from being obscured by another variable. Individualtrees are grown on Z subsamples generated by samplingthe training data set with replacement. Each of thesesamples is fit using a subset of size m of the variables, m = √ k , where k is the number of independent vari-ables in the model [36]. This method ensures the treesare not correlated and that the stronger variables do notoverwhelm weaker variables [21]. The “randomForest”package in “R” was used for the analysis. This pack-age provides a measure variable importance, the meandecrease in accuracy [37]. The mean decrease in accu-racy is the average decrease in classification accuracy ifthe variable is removed [36]. This work uses bootstrap-ping to produce similar variable importance measures forother performance metrics. D. Performance Metrics
The confusion matrix [38] as shown in Table II sum-marizes the results of a classification algorithm and is thebasis for calculating most model performance metrics. Toconstruct the confusion matrix, the classification modeldeveloped from the training dataset is used to classifystudents in the test dataset. The confusion matrix cate-gorizes the outcome of this classification.
Table II. Confusion MatrixActual Negative Actual PositivePredicted Negative True Negative (TN) False Negative (FN)Predicted Positive False Positive (FP) True Positive (TP)
For classification, one of the dichotomous outcomesis selected as the positive result. In the current study,we use the DF or CDF outcomes as the positive result.This choice was made because some the the model per-formance metrics focus on the positive results and wefeel that most instructors would be more interested inaccurately identifying students at risk of failure.From the confusion matrix, many performance metricscan be calculated. Study 1 reported the classificationaccuracy, the fraction of correct predictions, shown inEqn. 1. Accuracy = TN + TP N test (1)where N test = TP+TN+FP+FN is the size of the testdataset.Sensitivity, the true positive rate (TPR), and speci-ficity, the true negative rate (TNR), characterize the rateof making accurate predictions of either the positive ornegative class. Sensitivity is the fraction of the positivecases that are classified as positive (Eqn 2).Sensitiviy = TPR = TPTP + FN = TP N pos (2)where N pos = TP + FN is the number of positive cases inthe test dataset. Specificity is the fraction of the negativecases that are classified as negative (Eqn 3).Specificity = TNR = TNTN + FP = TN N neg (3)where N neg = TN + FP is the number of negative casesin the test dataset.Sensitivity and specificity can be adjusted by changingthe strictness of the classification criteria. If the modelclassifies even the only slightly promising cases as posi-tive, it will probably classify most actually positive casesas positive producing a high sensitivity. It will also makea lot of mistakes; the precision or the the positive pre-dictive value (PPV) captures the rate of making correctpredictions and is defined as the fraction of the positivepredications which are correct (Eqn. 4).Precision = PPV = TPTP + FP (4)This study will seek models that balance sensitivity andprecision; however, the correct balance for a given appli-cation must be selected based on the individual featuresof the situation. If there is little cost and no risk to anintervention, then optimizing for higher sensitivity mightbe the correct choice to identify as many students in thepositive class as possible. If the intervention is expensiveor carries risk, optimizing the precision so that most stu-dents who are given the intervention are actually at riskmight be more appropriate.One challenge of applying machine learning method-ologies to answer academic questions in PER is that someterms that have well established meanings in physics suchas precision and accuracy have been used differently incomputer science. In what follows, we will use the tra-ditional meaning of precision as how well a quantity isknown, calling the computer science precision, PPV. Ac-curacy will be as defined in Eqn. 1. Beyond simply evaluating the overall performance of aclassification algorithm, we would like to establish howmuch better the algorithm performs than pure guessing.The sample used in this study is substantially unbalancedbetween the DF or ABC outcomes with 88% of the stu-dents receiving an A, B, or C. If a classification methodguessed that all student would receive an A, B, or C (thenegative outcome), then the classifier would have a sensi-tivity of 0, a specificity of 1, a PPV of 0, and an accuracyof 0 .
88. If the classifier guessed all students would receivean D or F, the sensitivity would be 1, specificity 0, PPV0.12, and accuracy 0.12.Additional performance metrics have been constructedto provide a more complete picture of model perfor-mance. Cohen’s kappa, κ , measures agreement amongobservers [39] correcting for the effect of pure guessing asshown in Eqn. 5. κ = p − p e − p e (5)where p is the observed agreement and p e is agreementby chance. Fit criteria have been developed for κ with κ less than 0 . − Specificity. The Area Underthe Curve (AUC) is a measure of the model’s discrimi-nation between the two outcomes; AUC is the integratedarea under the ROC curve. For a classifier that uses pureguessing, the ROC curve is a straight line between (0,0)and (1,1) and the AUC is 0 .
5. An AUC of 1.0 repre-sents perfect discrimination [38, 41]. Hosmer et al. [41]suggest an AUC threshold of 0 .
80 for excellent discrimi-nation. Study 1 provided examples of ROC curves in theSupplemental Material.Two other metrics attempt to balance multiple perfor-mance measures. The F metric is the harmonic meanof the precision and sensitivity (the positive predictivevalue and the true positive rate) shown in Eqn. 6.1 F = 12 (cid:18) (cid:19) (6)As with the addition of parallel resistors, F gives astronger weight to the smaller of the sensitivity and PPV.The g mean metric is the geometric mean of sensitivityand specificity as shown in Eqn. 7. g mean = p Sensitivity · Specificity (7)
E. Unbalanced Datasets
This study, as well as Study 1, used a number ofdichotomous independent variables: Gender, FirstGen,URM, and CalReady. Each variable further divides bothoutcome classes. The division of the groups defined bythese variables over the outcome variables is shown inTable IV.The outcomes in the dataset used in this study are un-balanced; there are more students in the negative class(AB or ABC) than the positive class (CDF or DF); thisimbalance is severe for the DF class. Imbalance in thetraining data can cause learning algorithms to performpoorly on the minority class [42–46]. To improve clas-sification of the minority class, many different forms ofresampling have been introduced. Random undersam-pling or downsampling balances the two classes by ran-domly eliminating majority class examples. Downsam-pling, however, reduces the overall training dataset sizewhich may reduce overall classification performance.Random oversampling or upsampling also balances thetwo classes by randomly duplicating minority class in-stances. This method is susceptible to overfitting be-cause duplicating records causes the students who wereduplicated to have more weight in the classification pro-cess than other students. More sophisticated upsam-pling methods have been constructed. Synthetic Mi-nority Oversampling Technique (SMOTE) [47] generatesnew integrated minority cases rather than copying ex-isting cases. It forms new minority case examples byinterpolating existing examples that are near each otherin the parameter space. In addition to creating a bal-anced dataset, cost-sensitive learning methods also canbe used improve performance with unbalanced datasets[42, 48, 49].
F. Bootstrapping
Bootstrapping is a computational method designedto eliminate distributional assumptions (and their viola-tion) common in statistical methods. This study appliesbootstrapping by creating randomly selected subsets ofthe full sample. This allows the uncertainty in perfor-mance metrics to be calculated. The random forest algo-rithm internally applies bootstrapping growing a numberof trees selected by the user. For the evaluation of testand training dataset size in Sec. III A, 1000 bootstrapreplications were used with each growing 1000 decisiontrees for a total of 1,000,000 decision trees per data point.This was computationally very expensive. Examinationof the standard errors of this analysis and considering thesmall number of independent variables suggested that aless conservative selection of parameters would be appro-priate. For the remainder of the analysis, 200 bootstrapreplications growing 200 decision trees were used for atotal of 40,000 trees per data point.
G. Standard Deviation and Standard Error
All tables and figures report the standard deviation ofthe performance metrics calculated for the set of boot-strap replications; error bars in figures are one standarddeviation long. These measure the variation betweenmultiple subsamples of the same dataset and provide ameasure of the variation that should be expected as theclassification model is applied to new data. In practice,the classification model would be constructed from somesample of past students, then applied to predict the out-comes of a new set of students. However, when com-paring differences in performance metrics or determiningif variable importance is different than zero, the stan-dard error of the mean should be used. The standarderror divides the standard deviation by the square rootof the number of observations used to calculate the stan-dard deviation. For the test-train dataset size evaluationin Sec. III A, the standard error is the standard devi-ation divided by √ .
6; for other calculations,the standard error is the standard deviation divided by √
200 = 14 . III. RESULTS
The purpose of this work is to further understand theclassification of students who will receive a CDF outcomein Physics 1 as was done in Study 1 and to extend thiswork to the prediction of the more unbalanced DF out-come. Either dichotomous outcome variable divides thesample into two subsets with different academic charac-teristics. Table III presents overall academic performancemeasures for each outcome; the variables are defined inTable I. The dichotomous independent variables furtherdivide the subsets defined by the outcome variables. Theoverall demographic composition of the sample is shownin Table IV.
A. Training and Test Dataset Size Requirements
For any quantitative analysis, it is important to en-sure that a sufficient sample size is available to draw ac-curate conclusions. In machine learning, a large train-ing dataset provides the learning algorithm with moreunique cases from which to learn; model performancegenerally increases with training sample size [50]. Aswith most analyses, the precision with which the resultsare known increases with sample size. To determine howthe precision of the model performance metrics changewith training dataset size, the original training datasetwas randomly downsampled to produce smaller trainingdatasets while the test dataset was held fixed. The sam-ple was first split into a test and training dataset whereeach represented 50% of the original sample. This pro-vides a large sample to train the algorithm and a largedataset to precisely characterize the model produced.
N Physics Grade ACT Math % HSGPA CGPAOverall 7184 2 . ± . ±
14 3 . ± . . ± . . ± . ±
13 3 . ± . . ± . . ± . ±
15 3 . ± . . ± . . ± . ±
14 3 . ± . . ± . . ± . ±
15 3 . ± . . ± . . ± . ±
14 3 . ± . . ± . . ± . ±
14 3 . ± . . ± . . ± . ±
17 3 . ± . . ± . . ± . ±
14 3 . ± . . ± . . ± . ±
15 3 . ± . . ± . . ± . ±
14 3 . ± . . ± . . ± . ±
11 3 . ± . . ± . . ± . ±
14 3 . ± . . ± . ± the standard deviation.Predicting CDFCalReady FirstGen Gender URMOutcome Cal Ready Not Cal Ready First Gen Not First Gen Men Women URM Not URMAB 3828 678 494 4012 3650 856 207 4299CDF 1794 884 321 2357 2264 414 181 2497Predicting DFCalReady FirstGen Gender URMABC 5065 1272 725 5612 5187 1150 326 6011DF 557 290 90 757 727 120 62 785Table IV. Demographic composition of sample. Each entry shows the number of students in each subgroup. Figure 1 plots the minority outcome size in the train-ing dataset against the model accuracy, sensitivity, speci-ficity, and PPV as well as the standard deviations of thesequantities. We expect the smaller subdivision of the out-come variable (the minority outcome) to be most impor-tant in determining precision, and therefore, precision isexamined in terms of the minority outcome sample size.As expected, the standard deviation decreases as sam-ple size increases. For all performance metrics, there isa weak increase up to a minority outcome sample sizeof approximately 100 with all performance measures be-coming approximately constant above his value.Study 1 commented on a higher than desirable falsenegative rate in its Limitations section (Study 1 codedthe CDF outcome as negative). One can clearly see thiseffect in Fig. 1 where the sensitivity predicting the CDFoutcome is approximately 60% while the specificity is80%. The model predicts the CDF outcome substan-tially less effectively than the AB outcome. This effectbecomes severe for the DF outcome with a sensitivityof approximately 20%. For both the DF and CDF out-comes, the PPV is approximately 65%; therefore, 65% ofthe students classified as earning a DF or CDF actuallydo.The standard deviation curves in Fig. 1 are somewhatdifferent; this may have resulted from a ceiling or flooreffect for some performance metrics limiting the stan-dard deviations. The sample sizes required to achievea desired precision are commensurate; for the CDF out-come, to achieve a precision of 0.025 in the sensitivity, 220 students are required in the minority outcome; forthe DF outcome, 140 students are required for the sameprecision.A similar analysis was performed for the test dataset;the model performance plots are shown in the Supple-mental Material [51]. The relation of the performancemetrics to the test dataset size was somewhat differentthan those of the training dataset. The test datasetsize had no effect on the average value of the perfor-mance metric. This was to be expected; increasing thetest dataset size does not provide the learning algorithmwith additional unique cases to improve the predictionalgorithm. Slightly larger test datasets were required toachieve the same level of precision as the training datasetshown in Fig. 1. For predicting the CDF outcome,260 students were required in the minority outcome testdataset to produce a sensitivity with standard deviation0.025; for the DF outcome, 160 students were required.The uncertainty of PPV was much larger than the otherquantities for the DF outcome.Originally, we sought to provide some guidelines onminimum required sample size and optimal test-trainsplit ratio. We abandoned this goal because bootstrap-ping can provide confidence intervals and standard de-viations for any quantity desired. The required samplesize then reverts to the traditional decision in research,how much precision is required for the conclusions onewishes to draw from the analysis. We will find that thetest-train split is controlled by the need to retain a max-imum number of the minority class of the dichotomous A v e r a g e A v e r a g e S t a nd a r d D e v i a ti on S t a nd a r d D e v i a ti on AccuracySensitivitySpecificityPPV
0 100 200 300 400 0 100 200 300 400
Predicting Grade of D or FPredicting Grade of C, D, or F Predicting Grade of D or FPredicting Grade of C, D, or F
0 100 200 300 4000 100 200 300 400
Minority Training Dataset Size Minority Training Dataset SizeMinority Training Dataset Size Minority Training Dataset SizeAccuracySensitivitySpecificityPPVAccuracySensitivitySpecificityPPV AccuracySensitivitySpecificityPPV
Figure 1. Model performance parameters as a function of the size of the minority outcome (CDF or DF) in the training dataset. independent variables as discussed in Sec. III D.The plots in Fig. 1 also suggest that rather than fo-cussing on a single overall performance parameter suchas AUC or κ as was done in Study 1, that, for gradeprediction, it may be more productive to focus on op-timizing multiple measures simultaneously. We reportsensitivity, specificity, accuracy, and PPV as the modelsare optimized and report κ , AUC, g mean , and F only forthe optimized models. B. Unbalanced Dependent Variables
Both outcome variables, predicting CDF or DF, areunbalanced as shown in Table III; there are more stu-dents in one of the classifications than the other. TheCDF outcome is somewhat unbalanced with 37% of thestudents receiving a C, D, or F. The DF outcome is quiteunbalanced with 12% of the students receiving a D or F.Sample imbalance can produce a classifier that predictsthe outcomes of the majority and minority class withdiffering precision. Multiple methods exist to correct forsample imbalance: downsampling, upsampling, and hy-perparameter tuning.
1. Downsampling
Figure 1 shows that random forest models are less effec-tive at predicting CDF and DF outcomes; the sensitivitymeasures the fraction of CDF or DF outcomes that arecorrectly predicted. One possible cause of this is the sam-ple imbalance which provides the random forest learningalgorithm more examples of the majority class, thus op-timizing the model to correctly identify these cases. Onepossible method to improve the prediction of the minorityresults is downsampling; reducing the size of the majoritydataset by randomly sampling it without replacement.Because downsampling reduces overall sample size, thereduced sample still needs to meet the sample size re-quirements explored in the previous section. Figure 2shows the effect of downsampling on the model perfor-mance metrics predicting the DF outcome. The horizon-tal axis plots the percentage ratio of the majority sampleto the minority sample; the two samples are balancedwhen the ratio is 100%. The figure clearly shows thatas the majority class is downsampled overall model accu-racy and the correct prediction of the majority outcome(specificity) decreases; however, the rate of correctly pre-dicting of the minority outcome dramatically increases.The PPV, the fraction of correct positive predictions, alsodecreases with downsampling. This may be a result ofless data being provided to train the algorithm resultingin more incorrect classifications.If a balance of sensitivity and specificity is desired,rather than overall prediction accuracy, the figure sug-gests downsampling until the minority sample is of equalsize to the majority sample. The cost of achieving thisbalance is a much higher error rate in predicting thepositive class (PPV). If one wishes to balance sensitiv-ity with PPV, Fig. 2 suggests limited downsamplingshould be performed. Downsampling reduces the train-ing dataset size and, thus, decreases the precision withwhich model performance metrics are measured. At theminority class sample sizes in Fig. 2, approximately 400students, model performance metrics are still very pre-cisely estimated; no data point in Fig. 2 has a standarddeviation exceeding 0 .
02 or a standard error of the meanexceeding 0 .
2. Upsampling
It is also possible to oversample or upsample the mi-nority class to produce a more balanced sample. Thisis done by randomly replicating students in the minorityclass. For this sample, upsampling was completely inef-fective at producing the changes in model performancethat downsampling produced. This may be because up-sampling does not create additional unique cases of the
Minority Percentage of Majority Dataset Size A v e r a g e AccuracySensitivitySpecificityPPV
Figure 2. Model performance parameters as the majoritytraining dataset is downsampled to minority dataset size pre-dicting the DF outcome. The standard deviation did notexceed 0 .
02 for any data point. minority class to train the classifier.
3. Hyperparameter Tuning
Machine learning algorithms are algorithms that areultimately implemented as computer programs. Likemost programs, they contain a number of parametersthat can be adjusted by the user to optimize their per-formance. In Study 1, the default parameters selectedby the developers of the algorithms were used. The ad-justable parameters for the random forest function in Rinclude the number of trees that are grown and the deci-sion threshold. Random forests work by growing a largenumber of decision trees and letting each tree vote onthe classification. The decision threshold sets the per-centage of votes the positive classification (CDF or DF)must receive for the individual to be classified into thatclass. The default for the “randomForest” package in Rused in Study 1 is 50%. Figure 3 shows the effect ofthe decision threshold on the model performance statis-tics predicting the DF outcome. A decision thresholdaround 0 .
15 provides a balance of sensitivity and speci-ficity; however, at this threshold the PPV is poor. Adecision threshold of 0 . F statistics). Whiledifferent course applications of machine learning may re-quire valuing either sensitivity or PPV more highly, forthis work, we will seek to balance the two quantities valu-ing both identifying the most students at risk and havingthis identification be correct.0Examination of Fig. 3 allows one to understand howsensitivity, specificity, and PPV work together. As thedecision threshold is increased, more trees have to votefor the DF outcome for it to be selected, and therefore,fewer students are classified as DF for higher thresholds.Because fewer students are classified as DF, more ac-tual DF students are misclassified by the algorithm withhigher threshold decreasing sensitivity; however, with themore restrictive threshold, more of the predictions arecorrect increasing PPV.This analysis was repeated for the CDF outcome anda plot similar to Fig. 3 is presented in the SupplementalMaterial [51]. For the CDF outcome, a decision thresholdof 0 .
45 was optimal suggesting the models in Study 1 mayhave had a good balance of sensitivity and PPV.
Decision Threshold A v e r a g e AccuracySensitivitySpecificityPPV
Figure 3. Model performance parameters plotted against thedecision threshold predicting the DF outcome. The standarddeviation did not exceed 0 .
01 for any data point.
4. Grid Search
Both downsampling and tuning the decision thresholdgenerated models with improved classification of DF stu-dents. The degree of downsampling can also be viewedas a hyperparameter. In machine learning, it is not un-common to have multiple hyperparameters which mustbe optimized together to create the best classificationmodel. To do this, one performs a “grid search” throughthe space of hyperparameters, iterating through combi-nations of hyperparameters to optimize a performancestatistic [50]. The Supplemental Material [51] presentscontour plots of sensitivity, specificity, g mean , κ , F , PPV, and AUC varying the decision threshold and downsam-pling rate.Figure 2 suggests limited downsampling may be appro-priate for optimizing this sample. Sensitivity, specificity,and PPV do not have maxima on the contour plots. Thiswas expected; all continue to either increase or decreasewith changes in the decision threshold. AUC, g mean , and F all have broad maxima which include small downsam-pling rates. Cohen’s κ has two narrow maxima, one ofwhich also suggests low downsampling rates. As such,and because downsampling eliminates unique cases fromthe training data, no downsampling was performed by theoptimized classifier. It is unclear if this failure of down-sampling to improve models optimized by the decisionthreshold is a general feature of grade prediction clas-sifiers, or a unique feature of this dataset. Researchersinvestigating machine learning for student classificationshould explore downsampling; however, it was not effec-tive for the students in this sample.Without downsampling, the decision thresholds fromthe previous section (0.45 for CDF and 0.30 for DF) willbe used. The 0.30 threshold is near the maximum regionfor all performance metrics with a maximum. C. Unbalanced Independent Variables
Four dichotomous variables were explored for thisstudy: gender, first-generation status (FirstGen), un-derrepresented minority status (URM), and calculus-readiness (CalReady). Each of these variables divide thesample into subgroups with different class outcomes anddiffering levels of academic preparation as shown in TableIII. In Study 1, demographic variables including URM,FirstGen, and Gender were shown to have limited im-portance in the prediction of whether a student wouldreceive an A or B; however, it was unclear to the extentthat this resulted from the highly unbalanced sample. Ascan be seen from Table III, women, first-generation stu-dents, and underrepresented students form small subsetsof the overall sample. As with the unbalanced dependentvariable, this provides the machine learning algorithmmany more examples of majority students and possiblyoptimizes the prediction algorithm to the majority class.To explore the consequences of using unbalanced inde-pendent variables on the prediction performance of theminority class, we first introduce an artificial indepen-dent variable that is not co-linear with general markers ofacademic preparation and success. Once this variable isunderstood, the analysis is repeated for the four dichoto-mous independent variables available in this dataset.
1. An Artificial Independent Variable
Table III shows that, for the four dichotomous vari-ables available to this study, different levels of the vari-able select students with often dramatically different1
Minority Percentage of Majority Dataset Size A v e r a g e C h a ng e fr o m A dd i ng V a r i a b l e AccuracySensitivitySpecificityPPV
Figure 4. Variable importance predicting CDF outcome mea-sured by the change in model performance parameters be-tween a model using an artificial dichotomous variable andone where it is removed. Error bars are one standard devia-tion in length. The error bars for the accuracy were smallerthan the circle. measures of high school preparation and college success.Because the groups differ on continuous variables alreadyincluded in the analysis, it may be that the finding thatthe dichotomous independent variables are not impor-tant results from these differences. To understand theeffects of variable imbalance that is not coupled to prioracademic performance and preparation, an artificial di-chotomous variable was constructed. This new variablewas randomly set to one for students with a majorityoutcome (AB) and to zero otherwise. This was doneby checking if a random number was above a thresholdvalue for majority students and setting all minority out-come (CDF) students to zero. This variable should beimportant to the prediction models because one can per-fectly predict the outcome of students when the variableequals one. Conceptually this kind of distribution mightbe produced if students were randomly assigned to a per-fectly functioning treatment so all students receiving thetreatment scored an A or B. The size of the minority classcan be adjusted by changing the threshold of the randomvariable. Figure 4 shows the variable importance for theartificial variable for different minority sample sizes. Theminority percentage of majority dataset size plotted onthe horizontal axis is the percentage ratio of students forwhich the artificial variable is zero (the larger class) tostudents where the variable is one. Variable importanceis measured by the change in some performance metricwhen the variable is added to the model. This change is calculated using bootstrapping. Figure 4 clearly showsvariable importance is related to the balance of the mi-nority and majority classes of the artificial independentvariable.
2. Dichotomous Independent Variables
The analysis of the previous section was repeatedwith each dichotomous variable available: Gender, URM,FirstGen, and CalReady. The results are presented in theSupplemental Material [51]. For each variable, a plot ofthe sensitivity for models including the variable and ex-cluding the variable at different levels of downsamplingare provided. The difference in model performance be-tween the two classes defined by the variable is presented,as is the difference in performance with and without thevariable. An example of these plots is shown in Fig. 5which presents the results for the gender variable pre-dicting CDF outcomes. For all four variables, the perfor-mance of the models was fairly insensitive to the level ofdownsampling. As such, the conclusions about the lowimportance of demographic variables in Study 1 (gender,underrepresented minority status, and first generationstatus) in predicting physics grade were not the resultof the imbalance of the sample.Figure 5 shows a number of interesting features of theeffect of the inclusion of gender on model performance.The average sensitivity (Fig. 5(a)) differs by about 10%between men and women; the model correctly predictswhen men will receive a C, D, or F at a 10% higher ratethan women. The accuracy and specificity of the model,however, are higher for woman, Fig. 5(b). These differ-ence were fairly insensitive to the level of downsamplingand, therefore, did not result from the underrepresenta-tion of women in the sample causing the classifier to bepredominately trained on men. Some feature of men orwomen not captured by the institutional variables in themodel must be causing the differences in classification ac-curacy. Other demographic variables also demonstrateddifferences in some performance metrics as shown in theSupplemental Material [51].While sensitivity, specificity, accuracy, and PPV arerelatively constant at different levels of downsampling,the variable importance of the gender variable measuredas the change in these quantities as gender is added to themodel did change somewhat with the level of downsam-pling, Fig. 5(c); however, the magnitude of the changewas not large and did not exceed 0.01.
D. Optimal Classification Model
With the results above, the analysis of Study 1 canbe refined, and optimal classification models constructedfor the DF and the CDF outcomes. The results shownin Fig. 1 show that the uncertainty in model perfor-mance metrics decreases rapidly until the minority sam-2
Minority Percentage of Majority Dataset Size A v e r a g e S e n s iti v it y −0.02−0.010.000.010.02 Minority Percentage of Majority Dataset Size A v e r a g e C h a ng e fr o m A dd i ng V a r i a b l e −0.2−0.10.00.10.2 Minority Percentage of Majority Dataset Size D i ff e r e n ce B e t w ee n M a j o r it y a nd M i no r it y C l a ss
40 60 80 100 40 60 80 10040 60 80 100
AccuracySensitivitySpecificityPPVFilled - Men Unfilled - Women(a) (b)(c)
Figure 5. Model performance statistics comparing models including the gender variable to those which do not predicting CDF.Figure (a) shows the sensitivity of the model for men and women. Error bars represent one standard deviation. Figure (b)shows the difference in model performance between men and women. A positive difference means the model performs betterfor men. Figure (c) shows the change in model performance parameters as the gender variable is added to the model. ple size reaches 100. We would also like to achieve ac-curate characterization of the demographic subgroups inthe sample; it seems likely this threshold also applies tothe subgroups. The test and training datasets play dif-ferent roles in machine learning. Machine learning pre-dictions generally improve as more data are used to trainthe classification algorithms; therefore, as much data aspossible should be assigned to the training dataset whileretaining the minimum data required for accurate char-acterization of the classifier in the test dataset [50]. This implies the smallest demographic subgroup controls thetest-train split so as to ideally retain at least 100 stu-dents from each subgroup in both the test and trainingdataset. Table IV shows there is insufficient diversity inthe sample to achieve this goal for all groups particularlyfor predicting the DF outcome. Only 62 URM studentsand 90 first-generation students received a D or F in theclass. As such, to evenly divide these students betweenthe test and training dataset, a 50% test-train split wasused. For the CDF outcome, there are only 181 URM3
Table V. Model performance parameters for the optimized classifier using all variables. Values represent the mean ± thestandard deviation.Outcome Accuracy Sensitivity Specificity PPV κ AUC F g mean OverallCDF 0 . ± .
01 0 . ± .
02 0 . ± .
01 0 . ± .
01 0 . ± .
01 0 . ± .
01 0 . ± .
01 0 . ± . . ± .
01 0 . ± .
02 0 . ± .
01 0 . ± .
02 0 . ± .
02 0 . ± .
01 0 . ± .
01 0 . ± . . ± .
01 0 . ± .
03 0 . ± .
02 0 . ± .
03 0 . ± .
03 0 . ± .
02 0 . ± .
02 0 . ± . . ± .
01 0 . ± .
05 0 . ± .
01 0 . ± .
07 0 . ± .
05 0 . ± .
03 0 . ± .
05 0 . ± . . ± .
03 0 . ± .
04 0 . ± .
04 0 . ± .
04 0 . ± .
05 0 . ± .
03 0 . ± .
03 0 . ± . . ± .
02 0 . ± .
08 0 . ± .
03 0 . ± .
06 0 . ± .
07 0 . ± .
04 0 . ± .
06 0 . ± . . ± .
02 0 . ± .
03 0 . ± .
03 0 . ± .
03 0 . ± .
03 0 . ± .
02 0 . ± .
02 0 . ± . . ± .
01 0 . ± .
06 0 . ± .
01 0 . ± .
06 0 . ± .
05 0 . ± .
03 0 . ± .
05 0 . ± . . ± .
01 0 . ± .
02 0 . ± .
03 0 . ± .
02 0 . ± .
03 0 . ± .
01 0 . ± .
01 0 . ± . . ± .
01 0 . ± .
04 0 . ± .
02 0 . ± .
03 0 . ± .
03 0 . ± .
02 0 . ± .
03 0 . ± . . ± .
01 0 . ± .
03 0 . ± .
02 0 . ± . . ± .
01 0 . ± .
06 0 . ± .
02 0 . ± . . ± .
02 0 . ± .
04 0 . ± .
05 0 . ± . . ± .
02 0 . ± .
10 0 . ± .
04 0 . ± . . ± .
02 0 . ± .
04 0 . ± .
03 0 . ± . . ± .
02 0 . ± .
08 0 . ± .
02 0 . ± . . ± .
01 0 . ± .
02 0 . ± .
02 0 . ± . . ± .
01 0 . ± .
05 0 . ± .
02 0 . ± . students in the sample who earn a C, D, or F suggestinga 50% test-train split. If one abandoned the goal of pre-cisely predicting the variable importance of this group,more data could be retained for the training dataset.The analysis showing that downsampling was not pro-ductive for the unbalanced dependent variable and theresult that model performance parameters were insensi-tive to the level of downsampling for the independentdichotomous variables suggest that downsampling is notproductive for this dataset. The optimal model is thenconstructed with a 50% test-train split, no downsam-pling, and the 0.30 (DF) and 0.45 (CDF) decision thresh-olds suggested by hyperparameter turning. Using theseparameters and all variables in Table I allowed the con-struction of a classification model that was characterizedon all students; the model performance metrics for thismodel are shown as the “Overall” entries in Table V. Thismodel has a balance of sensitivity and PPV. For the CDFoutcome, the overall accuracy is slightly better than that found for only institutional variables in Study 1 with κ inthe range of moderate agreement, but AUC less than thethreshold of 0.80 for excellent discrimination. While theaccuracy is higher for the DF model, sensitivity and PPVare much lower. It is harder to classify students receiv-ing a D or F; other performance metrics are also lowerfor the DF model with κ in the range of fair agreement.Because no downsampling was used, the majority ofthe cases on which the optimal classifier was traineddo not belong to any demographic subgroup underrepre-sented in physics. Because more majority students caseswere used to train the classifier, it may perform differ-ently for some subgroups as is suggested by Fig. 5 andfigures like it for other groups in the Supplemental Ma-terials [51]. To investigate this possibility, the optimalclassification model was also characterized for each mi-nority subgroup separately as shown in Table V. This wasdone by using the classification model trained on the fulltraining set to classify only the minority cases in the testdataset. The results are shown in Table V; differencesare significant if the standard errors of the mean do notoverlap; the standard error is the standard deviation di-vided by 14 .
1. For the CDF outcome, the models havesubstantially lower sensitivity and higher specificity forwomen with approximately equal PPV consistent withFig. 5. Conversely, for both URM students and non-CalReady students, the model’s sensitivity is substan-tially higher and specificity lower than the overall model.For non-CalReady students, κ is substantially lower and F higher than for the overall model.All DF models had substantially lower sensitivity andPPV than CDF models; it is more difficult to predictthe DF outcome for all subgroups. The sensitivity ofthe DF model for women was substantially lower thanthe overall model consistent with the CDF models. Forother groups, the sensitivity of the DF models were verysimilar to the overall model. For URM students and non-4 URMACTVFirstGenTransCrdAPCreditMathEntryHrsEnrollACTMGenderCmpPctSTEMHrsHSGPAHrsCmpCGPA
Decrease in Sensitivity
HSGPAACTVHrsEnrollURMFirstGenTransCrdGenderSTEMHrsMathEntryACTMAPCreditCmpPctHrsCmpCGPA 0.0000 0.0375 0.0750
Decrease in Accuracy
HSGPAHrsEnrollACTVURMFirstGenGenderSTEMHrsTransCrdACTMMathEntryAPCreditHrsCmpCmpPctCGPA 0.0000 0.0375 0.0750
Decrease in Specificity
Figure 6. Variable importance of the optimized model predicting CDF. Error bars are one standard deviation in length.
CalReady students, κ was lower than the overall model;AUC, F , and g mean were similar.While few substantial differences in performance met-rics were identified in Table V, model performance mightbe substantially different if only students from the sub-group were used to train the model. This was investi-gated by training the models on the subgroup alone; theresults are shown in Table VI. For the CDF outcome, thesensitivity of the model for women improved from 0.61to 0.66 balancing a decrease in PPV from 0.69 to 0.65.For URM students, the sensitivity decreased for the CDFoutcome. For CDF, both URM and non-CalReady stu-dents had higher sensitivity than PPV, as they did inthe overall model; this suggests it may be appropriate totune the decision threshold separately for these groups.For the DF outcome, both URM and FirstGen studentshad lower sensitivity when fitted separately than whenfitted in the overall model.Overall, the optimal classification models presented inthis section achieved a balance of sensitivity and PPVas shown in Table V. Some variation existed for somesubgroups with higher sensitivity for URM students andlower sensitivity for women. Model performance met-rics were fairly precisely estimated for all groups with amaximum standard deviation of 0 .
06 suggesting the sam-ple size was sufficient for effective characterization of themodels; the standard error of the mean was substantiallysmaller. For the CDF outcome, κ ranges from 0.35 to0.48, or from fair to moderate agreement; for the DFoutcome, κ was smaller ranging from 0.23 to 0.33, fairagreement. No AUC value met Hosmer’s threshold of0.80 for excellent discrimination.The importance of each variable used in the classifi- cation models was evaluated by bootstrapping. Mod-els were fit with the variable and without the variableand model performance metrics compared. The variableimportance for sensitivity, specificity, and accuracy areshown in Fig. 6 for the overall model predicting CDF.The variable importance for the overall model predict-ing DF is similar and is presented in the SupplementalMaterials [51]. The variable importance for each demo-graphic subgroup is also presented in the SupplementalMaterials.The variable importance plots shown in Fig. 6 showthat CGPA is by far the most important variable in agree-ment with Study 1. In addition to CGPA, only HrsCmp(the number of credit hours completed) is consistently animportant variable. As in Study 1, a very limited numberof institutional variables are needed to predict grades ina physics class. IV. DISCUSSION
This study sought to answer four research questions;they will be addressed in the order proposed.
RQ1: How can machine learning algorithms be appliedto predict unbalanced physics class outcomes?
Figure 1shows that the random forest algorithm using the de-fault decision threshold and no downsampling producesmodels with very low sensitivity for a substantially unbal-anced outcome variable, the DF outcome. Model accu-racy, the primary performance metric reported in Study1, was not an effective measure of the performance of anunbalanced classifier. Model performance metrics whichfocused on the minority outcome, the sensitivity and5PPV, were more useful in evaluating performance. Sen-sitivity was substantially improved by downsampling un-til the minority and majority classes were the same size;however, this somewhat degraded the accuracy and speci-ficity and strongly degraded PPV. Tuning the randomforest hyperparameters, specifically the decision thresh-old, was also productive in increasing sensitivity, onceagain at the expense of accuracy, specificity, and PPV.To both identify as many of the minority class (the DFor CDF outcome) measured by sensitivity and to haveas large a proportion of those identifications be accurate,measured by PPV, models that balanced sensitivity andPPV were constructed.A grid search allowed the identification of the combi-nation of downsampling and hyperparameter tuning thatwas optimal to produce a balance of sensitivity and PPV.For this sample, that balance could be achieve by adjust-ing the decision threshold alone without downsampling.Downsampling, while productive in eliminating the ef-fects of sample imbalance, has other negative effects. Byremoving cases, it lowers the number of unique individ-uals on which the classifier is training, reducing perfor-mance. It also lowers the overall training dataset sizeincreasing the imprecision of the performance metric es-timates. With no downsampling, the decision thresholdwas set to 0.30 for the DF outcome and 0.45 for the CDFoutcome; both different than 0.50 default threshold usedin Study 1. Table V shows these values produced approx-imately equal values of sensitivity and PPV for both theDF and CDF outcome.At these values, the CDF outcome model was accuratein 76% of its predictions overall, predicting 69% of theCDF outcomes correctly; 67% of students predicted toearn a CDF did so in the test dataset. Global model fitparameters were also fairly strong with κ in the rangeof moderate agreement, and AUC near the 0.8 thresh-old for excellent discrimination. The DF models did notperform nearly as well as the CDF models. While themodels predictions were accurate 87% of the time, themodels were far more effective at predicting the ABCoutcome (correctly predicted 92% of the time) than theDF outcome (correctly predicted 46% of the time). Thefraction of the DF predictions that were correct was alsosmaller, only 45% of the students predicted to earn a Dor F actually did. While this work did not exhaust theadjustments that could be made to the random forest al-gorithm to improve DF prediction, the results presentedsuggest that simply modifying the algorithm will not besufficient to greatly improve the prediction accuracy ofthe DF model to the levels of the CDF model. It is likelythat new variables measuring different dimensions of stu-dent motivation and performance are required to improveprediction accuracy for students most at risk in the classstudied. RQ2: What is a productive set of performance metricsto characterize the prediction algorithms?
This studyintroduced many potential model performance metrics.Study 1 used accuracy, AUC, and κ . This study showed that these metrics obscured a difference in the rate of cor-rect classification of the minority and majority outcomeclasses. No single model performance parameter wassufficient to completely understand model performance.This work primarily used sensitivity and PPV and soughtto achieve a balance of the two quantities. This approachfocussed on the minority outcome, earning a DF or CDF,in anticipation that identification of at-risk students maybe one of the primary applications of machine learningin PER. Global model fit parameters such as AUC and g mean obscured differences in sensitivity, specificity, andPPV and were ineffective at distinguishing between mod-els. F , the harmonic mean of sensitivity and PPV, wasthe single parameter that most closely aligned with thegoal of optimizing both sensitivity and PPV. Of the over-all performance metrics, κ was more effective at distin-guishing between models than AUC and g mean ; however,it has the drawback of not being intuitively connectedto the confusion matrix. As such, it is not always clearhow optimizing κ actually influences model predictions.The overall classification accuracy used in Study 1 wasquite ineffective at distinguishing between models. Thiswas a result of sample imbalance, particularly for the DFoutcome. One can achieve a high accuracy by only clas-sifying the majority class with precision. Researchers ap-plying classification algorithms should not focus on a sin-gle performance measure, but should examine a varietyof measures. These measures should be chosen to alignwith how the results of the classification algorithm will beused. We suggest examining sensitivity, specificity, andPPV as a good starting point for understanding machinelearning models. One could also compute a negative pre-dictive value (NPV), if predicting the successful outcomewas also a focus of the model. RQ3: What sample size is required for accurate predic-tion of physics class outcomes?
There was a weak increase in predictive performancewith increasing sample size until the minority sample sizereached 100 as shown in Fig. 1. The uncertainty in modelperformance metrics decreased with increasing minoritysample size. There was not a well-defined transition, a“knee,” in these plots. For the CDF outcome, the rate theuncertainty decrease slowed at around 100 to 150 cases.For the DF outcome, the transition to slower decline wasless well defined for the sensitivity and PPV. The minor-ity sample size required for commensurate uncertainty inthese outcomes was also quite different. For an uncer-tainty of 0.025 in the CDF outcome, 220 minority caseswere needed for sensitivity and 60 cases for PPV; for theDF outcome, 140 cases were required for sensitivity and350 cases for PPV.The size of the test dataset also influences the uncer-tainty of the performance metrics; plots similar to Fig.1 for the test dataset are presented in the Supplemen-tal Material [51]. In general, larger test datasets wererequired to achieve the same uncertainty as the trainingdataset. For an uncertainty of 0.025 in the CDF out-come, 260 minority cases were needed for sensitivity and6225 cases for PPV; for the DF outcome, 175 cases forsensitivity and 300 cases for PPV.While no strong recommendation for absolute samplesize can be made, Fig. 1 should allow researchers wish-ing to develop a classification model to determine howmuch uncertainty they should expect for a given minor-ity sample size. It should be stressed that Fig. 1 plotsthe minority sample size on the horizontal axis; becauseof differences in sample imbalance, the overall samplesize required will be quite different for the CDF and DFoutcomes for the same minority sample size.
RQ4: How does prediction accuracy differ for groupsunderrepresented in physics? How can machine learningmodels be optimized to predict the outcomes of all groupswith equal accuracy?
This study reported multiple model performance met-rics. For the CDF outcome, overall accuracy was some-what, but not dramatically, different for the demographicsubgroups in the dataset varying from 0.70 to 0.78 in Ta-ble V. To produce Table V, first a model was constructedfor the complete training dataset (Overall), then thatmodel was used to characterize the subgroups in the testdataset. Stronger variation was observed in other per-formance metrics with sensitivity varying from 0.61 forwomen to 0.78 for URM students. For the DF outcome,the variation of accuracy was similar, 0.78 to 0.90, with asmaller variation in sensitivity; this was possibly causedby the generally lower values of this variable. For theCDF outcome, PPV was fairly constant for all students;more variation was observed for DF students.Table VI shows that some of the differences identifiedin Table V decrease if models are constructed for eachdemographic subgroup independently; however, most dif-ferences remain. This indicates that the origin of thedifferences in prediction performance identified in TableV were not solely the result of training the classifier onmore majority cases. It seems likely that additional infor-mation about students in underrepresented groups mayneed to be collected to produce classification models withconsistent performance across subgroups. The additionalinformation needed is not yet known, but may include in-class data such as homework scores or attitudinal datasuch as self-efficacy or possessing a growth mindset.
A. Other Observations
Only a few variables were important to the classifica-tion models as shown in Fig. 6. There is strong theoreti-cal support that many of the variables identified as unim-portant are strong markers of potential academic success:HSGPA and ACTM particularly. Table IV also showsstrong differences in the academic preparation of URM,FirstGen, and non-CalReady students. It seems quitelikely that all these variables were found unimportantbecause their effects are already present in the variablesthat were found as important: college GPA and the num-ber of hours completed. The variable importance would likely change dramatically if measures of college success,CGPA, HrsCmp, CmpPct, and STEMHrs were removedfrom the models leaving only variables measured beforecollege began.
B. Recommendations for the Use of MachineLearning
The results of this paper provide some guidance to fu-ture researchers interested in applying machine learningalgorithms to predict physics course outcomes.
Test-Train Data Requirements:
The results of Fig.1 suggest training datasets should be at minimum100 students; a similar plot in the Supplemen-tal Material suggest similar criteria for the testdataset. However, a fixed dataset threshold shouldnot be used, rather bootstrapping should be em-ployed to establish the precision of the model per-formance metrics. The decisions made with theclassification model will determine the precisionneeded. For models using demographic variablesfor students underrepresented in physics classes,substantially larger datasets are required to pre-cisely characterize model performance for these stu-dents.
Performance Metrics:
The application of the modelpredictions should be considered before selectingperformance metrics. Some applications may valueoverall accuracy, while others may be more con-cerned in the correct prediction of the minority ormajority outcome. This study settled on optimiz-ing sensitivity and PPV simultaneously; this couldbe accomplished by maximizing the F statistic.This choice focussed on the minority outcome, re-ceiving a DF or CDF grade, and placed equal valueon predicting as many of the minority outcomes aspossible and having the predictions be correct asoften as possible. Unbalanced Outcomes:
Imbalance between the ma-jority and minority outcome makes some perfor-mance metrics, such as accuracy or AUC, less use-ful in model evaluation. An unbalanced outcomevariable can produce classifiers that are inaccuratefor the minority class, as was the DF classifierpresented in this work using the default decisionthreshold. Both downsampling and hyperparame-ter tuning can eliminate some of the negative effectsof unbalanced outcomes. In this work, hyperpa-rameter tuning alone served to produce a classifierthat balanced sensitivity and PPV.
Unbalanced Independent Variables:
Demographicsubgroups underrepresented in physics may beclassified accurately less often because the machinelearning algorithm is trained on fewer cases. Thesample size of each demographic group to be7included in the model should be consider whenestablishing the test-train split. Each subgroupshould be examined independently to ensure themodels are performing equally for all groups. Itmay be necessary to fit each group or to tune thehyperparameters for each group separately. Thedifferences in the performance between groups wasgenerally unaffected by downsampling.
V. IMPLICATIONS
A limited number of institutional variables are requiredto construct classification models for physics classes. Ifphysics departments could arrange for these variables tobe provided to instructors along with tools to use thesevariables, at-risk students could be identified and inter-ventions directed toward these students.
VI. FUTURE
This work focussed on the random forest machinelearning algorithm; many other algorithms exist and mayprovide additional insight into student behavior and suc-cess in physics classes.This work also considered only linear relations of thevariables; it is possible non-linear combinations of vari-ables are important to understand student success. Thiswould be particularly important if interactions betweendemographic variables represented by product terms inthe random forest model were important to predictingstudent success.This work focussed on institutional variables, a futurestudy will examine the addition of in-class variables suchas homework average and affective variables such as self-efficacy. It is particularly important to identify the set ofvariables needed to improve prediction effectiveness forthe DF outcome.This work demonstrated that the difference in modelperformance for some underrepresented demographicgroups could not be explained by the imbalance of thesample alone. More research is required to determine theadditional variables needed produces classifiers which areequally accurate for all underrepresented students. This worked explored many factors which might in-fluence the performance of machine learning classifiersusing a large dataset and a fairly small number of vari-ables. As the number of variables increase, additionalwork is needed to see how the number of variables affectsthe results presented.
VII. CONCLUSIONS
This work applied the random forest machine learn-ing algorithm to the prediction of student grades in anintroductory, calculus-based mechanics class. Both stu-dents receiving a D or F and students receiving a C, D,or F were investigated. The default parameters for therandom forest algorithm produced classification modelswhich predicted the lower grade outcome correctly sub-stantially less often than the higher grade outcome. Bothdownsampling and hyperparameter tuning (adjusting thedecision threshold) were productive in producing classifi-cation models which predicted these outcomes correctlyat a higher rate. When used together, hyperparametertuning alone produced results close to a combination ofdownsampling and hyperparameter tuning without re-moving data. By tuning the decision threshold, sensi-tivity (the fraction of the DF or CDF outcomes classi-fied correctly) was improved from 20% to 46% for theDF outcome and from 62% to 69% for the CDF out-come. Three demographic subgroups were examined inthis work: women, underrepresented minority students,and first generation students. For all subgroups, differ-ences were detected in model performance metrics be-tween the subgroup classifier and the overall classifier.These differences largely persisted when the classifica-tion model was trained with only members of the demo-graphic subgroup. Some differences suggested the classi-fication models should be tuned independently for eachdemographic group.
ACKNOWLEDGMENTS
This work was supported in part by the National Sci-ence Foundation under grant ECR-1561517 and HRD-1834569. [1] D.E. Meltzer and R.K. Thornton, “Resource letterALIP–1: Active-learning instruction in physics,” Am. J.Phys. , 478–496 (2012).[2] S. Freeman, S.L. Eddy, M. McDonough, M.K. Smith,N. Okoroafor, H. Jordt, and M.Pat. Wenderoth, “Ac-tive learning increases student performance in science,engineering, and mathematics,” P. Nat. Acad. Sci. ,8410–8415 (2014).[3] President’s Council of Advisors on Science and Technol-ogy, “Report to the President. Engage to Excel: Pro- ducing One Million Additional College Graduates withDegrees in Science, Technology, Engineering, and Math-ematics,” Executive Office of the President: Washington,DC (2012).[4] X. Chen, “STEM Attrition: College students’ paths intoand out of STEM fields. NCES 2014-001.” National Cen-ter for Education Statistics (2013).[5] National Science Board, Revisiting the STEM Work-force: A Companion to Science and Engineering Indica-tors 2014 (National Science Foundation: Arlington, VA, , 892–900 (2010).[7] E.J. Shaw and S. Barbuti, “Patterns of persistence inintended college major with a focus on STEM majors,”NACADA J. , 19–34 (2010).[8] A.V. Maltese and R.H. Tai, “Pipeline persistence: Ex-amining the association of educational experiences withearned degrees in STEM among US students,” Sci. Educ. , 877–907 (2011).[9] G. Zhang, T.J. Anderson, M.W. Ohland, and B.R.Thorndyke, “Identifying factors influencing engineer-ing student graduation: A longitudinal and cross-institutional study,” J. Eng. Educ. , 313–320 (2004).[10] B.F. French, J.C. Immekus, and W.C. Oakes, “An ex-amination of indicators of engineering students’ successand persistence,” J. Eng. Educ. , 419–425 (2005).[11] R.M. Marra, K.A. Rodgers, D. Shen, and B. Bogue,“Leaving engineering: A multi-year single institutionstudy,” J. Eng. Educ. , 6–27 (2012).[12] C.W. Hall, P.J. Kauffmann, K.L. Wuensch, W.E. Swart,K.A. DeUrquidi, O.H. Griffin, and C.S. Duncan, “Ap-titude and personality traits in retention of engineeringstudents,” J. Eng. Educ. , 167–188 (2015).[13] P. Baepler and C.J. Murdoch, “Academic analytics anddata mining in higher education,” Int. J. Scholarsh.Teach. Learn. , 17 (2010).[14] R.S.J.D. Baker and K. Yacef, “The state of educationaldata mining in 2009: A review and future visions,” J.Educ. Data Mine , 3–17 (2009).[15] Z. Papamitsiou and A.A. Economides, “Learning analyt-ics and educational data mining in practice: A systematicliterature review of empirical evidence.” J. Educ. Tech.Soc. (2014).[16] A. Dutt, M.A. Ismail, and T. Herawan, “A system-atic review on educational data mining,” IEEE Access , 15991–16005 (2017).[17] C. Romero and S. Ventura, “Educational data mining: Areview of the state of the art,” IEEE T. Syst. Man Cy.C , 601–618 (2010).[18] C. Zabriskie, J. Yang, S. DeVore, and J. Stewart, “Usingmachine learning to predict physics course outcomes,”Phys. Rev. Phys. Educ. Res. , 020120 (2019).[19] A. Pe˜na-Ayala, “Educational data mining: A survey anda data mining-based analysis of recent works,” ExpertSyst. Appl. , 1432–1462 (2014).[20] C. Romero, S. Ventura, P.G. Espejo, and C. Herv´as,“Data mining algorithms to classify students,” in Pro-ceeding of the 1st International Conference on Educa-tional Data Mining , edited by R.S. Joazeiro de Baker,T. Barnes, and J.E. Beck (Montreal, Quebec, Canada,2008).[21] G. James, D. Witten, T. Hastie, and R. Tibshirani,
AnIntroduction to Statistical Learning with Applications inR , Vol. 112 (Springer-Verlag, New York, NY, 2017).[22] A.C. M¨uller and S. Guido,
Introduction to MachineLearning with Python: A Guide for Data Scientists (O’Reilly Media, Boston, MA, 2016).[23] A.M. Shahiri, W. Husain, and N.A. Rashid, “A reviewon predicting student’s performance using data miningtechniques,” Procedia Comput. Sci. , 414–422 (2015).[24] S. Huang and N. Fang, “Predicting student academic per-formance in an engineering dynamics course: A compar- ison of four types of predictive mathematical models,”Comput. Educ. , 133–145 (2013).[25] F. Marbouti, H.A. Diefes-Dux, and K. Madhavan, “Mod-els for early prediction of at-risk students in a course us-ing standards-based grading,” Comput. Educ. , 1–15(2016).[26] L.P. Macfadyen and S. Dawson, “Mining LMS data todevelop an early warning system for educators: A proofof concept,” Comput. Educ. , 588–599 (2010).[27] U. bin Mat, N. Buniyamin, P.M. Arsad, and R. Kas-sim, “An overview of using academic analytics to predictand improve students’ achievement: A proposed proac-tive intelligent intervention,” in Engineering Education(ICEED), 2013 IEEE 5th Conference on (IEEE, 2013)pp. 126–130.[28] J.M. Aiken, R. Henderson, and M.D. Caballero, “Model-ing student pathways in a physics bachelor’s degree pro-gram,” Phys. Rev. Phys. Educ. Res. , 010128 (2019).[29] “US News & World Report: Education,” USNews and World Report, Washington, DC,https://premium.usnews.com/best-colleges. Accessed2/23/2019.[30] V. Otero, S. Pollock, and N. Finkelstein, “A physics de-partment’s role in preparing physics teachers: The Col-orado Learning Assistant model,” Am. J. Phys. , 1218–1224 (2010).[31] C.H. Crouch and E. Mazur, “Peer instruction: Ten yearsof experience and results,” Am. J Phys , 970–977(2001).[32] L.C. McDermott and P.S. Shaffer, Tutorials in Introduc-tory Physics (Prentice Hall, Upper Saddle River, NJ,1998).[33] E. Elby, R.E. Scherr, T. McCaskey, R. Hodges,T. Bing, D. Hammer, and E.F. Redish, “OpenSource Tutorials in Physics Sensemaking,” http://umdperg.pbworks.com/w/page/10511218/OpenSourceTutorials .Accessed 9/17/2018.[34] A.L. Traxler, X.C. Cid, J. Blue, and R. Barthelemy,“Enriching gender in physics education re-search: A binary past and a complex future,”Phys. Rev. Phys. Educ. Res. , 020114 (2016).[35] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J.Stone, Classification and Regression Trees (Wadsworth& Brooks/Cole, Monterey, CA, 1984).[36] T. Hastie, R. Tibshirani, and J. Friedman,
The Elementsof Statistical Learning: Data Mining, Inference, and Pre-diction (Springer-Verlag, New York, NY, 2009).[37] A. Liaw and M. Wiener, “Classification and regressionby randomForest,” R News , 18–22 (2002).[38] T. Fawcett, “An introduction to ROC analysis,” PatternRecogn. Lett. , 861–874 (2006).[39] J. Cohen, Statistical Power Analysis for the BehavioralSciences (Academic Press, New York, NY, 1977).[40] D.G. Altman,
Practical Statistics for Medical Research (CRC Press, Boca Raton, FL, 1990).[41] D.W. Hosmer Jr, S. Lemeshow, and R.X. Sturdivant,