[PDF] Missing data and bias in physics education research: A case for using multiple imputation

Abstract

Physics education researchers (PER) commonly use complete-case analysis to address missing data. For complete-case analysis, researchers discard all data from any student who is missing any data. Despite its frequent use, no PER article we reviewed that used complete-case analysis provided evidence that the data met the assumption of missing completely at random (MCAR) necessary to ensure accurate results. Not meeting this assumption raises the possibility that prior studies have reported biased results with inflated gains that may obscure differences across courses. To test this possibility, we compared the accuracy of complete-case analysis and multiple imputation (MI) using simulated data. We simulated the data based on prior studies such that students who earned higher grades participated at higher rates, which made the data missing at random (MAR). PER studies seldom use MI, but MI uses all available data, has less stringent assumptions, and is more accurate and more statistically powerful than complete-case analysis. Results indicated that complete-case analysis introduced more bias than MI and this bias was large enough to obscure differences between student populations or between courses. We recommend that the PER community adopt the use of MI for handling missing data to improve the accuracy in research studies.

Full PDF

MMissing data and bias in physics education research: A case for using multipleimputation

Jayson Nissen, Robin Donatello, and Ben Van Dusen Department of Science Education, California State University Chico, Chico, CA, 95929, USA Department of Mathematics and Statistics, California State University Chico, Chico, CA, 95929, USA

Physics education researchers (PER) commonly use complete-case analysis to address missingdata. For complete-case analysis, researchers discard all data from any student who is missing anydata. Despite its frequent use, no PER article we reviewed that used complete-case analysis providedevidence that the data met the assumption of missing completely at random (MCAR) necessary toensure accurate results. Not meeting this assumption raises the possibility that prior studies havereported biased results with inﬂated gains that may obscure diﬀerences across courses. To test thispossibility, we compared the accuracy of complete-case analysis and multiple imputation (MI) usingsimulated data. We simulated the data based on prior studies such that students who earned highergrades participated at higher rates, which made the data missing at random (MAR). PER studiesseldom use MI, but MI uses all available data, has less stringent assumptions, and is more accurateand more statistically powerful than complete-case analysis. Results indicated that complete-caseanalysis introduced more bias than MI and this bias was large enough to obscure diﬀerences betweenstudent populations or between courses. We recommend that the PER community adopt the use ofMI for handling missing data to improve the accuracy in research studies.

I. INTRODUCTION

Physics education research (PER) commonly handlesmissing data by using complete-case analysis (a.k.a. list-wise deletion, casewise deletion, and matched data) [1, 2].Complete-case analysis removes any individuals who aremissing any data from the analysis. This method is com-mon because it is easy to implement. However, discard-ing data lowers the statistical power of the analysis andmay bias the results [3–6].Complete-case analysis produces reliable results solong as the missing data is missing completely at random(MCAR) [3]. For MCAR, the missingness is completelyindependent of any observed or missing data [7]. Weare not aware of any studies in PER that have explicitlytested the MCAR assumption. Van Ness et al. [8] andFielding et al. [9] provide examples of these tests in epi-demiology and health research. The few studies that haveexplicitly compared participants and non-participants us-ing course grades [2, 10–12] all indicate that studentswith higher course grades are more likely to provide com-plete data. Students with higher course grades also tendto do better on concept inventories and attitudes sur-veys [2]. PER studies that use these instruments likelydo not meet the MCAR assumption because the missingdata disproportionately comes from students with lowergrades who tend to have lower scores. Therefore, as illus-trated by the simulated data in Fig. 1, the distributionof the collected data and the missing data likely diﬀer.This diﬀerence may create biased results. For example,on concept inventories the mean scores will be higher ifthe data mostly comes from students that earned As andBs than if it comes from all of the students.As participation rates drop, the skew in representationtoward students who receive higher grades typically in-creases [2]. This increased skew in participation tends to

FIG. 1. Simulated distributions of missing and collected datawith means indicated to illustrate data that is not MCAR. raise the size of the diﬀerence between the collected andmissing data, leading to a greater likelihood of bias in anysubsequent analyses. We are not aware of any studies inPER that have investigated this potential bias, how largethis bias may be, nor what impact it could have on un-derstanding student learning in college physics courses.Multiple imputation (MI) [13] handles missing datawithout discarding any values by imputing the missingvalues using statistical models based on the availabledata. MI completes this process m times to create m complete data sets, analyzing each of those complete datasets with traditional methods to produce m results, andcombining the m results into a single mean, variance,and standard error for each of the statistics being calcu- a r X i v : . [ phy s i c s . e d - ph ] F e b lated. MI [14] provides a consistently superior alterna-tive to complete-case analysis. Research shows that MIhas greater statistical power and less biased results thancomplete-case analysis [3, 5, 15, 16]. This superior per-formance results from MI not relying on the assumptionthat the data is MCAR and from MI using all of the avail-able data to build accurate and reliable models. A searchof the Sage journals for the term ‘multiple imputation’during the preparation of this manuscript indicated thateducation researchers use of MI as the search identiﬁed2,876 research articles on education that referenced MI.A similar search of the Physical Review database for theterm ‘multiple imputation’ identiﬁed only four studies inPER that referenced the term. Of these four studies,only two used MI [1, 17], and we only know of one otherPER article outside of Physical Review that used MI [2]. II. RESEARCH QUESTION

In this article, we compare and contrast the bias in-troduced by using either complete-case analysis or MI toanalyze concept inventory data with participation skewedtoward higher performing students. We designed thestudy to cover a broad range of variables we identiﬁedas pertinent to concept inventory data. The results in-form how likely complete-case analysis biases results inthe PER literature and the possible size of those biases.By comparing complete-case analysis and MI we hope toraise awareness in the PER and discipline based educa-tion research communities about methods for handlingmissing data in quantitative studies.To compare the accuracy for complete-case analysisand MI we examined the following research question: • When controlling for the relationships betweengrade, concept inventory scores, grade distributionsin a course, and participation rates, to what extentdo complete-case analysis and MI produce biasedresults for posttest scores?If the results indicate that complete-case analysis pro-vides inaccurate results compared to MI, these resultscould motivate researchers to use MI in their studies.The results could also provide reviewers and editors witha resource to push against the use of complete-case analy-sis and to push for improved reporting and transparencyabout data collection and analysis in future studies.

III. LITERATURE REVIEWA. Missing data in PER studies

To inform the common research practices around re-porting and handling missing data, we reviewed the pub-lished literature in the American Journal of Physics andin Physical Review – Physics Education Research. We identiﬁed 28 studies that reported pretest and posttestscores for concept inventories in introductory physicscourses. We did not include studies that used eitherpretest or posttest scores but did not report descriptivestatistics for student performance. Of these 28 stud-ies, six provided adequate descriptive statistics to calcu-late the participation rates and one [18] stated the rangeof participation rates across the courses sampled in thestudy, as shown in Table I. The participation rates rangedfrom a low of 30% to a high of 80%.Twenty-three of the studies we reviewed usedcomplete-case analysis. For studies that did not reporthow they handled missing data, we inferred from thematched number of pretests and posttests that the re-searchers used complete-case analysis. Five studies calcu-lated descriptive statistics using all available data. These28 studies do not include the three studies in PER thatused MI, which we discussed earlier. We excluded thesethree articles from the 28 studies that we reviewed be-cause two of them did not report pretest and posttestscores on concept inventories [1, 17] and we discuss thethird article [2] below.Only three of the seven studies that reported participa-tion rates, shown in Table I, provided average grade datafor the participants and non-participants. All three stud-ies disaggregated the data by gender. The participants inthese three studies had much higher grades than the stu-dents who did not participate in the study, with a B- onaverage for participants and a C on average for nonpar-ticipants. These diﬀerences in grades indicate that themissing data in these studies does not meet the assump-tion of MCAR required for complete-case analysis. Theunder representation of low-performing students raisesthe possibility that the results reported in these studieswere positively biased.

B. An Investigation of Participation on Low-StakesAssessments

Nissen et al. [2] used an experimental design to investi-gate the diﬀerences in performance and participation onpaper-and-pencil tests (PPT) administered in class andcomputer-based tests (CBT) administered online outsideof class. In this article, we focus on their participationmodels. Data for the study came from 1,310 studentsin 25 sections of 3 diﬀerent introductory physics coursesat one institution. Instructors asked every student tocomplete four assessments: paper- and computer-basedpretests and posttests. Instructors reported using fourdiﬀerent practices to motivate students to participate:participation credit on the pretest, participation crediton the posttest, in class reminders, and email reminders.They modeled the participation rates of the students us-ing Hierarchical Generalized Linear Models to produceestimates of the likelihood that students would providedata on the low-stakes assessments. The hierarchicalmodels nested the data in three levels: tests nested in

TABLE I. Participation rates and descriptive statistics for students’ grades from prior studies published in Physical ReviewPhysics Education Research. Descriptive statistics include mean ( µ ), sample size ( N ), and standard deviation ( σ ). Grades arein GPA units on a 0 to 4 scale. Study Instruction Gender Participant grades Nonparticipant grades Participation µ N σ µ N σ

RateNissen, 2016 [12] Active Male 2.69 90 1.28 2.1 92 1.28 0.49Female 2.78 27 1.26 2.05 13 1.16 0.68Kost-Smith, 2010 [11] Active Male 2.85 1257 0.8 1.93 500 1.1 0.72Female 2.80 447 0.8 1.96 114 1.2 0.80Kost, 2009 [10] Active Male 2.82 1563 0.8 2.14 1152 1.2 0.58Female 2.74 533 0.8 1.89 315 1.1 0.63Henderson, 2017 [19] Lecture Male - 1084 - - 342 - 0.76Female - 323 - - 102 - 0.76Brewe, 2010 [20] Modeling All - 258 - - 64 - 0.8Lecture All - 758 - - 1743 - 0.3Cahill, 2014 [21] Lecture All - 366 - - 314 - 0.54Active All - 773 - - 448 - 0.63Cahill, 2014 [21] Lecture All - 360 - - 219 - 0.62Active All - 738 - - 384 - 0.66Cahill, 2018 [18] Both All - - - - - - 0.34 -0.59

FIG. 2. Participation rates for computer-based tests (CBT) and paper- and pencil tests (PPT) from Nissen et al. [2]. Participa-tion on the PPT pretest is not shown because it closely clustered around 100% for all grades. Recommended practices measuredthe total number of up to four actions instructors could take to motivate students to participate in the CBTs: participationcredit on the pretest, participation credit on the posttest, in class reminders, and email reminders. students who nested in course sections. Variables in theﬁnal model included paper pretest, computer pretest, pa-per posttest, and computer posttest at the test level; ﬁ-nal course grade and gender at the student level, andparticipation practices treated as a continuous variablefrom 0-4 based on the total number of practices instruc-tors used at the course section level. The coeﬃcients ingeneralized linear models are the log of the odds ratio,e.g. logits. Because logits are uncommon, nonintuitive,and beyond the scope of this article, we will focus onthe predicted participation rates reported by Nissen andcolleagues, which are shown in Fig. 2.Nissen and colleagues found participation tended to behigher on pretests than on posttests, participation tendedto be higher on paper-and-pencil tests than on computer- based tests, and students that earned higher grades par-ticipated at higher rates than those that earned lowergrades. The ﬁnal model predicted that participation oncomputer-based tests matched that on paper-and-penciltests when instructors used all four practices to motivatestudent participation. The diﬀerences in participationacross student grades existed no matter what practicesinstructors used to motivate their students to participate.Their ﬁnal model predicted female students partici-pated at slightly higher rates than male students, butthis diﬀerence was not statistically signiﬁcant. To gener-ate the participation rates represented in Fig. 2, Nissenand colleagues input the mean value for gender into theirparticipation model.

C. Summary of Missing Data in PER Studies

Higher participation rates for higher achieving stu-dents occurred in all of the studies that we reviewed thatreported information on participation. We cannot ruleout the possibility that only studies with a skew in partic-ipation reported on diﬀerences in grades between partic-ipants and non-participants. However, Kost-Smith et al. [11] reported one of the highest participation rates andreported this skew while Nissen et al. [2] found that theskew became smaller as the participation rate increased.Furthermore, Nissen et al. [2] tested for the relationshipbetween grade and participation because it was reportedin earlier studies [10–12]. Until studies show no relation-ship between course grades and participation, the litera-ture consistently and reliably indicates that students whoearn higher grades are more likely to participate thanthose that do not.The positive relationship between grade and participa-tion indicates that concept inventory data is not MCAR.This consistent failure to meet the assumptions necessaryfor complete-case analysis to produce accurate resultscombined with the almost exclusive use of complete-caseanalysis raises the possibility that results in PER stud-ies that use pre-post concept inventories are positivelybiased to varying extents.

D. Types of missing data

The statistical methods underlying complete-case anal-ysis assumes the data is MCAR. MI makes no explicitassumption about the missingness of the data, howevermany software packages implementation of MI assumesmissing at random (MAR) data. Rubin [7] coined threeterms to classify the relationships between the mecha-nisms of the missingness and the missing and observedvalues themselves. • Missing completely at random (MCAR): all of thecases have the same probability of being missing.No relationship exists between the probability of acase being missing and any values in the dataset.This assumption can be partially tested [22]. • Missing at random (MAR): The missingness is in-dependent of the value of the missing data butis conditionally dependent on other observed vari-ables that can explain all of the missingness. Forexample, a researcher has blood pressure, age, andcardiovascular disease data. They are concernedthat the blood pressure data is not missing at ran-dom because older people with cardiovascular dis-ease are more likely to report their blood pressurethan young healthy people. Provided the age andcardiovascular disease data can explain the miss-ingness in the data, the data is MAR. • Missing not at random (MNAR): The missingnessdepends on both the observed and unobserved data.For example, wealthy and poor people who chosenot to report their income for fear of being stig-matized due to their income. Since the reportedvariable is related to the likelihood of reporting andno other variable can explain the missingness, thedata is MNAR.In real world data, the boundary between MAR andMNAR cannot be ﬁrmly established because doing sorequires observing the unobserved data. Instead, re-searchers must make reasonable arguments to evaluatethe mechanism of missingness. Simulation studies likethe one we present in this manuscript allow researchersto build models with data that is known to be missingbased on one of the three missingness classiﬁcations.Bhaskaran and Smeeth [23] provide a brief article ex-plaining MAR. They argue [23, p. 1337], “... the termi-nology describing missingness mechanisms is undeniablyconfusing. In particular, ‘missing at random’ is oftenconﬂated with ‘missing completely at random’, leadingresearchers to mistakenly conclude that any systematicpatterns or mechanisms underlying the missing data con-traindicate the use of multiple imputation.” We adaptedthe following scenario from Bhaskaran and Smeeth’s arti-cle to present MAR in a common context for PER. Theirarticle provides a more thorough discussion of MAR.We present the following scenario as an example ofMAR. A research team collected concept inventory data,but they are concerned that the data is MNAR becausethe students who participated had much higher gradesthan the students who did not participate. Fig. 1 illus-trates this scenario. The researcher can use the gradedata to argue that the data is MAR because the miss-ingness in the concept inventory data can largely be ex-plained by the students grades, as illustrated by Fig. 3.In the case of MAR data, splitting the data in Fig. 1by grade results in Fig. 3 and shows similar distributionsbetween collected and missing data for each grade. Thedistribution of missing data for the A students looks sim-ilar to the complete data for the A students and so onfor each group of students. The researcher can argue thatwithin each group of students (A, B, C, D, and F) theprimary factors related to their participation were notrelated to their performance (i.e., traﬃc, illness, a deathin the family, etc.) and the groups with lower partici-pation had more of these unrelated events overall. Thediﬀerence in the aggregated data, Fig. 1 resulted from thediﬀerence in the proportion of students that participatedfor each grade, which is illustrated by the height of thehistograms in Fig. 3.

E. The persistence of complete-case analysis

Despite the known and proven bias caused by ignor-ing missing data when it is not MCAR, many research

FIG. 3. The simulated concept inventory data shown in Fig. 1 disaggregated by student’s course grades. The similar distributionfor each grade indicates that the data is MAR because course grade accounts for the missingness. We made the course gradesfollow a ﬂat distribution ( N A = N B = N C = N D = N F ) to focus the diﬀerences between the collected and missing data on thesimilar distribution by grade that indicates the data is MAR and to illustrate how combining the data results in Fig. 1, wherethe collected and missing distributions diﬀer. ﬁelds continue to use complete-case analysis. Cheema [5]points out that complete-case analysis and other errorprone methods for handling missing data are common ineducation research. King et al. [24] found that 94% ofpolitical scientists used complete-case analysis, resultingin losing one third of their data on average. In biomed-ical research, few studies accurately report the amountof missing data or how they handled it, and those thatdo most commonly report using complete-case analysis[25–28]. These four critiques of complete cases analysisin biomedical research span from 2004 to 2015, indicat-ing that researchers can consistently critique the use ofcomplete-case analysis with little improvement in a ﬁeld’spractices. F. Imputation of missing data

Imputation is a principled technique for handling miss-ing data [4]. Imputation ﬁlls in the missing data withplausible values, such that a researcher can analyze thenow complete data set without concern for missing data.Imputation methods fall into two broad categories: de-terministic and probabilistic. We focus on probabilisticimputation methods in this article, but provide a briefreview of deterministic methods for contrast.

Deterministic options for imputation include mean im-putation and last observation carried forward. Mean im-putation replaces the missing values with the mean valuefor that variable. Researchers use last observation car-ried forward with longitudinal data to replace the missingdata with the last observed value for all subsequent mea- surements. Both are problematic because they (1) donot preserve the relationships between variables and (2)as with any single imputation approach, do not accountfor the error incurred by the imputation process itself.These deterministic methods treat the missing values asif they were known, which can lead to inappropriatelysmall variances and an erroneously increased chance ofstatistically signiﬁcant ﬁndings [29].

Probabilistic options for imputation include multipleimputation (MI) and maximum likelihood estimation. Inthis article, we demonstrate the use of MI [4] because itis a probabilistic approach for addressing missing dataacross a wide range of applications [3] and because re-search ﬁnds that MI is more statistically powerful andmore accurate than other methods for handling missingdata [5, 16]. The idea behind MI is graphically presentedin Fig. 4. The ﬁrst step applies an imputation proce-dure containing a random component (such as predictivemean matching, which is described below) to a datasetwith missing data M times to generate diﬀerent imputedvalues for each piece of missing data and generate M complete data sets. Step two calculates the desired esti-mate from the analysis, such as a mean or regression co-eﬃcient, on each data set separately using standard ana-lytical methods. The ﬁnal step pools the estimates usingsimple combining rules, also known as Rubin’s Rules [13],which are described later in Eqs. (1-5). These pooled re-sults then properly reﬂect the variation in the originalestimates and the variation introduced by the imputa-tion process itself.The plausibility of the imputed values generated in theﬁrst step relies entirely on the model used for the imputa-

FIG. 4. The multiple imputation (MI) process. In the ﬁrststep missing data (shown in white) is imputed (shown in darkblue) to create M complete data sets, with M = 3 shown here.Then each complete imputed dataset is analyzed using stan-dard methods such as linear regression. Finally the resultsare pooled using Rubin’s Rules. tion. Simplistic imputation models that do not use infor-mation contained in related variables will impute valuesthat are not an accurate reﬂection of what the missingdata could have been. For example, imputation modelsneed to account for whether the data is longitudinal orif there is reason to suspect the data is MNAR, and themodels need to include known correlations and relation-ships between variables or measures. In short, MI is onlyas good as the imputation model being used to create theimputed values.Many software programs have built in or add-on meth-ods to perform MI, both the imputation and poolingsteps. In this paper we used the MICE [30] package inRStudio V. 1.1.456 [31]. The MICE package uses predic-tive mean matching, an imputation method developedby Little [22] in 1988, as the default model to imputemissing data for continuous variables. Predictive meanmatching uses the following process [32] to multiply im-pute the missing data based on the data the researchercollected. We use -hat (ˆ) to diﬀerentiate observed y andpredicted ˆ y values.1. Using the portion of the data with no missing val-ues, build a linear model ( b ) by calculating the leastsquares estimates of the regression coeﬃcients ˆ β ,the model residuals ˆ (cid:15) , and variance of the residualsˆ σ .2. Create a new linear model ( b ( m ) ) by randomlydrawing values for the regression coeﬃcient froma probability distribution centered on ˆ β with vari-ance derived from ˆ σ and ˆ (cid:15) .3. Use b to generate predictions ˆ y i for all cases withfully observed data, and b ( m ) to generate predictedvalues ˆ y ∗ j for all cases with missing data ( i (cid:54) = j ).4. For each case with a missing value, identify a set of k predictions on observed data (ˆ y i ) that are closeto the predicted value ˆ y ∗ j . The k observed values y i from these matched records form a donor pool of values, where k should vary between 3 and 10depending on the size of the complete data set. TheMICE package uses k = 5.5. Randomly choose one observed value y i from thedonor pool to impute the missing value.6. Repeat steps 2-5 for each of the M imputations.Following analysis of each complete dataset re-searchers, with the aid of statistical software, pool theindividual results from across the M imputations usingRubin’s Rules to generate valid estimates and intervalsof the quantities of interest. To explain Rubin’s Rules,let δ be the parameter whose estimate we desire to ob-tain from an analysis (i.e., a mean, correlation, or regres-sion slope). Given M imputed data sets, M estimates of δ : (ˆ δ , ˆ δ , . . . , ˆ δ M ) are generated and used to calculatethe following quantities. • The overall estimate of the parameter is the averageof the individual point estimates.ˆ Q = 1 M M (cid:88) m =1 ˆ δ m . (1) • The within-imputation variance is the average ofthe individual variances. U = 1 M M (cid:88) m =1 V ar (ˆ δ m ) . (2) • The between-imputation variance is the variance ofthe estimates B = V ar (ˆ δ , ˆ δ , . . . , ˆ δ M ) . (3) • The total variance is a weighted average of thewithin and between imputation variances. T = U + (1 + 1 M ) B, (4) • And, 95% intervals are calculated using the totalvariance. ˆ Q ± . ∗ √ T . (5)The resulting variance of the combined estimate thenaccounts for both the within and between data set vari-ances. The predictive mean matching process incorpo-rates randomness in steps 2 and 5. The amount of vari-ance introduced in these steps depends on the variabilityand size of the data set being modeled. If the linearregression in step 1 provides an excellent ﬁt with smallstandard errors for the coeﬃcients, then little variabilitywill be added by step 2 because each of the M linearmodels will be very similar and thus will generate similarpredictions across the M imputations. Step 5 adds littlevariability if the data set is large because a large dataset will likely have several similar values that will popu-late the donor pool. By pooling the within and betweenimputation variances, Rubin’s Rules provides standarderrors for the estimates based on all of the available in-formation that account for the uncertainty introduced bythe missing data. G. Comparisons of methods for handling missingdata in education research

Pampaka et al. [15] compared complete-case analysisto MI for handling missing data using a dataset thatoriginally had large portions of missing data that theywere able to ﬁll in with subsequent data collection. Thisdesign allowed them to compare the results for MI andcomplete-case analysis of the missing data to the truevalues for the dataset with no missing data. The totaldataset included 1,374 students, but complete-case anal-ysis reduced the data to 495 students. Pampaka and col-leagues used a logistic regression to model the probabil-ity that students dropped out of the current mathematicscourse they were enrolled in. The model included predic-tor variables for the mathematics course students tookbefore this course, student’s disposition towards math,student’s math self-eﬃcacy, and student’s grade on theGeneral Certiﬁcate of Secondary Education (GCSE) formathematics. Students who received an A on the GCSEwere three times more likely to provide data than stu-dents who received a C, indicating that the data wasnot MCAR. Both the complete case and MI models pro-vided similar relationships between the variables to thosein the true models. However, MI produced smaller stan-dard errors than complete-case analysis. They concludedthat MI provided a much closer approximation of thetrue values than complete-case analysis. Pampaka andcolleagues do not discuss why the complete-case analy-sis and MI provided similar results or the implications ofthose similarities, nor does their study provide suﬃcientdetails for us to make meaningful inferences about thelack of diﬀerences.Cheema [5] used a simulation study and two realdatasets to provide guidance for researchers in designingstudies to account for sample size, proportion of miss-ing data, method of analysis, and method for handlingmissing data. The analysis compared four methods forhandling missing data: multiple imputation, complete-case analysis, mean imputation and maximum likelihoodestimation. To characterize the quality of the four meth-ods, Cheema used the root mean square error (RMSE).RMSE is the standard deviation of the results from themultiple simulations about the mean of the results, andis a measure of the random error introduced by the four methods. As such, RMSE does not account for any bias(i.e., systematic error) between the mean of the simu-lations with missing data and the true values where nodata is missing. Cheema compared the four analyticalmethods across three sample sizes and two levels of miss-ingness. The two levels of missing data were 1% to 10%and 11% to 20%; very few studies in the PER litera-ture report such low levels of missing data. This designcreated a decision tree with 24 possibilities. Multipleimputation was the most eﬀective method in 15 casesand maximum likelihood estimation in 7 cases. Similarto Pampaka et al. [15], Cheema found that imputationmethods increased the statistical power of the studieswith samples less than 200 by large enough amounts towarrant the use of imputation methods. Cheema warnedthat missing data can bias data sets and inferences drawnfrom studies using these biased dataset. In these cases,he urged researchers to use statistical methods that ac-counted for that bias. However, Cheema did not measurebias introduced by missing data in his study.These two studies illustrate how MI tends to havegreater statistical power than complete-case analysis.The trend toward greater statistical power for MI followsfrom MI using all of the available data and not discard-ing any data. These studies did not identify bias in theresults from either complete-case analysis or MI.

IV. METHODS

We compared the accuracy of estimates from MI andcomplete-case analysis using simulated course data forgrades, pretest and posttest concept inventory scores,and missing values for posttest concept inventory scores.Our analysis focused on course level mean posttest scoresas the estimate of interest ( µ post ). While we focusedon posttest means, we also analyzed mean pretest scores( µ pre ) because many eﬀect sizes and analytical methodsuse both pretest and posttest scores. Data simulationincluded a random component that allowed us to gener-ate complete data, create missing values, and calculate µ many (20) times to generate a distribution of µ ’s. Run-ning the analyses twenty times informed how consistentlythe measures and methods for handling missing data per-formed.Figure 5 illustrates our process for generating the com-plete and missing data. In the ﬁrst stage, we simulatedcomplete courses by using ﬁve performance models ofthe relationships between course grades and mean con-cept inventory scores; one model of the relationship be-tween the mean concept inventory score for a group andthe standard deviation of the scores for that group; andthree models of grade distributions. This ﬁrst stage pro-duced the true values ( µ ) for our analysis. In the sec-ond stage, we introduced missing posttest data into thesimulated courses using ﬁve models of the relationshipbetween participation and course grade based on priorresearch [2]. Because we removed posttest scores based FIG. 5. Overview of data simulation and analysis methods. In the ﬁrst stage, we used models of performance, standarddeviation, and grade distributions to simulate courses. These simulated courses provided the true values for our analyses. Inthe second stage, we used participation models to create missing data by deleting posttest scores from the simulated coursedata. In the third stage, we analyzed the datasets with missing data using both multiple imputation (MI) and complete-caseanalysis (CC). This stage provided the MI and CC estimates. We used the three outputs (true values, MI estimates, and CCestimates) to investigate the bias introduced by MI and complete-case analysis. on course grade, the data was MAR. In the third stagewe calculated estimates (ˆ µ ) using complete-case analysisand MI. This design allowed us to assess the eﬀect of thesimulation model parameters and the method of handlingmissing data on the accuracy of the estimates.Because earlier studies did not ﬁnd large diﬀerencesin participation rates between male and female studentswe did not include gender as a variable in our simulateddata. A. Simulating the complete data to generate trueresults

We simulated the course data by simulating data foreach of the ﬁve course grade subsets (A, B, C, D and F)and then combining the ﬁve subsets into a single dataset.To generate the concept inventory scores, we used a trun-cated normal distribution, which limited the scores to be-tween 0% and 100%. The normal distribution requiredinputs for mean ( µ ), standard deviation ( σ ), and samplesize ( N ). The mean for each grade came from ﬁve perfor-mance models based on three physics courses investigatedby Nissen et al. [2]. The standard deviation came from a model of the relationship between the mean and standarddeviation for 197 pretest or posttest administrations ofconcept inventories. The sample size for each grade sub-set came from the total course size and three grade dis-tributions we developed based on the grade distributionsfrom 192 STEM courses. We used the ﬁve performancemodels and three grade distributions to cover a range ofrelationships that could occur in PER studies.

1. Determining means using the relationships betweenconcept inventory scores and course grades

To generate realistic concept inventory scores, we ex-amined the relationship between course grade and con-cept inventory scores using data from Nissen et al. [2].We disaggregated the students in each course by theircourse grade and calculated the mean concept inven-tory score for each group of students in each course.We transformed the grades to the numeric values, A=4,B=3, C=2, D=1, and F=0, that the institution used tocalculate student grade point average (GPA). Figure 6presents the means for each course grade and linear re-gression ﬁt lines for the pretests and posttests for the

FIG. 6. Raw data and linear regression ﬁt lines for averagepretest and posttest scores for each grade for the three coursesdescribed by Nissen et al. [2].TABLE II. Linear models of the relationship between conceptinventory score and course grade for pretest and posttests.

Test Course Intercept Slope r Pre One 24.5 0.99 0.52Pre Two 25.7 1.43 0.69Pre Three 34.2 0.91 0.13Post One 26.0 3.08 0.75Post Two 24.9 7.02 0.98Post Three 44.8 3.77 0.73 three courses. Table II includes the intercept, slope, and r for each linear regression. Based on the scatter plotsin Fig. 6 and the r value exceeding 0.5 for 5 of the 6models, we concluded that a linear model adequately de-scribed the relationship between mean concept inventoryscores and course grades.The mean concept inventory scores represented the av-erage value for each grade about which the models sim-ulated the individual scores. To cover a broad range ofperformance levels, we built models for ﬁve diﬀerent per-formance levels that were informed by the linear mod-els from the three courses studied by Nissen et al. [2].The models diﬀer from the results in Table II becauseour goal was to cover a broad range of possible relation-ships rather than to replicate the relationships that wefound. Table III contains the model parameters for theone pretest model and the ﬁve posttest models. Equa-tion (6) shows the generalized equation that we used tocalculate the mean score for each grade based on themodels in Table III. We started with an average modeland modiﬁed it to create two high-performance modelsand two low-performance models by varying either theslope or the intercept in the model. The intercept estab-lished the mean concept inventory score for the subgroupthat earned an F. The slope established the size of thediﬀerence between each grade. These ﬁve models covereda range of relationships to inform how varying the slopeand intercept related to the bias introduced by using MI TABLE III. Model parameters used to simulate pretest andposttest score data.

Model Intercept SlopePretest 25 2Average 43 6Low Int. 25 6High Int. 58 6Low Slope 43 3High Slope 43 10 or complete-case analysis and to provide more robust andgeneralizable results. µ Grade = Intercept + Slope ∗ Grade. (6)

2. Determining standard deviation using distribution ofconcept inventory scores

We used 197 means and standard deviations from ei-ther pretests or posttests to build a quadratic model forthe relationship between mean and standard deviation.This data came from both the literature and conceptinventories collected with the LASSO platform [33]. Aquadratic model ﬁt the data because the standard devi-ation should approach 0 at both of the boundaries of thetest scores (0% and 100%). Equation 7 describes the ﬁtline. We determined that the quadratic ﬁt line was ade-quate because the adjusted r for the ﬁt line was 0.34, allcoeﬃcients were statistically signiﬁcant with p < . σ = 16 . . ∗ µ − . ∗ µ . (7)

3. Determining sample size based on grade distributions inSTEM courses

To determine the number of students that earned eachgrade in our simulated courses, we analyzed grade dis-tributions from 192 STEM courses at California StateUniversity - Chico to build three diﬀerent grade distri-butions: low, average, and high. We combined the drop,withdraw, and fail grades into a single F group. To buildthe low grade distribution, we averaged the grade dis-tributions from 13 courses with less than 10% As andgreater than 30% Fs. We built the average grade dis-tribution by averaging all 192 grade distributions. Tobuild the high-grade distribution, we averaged the gradedistributions from 6 courses with greater than 20% Asand greater than 20% Bs. Fig. 7 shows the three gradedistributions. We reasoned that these three distributionscovered the range of grade distributions found in mostSTEM courses.0

FIG. 7. Three grade distributions based on grades from 192STEM courses.

We simulated courses based on a course size of 1,000students. While this size is larger than typical courses,it allowed us to use fewer replications (twenty) of thecourse level simulations to quantify any bias introducedby MI or complete-case analysis. The actual size of eachsimulated course was 990 for the low grade distributionand 970 for the medium and high grade distributions.These sizes diﬀered from each other and from 1,000 dueto rounding in the course grade data we used to calculatethe three grade distributions.

4. Simulated course data

The 5 performance models and three grade distribu-tions created a total of 15 diﬀerent simulated courses.For each of these 15 courses, we simulated 20 datasets(replications) with approximately 1,000 students each.This process resulted in 300 diﬀerent datasets.Figure 8 provides an example of data generated forone course using the high slope model with an interceptof 43 and a slope of 10 for the posttest scores and anaverage grade distribution. For the high slope model,each grade higher increased the average posttest conceptinventory score by 10 percentage points. Students withF grades had a 43% posttest score on the concept in-ventory on average and this raised to 53% for Ds, 63%for Cs, 73% for Bs, and 83% for As. The diamonds inFig. 8 represent the mean test scores for the subgroupsand illustrate the linear relationship between grade andboth pretest and posttest means. The density plots forthe pretests (top of Fig. 8) and posttest (right of Fig. 8)illustrate the variance of the generated scores about themeans. The density plots for posttest scores covered alarger range of means and illustrate how the quadraticequation for standard deviation concentrated the scoresinto a narrower range as the mean score neared 100%.Table IV provides the true average values for the com-

FIG. 8. Example data for an average grade distribution andhigh slope performance model. The diamonds are located atthe means for each grade and illustrate the linear relationshipbetween grade and mean test score. The density plots dis-play the marginal distributions of the simulated pretest andposttest data for this simulated course.TABLE IV. Descriptive statistics for the 15 simulated coursesaverage true pretest and posttest scores and gains.

Perfor- Inter- Slope Grade µ pre µ post Gainmance cept Dist. (%) (%) (%)Average ModelAverage 43 6 Low 30.2 51.7 21.5Average 43 6 Average 31.4 55.9 24.5Average 43 6 High 32.1 58.5 26.4Changing Intercept ModelsLow Int. 25 6 Low 30.2 47.5 17.3Low Int. 25 6 Average 31.4 49.4 18.0Low Int. 25 6 High 32.1 50.8 18.7High Int. 58 6 Low 30.2 57.5 27.2High Int. 58 6 Average 31.4 64.3 32.8High Int. 58 6 High 32.1 68.9 36.8Changing Slope ModelsLow Slope 43 3 Low 30.2 35.3 5.1Low Slope 43 3 Average 31.4 38.8 7.4Low Slope 43 3 High 32.1 41.2 9.1High Slope 43 10 Low 30.2 66.6 36.3High Slope 43 10 Average 31.4 70.7 39.2High Slope 43 10 High 32.1 73.5 41.4 plete data for pretest and posttest means and the abso-lute gain across the simulated courses.

B. Models for missing data

We used the participation models for computer-basedposttests from Nissen et al. [2] to create ﬁve levels ofMAR data based on course grades in the simulated1

TABLE V. Participation rates for each ﬁnal course gradebased on models from Nissen et al. [2]. The model numberrepresents the number of recommended practices to maximizestudent participation input into the ﬁnal model.

Grade Model 0 Model 1 Model 2 Model 3 Model 4A 0.30 0.75 0.96 0.99 1.00B 0.13 0.45 0.82 0.96 0.99C 0.05 0.18 0.49 0.81 0.95D 0.02 0.05 0.17 0.41 0.71F 0.01 0.02 0.04 0.10 0.24 posttest data for each of the 15 simulated courses de-scribed in Table IV. Table V and Fig. 2 show the ﬁvemodels for missing data with the value for ‘recommendedpractices’ distinguishing between the ﬁve models. Weused the model predictions provided by Nissen et al. [2]that used the average value for gender because we didnot include gender as a variable in our simulated data.To insert missing data into the posttest scores, ﬁrst,we dissagregated the simulated complete data by coursegrade. Then, we used the participation models to de-termine the number of posttest scores that should bemissing for that grade according to that model. Finally,we randomly deleted the appropriate number of posttestscores. As an example, for participation Model 2, TableV, (i.e., recommended practices = 2) we deleted 96% ofposttest scores for Fs, 83% for Ds, 51% for Cs, 18% fora Bs, and 4% for As. The randomization for deleting theposttest scores was done independently across all sim-ulated datasets. Removing posttest scores represents atypical situation in which a student withdraws from thecourse or decreases their participation in the course atthe end of the semester. Removing only posttest scoreshad a limited impact on the complete-case analysis be-cause complete-case analysis removes both pretest andposttest scores when either is missing. These methodsfor generating missing data provided participation rates,the percentage of students who took both the pretest andposttest, that covered the range of 30% to 80% reportedin the literature and presented in Table I.

C. Measuring accuracy using bias

To inform the extent to which complete-case analy-sis and MI provided biased estimates for posttest scores,we measured the accuracy of the results using bias. Wecalculated bias as the average diﬀerence between thetrue posttest mean ( µ ) and the mean from either thecomplete-case or MI analysis (ˆ µ ). This formula is shownin Eq. (8) where n represents the number of replications,which we set at 20 for each of the simulated courses. Abias greater than zero indicated that the estimates werelarger than the true values. FIG. 9. Bias in the pretest model for the three grade dis-tributions. Only the bias for the complete-case analysis arepresented because no data was missing for the pretest andtherefore the MI estimates could not be biased. bias = 1 n n (cid:88) i =1 ˆ µ i − µ i . (8) V. RESULTS

We ﬁrst present the bias on the pretest model acrossthe three grade distributions. Second, we present thebias in the posttest scores for the 15 simulated courses.Last, we present a comparison of two simulated coursesto illustrate the potential impact of the bias introducedby complete-case analysis and MI on research results.We used the same model of the relationship betweengrade and test scores to simulate the pretest data for allﬁve of the performance models because we expected thebias for the estimates of pretest scores to be smaller thanthat for the posttest scores. Figure 9 presents the pretestbias introduced by complete-case analysis. The partici-pation models only inserted missing data in the posttests.The complete-case analysis created missing pretest databy discarding the pretest scores from students that do notparticipate in the posttest. MI discards no data and therewere no missing pretest scores so it introduced no biasinto the analysis for the pretest scores. Complete-caseanalysis introduced small amounts of bias ( < . FIG. 10. Bias in the posttest data introduced by complete-case analysis or MI. positive and to overestimate the true values. Conductingcomplete-case analysis resulted in more bias than con-ducting MI. Conducting complete-case analysis alwaysproduced positive biases with a minimum value of 0.7percentage points and a maximum value of 12.8 percent-age points. The bias of 12.8 percentage points meantthat complete-case analysis estimated the posttest meanto be 70.2% on average for the high slope low grade dis-tributions simulated course while the true average valuewas 57.4%. In contrast to complete-case analysis, con-ducting MI produced negative biases for 19 of the 75measurements with a minimum value of -0.3 percentagepoints and a maximum value of 1.9 percentage points.These results indicate that both methods tend to over-estimate the true posttest scores, but that the overesti-mation was much larger for complete-case analysis. Thisoverall trend of larger bias resulting from complete-caseanalysis than from MI was true for all 75 combinations ofperformance, grade distribution, and participation rates.Even at the lowest level of participation, the MI analysistended to produce less bias than the highest level of par-ticipation for the complete-case analysis, as is illustratedby the boundary between the two graphs in Fig. 10.The bias introduced by conducting both MI andcomplete-case analysis tended to decrease as the partici-pation rate increased. This trend occurred for complete-case analysis of all 15 of the simulated courses but wasless consistent for MI analysis of the simulated courses.These results illustrate the value of maximizing partici- pation rates for achieving accurate estimates of conceptinventory means.Diﬀerences in bias across the ﬁve performance modelsfor complete-case analysis indicated that varying slopehad a stronger impact on bias than varying intercept. Asshown in Fig. 10, the largest bias occurred for the highslope simulated courses (long-dashed line with emptysquares) and the lowest bias occurred for the low slopesimulated courses (dotted line with ﬁlled squares). Themaximum bias for the high-slope simulated courses was12.8 percentage points whereas the maximum bias forthe high-intercept simulated courses (dashed lines withempty triangles) was 7.4 percentage points. This diﬀer-ence in bias was not caused by a diﬀerence in posttestscores as the bias was larger in the high-slope simulatedcourses but the mean posttest score was lower (57.4%for the 12.8 percentage point bias versus 66.6% for the7.4 percentage point bias). Similarly, comparing the lowslope and low intercept high grade distribution simulatedcourses shows that the bias for the low slope course waslower (0.7 versus 1.2 percentage points maximum biasfor each). Whereas, the posttest mean was higher forthe low-slope simulated courses (50.7% for 0.7 percent-age point bias versus 41.9% for 1.2 percentage point bias).These relations indicated that the absolute value of theposttest mean was not the primary factor in the amountof bias introduced by complete-case analysis. Rather, therelationships within the datasets and the total amount ofmissing data best explained the bias.3

FIG. 11. Bar graph illustrating the eﬀect of bias fromcomplete-case analysis or MI on a comparison of two courses.Performance in both courses was average. The traditionalcourse had a low grade distribution and low participationrates. The transformed course had a high grade distributionand a high participation rate. We did not include error barsto focus on the eﬀects of bias and because they are very smalldue to the large sample sizes for the simulated data.

Unlike complete-case analysis, the bias for MI did notreveal consistent diﬀerences between the performancemodels or grade distributions and bias. The much loweroverall bias for MI may obscure diﬀerences in bias acrossthe simulated courses. However, Fig. 10 shows that theclear diﬀerences in bias for complete-case analysis acrossthe simulated courses did not exist for MI.To compare how the bias introduced by complete-case analysis and MI could skew comparisons, in Fig. 11we compared two simulated courses with similar perfor-mance within each grade but diﬀerent grade distributionsand participation rates. Using the average performancemodel for both courses simpliﬁed comparing the resultsbecause the performance for students who earned thesame grade were the same across the two courses. Wevaried the participation and grade distributions betweenthe two courses to align with comparisons between tra-ditional and transformed courses that occur in the PERliterature (e.g., Brewe et al. [20]). The two comparisoncourses are listed below.1. Traditional Course(a) Average performance within each grade(b) Low grade distribution(c) Low participation (37%)2. Transformed Course(a) Average performance within each grade(b) High grade distribution(c) High participation (81%) The true values indicated that students in the trans-formed course learned more conceptual knowledge on av-erage than the students in the traditional course. Thisdiﬀerence follows from the higher grade distribution inthe transformed course and the same performance modelin both courses. The larger gains in the transformedcourse remained when we analyzed the data with MI.However, complete-case analysis nearly eliminated thediﬀerence in gains on the concept inventory. This de-crease in the diﬀerence between the courses occurred be-cause little data was collected in the traditional coursesfrom students with low grades and thus the analysis pos-itively biased the gain. In contrast to the true resultsand the results after analysis with MI, the results fromthe complete-case analysis do not support the claim thatstudents learned substantially more in the transformedcourse than in the traditional course.

VI. DISCUSSION

Complete-case analysis can introduce large amounts ofbias into the estimates for concept inventory scores whenresearchers apply it to data that is not MCAR. The biasintroduced by complete-case analysis in the simulateddata ranged from 0.7% to 12.8% for the posttest meansand fell below 2% for the pretest means. The 28 arti-cles we reviewed, which included 158 courses, reportedgains from 5% to 56% with an average of 23%. Twentythree of these studies used complete-case analysis, nonereported using a principled method for handling miss-ing data (e.g., MI), and none indicated that the missingdata in the study was MCAR. Subsequently, our resultsindicate that part of the gains reported in those studieslikely resulted from the improper use of complete-caseanalysis. In some of those studies, complete-case analy-sis may have exaggerated the gains by increasing themfrom anywhere between one third to doubling them. Theintroduced bias may have also skewed any comparisonsmade in those studies, particularly comparisons acrosscourses with diﬀerent participation rates.We cannot say exactly how much of these reportedgains resulted from bias introduced by complete-caseanalysis. Our results indicate that the amount of biascomplete-case analysis introduces depends on both theparticipation rate and the relationships within the data.To determine the bias in prior studies that used complete-case analysis without meeting the assumptions for its re-liable use, researchers will need to analyze the data di-rectly. However, physics education researchers seldompublish the data or analytical code used in their stud-ies. The PER community can improve transparency andaccountability by supporting researchers in publishing orpublicly sharing the datasets from their research. Goingforward, sharing data would allow the research commu-nity to double check the impact that the methods forhandling missing data have on the conclusions that re-searchers draw from their data.4The bias introduced by complete-case analysis couldobscure diﬀerences across courses and undermine bothresearch and evaluation work. For example, we com-pared a simulated traditional course with a simulatedtransformed course. The simulated transformed coursehad lower DWF rates, higher grades, and greater con-ceptual learning. Bias introduced by using complete-caseanalysis obscured the diﬀerences in conceptual learningbetween the two simulated courses. In a comparisonof real courses, a critic of the transformed course withlower DWF rates could use the similar results from thecomplete-case analysis of the concept inventory scoresto claim the transformed course had lower grading stan-dards. Otherwise, the transformed course would haveoutperformed the other course on the concept inventory.Using MI to account for the missingness in the dataintroduced less bias into the results and preserved thetrue result that, overall, students learned more in thetransformed course. Researchers and educators need ac-curate results to inform the design and implementationof research-based teaching materials. If researchers con-tinue to use complete-case analysis without accountingfor the impact of missing data, they risk wasting time andresources either discarding useful interventions or pursu-ing false leads.

VII. CONCLUSION

Researchers, reviewers, and editors can take severalsteps to improve the handling of missing data in quan-titative studies. During the data collection process, re-searchers should take reasonable actions to minimize the amount of missing data. However, education researchersoften cannot avoid some missing data in their studies.Researchers should use MI or another principled methodfor handling missing data. Researchers using complete-case analysis should present evidence that their data isMCAR. However, principled methods for handling miss-ing data, such as MI, are not a panacea. Rather, prin-cipled methods are only one component of the diligencenecessary to address missing data. Before analyzing thedata and deciding on an appropriate method for handlingthe missing data, researchers should examine the amountof missing data; patterns in the missing and completedata; and the mechanisms behind those patterns. Whenimplementing MI to address missing data, researchersshould check that their data meets the assumptions ofthe MI algorithm. Many MI software packages includetools to check these assumptions. Studies should statethe participation rates in their data collection, describethe methods they used to address missing data, discusspatterns in the missing data, and discuss how the miss-ing data may inﬂuence analytical results. These stepswill improve the quality, reliability, and replicability ofquantitative studies on student outcomes in physics.

VIII. ACKNOWLEDGEMENTS

This work is funded in part by NSF-IUSE Grant No.DUE-1525338 and is Contribution No. LAA-059 of theLearning Assistant Alliance. We are grateful to theLearning Assistant Program at the University of Col-orado Boulder for establishing the foundation for LASSOand LASSO studies. [1] Jayson M. Nissen, Robert M. Talbot, AmreenNasim Thompson, and Ben Van Dusen, “Comparisonof normalized gain and cohen’s d for analyzing gains onconcept inventories,” Phys. Rev. Phys. Educ. Res. ,010115 (2018).[2] Jayson M Nissen, Manher Jariwala, Eleanor W Close,and Ben Van Dusen, “Participation and performance onpaper-and computer-based low-stakes assessments,” In-ternational Journal of STEM Education , 21 (2018).[3] Joseph L Schafer, “Multiple imputation: a primer,” Sta-tistical methods in medical research , 3–15 (1999).[4] Roderick JA Little and Donald B Rubin, Statistical anal-ysis with missing data , Vol. 333 (John Wiley & Sons,2014).[5] Jehanzeb R Cheema, “A review of missing data handlingmethods in education research,” Review of EducationalResearch , 487–508 (2014).[6] Allan Donner, “The relative eﬀectiveness of procedurescommonly used in multiple regression analysis for dealingwith missing values,” The American Statistician , 378–381 (1982).[7] Donald B Rubin, “Inference and missing data,”Biometrika , 581–592 (1976). [8] Peter H Van Ness, Terrence E Murphy, Katy LB Araujo,Margaret A Pisani, and Heather G Allore, “The use ofmissingness screens in clinical epidemiologic research hasimplications for regression modeling,” Journal of clinicalepidemiology , 1239–1245 (2007).[9] Shona Fielding, Peter M Fayers, and Craig R Ramsay,“Investigating the missing data mechanism in quality oflife outcomes: a comparison of approaches,” Health andQuality of Life Outcomes , 57 (2009).[10] Lauren Kost, Steven Pollock, and Noah Finkelstein,“Characterizing the gender gap in introductory physics,”Physical Review Special Topics - Physics Education Re-search , 010101 (2009).[11] Lauren E. Kost-Smith, Steven J. Pollock, Noah D.Finkelstein, Geoﬀrey L. Cohen, Tiﬀany a. Ito, AkiraMiyake, Chandralekha Singh, Mel Sabella, and SanjayRebello, “Gender Diﬀerences in Physics 1: The Impactof a Self-Aﬃrmation Intervention,” PERC Proceedings ,197–200 (2010).[12] Jayson M Nissen and Jonathan T Shemwell, “Gender,experience, and self-eﬃcacy in introductory physics,”Physical Review Physics Education Research , 020105(2016). [13] Donald B Rubin, Multiple imputation for nonresponse insurveys , Vol. 81 (John Wiley & Sons, 2004).[14] Donald B Rubin, “Multiple imputation after 18+ years,”Journal of the American statistical Association , 473–489 (1996).[15] Maria Pampaka, Graeme Hutcheson, and JulianWilliams, “Handling missing data: analysis of a challeng-ing data set using multiple imputation,” InternationalJournal of Research & Method in Education , 19–37(2016).[16] Yiran Dong and Chao-Ying Joanne Peng, “Principledmissing data methods for researchers,” SpringerPlus ,222 (2013).[17] Remy Dou, Eric Brewe, Justyna P Zwolak, Geoﬀ Potvin,Eric A Williams, and Laird H Kramer, “Beyond perfor-mance metrics: Examining a decrease in students physicsself-eﬃcacy through a social networks lens,” Physical Re-view Physics Education Research , 020124 (2016).[18] Michael J Cahill, Mark A McDaniel, Regina F Frey,K Mairin Hynes, Michelle Repice, Jiuqing Zhao, and Re-becca Trousil, “Understanding the relationship betweenstudent attitudes and student learning,” Physical ReviewPhysics Education Research , 010107 (2018).[19] Rachel Henderson, Gay Stewart, John Stewart, LynnetteMichaluk, and Adrienne Traxler, “Exploring the gen-der gap in the conceptual survey of electricity and mag-netism,” Physical Review Physics Education Research , 020114 (2017).[20] Eric Brewe, Vashti Sawtelle, Laird H. Kramer, George E.O’Brien, Idaykis Rodriguez, and Priscilla Pamel´a, “To-ward equity through participation in Modeling Instruc-tion in introductory university physics,” Physical Re-view Special Topics - Physics Education Research , 1–12(2010).[21] Michael J. Cahill, K. Mairin Hynes, Rebecca Trousil,Lisa A. Brooks, Mark A. McDaniel, Michelle Repice,Jiuqing Zhao, and Regina F. Frey, “Multiyear,multi-instructor evaluation of a large-class interactive-engagement curriculum,” Physical Review Special Topics- Physics Education Research , 1–19 (2014).[22] Roderick JA Little, “Missing-data adjustments in largesurveys,” Journal of Business & Economic Statistics ,287–296 (1988).[23] Krishnan Bhaskaran and Liam Smeeth, “What is the dif-ference between missing completely at random and miss- ing at random?” International journal of epidemiology , 1336–1339 (2014).[24] Gary King, James Honaker, Anne Joseph, and KennethScheve, “Analyzing incomplete political science data: Analternative algorithm for multiple imputation,” Americanpolitical science review , 49–69 (2001).[25] Sara Fernandes-Taylor, Jenny K Hyun, Rachelle NReeder, and Alex HS Harris, “Common statistical andresearch design problems in manuscripts submitted tohigh-impact medical journals,” BMC research notes ,304 (2011).[26] A Burton and DG Altman, “Missing covariate datawithin cancer prognostic studies: a review of currentreporting and proposed guidelines,” British Journal ofCancer , 4 (2004).[27] Nicholas J Horton and Ken P Kleinman, “Much adoabout nothing: A comparison of missing data methodsand software to ﬁt incomplete data regression models,”The American Statistician , 79–90 (2007).[28] Katya L Masconi, Tandi E Matsha, Justin B Echouﬀo-Tcheugui, Rajiv T Erasmus, and Andre P Kengne, “Re-porting and handling of missing data in predictive re-search for prevalent undiagnosed type 2 diabetes melli-tus: a systematic review,” EPMA Journal , 7 (2015).[29] Naresh K Malhotra, “Analyzing marketing research datawith incomplete information on the dependent variable,”Journal of Marketing Research , 74–84 (1987).[30] Stef van Buuren and Karin Groothuis-Oudshoorn, “mice:Multivariate imputation by chained equations in r,” Jour-nal of Statistical Software , 1–67 (2011).[31] R Core Team, R: A Language and Environment for Sta-tistical Computing , R Foundation for Statistical Comput-ing, Vienna, Austria (2018).[32] Gerko Vink, Laurence E Frank, Jeroen Pannekoek, andStef Van Buuren, “Predictive mean matching imputationof semicontinuous variables,” Statistica Neerlandica ,61–90 (2014).[33] Learning Assistant Alliance, “ Learning About STEMStudent Outcomes (LASSO) Platform ,” (2018), https://learningassistantalliance.org/https://learningassistantalliance.org/