[PDF] Exploring the contributions of self-efficacy and test anxiety to gender differences in assessments

Abstract

The observed performance difference between women and men on assessments in physics---the "gender gap"---is a significant and persistent inequity which has broad implications for the participation of women in physics. Research also shows that gender-based inequities extend to affective measures, such as self-efficacy. In this exploratory study, we report on gender disparities in self-efficacy and test anxiety and their relationship to assessment scores in our active-learning introductory physics course. Overall, gender-based differences in favour of men are observed in all our measures, with women having lower scores on measures associated with success (self-efficacy and assessment scores) and a higher score on a possibly detrimental affective factor (test anxiety). Using a multiple regression model-selection process to explore which measures may explain end-of-course Force Concept Inventory (FCI) and final exam scores, we find that the best fitting models include FCI pretest and self-efficacy as predictors, but do not include test anxiety.

Full PDF

aa r X i v : . [ phy s i c s . e d - ph ] J u l Exploring the contributions of self-efﬁcacy and test anxiety to gender differences in assessments

Jared B. Stang, Emily Altiere, and Joss Ives

Dept. of Physics & Astronomy, University of British Columbia, 6224 Agricultural Road, Vancouver, BC V6T 1Z1

Patrick J. Dubois

Dept. of Psychology, University of British Columbia, 2136 West Mall, Vancouver, BC V6T 1Z4

The observed performance difference between women and men on assessments in physics—the “gendergap”—is a signiﬁcant and persistent inequity which has broad implications for the participation of women inphysics. Research also shows that gender-based inequities extend to affective measures, such as self-efﬁcacy.In this exploratory study, we report on gender disparities in self-efﬁcacy and test anxiety and their relationshipto assessment scores in our active-learning introductory physics course. Overall, gender-based differences infavour of men are observed in all our measures, with women having lower scores on measures associated withsuccess (self-efﬁcacy and assessment scores) and a higher score on a possibly detrimental affective factor (testanxiety). Using a multiple regression model-selection process to explore which measures may explain end-of-course Force Concept Inventory (FCI) and ﬁnal exam scores, we ﬁnd that the best ﬁtting models include FCIpretest and self-efﬁcacy as predictors, but do not include test anxiety. . INTRODUCTION

There is a signiﬁcant and persistent disparity in the partic-ipation of women and men in science, technology, engineer-ing, and mathematics (STEM) disciplines. In 2016, womenmade up only 23% of the Canadian STEM workforce ages 25-64 [1] and earned only 36% of Bachelor’s degrees in STEM inthe United States; within physics, the share of degrees earnedby women that year was only 20% [2]. A key issue related tothis participation disparity is the consistently observed perfor-mance difference between women and men on assessments:the “gender gap” [3–8]. Despite the extensive work by thePER community on gender-based performance differences,the issue remains unresolved and the implications continue todisadvantage women in physics courses and beyond.A parallel developing research area has been on gender-based disparities in affective measures such as self-efﬁcacy.These have been reviewed for STEM overall [9, 10] and havebeen studied in the context of physics [11–13]. In this report,we focus on gender disparities in self-efﬁcacy and test anxi-ety. Self-efﬁcacy inﬂuences the choices someone makes andthe effort they put forth in a task [14], and it’s been shownto be related to performance in physics in [13, 15, 16] and tophysics identity [17]. Ballen et al. [18] showed that self-efﬁcacy mediated learning gains for underrepresented mi-nority students in a biology course, identifying self-efﬁcacyas a possible avenue for supporting equity-seeking groups.Test anxiety is a psychological mechanism which may causean individual to underperform on an assessment; if an en-tire demographic group experiences relatively high test anx-iety, this could explain observed performance differences.Ballen, Salehi, and Cotner [19] found that—again in a bi-ology course—women reported higher test anxiety than menand that test anxiety had a negative correlation with exam per-formance for women only. If this were true in physics or gen-erally, then interventions or course designs aimed at reducingtest anxiety may help to reduce gender-based differences.Below, we describe a pilot study in our introductory active-learning physics course. Our research questions are: 1. In ourcontext, are there gender disparities in student self-efﬁcacy,test anxiety, Force Concept Inventory (FCI) pre- and post-test, and exam scores? 2. If gender disparities exist in FCIpost-test and exam scores, can self-efﬁcacy and/or test anx-iety explain some of the difference? Through this work, weaim to develop a better understanding of the factors involvedin gender-based inequalities in physics classrooms, with theultimate goal of improving the culture for all students.

II. METHODA. Theoretical framework

We adopt the model presented by Eddy and Brownell intheir review of gender disparities across undergraduate STEMdisciplines [9]. In their model, persistence in STEM results from both “observables” —performance and engagement—and “unobservables”—psychological factors such as self-efﬁcacy and belonging, and prior preparation. Gender-baseddisparities in persistence result from inequalities in the “ob-servables” and “unobservables”. In this study, we focus onthe relationship of two unobservable factors—self-efﬁcacyand test anxiety—to the inequalities we observe in assess-ment scores.Rodriguez et. al. [20] provide explicit models of equitythat we use to frame our results. In an equity of parity model,equity is achieved when the equity-seeking group matchesthe dominant group on the desired outcome. This model isimplicitly adopted in work focused on closing achievementgaps, and we take this perspective in Section III A to exam-ine the gender-based differences on our measures. In an eq-uity of fairness model, equity is achieved when all groupsreceive treatment free from bias. In the example of conceptinventories, this would mean that all groups experience equalconceptual gains. This model of equity then maintains dis-parities in outcomes. In our analysis of exam and FCI scores,Section III B, this perspective guides our interpretation.Throughout this manuscript, we disaggregate our dataalong gender lines, and refer to female students and malestudents, implicitly adopting the binary gender deﬁcit model[21]. We do not take this approach because men are the stan-dard to which women should be compared. Rather, in seekingto understand how sexism may impact women in our physicsclassrooms, we consider male students as a pseudo-controlgroup that does not experience sexism. Further, we acknowl-edge that gender identity is not binary, and that the categoriesof “female” and “male” do not accurately represent the diver-sity with which individuals experience their gender and withwhich gender inﬂuences their experience.There are several important limitations of our present work.Firstly, we do not consider how other identity factors, suchas race/ethnicity or socio-economic status, or intersectionsthereof may impact student experiences in our classroom.A more complete description would take these into account.Secondly, we consider only a global measure of self-efﬁcacy,most similar to a general conﬁdence in ability in the course. Ithas been shown that different types of self-efﬁcacy may mat-ter differently for women and men [13], a nuance which is nottaken into account here. Finally, a future analysis should alsoconsider instrument fairness in the FCI, which may explainup to 30% of the gender difference in FCI scores [6].

B. Course description

The data for this study were collected from “Physics1,” an algebra-based introductory physics course at a large,research-intensive Canadian university. Students enroll inPhysics 1 if they did not take physics at the senior level in highschool. The course has been transformed to an active learningstyle as part of the Carl Wieman Science Education Initiative[22]. In the semester analyzed here, 779 students completedhe course. The students were spread over three lecture sec-tions which each met for three hours per week. Each sectionwas taught by a white male instructor. Students also attendeda weekly two hour recitation, consisting primarily of problemsolving in groups. According to institutional records, ﬁrst-year students comprised 78% percent of the class.

C. Data and data collection

We measured physics preparation with the FCI [23] andassessed learning results with the FCI and the course exam.The FCI was given to students by Teaching Assistants duringrecitation sessions in the second week of the course (“FCIpretest”) and again in week 12 of the thirteen week course(“FCI post-test”). Students were asked to not skip questionsand to avoid guessing, and were told that the inventory willbe used to make future improvements to the course. Studentswere not given any points for completing either the FCI pre-or post-test. The topics of Physics 1 include kinematics, New-ton’s laws, energy, and energy and heat transfer mechanisms.The course exam consisted of a 130-minute solo phase fol-lowed by a 30-minute group phase [24]; we use only the solophase exam scores here, since they represent the performanceof the individual students. The solo exam consisted of abouthalf multiple-choice conceptual questions and half open re-sponse problems. The exam took place approximately twoweeks after the last day of classes.Gender information was acquired from institutionalrecords. Up to and including the year the data was collected,students selected between radio button options for “male” and“female” to indicate their “Gender” when they created a stu-dent account as part of their application to this university.To assess student self-efﬁcacy and test anxiety, we adaptedsurvey items from the Motivated Strategies for LearningQuestionnaire (MSLQ) [25] related to self-efﬁcacy and anx-iety. The MSLQ has 8 items comprising their self-efﬁcacyfactor and 5 items comprising their anxiety factor. In addi-tion to these, we included new items related to anxiety (sim-ilar to those on the MSLQ but with a positive framing), con-ﬁdence on the course midterm, and a comparison of anxi-ety in physics to other science courses. The anxiety-relateditems were added in an attempt to create a more robust factor,while the other added items were of local interest. Studentsresponded to each of the 21 items on a seven-point Likertscale, from “strongly disagree” to “strongly agree”. The sur-vey was administered during recitation sessions in week 10of the course, two weeks after the midterm exam and aftergraded midterm papers were returned to students.

D. Factor structure of self-efﬁcacy and test anxiety survey

We used an exploratory factor analysis (EFA) to assess theemergent dimensions that we focus on for analysis. We be-gan by checking the suitability of the data for a factor analysis following the process and thresholds described in Knekta etal. [27]. For the full 21-item survey, we examined possi-ble outliers, checked for univariate and multivariate normal-ity and linearity of relationships, and conﬁrmed that the datawere factorable and did not exhibit multicollinearity. Mar-dia’s tests for multivariate skewness and kurtosis revealed de-viations from multivariate normality; however, in the factoranalysis, we used the principal axis factor estimator, which isboth robust to non-normality and appropriate for ordinal data.We implemented the EFA using an oblique rotation methodto allow correlations between the factors. Based on theoreti-cal grounds we expected 2 factors (self-efﬁcacy and test anxi-ety); the scree plot for our full survey suggested 2 or 3 factors.We ﬁrst examined the three-factor EFA, including all surveyitems. However, in addition to our expectation of just twofactors, metrics indicated that the three-factor solution wasinadequate: Several items had low primary loadings, cross-loaded between two factors, had low communalities, and/orhad high complexities [27]. Altogether, this led us to focuson the two-factor solution.To determine the two-factor solution, we began with thefull survey and undertook a stepwise removal of items withlow primary loadings, with a strict cutoff of | . | in orderto have strong factors. At each step, we removed the itemwith the smallest primary loading and re-ran the EFA until nomore items had a primary loading smaller than our cutoff. Atthis point, all remaining cross-loadings were < | . | , com-munalities > . , and complexities < . . We removed onemore item (“On the [Physics 1] midterm exam, I felt conﬁ-dent about my performance while taking the solo part of theexam”) because it displayed the lowest primary loading andcommunality, it is theoretically different than the other self-efﬁcacy items in that it asks about a past test, and removingthis item did not signiﬁcantly change the loadings of the re-maining items. This process removed most of the items ad-ditional to those directly adapted from the MSLQ, leaving a self-efﬁcacy factor containing 9 items and a test anxiety factorcontaining 4 items.The self-efﬁcacy items remaining overlapped those of theMSLQ except for one item—which originally referred to “as-signments and tests”—that was split into separate items for“homework” and “tests”. For the self-efﬁcacy factor ( R = . ), the three highest-loading items—all with primary load-ing of . —were “Considering the difﬁculty of this course,the instructor, and my skills, I think I will do well in thiscourse,” “I’m conﬁdent I can master the skills being taught inthis course,” and “I’m conﬁdent I can do an excellent job ontests in this course.” For the test anxiety factor ( R = . ),two items were retained from the MSLQ, one retained itemwas a combination of two MSLQ items (“I feel my heartbeating fast, an uneasy feeling in my stomach or tightnessin my chest when taking an exam”), and one item was newbut loaded well onto this factor (“I get so nervous during atest that I cannot recall the material I have learned”). Thelatter two items were the highest-loading items, with primaryloadings of . and . , respectively. The ﬁnal EFA ex- emale students Male students N Mean SE N Mean SE Difference in means 95% CI t Cohen’s d Self-efﬁcacy 319 -0.18 0.052 124 0.46 0.091 -0.64*** [-0.85, -0.43] -6.10 -0.67Test anxiety 319 0.10 0.055 124 -0.26 0.088 0.38*** [0.17, 0.58] 3.66 0.38FCI pretest 495 29.4 0.63 217 43.2 1.19 -13.8*** [-16.5, -11.1] -10.2 -0.91FCI post-test 438 51.9 0.82 197 64.0 1.32 -12.2*** [-15.21, -9.09] -7.81 -0.69Exam 521 61.9 0.69 243 64.5 1.10 -2.61* [-5.16, -0.06] -2.01 -0.16TABLE I. Summary statistics and differences between female students and male students on the various measures. Self-efﬁcacy and testanxiety scores have been standardized; FCI and exam scores are percentages. Standard errors in the group means (SE) and 95% conﬁdenceintervals (95% CI) for the difference in means are reported. A negative difference in means implies that the average for female students islower than the average for male students. The t -test is a two-tailed test assuming unequal variance. d = 0 . represents a small effect size, d = 0 . is a medium effect size, and d = 0 . is a large effect size [26]. *: p < . ; **: p < . ; ***: p < . . plained of the variance in the included statements. Foreach student, we computed normalized factor scores using the‘tenBerge’ correlation-preserving regression method.We observed signiﬁcant correlations between the factorscores for self-efﬁcacy and test anxiety and the assessmentscores. Self-efﬁcacy was signiﬁcantly correlated with FCIpost-test, r (385) = . , p < . , and the exam, r (436) = . , p < . . Test anxiety was signiﬁcantly correlated with FCIpost-test, r (385) = − . , p < . , and exam, r (436) = − . , p < . , though with a smaller effect size than self-efﬁcacy. III. ANALYSIS AND RESULTSA. Equity of parity in assessments and affective factors

In our study, gender equity of parity would mean thatwomen and men have comparable distributions in scores forthe exam, FCI, and affective measures. As shown in Table I,differences can be seen in all our measures, with women hav-ing lower scores on measures associated with success (self-efﬁcacy and assessment scores) and a higher score on a pos-sibly detrimental affective factor (test anxiety).

B. Equity of fairness in FCI post-test and exam scores

In our study, we interpret gender equity of fairness to meanthat a student with the same FCI pretest, self-efﬁcacy, andtest anxiety would achieve the same FCI post-test or examscore independent of gender. To examine equity of fairness,we used model-selection criteria to compare different multi-ple regression models predicting standardized FCI post-testor exam. For the analysis, we consider gender, FCI pretest,self-efﬁcacy, and test anxiety to be possible independent vari-ables. Since a one-way ANOVA showed possible small ef-fects of lecture section on FCI post-test and exam, we con-sidered the inclusion of section as a random variable in theanalysis. However, we found that the intraclass correlation ρ was less than . for both FCI post-test and exam models; therefore, including section as a random effect would not beexpected to signiﬁcantly improve our models [28] and we donot include it.Starting from the base model with gender predicting FCIpost-test or exam score, we added predictors (FCI pretest,self-efﬁcacy, and test anxiety) to the model, using the Akaikeinformation criterion (AIC) to determine if each successivemodel was a better ﬁt to the data. Motivated by previouswork which identiﬁed test anxiety as an issue for female stu-dents only [19], we also considered gender-self-efﬁcacy andgender-test anxiety interaction terms. In comparing models,we consider models with ∆ AIC < to be equivalent, and lookfor the simplest model with the most explanatory power.For the models predicting exam score, a subset of the mod-els considered are presented in Table II. In the analysis, thecategorical variable gender is coded as F = 1 , M = 0 . Inthe base model of gender predicting exam score, β gender = − . implies that a female student would score . stan-dard deviations lower than a male student on the exam. How-ever, R = . for this model, indicating that only 3% of thevariance in the exam scores is explained by gender alone; thisbase model has little practical explanatory power, and we areled to add more predictors. In considering all combinationsof the predictors, including gender-self-efﬁcacy and gender-test anxiety interaction terms, no other model had a lowerAIC (within | ∆ AIC | < ) than the model with gender, FCIpretest, and self-efﬁcacy as predictors. In general, includingtest anxiety as a predictor made models more complex but didnot improve the models as much as including self-efﬁcacy.Relative to the model with gender and FCI pretest, the in-clusion of self-efﬁcacy explained 7% more of the variance inthe exam data. Compared to the model with gender as theonly predictor, the addition of FCI pretest and self-efﬁcacycaused the regression coefﬁcient for gender to change signs,going from one that predicted a statistically signiﬁcant dis-advantage for women to a statistically signiﬁcant advantage(of . standard deviations). Therefore, after taking into ac-count FCI pretest and self-efﬁcacy levels, exam scores maynot demonstrate equity of fairness.The outcome of the model-selection procedure for the FCI gender β FCI pretest β self-eff. β test anx. R AIC-0.40*** .03*** 1153.10.09 0.52*** .26*** 1042.20.13 0.49*** -0.15*** .29*** 1031.40.20* 0.43*** 0.29*** .33*** 1003.60.21* 0.43*** 0.26*** -0.08 .34*** 1002.2TABLE II. Model coefﬁcients and goodness of ﬁt measures for mul-tiple regression models predicting the exam score. Each row corre-sponds to a different model; if a row is missing a coefﬁcient, thatvariable was not included in the model. The categorical variablegender is coded as F = 1 , M = 0 ; all other measures are standardized.The models presented here are illustrative but not fully representa-tive of the complete model selection process described in the text. *: p < . ; **: p < . ; ***: p < . . post-test regression followed similar trends to that for examscore. Here, the simplest and best-ﬁtting model had only FCIpretest and self-efﬁcacy as predictors—including gender ortest anxiety as predictors resulted in a more complex model,but was within ∆ AIC < of the simpler model. In thisbest-ﬁtting model, the coefﬁcients were β FCI pretest = 0 . ( p < . ) and β self-efﬁcacy = 0 . ( p < . ). For thismodel, R = . , indicating that FCI pretest and self-efﬁcacyexplain 45% of the variance in FCI post-test scores. However,the increase in R relative to the model with gender and FCIpretest is . , indicating that self-efﬁcacy carries less addi-tional explanatory power for the FCI post-test than for theexam. The absence of gender as a predictor in the best-ﬁttingmodel for FCI post-test scores could be interpreted as equityof fairness in FCI post-test achievement. IV. DISCUSSION

Consistent with previous reports, we do not observe eq-uity of parity between women and men on our FCI post-testor exam scores. Among the negative effects of this inequityis that these outcomes inﬂuence how a person moves for-ward from this course, whether into further physics coursesor along whatever path they choose. For this class, we alsodo not observe equity of parity in our two affective measures:self-efﬁcacy and test anxiety.From an equity of fairness perspective, the results suggesta different interpretation. For a given FCI pretest and self-efﬁcacy, female and male students are ending up at the sameplace for the FCI post-test while women are achieving higherscores for the exam. Therefore, we observe equity of fairnessfor the FCI post-test but not for the exam; however the in-equity in exam results is to the beneﬁt of women, the equity-seeking group. If part of our ultimate goal is to make progresstoward equity of parity in assessment results, it must be thatequity-seeking groups achieve larger gains, all else equal. The results presented here represent a pilot study of self-efﬁcacy, test anxiety, and assessment scores in our context.In exploring our data, we made many comparisons and con-sidered many possible models. Thus, it is likely that statisticalsigniﬁcance as presented is overstated and we caution againstmaking ﬁrm conclusions based on the results. However, thetrends shown here are interesting in the context of several re-cent studies.In modelling both the FCI post-test and exam, self-efﬁcacyemerged as a signiﬁcant predictor. The importance of self-efﬁcacy is consistent with several recent studies [15, 16, 18].In Ref. [18], changes in self-efﬁcacy mediated the improve-ment in assessment scores for the equity-seeking group ofunder-represented minorities. Since women on average re-port a lower self-efﬁcacy, this supports the possibility that at-tending to self-efﬁcacy could speciﬁcally beneﬁt women inthe course. To better understand if a self-efﬁcacy interven-tion may work in our context, we could study how changesin self-efﬁcacy relate to assessment results. That self-efﬁcacyheld predictive power here is in contrast to a recent report[29]. An important difference is that Salehi et al. [29] foundthat including a general measure of college preparation (ACTor SAT scores) in addition to a subject-speciﬁc preparationgave the best exam-score model. It is possible that our self-efﬁcacy factor may overlap with general college readiness orability, and that controlling for that aspect would give a dif-ferent result here. Additionally, general academic preparationhas been shown to explain a fraction of the gender differenceon concept inventories [6].Test anxiety did not show up as a signiﬁcant predictor inour model-selection process. Women in our study do re-port higher levels of test anxiety. However, in contrast toRef. [19], this does not translate into reduced assessmentscores in our data. It could be that women do not actuallyexperience more anxiety during tests even though they re-port it on the survey; this discrepancy has been demonstratedin a math context [30]. Our results echo those of a recentstudy on a test-anxiety intervention in an introductory biol-ogy course, which found that “women are underperformingin STEM courses for reasons other than ... test anxiety” [31].Though we focus our report on the differences betweenfemale and male students, our ultimate goal is to alter ourclassrooms to be more inclusive. Better understanding howstudents experience our classes affectively, and the relationof these with assessments, offers a promising avenue for sug-gesting ways to change the culture in our classroom such thatall students are welcome and able to realize their potential.

ACKNOWLEDGMENTS

E. Altiere acknowledges support from the UBC Centre forthe Integration of Research, Teaching and Learning (CIRTL).We acknowledge the students and the instructors of Physics 1for their participation.

1] K. Wall,

Insights on Canadian Society: Persistence and repre-sentation of women in STEM programs , 020121 (2013).[4] J. Day, J. B. Stang, N. G. Holmes, D. Ku-mar, and D. A. Bonn, Gender gaps and gen-dered action in a ﬁrst-year physics laboratory,Physical Review Physics Education Research , 020104 (2016).[5] M. Normandeau, S. Iyengar, and B. Newling, The Pres-ence of Gender Disparity on the Force Concept Inventory ina Sample of Canadian Undergraduate Students, The Cana-dian Journal for the Scholarship of Teaching and Learning ,https://doi.org/10.5206/cjsotl-rcacea.2017.1.9 (2017).[6] R. Henderson, J. Stewart, and A. Traxler, Partitioningthe gender gap in physics conceptual inventories: ForceConcept Inventory, Force and Motion Conceptual Evalua-tion, and Conceptual Survey of Electricity and Magnetism,Physical Review Physics Education Research , 010131 (2019).[7] L. E. Kost, S. J. Pollock, and N. D. Finkelstein, Char-acterizing the gender gap in introductory physics,Physical Review Special Topics - Physics Education Research , 010101 (2009).[8] L. E. Kost-Smith, S. J. Pollock, and N. D. Finkel-stein, Gender disparities in second-semester collegephysics: The incremental effects of a "smog of bias",Physical Review Special Topics - Physics Education Research , 020112 (2010).[9] S. L. Eddyand S. E. Brownell, Beneath the numbers: Areview of gender disparities in undergraduate educationacross science, technology, engineering, and math disciplines,Physical Review Physics Education Research , 020106 (2016).[10] G. Trujilloand K. D. Tanner, Considering the Roleof Affect in Learning: Monitoring Students’ Self-Efﬁcacy, Sense of Belonging, and Science Identity,CBE Life Sciences Education , 6 (2014).[11] E. M. Marshman, Z. Y. Kalender, T. Nokes-Malach,C. Schunn, and C. Singh, Female students with A’shave similar physics self-efﬁcacy as male studentswith C’s in introductory courses: A cause for alarm?,Physical Review Physics Education Research , 020123 (2018).[12] K. L. Lewis, J. G. Stout, N. D. Finkelstein, S. J. Pollock,A. Miyake, G. L. Cohen, and T. A. Ito, Fitting in to Move For-ward: Belonging, Gender, and Persistence in the Physical Sci-ences, Technology, Engineering, and Mathematics (pSTEM),Psychology of Women Quarterly , 420 (2017).[13] V. Sawtelle, E. Brewe, and L. H. Kramer,Exploring the relationship between self-efﬁcacy and retention in introductory physics,Journal of Research in Science Teaching , 1096 (2012).[14] A. Bandura, Self-efﬁcacy:The exercise of control (Worth Pub-lishers, 1997).[15] L. E. Kost-Smith,

Characterizing, modeling, and addressinggender disparities in introductory college physics , Ph.D. thesis,University of Colorado (2011).[16] Z. Y. Kalender, E. Marshman, C. D. Schunn, T. J.Nokes-Malach, and C. Singh, Investigating the role of prior preparation and self-efﬁcacy on female and malestudents’ introductory physics course achievements, in (American Association of Physics Teachers, Provo, UT, 2019).[17] Z. Hazari, G. Sonnert, P. M. Sadler, and M.-C. Shanahan,Connecting high school physics experiences, outcome ex-pectations, physics identity, and physics career choice: Agender study, Journal of Research in Science Teaching ,10.1002/tea.20363 (2010).[18] C. J. Ballen, C. Wieman, S. Salehi, J. B. Searle, andK. R. Zamudio, Enhancing diversity in undergraduate science:self-efﬁcacy drives performance gains with active learning,CBE Life Sciences Education , 1 (2017).[19] C. J. Ballen, S. Salehi, and S. Cotner, Ex-ams disadvantage women in introductory biology,PLOS ONE , e0186419 (2017).[20] I. Rodriguez, E. Brewe, V. Sawtelle, and L. H.Kramer, Impact of equity models and statisticalmeasures on interpretations of educational reform,Physical Review Special Topics - Physics Education Research , 1 (2012).[21] A. L. Traxler, X. C. Cid, J. Blue, and R. Barthelemy,Enriching gender in physics education re-search: A binary past and a complex future,Physical Review Physics Education Research , 1 (2016).[22] C. Wieman, Improving how universities teach science: Lessonsfrom the science education initiative (Harvard University Press,2017).[23] D. Hestenes, M. Wells, and G. Swackhamer, Force concept in-ventory, The Physics Teacher , 141 (1992).[24] C. E. Wieman, G. W. Rieger, and C. E. Heiner,Physics Exams that Promote Collaborative Learning,The Physics Teacher , 51 (2014).[25] P. Pintrich, D. Smith, T. García, and W. McKeachie, A manualfor use of the motivated strategies for learning questionnaire ,Tech. Rep. (The Regents of the University of Michigan, AnnArbor, Michigan, 1991).[26] J. M. Maher, J. C. Markey, and D. Ebert-May, The otherhalf of the story: Effect size analysis in quantitative research,CBE Life Sciences Education , 345 (2013).[27] E. Knekta, C. Runyon, and S. Eddy, One Size Doesn’tFit All: Using Factor Analysis to Gather Valid-ity Evidence When Using Surveys in Your Research,CBE Life Sciences Education , rm1 (2019).[28] E. Theobald, Students are rarely independent: When, why, andhow to use random effects in discipline-based education re-search, CBE Life Sciences Education , 1 (2018).[29] S. Salehi, E. Burkholder, G. P. Lepage, S. Pollock,and C. Wieman, Demographic gaps or preparationgaps?: The large impact of incoming preparationon performance of students in introductory physics,Physical Review Physics Education Research , 20114 (2019).[30] T. Goetz, M. Bieg, O. Lüdtke, R. Pekrun, and N. C. Hall,Do Girls Really Experience More Anxiety in Mathematics?,Psychological Science , 2079 (2013).[31] R. B. Harris, D. Z. Grunspan, M. A. Pelch, G. Fernandes,G. Ramirez, and S. Freeman, Can test anxiety interventionsalleviate a gender gap in an undergraduate STEM course?,CBE Life Sciences Education18