Gender-grade-gap zeroed out under a specific intro-physics assessment regime
DD EPARTMENT OF P HYSICS , U
NIVERSITY OF C ALIFORNIA , D
AVIS
Gender-gap zeroed out under a specificintro-physics assessment regime
David J. Webb and Wendell H. Potter * February 23, 2021
NTRODUCTION
Unhappiness with issues involved in grading, coupled with a desire to offer classes whereeveryone succeeded, led us to offer several classes of a large-enrollment introductory physicscourse that took a small step toward mastery learning. These attempts to improve instructionin these courses involved three different instructors. In comparing the grades from these trialsto those from our usual class experiences we noticed that the gender gaps in our usual classesmay just be an artifact of the assessment/grading methods in those classes. This is a reporton those efforts and results. First, I’ll discuss how course grades were usually determinedand our unhappiness with this grading. Then I’ll discuss the changes in assessments/gradingin the trial run classes. Finally, I’ll present some data on the gender gap, comparing classeswith these changed assessments/grading to classes where assessments/grading were morestandard. These classes were offered about 5 years ago but I retired almost immediately afterthose course offerings and my colleague Wendell Potter died less than a year later so I haveonly recently returned to these data. This report is written from my (David Webb’s) perspective.
ACKGROUND ON
CLASP
COURSES
The active-learning based CLASP (Collaborative Learning through Active Sense-making inPhysics) courses used in this report have been described in detail in previous work [1]. Basically, * Deceased 8 January 2017 a r X i v : . [ phy s i c s . e d - ph ] F e b LASP features an explicit focus on understanding models (including words, graphs, equations,etc.) and using these models in qualitative and quantitative analysis of real-world physical,chemical, and biological situations. In a usual 10-week term the course includes a single80 minute lecture meeting per week coupled with two 140-minute discussion/laboratory(DL) sections per week. The active learning elements of the course are carried out in thesestudio-type DL sections of 25-35 students. The DL sections include small group activitieswhere students work together to understand a model and practice applying it before engagingin whole-class discussions. There are three courses, CLASP A, CLASP B and CLASP C, makingup this one-year introductory-physics series. The courses are meant to be taken in sequenceand cover essentially all of the introductory physics in a standard series for biological sciencestudents.
ATEGORICAL GRADING IN
CLASP
Instructors in most CLASP classes grade exams using a numerical grading system [2] thatdirectly links every graded item (be it an exam question, or the exam itself ) to an absolutegrade scale so that students can always understand how their performance on a given questionrelated to the expectations of the course instructors. Specifically, each possible letter grade (A+,A, A-, B+, etc.) is represented by a range of numbers on the grade scale that an instructor uses.There are two main grade scales used by CLASP instructors, a 4-point-based scale (CLASP4)and a 10-point-based scale (CLASP10). These two scales are discussed in detail in reference[2]. In practice, the graders use a scoring method called “grading by response category” (GRC)[3] in which a grader would categorize student responses by their most significant error, andthe instructor would assign the same score and written feedback to all students who madethat error. This type of scoring cannot be considered a rubric because the categories are madeafter looking at student responses, but are otherwise similar to holistic rubrics in that scoringis subjective (requires judgment as the answer is not simply correct or incorrect) and that asingle score and feedback is given for each exam problem.
NHAPPINESS WITH THIS GRADING
Regarding GRC, my colleague Wendell Potter often pointed out that much of the time a graderspent trying to carefully distinguish the physics value of various students’ answers was wasted.In his view there is a very basic division into two groups, either a student “got it” or they “didn’tget it” with the dividing line between these two groups likely somewhere in the B − to B + graderange. This basic division into two groups could be done by a grader with minimal cost intime (just reading the student’s answer) for the vast majority of students. And, finally, thatparsing the various answers from students who “didn’t get it” was a waste of the grader’s time.Basically, a student making a major physics error or many small errors or omitting an extremelyimportant issue from their discussion “didn’t get it” and those students giving answers witheven less correct physics content than this also “didn’t get it”. Wendell considered it a wasteof a grader’s time and also a hopeless task to attempt to divide unacceptable answers into2ategories and then decide which categories showed understanding that was satisfactory (theC’s), which should be labeled poor (the D’s), and which should be labeled failing (the F’s).In a recent paper [2] we presented data suggesting that Wendell’s vocal worries about parsingthe grades given to the answers of students who “didn’t get it” were, perhaps, justifiable. Weshowed[2] that D and F grades (and also C and D grades) seem quite fungible. In comparingthe two main grade scales, CLASP4 and CLASP10, we noted that instructors seemed to shiftabout 15% of the total grade weight from from C and D grades, given under CLASP4, down to Fgrades when the instructor used the CLASP10 grade scale. The fungibility of C, D, and F gradesstrongly suggests that the answers from students who “didn’t get it” were not easily placedonto an absolute grade scale by any of the seven instructors who used both grade scales atvarious times. ASTERY TEACHING AND STANDARDS - BASED GRADING
To me, offering a class devoted to mastery means, at a minimum, giving students whatevertime they need to reach mastery of each of the various topics in the course and, importantly,assessing their work as necessary to provide them with a gauge of that mastery. In CLASP theseassessments most often come during DL where they are short verbal assessments but can alsocome during an active-learning lecture and, of course, could include the timed exams thatstudents take. Implicit in a mastery class is the instructor’s confidence that every student cansucceed in mastering every topic.The usual way of providing the gauge for a student’s work is called “standards-based grading”.In standards-based grading the goals that the instructor has for a student are written in aset of standards and the standards that each student should meet are known to the studentsduring their work. The student’s work is judged against each standard using a grade scaleincluding only a few levels like “below basic”, “basic”, “proficient”, and “advanced”. The grades“proficient” and “advanced” would include all of the students who "got it" according to Wendell.A description of some attempts at a completely mastery physics class using standards-basedgrading is given by Beatty [4]. The trial classes described in the present report did not aim at thefull implementation that Beatty tried because those seemed too difficult for us to implementin our large classes. Nevertheless, we made a few important changes toward mastery classesin our trial runs.
XPERIMENTS IN GRADING
LASSES DURING SUMMER TERMS
Wendell and I decided to try out a new grading scheme in a summer quarter in 2015. I taughttwo classes, one CLASP A and one CLASP B, that quarter and both became trial runs of thisteaching method. CLASP courses are condensed by a factor of about two during a summersession so that almost all of the course fits into 5 weeks instead of a normal 10 week quarter. Inthe summer there are two 75 minute lecture times per week and four 140 minute DL meetingsper week. We made two major changes to these courses. 3he first change was toward mastery learning. Giving students many chances to demonstratetheir mastery didn’t seem feasible but we needed to give students at least a second time todemonstrate mastery of each topic. We divided the topics into five parts and students tookfour quizzes (on the first four topics) in their DL’s during the quarter. About a week after eachof the four quizzes the students were given a chance to take a new quiz, a retake given duringlecture time, covering the same material as the regular quiz they had taken the week before. Ifthe student received a higher score on that retake quiz then the retake score replaced theirfirst score. If the student didn’t get a higher score then the retake score was averaged with theoriginal score and the average was used as the measure of their mastery of that topic.The second change was to the lecture time. I ran the lecture time in a flipped-course styleby offering online lectures (complete with conceptual questions) for students to view beforethe lecture time. Then I used the live-lecture times as a kind of office-hour/practice-sessionwhere students worked on old quiz problems with my help. Of course one of the two lectureseach week had to have time saved at the end for a retake quiz for students who chose to try toimprove their grade.As pointed out above we decided that a standards-based grade regime was too difficultto implement. But, as a step toward standards-based grading, Wendell and I decided thatstudents whose answers were essentially perfect (no physics errors but maybe a small matherror) received a grade = 1.0. These students’ answers would have shown them to be “advanced”in their work on the subject of the quiz (greater than or maybe equal to A − using our usualGRC grading). Students making a minor physics error or with a minor part of their explanationmissing were given a grade = 0.67. These students’ answers would have shown them to be“proficient” in their work on the subject of the quiz (between a B and an A − or A using our usualGRC grading). All other answers would receive a grade = 0. Our feeling about the answersshowing "proficiency" is that they were close enough to correct that it seems likely that thestudent either forgot to deal with some small issue, or didn’t notice some small issue and thatthe student likely would have been in the "advanced" group with only a short comment fromthe instructor to guide them.Students took a final exam in each of these courses but there was no time during the term toallow a retake on the final. The final exam was graded on the same scale as the quizzes (everyanswer received either 0, or 0.67, or 1) and the scores were averaged using a weighted averageto give a total score less than or equal to 1.0. Course grades were given in a fairly standard wayby choosing cutoffs for each grade. The A+,A,A- range lower cutoff was 0.70, the B+,B,B- rangelower limit was 0.49, and the C+,C,C- range lower limit was 0.27 with all other grade cutoffsappropriately placed. So students had to average better than “proficient” to receive a gradehigher than B+. Our department had recently finished recording and putting lectures onto YouTube so that we could easily offerthese lectures online. To choose these scores (and the letter grade cutoffs described later), we re-scored students from a previousoffering of CLASP A to find grade cutoffs that would produce approximately the same course grade distributionas the original GRC grading had for that previous class. We re-scored all original grades from a low B+ throughmiddle A- with the “proficient”, 0.67, grade and higher original grades with the “advanced”, 1.0, grade in thisre-scoring. .2 C LASSES DURING F ALL AND W INTER TERMS
After the summer session trials I taught another class (a CLASP C class) that used retakequizzes and we recruited two more CLASP A instructors to try offering retake quizzes in theirclasses . These three classes all used online lectures and gave retake quizzes but there weretwo distinctly different quiz-offering schemes. Importantly, these two distinct quiz-offeringschemes had distinctly different results regarding the gender gap.In my CLASP C class and one of the CLASP A classes a quiz was given during lecture everytwo weeks and the retake quizzes were given in lecture during the weeks when no new quizwas given. The online lectures allowed these instructors to spend the rest of their officialweekly lecture time (45-50 minutes) helping students work on old quiz problems as practice(as was done during the summer lectures). So, in these classes students ended up with gradeson four quizzes and never took more than one quiz during any particular lecture time. Forthese classes the retake grade was substituted for the original grade if it was higher and wasaveraged with the original grade if it was not higher and retakes were not allowed on the finalexams. This regime roughly replicated the summer assessment regime.The other CLASP A instructor gave an original quiz during lecture time every week (exceptthe first week and the last week) and then, during the same lecture time, a retake of the previousweek’s quiz. Specifically, during a typical lecture time in this class the instructor i) gave a 30minute quiz on new material, ii) used 15 minutes to review the quiz the students just took, andthen iii) gave a 30 minute retake quiz. So, in this class students ended up with grades on eightquizzes and almost always took two quizzes during each lecture meeting. For this class thehighest of the two quiz grades was always used as the official quiz grade. ATA AND ANALYSIS METHODS
In comparing gender gap in the mastery CLASP classes with that of the more canonical CLASPclasses we will use data from about three years (winter of 2013 through winter of 2016) thatincludes all 67 CLASP classes (A, B, and C parts are all included) offered in that time period.The classes changed very little during those years. All together these 67 classes included 17,205grades given to students.The university administration supplied us with the self-identified (binary) gender of eachstudent. We will compare the mastery classes with the usual GRC grading classes. I shouldnote that one of the instructors offering a mastery class had previously taught courses usingour online lectures so that we will be able to see that offering lectures online did not, by itself,diminish the gender gap. Because there were two distinct ways to offer retake exams we willseparately analyze their effects.We have shown[2] that the class grade distributions may be significantly different dependingboth on the instructor and the grade scale used. In order to minimize these effects in ourexamination of gender gaps, we follow Ref. [5] and use the relative grade of a student ratherthan their absolute grade. We normalize the absolute grade on a class-by-class basis to define Most instructors did not want to have to give up their lecture time to online lectures and/or did not want to haveto offer extra quizzes and/or did not want to use such a coarse grading method
ESULTS
TUDENT AND TA RESPONSES TO THE MASTERY CLASSES
We surveyed our students after offering these classes. Students responded to statements usinga 5-point Likert scale (Strongly agree, Agree, Neutral, Disagree, Strongly Disagree). The twomost important statements that students responded to were 1) “I would choose a CLASP classthat had “acceptable/unacceptable” grading of exams without being offered retake exams .”and 2) “I would choose a CLASP class that had “acceptable/unacceptable” grading of exams ifI were offered retake exams .” Only 36% of the students were either neutral to or supportiveof the first statement but 91% of the students were neutral to or supportive of the secondstatement. So students were mostly happy with this grading method but only if the class wasaimed toward mastery of the material by offering retake exams.The Teaching Assistants (TAs) in the course ran the DL’s and did the grading of quizzes andthey had one main complaint about this kind of grading. They uniformly wanted there to beone more category between “proficient” and “unacceptable”. Their reasoning was that theywere unhappy giving a student’s answer 0 points if it showed considerable knowledge of thesubject. I understood their issue but did not change the grade scale because those studentswho showed considerable understanding still were giving answers that did not, in my view,show “proficiency”.
HE GENDER GAP
ASTERY CLASSES GIVING ONLY ONE QUIZ DURING ANY LECTURE TIME
For the classes where students took at most one quiz per week in lecture we first note thatwomen and men came into the classes with similar GPA’s as shown in Table 8.1 so we willjust compute the gender gap without controlling for demonstrated academic successes. Wecompare the gender gap for our mastery classes with the gender gap for classes with the usualassessment/grading regime. Using HLM to model the gender gap directly at first as:
C our seGr ad e = b + b Female ( F emal e ) (8.1)where Female is a categorical variable equal to 1 if the student self-identified as Female andequal to 0 if the student self-identified as Male. We fit mastery courses and the canonicalcourses separately so that we have a gender gap, given by b Female , for each. At this point we6able 8.1: Incoming GPA’s for both women and men in groups of courses considered in thisreport. The groups include 1) the group of mastery courses that offered one quizper lecture time (M1Q), 2) the single mastery course that offered both a quiz anda retake quiz in most lecture times (M2Q), and 3) all courses from Winter 2013through Winter 2016 that did not offer any retake quizzes (Usual). The P-values arefor two-tailed t-tests comparing the GPA’s and suggest that women’s incoming GPA’sare statistically indistinguishable from men’s incoming GPA’s in each group.
Group GPA Women GPA Men P-value
Usual 3.09 3.08 0.245M1Q 3.07 3.01 0.158M2Q 3.16 3.10 0.330Table 8.2: Gender gap (negative gap means women are given lower grades) for the same groupsfrom Table 8.1. N is the number of grades given to students in each group.
Group N GenderGap Stand. Error P-value
Usual 16,246 -0.223 0.016 < − M1Q 636 0.025 0.083 0.766M2Q 321 -0.21 0.12 0.072do not control for any other characteristics of the students. In Table 8.2 we give the results forthese two separate gender gaps. The usual gender gap shows that the grades of women werealmost a quarter of a standard deviation below those of men for canonical grading in CLASPcourses. On the other hand, for the mastery classes the women had slightly higher gradesthan men but not significantly higher. So our conclusion is that there was no gender gap inthe one-quiz-per-lecture group of CLASP courses offered as mastery courses and includingonly one quiz per lecture time even though there was a significant gender gap in the usualCLASP courses. These are the main points of this report and neither of these conclusionswould change if we had controlled for the students’ incoming GPA’s. WO QUIZZES A DAY
The course that offered two quizzes in most of the lectures had a quite different result. Table8.2 shows that the gender gap in this course was similar to that in CLASP courses with theusual kind of grading. Again, Table 8.1 suggests that this gender gap in grades was not due to agender gap in demonstrated academic skills. One instructor had used the flipped-class format with the usual grading methods. This class had a gender gap of − ± Group N URMGap Stand. Error P-value
Usual 16,248 -0.396 0.020 < − M1Q 636 -0.375 0.092 < − M2Q 321 -0.49 0.13 0.077
RADE GAPS OF UNDERREPRESENTED GROUPS
The university has also supplied us with the self-identified ethnicity of these students so wecan check to see whether changes in assessments toward mastery also have effects on theknown[6][5] grade gaps between students from underrepresented racial/ethnic groups andtheir peers in introductory physics. Again we use HLM to model this gap directly as:
C our seGr ad e = b + b URM ( U R M ) (8.2)where URM is a categorical variable equal to 1 if the student self-identified as belonging toa racial/ethnic group underrepresented in physics and equal to 0 if the student did not soidentify.The results of this analysis are shown in Table 8.3. These URM grade gaps are statisticallyindistinguishable so these changes in the assessment scheme toward mastery learning havenegligible effect on URM grade gaps.
ISCUSSION
We did these original trial runs in an effort to allow all of our students to maximize theirlearning and all of our students to demonstrate their mastery of the various topics of thecourse. We knew that we were attempting systemic changes in the courses but at no point inthe planning discussions did I have any idea that we would end up discussing a likely case ofsystemic sexism. Nevertheless, that is my conclusion and one that Wendell recognized also.As someone who hasn’t studied gender issues very much I don’t think it would be productivefor me to spend many words speculating about the origins of the gender-dependent results wesee in this report. I will just point to a few references [7], [8], and [9] (and references containedin these) reporting on the relations between mastery orientation, ideas of intelligence beingmalleable (growth mindset), intrinsic motivation toward learning, self-efficacy, and assessmenttypes. It seems likely that the connections between these topics are involved in explaining howa course explicitly emphasizing mastery might lead to a physics class that is close to genderneutral. They may also help explain the difference between a rather relaxed mastery classwhose students complete at most one quiz per lecture time and a rather rugged mastery classwhose students take two nearly sequential quizzes in most lectures. 8able 9.1: Groups whose demographic grade gaps are shown in Fig. 9.1, the measurements forwhich those gaps were defined, and the studies from which the data were acquired.
GroupID Group Measurement Ref.
HSEC Introductory physics classes forengineers and physical sciencemajors at a highly selective eastcoast university Final exams Salehi et. al. [5]HSWC Introductory physics classes forengineers and physical sciencemajors at a highly selective westcoast university Final exams Salehi et. al. [5]PM Introductory physics classes forengineers and physical sciencemajors at a public university in themiddle of the country Final exams Salehi et. al. [5]CLASPUsual CLASP classes, usual GRC grading Course Grades This reportCLASPM1Q CLASP classes, mastery grading,one quiz per lecture Course Grades This reportCLASPM2Q CLASP classes, mastery grading,two quizzes per lecture Course Grades This reportUCDEngin. Three introductory UCD physicsclasses for engineers and physicalscience majors (usual organizationof topics) Final exams(same exam asclass below) Webb [6]UCDEngin.ConceptsFirst One introductory UCD physicsclass for engineers and physicalscience majors (concepts taughtbefore any complicatedcalculations) Final exams(same exam asclasses above) Webb [6] 9 -0.200.20.40.60.8
URM GapsGroup
HSEC HSWC PM CLASPUsual CLASPM1Q CLASPM2Q UCDEngin UCDEngin.ConceptsFirst -0.200.20.40.60.8
Gender GapsGroup
HSEC HSWC PM CLASPUsual CLASPM1Q CLASPM2Q UCDEngin UCDEngin.ConceptsFirst
Figure 9.1: Demographic gaps from several groups defined in Table 9.1. The grade gaps areall in units of standard deviations of the distribution. The groups HSEC, HSWC,and PM are from Ref. [5] and show grade gaps on final exams. The groups CLASPUsual, CLASP M1Q, and CLASP M2Q are from the present report and show coursegrade gaps (following Salehi I am making gaps positive if women or students fromunderrepresented groups had lower grades). The groups UCD Engin. and UCDEngin. Concepts First are an analyis of the data from Ref. [6] and show grade gapson final exams.As a broader look at these issues, I want to replot data from this paper and from an earlierstudy [6] and compare with a recent paper [5] by Salehi et. al. Figure 9.1 shows demographicgaps for several groups of classes that are defined in Table 9.1. Note that I am followingSalehi now in using positive gaps when the affected group (either women or students fromunderrepresented groups) have lower grades . We have previously noted [2] that grades inCLASP are almost completely due to exam scores so I would argue that the CLASP coursegrade gaps can be reasonably compared to the final exam grade gaps studied by others. Thereare three points that I would make about this figure:1) Gender gaps are present and of similar magnitude in these classes when they are offeredin their usual way. URM gaps are also present in the usual classes but seem more variable.2) One can apparently zero-out either gap by making a systemic change that improves thecourse for all students. For the gender gap we see this in the mastery classes and for the URMgap we see it in the concepts-first course.3) A course improvement that zeros-out the gender gap may be quite different from a courseimprovement that zeros-out the URM gap. I estimate that differences between the gaps in CLASP course grades and the gaps in CLASP exam grades are lessthan one-third of the CLASP Usual standard error in Fig. 9.1. R EFERENCES [1] Wendell Potter, David Webb, Cassandra Paul, Emily West, Mark Bowen, Brenda Weiss,Lawrence Coleman, and Charles De Leone,
Sixteen years of collaborative learning throughactive sense-making in physics (CLASP) at UC Davis , Am. J. Phys., , 153–163 (2014).[2] David J. Webb, Cassandra A. Paul, and Mary K. Chessey, Relative impacts of differentgrade-scales on student success in introductory physics , Phys. Rev. Phys. Educ. Res., ,020114, 1-17 (2020).[3] Cassandra Paul, Wendell Potter, and Brenda Weiss, Grading by Response Category: Asimple method for providing students with meaningful feedback on exams in large courses ,Phys. Teach., , 485-488 (2014).[4] Ian D. Beatty, Standards-based grading in introductory university physics , J. Scholar. Teach.Learn., , 1-22 (2013).[5] Shima Salehi, Eric Burkholder, G. Peter Lepage, Steven Pollock, and Carl Wieman, De-mographic gaps or preparation gaps?: The large impact of incoming preparation onperformance of students in introductory physics , Phys. Rev. Phys. Educ. Res., , 020114,1-17 (2019).[6] David J. Webb, Concepts first: A course with improved educational outcomes and parityfor underrepresented minority groups , Amer. J. Phys., , 628-632 (2019).[7] Susan M. Brookhart, A theoretical framework for the role of classroom assessment inmotivating student effort and achievement , Appl. Meas. Educ., , 161-180 (1997).[8] Jayson M. Nissen and Jonathan T. Shemwell, Gender, experience, and self-efficacy inintroductory physics , Phys. Rev. Phys. Educ. Res., , 020105, 1-16 (2016).[9] Angela M. Kelly, Social cognitive perspective of gender disparities in undergraduate physics ,Phys. Rev. Phys. Educ. Res.,12