[PDF] Impact of a course transformation on students' reasoning about measurement uncertainty

Abstract

Physics lab courses are integral parts of an undergraduate physics education, and offer a variety of opportunities for learning. Many of these opportunities center around a common learning goal in introductory physics lab courses: measurement uncertainty. Accordingly, when the stand-alone introductory lab course at the University of Colorado Boulder (CU) was recently transformed, measurement uncertainty was the focus of a learning goal of that transformation. The Physics Measurement Questionnaire (PMQ), a research-based assessment of student understanding around statistical measurement uncertainty, was used to measure the effectiveness of that transformation. Here, we analyze student responses to the PMQ at the beginning and end of the CU course. We also compare such responses from two semesters: one before and one after the transformation. We present evidence that students in both semesters shifted their reasoning in ways aligned with the measurement uncertainty learning goal. Furthermore, we show that more students in the transformed semester shifted in ways aligned with the learning goal, and that those students tended to communicate their reasoning with greater sophistication than students in the original course. These findings provide evidence that even a traditional lab course can support valuable learning, and that transforming such a course to align with well-defined learning goals can result in even more effective learning experiences.

Full PDF

IImpact of a course transformation on students’ reasoning about measurementuncertainty

Benjamin Pollard , ∗ Alexandra Werth , Robert Hobbs , and H. J. Lewandowski Department of Physics, University of Colorado Boulder, Boulder, CO 80309, USAJILA, National Institute of Standards and Technology, Boulder, CO 80309, USA and Department of Physics, Bellevue College, Bellevue, WA 98007, USA (Dated: August 18, 2020)Physics lab courses are integral parts of an undergraduate physics education, and oﬀer a vari-ety of opportunities for learning. Many of these opportunities center around a common learninggoal in introductory physics lab courses: measurement uncertainty. Accordingly, when the stand-alone introductory lab course at the University of Colorado Boulder (CU) was recently transformed,measurement uncertainty was the focus of a learning goal of that transformation. The PhysicsMeasurement Questionnaire (PMQ), a research-based assessment of student understanding aroundstatistical measurement uncertainty, was used to measure the eﬀectiveness of that transformation.Here, we analyze student responses to the PMQ at the beginning and end of the CU course. Wealso compare such responses from two semesters: one before and one after the transformation. Wepresent evidence that students in both semesters shifted their reasoning in ways aligned with themeasurement uncertainty learning goal. Furthermore, we show that more students in the trans-formed semester shifted in ways aligned with the learning goal, and that those students tended tocommunicate their reasoning with greater sophistication than students in the original course. Theseﬁndings provide evidence that even a traditional lab course can support valuable learning, and thattransforming such a course to align with well-deﬁned learning goals can result in even more eﬀectivelearning experiences.

I. INTRODUCTION

Lab courses are an important part of physics under-graduate curricula [1, 2]. These courses oﬀer opportu-nities for learning that is critical to becoming a physi-cist in many diﬀerent career paths. For example, labcourses are natural settings for students to acquire ex-perimental skills [3, 4], practice scientiﬁc communicationin a wide range of formats [3, 5, 6], and develop sophis-ticated beliefs and epistemologies around the nature ofscience [7]; these goals are not often a primary focus inlecture or theory-focused courses [3, 8]. Recent researchhas shown that there are lab courses that meet some ofthese learning goals [9–12], but that there is still roomfor improvement [13, 14]. As such, an increasing numberof lab educators are considering the variety of learninggoals possible in lab courses, and working to align theircourses to better achieve these goals. Hand in hand, edu-cation researchers need to better understand the range oflearning that occurs in lab courses, and identify teachingstrategies that are eﬀective at facilitating such learning.

A. Measurement uncertainty as a learning goal

In this work, we focus on measurement uncertaintyas a learning goal of introductory physics lab courses.Uncertainty analysis is a common learning goal in physicslab courses [15], thus the speciﬁcs of how it is taught are ∗ [email protected] as varied as physics labs themselves. Here, we highlighta few lab curricula that are discussed in literature thatdescribe a focus on measurement uncertainty, as well assome research studies around learning of measurementuncertainty in labs.The Scientiﬁc Community Laboratory (SCL), devel-oped at the University of Maryland, centers around aseries of research questions that aim to teach studentshow to produce, analyze, and evaluate scientiﬁc evidence[16]. The SCL elevates measurement concepts to thesame level of importance as physics concepts, recognizingthem as critical for those broader skills. In particular, theSCL focuses on sources of variation in data and the gener-alizability of results based on statistical signiﬁcance, andalso includes uncertainty considerations in experimentaldesign [17].The Student-Centered Activities for Large EnrollmentUndergraduate Programs (SCALE-UP) Project at NorthCarolina State University includes labs with uncertaintyconsiderations as learning goals [18]. These activitiesfocus on including uncertainties when reporting results,and using uncertainty when comparing data. A test de-veloped speciﬁcally for SCALE-UP further explored stu-dent ideas and approaches around measurement uncer-tainty [19].The Investigative Science Learning Environment(ISLE), developed at Rutgers University, includes inquiryactivities that focus on sources of experimental uncer-tainty and ways to minimize them in the context of ex-perimental design and iteration [20, 21]. ISLE integratesmeasurement uncertainty ideas, particularly around sys-tematic uncertainties, in design of and reﬂection on lab-oratory experiments. a r X i v : . [ phy s i c s . e d - ph ] A ug More recently, the Structured Quantitative InquiryLab (SQILab) at the University of British Columbia aimsto teach measurement uncertainty in the context of crit-ical thinking [22]. SQILab includes explicit instructionaround skills and concepts related to distributions, andextends to the formalisims comparing results using sta-tistical tests. Research on SQILab also includes attitudesand beliefs related to measurement uncertainty [23].Lastly, the physics labs that are part of the intro-ductory calculus-based physics sequences at Cornell Uni-versity have been transformed with measurement uncer-tainty as a learning goal. The transformation frames itsgoals explicitly in the context of the AAPT lab guidelines[3], and identiﬁes both statistical and systematic uncer-tainty learning outcomes in the context of modeling andexperimental design [24]. While the course included con-ceptual introductions to measurement uncertainty beforeit was transformed, research has shown that the more in-tegrated and explicit approach in the transformed courseresulted in more students viewing uncertainty as impor-tant when deciding if a result is trustworthy [25].

B. Measuring learning of measurement uncertainty

In addition to developing physics lab curricula, physicseducation researchers have also studied student learn-ing and student ideas around measurement uncertainty,often in conjunction with curriculum development [26–30]. Central to those eﬀorts is the development of sev-eral research-based assessment tools related to measure-ment uncertainty [31, 32]. The Concise Data Process-ing Assessment (CDPA) was developed around a decadeago to measure student understanding of both measure-ment uncertainty and mathematical models of measureddata [33]. It has since been used to study pedagogi-cal scaﬀolding [34] and gender diﬀerences in physics labs[35]. Around the same time that the course transforma-tion project at CU was initiated, the Laboratory DataAnalysis Instrument (LDAI) was developed to measuredata analysis skills within the context of a single labreport [36]. While the LDAI does not focus on mea-surement uncertainty exclusively, it includes many as-pects of measurement uncertainty as they relate to dataanalysis. More recently, the Physics Lab Inventory ofCritical Thinking (PLIC) was developed to measure arange of skills under the umbrella of critical thinking [37].Measurement uncertainty concepts are represented in thePLIC in the context of this broader range of experimentalpractice.For this work, we use the Physics Measurement Ques-tionnaire (PMQ) [38] to study the introductory labcourse at the University of Colorado Boulder (CU),as both the course and the PMQ focus on statisticalmeasurement uncertainty concepts at the introductoryphysics level. We ﬁrst describe the course in SectionsII A and II B. We then describe the history and philo-sophical perspective of the PMQ in Section II C, and the particular items (or probes) of the PMQ on which thiswork focuses in Section III A.

II. BACKGROUNDA. Transformation of an introductory lab course

In the broader context of improving physics lab ed-ucation, the introductory lab course at the Universityof Colorado Boulder (CU) was recently transformed andstudied. We describe the course and the transformationprocess here; more details can be found in refs. [10–12, 32, 39, 40].The introductory physics lab course at CU is a stand-alone course typically taken by students in their secondor third semester of study at CU. For most students, it isthe ﬁrst physics lab course that they take at the collegelevel. The course, both before and after it was trans-formed, consists of a series of lab activities involving ba-sic concepts from mechanics, electricity and magnetism,and other topics from introductory physics. Studentsmeet weekly in two-hour lab sessions to work througheach activity, and occasionally attend additional lecturesessions on background topics. There are short pre-labvideos that students view before each activity, which in-clude embedded questions for students to respond at par-ticular points in each video [11]. Students keep an elec-tronic lab notebook while they work, which they uploadfor grading and feedback at the end of each activity. Thecourse has no midterms nor a ﬁnal exam.Beginning in 2016, author HJL began teaching the in-troductory physics lab course at CU. At the same time,she initiated a project to transform this course. First,professors in the Physics Department and various de-partments in the College of Engineering and Applied Sci-ence were surveyed, and engaged in group discussions, inorder to identify learning goals for the course. Thesegoals included an alignment of students beliefs and epis-temologies about experimental physics with that of ex-pert physicists, positive attitudes about the course andabout experimental physics more generally, the abilityto create quality graphs, and an understanding of mea-surement uncertainty [12]. Based on these learning goals,HJL, BP, RH, and others created a new set of lab activ-ities for the course, with corresponding apparatus, anal-ysis software, lab guides, grading rubrics, pre-lab videos,and lectures. The transformed course was ﬁrst taught inFall 2018, and continues to the present. HJL continuedto teach the course throughout this process, including allthe semesters studied below.While the transformed course was designed to meetthe identiﬁed learning goals, it is still distinct from theideal course that the designers would have wished. Thismismatch is due mostly to logistical constraints such asthose arising from working with 20-30 graduate teachingassistants, and the logistics of scheduling 35-45 separateweekly lab sections in a single instructional space. Thus,the transformed course still operates in many ways as atraditional introductory physics lab course. For the con-text of this work, we see traditional lab courses as highlyguided and prescriptive, focusing on conceptual ratherthan skills-based learning, and consisting of veriﬁcationexperiments. In particular for our transformed course,the lab activities remained quite prescriptive, guidingstudents through procedures with signiﬁcant scaﬀoldingthroughout. Nonetheless, most of the activities in thetransformed course focused on skills-based learning, andnone were veriﬁcation experiments.

B. Transformation aspects related to measurementuncertainty

In this work, we focus on the course transformationlearning goal concerning measurement uncertainty. Thetransformed introductory lab course at CU includes sev-eral aspects that support learning around measurementuncertainty. First, each lab activity in the transformedcourse involves students measuring a quantity or outcomethat they would not know before completing the measure-ment. These activities are diﬀerent than veriﬁcation labs ,in which students are measuring a value that they learnedin lecture or could look up in a textbook. In the coursebefore transformation, ﬁve out of six activities were ver-iﬁcation labs, in our judgement. In addition to there be-ing no veriﬁcation labs in the transformed course, manyof the lab activities ask students to use measurementsthey made previously to make predictions about theirpresent experiment. Then, after making a measurement,many of the lab activities ask students to discuss theirresult with their peers in the classroom, comparing datato decide if their diﬀerent results agree with each other.These discussions provide repeated opportunities for stu-dents to consider and communicate both the value andthe uncertainty of a result, and to discuss these resultsin the context of their choices involving data collectionand procedure.Beyond the lab activities themselves, four out of thesix lectures in the transformed course focused entirely onmeasurement uncertainty concepts, and a ﬁfth includedadditional discussion of measurement uncertainty alongwith other topics. As these concepts are presented inthe transformed course, students ﬁrst learn about distri-butions and the act of measurement as sampling from adistribution. The idea of uncertainty in measurement ispresented as a measure of such underlying distributions.While measurement uncertainty was included in lecturesbefore the transformation, it was not as much a focus,and was presented with less of a conceptual underpin-ning, focusing more on the mechanisms of error propaga-tion and the proper structures for reporting results.

C. The Physics Measurement Questionnaire

The PMQ originated from studies by researchers inYork, UK with primary school students age 9-16 [41, 42].This work stemmed from a need to evaluate a new na-tional curriculum that included school laboratory pro-grams [43], and resulted in a model for how studentsprogressed in their ideas about measurement that cate-gorized students’ ideas concerning experimental data asa progression through eight levels. [44]. Soon after, re-searchers in Cape Town, ZA attempted to use the mate-rials from York in their physics classes for ﬁrst-year uni-versity students at the University of Cape Town. Theyfound, however, that the materials were not suitable intheir context, so they created the PMQ by adapting theinstruments developed in York [38].Similarly, as the Cape Town researchers interpretedpreliminary responses from their students, they extendedand adapted the frameworks from York to develop thepoint and set paradigms [45, 46]. The point and setparadigms characterize two philosophical perspectivespertaining to the statistical uncertainty of measuredquantities. They are described in detail in refs. [38, 47].The point paradigm represents the idea that it wouldbe possible for a single measurement trial to completelyrepresent the true value of a physical quantity or mea-surand, where deviations from that true value are due tomistakes in the data taking procedure or unaccounted-for eﬀects in the measurement apparatus. The pointparadigm would maintain that the overall goal of a goodmeasurement procedure is to prevent, or identify andeliminate, all mistakes and unaccounted-for eﬀects, al-lowing for a measurement trial that perfectly capturesthe measurand. Thus, in the point paradigm, the resultsof individual trials can be considered independently ofeach other as long as all factors leading to deviation aretaken into account.In contrast, the set paradigm represents the idea thateach individual measurement trial reveals some informa-tion about the measurand, but that no individual mea-surement can yield its true value. Thus, multiple trialsmust be considered as a distribution, with each succes-sive trial revealing more information about the measur-and. However, perfect knowledge of the measurand withzero uncertainty is impossible under the set paradigm.The set paradigm stems from a probabilistic approachto measurement uncertainty [46], and is often consideredto be more aligned with expert-like reasoning than thepoint paradigm.It is also worth noting what is not captured by thepoint and set paradigms. The paradigms, and by exten-sion the PMQ itself, were designed to characterize reason-ing around statistical measurement uncertainty. Discus-sions around systematic errors, that is, any unwanted orunaccounted-for eﬀect that would not “average out” withrepeated trials, are outside the scope of the point and setparadigms. While students’ responses in the PMQ ofteninvolve such reasoning, those elements are irrelevant inthe framework of the point and set paradigms. Addi-tionally, skills and concepts concerning the propagationof uncertainty throughout a calculation are also outsidethe scope of the paradigms and the PMQ. There are alsomore subtle distinctions to be made when discussing sta-tistical measurement uncertainty, such as the diﬀerencesbetween frequentest and Bayesian perspectives, that aremore complex than the distinctions captured by thepoint and set paradigms. Despite its limited scope, thePMQ is still a valuable tool for studying student learningaround measurement uncertainty, especially at an intro-ductory level. Likewise, the point and set paradigms arenonetheless useful constructs for understanding overarch-ing trends in learning regarding measurement uncertaintyat the introductory physics level.

D. This work

In this work, we use the PMQ to measure the eﬀective-ness of the introductory lab course at CU at facilitatingstudent learning around statistical measurement uncer-tainty. Such learning is directly related to one of ourcourse transformation’s learning goals. In full, this goalwas stated as, “Students should demonstrate a set-likereasoning when evaluating measurements,” where “set-like” is a reference to the set paradigm discussed in Sec-tion II C.We aim to answer the research question, (Q1)

Did stu-dents respond to the PMQ in ways more aligned withthe set paradigm after taking the introductory lab course,compared to when they began the course?

In answer-ing (Q1), we consider both the original and transformedcourse, despite the fact that the original course didnot have explicitly stated learning goals, to investigatewhether an entirely traditional physics lab course canachieve such a learning outcome.Furthermore, we use the PMQ to evaluate the eﬀec-tiveness of the transformation at achieving its learninggoal around measurement uncertainty. We aim to an-swer the research question, (Q2)

Did student responsesto the PMQ after the transformation show greater changetowards the set paradigm than responses before the trans-formation?

In Section III, we describe the probes of the PMQ thatwe use in this work, as well as the students who takethe introductory physics lab course and our approach tocollecting and analyzing responses from them. SectionIV presents results from our analysis of these PMQ re-sponses, comparing responses from the start of the course(pre) and after completing the course (post), and frombefore the transformation and after it. We ﬁnish by dis-cussing these results in the broader context of physics labcourses in Sections V and VI.

III. METHODSA. Probes of the PMQ

The entirety of the PMQ is based on an experimentinvolving rolling a ball down a slope and then measuringthe distance it travels in free fall. Each item, or probe,of the PMQ concerns a decision at one step in the mea-surement process, from taking data to comparing the an-alyzed results. Each item asks students to make a choice,usually between two or three multiple-choice options, andthen to explain their choice in open-response format.In this work, we analyze student responses to four par-ticular probes of the PMQ: RD, UR, SMDS, and DMSS.We chose to exclude the other probes of the PMQ fromour analysis for a combination of reasons. Some probes(RT, DMOS, and DMSU) only appear on the pre-test orthe post-test version of the PMQ, but not on both, so itwas not possible to directly compare students’ responsesto these probes between pre and post tests. Additionally,when a fellow researcher in our group adapted the PMQfrom the original pen-and-paper format to an online for-mat using the Qualtrics online survey platform [48], theSLG probe required signiﬁcant adaption to the online for-mat, so we decided not to code responses to that probe.Finally, when we consulted with one of the creators of thePMQ, they recommended that we omit one of the probes(RDA) from our analysis, as they gained relatively littleinsight from that probe in their work.As an example, the RD probe is shown in Fig. 1.The RD probe concerns data collection. RD stands for“repeated distance,” referring to whether to repeat a trialthat measures the distance that a ball travels. The probepresents three stances: to repeat the trial several moretimes, to move on after performing only a single trial, orto repeat the trial exactly one more time. Respondentsare asked to choose with which stance they agree, and toexplain their choice in a text box.The other probes of the PMQ have the same generalformat as RD, and are shown in Appendix A. The URprobe, which stands for “using repeats,” asks how to an-alyze data to produce a ﬁnal result. The SMDS probe,which stands for “same mean, diﬀerent spread,” concernsdata comparison, asking respondents to decide which oftwo data sets is better. The two data sets have the samemean, but diﬀerent spread. Similarly, the DMSS probe,which stands for “diﬀerent mean, same spread,” also con-cerns data comparison. However, the DMSS probe con-cerns two data sets that have diﬀerent means but thesame spread.

B. Coding scheme development

Because every probe in the PMQ involves an open-response component, responses must ﬁrst be classiﬁedqualitatively before additional analysis is performed. Toclassify PMQ responses, the creators of the PMQ de-

FIG. 1. The RD probe of the PMQ. Respondents are asked, “With whom do you most closely agree?” and can choose A, B,or C. They are then prompted to “Explain your choice.” Reproduced from ref. [47]. veloped a coding scheme based on responses from theirstudents [42]. The coding scheme consists of a diﬀerentset of codes for each probe, and aims to capture the typesof reasoning students draw upon when reasoning aboutuncertainty in the various stages of measurement in thePMQ. This coding scheme was developed concurrentlywith the paradigm model described above, and in thecurrent version each code is associated with either thepoint paradigm, the set paradigm, or an “unknown” des-ignation if the code represents reasoning that does notunambiguously align with one paradigm.The ﬁrst set of PMQ responses that we analyzed camefrom students in the course in Fall 2016, before any ofthe responses that we present here. We observed a rangeof student reasoning in the 2016 data set that was notcaptured by the coding scheme developed by the cre-ators of the PMQ, and thus expanded upon it to describeresponses from our diﬀerent national, institutional, andcourse context. These expanded codes were subsequentlyconsolidated into common themes, which were then re-framed as code deﬁnitions to create a new coding schemefor the PMQ. This new coding scheme was then reﬁnedbased on CU student responses from Fall 2017, with asubset of those responses used to check inter-rater reli-ability between two independent coders. The process ofcreating and reﬁning the new coding scheme is describedin more detail in ref. [10].After our coding scheme was developed and veriﬁed,we applied it to data from the Spring 2017 and Spring 2018 semesters. We ﬁrst matched pre and post responsesby student, and removed the responses that were notmatched from the data set to make direct comparisonsof pre and post distributions straightforward. Then, foreach probe, all of the pre- and post-test responses fromthose two semesters were anonymized, combined, andshuﬄed into a single data set, which RH coded withoutknowing from which semester and pre/post designationeach response came. After the codes were assigned, thedata were separated back into their respective categoriesfor further analysis.

C. Coding scheme

The new coding scheme we created consists of 12-16codes per probe. Each code is denoted by a letter and anumber, for example, “U3.” When it is necessary to dis-ambiguate between codes pertaining to diﬀerent probes,we prepend the probe’s acronym and a hyphen, for ex-ample, “RD-U3.”The letter is either S, P, or U, signifying whether thecode falls under the set paradigm, the point paradigm,or unknown reasoning that does not unambiguously alignwith one paradigm. Across all probes, there are 21 Pcodes, 22 S codes, and 13 U codes. The codes within eachparadigm further diﬀerentiate between student reasoningat a ﬁner-grained level. The number designation of eachcode distinguishes between them, though we do not in-tend for the numbers to be interpreted as an ordering.The relative merits of each code are not inherent to thereasoning they represent, and in practice will depend onthe context of how the results of analysis are interpreted.For example, when we use the coding scheme here tomeasure the success of a course, we compare codes interms of the extent to which each code’s reasoning alignswith the learning goals of the course.Sometimes, student responses contain multiple distinctlines of reasoning, and thus we allow multiple codes to beassigned to a single response. In the data sets analyzedbelow, 87.9% of the responses were assigned a single code,11.9% were assigned two codes, and 0.3% were assignedthree codes. For the purposes of classifying a responseinto a single paradigm, if a response was assigned multi-ple codes from diﬀerent paradigms, S and P codes bothtook precedence over U codes. For example, if a responsewas assigned a P code and a U code, the response wasconsidered point-like overall. If both an S code and a Pcode were assigned to a single response, which happenedin 1.7% of the responses in the data set analyzed below,we classify that response’s paradigm as U.The complete code books of the new coding scheme,one for each of the four probes we analyzed, are repro-duced in full in the supplemental information accompany-ing this work. Here, we present an subset of these codesin Table I. As an illustration of the scope and depth of thecoding scheme, we discuss here three codes for the RDprobe: S4, P2, and U1. Each code represents a reason toperform more than one trial, aligned with response A orC of the RD probe (Fig. 1).The S4 code argues that multiple measurements shouldbe performed in order to reduce the uncertainty of a meanvalue, implying that the mean is the result that matters.This argument aligns with the set paradigm. On theother hand, the P2 code represents the idea that multi-ple measurements are beneﬁcial because they allow theexperimenter to identify outliers or mistakes in data col-lection. This argument aligns with the point paradigm.Lastly, the U1 code represents responses that merelystate that more data is needed. In this case, the respon-dent did not write a suﬃcient explanation to classify itinto one paradigm or the other. It is possible that, if dis-cussing the probe with the respondent in person, theirunderlying reasoning would become apparent. It is alsopossible that the respondent lacked the language to ex-press their reasoning, or that they had not consideredtheir reasoning to any greater depth. It is even possi-ble that the student was merely pressed for time whencompleting the survey, and otherwise they would haveprovided an explanation that aligned well with anothercode. In any case, the PMQ coder has only the writtenresponse to interpret, and as such, is forced to assign acode such as U1 regardless of these hypothetical cases.We note that, from the perspective of an expert exper-imental physicist, there is validity behind the reasoningrepresented by both the S4 and the P2 codes, and eventhe U1 code cannot be said to be incorrect. Therefore, we do not intend to assign an inherent ranking or hierarchyto the codes in our code book. They aim only to clas-sify common lines of reasoning used when responding toa given PMQ probe. Even at the paradigm level, thoughthe set paradigm has been identiﬁed as more expert-likethan the point paradigm in previous work, that does notmean that point-like reasoning is not sometimes includedin expert approaches more generally. For example, out-liers are often the impetus for proposing causes and en-acting revisions in the larger context of modeling, andidentifying systematic errors is sometimes merely a morenuanced framing for catching mistakes in the measure-ment procedure [49]. When using our coding scheme tomeasure the eﬀectiveness of a learning experience, as wedo in this work, we intend for our codes to be comparedto each other only in terms of how well they align withthe goals of that learning experience. In general, inter-preting results from our coding scheme by ranking codesrelative to each other is best done in the context of alearning goal.

D. Student-level paradigms

In addition to interpreting the codes assigned to eachstudents’ response to any particular probe, we also con-solidate students’ responses to each of the four probeswe analyzed into a single designation of that student’sreasoning overall [32]. Because of the coarse-grainednature of this consolidation, we characterize student’sreasoning only at the paradigm level. These student-level paradigms correspond to the number of probe-levelparadigms emerging from each probe, as deﬁned in Ta-ble II. A student’s overall response is point-like if theirresponses to the probes are represented only by P andU codes and no S codes. Conversely, an overall set-like response comes from responses to the probes that are rep-resented only by S and U codes and no P codes. A thirddesignation, mixed , refers to an overall response that in-cludes both P and S codes, or to when the response isentirely represented by U codes. However, a responserepresented entirely by U codes only occurred 0.2% ofthe time in the data sets analyzed here.

E. Statistical methods

After responses were coded using the coding schemedescribed above, we analyzed the distributions of the as-signed codes. Note that in this work, we compare distri-butions of students within a semester and PMQ admin-istration (pre or post), rather than comparing matchedresponses student-by-student. We compare distributionsof codes and demographic characteristics using Fisher’sExact test [50], with a signiﬁcance threshold of p < m multiple comparisons. We TABLE I. Selected codes from the new PMQ coding scheme.Probe Identiﬁer Name Deﬁnition: ”Argument is that...”RD S4 Reduce uncertainty of mean ...multiple measurements will be used to reduce the error/uncertainty of themean/average.RD P1 Measure the true value ...the experimenter could measure the correct value in a single measurement.RD P2 Identify the outliers after allmeasurements ...repeated measurements are needed in order to know which measurementswere mistakes or outliers, after all measurements are taken. This code in-cludes the idea that the experimenter must get the same result at least twicefor it to be correct.RD U1 Just take more data ...experimenter needs to take more data. No statistical reasoning apparent.UR S1 Simply average ...I averaged, do the average, average is best, or it is the average, but doesnot elaborate. Includes statements that simply say what the reported valueis.UR S3 Why average is appropriate inthis case ...reporting the average is best because all of this data matters, or becausethe spread of this data is small enough. Includes reporting all data as wellas the average.UR S4 Report average and spread ...experimenter should report the average and the uncertainty/range/spread.UR S5 How to compute ...how to compute the average.UR P1 Choose single value ...experimenter should choose a single value to report (for any reason).SMDS S2 Smaller spread is better, no men-tion of external factors ...a smaller spread/uncertainty/range is better / more accurate / more pre-cise / etc. The response does not mention external factors, outliers, humanerror, etc.SMDS S3 Smaller spread is better, due toexternal factors ...a smaller spread/uncertainty/range is better / more accurate / more pre-cise / etc. The response mentions external factors, outliers, human error,etc.SMDS P1 The means are the same ...the groups agree because the means are the same.SMDS P4 Diﬀerences in carefulness ...diﬀerences in the spread are due to diﬀerences in how carefully the mea-surements were performed.DMSS S3 Similar means and spreads, men-tions overlap ...the groups agree because the means and spreads are similar. Argumentconsiders the overlap between the means and/or spreads of the two datasets.DMSS S4 Chose A, blank explanation Respondent chose ”A” but left the explanation blank.DMSS P2 Means must match ...the groups do not agree because the means are not the same (no mentionof spread)DMSS P3 Means close enough, treats aver-age as point ...the groups agree because the means are close enoughDMSS U1 Not about statistics ...only non-statistical things, such as systematics, are mentioned.DMSS U3 Misc. Argument that doesn’t ﬁt into any of the other codes.TABLE II. Deﬁnitions of overall student paradigms. Repro-duced from ref. [32]Student paradigm Number of P ’s Number of S ’s point-like ≥ set-like ≥ mixed ≥ ≥ mixed compute these tests using the base package included inthe R programming language, version 3.6.2 [52]. For anadditional visual indication of the uncertainty in countsor percentages of codes or paradigms, we plot the bino-mial proportion conﬁdence interval at the 95% conﬁdencelevel. When plotting the diﬀerence between the numberof post responses and pre responses for each code, wepropagate these conﬁdence intervals for the plotted un-certainty bar of the calculated diﬀerence. Finally, whentwo similarly-measured proportions are both statistically signiﬁcant based on these methods, we estimate the de-gree to which they are diﬀerent by calculating an eﬀectsize using Cohen’s h [53]. F. Course context

In this work, we compare two semesters of the in-troductory physics lab course at CU, one before thetransformation of that course (Spring 2017, the “origi-nal course”) and one after the course was transformed(Spring 2018, the “transformed course”). We comparetwo Spring semesters, though the course is also taughtin Fall semesters, to avoid a range of factors that inﬂu-ence students of diﬀering backgrounds enrolling in theFall versus in the Spring. There were 641 students whocompleted the course at CU in Spring 2017, and 722 inSpring 2018. Of these students, 539 and 499 respectivelycompleted both the pre-test and post-test, and were in-cluded in the data set analyzed here. The self-reportedgender, race and/or ethnicity, major, and year of stu-dents in these two semesters, collected using anotherresearch-based assessment that was administered at thesame times as the PMQ, are shown in Table III. We in-clude this information for various reasons [54], includingto provide context for our research ﬁndings, as well asto enable meta-studies that combat normative whitenessand highlight inequities in research [55]. We comparedthe proportions of students identifying with each of thesecategories in Spring 2017 and Spring 2018 as an indica-tion of the similarity of the students entering the courseduring these two semesters. The resulting p-values fromFisher’s Exact are shown by each category heading in Ta-ble III. Along each of these dimensions, the populationsof students in the two courses were statistically equiva-lent (p > TABLE III. Self-reported gender, race/ethnicity, major, andyear of students enrolled in the course in both Spring 2017 andSpring 2018. “Engineering” excludes the major EngineeringPhysics, which is included in “Physics.” The p-values fromFisher’s Exact comparing the two semesters appear next toeach dimension heading.

Gender p=0.31Female 23.6%Male 75.1%Other Gender 1.3%

Race / Ethnicity p=0.81American Indian or Alaska Native 0.9%Asian 14.4%Black or African American 2.2%Hispanic/Latino 8.8%Native Hawaiian or other Paciﬁc Islander 0.7%White 69.0%Other race/ethnicity 4.0%

Major p=0.48Physics 17.2%Engineering 44.8%Other STEM 35.1%Other disciplines 3.0%

Year p=0.30First year 48.9%Second year 31.5%Third Year 11.4%Fourth year 6.2%Fifth year and above 2.0%

The PMQ was administered electronically to studentsat the beginning (pre) and at the end (post) of the courseduring both semesters. Students were sent an internetlink to complete the PMQ independently, and as an in-centive for completing the questionnaire, were oﬀered asmall amount of participation-based course credit total-ing 1-2% of their ﬁnal grade in the course. In the origi-nal course, students completed the pre-test and post-testas in-class assignments. However, due to unavoidablescheduling circumstances, students in the transformedcourse completed the surveys outside of class. In thatsemester, students received the pre-test link ﬁve days af-ter the course started, and were required to complete their responses within seven days. Previous research onother online research-based assessments of student learn-ing has shown little diﬀerence in matched student re-sponses when completed outside or during class time,showing at most a small positive increase from takingthe assessment in-class overall [56].Additionally, because of the change in timeline in thetransformed course, the ﬁrst lecture in that course oc-curred before 64% of the respondents completed theirpre-survey response. The content of that lecture touchedon aspects related to measurement uncertainty, speciﬁ-cally the importance of every measurement having an as-sociated uncertainty and how that uncertainty is used forcomparing measurements. For the post-tests, studentsreceived the link close to the end of the semester, afterthey had completed all activities for the course, and wererequired to complete their response before the semesterended.

IV. RESULTS

Here, we present quantitative results from applying thePMQ coding scheme developed at CU to two semestersof the introductory physics lab course at CU: Spring 2017(referred to as the original course, or before the transfor-mation) and Spring 2018 (referred to as the transformedcourse, or after the transformation). For each semester,we compare the distribution of responses from the pre-test to equivalent distributions from the post-test, as a“pre-to-post” comparison. We ﬁrst present these com-parisons at the student paradigm level, as the most sim-pliﬁed interpretation of our results. We then break downthe results probe by probe, still interpreting responses atthe paradigm level. Lastly, we consider each probe at alevel beyond the paradigms, comparing distributions ofthe codes that make up the paradigms. At each level,we note how each ﬁnding aligns or runs counter to thelearning goal of the transformed course. We further dis-cuss this alignment more broadly in the context of ourresearch questions in Section V.

A. Paradigm-level results

Fig. 2 shows the percentage of students whose re-sponses to the PMQ fell into each student-level paradigm, point-like , mixed , or set-like . The left panel shows thesemester before the transformation, while the right showthe semester after the transformation. Light gray barsrepresent the pre-test, while darker gray represent thepost-test.The error bars in Fig. II suggest that, for bothsemesters, there were signiﬁcant diﬀerences between preand post for the number of mixed and the number of set-like responses. Fisher’s Exact conﬁrms those diﬀerences,all with p (cid:28) .

05. However, the proportions of point-like responses were statistically similar ( p = 1 for 2017 and p = 0 .

34 for 2018). Overall at the student level, stu-dents shifted predominantly from the mixed to set-like paradigm, both before and after the transformation.Moving to the probe level, shifts between pre andpost paradigms for each of the four probes we analyzedare shown in Fig. 3. As before, the left panel showsthe semester before the transformation, while the rightshow the semester after the transformation. Within eachpanel, the four probes are represented on the verticalaxis. The horizontal axis represents the proportion ofstudents whose responses were coded with either S or P codes. Solid markers denote the proportion with an S code on the pre-test, while open markers denote the pro-portion with a P code on the pre-test. The end of thecorresponding arrows show the corresponding proprtionson the post-test. The signiﬁcance of these shifts is alsoindicated.Across all probes, students shifted to more S reasoningand less P reasoning during the semester. However, onlysome of these shifts were statistically signiﬁcant. In theRD probe, the transformed course showed larger shifts to-wards S than the original course (with Cohen’s h eﬀectsizes of h=0.45 and h=0.21, respectively), while in theSMDS probe, the transformed course showed no signiﬁ-cant shifts at all. The UR probe showed little practicalsigniﬁcance in either semester due to the large propor-tion of S responses, even on the pre-test. We have seenthis saturation eﬀect before in the UR probe when ana-lyzing responses from CU students at the paradigm level[10, 40].The DMSS probe shows a larger shift towards S rea-soning in the transformed course than in the originalcourse. However, this shift is due to a diﬀerence in theproportion of S responses in the pre-tests, rather thanthe post-tests. Such a diﬀerence stands in contrast tothe other three probes, which show similar pre-test pro-portions between the two semesters. Furthermore, thisdiﬀerence in pre-test responses stands in contrast to allof the information we have available on the distribu-tion of students who enrolled in the class for these twosemesters, which would suggest that the two groups ofstudents are similar. Our best guess as to the cause ofthis diﬀerence in the DMSS pre-test proportions concernsthe diﬀerences in the timelines of the two semesters. Wespeculate that responses in the transformed semester tothe DMSS probe in particular were aﬀected by the ﬁrstlecture of the course, for the 64% of students who com-pleted the survey after that lecture. That lecture touchedon the idea that uncertainty is used for comparing mea-surements in a generalized way, an idea that relates tothe DMSS probe. However, it also relates to the SMDSprobe, from which the paradigms of pre responses seemsimilar before and after the transformation. Given thisuncertainty, we proceed with caution when further an-alyzing DMSS responses from these two semesters, re-membering that the full story around this portion of thedata set remains unclear. B. Code-level results

We now analyze results beyond the level of paradigms,considering the individual codes themselves in the con-text of the course transformation. For each probe andsemester, we plot the diﬀerence between the number ofresponses to each code on the post-test and on the pre-test. In addition to the error bars that represent theuncertainty of these diﬀerences, we use blue bars to rep-resent codes in which the pre- and post-test distributionsare statistically diﬀerent using Fisher’s Exact ( p < . p > . m as thenumber of codes in the given probe’s code book, to the p -values from Fisher’s Exact before determining statisticalsigniﬁcance.

1. The RD probe

A comparison for the codes in the RD code book isshown in Fig. 4. In both semesters, the code with thelargest change from pre to post was S4. The prominentincrease in S4 is encouraging, as it is aligned with thelearning goals of the course, in particular the idea thatall numbers have an uncertainty, including the mean ofa set of data. Before the transformation, the other codethat increased was U1, while that code did not signiﬁ-cantly change after the transformation. U1 represents aresponse that does not display more sophisticate reason-ing than the other codes. Therefore, the lack of an in-crease in U1 after the transformation compared to beforecould suggest that students articulated their reasoningwith greater sophistication in the transformed course.Considering the codes that decreased from pre to post,there were two codes that showed a signiﬁcant pre-to-post decrease, and they were the same codes in both thetransformed and the original course. As they were bothpoint-like, that change is aligned with the goals of thetransformation.

2. The UR probe

Fig. 5 shows a comparison of the codes in the UR codebook. In both semesters, the most prominent change wasan increase of a single set-like code. This consolidationphenomenon in UR responses is discussed in ref. [10]. Inthe original course, the code into which students consol-idated was S1, while in the transformed course the codeas S4. S1 represents reporting the average as the result ofa set of measurements, while S4 represents reporting anaverage as well as a spread. S4 aligns with the transfor-mation’s learning goals, as it recognizes the importanceof the spread of a distribution and aligns with the ideathat all numbers have an uncertainty.0

FIG. 2. Pre-post shifts at the student paradigm level. (a) is before transformation, (b) is after transformation. Error bars arethe binomial proportion conﬁdence interval at the 95% conﬁdence level.FIG. 3. Pre-post shifts at the probe paradigm level. (a) is before transformation, (b) is after transformation. The horizontalaxis represents the proportion of students responding in either the point or set paradigm, on each of four probes along thevertical axis. The shapes at the start of each arrow represent the pre-test proportion, while the location of the arrowheadrepresent the post-test proportion. Solid shapes represent the set paradigm proportion, while open shapes represent the pointparadigm proportion. Star shapes represent a statistically signiﬁcant pre-to-post shift using Fisher’s Exact test at the 95%conﬁdence level; circle shapes represent a shift that is not signiﬁcant by that same test. Shaded boxes represent the pretestbinomial proportion conﬁdence interval at the 95% conﬁdence level, as a guide to the eye.

There were no other statistically signiﬁcant pre-postdecreases in UR responses in the original course. How-ever, there were three signiﬁcant decreases in the trans-formed course: P1, S3, and S5. P1 represents canonicalpoint-like reasoning, that one should choose the valuefrom a single trial to represent the result of an experi-ment. A decrease in this code is aligned with the goals ofthe transformation. The other two codes that decreasedin the transformed course were set-like. One of them, S3,discusses the purpose of reporting an average, represent-ing conceptual reasoning around the role of averages inexperimentation. The second, S5, states the mathemati- cal process of calculating an average, and represents thebasic skill or practice of reporting an average. The trans-formation’s goals included the reasoning represented byboth of these codes, suggesting that the decreases in S3and S5 may represent further room for improvement.

3. The SMDS probe

We compare responses to the SMDS probe in Fig. 6.Overall, the magnitude of the shifts on any of these codesare relatively small, suggesting that the type of reason-1

FIG. 4. Pre-post diﬀerences in code counts for the RD probe, for the original course (a) and the transformed course (b).Blue bars are statistically signiﬁcant diﬀerences using Fisher’s Exact test at the 95% conﬁdence interval, adjusted using theHolm-Bonferroni method. Organge bars are are not signiﬁcant by that same test. Error bars are the binomial proportionconﬁdence interval at the 95% conﬁdence level.FIG. 5. Pre-post diﬀerences in code counts for the UR probe, for the original course (a) and the transformed course (b). (b)reproduced from ref. [10]. Blue bars are statistically signiﬁcant diﬀerences using Fisher’s Exact test at the 95% conﬁdenceinterval, adjusted using the Holm-Bonferroni method. Organge bars are are not signiﬁcant by that same test. Error bars arethe binomial proportion conﬁdence interval at the 95% conﬁdence level. ing solicited by the SMDS probe is relatively stable inour population of students. In fact, in the transformedcourse, no single probe showed a statistically signiﬁcantchange pre-to-post.Before the transformation, the code that signiﬁcantlyincreased was S3, which states that a smaller spreadis better because of external factors such as “air resis-tance” or “human error.” While the recognition of spreadplaying a role in data comparison is aligned with set-like reasoning, and thereby the goals of the transforma-tion, the focus on external factors over inherent statisti-cal variation is more point-like than set-like. The codethat decreased pre-to-post before the transformation was P4, which talks about diﬀerences in carefulness betweenthe two experimenters. This idea aligns with the pointparadigm if the lack of carefulness manifests as mistakesin individual trials. However, there is a subtle diﬀer-ence between this line of reasoning and the idea that thespread of a data set overall is aﬀected by diﬀering ten-dencies of experimenter care. Taken together, the SMDStrends observed in the original course contain elementsthat are both closer and further away from set-like rea-soning.2

FIG. 6. Pre-post diﬀerences in code counts for the SMDS probe, for the original course (a) and the transformed course (b).Blue bars are statistically signiﬁcant diﬀerences using Fisher’s Exact test at the 95% conﬁdence interval, adjusted using theHolm-Bonferroni method. Organge bars are are not signiﬁcant by that same test. Error bars are the binomial proportionconﬁdence interval at the 95% conﬁdence level..

4. The DMSS probe

Lastly, a comparison of DMSS codes appears in Fig.7. In both semesters, the most prominent pre-to-postincrease was S3, with an especially large increase in thetransformed course. S3 represents the most complete wayto compare results under the set paradigm, by looking foroverlap between the means and spreads of the two datasets.The S4 code, choosing the best multiple choice answerbut leaving the explanation ﬁeld blank, also increasedsigniﬁcantly in the original course. There could be manyreasons for that diﬀerence between the two semesters,including time limitations stemming from the diﬀerentsettings in which the survey was administered. The dif-ference in S4 responses yields little insight into studentlearning or the transformation. Similarly, the U3 code,which decreased in both semesters, represents miscella-neous reasoning, and yields little insight without furtherqualitative interpretation of these responses. Likewisethe U1 code, which decreased in the transformed course,represents non-statistical reasoning and itself oﬀers littleinsight. That is not to say that these responses, whichlikely contain sophisticated reasoning about systematiceﬀects and other experimental considerations, are notworthy of study. They are simply outside the scope ofthe point and set paradigms and the focus of the PMQ.Two remaining codes decreased in prevalence duringthe transformed course: P2 and P3. Both consider onlythe mean of the two data sets when comparing them, P2concluding that the means must match for the results toagree, and P3 concluding that the means are close enoughto agree. Both responses lack reasoning around distri-butions or statistical uncertainty. A decrease in theirprevalence after the transformation, given the absence of such a decrease before the transformation, is an indica-tion that the transformation was eﬀective.

V. DISCUSSION AND CONCLUSIONS

In this section, we synthesize the results presentedabove through the lens of the course transformation’slearning goal around measurement uncertainty, whichwas, “Students should demonstrate a set-like reasoningwhen evaluating measurements.” We apply the lens ofthat learning goal to both the original and the trans-formed course. We consider the original course, in addi-tion to the transformed course, through that lens for tworeasons: ﬁrst, as a baseline of comparison for the trans-formed course, and second as an example of an entirelytraditional physics lab course that nonetheless achievedmeasurable and desirable learning outcomes.

A. (Q1) Eﬀectiveness of course overall

Regarding (Q1),

Did students respond to the PMQ inways more aligned with the set paradigm after taking theintroductory lab course, compared to when they began thecourse? , on each of the levels of analysis presented here,both the original and the transformed course met thelearning goal to some extent. At the most simpliﬁed levelof student-paradigms, our analysis shows increases in set-like reasoning and decreases in mixed reasoning in bothsemesters, which is aligned with the learning goal. At aﬁner level of detail looking at paradigms probe-by-probe,each of the four PMQ probes that we analyzed showed,in both semesters, pre-to-post increases in set-like rea-soning and pre-to-post decreases in point-like reasoning.3

FIG. 7. Pre-post diﬀerences in code counts for the DMSS probe, for the original course (a) and the transformed course (b).Blue bars are statistically signiﬁcant diﬀerences using Fisher’s Exact test at the 95% conﬁdence interval, adjusted using theHolm-Bonferroni method. Organge bars are are not signiﬁcant by that same test. Error bars are the binomial proportionconﬁdence interval at the 95% conﬁdence level.

In all but two cases, these increases were statistically sig-niﬁcant. Finally, in the most ﬁne-grained interpretationof the results, looking at individual codes beyond theirparadigms, there were pre-to-post changes in each of thefour probes that aligned with the learning goal to someextent. In particular, signiﬁcant increases in RD-S4 andDMSS-S3, and signiﬁcant decreases in RD-P1 and RD-P2, were observed in both semesters, and unambiguouslyalign with the learning goal.

B. (Q2) Eﬀectiveness of transformation

Regarding (Q2),

Did student responses to the PMQafter the transformation show greater change towardsthe set paragigm than resposnes before the transforma-tion? , there were several indications that the transformedcourse achieved the learning goal to a greater extent thanthe original course. There were also some indicationsthat it did not, suggesting directions for future improve-ment. However, all of these indications lie at ﬁner levelsof analysis than that of student-paradigms, in which thesemesters before and after transformation appear similarin all respects.At the level of paradigms for each probe, the RD probeshowed a striking pre-to-post increase in S codes after thetransformation, with an eﬀect size of h=0.45, as com-pared to the corresponding increase before the transfor-mation with an eﬀect size of h=0.21. This diﬀerencesuggests that the course transformation was especiallysuccessful in the context of evaluating choices in datacollection.On the other hand, in the SMDS probe, we observedsigniﬁcant pre-to-post shifts in the original course, butnot in the transformed course. This suggests that there is more for students to learn in the transformed coursearound data comparison, particularly in the context pre-sented by this probe: when the two data sets being com-pared have identical means but diﬀerent spreads. In sucha situation, deciding whether the results agree is verysimple, and once can entirely ignore the spreads of thetwo distributions. Accordingly, the most common SMDSpoint-like code across both semesters was SMDS-P1 (18%of all responses in the data set), which represents thissimple approach.However, there is more to consider when deciding notwhether the results agree, but which result is better over-all. The SMDS probe asks respondents to do this lattertask. With this broader task in mind, the fact that oneresult has a smaller spread becomes relevant, as repre-sented by the SMDS-S2 code, the most common code ofany paradigm assigned to SMDS responses (53% of all re-sponses in the data set). Nonetheless, it is possible thatthe probe does not prompt set-like reasoning as directlyas other probes, as the same means encourage studentsto stop there without considering the data set at a deeperlevel. For the transformed course to improve further, re-sults from the SMDS probe suggest that students couldbe better supported in using set-like thinking all the time,not just when the situation lends itself to it. Perhaps in-cluding more focused or nuanced discussions around whatmakes a data set better or worse would result in more fa-vorable SMDS responses, and more importantly, furtherimprove physics laboratory instruction.Regarding the other two probes, less can be drawnfrom the paradigm-level results. While pre-to-post shiftsin the UR probe were statistically signiﬁcant only in thetransformed course, and in directions aligned with thelearning goal, the overwhelming prevalence of S responsesin all cases makes this result have little practical signiﬁ-4cance. The eﬀect sizes of the DMSS probe are also morefavorable in the transformed course, but because this dif-ference is due to diﬀerences in pre-test proportions ratherthan post-test proportions, we hesitate to interpret it fur-ther.Finally, analyzing pre-to-post diﬀerences in each in-dividual code yields further insight into the success ofthe transformation. In the RD probe, an absence ofan increase in RD-U1 after the transformation indicatesnot only that more students communicated in alignmentwith the learning goal (as established earlier in this sec-tion), but also that the transformation allowed them todo so with greater sophistication. In the UR probe,a consolidation of responses into UR-S4 in the trans-formed course, rather than UR-S1 in the original course,is additional evidence of more sophisticated reasoning,this time regarding the idea that every number has anuncertainty. However, decreases in UR-S3 might sug-gest that the transformed course also de-emphasized amore sophisticated conceptual understanding of the roleof means, which could be a focus of further improvement.In the SMDS probe, there were no clear messages fromanalyzing pre-post diﬀerences code-by-code, underscor-ing the inherent consistency of responses to this probe.While the changes in the original course seem at ﬁrst toalign with the learning goal, further consideration of thereasoning they represent complicates this picture (as dis-cussed above). More qualitative study is needed to betterunderstand how students are interpreting, reasoning, andresponding to the SMDS probe.In the DMSS probe, the observed decrease in DMSS-P2 and DMSS-P3 in the transformed course would sug-gest that the transformation was successful at encourag-ing set-like reasoning around data comparison. However,the irregularities around the DMSS probe, as discussedabove and in the next section, cast doubt on the full im-plications of this ﬁnding.

C. Successes and Limitations

We start by noting a success regarding researchmethodology, noting that qualitatively diﬀerent insightemerged as we proceeded to each deeper level of analysisdetail. Indications that the transformation was successfulat meeting the learning goal around measurement uncer-tainty emerged only when considering responses probe-by-probe, and evidence about the depth of that learn-ing emerged only when investigating responses code-by-code. More generally, these observations are merely areminder that there is far more to learning around mea-surement uncertainty than is captured by the point andset paradigms. However, this bigger picture also comeswith a limitation: here, we can study only the reason-ing prompted by the probes of the PMQ, which is asmaller scope than all that is important to student learn-ing around measurement uncertainty.Furthermore, studying responses to surveys come with limitations in general, as the administration of surveysprecludes the ability to ask follow-up questions. Whenapplying our coding scheme in this work, we had onlythe written survey responses to interpret, complete withany ambiguities in what the student was actually think-ing. These ambiguities often forced us to assign U codesto such responses. For those students, more interactivemethods (such as interviews) would allow us to betterdistinguish their ability to communicate reasoning fromthe nature of their reasoning itself.We also note that the responses we analyze camefrom a single institution, CU, which is a large, research-focused, highly resourced, primarily white institution of atype that is overrepresented in literature [57]. A broaderrange of students and institutions is needed to determinewhether these ﬁndings hold beyond CU. Additionally, westudied student reasoning at the introductory undergrad-uate level, and we expect these ﬁndings to apply only tostudents at similar academic levels.While we did not directly compare pre and post re-sponses student-by-student in this analysis, we still in-cluded in our data set only matched responses, that is,only those students who completed both pre- and post-test. We did this because we focused our analysis on mea-suring the eﬀectiveness of our course, rather than makingany direct comparisons to other courses or institutions.This framing is distinct from the motivations outlinedin [58], which calls for using statistical methods to modelstudent responses and predict the missing responses fromstudents with unmatched data. However, our compar-isons between the original and the transformed course,given the similarity in student demographics between thetwo semesters, and that they come from the same insti-tutional context and instructor, are less aﬀected by thebias identiﬁed in that reference. Furthermore, it wouldrequire a larger data set than currently available to ap-ply the techniques described in ref. [58] to the nominaldata of paradigm or coding designations. Nonetheless,and unavoidably, there could be some bias in the changeswe observe based on exogenous factors that aﬀect both astudent’s reasoning around measurement uncertainty andtheir likelihood to complete both the pre- and post-tests.Lastly, the diﬀerences in timing of the pre-test betweenthe original and transformed courses casts some doubt ifthe pre-test in the transformed course is a valid baselineto which to compare the post-test of that semester. With64% of students completing the pretest after the ﬁrst lec-ture in the transformed semester, this could potentiallybias, but not eliminate, any measured learning outcomesfrom that ﬁrst lecture. However, considering the focuson measurement uncertainty throughout the course, theﬁrst lecture is a very small portion of all of the learningopportunities that the students experienced throughoutthe course. Furthermore, when answering (Q2) by com-paring the transformed course to the original one, ad-ditional instructional opportunities before the pre-test inthe transformed course would cause measured pre-to-postchanges to have a smaller eﬀect size than otherwise, as-5suming the instruction has an overall eﬀect aligned withthe learning goal. Results from (Q1) suggest that instruc-tion does indeed shift students toward the set paradigmoverall, thus, the diﬀerence in timing would result in a de-crease in apparent shifts towards set-like reasoning in thetransformed course. Given that we observe evidence forthe opposite eﬀect, the timing diﬀerence is less of a con-cern. However, irregularities in the transformed coursepre-test results, speciﬁcally in the DMSS probe, remain amystery, and require further investigation before resultsfrom that probe can be taken at face value.

VI. SUMMARY

Here, we used the PMQ to measure the eﬀectiveness ofthe introductory lab course at CU, and a recent transfor-mation of that course. We aimed to answer two researchquestions: (Q1),

Did students respond to the PMQ inways more aligned with the set paradigm after taking theintroductory lab course, compared to when they began thecourse? , and (Q2),

Did student responses to the PMQ af-ter the transformation show greater change towards theset paradigm than responses before the transformation?

With regards to (Q1), we see strong evidence of PMQresponses that are more aligned with the set paradigmin the post-tests from both semesters, compared to thecorresponding pre-tests, and we see this evidence at alllevels of analysis, from the coarsest to the ﬁnest grainsizes. With regards to (Q2), we see evidence that PMQresponses in the transformed course shifted pre-to-posttowards more prevalent set-like reasoning compared to those from the original course. Furthermore, we also seesome evidence that the responses from the transformedcourse tend to shift towards more sophisticated reason-ing than in the original course. We also identiﬁed speciﬁcaspects of a sophisticated understanding of measurementuncertainty that were less apparent in responses fromthe transformed course than from the original course,suggesting areas for further improvement. These ﬁnd-ings add to the growing body of evidence that physicslab courses, even traditional ones, have value by creatingopportunities for students to learn important aspects ofconceptual physics and to develop expert physics prac-tices.

ACKNOWLEDGMENTS

We acknowledge Daniel Bolton, Colin West, SkipWoody, Michael Scheﬀerstein, Adam Ellzey, MichaelDubson, and Jason Bossert for their work in develop-ing the transformed course. We also acknowledge Dim-itri Dounas-Frazer and Laura R´ıos for their input duringthe course transformation, and the many student testersfor their contributions to improving the labs. We ac-knowledge Dimitri Dounas-Frazer for creating the onlineversion of the PMQ, and for Saalih Allie for his perspec-tive and advice regarding the PMQ. This work is sup-ported by the NSF under grant nos. PHYS-1734006 andDMR-1548924. It is further supported by the oﬃce ofthe Associate Dean for Education of the College of En-gineering and Applied Science, and the College of Artsand Sciences, at the University of Colorado Boulder. [1] PCAST STEM Undergraduate Education WorkingGroup,

Engage to Excel: Producing One Million Addi-tional College Graduates with Degrees in STEM (Execu-tive Oﬃce of the President, 2012).[2] National Research Council,

Discipline-based educationresearch: Understanding and improving learning in un-dergraduate science and engineering (Natl Acad Press,Washington, DC, 2012).[3] AAPT Committee on Laboratories,

AAPT Recommen-dations for the Undergraduate Physics Laboratory Cur-riculum (Am Assoc Phys Teach, 2014).[4] Joint Task Force on Undergraduate Physics Programs,

Phys21: Preparing Physics Students for 21st CenturyCareers (Am Phys Soc, Am Assoc Phys Teach, CollegePark, MD, 2016).[5] C. Moskovitz and D. Kellogg, Science , 919 (2011).[6] B. M. Zwickl, N. Finkelstein, and H. J. Lewandowski,American Journal of Physics , 63 (2013),arXiv:1207.2177.[7] B. R. Wilcox and H. J. Lewandowski, Am. J. Phys. ,212 (2018).[8] E. M. Smith, N. Chodkowski, and N. G. Holmes, in (American Association of Physics Teachers, 2020). [9] E. M. Smith, M. M. Stein, C. Walsh, and N. Holmes,Physical Review X , 011029 (2020).[10] B. Pollard, R. Hobbs, D. R. Dounas-Frazer, and H. J.Lewandowski, in (American Association of PhysicsTeachers, 2020).[11] H. J. Lewandowski, B. Pollard, and C. G. West, in (American Association of Physics Teachers, 2020).[12] H. J. Lewandowski, D. R. Bolton, and B. Pollard, in (American Association of Physics Teachers, 2019).[13] N. Holmes, J. Olsen, J. L. Thomas, and C. E. Wieman,Physical Review Physics Education Research , 010129(2017).[14] N. G. Holmes and C. E. Wieman, Physics Today , 38(2018).[15] M. F. J. Fox, A. Werth, J. R. Hoehn, and H. J.Lewandowski, arXiv.org (2020), arXiv:2007.01271.[16] R. F. Lippmann, Students’ understanding of measure-ment and uncertainty in the physics laboratory : socialconstruction, underlying concepts, and quantitative anal-ysis , Ph.D. thesis, University of Maryland, College Park(2003). [17] R. L. Kung, American Journal of Physics , 771 (2005).[18] R. Beichner, Research-Based Reform of UniversityPhysics (2007).[19] D. S. Abbot, Assessing student understanding of mea-surement and uncertainty , Ph.D. thesis, North CarolinaState University (2003).[20] E. Etkina and A. V. Heuvelen, Research-based reform ofuniversity physics , 1 (2007).[21] E. Etkina, A. Karelina, M. Ruibal-Villasenor, D. Rosen-grant, R. Jordan, and C. E. Hmelo-Silver, Journal of theLearning Sciences , 54 (2010).[22] N. G. Holmes, Structured quantitative inquiry labs : de-veloping critical thinking in the introductory physics labo-ratory , Ph.D. thesis, The University of British Columbia(2014).[23] L. E. Strubbe, J. Ives, N. G. Holmes, D. A. Bonn, andN. K. Sumah (American Association of Physics Teachers(AAPT), 2016) pp. 340–343.[24] N. G. Holmes and E. M. Smith, The Physics Teacher ,296 (2019).[25] E. M. Smith, M. M. Stein, C. Walsh, and N. G. Holmes,Physical Review X , 011029 (2020).[26] D. L. Deardorﬀ, Introductory physics students’ treatmentof measurement uncertainty , Ph.D. thesis, North Car-olina State University (2001).[27] R. Lippmann Kung and C. Linder, NorDiNa , 40 (2006).[28] N. G. Holmes and D. A. Bonn, in Physics Education Re-search Conference Proceedings (American Association ofPhysics Teachers (AAPT), 2014) pp. 185–188.[29] N. Majiet and S. Allie, in

Physics Education ResearchConference Proceedings , Vol. 2018 (American Associa-tion of Physics Teachers, 2018).[30] M. M. Stein, C. White, G. Passante, and N. G. Holmes,in

Physics Education Research Conference Proceedings (American Association of Physics Teachers (AAPT),2020).[31] A. Madsen, S. B. McKagan, E. C. Sayre, and C. A. Paul,American Journal of Physics , 350 (2019).[32] B. Pollard, R. Hobbs, J. T. Stanley, D. R. Dounas-Frazer,and H. J. Lewandowski, in (American Association ofPhysics Teachers, 2018) pp. 312–315.[33] J. Day and D. Bonn, Physical Review Physics EducationResearch , 010114 (2011).[34] N. G. Holmes, J. Day, A. H. K. Park, D. A. Bonn, andI. Roll, Instr Sci , 523 (2014).[35] J. Day, J. B. Stang, N. G. Holmes, D. Kumar, and D. A.Bonn, Phys Rev Spec Top-PH , 020104 (2016).[36] H. Eshach and I. Kukliansky, Can J Phys , 1205(2016).[37] C. Walsh, K. N. Quinn, C. Wieman, and N. Holmes,Physical Review Physics Education Research , 010135(2019).[38] B. Campbell, F. Lubben, A. Buﬄer, and S. Allie, AJRM-STE , 1 (2005).[39] B. Pollard and H. J. Lewandowski, in (American As-sociation of Physics Teachers, 2019).[40] H. J. Lewandowski, R. Hobbs, J. T. Stanley, D. R.Dounas-Frazer, and B. Pollard, in (American As-sociation of Physics Teachers, 2018) pp. 244–247.[41] F. Lubben and R. Millar, International Journal of ScienceEducation , 955 (1996). [42] T. S. Volkwyn, First year students’ understanding ofmeasurement in physics laboratory work , Ph.D. thesis(2005).[43] R. Millar, F. Lubben, R. Gott, and S. Duggan, ResearchPapers in Education , 207 (1994).[44] R. Millar, R. Gott, F. Lubben, and S. Duggan, “Chil-dren’s performance in investigative tasks in science: Aframework for considering progression,” in Progressionin learning , edited by M. Hughes (Multilingual MattersLtd, 1996) pp. 82–108.[45] A. Buﬄer, S. Allie, and F. Lubben, Int J Sci Educ ,37 (2001).[46] A. Buﬄer, S. Allie, F. Lubben, and B. Campbell, 4thConference of the European Science Education ResearchAssociation , 19 (2003).[47] T. S. Volkwyn, S. Allie, A. Buﬄer, and F. Lubben, Phys-ical Review Special Topics - Physics Education Research , 064005 (2018).[50] R. A. Fisher, Journal of the Royal Statistical Society ,87 (1922).[51] S. Holm, Scandinavian Journal of Statistics , 65 (1979).[52] R Core Team, R: A Language and Environment for Sta-tistical Computing , R Foundation for Statistical Comput-ing, Vienna, Austria (2019).[53] J. Cohen,

Statistical Power Analysis for the BehavioralSciences , 2nd ed. (Routledge, 2013) p. 567.[54] A. N. Parks and M. Schmeichel, Journal for Research inMathematics Education , 238 (2012).[55] S. Kanim and X. Cid, Physical Review Physics EducationResearch , 020106 (2020).[56] B. R. Wilcox and H. Lewandowski, Physical ReviewPhysics Education Research , 010123 (2016).[57] S. Kanim and X. C. Cid, arXiv.org (2017),arXiv:1710.02598.[58] J. Nissen, R. Donatello, and B. Van Dusen, PhysicalReview Physics Education Research , 020106 (2019),arXiv:1809.00035. Appendix A: PMQ Probes FIG. 8. Contextual information for the PMQ. This information precedes the probes themselves. Reproduced from ref. [47].FIG. 9. The RD probe of the PMQ. Reproduced from ref. [47]. FIG. 10. The UR probe of the PMQ. Reproduced from ref. [47]. FIG. 11. The SMDS probe of the PMQ. Reproduced from ref. [47]. FIG. 12. The DMSS probe of the PMQ. Reproduced from ref. [47]. Appendix B: Codebook (to go in SupplementalMaterial) TABLE IV. Codes for the RD probe

Probe Number Paradigm Name Deﬁnition: “Argument is that...”

RD P1 P Measure the true value ... the experimenter could measure the correct value in asingle measurement.RD P2 P Identify the outliers afterall measurements ...repeated measurements are needed in order to know whichmeasurements were mistakes or outliers, after all measure-ments are taken. This code includes the idea that the exper-imenter must get the same result at least twice for it to becorrect.RD P3 P Available time or resources ... a course of action is better due to considerations abouthow much time or resources it would requireRD P4 P Need to practice as you go ... practice is needed to account for errors/outside factors asmeasurements are being madeRD P5 P Misc. point Point-like argument that doesn’t ﬁt the other point-likecodesRD S1 S Measure a spread ... multiple measurements will allow the experimenter tocalculate/estimate a spread/variation/uncertaintyRD S2 S Measure an average ... multiple measurements will allow the experimenter tocalculate an average/meanRD S3 S Use all the data together ... multiple measurements will all be used together to im-prove accuracy/precision/goodness. Doesn’t talk about av-erage or spread speciﬁcally.RD S4 S Reduce uncertainty ofmean ... multiple measurements will be used to reduce the er-ror/uncertainty of the mean/average.RD S5 S Misc. set Set-like argument that doesn’t ﬁt the other set-like codesRD U1 U Just take more data ... experimenter needs to take more data. No statisticalreasoning apparent.RD U2 U More data cancels outerror ... experimenter needs to take more data to cancel or out-weigh the eﬀect of error.RD U3 U More data is better ... more data is better / more accurate / more precise /etc. Includes if reasoning other than statistical reasoningapparent.RD U4 U Misc. Argument that doesn’t ﬁt into any of the other codes.RD U5 U Unintelligible Unintelligible / blank / logically incoherent TABLE V. Codes for the UR probe

Probe Number Paradigm Name Deﬁnition: “Argument is that...”

UR P1 P Choose single value ...experimenter should choose a single value to report (forany reason).UR P2 P Average as last resort ...experimenter should report the average because no betteroption exists.UR P3 P Misc. point Point-like argument that doesn’t ﬁt the other point-likecodes.UR S1 S Simply average, or namesreported value as average States things like “I averaged,” “do the average,” “averageis best,” or “it is the average,” but does not elaborate alongthe lines of the other codes. Includes statements that simplysay what the reported value is.UR S2 S Why average is useful ...reporting the average is best, because (in general) it ac-counts for ﬂuctuations or errors, or because it predicts futuremeasurements.UR S3 S Why average is appropri-ate in this case ...reporting the average is best because all of this data mat-ters, or because the spread of this data is small enough. In-cludes reporting all data as well as the average. Does notinclude “it is the correct thing to do” (see S7).UR S4 S Report average and spread ...experimenter should report the average and theuncertainty/range/spread.UR S5 S How to compute Response explains how to compute the average. May bedouble coded when a separate explanation appears.UR S6 S Discard outliers, thenaverage ...experimenter should discard outliers/extreme data points,and then compute an average from the data that remains.UR S7 S Misc. set Set-like argument that doesn’t ﬁt the other set-like codes.Rule based reasons are coded here (e.g. “logical thing todo” or “the correct thing to do”).UR U1 U Misc. Argument that doesn’t ﬁt into any of the other codes.UR U2 U Unintelligible Unintelligible / blank / logically incoherentTABLE VI. Codes for the SMDS probe

Probe Number Paradigm Name Deﬁnition: “Argument is that...”

SMDS P1 P The means are the same ...the groups agree because the means are the same.SMDS P2 P Spreads don’t matter ...the fact that the spreads or individual trials are diﬀerentdoes not matter, including responses that focus on agreementof the averages while providing a reason for why the sets arediﬀerent.SMDS P3 P A has fewer outliers ...A is better because that group has fewer outliers, or A’sindividual measurements are more precise. Contains no rea-soning about spread.SDMS P4 P Diﬀerences in carefulness ...diﬀerences in the spread are due to diﬀerences in how care-fully the measurements were performed.SMDS P5 P Chose B, no explanation Student chose “B” but left the explanation blank.SMDS P6 P Misc. point Point-like argument that doesn’t ﬁt the other point-likecodes.SMDS S1 S A is better ...group A is better / more accurate / more precise / etc. Nofurther explanation.SMDS S2 S Smaller spread is better, nomention of external factors ...a smaller spread/uncertainty/range is better / more accu-rate / more precise / etc. The response does not mentionexternal factors, outliers, human error, etc.SMDS S3 S Smaller spread is better,due to external factors ...a smaller spread/uncertainty/range is better / more accu-rate / more precise / etc. The response mentions externalfactors, outliers, human error, etc.SMDS S4 S Chose A, no explanation Student chose “A” but left the explanation blank.SMDS S5 S Misc. set Set-like argument that doesn’t ﬁt the other set-like codes.SMDS U1 U Misc. Argument that doesn’t ﬁt into any of the other codes.SMDS U2 U Unintelligible Unintelligible / blank / logically incoherent TABLE VII. Codes for the DMSS probe