Assessment of critical thinking in physics labs
AAssessment of critical thinking in physics labs
Cole Walsh, Katherine N. Quinn, and N. G. Holmes
Laboratory of Atomic and Solid State Physics, Cornell University, Ithaca, NY, 14853
Despite the significant amount of time undergraduate students spend in introductory physics labs, there islittle consensus on instructional goals and accepted diagnostic assessments for these labs. In response to theseissues, we have developed the Physics Lab Inventory of Critical thinking (PLIC) to assess students’ proficiencywith critical thinking in a physics lab context. Specifically, the PLIC aims to evaluate students’ skills in makingsense of data, variability, models, and experimental methods and to assess the effectiveness of lab courses atdeveloping these skills. Here, we discuss the statistical and expert validation of this instrument using 2681valid student responses collected from 12 institutions during the 2017-2018 academic year. As a part of ourvalidation, we address the impact of lab design on student performance on the instrument.
I. INTRODUCTION
The goals of instruction in undergraduate physics labs typ-ically involve reinforcing student conceptual knowledge oftopics introduced in lecture and, to a lesser extent, teach-ing students how to work with experimental equipment [1].Though research has shown that the instructional goals oflabs are highly debated and vary across disciplines and insti-tutions [2], it has been shown that traditional labs do not mea-surably affect student learning of physics content [3]. Thereare, however, many important skills that a lab setting uniquelyallows students to learn [4]. Particularly, proficiency in mak-ing sense of data, variability, models, and experimental meth-ods are all skills that are best developed in physics labs [5].These skills make up a recently endorsed set of recommendedlearning goals for undergraduate physics lab courses by theAmerican Association of Physics Teachers [6].With ongoing laboratory course transformations now oc-curring at multiple institutions in an effort to meet thesenew instructional goals, there is an increasing need for val-idated ways to measure student acquisition of these skillsand few validated assessments exist. In response to this, wehave developed the Physics Lab Inventory of Critical thinking(PLIC). Here, we demonstrate the concurrent validity [7]—that is, how consistent performance is with certain expectedresults—of the PLIC. We expect that either from instruc-tion or selection effects, performance on the PLIC should behigher with greater physics maturity of the respondent. Wedefine physics maturity by the level of the lab course that stu-dents were enrolled in when they took the PLIC. We also ad-dress the impact of lab courses that have undergone specifictransformations to meet the goals outlined above, which werefer to as Structured Quantitative Inquiry labs or SQIlabs [4],on PLIC performance. This work is part of the ongoing vali-dation and reliability assessment of the PLIC following stepslaid out in Refs. [8] and [9]. Other validation studies of thePLIC have been published here [10] and here [11].
II. THE PLIC
The PLIC is a 4-page online survey that uses a combina-tion of Likert-style and ‘select all that apply’ questions. TheLikert-style questions ask respondents to evaluate how well data agrees with a model or how well a particular group testedthe model. The ‘select all that apply’ questions ask respon-dents to elaborate on their reasoning to the Likert-style ques-tions and to suggest what the group should do next. For eachof these ‘reasoning’ and ‘what to do next’ questions, thereare between 5-10 options to choose from, and students arelimited to selecting no more than three options.The PLIC presents respondents with a hypothetical sce-nario where two groups of physicists are completing a mass-on-a-spring experiment to test a model where the period ofoscillation of the bouncing mass is T = 2 π (cid:112) mk . The firstgroup conducts 10 repeated trials for the period of oscillationfor two different masses and uses the given equation to find k in each case. Group 2 conducts two repeated trials for the pe-riod of oscillation for 10 different masses, plots T versus m and fits to a straight line with the intercept fixed at the origin(one-parameter fit). Finally, Group 2 attempts to fit a straightline with a free intercept (two-parameter fit). In addition tothe ‘select all that apply’ questions outlined above, respon-dents are also asked which fit Group 2 should use and why, aswell as which group they think did a better job at testing themodel and why.In scoring the PLIC, we focus on the ‘select all that apply’questions. That is, we do not score based on which groupa student thinks did better or how well they think the grouptested the model, only those ideas that they used to make theirdecisions. Based on responses from 24 physics post-docs,lecturers, and faculty, 1-2 responses per question that were se-lected by at least of respondents were identified as expert(E). Respondents who select at least one of these E responseson a question receive 1 point. Further, 1-2 responses per ques-tion that were selected by more than , but less than of experts, were identified as partially-expert (P). Respon-dents who failed to select any E responses, but selected atleast one P response, are awarded 0.5 points. Finally, 2-3 re-sponses per question that were picked by less than ofexperts were identified as being particularly novice (N). Re-spondents selecting at least one
N response have 0.25 pointsdeducted from their score. All other responses to a questionhave no impact on a respondent’s score on that question. Allscores are floored at zero, so even if a respondent selects onlyneutral and novice responses to a question, they still receive a r X i v : . [ phy s i c s . e d - ph ] A ug score of 0 on that question.This scoring scheme allows students to obtain a maxi-mum possible score on a question regardless of how manyresponses they select; we allow students to select up to threeresponses, but do not penalize for picking fewer. This schemealso provides credit for partial displays of critical thinkingand differentiates students who answer in correct and partiallycorrect ways from students who still have novice ideas aboutphysics experimentation. On any question a student can re-ceive a score between 0 and 1 in increments of 0.25. ThePLIC’s current format of 10 questions then allows for a max-imum possible total score of 10 points.The original 24 experts (upon whom the scoring schemewas based) obtained an average overall score on the PLICof . ± . . We subsequently received 24 additional re-sponses from physics experts who scored . ± . . Sinceneither of these data are normally distributed and sample sizesare small, we used a non-parametric Mann-Whitney U-test tocompare the two sample distributions. They are statisticallyindistinguishable ( p = 0 . ), hence we combine these datafor later comparisons.Over the course of the 2017-2018 academic year, we col-lected PLIC responses before and after course instructionfrom 25 courses across 12 institutions. A total of 2681 validresponses were collected from students who completed thesurvey, consented to participate in the study, and indicatedthey were at least 18 years of age. Of these valid responses,pre- and post-responses were matched for individual studentsfrom the student ID or full name they provided at the endof the survey. In 2017-2018, we collected matched pre- andpost-surveys from 726 students. The lab level and type (tradi-tional or SQI) for each class were inferred from informationprovided by instructors about their course via a course infor-mation survey (CIS). The CIS is part of an automated systemassociated with the PLIC, which was adapted from Ref. [12]. III. METHODS
To assess concurrent validity, we split our matched datasetby physics maturity and compared performance on the PLIC.This split dataset includes 584 students in first-year (FY) labs,108 students in beyond-first-year (BFY) labs, and 34 studentsin graduate level labs. To assess the impact of SQIlab in-struction on students’ PLIC performance we grouped FY stu-dents according to the type of lab their instructor indicatedthey were running as part of the CIS. The data includes 383students who participated in traditional FY physics labs and201 who participated in FY SQIlabs. We also examined stu-dents’ responses to individual questions in detail to illuminatethe differences in overall performance.Our data follows an approximately normal distributionwith roughly equal variances in pre- and post-scores (seeFig. 1). For this reason, we used parametric statistical teststo compare paired (paired t-test) and unpaired sample means(unpaired t-test), and a one-way Analysis of Covariance (AN-COVA) to evaluate the effect of lab treatment on post-scoreswith pre-scores as a covariate. We used Cohen’s d to calculate TABLE I. Performance on the PLIC across different levels ofphysics maturity. N is the number of matched responses withina dataset, except in the case of the expert surveys where respon-dents only filled out the survey once. Significance levels and effectsizes are reported for differences in pre- and post-means within eachgroup of students. N Pre Avg. Post Avg. p d
FY 584 . ± .
07 5 . ± . . ± .
17 6 . ± . (cid:28) . Grad 34 . ± . . ± . . ± . FIG. 1. Distributions of matched pre- and post-scores on the PLIC. N = 726 . effect sizes between two sample means and Cohen’s f to cal-culate effect sizes of the independent variable and covariateon the dependent variable in ANCOVA. IV. RESULTS AND DISCUSSIONA. Concurrent Validity
We begin by comparing respondents’ performance on thePLIC by physics maturity. In Table I, we report the aver-age scores for students enrolled in different level physics labcourses, as well as for our 48 experts who only took the PLIConce. The significance level and effect sizes between pre- andpost-mean scores within each group are also included.For all three groups of students, the differences betweenthe pre- and post-instruction means are not statistically sig-nificant and the effect sizes are small or very small (Cohen’s d ≤ . ). Conversely, the pre-instruction means are statisti-cally different between all groups (unpaired t-test, p (cid:28) . )other than between students in BFY and graduate level labs( p = 0 . ). The effect sizes ranged from small (Cohen’s d = 0 . ) between the students in BFY and graduate labs tovery large ( d = 1 . ) between students in FY labs and experts.The clear differences in means between groups of differingphysics maturity, coupled with the lack of measurable in-crease in mean scores following instruction at any level, mayimply that these differences arise from selection effects rather IG. 2. Fraction of respondents that select a particular item inresponse to a question on the pre-instruction PLIC about howwell the k-values from data collected using two different massesagree with each other. The responses R1-R4 are explained in thetext. N ( F Y ) = 584 , N ( BF Y ) = 108 , N ( Grad ) = 34 , N ( Experts ) = 48 . than cumulative instruction. This has been seen in other eval-uations of students’ lab sophistication as well [13].We illustrate in Fig. 2 how the differences in physics matu-rity play out on one question from the pre-instruction PLIC.We use the pre-survey here since it allows us to examinestudents thinking before instruction, and eliminates concernabout students seeing the PLIC for a second time. We chosethe question presented here because it clearly demonstratesthe inherent differences in item selections by respondentsfrom these different groups. In this question, respondentswere asked: ‘How similar or different do you think Group1’s spring constant (k) values are?’ (Likert-style, not scored)and ‘what features were most important in comparing the twok values?’ (‘select all that apply’, scored) The answers to the‘select all that apply’ question that will affect a respondent’sscore, as well as their classifications, are: • R1 (Expert) - the difference in the k-values comparedto their uncertainties, • R2 (Partial-Expert) - the size of the uncertainties (or thevariability in the data), • R3 (Novice) - the difference between the oscillation pe-riods of the masses, • R4 (Novice) - how they accounted for human error.As expected, the expert answer to this question is pickedincreasingly more often with the physics maturity of the re-spondent. Additionally, the two novice answers to this ques-tion are picked less frequently with additional physics matu-rity of the respondent. The partial-expert response, however,is picked almost equally as often by all students. The fact thatstudents equally value a partially correct answer at all levelscould indicate that most intro to advanced physics lab courseshave some focus on uncertainty. Though we have describedjust one question in detail here, the performance differencesare present across all of the questions on the PLIC. Expertsscore higher than students in BFY labs who, in turn, score
TABLE II. Table of ANCOVA results for post-instruction scores.
Effect F Effect Size p Post-scores Pre-scores 18.39 0.18 (cid:28) . Lab Treatment 18.61 0.18 (cid:28) . FIG. 3. Box plots of overall PLIC scores grouped by the type of labstudents participated in. higher than students in FY labs. As seen here, these perfor-mance differences are manifest in the expertness of responses.
B. Performance by Lab Type
We now examine how students participating in labs de-signed to meet the instructional goals of the PLIC performedin comparison to their counterparts in traditional lab settings.We limit our analysis to FY labs where we had three labs thatwere best described as being of the SQI format, while the restwere of the traditional format. The overall performance ofthese two groups of students is illustrated in Fig. 3.We performed an ANCOVA comparing PLIC post-scoresacross lab treatments using pre-scores as a covariate (Ta-ble II). We see that, controlling for pre-instruction scores, labtreatment does have a statistically significant impact on post-instruction scores with a medium effect.To see how these differences in overall scores arise, welooked at how students in the two lab types compare in theirresponses to one question. Fig. 4 shows the fraction of stu-dents who selected particular responses to one of the ques-tions on the PLIC post-survey grouped by lab type. We showonly the post-instruction data since there was no statisticallysignificant difference in pre-instruction scores on this ques-tion between the two groups. Again, we chose this questionbecause it most clearly illustrates key differences in studentthinking by respondents from the two lab types.In this question, students are asked: ‘How similar or dif-ferent do you think Group 2’s data are from the new best-fitline?’ (Likert-style, not scored) and ‘what features were mostimportant in comparing the fit to the data?’ (‘select all thatapply’, scored) The responses that will affect a respondent’sscore are: • R1 (Expert) - the way points are scattered above andbelow the line, • R2 (Expert) - how close the points are to the line com-pared to their uncertainties,
IG. 4. Fraction of students that select a particular item in responseto a question on the PLIC post-survey about how well data fits a 2-parameter best fit line. The responses R1-R5 are explained in thetext. N ( T raditional ) = 383 , N ( SQI ) = 201 . • R3 (Partial-Expert) - number of points with uncertain-ties crossing the line, • R4 (Novice) - the number of outliers, • R5 (Novice) - number of points above the line com-pared to the number below.One of the expert responses, R2, is a common response bystudents in both lab types and is mostly unaffected by instruc-tion. However, students in SQIlabs favored the other expertresponse, R1, much more than students in traditional labs fol-lowing instruction. Both R1 and R5 are concerned with theoverall distribution of points about the best fit line. Thus, stu-dents taught in SQIlabs appear to become more engaged withthe importance of the distribution of residuals, albeit somestudents may oversimplify this to the sheer number of pointsabove and below the line. Nonetheless, traditionally taughtstudents maintain their interest in the closeness of the pointsto the best fit line and do not acknowledge the importance ofthe distribution of residuals any more than they did prior toinstruction. Interestingly, students in both labs become less interested in the presence of outliers following instruction.Again, though we have included only one question here forillustration purposes, students in SQIlabs see, on average, in-creases in performance on all but one question, whereas theaverage scores for students in traditional labs increase on only4 out of 10 questions.
V. CONCLUSIONS
With the need for large-scale instructional reform inphysics lab instruction, there will be an equally importantneed for a method to evaluate these new instructional goals.Here, we have introduced one such method for measuringthese goals, the PLIC. By comparing the performance ofphysics experts to students in FY and BFY undergraduatelabs, and those in graduate-level labs, we have demonstratedthe concurrent validity of the assessment. Though likely dueto selection effects rather than instruction, respondents with agreater level of physics maturity perform consistently betteron the PLIC.Further, though we have seen no statistically significantshifts in performance following instruction for large cohortsof students, this is not the case for students enrolled in SQI-labs designed to teach the skills that the PLIC is designedto measure. Using ANCOVA, we have shown that SQIlabshave a statistically significant and medium effect on PLICperformance. Despite there being no measurable differencein PLIC performance before instruction, students in SQIlabshad created separation from their counterparts in traditionallabs, in terms of PLIC performance, following instruction.These studies establish one measure of validity of the PLICand its importance in measuring the skills that we aim to teachin meeting the new guidelines for physics labs in the future.
ACKNOWLEDGMENTS
This material is based upon work supported by the NationalScience Foundation under Grant No. 1611482. We wouldlike to acknowledge Carl Wieman, members of CPERLfor their useful feedback, and Heather Lewandowski andBethany Wilcox for their support developing the administra-tion system. [1] R. Millar,
The role of practical work in the teaching and learn-ing of science (National Academy of Sciences, Washington,D.C., 2004).[2] A. Hoffstein & V.N. Lunetta,
Sci. Educ. , 1 (2004).[3] N. Holmes, et al. , Phys. Rev. Phys. Educ. Res. , 010129(2017).[4] N.G. Holmes, et al. , PNAS. , 36 (2015).[5] E. Etkina, et al. , Am. J. Phys. , 979 (2006).[6] Kozminski, Joseph, et al . AAPT Recommendations for the Un-dergraduate Physics Laboratory Curriculum. (2014).[7] B.R. Wilcox and H. Lewandowski.
Phys. Rev. Phys. Educ. Res. , 020132 (2016). [8] W.K. Adams & C.E. Wieman, Int. J. Sci. Educ. , 9 (2011).[9] A. Madsen et al. Am. J. Phys. (2017).[10] N.G. Holmes & C.E. Wieman, in PERC proceedings, Sacra-mento, CA, 2016 .[11] K.N. Quinn et al. , in
PERC proceedings, Cincinnati, OH, 2017 .[12] B.R. Wilcox, et al. , Phys. Rev. Phys. Educ. Res. , 010139(2016).[13] B.R. Wilcox & H. Lewandowski, Phys. Rev. Phys. Educ. Res.13