Toward a valid instrument for measuring physics quantitative literacy
Trevor I. Smith, Philip Eaton, Suzanne White Brahmia, Alexis Olsho, Andrew Boudreaux, Charlotte Zimmerman
TToward a valid instrument for measuring physics quantitative literacy
Trevor I. Smith
Department of Physics & Astronomy and Department of STEAM Education,Rowan University, 201 Mullica Hill Rd., Glassboro, NJ 08028, USA
Philip Eaton
School of Natural Sciences and Mathematics, Stockton University, Galloway, NJ 08205, USA
Suzanne White Brahmia, Alexis Olsho, and Charlotte Zimmerman
Department of Physics, University of Washington, Box 351560, Seattle, WA 98195-1560, USA
Andrew Boudreaux
Department of Physics & Astronomy, Western Washington University, 516 High St., Bellingham, WA 98225, USA
We have developed the Physics Inventory of Quantitative Literacy (PIQL) as a tool to measure students’ quan-titative literacy in the context of introductory physics topics. We present the results from various quantitativeanalyses used to establish the validity of both the individual items and the PIQL as a whole. We show howexamining the results from classical test theory analyses, factor analysis, and item response curves informeddecisions regarding the inclusion, removal, or modification of items. We also discuss how the choice to includemultiple-choice/multiple-response items has informed both our choices for analyses and the interpretations oftheir results. We are confident that the most recent version of the PIQL is a valid and reliable instrument formeasuring students’ physics quantitative literacy in calculus-based introductory physics courses at our primaryresearch site. More data are needed to establish its validity for use at other institutions and in other courses. a r X i v : . [ phy s i c s . e d - ph ] J u l I. INTRODUCTION
Physics Quantitative Literacy (PQL) is defined as the in-terconnected skills, attitudes, and habits of mind that togethersupport the sophisticated use of elementary mathematics inthe context of physics [1–4]. Developing PQL is a desiredoutcome of physics instruction, but valid measures of reason-ing about quantities and their relationships in physics contextsare absent from research-based assessment instruments in in-troductory physics. We have developed the Physics Inven-tory of Quantitative Literacy (PIQL) to address this need [4].The PIQL is a reasoning inventory that probes the quantifica-tion typically used in introductory physics that has a potentialimpact analogous to the early concept inventories in physicseducation research that catalyzed curriculum development ef-forts by raising awareness of broad instructional goals that arenot being met [5–7].In introductory physics, PQL involves using simple math-ematics in sophisticated ways. Reasoning about ratios andproportions, covariation, and signed quantities/negativity areat the heart of quantification in introductory physics [1, 8, 9].The PIQL was designed based on these three facets of quan-tification, with many items being drawn from previous re-search in mathematics and physics education [10–16].Over the past two years we have engaged in an iterativeprocess of data collection and analysis, item development,and test revision to establish the validity of the PIQL for usein calculus-based introductory physics courses. In the fol-lowing sections we discuss the methods we used to analyzethe data, how we interpreted the results, and the decisions wemade to improve the PIQL. Our focus is on using a variety ofquantitative measures to gain a holistic view of the PIQL tooptimize our ability to measure student reasoning. We presentthe results from each analysis individually, but our decisionsto modify individual items and the PIQL as a whole werebased on the collective results from all of them.
II. INVENTORY DEVELOPMENT AND DATA SOURCES
The prototype version of the PIQL focused primarily onmeasuring students reasoning about ratios and proportions[12–14] and signed quantities/negativity [15–18]. This 18-item protoPIQL also included two items on covariation taken(with permission) from the Precalculus Conceptual Assess-ment (PCA) [10]. Revisions were made to improve the va-lidity and reliability of the PIQL, reduce redundancies, andensure that all three facets of PQL were represented. Laterversions of the PIQL include 20 or 21 items.Data for this study were collected at the beginning of eachterm (before instruction) in three calculus-based introductoryphysics courses at a large public research university in thenorthwestern US. Previous results have shown that overallscore distributions on the PIQL are not significantly differ-ent in the three courses [19, 20], and this trend has persistedthroughout all versions of the PIQL; therefore, we have com- bined all data collected in each term for this study. Due toour iterative revisions, the items on the PIQL in each of thesix data sets are slightly different; we label the data sets bytheir version of the PIQL: protoPIQL, v1.0, v1.1, v2.0, v2.1,and v2.2. Data were collected from approximately 1000 stu-dents for each version.
III. QUANTITATIVE VALIDATION USING CLASSICALTEST THEORY
We used various quantitative analyses to measure the va-lidity and reliability of each version of the PIQL. Using Clas-sical Test Theory (CTT) we calculated the difficulty and dis-crimination parameters for each item; we want to have a widerange of difficulty values with most items between 0.2 and0.8 (representing the fraction of students who answer eachitem correctly), and we want most discrimination values tobe above 0.3 (representing the difference in CTT difficultybetween the top and bottom 27% of students) [21]. We alsocalculated Cronbach’s α as a measure of reliability; a value ofat least 0.7 indicates that the test is reliable for measuring theperformance of groups of students on a single-construct test,and a value of at least 0.8 indicates that the test is reliable formeasuring the performance of individual students [22].Figure 1 shows the distributions of the CTT difficulty anddiscrimination parameters for each version of the PIQL. Fiveof the items in the protoPIQL were considered too easy (dif-ficulty above 0.8), and three items had discrimination valuesbelow 0.3; moreover, there was a gap in the middle of thedifficulty distribution with only one item having a difficultyin the range between 0.3 and 0.55. Due to these results, wechose to use only nine of these items in subsequent versionsof the PIQL, with one of them being periodically modified.For PIQL v1.0, 11 items were added based on previous re-search on all three of our PQL facets [10, 12–18], which re-sulted in a much broader distribution of CTT difficulty val-ues. One additional proportional reasoning item was addedto PIQL v1.1; for PIQL v2.0 two covariation items were re-placed by newly developed items based on research in math-ematics education [23–25]; two items were slightly modifiedfor v2.1; one item was removed for PIQL v2.2 due to consis-tently high difficulty and low discrimintation parameters.Taken together, these revisions have resulted in a 20-iteminstrument with a broad range of difficulty values (only oneof which is above the desired upper limit of 0.8), and all itemshaving discrimination values above 0.3. Six of the 20 havinglarge discrimination (above 0.6), meaning that high-scoringstudents are much more likely to answer these questions cor-rectly than low-scoring students. Additionally, Cronbach’s α has also increased: α = 0 . on the protoPIQL, whichdoes not meet the threshold for measuring either groups ofstudents or individuals; however, α = 0 . on PIQL v2.2,which meets both thresholds. The distribution of difficultyvalues for PIQL v2.2 is a little higher than we think wouldbe ideal (average of 0.54), but we have chosen to keep some v2.2v2.1v2.0v1.1v1.0protoPIQL Difficulty N u m be r o f It e m s v2.2v2.1v2.0v1.1v1.0protoPIQL Discrimination N u m be r o f It e m s (a) (b)FIG. 1. CTT difficulty (a) and discrimination (b) parameter distributions for all versions of the PIQL. The desired range of difficulty values isbetween 0.2 and 0.8 (shown by dashed red lines). The desired range for discrimination is above 0.3. of the easier items because we recognize that the students inour data set may have had more prior exposure to mathemat-ics and physics instruction than is typical of the introductoryphysics student population [26]. We consider the changes inparameter values to indicate that we have created a valid andreliable inventory for measuring PQL for students in calculus-based introductory physics courses. IV. ANALYZING DATA FROM MULTIPLE-CHOICE/MULTIPLE-RESPONSE ITEMS
We consider PQL to be a conceptual blend between physicsconcepts and mathematical reasoning [27, 28]. In order tomeasure the complexity of ideas that students bring fromboth of these input spaces, we have chosen to include somemultiple-choice/multiple-response (MCMR) items in whichstudents are instructed to “select all statements that must betrue ” from a given list, and to “ choose all that apply ” (em-phasis in the original text). The MCMR item format has thepotential to reveal more information about students’ thinkingthan standard single-response items, but it also poses prob-lems with data analysis, as typical analyses of multiple-choicetests (such as CTT) assume single-response items.For MCMR items, dichotomous scoring methods require astudent to choose all correct responses and only correct re-sponses to be considered correct. For example, item 18 onPIQL v2.2 has two correct answer choices: D and G. In a di-chotomous scoring scheme a student who picks only answerD would be scored the same way as a student who chooses an-swers E and F (incorrect). This ignores the nuance and com-plexity of students’ response patterns within (and between)items. As such, the CTT results for these items are not en-tirely representative of students’ responses.In an effort to move beyond the constraints of dichotomousscoring for MCMR items, we have developed a four-levelscoring scale in which we categorize students’ responses as Completely Correct, Some Correct (if at least one but notall correct response choices are chosen), Both Correct andIncorrect (if at least one correct and one incorrect responsechoices are chosen), and Completely Incorrect [19, 20]. Fig-ure 2 shows the results of using this four-level scoring scaleto categorize student responses to the six MCMR items onPIQL v2.2. The dark purple Completely Correct bars areequivalent to CTT difficulty; however, Fig. 2 also shows usthat at least 60% of students provide at least one correct re-sponse to each item (Completely Correct, Some Correct, andBoth Correct and Incorrect combined), although this is of-ten coupled with an incorrect response (6%–44% of studentscategorized as Both Correct and Incorrect). This tells a verydifferent story than the CTT results, which group the SomeCorrect, Both, and Completely Incorrect categories togetherinto a broad Incorrect category.These four-level scoring results also reveal differences hid-den by dichotomous scoring. For example, on PIQL v2.2two items (Q17 and Q18) have more than one correct answerchoice. Figure 2 shows that approximately the same num-ber of students answers these items completely correctly, but
Item F r a c t i on o f R e s pon s e s Completely Correct Some Correct Both Correct and Incorrect Completely Incorrect
FIG. 2. Fraction of student responses in each category of our four-level scoring scheme for MCMR items with multiple correct an-swers. These results are from the final version of the PIQL. . . . . . . Item 14
Score F r a c t i on o f S t uden t s A A A A A A A A A A A A A A A A A A A AB B B B B B B B B B B B B B B B B B B BC C C C C C C C C C C C C C C C C C C CD D D D D D D D D D D D D D D D D D D DE E E E E E E E E E E E E E E E E E E EF F F F F F F F F F F F F F F F F F F FG G G G G G G G G G G G G G G G G G G G . . . . . . Item 17
Score F r a c t i on o f S t uden t s A A A A A A A A A A A A A A A A A A A AB B B B B B B B B B B B B B B B B B B BC C C C C C C C C C C C C C C C C C C CD D D D D D D D D D D D D D D D D D D DE E E E E E E E E E E E E E E E E E E E . . . . . . Item 18
Score F r a c t i on o f S t uden t s A A A A A A A A A A A A A A A A A A A AB B B B B B B B B B B B B B B B B B B BC C C C C C C C C C C C C C C C C C C CD D D D D D D D D D D D D D D D D D D DE E E E E E E E E E E E E E E E E E E EF F F F F F F F F F F F F F F F F F F FG G G G G G G G G G G G G G G G G G G G
FIG. 3. Item Response Curves for three items on PIQL v2.2. Each plot shows the fraction of students who chose each response out of thestudents who earned each score on the total test. Item 14 has correct answer B, item 17 has correct answers A, C, and D, and item 18 hascorrect answers D and G.
Q17 has a much higher fraction of students in the Some Cor-rect category. Students are much more likely to include oneof the incorrect responses to Q18 than they are for Q17. Theitems with multiple correct answers also present a new ques-tion: is it better for a student to choose Some Correct answersor Both Correct and Incorrect answers? The answer may de-pend on the specifics of each item and the associated answerchoices.Future work will include analyzing data from MCMRitems to develop a more sophisticated scoring scheme.To further examine the responses students give to individ-ual PIQL items we use Item Response Curves (IRCs), whichshow the fraction of students who choose each answer choiceas a function of the students’ overall score on the PIQL [29–32]. IRCs have been used with single-response tests to rankincorrect responses and to compare different student popula-tions with regard to both correct and incorrect answer choices[31, 32]. We find IRCs particularly helpful for examining stu-dent responses to items with multiple correct answers.Figure 3 shows three IRCs with different behavior. Item 14is a single-response item with correct answer B. Even fairlyhigh-scoring students persist in choosing a particular incor-rect answer F. Item 17 has three correct responses (A, C,D), with A being the most commonly chosen, and C beingthe least commonly chosen. Few students at any score levelchoose E, and fewer than 20% of students who score aboveaverage (10.8) choose either incorrect response (B, E). Item18 is particularly interesting in that all responses are chosenby 20%–60% of students in the middle score range (8-12).This supports the results from Fig. 2 that students are likelyto choose both a correct and an incorrect response to Q18.Both the four-level scoring scheme and the IRCs providemore information than traditional CTT analyses and allow usto see patterns in students’ responses that go beyond typicaldichotomous scoring methods. We have used these to gaina deeper qualitative picture of student performance on eachPIQL item, and these have been very valuable for decidingwhich items to keep, eliminate, or modify.
V. EXPLORING THE SUBSTRUCTURE OF THE PIQL
The PIQL was initially developed to probe student rea-soning about ratios and proportions, covariation, and signedquantities/negativity. In the language of factor analysis, thiswould imply that the PIQL was originally intended to have athree-factor structure. Since the intended factor structure ofthe PIQL was well understood at the beginning of its devel-opment, confirmatory factor analysis (CFA) was used at theonset, in conjunction with exploratory factor analysis (EFA).CFA is a model-driven statistical method whose goal is toidentify the adequacy of a proposed factor model to responsedata from the instrument being analyzed [33]. EFA is a data-driven statistical method whose goal is to uncover the un-derlying dependencies between observed variables [34]. Forall versions of the PIQL, CFA determined that the proposed,facet-driven, factor model was not an adequate representationof the PIQL’s latent trait structure [35]. The target thresholdfor CFA is to have goodness-of-fit statistics such as the Con-firmatory Fit Index (CFI) and Tucker-Lewis Fit Index (TLI)above a threshold of 0.9 [36]. For all versions of the PIQLthe CFI and TLI were below 0.8 when using the facet-drivenfactor model.Given that the CFA results do not fit with the proposedmode, we moved on to a more in-depth investigation usingEFA. The goal of using EFA was to determine if the PIQLhas any substructure, and how closely any substructure alignswith the three facets of PQL. The results from parallel analy-sis suggested that 3–4 meaningful factors could be extractedfor the earlier versions of the PIQL (protoPIQL, v1.0, andv1.1) [37]; however, when examining these structures, theywere found to be inconsistent with the originally intended fac-tors, based on the three facets of PQL [35]. During this initialdevelopment of the PIQL, EFA models of versions v1.0 andv1.1 each contained a factor that only contained the same twoitems. These two items were found to have item loadingson the same factor of above 0.8, compared to the next highestloading value of approximately 0.5. These items’ loadings re-mained essentially the same when they appeared sequentiallyon v1.0 and when they were separated and placed onto dif-ferent pages of the instrument in v1.1. This suggested theseitems were redundant, which lead to the removal of one of theitems from the PIQL in future iterations.Analyses of the most recent versions of the PIQL (v2.0,v2.1, and v2.2) suggest the instrument is now unidimensional,with no strong substructure amongst the items. Results fromEFA parallel analysis suggested that these versions of thePIQL could be adequately described by a single factor. Ad-ditional evidence to support this conclusion was obtained byperforming CFA on v2.1 and v2.2 of the PIQL using a unidi-mensional model, with measures of goodness-of-fit suggest-ing that the unidimensional model adequately fit the studentresponse data. Specifically, the CFI and TLI were above 0.93for both versions under CFA using a unidimensional model.Additionally, the standardized root mean square of the resid-uals was below 0.04, and the root mean square of the errorof approximation was below 0.04 [36]. This suggests thatremoving one of the redundant items identified in v1.0 andv1.1, resulted in the collapse of the PIQL’s multiple factorstructure into one that is unidimensional. This may also havebeen affected by replacing two of the covariation items fromv1.1.A major confounding feature of these results is that the fac-tor loadings were determined based on dichotomously scoreditems. As shown in Fig. 2, up to 65% of students who choosecorrect responses to MCMR items may be scored as incorrectbecause either they didn’t choose all of the correct responsesor they also chose an incorrect response. As such, the fac-tor loadings may not accurately capture the relationships be-tween students’ responses for cases involving MCMR items.To preserve the nuance and complexity of students’ re-sponse patterns within (and between) items we used mod-ule analysis for multiple-choice responses to examine the net-work of student responses to PIQL items [38]. Module analy-sis uses community detection algorithms to identify modules(a.k.a. communities, etc.) within networks of responses tomultiple-choice items. We chose to analyze a network of onlycorrect responses to PIQL items. The benefit of this method isthat we can examine the patterns that arise from students’ se-lections of each individual correct response, which preservessome of the complexity of MCMR items.Earlier module analyses of v1.0 and v1.1 using variouscommunity detection algorithms on full data sets suggestedthat there was some substructure in the PIQL. Again, theseresults did not agree with the three facets that the PIQL wasintended to measure and also did not align well with the re-sults of EFA [35, 39]. Recent developments in the applicationof module analysis within PER have enabled a deeper andmore refined analysis of the module structure of the PIQL[40]. Using Modified Module Analysis (MMA) on the finaltwo versions of the PIQL, with a locally adaptive network sparsification (LANS) in place of a global cutoff sparsifica-tion, resulted in no discernible substructure between the itemson the instrument [40, 41]. This corroborates the conclusionsof EFA and CFA that the PIQL is not measuring multiple con-structions and is thus a unidimensional instrument.
VI. CONCLUSIONS
Our goal is to develop a valid and reliable instrumentto measure PQL for students in calculus-based introductoryphysics courses. Results from classical test theory show thatafter several revisions the items on the PIQL have a broadrange of difficulty values, and all items have acceptable levelsof discrimination. The reliability of the PIQL has been estab-lished with Cronbach’s α = 0 . , which meets the typicallyaccepted criterion for measuring both properties of groupsand properties of individuals.Results from exploratory and confirmatory factor analysisand modified module analysis show that the PIQL is a uni-dimensional instrument that measures a single construct. Weinterpret this construct as being Physics Quantitative Literacy.These results show that student responses to PIQL items donot separate cleanly along the lines of ratios and proportions,covariation, and signed quantities/negativity, suggesting thatthese three facets of PQL (which are discernible to experts)may develop simultaneously in students.We have supplemented rigorous psychometric analyseswith four-level scoring methods for MCMR items and IRCs,which provide additional information about students’ choicesof both correct and incorrect responses. These analysesplayed a vital role in informing our decisions when revisingthe PIQL. Future work will include developing more sophis-ticated analyses that can include the nuance of MCMR datainto CTT-style analyses.Additional manuscripts will detail the work we have doneto qualitatively validate both individual PIQL items and theinventory as a whole using interviews of both students andexperts. As a result of all our analyses, we are comfortableasserting that the PIQL is a valid and reliable instrument foruse in calculus-based introductory physics courses at our pri-mary research site. Our next steps will include establishingits validity in broader contexts by collecting data from stu-dents in calculus-based introductory courses at other institu-tions, as well as from students in algebra-based and concep-tual physics courses, in order to increase the potential of thePIQL to catalyze meaningful curriculum development efforts. ACKNOWLEDGMENTS
This work is supported by the National Science Foundationunder awards DUE-1832836, DUE-1832880, DUE-1833050,and DGE-1762114. [1] Patrick W Thompson, “Quantitative reasoning and mathemati-cal modeling,” New perspectives and directions for collabora-tive research in mathematics education , 33 (2010).[2] Bobby Ojose, “Mathematics Literacy: Are We Able To Put TheMathematics We Learn Into Everyday Use?” Journal of Math-ematics Education , 89–100 (2011).[3] Suzanne White Brahmia, Alexis Olsho, Trevor I Smith, andAndrew Boudreaux, “Framework for the natures of negativ-ity in introductory physics,” Phys. Rev. Phys. Educ. Res. ,010120 (2020).[4] Alexis Olsho, Suzanne White Brahmia, Andrew Boudreaux,and Trevor I Smith, “The physics inventory of quantitativereasoning: Assessing student reasoning about sign,” in Pro-ceedings of the 22nd Annual Conference on Research in Un-dergraduate Mathematics Education. , edited by A Weinberg,D Moore-Russo, H Soto, and M Wawro (Oklahoma City, OK,2019) pp. 992–997.[5] Richard R Hake, “Interactive-engagement versus traditionalmethods: A six-thousand-student survey of mechanics test datafor introductory physics courses,” American Journal of Physics , 64–74 (1998).[6] Joshua Von Korff, Benjamin Archibeque, K Alison Gomez,Sarah B Mckagan, Eleanor C Sayre, Edward W Schenk, ChaseShepherd, and Lane Sorell, “Secondary analysis of teachingmethods in introductory physics: A 50 k-student study,” Amer-ican Journal of Physics , 969–974 (2016).[7] Adrian Madsen, Sarah B McKagan, and Eleanor C Sayre,“Resource Letter {RBAI}-1: Research-Based Assessment In-struments in Physics and Astronomy,” American Journal ofPhysics , 245–264 (2017).[8] Patrick W Thompson, Marilyn P Carlson, Cameron Byerley,and Neil Hatfield, “Schemes for Thinking with Magnitudes:A Hypothesis about Foundational Reasoning Abilities in Al-gebra,” in Epistemic Algebraic Students: Emerging Models ofStudents’ Algebraic Knowing Papers from an Invitational Con-ference (2014) p. 1.[9] Patrick W Thompson and L Saldanha, “Fractions and mul-tiplicative reasoning,” in
Research companion to Principlesand Standards for School Mathematics , edited by Jeremy Kil-patrick, W. Gary Martin, and Deborah Schifter (NationalCouncil of Teachers of Mathematics, 2003) Chap. 7, pp. 95–113.[10] Marilyn Carlson, Michael Oehrtman, and Nicole Engelke,“The precalculus concept assessment: A tool for assessing stu-dents’ reasoning abilities and understandings,” Cognition andInstruction , 113–145 (2010).[11] Andrew Boudreaux, Stephen Kanim, and Suzanne Brah-mia, “Student facility with ratio and proportion: Mappingthe reasoning space in introductory physics,” arXiv preprintarXiv:1511.08960 (2015).[12] S Brahmia, “Developing expert mathematization in the intro-ductory physics course: an impedance mismatch,” in Pro-ceedings of 2nd International Conference On Research, Imple-mentation And Education Of Mathematics And Sciences (2ndICRIEMS) (2015).[13] S Brahmia, A Boudreaux, and S. E. Kanim, “Obstaclesto Mathematization in Introductory Physics,” ArXiv e-prints(2016).[14] Suzanne Brahmia, Andrew Boudreaux, and Stephen E Kanim, “Developing Mathematical Creativity with Physics InventionTasks,” arXiv e-prints , arXiv:1602.02033 (2016).[15] Suzanne S Brahmia and Andrew Boudreaux, “Exploring stu-dent understanding of negative quantity in introductory physicscontexts,” in
Proceedings of the 19th Annual Conference ofRUME (2016) p. 79.[16] Suzanne Brahmia and Andrew Boudreaux, “Signed Quanti-ties: Mathematics Based Majors Struggle to Make Meaning,”in
Proceedings of the 20th Annual Conference on Researchin Undergraduate Mathematics Education , The Special Inter-est Group of the Mathematical Association of Americ, editedby Aaron Weinberg, Chris Rasmussen, Jeffrey Rabin, MeganWawro, and Stacy Brown (San Diego, CA, 2017).[17] Suzanne White Brahmia, Alexis Olsho, Trevor I Smith, andAndrew Boudreaux, “NoNIP: Natures of Negativity in Intro-ductory Physics,” in
Physics Education Research Conference2018 , PER Conference, edited by Adrienne Traxler, Ying Cao,and Steven Wolf (Washington, DC, 2018).[18] Suzanne White Brahmia, Alexis Olsho, Trevor I Smith, andAndrew Boudreaux, “A framework for the natures of negativ-ity in introductory physics,” in
Proceedings of the 22nd AnnualConference on Research in Undergraduate Mathematics Edu-cation. , edited by A Weinberg, D Moore-Russo, H Soto, andM Wawro (Oklahoma City, OK, 2019) pp. 68–75.[19] Trevor I Smith, Suzanne W Brahmia, Alexis Olsho, An-drew Boudreaux, Philip Eaton, Paul J Kelly, Kyle J Louis,Mitchell A Nussenbaum, and Louis J Remy, “Developing areasoning inventory for measuring physics quantitative liter-acy,” arXiv preprint arXiv:1901.03351 (2018).[20] Trevor I Smith, Suzanne White Brahmia, Alexis Olsho, andAndrew Boudreaux, “Developing a reasoning inventory formeasuring physics quantitative literacy,” in
Proceedings of the22nd Annual Conference on Research in Undergraduate Math-ematics Education. , edited by A Weinberg, D Moore-Russo,H Soto, and M Wawro (Oklahoma City, OK, 2019) pp. 1181–1182.[21] William Wiersma and Stephen G Jurs,
Educational measure-ment and testing , 2nd ed. (Allyn & Bacon, 1990).[22] Rodney L Doran,
Basic Measurement and Evaluation ofScience Instruction. (National Science Teachers Association,1980).[23] Kevin C Moore, Teo Paoletti, and Stacy Musgrave, “Covari-ational reasoning and invariance among coordinate systems,”Journal of Mathematical Behavior , 461–473 (2013).[24] Natalie L F Hobson and Kevin C Moore, “Exploring Experts’Covariational Reasoning,” in (Moore &Thompson, 2017) pp. 664–672.[25] Teo Paoletti and Kevin C Moore, “The parametric nature oftwo students’ covariational reasoning,” Journal of Mathemati-cal Behavior , 137–151 (2017).[26] Stephen Kanim and Ximena C Cid, “The demographics ofphysics education research,” arXiv preprint arXiv:1710.02598(2017).[27] Gilles Fauconnier and Mark Turner, The Way We Think: Con-ceptual Blending and the Mind’s Hidden Complexities (BasicBooks, New York, 2002).[28] Suzanne White Brahmia, Alexis Olsho, Andrew Boudreaux,Trevor I. Smith, and Charlotte Zimmerman, “A Conceptual
Blend Analysis of Physics Quantitative Literacy Reasoning In-ventory Items,” in
Proceedings of the 23rd Annual Conferenceon Research in Undergraduate Mathematics Education (ac-cepted for publication) (2020).[29] Gary A Morris, Lee Branum-Martin, Nathan Harshman,Stephen D Baker, Eric Mazur, Suvendra Dutta, Taha Mzoughi,and Veronica McCauley, “Testing the test: Item responsecurves and test quality,” American Journal of Physics , 449–453 (2006).[30] Gary A Morris, Nathan Harshman, Lee Branum-Martin, EricMazur, Taha Mzoughi, and Stephen D Baker, “An item re-sponse curves analysis of the Force Concept Inventory,” Amer-ican Journal of Physics , 825–831 (2012).[31] Paul J Walter and Gary Morris, “Assessing Student Learn-ing and Improving Instruction with Transition Matrices,” in Physics Education Research Conference 2016 , PER Confer-ence, edited by D L Jones, L Ding, and A Traxler (Sacramento,CA, 2016) pp. 376–379.[32] Michi Ishimoto, Glen Davenport, and Michael C Wittmann,“Use of item response curves of the Force and Motion Concep-tual Evaluation to compare Japanese and American students’views on force and motion,” Phys. Rev. Phys. Educ. Res. ,20135 (2017).[33] Timothy A Brown, Confirmatory Factor Analysis for AppliedResearch , 2nd ed. (The Guilford Press, 2015) pp. 72–75.[34] D N Lawley and A E Maxwell,
Factor analysis as a statisticalmethod (Butterworths London, 1963) pp. viii, 117 p.[35] Trevor I. Smith, Philip Eaton, Suzanne White Brahmia, AlexisOlsho, Andrew Boudreaux, Chris DePalma, Victor LaSasso, Scott Straguzzi, and Christopher Whitener, “Using psychome-tric tools as a window into students’ quantitative reasoning inintroductory physics,” in
Physics Education Research Confer-ence 2019 , PER Conference, edited by Ying Cao, Steven Wolf,and Michael Bennett (Provo, UT, 2019).[36] Philip Eaton and Shannon D Willoughby, “Confirmatory factoranalysis applied to the Force Concept Inventory,” Phys. Rev.Phys. Educ. Res. , 10124 (2018).[37] Li-Jen Weng and Chung-Ping Cheng, “Parallel Analysis withUnidimensional Binary Data,” Educational and PsychologicalMeasurement , 697–716 (2005).[38] Eric Brewe, Jesper Bruun, and Ian G Bearden, “Using moduleanalysis for multiple choice responses: A new method appliedto Force Concept Inventory data,” Phys. Rev. Phys. Educ. Res. , 20131 (2016).[39] Trevor I. Smith, Suzanne White Brahmia, Alexis Olsho, andAndrew Boudreaux, “Physics Students’ Implicit ConnectionsBetween Mathematical Ideas,” in Proceedings of the 23rd An-nual Conference on Research in Undergraduate MathematicsEducation (accepted for publication) (2020).[40] James Wells, Rachel Henderson, John Stewart, Gay Stewart,Jie Yang, and Adrienne Traxler, “Exploring the Structure ofMisconceptions in the Force Concept Inventory with ModifiedModule Analysis,” arXiv e-prints , arXiv:1905.06176 (2019).[41] Nicholas J Foti, James M Hughes, and Daniel N Rockmore,“Nonparametric Sparsification of Complex Multiscale Net-works,” PLOS ONE6