[PDF] The Physics Inventory of Quantitative Literacy: A tool for assessing mathematical reasoning in introductory physics

Abstract

One desired outcome of introductory physics instruction is that students will develop facility with reasoning quantitatively about physical phenomena. Little research has been done regarding how students develop the algebraic concepts and skills involved in reasoning productively about physics quantities, which is different from either understanding of physics concepts or problem-solving abilities. We introduce the Physics Inventory of Quantitative Literacy (PIQL) as a tool for measuring quantitative literacy, a foundation of mathematical reasoning, in the context of introductory physics. We present the development of the PIQL and evidence of its validity for use in calculus-based introductory physics courses. Unlike concept inventories, the PIQL is a reasoning inventory, and can be used to assess reasoning over the span of students' instruction in introductory physics. Although mathematical reasoning associated with the PIQL is taught in prior mathematics courses, pre/post test scores reveal that this reasoning isn't readily used by most students in physics, nor does it develop as part of physics instruction--even in courses that use high-quality, research-based curricular materials. As has been the case with many inventories in physics education, we expect use of the PIQL to support the development of instructional strategies and materials--in this case, designed to meet the course objective that all students become quantitatively literate in introductory physics.

Full PDF

TThe Physics Inventory of Quantitative Literacy: A tool for assessing mathematical reasoning inintroductory physics

Suzanne White Brahmia, Alexis Olsho, Trevor I. Smith, Andrew Boudreaux, Philip Eaton, and Charlotte Zimmerman Department of Physics, University of Washington, Box 351560, Seattle, WA 98195-1560, USA Department of Physics & Astronomy and Department of STEAM Education,Rowan University, 201 Mullica Hill Rd., Glassboro, NJ 08028, USA Department of Physics & Astronomy, Western Washington University, 516 High St., Bellingham, WA 98225, USA School of Natural Sciences and Mathematics, Stockton University, Galloway, NJ 08205, USA

One desired outcome of introductory physics instruction is that students will be able to reason mathematicallyabout physical phenomena. Little research has been done regarding how students develop the knowledge andskills needed to reason productively about physics quantities, which is different from either conceptual under-standing or problem-solving abilities. We introduce the Physics Inventory of Quantitative Literacy (PIQL) asa tool for measuring quantitative literacy (i.e., mathematical reasoning) in the context of introductory physics.We present the development of the PIQL and results showing its validity for use in calculus-based introductoryphysics courses. As has been the case with many inventories in physics education, we expect large-scale use ofthe PIQL to catalyze the development of instructional materials and strategies—in this case, designed to meetthe course objective that all students become quantitatively literate in introductory physics. Unlike concept in-ventories, the PIQL is a reasoning inventory, and can be used to assess reasoning over the span of students’instruction in introductory physics.

I. INTRODUCTION

Introductory physics is characterized by using simplemathematics in sophisticated ways, where experts translateﬂuidly between different representations of phenomena. Toan expert, a physics equation “tells the story” of an interac-tion or process. For example, when reading the equation (4 kg ) a = (4 kg )(9 . m/s ) − (3 . N · s/m )(5 m/s ) , an expert may quickly construct a mental story about a 4-kgobject in motion that is experiencing both the gravitationalforce near the surface of the earth and air resistance. The co-ordinate system is set by the (positive) direction of the grav-itational force. Since air resistance opposes motion and isin the negative direction, the object must be moving in thepositive direction, or downward in this case. There is an im-plied positive sign between the force terms in the equationthat describes an operation rather than a direction, summingthe forces together to generate a net force expression. In-ferring downward motion makes sense, since the mathemat-ical effect of air resistance in this equation is to reduce thenet force from what it would be in free-fall. The ball hasn’treached terminal velocity since the two force terms have dif-fering magnitudes, so the observable physical effect is thatthe ball accelerates in a direction parallel to its motion—it’smoving downward at 5 m/s and speeding up at a rate less than9.8 m/s . Quantitative Literacy , the interconnected skills, attitudes,and habits of mind that together support the sophisticated useof elementary mathematics to describe and understand theworld [1, 2], is heavily relied upon in college physics courses.Given the ubiquitous and nuanced mathematical nature of in-troductory physics,

Physics Quantitative Literacy (PQL) , i.e.quantitative literacy in the context of physics, has the poten- tial to be an important learning outcome for all students tak-ing introductory physics. PQL is characterized by the blend-ing of conceptual and procedural mathematics to generate andapply models relating physics quantities to each other, whichis a transferable skill valued in all STEM majors.While PQL is a desired outcome of physics instruction,valid measures of reasoning about quantities and their rela-tionships in physics contexts are absent from research-basedassessment instruments in introductory physics. We have de-veloped and validated the Physics Inventory of QuantitativeLiteracy (PIQL) at a large research university in the PaciﬁcNorthwest to address this need [3]. PIQL is a reasoning in-ventory that probes the quantiﬁcation typically used in in-troductory physics. The PIQL has a potential impact anal-ogous to the early concept inventories in physics educationresearch (e.g., Force Concept Inventory [4], Force and Mo-tion Conceptual Evaluation [5]). While concept inventoriesdon’t improve instruction directly, they have raised aware-ness that broad instructional goals are not being met and havethus driven curriculum development that focuses on concep-tual understanding on a large scale [6–8].In the next section, we present the theoretical underpin-nings of the PIQL. Section III describes the methods used todevelop individual items and collect them into a full instru-ment. The results of both qualitative and quantitative analysesto validate the PIQL for use in a calculus-based introductoryphysics sequence are presented in Section IV. We concludewith a discussion of plans to validate the PIQL for use in allintroductory physics courses and across multiple instructionalsettings. As has been the case with many concept inventories,we expect large-scale use of the PIQL to catalyze the devel-opment of instructional materials and strategies to meet thecourse objective that all students become quantitatively liter-ate in introductory physics. a r X i v : . [ phy s i c s . e d - ph ] S e p II. THEORETICAL FOUNDATIONS

PQL involves both procedural and conceptual masteryof the mathematics involved. Mathematics education re-searchers Gray and Tall make this distinction, explaining that“the symbol stands for both the process of division and theconcept of fraction.” They deﬁne proceptual understanding,in which pro cedural mastery and con ceptual understandingcoexist, as an appropriate goal for instruction [9]. A studentwith a proceptual understanding of fractions, for example,would move ﬂuidly between the procedure of dividing 3 by4, and the physical instantiation of the fraction as a precisequantiﬁcation of portion.Similarly, a physics student with a proceptual understand-ing of torque would move ﬂuidly between the procedure, ﬁnd-ing the vector cross product of a position relative to an originand a force, and conceptualizing the vector product (cid:126)r × (cid:126)F as aquantity unto itself (i.e., as simply (cid:126)τ ), with its own, importantemergent properties and consequences. A. Knowledge Space and Test Construct

We frame PIQL as a probe of proceptual algebraic reason-ing that is a hallmark of mastery in introductory physics. ThePIQL is designed to span a knowledge space [10] based onelements of mathematical reasoning that are ubiquitous in in-troductory physics. Unlike solving math problems commonlyencountered in a math course, physics students are reasoningabout abstract and unfamiliar physical quantities. In addition,as characterized by the example in the introduction to this pa-per, much of the mathematics holds physical signiﬁcance be-yond its mathematical meaning. Many of the PIQL items in-volve scenarios common to physics problems, and have beenconstructed such that students can’t separate “doing math”from “doing physics.”Conceptual blending theory (CBT) [11] provides a frame-work for understanding the integration of mathematical andphysical reasoning involved in PQL. In their theory, Fau-connier and Turner describe a cognitive process in which aunique mental space is formed from two (or more) separatemental spaces. The blended space can be thought of as aproduct of the input spaces, rather than a separable sum. Ac-cording to CBT, development of expert mathematization inphysics would occur not through a simple addition of newelements (physics quantities) to an existing cognitive struc-ture (arithmetic or algebra), but rather through the creationof a new and independent cognitive spaces. These spaces, inwhich creative, quantitative analysis of physical phenomenacan occur, involves a continuous interdependence of thinkingabout the mathematical and physical worlds.An example MCMR item from the PIQL is shown in Fig. 1.Complete understanding is reﬂected in selecting both d andg. In one study using an open-ended version of this question,White Brahmia observed that students were much more ac-curate when they described the system energy (equivalent to

A hand exerts a constant, horizontal force on a block as theblock moves along a frictionless, horizontal surface. No otherobjects do work on the block. For a particular interval of themotion, the hand does W = − . J of work on the block.Recall that for a constant force, W = (cid:126)F · ∆ (cid:126)s .Consider the following statements about this situation. Selectthe statement(s) that must be true. Choose all that apply. a. The work done by the hand is in the negative direction.b. The force exerted by the hand is in the negativedirection.c. The displacement of the block is in the negativedirection.d. The force exerted by the hand is in the direction opposite to the block’s displacement.e. The force exerted by the hand is in the direction parallelto the block’s displacement.f. Energy was added to the block system.g. Energy was taken away from the block system.FIG. 1. PIQL MCMR item that exempliﬁes a proceptual understand-ing of the math and physics blend. The correct responses are d andg. g above) if they also mentioned a statement similar to answerchoice d [12]. A proceptually correct answer involves rea-soning both about the orientation of the vector quantities andthe physical ramiﬁcations of these particular vectors oppos-ing each other. We hypothesize that it is in the blended spacethat the reasoning becomes more accurate and expert-like.As we discuss more fully below, the framework for in-ventory development described by Adams and Wieman [13]guided the development of the PIQL. In this framework,phase 1 of development involves delineating what the inven-tory is intended to measure, i.e., establishing the test con-struct [13]. We developed the test construct for the PIQLbased on Sherin’s (2001) theory of symbolic forms, which inturn was developed to explain how successful introductoryphysics students understand and construct equations [14].Sherin’s symbolic forms provides a framework for character-izing the reasoning targeted in PIQL.. . . successful (physics) students learn to under-stand what equations say in a fundamental sense;they have a feel for expressions, and this under-standing guides their work. . . We do students adisservice by treating (physics) conceptual un-derstanding as separate from the use of mathe-matical notations [14].The symbolic forms framework hypothesizes that success-ful students can develop expert-like conceptual schema withwhich they associate certain symbol patterns in equations. value unitssign value units [ xy ] [… x …] n □ □ − □ Ratio Dependence Scaling Opposition b. Covariation a . Creating Quantity

Measurement Quantity

FIG. 2. Examples of symbolic forms represented in the PIQL. a)Creating Quantity: Measurement [15] and Quantity [17], and b) Co-variation: Ratio, Dependence, Scaling and Opposition [14] .

Sherin (2001) developed a list of these symbolic forms, not-ing that it is not comprehensive, with the intention that subse-quent research could help build up a library. Figure 2 showsexamples of symbolic forms that are represented in the PIQL,both from Sherin’s original work and from more recent workby other researchers [15, 16]. Note that expert reasoning de-scribed in the falling ball example in the introduction to thispaper, and in the PIQL item in Fig. 1 both rely on the quantity and the opposition symbolic forms.Most introductory physics students are less sophisticated intheir use of algebraic structures [18], and come from less priv-ileged backgrounds[19] than the students in Sherin’s study,who were in the last semester of calculus-based physics at anelite institution. We consider symbolic forms to be a learn-ing objective of an introductory physics course, rather thancharacteristic of how typical physics students think. PIQL isdesigned as a probe to help researchers and instructors meetthis objective.

B. Facets of PQL

Introductory physics courses present many new and ab-stract quantities, most of which are ratios, products, sums,or differences of other quantities. Quantities have associ-ated units, and many can be positive or negative, where thesign carries physical meaning (e.g., negative work, positivecharge). Quantities can also be vectors or scalars—whichhave different algebraic rules. Beyond the additional mean-ings that are speciﬁc to the physical context, each of theseaspects of quantity has mathematical reasoning associatedwith it that is rich, nuanced, and challenging, as evidencedby research on student difﬁculties in mathematics education[1, 20, 21].Although the mathematics involved in introductory physicsquantiﬁcation is algebra or arithmetic, a conceptual under-standing of this mathematics is fundamental to reasoningin the context of strange new quantities. In introductoryphysics, PQL involves using simple mathematics in sophis- ticated ways. Proportional reasoning, reasoning with signedquantity, and covariational reasoning are at the heart of quan-tiﬁcation in introductory physics [1, 14, 21, 22]. The PIQLwas designed using these facets of quantiﬁcation as a founda-tion.The use of ratios and proportions to describe systems andcharacterize phenomena is a hallmark of expertise in STEMﬁelds, perhaps especially in physics. What about Boudreaux,Kanim, and White Brahmia developed a more ﬁne-grainedset of proportional reasoning subskills, based on their anal-ysis of college students’ speciﬁc difﬁculties on proportionalreasoning assessment items. The items are categorized intosix subskills, which overlap with the early work of ArnoldArons, as “underpinnings” to success in introductory physics[23].Negative pure numbers represent a more cognitively difﬁ-cult mathematical object than positive pure numbers do forpre-college mathematics students [24]. Mathematics educa-tion researchers have isolated a variety of “natures of neg-ativity” fundamental to algebraic reasoning in the context ofhigh school algebra—the many meanings of the negative signthat must be distinguished and understood for students to de-velop understanding [25–27]. These various meanings of thenegative sign form the foundation for scientiﬁc quantiﬁca-tion, where the mathematical properties of negative numbersare well suited to represent natural processes and quantities.Physics education researchers report that a majority of stu-dents enrolled in a calculus-based physics course struggled tomake meaning of positive and negative quantities in spite ofcompleting Calculus I and more advanced courses in mathe-matics [12, 28]. Developing “ﬂexibility” with negative num-bers, i.e. the recognition and correct interpretation of the mul-tiple meanings of the negative sign, is a known challenge inmathematics education. There is mounting evidence that rea-soning about negative quantity poses a signiﬁcant hurdle forphysics students at the introductory level and beyond [16, 29–33].Covariational reasoning, i.e. the formal reasoning abouthow one variable changes due to small changes in another,related quantity, has been shown to be strongly associatedwith student success in calculus by mathematics educationresearchers [34–36]. However, physics education researchis only beginning to explore how covariational reasoning isused in introductory physics contexts. Preliminary work sug-gests that covariational reasoning in physics graduate students(“experts” in introductory physics contexts) differs in someways from that in mathematics graduate students [37].

III. DEVELOPMENT OF THE PIQL

Our work was guided in a general way by the four-phaseframework for developing assessment instruments proposedby Adams and Weiman for developing assessment instru-ments [13]: • Phase 1: Delineation of the purpose of the test and thescope of the construct or the extent of the domain to bemeasured; • Phase 2. Development and evaluation of the test speci-ﬁcations; • Phase 3. Development, ﬁeld testing, evaluation, and se-lection of the items and scoring guides and procedures;and • Phase 4. Assembly and evaluation of the test for oper-ational use.We reﬁned and adapted this framework to arrive at a set ofspeciﬁc, iterative steps that characterized our development ofthe PIQL.Development of the PIQL rested on two foundations: 1)existing literature in discipline-based education research inmathematics and physics, and 2) our own research on studentreasoning about ratio and about signs and signed quantities.The latter was conducted over an approximately 5 year periodimmediately preceding our explicit development of PIQL asa standardized assessment instrument. As we drew on theseintellectual foundations, we followed an iterative cycle thatled to the current version of the PIQL. This cycle consistedof: assembling a working inventory of multiple-choice items,gathering data from introductory physics students,analyzingthe results, identifying areas where the inventory could be im-proved (in terms of broader content coverage, items that wereeither too easy or too hard, redundant items, etc.), develop-ing new items, adding and removing items from the workinginventory and repeating data collection and analysis. The cy-cle is represented in Fig. 3. Below, in part A, we describethe choices we made when assembling versions of the PIQL.In part B, we discuss our process for developing individualitems. Finally, in part C, we describe the collection of dataused to guide revisions of and to validate the PIQL.Many of the PIQL items have been pilot-tested over thepast decade as interview prompts, free-response written ques-tions, and/or multiple choice questions [12, 38–40].

A. Inventory Development

As described above in Sec. II B, the PIQL is intended toprobe proceptual algebraic reasoning in contexts relevant tointroductory physics, i.e., physics quantitative literacy. Bycollaboratively reﬂecting on our combined experience as in-structors, as well as reviewing work in both mathematics andphysics education research, we identiﬁed three sub-domainsas foundational to PQL: proportional reasoning, reasoningabout signs and signed quantities, and covariational reason-ing. Consistent with other inventories in physics, we expectedthe PIQL to have 20–30 items, and to take 30–40 minutesfor students to complete. Consistent with best practices [41],the PIQL was originally designed to be administered on pa-per and in-person (proctored). Administration protocols were A s s e m b l e I n v e n t o r y E x p e r t I n t e r v i e w s C o l l e c t D a t a S t a t i s t i c a l A n a l y s e s M o d i f y / D e v e l o p I t e m s P i l o t i t e m s ( i n t e r v i e w s ) S t a r t i n g w i t h B a s e l i n e I t e m L i b r a r y E s t a b l i s h F a c e V a l i d i t y C T T , C F A , E F A , M M A , I R C s Exploratory Analyses

Model growth of PQL across coursesCharacterize students’ knowledge spacesPilot multi-level scoring for MCMR items

FIG. 3. Workﬂow for developing and validating items, and revisingthe PIQL. modiﬁed, however, in response to the COVID pandemic; de-tails are provided below in Sec. III C.Because of our interest in using the PIQL to track the de-velopment of student reasoning over time, we designed someitems as multiple-choice multiple-response (MCMR). Theseitems may have more than one correct response, and promptstudents to select all answer choices that apply. Such itemsallow us to probe multiple facets of student reasoning about agiven context.The prototype version of the PIQL consisted of 18 itemsthat focused on ratios and proportions [38, 40, 42] and signsand signed quantities [40, 43, 44]. This version (the “pro-toPIQL") did include two items on covariational reasoningtaken with permission from the Precalculus Conceptual As-sessment (PCA) [45]. Iterative revisions were made over sev-eral years to improve the validity and reliability of the PIQL,reduce redundancies, and ensure that the three foundationalsub-domains of PQL were all represented. Later versions ofthe PIQL include 20 or 21 items. Due to these iterative revi-sions, the items on the PIQL in each of the six data sets areslightly different; we label the data sets by their version of thePIQL: protoPIQL, v1.0, v1.1, v2.0, v2.1, and v2.2.As noted in Sec. II B, earlier work of White Brahmia,Boudreaux and Kanim resulted in the creation of an organiza-tional framework for speciﬁc modes of reasoning about ratioand proportion used in introductory physics contexts, and inthe development of a working inventory of items for assess-ing proportional reasoning [39]. Several of these items areincluded in the current version of the PIQL.Initially, items for assessing reasoning about signs andsigned quantities were developed based on work done bymathematics education researchers to categorize differentmeanings of the negative sign [20]. Gradual recognition ofdifferences in the meanings of signs in purely mathematicscontexts and in physics contexts led to the development of aphysics-speciﬁc framework for the uses of signs: the “naturesof negativity in introductory physics” [44]. This frameworkhas informed the development of PIQL items that will allowus to track the development of expert-like reasoning about thenegative sign.In a similar fashion, the study of expert covariational rea-soning and the development of a framework for covariationalreasoning in physics informed the design of PIQL items forassessing covariational reasoning. These new items allow usto measure the behaviors identiﬁed in physics expert reason-ing that were not present in mathematics expert reasoning.A goal during inventory development was to ensure thateach of these three foundational sub-domains of quantitativereasoning were well-represented in the PIQL. As we discussbelow in Sec. IV, factor analyses did not reveal an instrumentstructure well-aligned with these three facets, despite suchstructure being easy for physics experts to identify. We there-fore relied on the reasoning frameworks described above aswell as expert validation interviews to ensure that all facetsof reasoning were represented on the PIQL.As we approached a steady-state version of the PIQL,with items qualitatively and quantitatively validated, we per-formed a series of expert interviews. These expert interviewsserved as a ﬁnal validation check of the individual items, andof the PIQL as a whole. Experts agreed that, overall, the PIQLrepresents reasoning that they expect their students to developduring introductory physics courses and that is important inphysics generally. We removed one item experts felt did notrepresent reasoning central to introductory physics. Expertsagreed that administering the PIQL to their students wouldgive important information about the students’ quantitativereasoning, and how that reasoning changed over instruction.

B. Item Development

Items on the PIQL were generated in two ways. Approxi-mately two-thirds of the items on the steady-state PIQL camefrom an existing library of questions. This baseline item li-brary consisted mostly of items about proportional reason-ing, and reasoning about sign and signed quantities. Theseitems were developed in the course of our own work inves-tigating student quantitative reasoning. Other items in thebaseline item library were adapted from questions developedby mathematics education researchers [45] to assess covari-ational reasoning. The remaining items on the steady-statePIQL were developed speciﬁcally for inclusion on the PIQL.In both cases, items focus on reasoning rather than computa-tion skill; most items require neither a calculator nor signiﬁ-cant mental computation. Here we describe how items were generated and revised.The items that emerged from our prior research on stu-dent difﬁculties with quantitative reasoning were developedlargely before we had explicitly articulated the goal of de-veloping the PIQL as an assessment instrument. This priorresearch involved the use of free-response questions designedto elicit student reasoning. Such free-response questions wereadministered in introductory physics courses at multiple in-stitutions on course exams and on ungraded course preteststo more than 1000 students over a 3-year period. Althoughungraded, the pretests occurred under exam conditions andstudents seemed to take them seriously. During a subsequent2-year period, we adapted selected free-response questions toa multiple-choice format. The multiple choice versions werealso administered in introductory physics courses, on un-graded diagnostic tests given near the start and end of the termto more than 2000 students. As with the course pretests, thesediagnostics were given under exam conditions. A multiplechoice diagnostic typically contained a suite of between 8 and16 proportional reasoning items, some of which we now con-sider to be covariational reasoning items. Within a suite, pro-portional reasoning subskills identiﬁed by researchers WhiteBrahmia, Boudreaux, and Kanim were generally assessed bymultiple items, which varied in both contextual abstractionand numerical complexity. Some items involved everydaycontexts, presumably familiar to most students (e.g. a sportsdrink mixed from a powder and some water), while othersinvolved more “sciencey” contexts, perhaps less familiar oreven intimidating (e.g., the mass and volume of a high-techmaterial called “traxolene”). In addition, some questions in-volved small whole numbers (e.g. 4 tablespoons of powderand 7 cups of water), some involved decimal numbers (e.g.,a sample of traxolene of mass 7.6 g and volume 2.1 ml),and still others involved quantities represented symbolicallyas general variables (e.g., a sample of traxolene of mass M grams and volume 6.2 cm ). By varying questions in thismanner, it became evident that quantity type (whole numbervs decimal number vs general variable) could interfere withproductive reasoning for some students [40]. This ﬁnding in-ﬂuenced our eventual item choices for the PIQL.Our prior research also made use of individual student in-terviews to probe reasoning in more detail and to improve theclarity of the assessment questions and their effectiveness ineliciting student thinking. The interviews were conducted attwo institutions in the Paciﬁc Northwest, with student volun-teers from calculus-based introductory physics courses, gen-eral education physics courses, and an introductory physicscourse designed especially for preservice elementary teach-ers. Over 20 such interviews were conducted. Each interviewlasted about one hour, and was either audio- of video-tapedfor later transcription and analysis. A semi-structured proto-col was used: the interviewer posed speciﬁc proportional rea-soning questions and asked the interview subject to “think outloud.” In the ﬁrst phase of interviews, the interviewer clariﬁedthe questions as needed, prompted the subject to explain hisor her thinking after sustained periods of silence, and askedthe subject to elaborate on statements that were brief or un-clear. The interviewer did not, however, offer hints or guidingquestions. At the close of some of the interviews, the subjectwas asked to reﬂect on how difﬁcult he or she experienced theitems to be.In contrast to items developed early on, as part of our priorresearch, some of the items on the steady state PIQL were de-veloped later, as we assembled the instrument, to ensure thatall three foundational domains of PQL would be well rep-resented. Most of these later items involve reasoning aboutsigned quantities and about the covariation of quantities. Asan example of this process of novel item development, we de-scribe the development of the “Charged Spheres Question,”an item that assesses student ability to interpret the negativesign in the context of electric charge.Electric charge involves an idiosyncratic use of sign withquantity in physics—in this context, the sign is an indicationof type [46] rather than an indication that the quantity is math-ematically negative or “less than zero.” An item developed asa part of work investigating student understanding of nega-tivity [40] in the context of the transfer of charge from oneobject to another was administered on the protoPIQL. Whenexpert validation interviews with this item found ﬂaws thatcould not be eliminated by item modiﬁcation, we began todevelop a new item involving the negative sign in the contextof electric charge.An early version of the new item is shown in Fig. 4. Thisitem was developed to be less ambiguous and more alignedwith expert reasoning about sign in the context of electriccharge. It was also designed as an MCMR item, with twocorrect responses, to probe student reasoning about a physi-cal interpretation of the context as well as the mathematicalrepresentation of the quantity electric charge. However, thisitem proved to be very difﬁcult for students and experts alike.While student interviews indicated that students were choos-ing answer c for the reasons we intended, expert interviewsindicated that there was not consensus about the interpreta-tion of the negative sign in this context.To address this issue, we reworked the item to focus onthe meaning of the negative sign, and changed the wordingto reduce ambiguity. Expert interviews indicate that the cur-rent version of the item, shown below in Fig. 5 is well-alignedwith expert understanding of the meaning of the negative signin the context of electric charge. With these changes, theClassical Test Theory (CTT) statistics for the item fall withinthe desired ranges (see section IV for more info on quantita-tive validation of assessment items).Most items chosen for the steady-state PIQL—both thosethat grew out of prior work and those developed speciﬁcallyfor the PIQL—have undergone changes over the course of ad-ministration. Modiﬁcations of the items took place throughan iterative cycle of research, item administration, and itemvalidation. Most modiﬁcations were rooted in our deepeningunderstanding of aspects of quantiﬁcation and quantitativemodeling in introductory physics. Generally speaking, mod-iﬁcations to question stems were relatively minor; however, A student has two electrically neutral aluminum spheres, A andB. Initially, sphere A has exactly the same number of protonsand electrons as sphere B. The student performs an experimentthat causes charge to move from one of the spheres to theother. After the experiment, the charge on sphere A is q A = − microcoulombs, and the charge on sphere B is q B = +5 microcoulombs.Which of the following statements best describe the charges onthe spheres after the student performed the experiment? Selectthe statement(s) that must be true. Choose all that apply. a. The charge on sphere A is greater than the charge onsphere B.b. The charge on sphere A is less than the charge onsphere B.c. The charge on sphere A is neither greater than nor lessthan the charge on B.d. The total number of protons and electrons in sphere A isgreater than the total number of protons and electrons insphere B.e. The total number of protons and electrons in sphere A isless than the total number of protons and electrons insphere B.f. The total number of protons and electrons in sphere A isequal to the total number of protons and electrons insphere B.FIG. 4. An early MCMR version of the

Charged Spheres item, in-tended to assess student reasoning about the negative sign in the con-text of electric charge. Answers c and d are correct. we did make signiﬁcant changes to many answer choices toencompass not only observed patterns in student reasoningbut also multiple natures of expert reasoning. For example, onquestions probing student reasoning about the negative signin a given context, we used results from earlier investigationsof student reasoning about negativity to develop item distrac-tors. The wording of these distractors was often modiﬁed tobe consistent with the meanings of the negative sign describedby our framework of the natures of negativity of introductoryphysics [44]. These distractors were then validated via stu-dent and expert interviews to ensure that the correct responseswere consistent with expert reasoning and that the distractorswere incorrect but consistent not only with common studentreasoning but also with expert reasoning that might be cor-rect in other contexts. Student validation interviews also ledto improvements in the clarity and purpose of items. Moreinformation on the validation process and procedures can befound in Sec. IV.

C. Administration of the PIQL during development

In this section, we describe the administration of the PIQL,both on paper (in-person) and online. We discuss the circum-

A student has two electrically neutral aluminum spheres, A andB. The student performs an experiment that causes charge tomove from one of the spheres to the other. After theexperiment, the charge of sphere A is q A = − microcoulombs, and the charge of sphere B is q B = +5 microcoulombs.What are the meanings of the negative and positive signs inthis context?I The signs imply that the charge on sphere A is less thanthat on sphere B.II The signs imply that the unbalanced charges on the twospheres are of opposite types.III The signs imply that charge was removed from sphereA and added to sphere B.a. Ib. IIc. IIId. I and IIe. I and IIIf. II and IIIg. I, II, and IIIFIG. 5. Expert and student validated version of the Charged Spheres item on the steady-state PIQL, intended to assess student reasoningabout the negative sign in the context of electric charge. Answer bis correct. stances under which the the online version of the PIQL wasadministered, and our attempts to adhere to best practices ina limited timeframe.During the initial development of the PIQL, we admin-istered it to all students enrolled in the 3-quarter, large-enrollment, calculus-based introductory physics sequence ata large public university in the Paciﬁc Northwest. We ranversions of the PIQL over eight academic quarters. It wasadministered at the beginning of the terms, before signiﬁcantinstruction, thus serving as a “pretest" for each course of theintroductory sequence.Development of a valid and reliable instrument requiresregular access to a large number of students, as well as a sig-niﬁcant amount of class time that might otherwise be usedfor physics instruction. For most quarters, we were able toadminister the PIQL to students during recitation sessions.These sessions are used for required small-group activities sowe were able to achieve a high participation rate. This alsoallowed us to proctor the assessment, consistent with best-practices [8].Proctoring the instrument administration was resource-intensive. The assessment was administered in over 50 recita-tions, each 50 minutes long, during the ﬁrst week of instruc-tion. Because of the timing (during the ﬁrst week of the quar-ter when TA assignments were still in ﬂux) preparing physics department TAs to proctor the assessment presented a signif-icant challenge.For most in-person administrations of the assessment,students read items from a 5-page stapled packet andrecorded their responses on a paper answer form as well aselectronically. Our instrument includes several “multiple-choice/multiple response” (MCMR) items that ask studentsto select all answer choices they feel are appropriate. Theseitems could not be handled by the University’s multiple-choice scoring machines. Therefore, quarterly preparation foradministration of the instrument involved not only printingthe items and answer forms but also creating online surveysinto which students could input their responses. Because ofongoing changes to the assessment during the developmentperiod, the stapled packets and the online surveys could notbe reused and had to be generated anew each quarter. Stu-dents were asked to enter their responses online using theirlaptop, smartphone, or tablet if possible. Students that did nothave or bring such a device with them to the class session—and so were unable to enter their answers online—were askedto indicate this on their paper answer form. After the in-strument administration was ﬁnished for the quarter, a mem-ber of the research team entered those responses manually.Responses for roughly 25-50 students were added manuallyeach quarter.Although we believe the methods described above re-sulted in careful, sustained effort from a large number of stu-dents, they required a signiﬁcant investment of time and re-sources. Moreover, some students misunderstood the instruc-tions, leading research team members to spend additionaltime making sure the data set was complete and that studentswere receiving credit for their work. We began to consideronline administration methods as an alternative, even explor-ing purchasing ∼ electronic tablets through a University-based grant. In this scheme, there would be no paper versionof the instrument—students would access the survey on thetablets during class, proctored by TAs or members of the re-search team. While this method of administration would stillrequire signiﬁcant time and effort by research team members,we believed it would be more straightforward for studentsthan the previous procedure of entering responses online aftercompleting the assessment on paper.Though our focus was on in-person, proctored administra-tion of the assessment, we began to consider whether online,unproctored administration would better support broader val-idation and wide-spread dissemination. While existing re-search suggests little or no signiﬁcant difference in studentperformance between proctored and unproctored administra-tions of some research-based assessments (RBAs) [47–49],researchers recommend that online, unproctored administra-tion be validated separately [47]. We wanted to determinewhether our instrument could be administered online and un-proctored by instructors who were reluctant or unable to allo-cate class time for administration. Moreover, though we gen-erally have access to students during the ﬁrst week of classesduring scheduled recitation sessions, scheduling was difﬁcultduring academic quarters in which instruction started mid-week, leading to confusion and decreased participation rates.The COVID-19 pandemic of early 2020 forced the issue.With the University moving to all online instruction over avery short time interval, in-person administration of the as-sessment became impossible. Although online “proctoring”services exist [50], the proctoring requirements do not alignwell with University policies regarding computer camera useduring virtual instruction, and do not take into account possi-ble limitations on students during such an uncertain and dif-ﬁcult period.We ran the PIQL unproctored and entirely online using theUniversity’s existing survey/quiz platform. To mitigate stu-dent stress during the rapid shift to online learning, the Uni-versity suggested that no graded work be required during theﬁrst week of instruction. Because we do not grade students’responses to the PIQL for correctness, we decided to run thePIQL, as usual, during the ﬁrst week of the term in each of thethree courses of the calculus-based introductory physics se-quence. Because we were aware of the tendency of some stu-dents to place undue importance on such assessments, how-ever, we presented the PIQL as a low-stakes survey.We adhered to best practices [48, 51] as much as possi-ble: the PIQL had a 50-minute time limit [52], equal to theusual class length in which the instrument was administered(we note that this is longer than it should take for students tocomplete the instrument); multiple reminder emails were sentto students to increase participation rate; and course creditwas offered for participation, but students’ responses werenot graded for “correctness.” In addition, we constructed theonline version of the instrument to discourage copying or sav-ing of test items: each item was shown in a browser windowon its own; students were not able to backtrack in the PIQL[53] and were not shown a summary of their work or giventhe correct answers after completion. A video (less than 3minutes long) embedded at the beginning of the online PIQLexplained the purpose of the assessment and reiterated thatthe PIQL was associated with course credit to be awarded onthe basis of participation rather than the number of questionsanswered correctly. This is in line with best practices to dis-courage students from searching for answers to the items onthe internet, while still motivating students to give their bestefforts on the assessment [47].Many online testing platforms will (automatically or by re-quest) randomly order each test item’s responses. We notethat this does not adhere to best practices—validation of indi-vidual items only holds for the versions used during the vali-dation process [54]. Randomizing answer choices was there-fore not used to decrease cheating. Especially at the begin-ning of the academic quarter and with a majority of studentsgeographically separated due to the pandemic, we believedthat students were unlikely to attempt to collaborate with eachother when completing the assessment.Because we recognized that a majority of students com-pleting the survey for the ﬁrst time would have little-to-noexperience with MCMR items, we made some changes to the instrument to increase the likelihood that students wouldrecognize that they could select multiple responses for thoseitems. All of the MCMR items were moved to the end ofthe survey. After answering the last multiple-choice/single-response (MCSR) item, students saw a page with no instru-ment item, but rather a statement that the remaining questionson the survey might have more than one correct response, andthat students should choose all answers that they feel are cor-rect. At the top of the page for each of the remaining items(all MCMR), students saw a reminder that the question mighthave more than one correct response. We also prompted stu-dents to “choose all that apply” in the question stem.Overall participation rates were similar for in-person andonline administration. For in-person administration, the over-all participation rate was 91% (93%–92%–89% rate for C(I)–C(II)–C(III) students); for online administration, the overallparticipation rate was 90% (93%–89%–89% for C(I)–C(II)–C(III) students). For the online administration, we countedany attempt at completing the survey as participation. (Thisincluded a small number ( < ) of students who opened thesurvey but did not answer any of the items.)We attribute the high participation rates on the online ver-sion to the multiple reminder emails and course web pageannouncements about the assessment, as well as the assign-ment of course credit for participating in the assessment. Inaddition, as in previous quarters, the assessment was asso-ciated with the weekly small-group-work recitation sessions;students were told that the survey constituted the week’s workassociated with the recitation session. Finally, administrationof the survey during the ﬁrst week, before other graded workwas due, may have boosted participation, as students were notyet overly burdened with assignments.Additionally, administering the assessment online allowedus to track the amount of time individual students took tocomplete it, which we were unable to do during previous in-person administrations. Although we cannot formally com-pare the time taken on the online version to that on the in-person versions, we do use the time data from the onlineadministration to address student “buy-in”—that is, whetheror not students seem to take the assessment seriously. Overall three courses, the average time spent on the survey was27.3 minutes (31.8–27.0–23.1 minutes for C(I)–C(II)–C(III)students, respectively) [55]. Classroom observations fromproctors during in-person administration suggest that studentstake about 30 to 40 minutes, depending on the course, to com-plete the PIQL and upload their answers in that setting. Webelieve the small (presumed) difference may be due to thesimpler test-taking process in the online context. When com-pleting the assessment online, students did not record theirresponses on paper and then enter them electronically afternavigating to a website; rather, they read and responded to theitems entirely online. Time spent navigating to the website ontheir computer, smartphone, or tablet is not included in theirtime. The time-on-task data are consistent with the amount oftime that we believe is necessary to read and respond to itemswith an appropriate amount of effort.We did notice a small number of students in each of thecourses taking ten minutes or less to complete the PIQL: 5%overall (1%– 3%–11% for C(I)–C(II)–C(III) students). Tenminutes is likely not enough time to read and consider theanswer choices carefully, suggesting that these students maynot have been taking the assessment as seriously as we wouldlike. Fortunately, only for the C(III) students was the per-centage of students spending less than 10 minutes a sizablefraction of the student population. Because we ran the assess-ment in each quarter of the 2019-2020 academic year, manyof the C(III) students were seeing the assessment for the thirdtime. We would expect these students to spend less time onthe PIQL due to familiarity with the material and assessmentitems; however, average score for C(III) students spendingless than ten minutes on the survey was 5.25, signiﬁcantlylower than the average score of 11.34 for the entire C(III)data set. We believe that many of the C(III) students that tookless than ten minutes to complete the PIQL were not engagingwith the items thoughtfully.We compare student performance on the two administra-tions of the assessment, denoted “Online” and “In-person”.We limit our analyses to the data collected from students en-rolled in the ﬁrst quarter of the calculus-based introductoryphysics sequence (C(I) students). We believe this is the bestcomparison, as these groups contain students seeing the in-strument for the ﬁrst time. We compared student performanceon the two versions of the instrument in two ways: using theaverage score for a subset of 17 items in common between thetwo versions; and using changes in item difﬁculty for those 17items.Average overall score and standard deviation on the subsetof 17 items for Online was . ± . ( N = 397 ); In-person,it was . ± . ( N = 326 ), a percent difference of about . While this difference is slightly larger than expectedfrom past quarters’ data, the effect is fairly small, with Co-hen’s d ≈ . .In addition to looking at students’ scores to compare per-formance for the two administrations, we calculated the Clas-sical Test Theory statistic item difﬁculty. The item difﬁculty isthe fraction of students answering each item correctly; there-fore, a higher difﬁculty value indicates an easier question.Comparing item difﬁculty for the 17 common items, wefound that while the average difﬁculty over all items in the setwas not signiﬁcantly different, the individual difﬁculty wassigniﬁcantly different for ﬁve items (binomial test p < . ).A comparison of the item difﬁculties is shown in Fig. 6. Allﬁve of the items had lower difﬁculty values for the onlineversion of the instrument, indicating the items were more dif-ﬁcult for students when presented online, consistent with thelower overall score described above. Four of the ﬁve of theitems (Q15, Q18, Q19, and Q20 in Fig. 6) are MCMR items;we discuss a possible explanation for the difference in sec-tion IV C below. We typically see large variations in the itemdifﬁculty for two of these items (Q15 and Q19), but the difﬁ-culties for those items during online administration are lowerthan expected from previous administrations.

Item C TT D i ff i c u l t y Quarter

In−personOnline

FIG. 6. A comparison of CTT item difﬁculty for 17 items from theassessment for C(I) students. Red bars represent item difﬁculty onthe In-person administration of the assessment; blue bars are usedfor the Online administration. Error bars represent the standard er-ror. Dashed lines show the upper and lower bounds for desired itemdifﬁculty.

Six of the 20 items on the PIQL are multiple-choice/multiple-response (MCMR), asking students tochoose as many answers as they believed were correct foreach item. When the instrument was administered in person,there were multiple opportunities to remind students thatthey could choose more than one response on these items,both in writing on the instrument itself, and also verbally bythe proctor. Validation interviews suggested that multiplereminders were necessary, as this variety of question isrelatively rare on the assessments typically encountered bystudents. We were concerned that many students would notrecognize this type of question when encountering it online,especially students who had not completed the instrumentpreviously. As noted above, we made several changes to theformat of the assessment to emphasize to students that theyshould choose more than one response for the MCMR itemsif appropriate.To assess the effectiveness of these measures, we comparedthe percentage of students choosing more than one responseon each MCMR item, ﬁnding an increase for all MCMR itemswhen administered online. We conclude that our measureswere effective. However, as only two of the MCMR items onthe PIQL have more than one correct response, an increasein the number of answers chosen is not necessarily associ-ated in an improvement in performance. Increases in thenumber of responses selected is generally associated with adecrease in the correct response rate, as MCMR items werescored dichotomously (i.e., an MCMR item was only countedas correct if a student selected correct answer choice(s) anddid not select any of the incorrect choices). For items Q15,Q18, Q19, and Q20—the four MCMR items for which wesaw signiﬁcant decreases in CTT item difﬁculty—the frac-tion of students who selected more than one answer choiceincreased by 9%, 22%, 9% and 16%, respectively, from theIn-person to the Online administration. Item 18 had two cor-rect responses; as with the other items, there was a decrease in0the item difﬁculty statistic and an increase in the percentageof students choosing more than one response.Initial results tentatively suggest that students take theonline-version of the assessment seriously, perform atroughly the same level as for in-person administration, andare able to understand that MCMR items allow for multipleresponses. To continue toward a valid and reliable online assessment, we must learn more about how students inter-act with test items when using a computer or other internet-capable device, especially items for which there seems to be asigniﬁcant difference in performance when administered on-line compared to on paper. In a preliminary follow-up study,we presented the PIQL online to a class of C(I) students( N = 109 ). Approximately half the students ( N = 59 ) sawthe MCMR items grouped at the end of the PIQL, while theother half saw the MCMR items interspersed with the MCSRitems. We found no signiﬁcant difference between the twogroups for the number of answers chosen for MCMR ques-tions; and preliminary analysis suggests that all of these stu-dents chose more answers for the items than did students whocompleted the PIQL in-person. We plan to develop an onlineinterview protocol that may help us understand how studentreasoning may change when the assessment is given in an on-line format.Although there were differences in item difﬁculty betweenthe two versions of the assessment discussed, we note thatmost items still fall within the desired range for difﬁculty forﬁrst-term students, as seen in Fig. 6. The data discussedin-depth here indicate that the bulk of the difference is dueto students being more willing to “click” multiple responsesfor MCMR items. As described above, preliminary follow-up work indicates that it is online administration, rather thanquestion-type grouping (i.e., putting all MCMR items at theend of the instrument) that increases the number of answerchoices chosen by students. Further analyses of particularanswer choices on the MCMR items, going beyond dichoto-mous scoring, may also provide insight: for example, we areinterested in changes in the percentage of students choosingboth correct and incorrect responses for different administra-tion methods. IV. VALIDATIONA. Qualitatively validating the PIQL and individual questions

Face validity of the PIQL was assessed primarily throughexpert and student interviews. Interviews were used to val-idate the PIQL and individual PIQL items in three distinctways:1. Expert panel reviews were performed to verify that thecorrect answer choices are indeed correct, as well as toensure that distractors are incorrect; experts also iden-tiﬁed the mathematical content of each question, whichallowed us to conﬁrm the face validity of the individualitems. 2. Individual student interviews were performed to deter-mine whether students are interpreting the questions asintended, to ensure that students are choosing the cor-rect answer for the correct reasons, and to ensure thatincorrect responses are chosen for consistent reasons.3. Individual expert validation interviews were performedto verify that individual items and the assembled inven-tory are testing ideas that experts expect their studentsto learn.During the expert panel reviews, physics education re-search faculty and graduate students worked in groups of 3–5members, with each group seeing 6–8 questions. Panel mem-bers worked through each item individually to determine thecorrect answer and identify the speciﬁc mathematical con-struct(s) required to answer the question. The panels then dis-cussed the items as a group, to come to a consensus about thecorrect answer and that the incorrect answers were indeed in-correct; together, they also rated the quality of the question interms of clarity, ambiguity, and appropriateness. Researchersobserved the conversations, took notes, and collected the ma-terials afterwards.For the individual student interviews, students were re-cruited in approximately equal numbers from each of thethree courses in the calculus-based introductory physics se-quence at UW. During interviews lasting 30–60 minutes, thestudents were asked to work through the questions to be val-idated, following a “think-aloud” protocol. That is, studentswere asked to describe their thinking about each question asthey attempted to answer it. Interviewers did not ask ques-tions of the students or interrupt their work except to remindthem to think aloud (when necessary). The interviews wererecorded and the interviewers took written notes. Scans ofstudents’ written work were saved. A small number of in-terviews following an identical protocol were performed withstudents enrolled in the calculus-based introductory physicssequence at Western Washington University. The student val-idation interviews informed changes to the questions to im-prove their coherence with the target population.For the expert interviews, the complete PIQL was sent outto instructors with extensive experience teaching in the intro-ductory physics sequence at UW. The experts were asked toreview the PIQL before the interviews. During formal, semi-structured interviews, experts were asked to comment on theappropriateness of the items and of the test as a whole to en-sure that the PIQL is testing ideas that experts expect theirstudents to learn. The interviews were recorded and the in-terviewers took written notes. These interviews resulted in anumber of small but substantive wording changes for two ofthe items to improve their clarity. Feedback about the rele-vance of the items with respect to course learning objectivesalso informed the composition of the PIQL as a whole, result-ing in one item being removed from the instrument.1

B. Quantitative validation using Classical Test Theory

We used various quantitative analyses to measure the va-lidity and reliability of each version of the PIQL. Using Clas-sical Test Theory (CTT) we calculated the difﬁculty and dis-crimination parameters for each item; we want to have a widerange of difﬁculty values with most items between 0.2 and0.8 (representing the fraction of students who answer eachitem correctly), and we want most discrimination values tobe above 0.3 (representing the difference in CTT difﬁcultybetween the top and bottom 27% of students) [56]. We alsocalculated Cronbach’s α as a measure of reliability; a value ofat least 0.7 indicates that the test is reliable for measuring theperformance of groups of students on a single-construct test,and a value of at least 0.8 indicates that the test is reliable formeasuring the performance of individual students [57].Figure 7 shows the distributions of the CTT difﬁculty anddiscrimination parameters for each version of the PIQL. Fiveof the items in the protoPIQL were considered too easy (dif-ﬁculty above 0.8), and three items had discrimination val-ues below 0.3; moreover, there was a gap in the middle ofthe difﬁculty distribution with only one item having a dif-ﬁculty in the range between 0.3 and 0.55. Due to theseresults, we chose to use only nine of these items in sub-sequent versions of the PIQL, with one of them being pe-riodically modiﬁed. For PIQL v1.0, 11 items were addedbased on previous research on all three of our PQL facets[28, 38, 40, 42, 43, 45, 58, 59], which resulted in a muchbroader distribution of CTT difﬁculty values. One additionalproportional reasoning item was added to PIQL v1.1; forPIQL v2.0 two covariation items were replaced by newly de-veloped items based on research in mathematics education[60–62]; two items were slightly modiﬁed for v2.1; one itemwas removed for PIQL v2.2 due to consistently high difﬁcultyand low discrimination parameters.Taken together, these revisions have resulted in a 20-iteminstrument with a broad range of difﬁculty values (only oneof which is above the desired upper limit of 0.8), and all itemshaving discrimination values above 0.3. Six of the 20 havinglarge discrimination (above 0.6), meaning that high-scoringstudents are much more likely to answer these questions cor-rectly than low-scoring students. Additionally, Cronbach’s α has also increased: α = 0 . on the protoPIQL, whichdoes not meet the threshold for measuring either groups ofstudents or individuals; however, α = 0 . on PIQL v2.2,which meets both thresholds. The distribution of difﬁcultyvalues for PIQL v2.2 is a little higher than we think wouldbe ideal (average of 0.54), but we have chosen to keep someof the easier items because we recognize that the students inour data set may have had more prior exposure to mathemat-ics and physics instruction than is typical of the introductoryphysics student population [19]. We consider the changes inparameter values to indicate that we have created a valid andreliable inventory for measuring PQL for students in calculus-based introductory physics courses. C. Analyzing Data from Multiple-Choice/ Multiple-ResponseItems

We consider PQL to be a conceptual blend between physicsconcepts and mathematical reasoning [11, 63]. In order tomeasure the complexity of ideas that students bring fromboth of these input spaces, we have chosen to include somemultiple-choice/multiple-response (MCMR) items in whichstudents are instructed to “select all statements that must betrue ” from a given list, and to “ choose all that apply ” (em-phasis in the original text). The MCMR item format has thepotential to reveal more information about students’ thinkingthan standard single-response items, but it also poses prob-lems with data analysis, as typical analyses of multiple-choicetests (such as CTT) assume single-response items.For MCMR items, dichotomous scoring methods require astudent to choose all correct responses and only correct re-sponses to be considered correct. For example, item 18 onPIQL v2.2 has two correct answer choices: D and G. In a di-chotomous scoring scheme a student who picks only answerD would be scored the same way as a student who chooses an-swers E and F (incorrect). This ignores the nuance and com-plexity of students’ response patterns within (and between)items. As such, the CTT results for these items are not en-tirely representative of students’ responses.In an effort to move beyond the constraints of dichotomousscoring for MCMR items, we have developed a four-levelscoring scale in which we categorize students’ responses asCompletely Correct, Some Correct (if at least one but notall correct response choices are chosen), Both Correct andIncorrect (if at least one correct and one incorrect responsechoices are chosen), and Completely Incorrect [64, 65]. Fig-ure 8 shows the results of using this four-level scoring scaleto categorize student responses to the six MCMR items onPIQL v2.2. The dark purple Completely Correct bars areequivalent to CTT difﬁculty; however, Fig. 8 also shows usthat at least 60% of students provide at least one correct re-sponse to each item (Completely Correct, Some Correct, andBoth Correct and Incorrect combined), although this is of-ten coupled with an incorrect response (6%–44% of studentscategorized as Both Correct and Incorrect). This tells a verydifferent story than the CTT results, which group the SomeCorrect, Both, and Completely Incorrect categories togetherinto a broad Incorrect category.These four-level scoring results also reveal differences hid-den by dichotomous scoring. For example, on PIQL v2.2two items (Q17 and Q18) have more than one correct answerchoice. Figure 8 shows that approximately the same num-ber of students answers these items completely correctly, butQ17 has a much higher fraction of students in the Some Cor-rect category. Students are much more likely to include oneof the incorrect responses to Q18 than they are for Q17. Theitems with multiple correct answers also present a new ques-tion: is it better for a student to choose Some Correct answersor Both Correct and Incorrect answers? The answer may de-pend on the speciﬁcs of each item and the associated answer2 v2.2v2.1v2.0v1.1v1.0protoPIQL

Difficulty N u m be r o f It e m s v2.2v2.1v2.0v1.1v1.0protoPIQL Discrimination N u m be r o f It e m s (a) (b)FIG. 7. CTT difﬁculty (a) and discrimination (b) parameter distributions for all versions of the PIQL. The desired range of difﬁculty values isbetween 0.2 and 0.8 (shown by dashed red lines). The desired range for discrimination is above 0.3. choices.Future work will include analyzing data from MCMRitems to develop a more sophisticated scoring scheme.To further examine the responses students give to individ-ual PIQL items we use Item Response Curves (IRCs), whichshow the fraction of students who choose each answer choiceas a function of the students’ overall score on the PIQL [66–69]. IRCs have been used with single-response tests to rankincorrect responses and to compare different student popula-tions with regard to both correct and incorrect answer choices[68, 69]. We ﬁnd IRCs particularly helpful for examining stu-dent responses to items with multiple correct answers.Figure 9 shows three IRCs with different behavior. Item 14is a single-response item with correct answer B. Even fairlyhigh-scoring students persist in choosing a particular incor-rect answer F. Item 17 has three correct responses (A, C,D), with A being the most commonly chosen, and C beingthe least commonly chosen. Few students at any score levelchoose E, and fewer than 20% of students who score aboveaverage (10.8) choose either incorrect response (B, E). Item18 is particularly interesting in that all responses are chosenby 20%–60% of students in the middle score range (8-12). Item F r a c t i on o f R e s pon s e s Completely Correct Some Correct Both Correct and Incorrect Completely Incorrect

FIG. 8. Fraction of student responses in each category of our four-level scoring scheme for MCMR items with multiple correct an-swers. These results are from the ﬁnal version of the PIQL.

This supports the results from Fig. 8 that students are likelyto choose both a correct and an incorrect response to Q18.Both the four-level scoring scheme and the IRCs providemore information than traditional CTT analyses and allow usto see patterns in students’ responses that go beyond typicaldichotomous scoring methods. We have used these to gaina deeper qualitative picture of student performance on eachPIQL item, and these have been very valuable for decidingwhich items and answer choices to keep, eliminate, or mod-ify.

D. Exploring the substructure of the PIQL

The PIQL was initially developed to probe student reason-ing about three facets of PQL that were deﬁned from an ex-pert’s perspective: ratios and proportions, covariation, andsigned quantities/negativity. In the language of factor anal-ysis, this would imply that the PIQL was originally intendedto have a three-factor structure. Because the intended factorstructure of the PIQL was well understood at the beginningof its development, conﬁrmatory factor analysis (CFA) wasused at the onset, in conjunction with exploratory factor anal-ysis (EFA). CFA is a model-driven statistical method whosegoal is to identify the adequacy of a proposed factor modelto response data from the instrument being analyzed [70]. Inour work, CFA helps reveal whether or not student responsepatterns align with the facet-driven model predicted by ex-perts. EFA is a data-driven statistical method whose goal is touncover the underlying dependencies between observed vari-ables [71]. For all versions of the PIQL, CFA determined thatthe proposed, facet-driven, factor model was not an adequaterepresentation of the PIQL’s latent trait structure [72]. Thetarget threshold for CFA is to have goodness-of-ﬁt statisticssuch as the Conﬁrmatory Fit Index (CFI) and Tucker-LewisFit Index (TLI) above a threshold of 0.9 [73]. For all versionsof the PIQL the CFI and TLI were below 0.8 when using the3 . . . . . . Item 14

Score F r a c t i on o f S t uden t s A A A A A A A A A A A A A A A A A A A AB B B B B B B B B B B B B B B B B B B BC C C C C C C C C C C C C C C C C C C CD D D D D D D D D D D D D D D D D D D DE E E E E E E E E E E E E E E E E E E EF F F F F F F F F F F F F F F F F F F FG G G G G G G G G G G G G G G G G G G G . . . . . . Item 17

Score F r a c t i on o f S t uden t s A A A A A A A A A A A A A A A A A A A AB B B B B B B B B B B B B B B B B B B BC C C C C C C C C C C C C C C C C C C CD D D D D D D D D D D D D D D D D D D DE E E E E E E E E E E E E E E E E E E E . . . . . . Item 18

FIG. 9. Item Response Curves for three items on PIQL v2.2. Each plot shows the fraction of students who chose each response out of thestudents who earned each score on the total test. Item 14 has correct answer B, item 17 has correct answers A, C, and D, and item 18 hascorrect answers D and G. facet-driven factor model; therefore, students’ response pat-terns did not match our expectations of reasoning developingdifferently in the areas of ratio and proportion, covariation,and signed quantities/negativity.Given that the CFA results do not ﬁt with the proposedmode, we moved on to a more in-depth investigation usingEFA. The goal of using EFA was to determine if the PIQLhas any substructure, and how closely any substructure alignswith the three facets of PQL. The results from parallel analy-sis suggested that 3–4 meaningful factors could be extractedfor the earlier versions of the PIQL (protoPIQL, v1.0, andv1.1) [74]; however, when examining these structures, theywere found to be inconsistent with the originally intended fac-tors, based on the three facets of PQL [72]. During this initialdevelopment of the PIQL, EFA models of versions v1.0 andv1.1 each contained a factor that only contained the same twoitems. These two items were found to have item loadingson the same factor of above 0.8, compared to the next highestloading value of approximately 0.5. These items’ loadings re-mained essentially the same when they appeared sequentiallyon v1.0 and when they were separated and placed onto dif-ferent pages of the instrument in v1.1. This suggested theseitems were redundant, which lead to the removal of one of theitems from the PIQL in future iterations.Analyses of the most recent versions of the PIQL (v2.0,v2.1, and v2.2) suggest the instrument is now unidimensional,with no strong substructure amongst the items. Results fromEFA parallel analysis suggested that these versions of thePIQL could be adequately described by a single factor. Ad-ditional evidence to support this conclusion was obtained byperforming CFA on v2.1 and v2.2 of the PIQL using a unidi-mensional model, with measures of goodness-of-ﬁt suggest-ing that the unidimensional model adequately ﬁt the studentresponse data. Speciﬁcally, the CFI and TLI were above 0.93for both versions under CFA using a unidimensional model.Additionally, the standardized root mean square of the resid-uals was below 0.04, and the root mean square of the error of approximation was below 0.04. A model is considered as be-ing an adequate representation of the data is these ﬁt statisticsare below 0.06 and 0.08, respectively. Thus, these ﬁt statisticsindicate a good ﬁt between the data and the unidimensionalmodel [73]. This suggests that removing one of the redundantitems identiﬁed in v1.0 and v1.1, resulted in the collapse ofthe PIQL’s multiple factor structure into one that is unidimen-sional. This may also have been affected by replacing two ofthe covariation items from v1.1.A major confounding feature of these results is that the fac-tor loadings were determined based on dichotomously scoreditems. As shown in Fig. 8, up to 65% of students who choosecorrect responses to MCMR items may be scored as incorrectbecause either they didn’t choose all of the correct responsesor they also chose an incorrect response. As such, the fac-tor loadings may not accurately capture the relationships be-tween students’ responses for cases involving MCMR items.To preserve the nuance and complexity of students’ re-sponse patterns within (and between) items we used mod-ule analysis for multiple-choice responses to examine the net-work of student responses to PIQL items [75]. Module analy-sis uses community detection algorithms to identify modules(a.k.a. communities, etc.) within networks of responses tomultiple-choice items. We chose to analyze a network of onlycorrect responses to PIQL items. The beneﬁt of this method isthat we can examine the patterns that arise from students’ se-lections of each individual correct response, which preservessome of the complexity of MCMR items.Earlier module analyses of v1.0 and v1.1 using variouscommunity detection algorithms on full data sets suggestedthat there was some substructure in the PIQL. Again, theseresults did not agree with the three facets that the PIQL wasintended to measure and also did not align well with the re-sults of EFA [72, 76]. Recent developments in the applicationof module analysis within PER have enabled a deeper andmore reﬁned analysis of the module structure of the PIQL[77]. Using Modiﬁed Module Analysis (MMA) on the ﬁnal4two versions of the PIQL, with a locally adaptive networksparsiﬁcation (LANS) in place of a global cutoff sparsiﬁca-tion, resulted in no discernible substructure between the itemson the instrument [77, 78]. This corroborates the conclusionsof EFA and CFA that the PIQL is not measuring multiple con-structions and is thus a unidimensional instrument.

E. Validation Summary

Our goal is to develop a valid and reliable instrumentto measure PQL for students in calculus-based introductoryphysics courses. Results from classical test theory show thatafter several revisions the items on the PIQL have a broadrange of difﬁculty values, and all items have acceptable levelsof discrimination. The reliability of the PIQL has been estab-lished with Cronbach’s α = 0 . , which meets the typicallyaccepted criterion for measuring both properties of groupsand properties of individuals.Results from exploratory and conﬁrmatory factor analysisand modiﬁed module analysis show that the PIQL is a uni-dimensional instrument that measures a single construct. Weinterpret this construct as being Physics Quantitative Literacy.These results show that student responses to PIQL items donot separate cleanly along the lines of ratios and proportions,covariation, and signed quantities/negativity, suggesting thatthese three facets of PQL (which are discernible to experts)may develop simultaneously in students and are deeply inter-connected in physics contexts. This also suggests that stu-dent reasoning patterns aren’t well-aligned with experts. Stu-dents’ reasoning patterns represent novice PQL in introduc-tory physics; future work will look for resources within thesereasoning patterns.We have supplemented rigorous psychometric analyseswith four-level scoring methods for MCMR items and IRCs,which provide additional information about students’ choicesof both correct and incorrect responses. These analysesplayed a vital role in informing our decisions when revisingthe PIQL. Future work will include developing more sophis-ticated analyses that can include the nuance of MCMR datainto CTT-style analyses. V. CONCLUSION

Physics is characterized by systematic quantitative reason-ing that is less common in other introductory STEM courses.Developing this reasoning is an important outcome of intro-ductory physics courses, which are required for most STEM majors. We anticipate the PIQL can have a strong inﬂuenceon undergraduate physics—and thereby STEM—educationacross varied post-secondary learning environments.In this paper we have presented the development of thePIQL, and the process and outcomes of its validation. Thenext steps involve establishing the PIQL as a valid metricacross diverse student populations. Historically, physics ed-ucation research studies oversample from large research uni-versities that are similar to University of Washington, wherethe PIQL has been validated. The broader population of intro-ductory physics students is a more racially and socioeconom-ically diverse group, attending a variety of geographically di-verse post-secondary institutions [19]; therefore, it is essen-tial to ensure that materials and methods developed with arelatively homogeneous student sample be validated for thebreadth of learning environments in which students take in-troductory physics. A broadly validated PIQL will put a newtool in the hands of researchers that can help facilitate the im-provement of quantitative literacy as an educational outcome.In addition, we continue to validate the PIQL for use as anonline reasoning inventory. We intend to develop interviewprotocols to investigate how students interact with the PIQL,especially its MCMR items, when it is administered online.A more in-depth analysis—beyond a simple comparison ofthe number of responses chosen—of student performance onMCMR items for different administration methods is also in-dicated.Finally, we seek to characterize the development of stu-dents’ PQL by utilizing additional psychometric analysismethods. We will use item tree analysis to characterize stu-dents’ knowledge states at various points within the physicscurriculum and explore the potential hierarchy of skills mea-sured by the PIQL [79–81]. We will also modify polyto-mous models of item response theory to be able to handlethe MCMR items in the PIQL [82–84]. Using these methodswill extend our preliminary analyses of MCMR items usingitem response curves, and could provide more robust resultsto inform our efforts to create a multi-level scoring scheme.As students move beyond the introductory sequence to us-ing newly learned mathematics (calculus, linear algebra, dif-ferential equations) in subsequent physics courses, there ismounting evidence that many don’t always understand whythey do what they do to solve problems [85–87]. Most stu-dents learn procedures effectively. They would like to havethe quantitative literacy to conceptualize these procedures[85], but for many students PQL isn’t a strong outcome ofprior instruction. We hope to use of the PIQL to map the de-velopment of physics quantitative literacy through the entireundergraduate physics course sequence. [1] Patrick W Thompson, “Quantitative reasoning and mathemati-cal modeling,” New perspectives and directions for collabora-tive research in mathematics education , 33 (2010).[2] Bobby Ojose, “Mathematics Literacy: Are We Able To Put The Mathematics We Learn Into Everyday Use?” Journal of Math-ematics Education , 89–100 (2011).[3] Alexis Olsho, Suzanne White Brahmia, Andrew Boudreaux,and Trevor I Smith, “The physics inventory of quantitative reasoning: Assessing student reasoning about sign,” in Pro-ceedings of the 22nd Annual Conference on Research in Un-dergraduate Mathematics Education. , edited by A Weinberg,D Moore-Russo, H Soto, and M Wawro (Oklahoma City, OK,2019) pp. 992–997.[4] David Hestenes, Mark Wells, and Gregg Swackhamer, “Forceconcept inventory,” The Physics Teacher , 141–166 (1992).[5] Ronald K Thornton and David R Sokoloff, “Assessing studentlearning of Newton’s laws: The force and motion conceptualevaluation and the evaluation of active learning laboratory andlecture curricula,” American Journal of Physics , 338–352(1998).[6] Richard R Hake, “Interactive-engagement versus traditionalmethods: A six-thousand-student survey of mechanics test datafor introductory physics courses,” American Journal of Physics , 64–74 (1998).[7] Joshua Von Korff, Benjamin Archibeque, K Alison Gomez,Sarah B Mckagan, Eleanor C Sayre, Edward W Schenk, ChaseShepherd, and Lane Sorell, “Secondary analysis of teachingmethods in introductory physics: A 50 k-student study,” Amer-ican Journal of Physics , 969–974 (2016).[8] Adrian Madsen, Sarah B McKagan, and Eleanor C Sayre,“Resource Letter RBAI-1: Research-Based Assessment In-struments in Physics and Astronomy,” American Journal ofPhysics , 245–264 (2017).[9] Eddie M Gray, Tall Source: Journal for Research in Mathe-matics Education , Tech. Rep. 2 (1994).[10] Jean-Claude Falmagne, Mathieu Koppen, Michael Villano,Jean-Paul Doignon, and Leila Johannesen, “Introduction toKnowledge Spaces: How to Build, Test, and Search Them,”Psychological Review , 201–224 (1990).[11] Gilles Fauconnier and Mark Turner, The Way We Think: Con-ceptual Blending and the Mind’s Hidden Complexities (BasicBooks, New York, 2002).[12] S Brahmia, Andrew Boudreaux, and S E Kanim, “Develop-ing mathematization with Physics Invention Tasks,” AmericanJournal of Physics (in press) (2017).[13] Wendy K Adams and Carl E Wieman, “Development and vali-dation of instruments to measure learning of expert-like think-ing,” International Journal of Science Education , 1289–1312 (2010).[14] Bruce L Sherin, “How Students Understand Physics Equa-tions,” Cognition and Instruction , 479–541 (2001).[15] Allison Dorko and Natasha Speer, “Calculus Students’ Under-standing of Area and Volume Units,” Investigations in Mathe-matics Learning , 23–46 (2015).[16] Suzanne White Brahmia, Alexis Olsho, Trevor I Smith, andAndrew Boudreaux, “A framework for the natures of negativ-ity in introductory physics,” (under review) preprint availablearXiv:1903.03806 (2019).[17] Suzanne White Brahmia, “Quantiﬁcation and its importanceto modeling in introductory physics,” European Journal ofPhysics , 044001 (2019).[18] Jonathan Tuminaro, A Cognitive Framework for Analyzing andDescribing Introductory Students’ Use and Understanding ofMathematics in Physics , Ph.D. thesis, University of Maryland(2004).[19] Stephen Kanim and Ximena C Cid, “The demographics ofphysics education research,” arXiv e-print 1710.02598 (2017).[20] Joëlle Vlassis, “Making sense of the minus sign or becomingﬂexible in “negativity”,” Learning and Instruction , 469–484 (2004).[21] Patrick W Thompson and L Saldanha, “Fractions and mul-tiplicative reasoning,” in Research companion to Principlesand Standards for School Mathematics , edited by Jeremy Kil-patrick, W. Gary Martin, and Deborah Schifter (NationalCouncil of Teachers of Mathematics, 2003) Chap. 7, pp. 95–113.[22] Patrick W Thompson, Marilyn P Carlson, Cameron Byerley,and Neil Hatﬁeld, “Schemes for Thinking with Magnitudes:A Hypothesis about Foundational Reasoning Abilities in Al-gebra,” in

Epistemic Algebraic Students: Emerging Models ofStudents’ Algebraic Knowing Papers from an Invitational Con-ference (2014) p. 1.[23] Arnold B Arons, “Student patterns of thinking and reasoning,”The Physics Teacher , 576–581 (1983).[24] Jessica Pierson Bishop, Lisa L Lamb, Randolph A Philipp, IanWhitacre, Bonnie P Schappelle, and Melinda L Lewis, “Ob-stacles and affordances for integer reasoning: An analysis ofchildren’s thinking and the history of mathematics,” Journal forResearch in Mathematics Education , 19–61 (2014).[25] Aurora Gallardo and Teresa Rojano, “School algebra. Syntac-tic difﬁculties in the operativity,” Proceedings of the XVI Inter-national Group for the Psychology of Mathematics Education,North American Chapter , 265–272 (1994).[26] Patrick W Thompson and Tommy Dreyfus, “Integers as trans-formations,” Journal for Research in Mathematics Education ,115–133 (1988).[27] Terezinha Nunes, “Learning mathematics: Perspectives fromeveryday life,” Schools, mathematics, and the world of reality, 61–78 (1993).[28] S Brahmia, Andrew Boudreaux, and Stephen E Kanim, “Ob-stacles to mathematization in introductory physics,” arXivpreprint arXiv:1601.01235 (2016).[29] Kate Hayes and Michael C Wittmann, “The role of sign in stu-dents’ modeling of scalar equations,” The Physics Teacher ,246–249 (2010).[30] Rabindra R Bajracharya, Thomas M Wemyss, and John RThompson, “Student interpretation of the signs of deﬁnite in-tegrals using graphical representations,” AIP Conference Pro-ceedings , 111–114 (2012).[31] Tra Huynh and Eleanor C Sayre, “Blending Mathematicaland Physical Negative-ness,” arXiv preprint arXiv:1803.01447(2018).[32] Moa Eriksson, Cedric Linder, and Urban Eriksson, “Towardsunderstanding learning challenges involving sign convention inintroductory level kinematics,” in Physics Education ResearchConference (PERC), Washington DC, 1-2 August 2018 (2018).[33] Stijn Ceuppens, Laurens Bollen, Johan Deprez, Wim Dehaene,and Mieke De Cock, “9th grade students? understanding andstrategies when solving x (t) problems in 1D kinematics and y(x) problems in mathematics,” Physical Review Physics Edu-cation Research , 10101 (2019).[34] L. A. Saldanha and Patrick W. Thompson, “Re-thinking co-variation from a quantitative perspective: Simultaneous contin-uous variation,” in Annual Meeting of the Psychology of Math-ematics Education — North America , Vol. 1, edited by S. B.Berenson and W. N. Coulombe (North Carolina State Univer-sity, Raleigh, NC, 1998) pp. 298–304.[35] Marilyn Carlson, Sally Jacobs, Edward Coe, Sean Larsen, andEric Hsu, “Applying Covariational Reasoning While ModelingDynamic Events: A Framework and a Study,” Journal for Re- search in Mathematics Education , 352–378 (2002).[36] Patrick W. Thompson, “Images of rate and operational under-standing of the fundamental theorem of calculus,” EducationalStudies in Mathematics (1994), 10.1007/BF01273664.[37] Charlotte Zimmerman, Alexis Olsho, Suzanne White Brahmia,Michael Loverude, Andrew Boudreaux, and Trevor I Smith,“Toward understanding and characterizing expert physics co-variational reasoning,” in Physics Education Research Confer-ence 2019 , PER Conference (Provo, UT, 2019).[38] S Brahmia, “Developing expert mathematization in the intro-ductory physics course: an impedance mismatch,” in

Pro-ceedings of 2nd International Conference On Research, Imple-mentation And Education Of Mathematics And Sciences (2ndICRIEMS) (2015).[39] Andrew Boudreaux, Stephen Kanim, and Suzanne Brah-mia, “Student facility with ratio and proportion: Mappingthe reasoning space in introductory physics,” arXiv e-print1511.08960 (2015).[40] Suzanne Brahmia and Andrew Boudreaux, “Exploring studentunderstanding of the negative sign in introductory physics con-texts,” in

Proceedings of the 19th Annual Conference on Re-search in Undergraduate Mathematics Education (2016).[41] Adrian Madsen, Sarah B McKagan, and Eleanor C Sayre,“Best practices for administering concept inventories,” ThePhysics Teacher , 530–536 (2017).[42] Suzanne Brahmia, Andrew Boudreaux, and Stephen E Kanim,“Developing Mathematical Creativity with Physics InventionTasks,” arXiv e-print 1602.02033 (2017).[43] Suzanne Brahmia and Andrew Boudreaux, “Signed quantities:Mathematics based majors struggle to make meaning,” in Pro-ceedings of the 20th Annual Conference on Research in Under-graduate Mathematics Education (2017).[44] Suzanne White Brahmia, Alexis Olsho, Trevor I Smith, andAndrew Boudreaux, “Framework for the natures of negativ-ity in introductory physics,” Phys. Rev. Phys. Educ. Res. ,010120 (2020).[45] Marilyn Carlson, Michael Oehrtman, and Nicole Engelke,“The precalculus concept assessment: A tool for assessing stu-dents’ reasoning abilities and understandings,” Cognition andInstruction , 113–145 (2010).[46] Alexis Olsho, Suzanne White Brahmia, Andrew Boudreaux,and Trevor I. Smith, “When negative is not “less than zero”:electric charge as a signed quantity,” The Physics Teacher (un-der review) (2019).[47] Scott Bonham, “Reliability, compliance, and security in web-based course assessments,” Phys. Rev. ST Phys. Educ. Res. ,010106 (2008).[48] Jayson M Nissen, Manher Jariwala, Eleanor W Close, andBen Van Dusen, “Participation and performance on paper-andcomputer-based low-stakes assessments,” International journalof STEM education , 21 (2018).[49] Bethany R Wilcox and Steven J Pollock, “Investigating stu-dents’ behavior and performance in online conceptual assess-ment,” Phys. Rev. Phys. Educ. Res. , 020145 (2019).[50] Many of these systems can be described more accurately assurveillance—they cannot interact with the students or providea physical presence.[51] Bethany R Wilcox and H J Lewandowski, “A summary ofresearch-based assessment of students’ beliefs about the na-ture of experimental physics,” American Journal of Physics ,212–219 (2018). [52] The particular platform that we used to administer the onlineversion of the instrument allows students to leave the PIQLopen for as long as the students want; however, after the timehas expired, students cannot enter any new responses or viewany more questions, though they can submit any responses al-ready entered.[53] We recognize that not being able to look at questions alreadycompleted is a signiﬁcant change from in-class practices, butdeemed it necessary for test security purposes.[54] Adrian Madsen and Sam McKagan, “Administering research-based assessments online (PhysPort Expert Recommenda-tion),” (0).[55] This excludes the very large times (sometimes > 1000 minutes)associated with opening the survey and then ignoring it for sev-eral hours before hitting the submit button.[56] William Wiersma and Stephen G Jurs, Educational measure-ment and testing , 2nd ed. (Allyn & Bacon, 1990).[57] Rodney L Doran,

Basic Measurement and Evaluation ofScience Instruction. (National Science Teachers Association,1980).[58] Suzanne White Brahmia, Alexis Olsho, Trevor I Smith, andAndrew Boudreaux, “NoNIP: Natures of Negativity in Intro-ductory Physics,” in

Physics Education Research Conference2018 , PER Conference, edited by Adrienne Traxler, Ying Cao,and Steven Wolf (Washington, DC, 2018).[59] Suzanne White Brahmia, Alexis Olsho, Trevor I Smith, andAndrew Boudreaux, “A framework for the natures of negativ-ity in introductory physics,” in

Proceedings of the 22nd AnnualConference on Research in Undergraduate Mathematics Edu-cation. , edited by A Weinberg, D Moore-Russo, H Soto, andM Wawro (Oklahoma City, OK, 2019) pp. 68–75.[60] Kevin C Moore, Teo Paoletti, and Stacy Musgrave, “Covari-ational reasoning and invariance among coordinate systems,”Journal of Mathematical Behavior , 461–473 (2013).[61] Natalie L F Hobson and Kevin C Moore, “Exploring Experts’Covariational Reasoning,” in (Moore &Thompson, 2017) pp. 664–672.[62] Teo Paoletti and Kevin C Moore, “The parametric nature oftwo students’ covariational reasoning,” Journal of Mathemati-cal Behavior , 137–151 (2017).[63] Suzanne White Brahmia, Alexis Olsho, Andrew Boudreaux,Trevor I. Smith, and Charlotte Zimmerman, “A ConceptualBlend Analysis of Physics Quantitative Literacy Reasoning In-ventory Items,” in Proceedings of the 23rd Annual Conferenceon Research in Undergraduate Mathematics Education (ac-cepted for publication) (2020).[64] Trevor I Smith, Suzanne W Brahmia, Alexis Olsho, An-drew Boudreaux, Philip Eaton, Paul J Kelly, Kyle J Louis,Mitchell A Nussenbaum, and Louis J Remy, “Developing areasoning inventory for measuring physics quantitative liter-acy,” arXiv e-print 1901.03351 (2018).[65] Trevor I Smith, Suzanne White Brahmia, Alexis Olsho, andAndrew Boudreaux, “Developing a reasoning inventory formeasuring physics quantitative literacy,” in

Proceedings of the22nd Annual Conference on Research in Undergraduate Math-ematics Education. , edited by A Weinberg, D Moore-Russo,H Soto, and M Wawro (Oklahoma City, OK, 2019) pp. 1181–1182.[66] Gary A Morris, Lee Branum-Martin, Nathan Harshman,Stephen D Baker, Eric Mazur, Suvendra Dutta, Taha Mzoughi, and Veronica McCauley, “Testing the test: Item responsecurves and test quality,” American Journal of Physics , 449–453 (2006).[67] Gary A Morris, Nathan Harshman, Lee Branum-Martin, EricMazur, Taha Mzoughi, and Stephen D Baker, “An item re-sponse curves analysis of the Force Concept Inventory,” Amer-ican Journal of Physics , 825–831 (2012).[68] Paul J Walter and Gary Morris, “Assessing Student Learn-ing and Improving Instruction with Transition Matrices,” in Physics Education Research Conference 2016 , PER Confer-ence, edited by D L Jones, L Ding, and A Traxler (Sacramento,CA, 2016) pp. 376–379.[69] Michi Ishimoto, Glen Davenport, and Michael C Wittmann,“Use of item response curves of the Force and Motion Concep-tual Evaluation to compare Japanese and American students’views on force and motion,” Phys. Rev. Phys. Educ. Res. ,020135 (2017).[70] Timothy A Brown, Conﬁrmatory Factor Analysis for AppliedResearch , 2nd ed. (The Guilford Press, 2015) pp. 72–75.[71] D N Lawley and A E Maxwell,

Factor analysis as a statisticalmethod (Butterworths London, 1963) pp. viii, 117 p.[72] Trevor I. Smith, Philip Eaton, Suzanne White Brahmia, AlexisOlsho, Andrew Boudreaux, Chris DePalma, Victor LaSasso,Scott Straguzzi, and Christopher Whitener, “Using psychome-tric tools as a window into students’ quantitative reasoning inintroductory physics,” in

Physics Education Research Confer-ence 2019 , PER Conference, edited by Ying Cao, Steven Wolf,and Michael Bennett (Provo, UT, 2019).[73] Philip Eaton and Shannon D Willoughby, “Conﬁrmatory factoranalysis applied to the Force Concept Inventory,” Phys. Rev.Phys. Educ. Res. , 010124 (2018).[74] Li-Jen Weng and Chung-Ping Cheng, “Parallel Analysis withUnidimensional Binary Data,” Educational and PsychologicalMeasurement , 697–716 (2005).[75] Eric Brewe, Jesper Bruun, and Ian G Bearden, “Using moduleanalysis for multiple choice responses: A new method appliedto Force Concept Inventory data,” Phys. Rev. Phys. Educ. Res. , 020131 (2016).[76] Trevor I. Smith, Suzanne White Brahmia, Alexis Olsho, andAndrew Boudreaux, “Physics Students’ Implicit ConnectionsBetween Mathematical Ideas,” in Proceedings of the 23rd An- nual Conference on Research in Undergraduate MathematicsEducation (accepted for publication) (2020).[77] James Wells, Rachel Henderson, John Stewart, Gay Stewart,Jie Yang, and Adrienne Traxler, “Exploring the structure ofmisconceptions in the Force Concept Inventory with modi-ﬁed module analysis,” Phys. Rev. Phys. Educ. Res. , 020122(2019).[78] Nicholas J Foti, James M Hughes, and Daniel N Rockmore,“Nonparametric Sparsiﬁcation of Complex Multiscale Net-works,” PLOS ONE , 1–10 (2011).[79] Nicholas T Young and Andrew F Heckler, “Observed hierar-chy of student proﬁciency with period, frequency, and angularfrequency,” Physical Review Physics Education Research ,010104 (2018).[80] Martin Schrepp, “On the empirical construction of implicationsbetween bi-valued test items,” Mathematical Social Sciences , 361 – 375 (1999).[81] Anatol Sargin and Ali Ünlü, “Inductive item tree analysis: Cor-rections, improvements, and comparisons,” Mathematical So-cial Sciences , 376 – 392 (2009).[82] R Darrell Bock, “Estimating item parameters and latent abilitywhen responses are scored in two or more nominal categories,”Psychometrika , 29–51 (1972).[83] R Darrell Bock and Irini Moustaki, “Item Response Theoryin a General Framework,” in Handbook of Statistics , Vol. 26,edited by C R Rao and S Sinharay (Elsevier, 2007) Chap. 15,pp. 469–514.[84] Youngsuk Suh and Daniel M Bolt, “Nested Logit Modelsfor Multiple-Choice Item Response Data,” Psychometrika ,454–473 (2010).[85] Marcos D Caballero, Bethany R Wilcox, Leanne Doughty, andSteven J Pollock, “Unpacking students’ use of mathematics inupper-division physics: where do we go from here?” EuropeanJournal of Physics , 65004 (2015).[86] T.I. Smith, J.R. Thompson, and D.B. Mountcastle, “Studentunderstanding of Taylor series expansions in statistical me-chanics,” Physical Review Special Topics - Physics EducationResearch (2013), 10.1103/PhysRevSTPER.9.020110.[87] Trevor I Smith, Donald B Mountcastle, and John R Thompson,“Student understanding of the {B}oltzmann factor,” Phys. Rev.ST Phys. Educ. Res.11