[PDF] Exploring the Structure of Misconceptions in the Force Concept Inventory with Modified Module Analysis

Abstract

Module Analysis for Multiple-Choice Responses (MAMCR) was applied to a large sample of Force Concept Inventory (FCI) pretest and post-test responses ( N pre =4509 and N post =4716 ) to replicate the results of the original MAMCR study and to understand the origins of the gender differences reported in a previous study of this data set. When the results of MAMCR could not be replicated, a modification of the method was introduced, Modified Module Analysis (MMA). MMA was productive in understanding the structure of the incorrect answers in the FCI, identifying 9 groups of incorrect answers on the pretest and 11 groups on the post-test. These groups, in most cases, could be mapped on to common misconceptions used by the authors of the FCI to create distactors for the instrument. Of these incorrect answer groups, 6 of the pretest groups and 8 of the post-test groups were the same for men and women. Two of the male-only pretest groups disappeared with instruction while the third male-only pretest group was identified for both men and women post-instruction. Three of the groups identified for both men and women on the post-test were not present for either on the pretest. The rest of the identified incorrect answer groups did not represent misconceptions, but were rather related to the the blocked structure of some FCI items where multiple items are related to a common stem. The groups identified had little relation to the gender unfair items previously identified for this data set, and therefore, differences in the structure of student misconceptions between men and women cannot explain the gender differences reported for the FCI.

Full PDF

aa r X i v : . [ phy s i c s . e d - ph ] M a y Exploring the Structure of Misconceptions in the Force Concept Inventory withModiﬁed Module Analysis

James Wells, Rachel Henderson, John Stewart, ∗ Gay Stewart, Jie Yang, and Adrienne Traxler W. M. Keck Science Department of Claremont McKenna,Pitzer, and Scripps Colleges, Claremont CA, 91711 Michigan State University, Department of Physics and Astronomy, East Lansing MI, 48824 West Virginia University, Department of Physics and Astronomy, Morgantown WV, 26506 Wright State University, Department of Physics, Dayton OH, 45435 (Dated: May 16, 2019)Module Analysis for Multiple-Choice Responses (MAMCR) was applied to a large sample ofForce Concept Inventory (FCI) pretest and post-test responses ( N pre = 4509 and N post = 4716)to replicate the results of the original MAMCR study and to understand the origins of the genderdiﬀerences reported in a previous study of this data set. When the results of MAMCR could not bereplicated, a modiﬁcation of the method was introduced, Modiﬁed Module Analysis (MMA). MMAwas productive in understanding the structure of the incorrect answers in the FCI, identifying 9groups of incorrect answers on the pretest and 11 groups on the post-test. These groups, in mostcases, could be mapped on to common misconceptions used by the authors of the FCI to createdistactors for the instrument. Of these incorrect answer groups, 6 of the pretest groups and 8of the post-test groups were the same for men and women. Two of the male-only pretest groupsdisappeared with instruction while the third male-only pretest group was identiﬁed for both menand women post-instruction. Three of the groups identiﬁed for both men and women on the post-test were not present for either on the pretest. The rest of the identiﬁed incorrect answer groupsdid not represent misconceptions, but were rather related to the the blocked structure of some FCIitems where multiple items are related to a common stem. The groups identiﬁed had little relationto the gender unfair items previously identiﬁed for this data set, and therefore, diﬀerences in thestructure of student misconceptions between men and women cannot explain the gender diﬀerencesreported for the FCI. I. INTRODUCTION

The “gender gap,” gender diﬀerences between thescores of men and women on the Force Concept Inventory(FCI) [1] and other instruments developed by PhysicsEducation Research (PER), has been extensively stud-ied [2]. For the FCI, a substantial number of studieshave suggested that some of the gender diﬀerences ob-served resulted from diﬀerent response patterns of menand women to a subset of the items in the instrument[3]. The origin of these diﬀerential response patterns is,however, unknown. The purpose of this study is to ap-ply Module Analysis for Multiple-Choice Responses in-troduced by Brewe et al. [4] to a large sample of FCIresponses known to contain a subset of items which pro-duce substantially diﬀerent response patterns for menand women in order to determine if the structure of themisconceptions of men and women diﬀer on these items.

A. Background Studies

This work will draw heavily from three previous studieswhich will be referenced as Study 1, Study 2, and Study3 in this work. ∗ [email protected]

1. Study 1: Module Analysis

Study 1 introduced Module Analysis for MultipleChoice Responses (MAMCR) to analyze concept inven-tory data at the level of individual responses to the items[4]. Unlike many analysis techniques applied to FCI data,which consider only a student’s overall score or only thecorrect answers to individual items, MAMCR considerseach answer choice a student selected in order to providea ﬁne-grained examination of students’ misconceptionsof Newtonian physics and to allow instructors to targetspeciﬁc errors.MAMCR is based on network analytic techniques[5, 6]. A network is represented by a graph where nodesare connected to one another by edges. Edges can beweighted, where the value of the weight represents someaspect of the interaction. Network analysis is a highlysuccessful and versatile set of methods which have beenapplied to a variety of problems including the probabilityof homicide victimization among people living in a dis-advantaged neighborhood [7], the mapping of functionalnetworks in the brain from electrical signals [8], passingpatterns of soccer teams in the World Cup [9], and theresponse of plants to bacterial infection [10].Study 1 examined the FCI post-test scores of 143 ﬁrst-year physics majors at a Danish university. The samplewas 78% male and scored relatively highly on the exam:pre-test 65 ±

22% and post-test 81 ± et al. emphasized that the resultsof this study should be generalized with care. The groupof students tested was small, unusually high scoring, andhad limited diversity. Likewise, there were several choicesmade during the process of applying MAMCR to the datawhich could have been made diﬀerently. Both the choiceof sparsiﬁcation method (described in Sec. III) and thedecision of how to group responses that cluster togetheronly on some of the one-thousand applications of Infomapwas somewhat arbitrary, as was the interpretation of themeaning of the modules. As will be seen in Sec. III, ourdata set required diﬀerent choices to be made.

2. Study 2: Item Fairness and the FCI

In Study 2, Traxler et al. [3] explored item-level genderfairness of the FCI using Classical Test Theory [13], ItemResponse Theory [14], and Diﬀerential Item Functioning(DIF) [15, 16] analysis. An item is fair to men and womenif men and women of equal overall ability score equallyon the item. Using three samples, a graphical analysisidentiﬁed ﬁve FCI items that were substantially unfair towomen: item 14 (bowling ball falling out of an airplane),items 21 through 23 (sideways-drifting rocket with en-gine turning on and oﬀ), and item 27 (a large box beingpushed across a horizontal ﬂoor). A further DIF anal-ysis, which controlled for the student’s overall post-testscore, identiﬁed eight items on the FCI as substantiallyunfair. Two of these were unfair to men: item 9 (speedof a puck after it receives a kick) and item 15 (a smallcar pushing a large truck). These eight items includedthe ﬁve items identiﬁed in the graphical analysis alongwith items 9, 12 (the trajectory of a cannon ball shot oﬀof a cliﬀ), and 15. Many of the unfair items had beenidentiﬁed as unfair in previous studies. Overall, Study2 demonstrated that eliminating all unfair items on theFCI to create a fair instrument reduced the gender gapby 50% in the largest sample.This work, however, could not identify the source ofthe unfairness. The distribution of student responses wasanalyzed. Focusing on the ﬁve items that were identiﬁed with the graphical analysis and the DIF analysis, incor-rect female responses were predominately one of the dis-tractors in each of the FCI items; however, the distractorschosen by the male students were less uniform in all ﬁveFCI items. Overall, Study 2 concluded that no physi-cal principle or common misconception could explain theunfairness identiﬁed FCI items; however, this conclusionwas drawn from a qualitative inspection of the items.The current study builds on the work in Study 2 by per-forming a quantitative analysis of the incorrect responsesof men and women.

3. Study 3: Multidimensional Item Response Theory andthe FCI

The study in the current work applies network analyticmethods to understand the incorrect answer structure ofthe FCI. This structure might be inﬂuenced by featuresof the FCI which produce correlations between the cor-rect answers. If a consistent misconception is being ap-plied, it would form an alternate incorrect answer to setsof related correct answers. Study 3 examined the cor-rect answer structure of the FCI using both exploratoryand conﬁrmatory methods [17]. Exploratory factor anal-ysis (EFA) suggested that the practice of “blocking”items produced correlations between the items within theblock. A block of items is a sequence of items whichall refer to a common stem or where one item refers toa previous item. The FCI contains item blocks {

5, 6 } , {

8, 9, 10, 11 } , {

15, 16 } , {

21, 22, 23, 24 } , and {

25, 26,27 } . Study 3 reported that often the factors identiﬁed byEFA strongly loaded on items in the same block, suggest-ing that blocking was generating correlations among theitems in the block. Study 3 went on to produce a detailedmodel of the reasoning required to solve the FCI. Mul-tidimensional Item Response Theory (MIRT) was usedto test alternate models and allowed the identiﬁcation ofan optimal model. This model allowed the identiﬁcationof groups of items with very similar solution structure: {

5, 18 } , {

6, 7 } , {

17, 25 } , and {

4, 15, 28 } . Study 3 onlyincluded the ﬁrst item in a block in the analysis and itis likely that item 16 should be added to the last blockwhich represents Newton’s 3rd law items. This mappingof item blocks and groups with similar solution will beimportant to understanding the incorrect answer struc-ture presented in this work. B. Previous Studies of the FCI

The FCI, either in aggregate or disaggregating by gen-der, is one of the most studied instruments in PER.The present study examines item-level structure disag-gregated by gender. The structure of the incorrect an-swers is examined to identify coherent patterns of incor-rect answers.

1. Exploratory Analyses of the FCI

Many studies have examined the structure of the FCI,primarily using EFA. These studies began soon after thepublication of the FCI when Huﬀman and Heller [18]failed to extract the factor structure suggested by theauthors of the instrument [1]. For a sample of 145 highschool students, Huﬀman and Heller found only two post-test factors: “Newton’s 3rd law” and “Kinds of Forces.”The small number of factors may have resulted from avery conservative factor selection criteria. In the samestudy, for a sample of 750 university students, only onefactor was identiﬁed: “Kinds of Forces.” A later work byScott, Schumayer and Gray [19] applied EFA to the FCIpost-test scores of a sample of 2150 students and foundan optimal model with 5 factors; however one of the fac-tors explained much of the variance. The result that asingle factor explains the majority of the variance is fairlyrobust and is further supported by the high Cronbach al-pha values reported in Study 2 and by Lasry et al. [20].Scott and Schumayer [21] replicated their 5 factor anal-ysis using MIRT on the same sample. Semak et al. [22]reported optimal models with 5 factors on the pretest and6 factors on the post-test when exploring the evolutionof student thinking for 427 algebra- and calculus-basedintroductory physics students. Study 3 also performedEFA using MIRT and reported 9 factors as optimal.

2. Gender and the FCI

In an extensive review of gender diﬀerences on physicsconcept inventories [2], men outperformed women by 13%on pretests and 12% on post-tests of conceptual mechan-ics: the FCI and the Force and Motion Conceptual Eval-uation [23].Many reasons have been explored to explain these dif-ferences. Diﬀerences in high school physics class election[24–26] may cause diﬀerences in college physics grades[27, 28]. In addition, many studies have identiﬁed gen-der diﬀerences in academic course grades with womengenerally outperforming men [29]; these diﬀerences mayinﬂuence conceptual inventory performance. Cognitivediﬀerences have also been advanced as explanations ofacademic gender diﬀerences [30–33] with women scor-ing generally higher on verbal reasoning tasks and menscoring generally higher on spatial reasoning tasks; how-ever, cognitive diﬀerences between men and women areﬁne grained with diﬀerences within the subskills of a sin-gle discipline [34]. Psychocultural factors have also beenadvanced as explanations of academic performance dif-ferences including mathematics anxiety [35, 36], scienceanxiety [37–39], and stereotype threat [40]. For a moredetailed discussion about the many sources that may in-ﬂuence the overall gender diﬀerences on physics concep-tual inventories, see Henderson et al. [41].

3. Item Fairness and the FCI

In addition to student-centered explanations for con-ceptual inventory gender diﬀerences, bias in the individ-ual FCI items has been investigated as a source of thesegender diﬀerences. McCullough and Meltzer [42] ran-domly gave students the original FCI or a version whereeach problem’s context was modiﬁed to be more stereo-typically familiar to women. In a sample of 222 algebra-based physics students, they found signiﬁcant diﬀerencesin performance on items 14, 22, 23, and 29. In 2004, in asample of non-physics students, McCullough used a simi-lar methodology [43], and found that female performancedid not change while male performance decreased on themodiﬁed contexts. Multiple studies have reported itemunfairness in unmodiﬁed items in the FCI [44, 45].Study 2 provides a thorough summary of research intothe item fairness of the FCI [3]. Recent research hassuggested that other commonly used conceptual physicsinstruments do not contain a substantial number of unfairitems [46].

C. Misconception Research

Since the early 1980’s, student diﬃculties, most com-monly known as “misconceptions” or “alternate concep-tions/hypotheses,” have been extensively studied withinphysics classrooms. The early work done by Clement andcolleagues [47–49] qualitatively analyzing the “alternateview of the relationship between force and acceleration”that are grounded in students’ experiences has inﬂuencedmuch of the research examining conceptual understand-ing in physics. Halloun and Hestenes [50, 51] furtherexplored this idea by collecting a taxonomy of “commonsense concepts” that conﬂict with student understandingof Newtonian mechanics. Hestenes, Wells, and Swack-hamer developed the FCI [1] with the intent of mea-suring student conceptual understanding of Newtoniantheory, speciﬁcally analyzing student misconceptions pre-and post-instruction [52].

1. Misconceptions and the FCI

The authors of the FCI provided a detailed descriptionof the misconceptions measured by the instrument [1]. Asummary of those misconceptions follow.

Impetus.

Dating back to pre-Galilean times, the impe-tus model involves the idea that an object has a “motivepower” that can explain why an object remains in mo-tion regardless of any external forces [1, 51]. Studentswith this misconception do not fully understand New-ton’s 1st law. For example, FCI items 6 and 7 describea ball moving in a circle and ask about the path the ballwill take after it exits a circular path. Selecting the cir-cular trajectory after exiting the track demonstrates themisconception that the ball has a circular impetus.

Active Force.

The misconception that motion impliesforce involves the idea that an object in motion mustbe experiencing a force. This misconception involves anaive understanding of the diﬀerence between velocityand acceleration [1, 47] and demonstrates that Newton’s2nd law is not well understood. For example, items 5and 18 describe an object moving in a circular path andask about the forces acting on the object. The motionimplies force misconception would predict that there isforce in the direction of the motion.

Action/Reaction Pairs.

The misconception that thelarger object exerts a greater force on a smaller objectstems from the “dominance principle” [1, 51]. This mis-conception demonstrates that Newton’s 3rd law is notwell understood. For example, items 4 and 15 describe asmall car pushing a large truck and ask to describe theforces between the two objects. The “dominance prin-ciple” misconception would predict the truck exerts alarger force on the car than the car exerts on the truck.

Concatenation of Inﬂuences.

This misconception in-volves the idea that forces inﬂuence with “one force win-ning out over the other” [1]. This misconception demon-strates that the superposition principle for Newtonianforces is not well understood. For example, items 8 and9 describe a hockey puck sliding horizontally at a con-stant speed on a frictionless surface. These items ask forthe path that the hockey puck would take and the speedof the puck after it receives a swift kick. The miscon-ception of “one force winning” would predict that thelast force (i.e. the swift kick) determines the motion andspeed of the puck.

Gravity.

The misconception that gravity is not a forcestems from the Aristotelian physics idea that heavier ob-jects tend to move toward the center of the earth andlighter objects tend to move away from the center of theearth [1, 51]. For example, FCI items 1 and 2 describetwo metal balls of diﬀerent weights that are (1) droppedat the same time and (2) rolled oﬀ of a horizontal tableat the same speed; the items ask about the amount oftime it takes for the two balls to hit the ground and thehorizontal distance traveled, respectively. The gravitymisconception predicts that the heavier ball falls fasterand travels further.Recently, quantitative studies have been used to beginto further understand the misconception structure of theFCI. Scott and Schumayer [53] applied EFA to all 150responses, 5 per item, on the FCI pretest. The two mostimportant factors each contained responses from the ma-jority of the items in the FCI; contained both incorrectand correct responses; and mixed conceptually very dif-ferent correct reasoning as characterized by the modelin Study 3. For example, factor 1 contained correct re-sponses to questions on Newton’s 1st law, Newton’s 3rdlaw, one and two-dimensional motion under gravity, andone and two-dimensional motion ignoring gravity. Three of the six factors showed evidence of students answer-ing in patterns in the data set (always selecting response“A,” “C,” or “E” when unsure of the answer). Eaton,Vavruska, and Willoughby [54] replicated this work forboth pretest and post-test data; no consistent themecould be identiﬁed for multiple factors in their study.The failure of these studies to identify an intelligible fac-tor structure containing items requiring related Newto-nian reasoning may indicate that factoring the incorrectand correct items together in the same analysis is notproductive or may be seen as further support for Study 3which concluded that EFA was not a productive methodto explore the FCI.Scott and Schumayer provided additional analysis oftwo of their factors using network analytic techniques[55]. As in this work, the network was constructed usingthe correlation matrix; however, only correlations withinthe factors identiﬁed in their early factor analysis wereconsidered. This work reported node centrality measures,but did not use the community detection methods ofStudy 1.

2. Misconception Research

Many researchers have investigated students’ concep-tual understanding exploring the misconceptions out-lined above. Early research explored the overall commondiﬃculties and beliefs that students had about Newto-nian mechanics [56–62]. More recently, researchers havedesigned systematic studies to explore student under-standing and the epistemological development of New-ton’s Laws of Motion [23, 63–66]. For example, Rosen-blatt and Heckler developed a new assessment to inves-tigate student understanding of the relationship betweenforce, velocity, and acceleration [64]. This study foundthat understanding the relationship between velocity andacceleration was necessary to understanding the relation-ship between velocity and force; however, the reverse wasnot necessarily true.In general, a misconception about mechanics can bedeﬁned as a non-Newtonian reasoning principles directlyrelated to the physical systems addressed by Newtonianmechanics. Modeling coherent patterns of student wronganswers as misconceptions is only one of many ways toexplain patterns of reasoning about mechanics. Otherimportant theories include knowledge in pieces [67–69]and ontological categories [70–72]. The knowledge-in-pieces framework posits that student knowledge is formedof a number of granular facts, p-prims, that are acti-vated either individually or collectively to produce a so-lution. In general, both p-prims and misconceptions aresmall segments of reasoning; p-prims are more general,while misconceptions are more speciﬁc to the physicalcontext. The ontological categories framework is sub-stantially diﬀerent than either the misconception viewor the knowledge-in-pieces view; ontological categoriesposits that incorrect student answers result from a mis-classiﬁcation of a concept. For example, misclassiﬁcationof the concept of force as a substance that can be usedup. A substantial amount of research has also investi-gated how students’ conceptual knowledge changes overtime [73]. In PER, Hammer proposed an extension anduniﬁcation of the knowledge-in-pieces and misconceptionviews modeling both misconceptions and p-prims as “re-sources” [74–76]. A resource was developed in analogyto a segment of computer code of arbitrary complexity.A p-prim would represent a fundamental subset of thecode, while a misconception would represent a consis-tent misapplication of the code. Unlike the misconcep-tion view, the resource view identiﬁes positive intellectualcomponents which a instructor can activate to encouragethe knowledge construction process. The quantitativemethod in the present work identiﬁed small segments ofincorrect reasoning, and as such, cannot inform the re-source view of student knowledge. In addition, the FCIwas strongly developed within the misconception viewand, therefore, this work will as focus on that view.The framework chosen, knowledge-in-pieces, miscon-ceptions, or ontological categories, has diﬀerent conse-quences for instruction or curriculum design in how theydraw out and make use of student ideas [75]. However, itis less clear that this diﬀerence is measured by conceptualinventories. Incoherence in student answers for the sameconcept might suggest a knowledge-in-pieces view, wherediﬀerent problem contexts can trigger diﬀerent p-primseven if a physicist would see the scenarios as isomorphic.However, the FCI was not designed to measure this eﬀect,and as such, a separate instrument designed around theknowledge-in-pieces or ontological categories frameworksis likely required to fully explore either framework.In the text, we will primarily use the naive theory ormisconceptions framework [77, 78] with notes in the Re-sults where alternative frameworks seem relevant. Ulti-mately, while we call groups of incorrect answers iden-tiﬁed by network analytic techniques “misconceptions,”this work is purely quantitative and cannot distinguishbetween the various theoretical frameworks developed toexplain incorrect answering patterns.

D. Research Questions

This study attempted to reproduce the results of Study1 disaggregated by gender. When the results were notreproduced, the reasons for the failure of MAMCR wereexplored and a modiﬁcation to the algorithm proposedcalled “Modiﬁed Module Analysis.” The modiﬁed algo-rithm was used to explore gendered diﬀerences in thepatterns of incorrect answers on the FCI.In general, network analysis uses the term “commu-nity” and “module” interchangeably to represent con-nected (under some deﬁnition) subsets of a network. Weadopt the term community instead of module in antici-pation of the “igraph” package [79] in the “R” softwaresystem [80] becoming the primary network analysis tool in PER.This study explored the following research questions:

RQ1:

Are the results of Module Analysis for Multiple-Choice Responses replicable for large FCI datasets? If not, what changes to the algorithm arerequired to detect meaningful communities of in-correct answers?

RQ2:

How do the communities detected change asnetwork-building parameters are modiﬁed? Dothese changes support the existence of a coherentnon-Newtonian conceptual model?

RQ3:

How is the incorrect answer community structurediﬀerent between the pretest and the post-test?

RQ4:

How is the incorrect answer community structurediﬀerent for men and women? Do the diﬀerencesexplain the gender unfairness identiﬁed in the in-strument?This work extends the module analysis technique toa larger data set, explores alternate choices during thatanalysis, and contrasts structure between pre- and post-test data. Structural clues in the community structureare examined to explain unresolved questions about gen-der diﬀerences in answer choices [3].

II. METHODSA. Instrument

The FCI is a 30-item instrument designed to measurea student’s facility with Newtonian mechanics [1]. Theinstrument includes items involving Newton’s three lawsas well as items probing an understanding of one- andtwo-dimensional kinematics. The instrument was alsoconstructed with distractors representing common stu-dent misconceptions.

B. Sample

The data for this study was collected at a large south-ern land-grant university serving approximately 25,000students. Overall university undergraduate demograph-ics were 79% White, 5% African American, 6% Hispanic,and other groups each with 3% or less [81].The sample was collected in the introductory calculus-based mechanics class serving primarily physical scien-tists and engineers. The sample has been analyzed pre-viously by Traxler et al. (Study 2, Sec. I A 2) [3]; it isreferenced as Sample 1 in that work. The sample contains4716 complete FCI post-test records (3628 men and 1088women) and 4509 complete pretest records (3482 menand 1027 women). Table II in Study 2 reports basic de-scriptive statistics. On the pretest, men have an averagepercentage score of 43%, women 32%. On the post-test,men have an average percentage score of 73%, women65%. The course in which the sample was collected waspresented using the same pedagogy and managed by thesame lead instructor for the period studied. A more thor-ough discussion of the sample and the instructional en-vironment may be found in Study 2.

C. Analysis Methods

Initial replication of Study 1 was performed with theInfomap software available from mapequation.org [82].All other statistical analysis was performed in the “R”statistical software system [80]. This work failed to repli-cate the Study 1 results and proposes a modiﬁed analysismethod; as such, the analysis method is a result of thework and the various network techniques employed aredescribed as they are used.

III. RESULTSA. Module Analysis

Figure 1 outlines the original and modiﬁed analysissteps. The original module analysis method presentedin Study 1 ﬁrst formed a bipartite network, a networkthat includes two types of nodes where all edges con-nect nodes of diﬀerent types. This network includednodes representing students and nodes representing FCIresponses. The bipartite network is then projected intoa unipartite network containing only nodes representingFCI responses. Edges in this network connect diﬀerentresponses of the same student. Edge weights representthe number of students who selected the pair of responsesconnected by the edge. For example, if 40 students se-lected FCI responses 1A and 2B, where the number isthe item number and the letter is the response within theitem, there would be an edge between node 1A and node2B with weight 40. While the bipartite network can beused to extract additional properties of the network [83],this was not done in Study 1. As such, we began withthe unipartite network. The unipartite network can berepresented by a two-dimensional matrix, called the ad-jacency matrix, adj ( X, Y ), where X and Y are FCI itemresponses (for example, X = 1A). The value adj ( X, Y ) isthe number of students who selected response X and re-sponse Y . In the above example, adj (1 A, B ) = 40. Thenetwork representing the post-test responses of womenon the FCI post-test is shown in Fig. 2. Because ofthe diﬀerences identiﬁed in men and women in Study 2for this sample, all results are reported disaggregated bygender. The network in Fig. 2 is fairly representative ofthe pretest and post-test networks for men and women.Figure 2 uses a node placement algorithm that placesmore densely connected nodes close to one another. Asin Study 1, only incorrect responses were included in thenetwork. The correct responses are highly correlated and are often the most commonly selected responses. If theyare included in the network, they form a tightly con-nected community that prevents exploration of the in-correct answers.To attempt to replicate the results of Study 1, com-munity detection algorithms were applied to the networkshown in Fig. 2. First, a complete replication was at-tempted which employed the “Infomap” software avail-able at “mapequation.org” [82] as was originally used inStudy 1. This software, designed for very large networks,presents such signiﬁcant installation and use barriers thatit seems unlikely that it will ever achieve broad accep-tance in PER. A second path to replication using the“infomap” implementation in the “igraph” package [79]in “R” was also attempted.To extract meaningful structure from a high-densitynetwork, the network must be generally be simpliﬁedwithout removing important structure. The process ofsimplifying a network by removing edges is called “spar-siﬁcation.” The network sparsiﬁcation method used inStudy 1 was Locally Adaptive Network Sparsiﬁcation(LANS) [84]. The LANS algorithm removes edges basedon the distribution of edge weights connected to eachnode. The probability of selecting an edge with a smallerweight at random is compared to a predetermined signif-icance level and only edges above that level are retained.This method is locally adaptive because it depends onlyon the edges incident on a single node. A consequenceof sparsifying based on the distribution of weights inci-dent on each node is that no node will have its last edgeremoved, so no connected node is unconnected from therest of the network. This ensures that local structuresimportant to the global structure of the network are re-tained.After sparsifying with LANS (using code from Traxler et al. [85], Supplemental Material), the Infomap CDAwas applied. Infomap is based on information theoreticmethods. Infomap records a random walk through thenetwork by assigning codewords to each node, then try-ing to minimizing the length of the description. Nodesvisited more often are given shorter codes and coherentcommunities, where the random walker tends to spendmore time, are given their own unique codes to furtherreduce the information needed to represent the network.This way, the codes for individual nodes can be reusedwithin the larger community structure. Nodes in one ofthe large communities are connected more to each otherthan to nodes outside the community. Because Infomapis not deterministic, it was run 1000 times and the com-munities that were most often found were selected as themisconception modules in Study 1.Applying Infomap with LANS sparsiﬁcation failed toidentify meaningful community structure for the largedata set in the current study; Infomap consistently iden-tiﬁed only one large community.To explore the source of the discrepancy with Study 1,an alternate implementation of Infomap was employed;this implementation was part of “igraph” package in the Original data (students and their answers)Bipartite incidence matrix(N student x N answer , each entry is 0 or 1 to showanswers chosen)Unipartite adjacency matrix(N answer x N answer , each [X,Y] showsfrequency that answers X and Y were chosenby the same student) Correlation matrix(N answer x N answer , each [X,Y] givescorrelation r between answers X and Y)Sparsify network with LANS Sparsify correlation matrix1. Zero all entries below r threshold2. Remove answers with < 30 respondents3. Zero entries where r is not stat. significantCommunity detection algorithm partitions networkModule Analysis of Multiple-Choice Responses Modified Module AnalysisUse reduced matrix as adjacency matrixRerun non-deterministic community detectionOR bootstrap new network and rerunRemove correct answersThreshold for how often answers appear in samecommunity, interpret groups above threshold

Figure 1. Workﬂow of analysis for the original module analysis method (left branch) and our modiﬁed version (right branch).

Figure 2. Unipartite network for the FCI post-test responsesof women. “R” software system. A simpler sparsiﬁcation algorithmwas also employed. The LANS algorithm statisticallyevaluates each edge, but will not remove the last edgeconnecting a node. This algorithm is a reasonable choicefor a network where every edge is purposeful (such as airtravel), but may amplify noise in a network of studentresponses where some edges are the results of carelessmistakes or guessing. As such, the network was also spar-siﬁed by imposing a threshold requiring edges to have aminimum weight. Multiple thresholds were tried. TheInfomap community detection algorithm used in Study1 identiﬁed only one community at all threshold values.Many other CDAs are available in the “igraph” pack-age; some identiﬁed two communities even at very highthresholds. No CDA identiﬁed more than 2 communities.Fig. 3 shows the communities identiﬁed by the “fastgreedy” CDA at an edge weight threshold of N/ N is the number of participants; only 22 nodes remainconnected at this threshold. Nodes in diﬀerent commu-nities are shown in diﬀerent colors. The fast-greedy CDA[86] is an improved version of a modularity-based CDAalgorithm [87]. Modularity is a measure that compares,for a given division of a network into communities, howmany more intra-community links exist than expected bychance in an equivalent network [88]. Modularity valuesrange from zero to one, where a network with a modu-larity of zero means there is no clustering in the networkand a modularity of one is a strongly clustered network.There seem to be two likely sources of the diﬀerencesof the results of this study and Study 1: sample sizeand the LANS algorithm. To investigate sample size,100 subsamples of 143 students each were drawn fromthe sample in this study. Applying Infomap using “R”identiﬁed only one community 100% of the time with nosparsiﬁcation and one community 92% of the time withthe requirement the edge weight be at least N/10 where N is the number of students.The igraph package implements many CDA algo-rithms; for the small network analyzed in this work, mostperform similarly. For rest of this work, the “fast-greedy”CDA described above will be used. Again, the data wassubsampled to 143 students to compare with Study 1. Figure 3. Communities detected for the adjacency matrix ofFCI post-test responses of women with an edge weight thresh-old of N/ With no sparsiﬁcation, the fast-greedy algorithm identi-ﬁed 3 to 6 communities with 3 to 4 communities identiﬁedin 92% of the runs. With the edge weight greater than N/

10 sparsiﬁcation, fast-greedy identiﬁed 2 to 4 commu-nities with 66% of the runs identifying 3 communities.The communities identiﬁed made little theoretical sensewithin the framework of Study 3 with very diﬀerent itemsin the same communities. As such, while some of the dif-ferences in the studies may be attributed to sample size,the choice of CDA also inﬂuenced the communities iden-tiﬁed at small sample size. At the large sample size ofthe current study, the various community detection algo-rithms implemented in igraph give fairly similar results.

B. Correlation Analysis

Part of the cause of the failure of MAMCR to ﬁndmeaningful community structure for large samples canbe understood by comparing the adjacency matrix to thecorrelation matrix. The correlation matrix also deﬁnesa network, most usefully when a threshold value is ap-plied. The adjacency matrix which produced the networkin Fig. 2 has no obvious clustered structure. The par-tial correlation matrices reported in Study 3 clearly showclustering into distinct communities.The correlation between item X and item Y is deﬁnedas: corr ( X, Y ) = E [( X − µ X )( Y − µ Y )] σ X σ Y (1)where µ j is the mean of variable j , σ j is the standard de-viation, and E [ Z ] is the expectation value of the randomvariable Z . The expectation value is deﬁned as: E [ X ] = X i X i N (2)where i is a participant and N is the number of partici-pants. Equation 1 can be simpliﬁed to produce Eq. 3: corr ( X, Y ) = E [ X · Y ] − µ X · µ Y σ X σ Y (3)For dichotomously scored items, the sum P i X i Y i isthe X,Y entry in the adjacency matrix, adj ( X, Y ) = P i X i Y i . The correlation matrix is then related to theadjacency matrix by the expression: corr ( X, Y ) = adj ( X, Y ) − N µ X · µ Y N σ X σ Y (4)A pair of items can have a large adj ( X, Y ) in a numberof ways: (a) purposeful association, students preferen-tially select the two items together, or (b) accidental as-sociation, many students select both items so on averagethe items get selected together often. By subtracting theproduct of the means, the correlation matrix eliminatesthe second case and only has large values for purpose-fully selected pairs. This suggests the adjacency matrixcontains many more edges that are the result of randomchance than the correlation matrix. The correlation ma-trix also has the substantial advantage of the existenceof signiﬁcance tests for entries allowing the discarding ofnon-signiﬁcant edges.With this observation, we propose a modiﬁcation ofMAMCR, called Modiﬁed Module Analysis (MMA), thatinvestigates the community structure of the correlationmatrix. The remainder of the this work investigates thisproposal. The diﬀerences between MAMCR and MMAare presented schematically in Fig. 1.To explore this proposal, the correlation matrix wascalculated for all incorrect answers. Nodes with too fewparticipants to be statistically reliable were eliminated;for this work, nodes with fewer than 30 responses wereremoved. Edges were removed where the correlation, r ,between the two nodes was not signiﬁcant at the p = 0 . r > . r > .

20, and r > .

25. Therepresentation in Fig. 4 was produced by the “qgraph”package in “R” [89]. The width of the line is propor-tional to the size of the correlation. Node placement isfor visual eﬀect only.

C. Modiﬁed Module Analysis

The correlation matrices in Fig. 4 show a clear clus-tered structure. MMA was applied to understand thesestructures. The communities detected for the r > . C , asthe fraction of the bootstrap subsamples in which thepair of items were found in the same community. Thecommunity matrix was ﬁltered to show items that wereidentiﬁed in C >

60% and

C >

80% of the communitiesin the 1000 bootstrap replications in Table I. The ma-jority of the communities extracted from the communitymatrix were fully connected; each node was connectedto every other node in the community. Some, however,were not. The intra-community density, γ , is deﬁned asthe ratio of the number of edges in the community to themaximum number of edges possible [6]. For communitieswith γ < γ is presented as a percentage in parenthesisin Table I. For example, if a community contains fournodes then there are a maximum of six distinct edges be-tween the nodes. If the community only possesses ﬁve ofthose edges, then γ = 5 / D. The Structure of Incorrect FCI Responses

Unless otherwise stated, results below are reported for

C > . r > .

1. Types of Incorrect Communities

Table II classiﬁes the incorrect reasoning for each com-munity of incorrect answers in Table I. These can bedivided into two general classes: communities resulting0

Figure (a) r >0.15Figure (b) r >0.20Figure (c) r >0.25 Figure 4. Post-test correlation matrices of women at varying levels of r . Figure 5. Communities detected in the FCI correlation matrixwith r > .

2. Each community is drawn in a diﬀerent color. from blocking and communities resulting for consistentlyapplied incorrect reasoning (misconceptions). Communi-ties { } , { } , and { } are answerswithin blocked problems where the second answer in thepair would be correct if the ﬁrst answer was correct. Theother communities apply either the same incorrect rea-soning or related incorrect reasoning.Hestenes and Jackson produced a detailed taxonomyof the naive conceptions (their terminology) tested by theFCI [92]. Table II shows a mapping of this taxonomy ontothe incorrect answer communities identiﬁed in the cur-rent work. The taxonomy divides the naive conceptionsinto a general category and a number of sub-categories.The number in parenthesis in Table II is the sub-categorylabel [92]. Items marked with an asterisk in Table II arepart of item blocks. Because the relation between theitems seems to be largely generated by the interdepen-dencies resulting from blocking rather than consistentlyapplied misconceptions, the blocked items will not be dis-cussed further.Some issues arise in comparing the proposed FCI tax-onomy with communities identiﬁed by MMA and the sim-ilar item blocks identiﬁed in Study 3. First, for someof items in the incorrect communities, no misconceptionwas identiﬁed (items 1D, 2C, and 2D). Students are an-swering these items in a correlated manner which impliesthe possibility of consistent reasoning patterns; for theseitems a possible misconception was suggested. The newmisconception was labeled “(Add).” Table II shows thatoften the items in the incorrect communities identiﬁed byMMA belong to multiple naive misconception categoriesand have diﬀerent sub-categories. This would seem toimply that the naive conception taxonomy is more de-tailed than the actual application of misconceptions bystudents as measured by the FCI. For example, in theNewton’s 3rd law community, { } , diﬀer-ent items involve objects of diﬀerent activity from onestudent pushing on another student (item 28, one activeobject), to a car pushing a truck (item 15, one activeobject), to a head-on collision (item 4, two active ob-jects). This distinction does not seem important to the Table I. Communities identiﬁed in the pretest and post-test incorrect answers at r > . C . The number in parenthesis is theintra-community density, γ , for communities where the intra-community density is not one.Community Pretest Post-testMen Women Men Women C >

C > student’s answering pattern. It is unclear if this resultsfrom one misconception taking precedence over anotheror from students applying more general reasoning as pro-posed by the resource or knowledge-in-pieces models. Assuch, Table II includes a column which proposes a ti-tle for the dominant misconception. In many cases, thedominant misconception was identiﬁed as the misconcep-tion shared by the majority of the items. In some cases, adominant misconception was proposed. For the Newton’s3rd law community, { } , multiple misconcep- Table II. Misconceptions represented by incorrect answer communities. Communities marked with a ∗ result from blocked problems. Proposed additions are marked(Add). Proposed items to be removed are marked (Remove). If Add or Remove is placed before all items, it applies to all items. If Add or Remove is placed beforeonly one of many items, it applies to that item.Community Naive Conceptions Dominant MisconceptionCategory Sub-Catagory1A, 2C Gravity 1A (G3): Heavier objects fall faster Heavier objects fall faster (Add) (Add) (Add) (Remove) (Add)

18D (AF2): Motion implies active forces5E, 18E ImpetusActive Forces 5E (I1): Impetus supplied by “hit” Motion implies active forcesCentrifugal force (Remove)

5E (I5): Circular impetus5E, (Add)

18E (I1): Motion implies active forces5E, 18E (CF): Centrifugal force6A,7A Impetus 6A, 7A (I5): Circular impetus Circular impetus8A, 9B ∗ Concatenation of Inﬂuences 8A, 9B (CI3): Last force to act determines motion11B, 29A Other Inﬂuences on MotionImpetus 29A, (Remove)

11B (Ob): Obstacles exert no force Motion implies active forces (Remove)

11B (I1): Impetus supplied by “hit” (Add)

11B (AF2): Motion implies active forces11C, 13C, 30E Impetus 11C, 30E (I1): Impetus supplied by “hit” Motion implies active forces (Add) (Add)

25D (CI1): Largest force determines motion Largest force determines motionResistance 25D (R2): Motion when force overcomes resistance21B, 23C* Concatenation of Inﬂuences 21B, 23C (CI3): Last force to act determines motion21C, 22A* Concatenation of Inﬂuences 21C (CI2): Force compromise determines motionActive Forces 22A (AF4): Velocity proportional to applied force23D, 24C* Impetus 23D, 24C (I3): Impetus dissipation Impetus dissipation23D (I2): Loss/recovery of original impetus { } is also curious. Initem 11, a hockey puck is struck activating the impetussupplied by the “hit” misconception, but response 11Bexplicitly asks about a force in the direction of motion.As such, we propose this item also tests the motion im-plies active forces misconception. It is also unclear howitem 11B probes the obstacle exerts no force misconcep-tion; we propose it be removed from the item. Item 29involves a chair sitting on a ﬂoor; response 29A identi-ﬁes only the force of gravity on the object and ignoresthe normal force. It seems diﬃcult to claim this com-munity probes a common misconception. Item 29 wasalso demonstrated to have poor psychometric propertiesin Study 2; the correlation between 11B and 29A mayhave resulted from 29A not functioning as intended.The community { } continues to convolvethe motion involves active forces misconception with theimpetus supplied by the “hit” misconception. Response30E explicitly discusses the force of the “hit” while items11C and 13C discuss a force in the direction of motion.We propose adding this misconception to the items 11Cand 13C. Further, only item 13C involves the idea of adissipation of impetus. For this community, while mul- tiple misconceptions are tested, one seems to dominatestudent responses, motion implies active forces.Finally, the blocked item responses 23D, 24C diﬀerfrom the other blocked responses. Rather than the sec-ond response being the correct answer if the ﬁrst responsewas correct, both appear to be applications of the dissi-pation of impetus misconception.

2. Reducing Sparsiﬁcation

The r > . C > . r > .

15 and for

C > . C > . { } to be detectedfor both men and women, most other new communitiesidentiﬁed did not result from the merger of communitiesidentiﬁed at more restrictive thresholds. Particularly onthe pretest, the larger communities do not make muchsense in terms of the framework of Study 3. This is par-ticularly evident in the mixing of the Newton’s 3rd lawitems {

4, 15, 16, and 28 } with other items. As such, itappears that student misconceptions exist relatively in-dependently as small groups of consistent answers, notas a part of a larger coherent framework.

3. The Strength of Common Misconceptions

One motivation of Study 1 was to provide instructorswith a mechanism for identifying common misconcep-tions so that speciﬁc interventions could be targeted toaddress those misconceptions. The communities of incor-rect answers remaining on the post-test as shown in TableI could be used provide a measure of the prevalence of themisconception in the classes studied. Table III presentsan overall average for each incorrect community in TableI on the post-test. Only communities that did not re-sult from problem blocking are presented. Averages werecalculated by assigning a score of 1 if the response wasselected and 0 if it was not, then averaging over each itemin the group. Results are disaggregated by gender andthe p -value for a t -test to determine if diﬀerences by gen-der are signiﬁcant is also presented; Cohen’s d provides ameasure of eﬀect size. Cohen suggests d = 0 . d = 0 . d = 0 . Table III. Percentage of students selecting each incorrect community for the FCI post-test. A t-test was performed to determineif the diﬀerences between men and women were signiﬁcant, the p -value is presented. Cohen’s d for the diﬀerence is also presented.Community Male Female p d MisconceptionAve. (%) Ave. (%)4A, 15C, 28D 32 ±

47 33 ±

47 0.27 0.02 Newton’s 3rd law misconceptions5D, 11C, 13C, 18D, 30E 22 ±

42 20 ± < .

001 0.06 Motion implies active forces5E, 18E 7 ±

25 7 ±

25 0.69 0.01 Motion implies active forces, centrifugal force6A, 7A 14 ±

35 5 ± < .

001 0.39 Circular impetus17A, 25D 42 ±

49 37 ± < .

001 0.11 Largest force determines motion the percentage of students who answer an item incor-rectly; therefore, only diﬀerences in Table III greater than8% represent unexpected diﬀerences between men andwomen. Only items { } exceed this diﬀerence, butthen only slightly with a diﬀerence of 9%. Items { } are also the only community with diﬀerences of atleast a small eﬀect size; however, the eﬀect size is likelyinﬂated by the small standard deviation of women be-cause of a ﬂoor eﬀect. In general, the rate of selectingone of the communities of common incorrect answers wasvery similar for men and women.For the class studied, the results of Table III sug-gest that additional eﬀort be directed to addressing thelargest force determines motion misconception measuredby { } and Newton’s 3rd law misconceptionsmeasured by { } . IV. DISCUSSIONA. Research Questions

This study sought to answer four research questions;they will be addressed in the order proposed.

RQ1: Are the results of Module Analysis for Multiple-Choice Responses replicable for large FCI data sets? Ifnot, what changes to the algorithm are required to de-tect meaningful communities of incorrect answers?

TheMAMCR process described in Study 1 identiﬁed only oneor two communities in our data whether using LANS oran edge weight threshold to sparsify the network. Thisresult held for Infomap and for other CDAs. Reducingthe data to a comparable size by subsampling generatedmore communities, but still fewer than identiﬁed in Study1; however, the communities identiﬁed did not make con-ceptual sense. We concluded that the community struc-ture identiﬁed in Study 1 was the result of the low samplesize and the LANS algorithm and that modiﬁcations toMAMCR were needed to productively identify incorrectanswer communities.The failure of MAMCR for large samples led us to pro-pose a variant of the algorithm using the correlation ma-trix instead of the adjacency matrix to build the network.This matrix was sparsiﬁed by removing statistically in-signiﬁcant correlations and correlation below a threshold ( r < . RQ2: How do the communities detected changeas network-building parameters are modiﬁed? Dothese changes support the existence of a coherent non-Newtonian conceptual model?

A more permissive thresh-old for the correlation matrix ( r > .

15) yielded largercommunities as shown in the Supplemental Material [93].These larger communities were not formed by the joiningof smaller communities related to the same misconcep-tion; in fact, many of the communities contained itemsthat had little conceptual relation. As such, it appearsthat the best model of student misconceptions are as iso-lated pieces of reasoning associated with items with asimilar correct solution structure.

RQ3: How is the incorrect answer community struc-ture diﬀerent between the pretest and the post-test?

For

C > . r > .

2, a total of 14 incorrect answercommunities were identiﬁed for either men or womenpre- and post-instruction; 5 of the communities wereconsistently identiﬁed for both genders pre- and post-instruction. Three of these ﬁve represent consistently ap-plied misconceptions: { } , Newton’s 3rd lawmisconceptions; { } , motion implies active forcesand the existence of a centrifugal force; and { } ,circular impetus. The items from which the incorrect an-swers in these communities were drawn were all identiﬁedas having very similar correct solution structure in Study3. The other two communities were drawn from problemblocks: { } and { } . Three incorrect com-munities disappeared with instruction: for all students { } , motion implies active forces; for men only, { } heavier objects fall faster and { } lighterobjects fall faster. Many incorrect communities were onlyidentiﬁed post-instruction including { } involv-ing items with similar solution structure as identiﬁed inStudy 3. RQ4: How is the incorrect answer community struc-ture diﬀerent for men and women? Do the diﬀerences ex- plain the gender unfairness identiﬁed in the instrument? Post-instruction, using

C > . r > .

2, 11 commu-nities were identiﬁed for either men or women; 8 of thesecommunities were identiﬁed for both men and women.One of the other three communities was only identiﬁedfor women, { } , represents the mo-tion implies active forces misconception. This communitywas the merger of the two communities only identiﬁedfor men { } and { } ; the femalecommunity was also not completely connected, γ = 0 . { } , was the result of blocking andwas identiﬁed for both men and women post-instruction.The other two communities unique to men, { } and { } , involve the heavier objects fall faster and thelighter objects fall faster misconceptions. The miscon-ception structure of men and women was quite similarpre-instruction, with men holding more consistent mis-conceptions. B. Additional Observations

The misconception communities identiﬁed by MMAwere not completely consistent with the naive concep-tion taxonomy provided by Hestenes and Jackson for theFCI [92]. Often multiple naive conceptions were associ-ated with the same community. This may indicate thatstudent reasoning is better modeled by a more generalframework such as knowledge-in-pieces or ontological cat-egories. It may also indicate that the FCI cannot fullyresolve the detailed set of misconceptions identiﬁed in thetaxonomy.The results of this work were not consistent with recentexploratory analyses of the FCI [53–55] which identiﬁeda few large factors; these factors mixed very diﬀerentcorrect and incorrect responses. The small communitiesidentiﬁed in the current work, which are partially sup-ported by the taxonomy of Hestenes and Jackson, seemto indicate the MMA may be a more productive quanti-tative method to explore misconceptions.

V. IMPLICATIONS

Not all of the communities identiﬁed in Table I rep-resent misconceptions. Some represent combinations ofdependent answers. For these combinations, the second answer is correct if the ﬁrst answer was the correct an-swer. This suggests that, because of the blocking of itemsin the FCI, a simple scoring of the instrument with eachitem as correct or incorrect may understate a student’sknowledge of the material. Previous authors have calledfor reevaluating the scoring of the FCI [95], but not be-cause of problem blocking.The identiﬁcation of three communities of incorrectanswers that were the result of item blocking furthersupports the conclusions of Study 3 that item blockingshould be discontinued in future PER instruments be-cause it may make the instruments diﬃcult to interpretstatistically.The misconception communities identiﬁed in Table IIallow instructors to determine the strength of students’misconceptions as they enter a physics class and the re-maining strength after instruction, as shown in Table III.This should allow instructors to adjust their classes to ad-dress misconceptions remaining after instructoion and todirect fewer resources to addressing misconceptions thatare not present pre-instruction.

VI. FUTURE WORK

MMA was productive in extending the understand-ing of the incorrect answer structure of the FCI; it willbe extended to other conceptual instruments includingthe Force and Motion Conceptual Inventory [23] and theConceptual Survey of Electricity and Magnetism [96].Network analysis encompasses a broad collection ofpowerful analysis techniques. The analysis in this workrepresents the barest beginnings of the possibilities ofthese techniques. Future research may consider networkswith multiple types of nodes (possibly correct and incor-rect answers or pretest and post-test answers) or multipletypes of edges (possibly negative and positive correla-tions).

VII. CONCLUSION

Previous results reported for Module Analysis forMultiple-Choice Responses (MAMCR) could not bereplicated for a large sample. The failure of the algo-rithm at large sample size likely results from a combi-nation of unpurposeful edges in the adjacency matrix atlarge sample sizes and properties of the LANS sparsiﬁ-cation algorithm. A modiﬁcation of the algorithm, Mod-iﬁed Module Analysis (MMA), based on the correlationmatrix was productive in identifying useful communitystructure. MMA identiﬁed 11 communities on the post-test and 9 on the pretest. Most of these communities wereidentiﬁed both for men and women: 8 on the post-test,6 on the pretest. In general, the incorrect answer com-munity structure identiﬁed for men and women was verysimilar and could not explain the gender diﬀerences pre-viously identiﬁed in a subset of items in the instrument.6The communities identiﬁed at high sparsiﬁcation failedto merge into larger communities addressing similar mis-conceptions as sparsiﬁcation was reduced suggesting thatstudents do not have an integrated non-Newtonian beliefsystem, but rather isolated incorrect beliefs strongly tiedto the type of question asked.

ACKNOWLEDGMENTS

This work was supported in part by the National Sci-ence Foundation as part of the evaluation of improved learning for the Physics Teacher Education Coalition,PHY-0108787. [1] D. Hestenes, M. Wells, and G. Swackhamer, “Force Con-cept Inventory,” Phys. Teach. , 141–158 (1992).[2] A. Madsen, S.B. McKagan, and E. Sayre, “Gender gapon concept inventories in physics: What is consistent,what is inconsistent, and what factors inﬂuence the gap?”Phys. Rev. Phys. Educ. Res. , 020121 (2013).[3] A. Traxler, R. Henderson, J. Stewart, G. Stewart, A. Pa-pak, and R. Lindell, “Gender fairness within the ForceConcept Inventory,” Phys. Rev. Phys. Educ. Res. ,010103 (2018).[4] E. Brewe, J. Bruun, and I.G. Bearden, “Using mod-ule analysis for multiple choice responses: A newmethod applied to Force Concept Inventory data,”Phys. Rev. Phys. Educ. Res. , 020131 (2016).[5] M.J. Newman, Networks, 2nd ed. (Oxford UniversityPress, New York, NY, 2018).[6] K.A. Zweig,

Network Analysis Literacy: A Practical Ap-proach to the Analysis of Networks (Springer-Verlag,Wien, Austria, 2016).[7] A.V. Papachristos and C. Wildeman, “Network exposureand homicide victimization in an African American com-munity,” Am. J. Public Health , 143–150 (2014).[8] F. De Vico, J. Richiardi, M. Chavez, and S. Achard,“Graph analysis of functional brain networks: Practicalissues in translational neuroscience,” Philos. T. R. Soc.Lon. B (2014).[9] J. Lop´ez Pe˜na and H. Touchette, “A network theoryanalysis of football strategies,” in

Sports Physics: Proc.2012 Euromech Physics of Sports Conference , edited byC. Clanet (´Editions de l’´Ecole Polytechnique, 2012) pp.517–528.[10] Z. Zheng and Y. Zhao, “Transcriptome comparison andgene coexpression network analysis provide a systemsview of citrus response to “CandidatusLiberibacter asi-aticus” infection,” BMC Genomics , 27 (2013).[11] S. Fortunato and D. Hric, “Community detection in net-works: A user guide,” Physics Reports , 1–44 (2016).[12] M. Rosvall and C.T. Bergstrom, “Maps of randomwalks on complex networks reveal community structure,”P. Natl. Acad. Sci. USA , 1118–1123 (2008).[13] L. Crocker and J. Algina, Introduction to Classical andModern Test Theory (Holt, Rinehart and Winston, NewYork, 1986).[14] R.J. De Ayala,

The theory and practice of item responsetheory (Guilford Publications, 2013). [15] P.W. Holland and D.T. Thayer, “An alternate deﬁnitionof the ETS delta scale of item diﬃculty,” ETS ResearchReport Series

Research Report RR-85-43 (1985).[16] P.W. Holland and D.T. Thayer, “Diﬀerential item per-formance and the Mantel-Haenszel procedure,” in

TestValidity , edited by H. Wainer and H. I. Braun (LawrenceErlbaum, Hillsdale, NJ, 1993) pp. 129–145.[17] J. Stewart, C. Zabriskie, S. DeVore, andG. Stewart, “Multidimensional item responsetheory and the Force Concept Inventory,”Phys. Rev. Phys. Educ. Res. , 010137 (2018).[18] D. Huﬀman and P. Heller, “What does theForce Concept Inventory actually measure?”Phys. Teach. , 138 (1995).[19] T.F. Scott, D. Schumayer, and A.R. Gray, “Exploratoryfactor analysis of a Force Concept Inventory data set,”Phys. Rev. Phys. Educ. Res. , 020105 (2012).[20] N. Lasry, S. Rosenﬁeld, H. Dedic, A. Dahan, andO. Reshef, “The puzzling reliability of the Force Con-cept Inventory,” Am. J. Phys. , 909–912 (2011).[21] T.F. Scott and D. Schumayer, “Students’ proﬁciencyscores within multitrait item response theory,” Phys.Rev. Phys. Educ. Res. , 020134 (2015).[22] M.R. Semak, R.D. Dietz, R.H. Pearson, andC.W. Willis, “Examining evolving performance onthe Force Concept Inventory using factor analysis,”Phys. Rev. Phys. Educ. Res. , 010103 (2017).[23] R.K. Thornton and D.R. Sokoloﬀ, “Assessing stu-dent learning of Newton’s laws: The Force andMotion Conceptual Evaluation and the evaluationof active learning laboratory and lecture curricula,”Am. J. Phys. , 338–352 (1998).[24] C. Nord, S. Roey, S. Perkins, M. Lyons,N. Lemanski, J. Schuknecht, and J. Brown,“American High School Graduates: Results of the 2009 NAEP High School Transcript Study,” USDepartment of Education, National Center for EducationStatistics, Washington, DC (2011).[25] B.C. Cunningham, K.M. Hoyer, and D. Sparks, Gender Diﬀerences in Science, Technology, Engineering, and Mathematics (STEM) Interest, Credits Earned, and NAEP Performance in the 12th Grade (National Center for Education Statistics, Washington,DC, 2015).[26] B.C. Cunningham, K.M. Hoyer, and D. Sparks,“The Condition of STEM 2016,” ACT Inc., Iowa City,IA (2016).[27] P.M. Sadler and R.H. Tai, “Success in introductory col-lege physics: The role of high school preparation,” Sci. Educ. , 111–136 (2001).[28] Z. Hazari, R.H. Tai, and P.M. Sadler, “Gender dif-ferences in introductory university physics performance:The inﬂuence of high school physics preparation and af-fective factors,” Sci. Educ. , 847–876 (2007).[29] D. Voyer and S.D. Voyer, “Gender diﬀerences in scholas-tic achievement: A meta-analysis.” Psychol. Bull. ,1174 (2014).[30] Y. Maeda and S.Y. Yoon, “A meta-analysis on gender dif-ferences in mental rotation ability measured by the Pur-due Spatial Visualization Tests: Visualization of Rota-tions (PSVT: R),” Educ. Psychol. Rev. , 69–94 (2013).[31] D.F. Halpern, Sex Diﬀerences in Cognitive Abilities, 4thed. (Psychology Press, Francis & Tayler Group, NewYork, NY, 2012).[32] J.S. Hyde and M.C. Linn, “Gender diﬀerences in verbalability: A meta-analysis.” Psychol. Bull. , 53 (1988).[33] J.S. Hyde, E. Fennema, and S.J. Lamon, “Gender dif-ferences in mathematics performance: A meta-analysis.”Psychol. Bull. , 139 (1990).[34] N.S. Cole,

The ETS Gender Study: How Females andMales Perform in Educational Settings (EducationalTesting Service, Princeton, NJ, 1997).[35] N.M. Else-Quest, J.S. Hyde, and M.C. Linn, “Cross-national patterns of gender diﬀerences in mathematics:A meta-analysis.” Psychol. Bull. , 103 (2010).[36] X. Ma, “A meta-analysis of the relationship between anx-iety toward mathematics and achievement in mathemat-ics,” Jour. Res. Math. Educ. , 520–540 (1999).[37] J.V. Mallow and S.L. Greenburg, “Science anxiety:Causes and remedies,” J. Coll. Sci. Teach. , 356–358(1982).[38] M.K. Udo, G.P. Ramsey, and J.V. Mallow, “Science anx-iety and gender in students taking general education sci-ence courses,” J. Sci. Educ. Technol. , 435–446 (2004).[39] J. Mallow, H. Kastrup, F.B. Bryant, N. Hislop,R. Shefner, and M. Udo, “Science anxiety, science atti-tudes, and gender: Interviews from a binational study,”J. Sci. Educ. Technol. , 356–369 (2010).[40] J.R. Shapiro and A.M. Williams, “The role ofstereotype threats in undermining girls’ andwomen’s performance and interest in STEM ﬁelds,”Sex Roles , 175–183 (2012).[41] R. Henderson, G. Stewart, J. Stewart, L. Michaluk, andA. Traxler, “Exploring the gender gap in the ConceptualSurvey of Electricity and Magnetism,” Phys. Rev. Phys.Educ. Res. , 020114 (2017).[42] L. McCullough and D.E. Meltzer, “Diﬀer-ences in male/female response patterns onalternative-format versions of FCI items,” in ,edited by K. Cummings, S. Franklin, and J. Marx (AIPPublishing, New York, 2001) pp. 103–106.[43] L. McCullough, “Gender, context, and physics assess-ment,” J. Int. Womens St. , 20–30 (2004).[44] S. Osborn Popp, D. Meltzer, and M.C. Megowan-Romanowicz, “Is the Force Concept Inventory bi-ased? Investigating diﬀerential item functioningon a test of conceptual learning in physics,” in (American Education Research Association, Washington,DC, 2011).[45] R.D. Dietz, R.H. Pearson, M.R. Semak, and C.W.Willis, “Gender bias in the Force Concept Inventory?” in ,Vol. 1413, edited by N.S. Rebello, P.V. Engelhardt, andC. Singh (AIP Publishing, New York, 2012) pp. 171–174.[46] R. Henderson, P. Miller, J. Stewart, A. Traxler, andR. Lindell, “Item-level gender fairness in the Force andMotion Conceptual Evaluation and the Conceptual Sur-vey of Electricity and Magnetism,” Phys. Rev. Phys.Educ. Res. , 020103 (2018).[47] J. Clement, “Students’ preconceptions in introductorymechanics,” Am. J. Phys. , 66–71 (1982).[48] J. Clement, D.E. Brown, and A. Zietsman, “Not allpreconceptions are misconceptions: Finding anchoringconceptions for grounding instruction on students intu-itions,” Int. J. Sci. Educ. , 554–565 (1989).[49] J. Clement, “Using bridging analogies and anchor-ing intuitions to deal with students’ preconceptions inphysics,” J. Res. Sci. Teach. , 1241–1257 (1993).[50] I.A. Halloun and D. Hestenes, “The initial knowledgestate of college physics students,” Am. J. Phys. , 1043–1055 (1985).[51] I.A. Halloun and D. Hestenes, “Common sense conceptsabout motion,” Am. J. Phys. , 1056–1065 (1985).[52] R.R. Hake, “Interactive-engagement versus traditionalmethods: A six-thousand-student survey of mechanicstest data for introductory physics courses,” Am. J. Phys. , 64–74 (1998).[53] T.F. Scott and D. Schumayer, “Conceptual coherence ofnon-Newtonian worldviews in Force Concept Inventorydata,” Phys. Rev. Phys. Educ. Res. , 010126 (2017).[54] P. Eaton, K. Vavruska, and S. Willoughby, “Exploringthe preinstruction and postinstruction non-Newtonianworld views as measured by the Force Concept Inven-tory,” Phys. Rev. Phys. Educ. Res. , 010123 (2019).[55] T.F. Scott and D. Schumayer, “Central distractors inForce Concept Inventory data,” Phys. Rev. Phys. Educ.Res. , 010106 (2018).[56] L. Viennot, “Spontaneous reasoning in elementary dy-namics,” Eur. J. Sci. Educ. , 205–221 (1979).[57] D.E. Trowbridge and L.C. McDermott, “Investigation ofstudent understanding of the concept of acceleration inone dimension,” Am. J. Phys. , 242–253 (1981).[58] A. Caramazza, M. McCloskey, and B. Green, “Naivebeliefs in “sophisticated” subjects: Misconceptions abouttrajectories of objects,” Cogn. , 117–123 (1981).[59] P.C. Peters, “Even honors students have conceptual dif-ﬁculties with physics,” Am. J. Phys. , 501–508 (1982).[60] M. McCloskey, “Intuitive physics,” Sci. Am. , 122–131 (1983).[61] R.F. Gunstone, “Student understanding in mechanics:A large population survey,” Am. J. Phys. , 691–696(1987).[62] C.W. Camp and J.J. Clement, Preconceptions in mechan-ics: Lessons dealing with students’ conceptual diﬃculties (Kendall/Hunt, Dubuque, IA, 1994).[63] L.C. McDermott, “Students’ conceptions and problemsolving in mechanics,” in

Connecting research in physicseducation with teacher education , edited by Andr´eeTiberghien, E. Leonard Jossem, and Jorge Barojas (In-ternational Commission on Physics Education, 1997) pp.42–47.[64] R. Rosenblatt and A.F. Heckler, “Systematic study ofstudent understanding of the relationships between thedirections of force, velocity, and acceleration in one di-mension,” Phys. Rev. Phys. Educ. Res. , 020112 (2011). [65] N. Erceg and I. Aviani, “Students’ understanding ofvelocity-time graphs and the sources of conceptual dif-ﬁculties,” Croat. J. Educ. , 43–80 (2014).[66] B. Waldrip, “Impact of a representational approach onstudents’ reasoning and conceptual understanding inlearning mechanics,” Int. J. Sci. Math. Educ. , 741–765 (2014).[67] A.A. diSessa, “Knowledge in pieces,” in Constructivismin the Computer Age , The Jean Piaget Symposium Se-ries, edited by George Forman and Peter B. Pufall(Lawrence Erlbaum, Hillsdale, NJ, 1988) pp. 49–70.[68] A.A. diSessa, “Toward an epistemology of physics,”Cogn. Instr. , 105–225 (1993).[69] A.A. diSessa and B.L. Sherin, “What changes in concep-tual change?” Int. J. Sci. Educ. , 1155–1191 (1998).[70] M.T.H. Chi and J.D. Slotta, “The ontological coherenceof intuitive physics,” Cogn. Instr. , 249–260 (1993).[71] M.T.H. Chi, J.D Slotta, and N. De Leeuw, “From thingsto processes: A theory of conceptual change for learningscience concepts,” Learn. Instr. , 27–43 (1994).[72] J.D. Slotta, M.T.H. Chi, and E. Joram, “Assessing stu-dents’ misclassiﬁcations of physics concepts: An ontologi-cal basis for conceptual change,” Cogn. Instr. , 373–400(1995).[73] R. Duit and D. F. Treagust, “Conceptual change: A pow-erful framework for improving science teaching and learn-ing,” Int. J. Sci. Educ. , 671–688 (2003).[74] D. Hammer, “Misconceptions or p-prims: How may al-ternative perspectives of cognitive structure inﬂuence in-structional perceptions and intentions,” J. Learn. Sci. ,97–127 (1996).[75] D. Hammer, “More than misconceptions: Multi-ple perspectives on student knowledge and reason-ing, and an appropriate role for education research,”Am. J. Phys. , 1316–1325 (1996).[76] D. Hammer, “Student resources for learning introductoryphysics,” Am. J. Phys. , S52–S59 (2000).[77] E. Etkina, J. Mestre, and A. O’Donnell, “The impactof the cognitive revolution on science learning and teach-ing,” in The Cognitive Revolution in Educational Psy-chology , edited by James M. Royer (IAP, 2005) pp. 119–164.[78] J.D. Bransford, A.L. Brown, and R.R. Cocking,

Howpeople learn: Brain, Mind, Experience, and School (Na-tional Academy Press, Washington, DC, 2000).[79] G. Csardi and T. Nepusz, “The igraph soft-ware package for complex network research,”InterJournal, Complex Systems , 1–9 (2006).[80] R Core Team,

R: A Language and Environment for Statistical Computing ,R Foundation for Statistical Computing, Vienna, Austria (2017).[81] “US News & World Report: Education,” https://premium.usnews.com/best-colleges . Ac-cessed 4/30/2017.[82] D. Edler and Rosvall M., “The mapequa-tion software package,” Available online at . Accessed 2/1/2019.[83] S.P. Borgatti and D.S. Halgin, “Analyzing aﬃliation net-works,” in

The Sage Handbook of Social Network Analy-sis , edited by J. Scott and P.J. Carrington (Sage Publi-cations, Thousand Oaks, CA, 2011) pp. 417–433.[84] N.J. Foti, J.M. Hughes, and D.N. Rockmore, “Nonpara-metric sparsiﬁcation of complex multiscale networks,”PLoS ONE , 1–10 (2011).[85] A. Traxler, A. Gavrin, and R. Lindell, “Net-works identify productive forum discussions,”Phys. Rev. Phys. Educ. Res. , 020107 (2018).[86] A. Clauset, M.E.J. Newman, and C. Moore, “Find-ing community structure in very large networks,”Phys. Rev. E , 066111 (2004).[87] M.E.J. Newman, “Fast algorithm for de-tecting community structure in networks,”Phys. Rev. E , 066133 (2004).[88] M.E.J. Newman and M. Girvan, “Finding andevaluating community structure in networks,”Phys. Rev. E , 026113 (2004).[89] S. Epskamp, A.O.J. Cramer, J.L. Waldorp, V.D.Schmittmann, and D. Borsboom, “qgraph: Networkvisualizations of relationships in psychometric data,”J. Stat. Soft. , 1–18 (2012).[90] A.C. Davison and D.V. Hinkley, Bootstrap Methods and Their Applications (CambridgeUniversity Press, Cambridge, UK, 1997).[91] A. Canty and B.D. Ripley, boot: Bootstrap R (S-Plus)Functions (2017), R package version 1.3-20.[92] “Table II for the Force Concept In-ventory (revised from 081695r),” http://modeling.asu.edu/R&E/FCI-RevisedTable-II_2010.pdf .Accessed 3/17/2019.[93] See Supplemental Material at [URL will be inserted bypublisher] for the communities detected at the r > . Statistical Power Analysis for the BehavioralSciences (Academic Press, New York, NY, 1977).[95] R.C. Hudson and F. Munley, “Re-score the Force Con-cept Inventory!” Phys. Teach. , 261–261 (1996).[96] D.P. Maloney, T.L. O’Kuma, C. Hieggelke, andA. Van Huevelen, “Surveying students’ concep-tual knowledge of electricity and magnetism,”Am. J. Phys.69