Exploring the Structure of Misconceptions in the Force Concept Inventory with Modified Module Analysis
James Wells, Rachel Henderson, John Stewart, Gay Stewart, Jie Yang, Adrienne Traxler
aa r X i v : . [ phy s i c s . e d - ph ] M a y Exploring the Structure of Misconceptions in the Force Concept Inventory withModified Module Analysis
James Wells, Rachel Henderson, John Stewart, ∗ Gay Stewart, Jie Yang, and Adrienne Traxler W. M. Keck Science Department of Claremont McKenna,Pitzer, and Scripps Colleges, Claremont CA, 91711 Michigan State University, Department of Physics and Astronomy, East Lansing MI, 48824 West Virginia University, Department of Physics and Astronomy, Morgantown WV, 26506 Wright State University, Department of Physics, Dayton OH, 45435 (Dated: May 16, 2019)Module Analysis for Multiple-Choice Responses (MAMCR) was applied to a large sample ofForce Concept Inventory (FCI) pretest and post-test responses ( N pre = 4509 and N post = 4716)to replicate the results of the original MAMCR study and to understand the origins of the genderdifferences reported in a previous study of this data set. When the results of MAMCR could not bereplicated, a modification of the method was introduced, Modified Module Analysis (MMA). MMAwas productive in understanding the structure of the incorrect answers in the FCI, identifying 9groups of incorrect answers on the pretest and 11 groups on the post-test. These groups, in mostcases, could be mapped on to common misconceptions used by the authors of the FCI to createdistactors for the instrument. Of these incorrect answer groups, 6 of the pretest groups and 8of the post-test groups were the same for men and women. Two of the male-only pretest groupsdisappeared with instruction while the third male-only pretest group was identified for both menand women post-instruction. Three of the groups identified for both men and women on the post-test were not present for either on the pretest. The rest of the identified incorrect answer groupsdid not represent misconceptions, but were rather related to the the blocked structure of some FCIitems where multiple items are related to a common stem. The groups identified had little relationto the gender unfair items previously identified for this data set, and therefore, differences in thestructure of student misconceptions between men and women cannot explain the gender differencesreported for the FCI. I. INTRODUCTION
The “gender gap,” gender differences between thescores of men and women on the Force Concept Inventory(FCI) [1] and other instruments developed by PhysicsEducation Research (PER), has been extensively stud-ied [2]. For the FCI, a substantial number of studieshave suggested that some of the gender differences ob-served resulted from different response patterns of menand women to a subset of the items in the instrument[3]. The origin of these differential response patterns is,however, unknown. The purpose of this study is to ap-ply Module Analysis for Multiple-Choice Responses in-troduced by Brewe et al. [4] to a large sample of FCIresponses known to contain a subset of items which pro-duce substantially different response patterns for menand women in order to determine if the structure of themisconceptions of men and women differ on these items.
A. Background Studies
This work will draw heavily from three previous studieswhich will be referenced as Study 1, Study 2, and Study3 in this work. ∗ [email protected]
1. Study 1: Module Analysis
Study 1 introduced Module Analysis for MultipleChoice Responses (MAMCR) to analyze concept inven-tory data at the level of individual responses to the items[4]. Unlike many analysis techniques applied to FCI data,which consider only a student’s overall score or only thecorrect answers to individual items, MAMCR considerseach answer choice a student selected in order to providea fine-grained examination of students’ misconceptionsof Newtonian physics and to allow instructors to targetspecific errors.MAMCR is based on network analytic techniques[5, 6]. A network is represented by a graph where nodesare connected to one another by edges. Edges can beweighted, where the value of the weight represents someaspect of the interaction. Network analysis is a highlysuccessful and versatile set of methods which have beenapplied to a variety of problems including the probabilityof homicide victimization among people living in a dis-advantaged neighborhood [7], the mapping of functionalnetworks in the brain from electrical signals [8], passingpatterns of soccer teams in the World Cup [9], and theresponse of plants to bacterial infection [10].Study 1 examined the FCI post-test scores of 143 first-year physics majors at a Danish university. The samplewas 78% male and scored relatively highly on the exam:pre-test 65 ±
22% and post-test 81 ± et al. emphasized that the resultsof this study should be generalized with care. The groupof students tested was small, unusually high scoring, andhad limited diversity. Likewise, there were several choicesmade during the process of applying MAMCR to the datawhich could have been made differently. Both the choiceof sparsification method (described in Sec. III) and thedecision of how to group responses that cluster togetheronly on some of the one-thousand applications of Infomapwas somewhat arbitrary, as was the interpretation of themeaning of the modules. As will be seen in Sec. III, ourdata set required different choices to be made.
2. Study 2: Item Fairness and the FCI
In Study 2, Traxler et al. [3] explored item-level genderfairness of the FCI using Classical Test Theory [13], ItemResponse Theory [14], and Differential Item Functioning(DIF) [15, 16] analysis. An item is fair to men and womenif men and women of equal overall ability score equallyon the item. Using three samples, a graphical analysisidentified five FCI items that were substantially unfair towomen: item 14 (bowling ball falling out of an airplane),items 21 through 23 (sideways-drifting rocket with en-gine turning on and off), and item 27 (a large box beingpushed across a horizontal floor). A further DIF anal-ysis, which controlled for the student’s overall post-testscore, identified eight items on the FCI as substantiallyunfair. Two of these were unfair to men: item 9 (speedof a puck after it receives a kick) and item 15 (a smallcar pushing a large truck). These eight items includedthe five items identified in the graphical analysis alongwith items 9, 12 (the trajectory of a cannon ball shot offof a cliff), and 15. Many of the unfair items had beenidentified as unfair in previous studies. Overall, Study2 demonstrated that eliminating all unfair items on theFCI to create a fair instrument reduced the gender gapby 50% in the largest sample.This work, however, could not identify the source ofthe unfairness. The distribution of student responses wasanalyzed. Focusing on the five items that were identified with the graphical analysis and the DIF analysis, incor-rect female responses were predominately one of the dis-tractors in each of the FCI items; however, the distractorschosen by the male students were less uniform in all fiveFCI items. Overall, Study 2 concluded that no physi-cal principle or common misconception could explain theunfairness identified FCI items; however, this conclusionwas drawn from a qualitative inspection of the items.The current study builds on the work in Study 2 by per-forming a quantitative analysis of the incorrect responsesof men and women.
3. Study 3: Multidimensional Item Response Theory andthe FCI
The study in the current work applies network analyticmethods to understand the incorrect answer structure ofthe FCI. This structure might be influenced by featuresof the FCI which produce correlations between the cor-rect answers. If a consistent misconception is being ap-plied, it would form an alternate incorrect answer to setsof related correct answers. Study 3 examined the cor-rect answer structure of the FCI using both exploratoryand confirmatory methods [17]. Exploratory factor anal-ysis (EFA) suggested that the practice of “blocking”items produced correlations between the items within theblock. A block of items is a sequence of items whichall refer to a common stem or where one item refers toa previous item. The FCI contains item blocks {
5, 6 } , {
8, 9, 10, 11 } , {
15, 16 } , {
21, 22, 23, 24 } , and {
25, 26,27 } . Study 3 reported that often the factors identified byEFA strongly loaded on items in the same block, suggest-ing that blocking was generating correlations among theitems in the block. Study 3 went on to produce a detailedmodel of the reasoning required to solve the FCI. Mul-tidimensional Item Response Theory (MIRT) was usedto test alternate models and allowed the identification ofan optimal model. This model allowed the identificationof groups of items with very similar solution structure: {
5, 18 } , {
6, 7 } , {
17, 25 } , and {
4, 15, 28 } . Study 3 onlyincluded the first item in a block in the analysis and itis likely that item 16 should be added to the last blockwhich represents Newton’s 3rd law items. This mappingof item blocks and groups with similar solution will beimportant to understanding the incorrect answer struc-ture presented in this work. B. Previous Studies of the FCI
The FCI, either in aggregate or disaggregating by gen-der, is one of the most studied instruments in PER.The present study examines item-level structure disag-gregated by gender. The structure of the incorrect an-swers is examined to identify coherent patterns of incor-rect answers.
1. Exploratory Analyses of the FCI
Many studies have examined the structure of the FCI,primarily using EFA. These studies began soon after thepublication of the FCI when Huffman and Heller [18]failed to extract the factor structure suggested by theauthors of the instrument [1]. For a sample of 145 highschool students, Huffman and Heller found only two post-test factors: “Newton’s 3rd law” and “Kinds of Forces.”The small number of factors may have resulted from avery conservative factor selection criteria. In the samestudy, for a sample of 750 university students, only onefactor was identified: “Kinds of Forces.” A later work byScott, Schumayer and Gray [19] applied EFA to the FCIpost-test scores of a sample of 2150 students and foundan optimal model with 5 factors; however one of the fac-tors explained much of the variance. The result that asingle factor explains the majority of the variance is fairlyrobust and is further supported by the high Cronbach al-pha values reported in Study 2 and by Lasry et al. [20].Scott and Schumayer [21] replicated their 5 factor anal-ysis using MIRT on the same sample. Semak et al. [22]reported optimal models with 5 factors on the pretest and6 factors on the post-test when exploring the evolutionof student thinking for 427 algebra- and calculus-basedintroductory physics students. Study 3 also performedEFA using MIRT and reported 9 factors as optimal.
2. Gender and the FCI
In an extensive review of gender differences on physicsconcept inventories [2], men outperformed women by 13%on pretests and 12% on post-tests of conceptual mechan-ics: the FCI and the Force and Motion Conceptual Eval-uation [23].Many reasons have been explored to explain these dif-ferences. Differences in high school physics class election[24–26] may cause differences in college physics grades[27, 28]. In addition, many studies have identified gen-der differences in academic course grades with womengenerally outperforming men [29]; these differences mayinfluence conceptual inventory performance. Cognitivedifferences have also been advanced as explanations ofacademic gender differences [30–33] with women scor-ing generally higher on verbal reasoning tasks and menscoring generally higher on spatial reasoning tasks; how-ever, cognitive differences between men and women arefine grained with differences within the subskills of a sin-gle discipline [34]. Psychocultural factors have also beenadvanced as explanations of academic performance dif-ferences including mathematics anxiety [35, 36], scienceanxiety [37–39], and stereotype threat [40]. For a moredetailed discussion about the many sources that may in-fluence the overall gender differences on physics concep-tual inventories, see Henderson et al. [41].
3. Item Fairness and the FCI
In addition to student-centered explanations for con-ceptual inventory gender differences, bias in the individ-ual FCI items has been investigated as a source of thesegender differences. McCullough and Meltzer [42] ran-domly gave students the original FCI or a version whereeach problem’s context was modified to be more stereo-typically familiar to women. In a sample of 222 algebra-based physics students, they found significant differencesin performance on items 14, 22, 23, and 29. In 2004, in asample of non-physics students, McCullough used a simi-lar methodology [43], and found that female performancedid not change while male performance decreased on themodified contexts. Multiple studies have reported itemunfairness in unmodified items in the FCI [44, 45].Study 2 provides a thorough summary of research intothe item fairness of the FCI [3]. Recent research hassuggested that other commonly used conceptual physicsinstruments do not contain a substantial number of unfairitems [46].
C. Misconception Research
Since the early 1980’s, student difficulties, most com-monly known as “misconceptions” or “alternate concep-tions/hypotheses,” have been extensively studied withinphysics classrooms. The early work done by Clement andcolleagues [47–49] qualitatively analyzing the “alternateview of the relationship between force and acceleration”that are grounded in students’ experiences has influencedmuch of the research examining conceptual understand-ing in physics. Halloun and Hestenes [50, 51] furtherexplored this idea by collecting a taxonomy of “commonsense concepts” that conflict with student understandingof Newtonian mechanics. Hestenes, Wells, and Swack-hamer developed the FCI [1] with the intent of mea-suring student conceptual understanding of Newtoniantheory, specifically analyzing student misconceptions pre-and post-instruction [52].
1. Misconceptions and the FCI
The authors of the FCI provided a detailed descriptionof the misconceptions measured by the instrument [1]. Asummary of those misconceptions follow.
Impetus.
Dating back to pre-Galilean times, the impe-tus model involves the idea that an object has a “motivepower” that can explain why an object remains in mo-tion regardless of any external forces [1, 51]. Studentswith this misconception do not fully understand New-ton’s 1st law. For example, FCI items 6 and 7 describea ball moving in a circle and ask about the path the ballwill take after it exits a circular path. Selecting the cir-cular trajectory after exiting the track demonstrates themisconception that the ball has a circular impetus.
Active Force.
The misconception that motion impliesforce involves the idea that an object in motion mustbe experiencing a force. This misconception involves anaive understanding of the difference between velocityand acceleration [1, 47] and demonstrates that Newton’s2nd law is not well understood. For example, items 5and 18 describe an object moving in a circular path andask about the forces acting on the object. The motionimplies force misconception would predict that there isforce in the direction of the motion.
Action/Reaction Pairs.
The misconception that thelarger object exerts a greater force on a smaller objectstems from the “dominance principle” [1, 51]. This mis-conception demonstrates that Newton’s 3rd law is notwell understood. For example, items 4 and 15 describe asmall car pushing a large truck and ask to describe theforces between the two objects. The “dominance prin-ciple” misconception would predict the truck exerts alarger force on the car than the car exerts on the truck.
Concatenation of Influences.
This misconception in-volves the idea that forces influence with “one force win-ning out over the other” [1]. This misconception demon-strates that the superposition principle for Newtonianforces is not well understood. For example, items 8 and9 describe a hockey puck sliding horizontally at a con-stant speed on a frictionless surface. These items ask forthe path that the hockey puck would take and the speedof the puck after it receives a swift kick. The miscon-ception of “one force winning” would predict that thelast force (i.e. the swift kick) determines the motion andspeed of the puck.
Gravity.
The misconception that gravity is not a forcestems from the Aristotelian physics idea that heavier ob-jects tend to move toward the center of the earth andlighter objects tend to move away from the center of theearth [1, 51]. For example, FCI items 1 and 2 describetwo metal balls of different weights that are (1) droppedat the same time and (2) rolled off of a horizontal tableat the same speed; the items ask about the amount oftime it takes for the two balls to hit the ground and thehorizontal distance traveled, respectively. The gravitymisconception predicts that the heavier ball falls fasterand travels further.Recently, quantitative studies have been used to beginto further understand the misconception structure of theFCI. Scott and Schumayer [53] applied EFA to all 150responses, 5 per item, on the FCI pretest. The two mostimportant factors each contained responses from the ma-jority of the items in the FCI; contained both incorrectand correct responses; and mixed conceptually very dif-ferent correct reasoning as characterized by the modelin Study 3. For example, factor 1 contained correct re-sponses to questions on Newton’s 1st law, Newton’s 3rdlaw, one and two-dimensional motion under gravity, andone and two-dimensional motion ignoring gravity. Three of the six factors showed evidence of students answer-ing in patterns in the data set (always selecting response“A,” “C,” or “E” when unsure of the answer). Eaton,Vavruska, and Willoughby [54] replicated this work forboth pretest and post-test data; no consistent themecould be identified for multiple factors in their study.The failure of these studies to identify an intelligible fac-tor structure containing items requiring related Newto-nian reasoning may indicate that factoring the incorrectand correct items together in the same analysis is notproductive or may be seen as further support for Study 3which concluded that EFA was not a productive methodto explore the FCI.Scott and Schumayer provided additional analysis oftwo of their factors using network analytic techniques[55]. As in this work, the network was constructed usingthe correlation matrix; however, only correlations withinthe factors identified in their early factor analysis wereconsidered. This work reported node centrality measures,but did not use the community detection methods ofStudy 1.
2. Misconception Research
Many researchers have investigated students’ concep-tual understanding exploring the misconceptions out-lined above. Early research explored the overall commondifficulties and beliefs that students had about Newto-nian mechanics [56–62]. More recently, researchers havedesigned systematic studies to explore student under-standing and the epistemological development of New-ton’s Laws of Motion [23, 63–66]. For example, Rosen-blatt and Heckler developed a new assessment to inves-tigate student understanding of the relationship betweenforce, velocity, and acceleration [64]. This study foundthat understanding the relationship between velocity andacceleration was necessary to understanding the relation-ship between velocity and force; however, the reverse wasnot necessarily true.In general, a misconception about mechanics can bedefined as a non-Newtonian reasoning principles directlyrelated to the physical systems addressed by Newtonianmechanics. Modeling coherent patterns of student wronganswers as misconceptions is only one of many ways toexplain patterns of reasoning about mechanics. Otherimportant theories include knowledge in pieces [67–69]and ontological categories [70–72]. The knowledge-in-pieces framework posits that student knowledge is formedof a number of granular facts, p-prims, that are acti-vated either individually or collectively to produce a so-lution. In general, both p-prims and misconceptions aresmall segments of reasoning; p-prims are more general,while misconceptions are more specific to the physicalcontext. The ontological categories framework is sub-stantially different than either the misconception viewor the knowledge-in-pieces view; ontological categoriesposits that incorrect student answers result from a mis-classification of a concept. For example, misclassificationof the concept of force as a substance that can be usedup. A substantial amount of research has also investi-gated how students’ conceptual knowledge changes overtime [73]. In PER, Hammer proposed an extension andunification of the knowledge-in-pieces and misconceptionviews modeling both misconceptions and p-prims as “re-sources” [74–76]. A resource was developed in analogyto a segment of computer code of arbitrary complexity.A p-prim would represent a fundamental subset of thecode, while a misconception would represent a consis-tent misapplication of the code. Unlike the misconcep-tion view, the resource view identifies positive intellectualcomponents which a instructor can activate to encouragethe knowledge construction process. The quantitativemethod in the present work identified small segments ofincorrect reasoning, and as such, cannot inform the re-source view of student knowledge. In addition, the FCIwas strongly developed within the misconception viewand, therefore, this work will as focus on that view.The framework chosen, knowledge-in-pieces, miscon-ceptions, or ontological categories, has different conse-quences for instruction or curriculum design in how theydraw out and make use of student ideas [75]. However, itis less clear that this difference is measured by conceptualinventories. Incoherence in student answers for the sameconcept might suggest a knowledge-in-pieces view, wheredifferent problem contexts can trigger different p-primseven if a physicist would see the scenarios as isomorphic.However, the FCI was not designed to measure this effect,and as such, a separate instrument designed around theknowledge-in-pieces or ontological categories frameworksis likely required to fully explore either framework.In the text, we will primarily use the naive theory ormisconceptions framework [77, 78] with notes in the Re-sults where alternative frameworks seem relevant. Ulti-mately, while we call groups of incorrect answers iden-tified by network analytic techniques “misconceptions,”this work is purely quantitative and cannot distinguishbetween the various theoretical frameworks developed toexplain incorrect answering patterns.
D. Research Questions
This study attempted to reproduce the results of Study1 disaggregated by gender. When the results were notreproduced, the reasons for the failure of MAMCR wereexplored and a modification to the algorithm proposedcalled “Modified Module Analysis.” The modified algo-rithm was used to explore gendered differences in thepatterns of incorrect answers on the FCI.In general, network analysis uses the term “commu-nity” and “module” interchangeably to represent con-nected (under some definition) subsets of a network. Weadopt the term community instead of module in antici-pation of the “igraph” package [79] in the “R” softwaresystem [80] becoming the primary network analysis tool in PER.This study explored the following research questions:
RQ1:
Are the results of Module Analysis for Multiple-Choice Responses replicable for large FCI datasets? If not, what changes to the algorithm arerequired to detect meaningful communities of in-correct answers?
RQ2:
How do the communities detected change asnetwork-building parameters are modified? Dothese changes support the existence of a coherentnon-Newtonian conceptual model?
RQ3:
How is the incorrect answer community structuredifferent between the pretest and the post-test?
RQ4:
How is the incorrect answer community structuredifferent for men and women? Do the differencesexplain the gender unfairness identified in the in-strument?This work extends the module analysis technique toa larger data set, explores alternate choices during thatanalysis, and contrasts structure between pre- and post-test data. Structural clues in the community structureare examined to explain unresolved questions about gen-der differences in answer choices [3].
II. METHODSA. Instrument
The FCI is a 30-item instrument designed to measurea student’s facility with Newtonian mechanics [1]. Theinstrument includes items involving Newton’s three lawsas well as items probing an understanding of one- andtwo-dimensional kinematics. The instrument was alsoconstructed with distractors representing common stu-dent misconceptions.
B. Sample
The data for this study was collected at a large south-ern land-grant university serving approximately 25,000students. Overall university undergraduate demograph-ics were 79% White, 5% African American, 6% Hispanic,and other groups each with 3% or less [81].The sample was collected in the introductory calculus-based mechanics class serving primarily physical scien-tists and engineers. The sample has been analyzed pre-viously by Traxler et al. (Study 2, Sec. I A 2) [3]; it isreferenced as Sample 1 in that work. The sample contains4716 complete FCI post-test records (3628 men and 1088women) and 4509 complete pretest records (3482 menand 1027 women). Table II in Study 2 reports basic de-scriptive statistics. On the pretest, men have an averagepercentage score of 43%, women 32%. On the post-test,men have an average percentage score of 73%, women65%. The course in which the sample was collected waspresented using the same pedagogy and managed by thesame lead instructor for the period studied. A more thor-ough discussion of the sample and the instructional en-vironment may be found in Study 2.
C. Analysis Methods
Initial replication of Study 1 was performed with theInfomap software available from mapequation.org [82].All other statistical analysis was performed in the “R”statistical software system [80]. This work failed to repli-cate the Study 1 results and proposes a modified analysismethod; as such, the analysis method is a result of thework and the various network techniques employed aredescribed as they are used.
III. RESULTSA. Module Analysis
Figure 1 outlines the original and modified analysissteps. The original module analysis method presentedin Study 1 first formed a bipartite network, a networkthat includes two types of nodes where all edges con-nect nodes of different types. This network includednodes representing students and nodes representing FCIresponses. The bipartite network is then projected intoa unipartite network containing only nodes representingFCI responses. Edges in this network connect differentresponses of the same student. Edge weights representthe number of students who selected the pair of responsesconnected by the edge. For example, if 40 students se-lected FCI responses 1A and 2B, where the number isthe item number and the letter is the response within theitem, there would be an edge between node 1A and node2B with weight 40. While the bipartite network can beused to extract additional properties of the network [83],this was not done in Study 1. As such, we began withthe unipartite network. The unipartite network can berepresented by a two-dimensional matrix, called the ad-jacency matrix, adj ( X, Y ), where X and Y are FCI itemresponses (for example, X = 1A). The value adj ( X, Y ) isthe number of students who selected response X and re-sponse Y . In the above example, adj (1 A, B ) = 40. Thenetwork representing the post-test responses of womenon the FCI post-test is shown in Fig. 2. Because ofthe differences identified in men and women in Study 2for this sample, all results are reported disaggregated bygender. The network in Fig. 2 is fairly representative ofthe pretest and post-test networks for men and women.Figure 2 uses a node placement algorithm that placesmore densely connected nodes close to one another. Asin Study 1, only incorrect responses were included in thenetwork. The correct responses are highly correlated and are often the most commonly selected responses. If theyare included in the network, they form a tightly con-nected community that prevents exploration of the in-correct answers.To attempt to replicate the results of Study 1, com-munity detection algorithms were applied to the networkshown in Fig. 2. First, a complete replication was at-tempted which employed the “Infomap” software avail-able at “mapequation.org” [82] as was originally used inStudy 1. This software, designed for very large networks,presents such significant installation and use barriers thatit seems unlikely that it will ever achieve broad accep-tance in PER. A second path to replication using the“infomap” implementation in the “igraph” package [79]in “R” was also attempted.To extract meaningful structure from a high-densitynetwork, the network must be generally be simplifiedwithout removing important structure. The process ofsimplifying a network by removing edges is called “spar-sification.” The network sparsification method used inStudy 1 was Locally Adaptive Network Sparsification(LANS) [84]. The LANS algorithm removes edges basedon the distribution of edge weights connected to eachnode. The probability of selecting an edge with a smallerweight at random is compared to a predetermined signif-icance level and only edges above that level are retained.This method is locally adaptive because it depends onlyon the edges incident on a single node. A consequenceof sparsifying based on the distribution of weights inci-dent on each node is that no node will have its last edgeremoved, so no connected node is unconnected from therest of the network. This ensures that local structuresimportant to the global structure of the network are re-tained.After sparsifying with LANS (using code from Traxler et al. [85], Supplemental Material), the Infomap CDAwas applied. Infomap is based on information theoreticmethods. Infomap records a random walk through thenetwork by assigning codewords to each node, then try-ing to minimizing the length of the description. Nodesvisited more often are given shorter codes and coherentcommunities, where the random walker tends to spendmore time, are given their own unique codes to furtherreduce the information needed to represent the network.This way, the codes for individual nodes can be reusedwithin the larger community structure. Nodes in one ofthe large communities are connected more to each otherthan to nodes outside the community. Because Infomapis not deterministic, it was run 1000 times and the com-munities that were most often found were selected as themisconception modules in Study 1.Applying Infomap with LANS sparsification failed toidentify meaningful community structure for the largedata set in the current study; Infomap consistently iden-tified only one large community.To explore the source of the discrepancy with Study 1,an alternate implementation of Infomap was employed;this implementation was part of “igraph” package in the Original data (students and their answers)Bipartite incidence matrix(N student x N answer , each entry is 0 or 1 to showanswers chosen)Unipartite adjacency matrix(N answer x N answer , each [X,Y] showsfrequency that answers X and Y were chosenby the same student) Correlation matrix(N answer x N answer , each [X,Y] givescorrelation r between answers X and Y)Sparsify network with LANS Sparsify correlation matrix1. Zero all entries below r threshold2. Remove answers with < 30 respondents3. Zero entries where r is not stat. significantCommunity detection algorithm partitions networkModule Analysis of Multiple-Choice Responses Modified Module AnalysisUse reduced matrix as adjacency matrixRerun non-deterministic community detectionOR bootstrap new network and rerunRemove correct answersThreshold for how often answers appear in samecommunity, interpret groups above threshold
Figure 1. Workflow of analysis for the original module analysis method (left branch) and our modified version (right branch).
Figure 2. Unipartite network for the FCI post-test responsesof women. “R” software system. A simpler sparsification algorithmwas also employed. The LANS algorithm statisticallyevaluates each edge, but will not remove the last edgeconnecting a node. This algorithm is a reasonable choicefor a network where every edge is purposeful (such as airtravel), but may amplify noise in a network of studentresponses where some edges are the results of carelessmistakes or guessing. As such, the network was also spar-sified by imposing a threshold requiring edges to have aminimum weight. Multiple thresholds were tried. TheInfomap community detection algorithm used in Study1 identified only one community at all threshold values.Many other CDAs are available in the “igraph” pack-age; some identified two communities even at very highthresholds. No CDA identified more than 2 communities.Fig. 3 shows the communities identified by the “fastgreedy” CDA at an edge weight threshold of N/ N is the number of participants; only 22 nodes remainconnected at this threshold. Nodes in different commu-nities are shown in different colors. The fast-greedy CDA[86] is an improved version of a modularity-based CDAalgorithm [87]. Modularity is a measure that compares,for a given division of a network into communities, howmany more intra-community links exist than expected bychance in an equivalent network [88]. Modularity valuesrange from zero to one, where a network with a modu-larity of zero means there is no clustering in the networkand a modularity of one is a strongly clustered network.There seem to be two likely sources of the differencesof the results of this study and Study 1: sample sizeand the LANS algorithm. To investigate sample size,100 subsamples of 143 students each were drawn fromthe sample in this study. Applying Infomap using “R”identified only one community 100% of the time with nosparsification and one community 92% of the time withthe requirement the edge weight be at least N/10 where N is the number of students.The igraph package implements many CDA algo-rithms; for the small network analyzed in this work, mostperform similarly. For rest of this work, the “fast-greedy”CDA described above will be used. Again, the data wassubsampled to 143 students to compare with Study 1. Figure 3. Communities detected for the adjacency matrix ofFCI post-test responses of women with an edge weight thresh-old of N/ With no sparsification, the fast-greedy algorithm identi-fied 3 to 6 communities with 3 to 4 communities identifiedin 92% of the runs. With the edge weight greater than N/
10 sparsification, fast-greedy identified 2 to 4 commu-nities with 66% of the runs identifying 3 communities.The communities identified made little theoretical sensewithin the framework of Study 3 with very different itemsin the same communities. As such, while some of the dif-ferences in the studies may be attributed to sample size,the choice of CDA also influenced the communities iden-tified at small sample size. At the large sample size ofthe current study, the various community detection algo-rithms implemented in igraph give fairly similar results.
B. Correlation Analysis
Part of the cause of the failure of MAMCR to findmeaningful community structure for large samples canbe understood by comparing the adjacency matrix to thecorrelation matrix. The correlation matrix also definesa network, most usefully when a threshold value is ap-plied. The adjacency matrix which produced the networkin Fig. 2 has no obvious clustered structure. The par-tial correlation matrices reported in Study 3 clearly showclustering into distinct communities.The correlation between item X and item Y is definedas: corr ( X, Y ) = E [( X − µ X )( Y − µ Y )] σ X σ Y (1)where µ j is the mean of variable j , σ j is the standard de-viation, and E [ Z ] is the expectation value of the randomvariable Z . The expectation value is defined as: E [ X ] = X i X i N (2)where i is a participant and N is the number of partici-pants. Equation 1 can be simplified to produce Eq. 3: corr ( X, Y ) = E [ X · Y ] − µ X · µ Y σ X σ Y (3)For dichotomously scored items, the sum P i X i Y i isthe X,Y entry in the adjacency matrix, adj ( X, Y ) = P i X i Y i . The correlation matrix is then related to theadjacency matrix by the expression: corr ( X, Y ) = adj ( X, Y ) − N µ X · µ Y N σ X σ Y (4)A pair of items can have a large adj ( X, Y ) in a numberof ways: (a) purposeful association, students preferen-tially select the two items together, or (b) accidental as-sociation, many students select both items so on averagethe items get selected together often. By subtracting theproduct of the means, the correlation matrix eliminatesthe second case and only has large values for purpose-fully selected pairs. This suggests the adjacency matrixcontains many more edges that are the result of randomchance than the correlation matrix. The correlation ma-trix also has the substantial advantage of the existenceof significance tests for entries allowing the discarding ofnon-significant edges.With this observation, we propose a modification ofMAMCR, called Modified Module Analysis (MMA), thatinvestigates the community structure of the correlationmatrix. The remainder of the this work investigates thisproposal. The differences between MAMCR and MMAare presented schematically in Fig. 1.To explore this proposal, the correlation matrix wascalculated for all incorrect answers. Nodes with too fewparticipants to be statistically reliable were eliminated;for this work, nodes with fewer than 30 responses wereremoved. Edges were removed where the correlation, r ,between the two nodes was not significant at the p = 0 . r > . r > .
20, and r > .
25. Therepresentation in Fig. 4 was produced by the “qgraph”package in “R” [89]. The width of the line is propor-tional to the size of the correlation. Node placement isfor visual effect only.
C. Modified Module Analysis
The correlation matrices in Fig. 4 show a clear clus-tered structure. MMA was applied to understand thesestructures. The communities detected for the r > . C , asthe fraction of the bootstrap subsamples in which thepair of items were found in the same community. Thecommunity matrix was filtered to show items that wereidentified in C >
60% and
C >
80% of the communitiesin the 1000 bootstrap replications in Table I. The ma-jority of the communities extracted from the communitymatrix were fully connected; each node was connectedto every other node in the community. Some, however,were not. The intra-community density, γ , is defined asthe ratio of the number of edges in the community to themaximum number of edges possible [6]. For communitieswith γ < γ is presented as a percentage in parenthesisin Table I. For example, if a community contains fournodes then there are a maximum of six distinct edges be-tween the nodes. If the community only possesses five ofthose edges, then γ = 5 / D. The Structure of Incorrect FCI Responses
Unless otherwise stated, results below are reported for
C > . r > .
1. Types of Incorrect Communities
Table II classifies the incorrect reasoning for each com-munity of incorrect answers in Table I. These can bedivided into two general classes: communities resulting0
Figure (a) r >0.15Figure (b) r >0.20Figure (c) r >0.25 Figure 4. Post-test correlation matrices of women at varying levels of r . Figure 5. Communities detected in the FCI correlation matrixwith r > .
2. Each community is drawn in a different color. from blocking and communities resulting for consistentlyapplied incorrect reasoning (misconceptions). Communi-ties { } , { } , and { } are answerswithin blocked problems where the second answer in thepair would be correct if the first answer was correct. Theother communities apply either the same incorrect rea-soning or related incorrect reasoning.Hestenes and Jackson produced a detailed taxonomyof the naive conceptions (their terminology) tested by theFCI [92]. Table II shows a mapping of this taxonomy ontothe incorrect answer communities identified in the cur-rent work. The taxonomy divides the naive conceptionsinto a general category and a number of sub-categories.The number in parenthesis in Table II is the sub-categorylabel [92]. Items marked with an asterisk in Table II arepart of item blocks. Because the relation between theitems seems to be largely generated by the interdepen-dencies resulting from blocking rather than consistentlyapplied misconceptions, the blocked items will not be dis-cussed further.Some issues arise in comparing the proposed FCI tax-onomy with communities identified by MMA and the sim-ilar item blocks identified in Study 3. First, for someof items in the incorrect communities, no misconceptionwas identified (items 1D, 2C, and 2D). Students are an-swering these items in a correlated manner which impliesthe possibility of consistent reasoning patterns; for theseitems a possible misconception was suggested. The newmisconception was labeled “(Add).” Table II shows thatoften the items in the incorrect communities identified byMMA belong to multiple naive misconception categoriesand have different sub-categories. This would seem toimply that the naive conception taxonomy is more de-tailed than the actual application of misconceptions bystudents as measured by the FCI. For example, in theNewton’s 3rd law community, { } , differ-ent items involve objects of different activity from onestudent pushing on another student (item 28, one activeobject), to a car pushing a truck (item 15, one activeobject), to a head-on collision (item 4, two active ob-jects). This distinction does not seem important to the Table I. Communities identified in the pretest and post-test incorrect answers at r > . C . The number in parenthesis is theintra-community density, γ , for communities where the intra-community density is not one.Community Pretest Post-testMen Women Men Women C >
C > student’s answering pattern. It is unclear if this resultsfrom one misconception taking precedence over anotheror from students applying more general reasoning as pro-posed by the resource or knowledge-in-pieces models. Assuch, Table II includes a column which proposes a ti-tle for the dominant misconception. In many cases, thedominant misconception was identified as the misconcep-tion shared by the majority of the items. In some cases, adominant misconception was proposed. For the Newton’s3rd law community, { } , multiple misconcep- Table II. Misconceptions represented by incorrect answer communities. Communities marked with a ∗ result from blocked problems. Proposed additions are marked(Add). Proposed items to be removed are marked (Remove). If Add or Remove is placed before all items, it applies to all items. If Add or Remove is placed beforeonly one of many items, it applies to that item.Community Naive Conceptions Dominant MisconceptionCategory Sub-Catagory1A, 2C Gravity 1A (G3): Heavier objects fall faster Heavier objects fall faster (Add) (Add) (Add) (Remove) (Add)
18D (AF2): Motion implies active forces5E, 18E ImpetusActive Forces 5E (I1): Impetus supplied by “hit” Motion implies active forcesCentrifugal force (Remove)
5E (I5): Circular impetus5E, (Add)
18E (I1): Motion implies active forces5E, 18E (CF): Centrifugal force6A,7A Impetus 6A, 7A (I5): Circular impetus Circular impetus8A, 9B ∗ Concatenation of Influences 8A, 9B (CI3): Last force to act determines motion11B, 29A Other Influences on MotionImpetus 29A, (Remove)
11B (Ob): Obstacles exert no force Motion implies active forces (Remove)
11B (I1): Impetus supplied by “hit” (Add)
11B (AF2): Motion implies active forces11C, 13C, 30E Impetus 11C, 30E (I1): Impetus supplied by “hit” Motion implies active forces (Add) (Add)
25D (CI1): Largest force determines motion Largest force determines motionResistance 25D (R2): Motion when force overcomes resistance21B, 23C* Concatenation of Influences 21B, 23C (CI3): Last force to act determines motion21C, 22A* Concatenation of Influences 21C (CI2): Force compromise determines motionActive Forces 22A (AF4): Velocity proportional to applied force23D, 24C* Impetus 23D, 24C (I3): Impetus dissipation Impetus dissipation23D (I2): Loss/recovery of original impetus { } is also curious. Initem 11, a hockey puck is struck activating the impetussupplied by the “hit” misconception, but response 11Bexplicitly asks about a force in the direction of motion.As such, we propose this item also tests the motion im-plies active forces misconception. It is also unclear howitem 11B probes the obstacle exerts no force misconcep-tion; we propose it be removed from the item. Item 29involves a chair sitting on a floor; response 29A identi-fies only the force of gravity on the object and ignoresthe normal force. It seems difficult to claim this com-munity probes a common misconception. Item 29 wasalso demonstrated to have poor psychometric propertiesin Study 2; the correlation between 11B and 29A mayhave resulted from 29A not functioning as intended.The community { } continues to convolvethe motion involves active forces misconception with theimpetus supplied by the “hit” misconception. Response30E explicitly discusses the force of the “hit” while items11C and 13C discuss a force in the direction of motion.We propose adding this misconception to the items 11Cand 13C. Further, only item 13C involves the idea of adissipation of impetus. For this community, while mul- tiple misconceptions are tested, one seems to dominatestudent responses, motion implies active forces.Finally, the blocked item responses 23D, 24C differfrom the other blocked responses. Rather than the sec-ond response being the correct answer if the first responsewas correct, both appear to be applications of the dissi-pation of impetus misconception.
2. Reducing Sparsification
The r > . C > . r > .
15 and for
C > . C > . { } to be detectedfor both men and women, most other new communitiesidentified did not result from the merger of communitiesidentified at more restrictive thresholds. Particularly onthe pretest, the larger communities do not make muchsense in terms of the framework of Study 3. This is par-ticularly evident in the mixing of the Newton’s 3rd lawitems {
4, 15, 16, and 28 } with other items. As such, itappears that student misconceptions exist relatively in-dependently as small groups of consistent answers, notas a part of a larger coherent framework.
3. The Strength of Common Misconceptions
One motivation of Study 1 was to provide instructorswith a mechanism for identifying common misconcep-tions so that specific interventions could be targeted toaddress those misconceptions. The communities of incor-rect answers remaining on the post-test as shown in TableI could be used provide a measure of the prevalence of themisconception in the classes studied. Table III presentsan overall average for each incorrect community in TableI on the post-test. Only communities that did not re-sult from problem blocking are presented. Averages werecalculated by assigning a score of 1 if the response wasselected and 0 if it was not, then averaging over each itemin the group. Results are disaggregated by gender andthe p -value for a t -test to determine if differences by gen-der are significant is also presented; Cohen’s d provides ameasure of effect size. Cohen suggests d = 0 . d = 0 . d = 0 . Table III. Percentage of students selecting each incorrect community for the FCI post-test. A t-test was performed to determineif the differences between men and women were significant, the p -value is presented. Cohen’s d for the difference is also presented.Community Male Female p d MisconceptionAve. (%) Ave. (%)4A, 15C, 28D 32 ±
47 33 ±
47 0.27 0.02 Newton’s 3rd law misconceptions5D, 11C, 13C, 18D, 30E 22 ±
42 20 ± < .
001 0.06 Motion implies active forces5E, 18E 7 ±
25 7 ±
25 0.69 0.01 Motion implies active forces, centrifugal force6A, 7A 14 ±
35 5 ± < .
001 0.39 Circular impetus17A, 25D 42 ±
49 37 ± < .
001 0.11 Largest force determines motion the percentage of students who answer an item incor-rectly; therefore, only differences in Table III greater than8% represent unexpected differences between men andwomen. Only items { } exceed this difference, butthen only slightly with a difference of 9%. Items { } are also the only community with differences of atleast a small effect size; however, the effect size is likelyinflated by the small standard deviation of women be-cause of a floor effect. In general, the rate of selectingone of the communities of common incorrect answers wasvery similar for men and women.For the class studied, the results of Table III sug-gest that additional effort be directed to addressing thelargest force determines motion misconception measuredby { } and Newton’s 3rd law misconceptionsmeasured by { } . IV. DISCUSSIONA. Research Questions
This study sought to answer four research questions;they will be addressed in the order proposed.
RQ1: Are the results of Module Analysis for Multiple-Choice Responses replicable for large FCI data sets? Ifnot, what changes to the algorithm are required to de-tect meaningful communities of incorrect answers?
TheMAMCR process described in Study 1 identified only oneor two communities in our data whether using LANS oran edge weight threshold to sparsify the network. Thisresult held for Infomap and for other CDAs. Reducingthe data to a comparable size by subsampling generatedmore communities, but still fewer than identified in Study1; however, the communities identified did not make con-ceptual sense. We concluded that the community struc-ture identified in Study 1 was the result of the low samplesize and the LANS algorithm and that modifications toMAMCR were needed to productively identify incorrectanswer communities.The failure of MAMCR for large samples led us to pro-pose a variant of the algorithm using the correlation ma-trix instead of the adjacency matrix to build the network.This matrix was sparsified by removing statistically in-significant correlations and correlation below a threshold ( r < . RQ2: How do the communities detected changeas network-building parameters are modified? Dothese changes support the existence of a coherent non-Newtonian conceptual model?
A more permissive thresh-old for the correlation matrix ( r > .
15) yielded largercommunities as shown in the Supplemental Material [93].These larger communities were not formed by the joiningof smaller communities related to the same misconcep-tion; in fact, many of the communities contained itemsthat had little conceptual relation. As such, it appearsthat the best model of student misconceptions are as iso-lated pieces of reasoning associated with items with asimilar correct solution structure.
RQ3: How is the incorrect answer community struc-ture different between the pretest and the post-test?
For
C > . r > .
2, a total of 14 incorrect answercommunities were identified for either men or womenpre- and post-instruction; 5 of the communities wereconsistently identified for both genders pre- and post-instruction. Three of these five represent consistently ap-plied misconceptions: { } , Newton’s 3rd lawmisconceptions; { } , motion implies active forcesand the existence of a centrifugal force; and { } ,circular impetus. The items from which the incorrect an-swers in these communities were drawn were all identifiedas having very similar correct solution structure in Study3. The other two communities were drawn from problemblocks: { } and { } . Three incorrect com-munities disappeared with instruction: for all students { } , motion implies active forces; for men only, { } heavier objects fall faster and { } lighterobjects fall faster. Many incorrect communities were onlyidentified post-instruction including { } involv-ing items with similar solution structure as identified inStudy 3. RQ4: How is the incorrect answer community struc-ture different for men and women? Do the differences ex- plain the gender unfairness identified in the instrument? Post-instruction, using
C > . r > .
2, 11 commu-nities were identified for either men or women; 8 of thesecommunities were identified for both men and women.One of the other three communities was only identifiedfor women, { } , represents the mo-tion implies active forces misconception. This communitywas the merger of the two communities only identifiedfor men { } and { } ; the femalecommunity was also not completely connected, γ = 0 . { } , was the result of blocking andwas identified for both men and women post-instruction.The other two communities unique to men, { } and { } , involve the heavier objects fall faster and thelighter objects fall faster misconceptions. The miscon-ception structure of men and women was quite similarpre-instruction, with men holding more consistent mis-conceptions. B. Additional Observations
The misconception communities identified by MMAwere not completely consistent with the naive concep-tion taxonomy provided by Hestenes and Jackson for theFCI [92]. Often multiple naive conceptions were associ-ated with the same community. This may indicate thatstudent reasoning is better modeled by a more generalframework such as knowledge-in-pieces or ontological cat-egories. It may also indicate that the FCI cannot fullyresolve the detailed set of misconceptions identified in thetaxonomy.The results of this work were not consistent with recentexploratory analyses of the FCI [53–55] which identifieda few large factors; these factors mixed very differentcorrect and incorrect responses. The small communitiesidentified in the current work, which are partially sup-ported by the taxonomy of Hestenes and Jackson, seemto indicate the MMA may be a more productive quanti-tative method to explore misconceptions.
V. IMPLICATIONS
Not all of the communities identified in Table I rep-resent misconceptions. Some represent combinations ofdependent answers. For these combinations, the second answer is correct if the first answer was the correct an-swer. This suggests that, because of the blocking of itemsin the FCI, a simple scoring of the instrument with eachitem as correct or incorrect may understate a student’sknowledge of the material. Previous authors have calledfor reevaluating the scoring of the FCI [95], but not be-cause of problem blocking.The identification of three communities of incorrectanswers that were the result of item blocking furthersupports the conclusions of Study 3 that item blockingshould be discontinued in future PER instruments be-cause it may make the instruments difficult to interpretstatistically.The misconception communities identified in Table IIallow instructors to determine the strength of students’misconceptions as they enter a physics class and the re-maining strength after instruction, as shown in Table III.This should allow instructors to adjust their classes to ad-dress misconceptions remaining after instructoion and todirect fewer resources to addressing misconceptions thatare not present pre-instruction.
VI. FUTURE WORK
MMA was productive in extending the understand-ing of the incorrect answer structure of the FCI; it willbe extended to other conceptual instruments includingthe Force and Motion Conceptual Inventory [23] and theConceptual Survey of Electricity and Magnetism [96].Network analysis encompasses a broad collection ofpowerful analysis techniques. The analysis in this workrepresents the barest beginnings of the possibilities ofthese techniques. Future research may consider networkswith multiple types of nodes (possibly correct and incor-rect answers or pretest and post-test answers) or multipletypes of edges (possibly negative and positive correla-tions).
VII. CONCLUSION
Previous results reported for Module Analysis forMultiple-Choice Responses (MAMCR) could not bereplicated for a large sample. The failure of the algo-rithm at large sample size likely results from a combi-nation of unpurposeful edges in the adjacency matrix atlarge sample sizes and properties of the LANS sparsifi-cation algorithm. A modification of the algorithm, Mod-ified Module Analysis (MMA), based on the correlationmatrix was productive in identifying useful communitystructure. MMA identified 11 communities on the post-test and 9 on the pretest. Most of these communities wereidentified both for men and women: 8 on the post-test,6 on the pretest. In general, the incorrect answer com-munity structure identified for men and women was verysimilar and could not explain the gender differences pre-viously identified in a subset of items in the instrument.6The communities identified at high sparsification failedto merge into larger communities addressing similar mis-conceptions as sparsification was reduced suggesting thatstudents do not have an integrated non-Newtonian beliefsystem, but rather isolated incorrect beliefs strongly tiedto the type of question asked.
ACKNOWLEDGMENTS
This work was supported in part by the National Sci-ence Foundation as part of the evaluation of improved learning for the Physics Teacher Education Coalition,PHY-0108787. [1] D. Hestenes, M. Wells, and G. Swackhamer, “Force Con-cept Inventory,” Phys. Teach. , 141–158 (1992).[2] A. Madsen, S.B. McKagan, and E. Sayre, “Gender gapon concept inventories in physics: What is consistent,what is inconsistent, and what factors influence the gap?”Phys. Rev. Phys. Educ. Res. , 020121 (2013).[3] A. Traxler, R. Henderson, J. Stewart, G. Stewart, A. Pa-pak, and R. Lindell, “Gender fairness within the ForceConcept Inventory,” Phys. Rev. Phys. Educ. Res. ,010103 (2018).[4] E. Brewe, J. Bruun, and I.G. Bearden, “Using mod-ule analysis for multiple choice responses: A newmethod applied to Force Concept Inventory data,”Phys. Rev. Phys. Educ. Res. , 020131 (2016).[5] M.J. Newman, Networks, 2nd ed. (Oxford UniversityPress, New York, NY, 2018).[6] K.A. Zweig,
Network Analysis Literacy: A Practical Ap-proach to the Analysis of Networks (Springer-Verlag,Wien, Austria, 2016).[7] A.V. Papachristos and C. Wildeman, “Network exposureand homicide victimization in an African American com-munity,” Am. J. Public Health , 143–150 (2014).[8] F. De Vico, J. Richiardi, M. Chavez, and S. Achard,“Graph analysis of functional brain networks: Practicalissues in translational neuroscience,” Philos. T. R. Soc.Lon. B (2014).[9] J. Lop´ez Pe˜na and H. Touchette, “A network theoryanalysis of football strategies,” in
Sports Physics: Proc.2012 Euromech Physics of Sports Conference , edited byC. Clanet (´Editions de l’´Ecole Polytechnique, 2012) pp.517–528.[10] Z. Zheng and Y. Zhao, “Transcriptome comparison andgene coexpression network analysis provide a systemsview of citrus response to “CandidatusLiberibacter asi-aticus” infection,” BMC Genomics , 27 (2013).[11] S. Fortunato and D. Hric, “Community detection in net-works: A user guide,” Physics Reports , 1–44 (2016).[12] M. Rosvall and C.T. Bergstrom, “Maps of randomwalks on complex networks reveal community structure,”P. Natl. Acad. Sci. USA , 1118–1123 (2008).[13] L. Crocker and J. Algina, Introduction to Classical andModern Test Theory (Holt, Rinehart and Winston, NewYork, 1986).[14] R.J. De Ayala,
The theory and practice of item responsetheory (Guilford Publications, 2013). [15] P.W. Holland and D.T. Thayer, “An alternate definitionof the ETS delta scale of item difficulty,” ETS ResearchReport Series
Research Report RR-85-43 (1985).[16] P.W. Holland and D.T. Thayer, “Differential item per-formance and the Mantel-Haenszel procedure,” in
TestValidity , edited by H. Wainer and H. I. Braun (LawrenceErlbaum, Hillsdale, NJ, 1993) pp. 129–145.[17] J. Stewart, C. Zabriskie, S. DeVore, andG. Stewart, “Multidimensional item responsetheory and the Force Concept Inventory,”Phys. Rev. Phys. Educ. Res. , 010137 (2018).[18] D. Huffman and P. Heller, “What does theForce Concept Inventory actually measure?”Phys. Teach. , 138 (1995).[19] T.F. Scott, D. Schumayer, and A.R. Gray, “Exploratoryfactor analysis of a Force Concept Inventory data set,”Phys. Rev. Phys. Educ. Res. , 020105 (2012).[20] N. Lasry, S. Rosenfield, H. Dedic, A. Dahan, andO. Reshef, “The puzzling reliability of the Force Con-cept Inventory,” Am. J. Phys. , 909–912 (2011).[21] T.F. Scott and D. Schumayer, “Students’ proficiencyscores within multitrait item response theory,” Phys.Rev. Phys. Educ. Res. , 020134 (2015).[22] M.R. Semak, R.D. Dietz, R.H. Pearson, andC.W. Willis, “Examining evolving performance onthe Force Concept Inventory using factor analysis,”Phys. Rev. Phys. Educ. Res. , 010103 (2017).[23] R.K. Thornton and D.R. Sokoloff, “Assessing stu-dent learning of Newton’s laws: The Force andMotion Conceptual Evaluation and the evaluationof active learning laboratory and lecture curricula,”Am. J. Phys. , 338–352 (1998).[24] C. Nord, S. Roey, S. Perkins, M. Lyons,N. Lemanski, J. Schuknecht, and J. Brown,“American High School Graduates: Results of the 2009 NAEP High School Transcript Study,” USDepartment of Education, National Center for EducationStatistics, Washington, DC (2011).[25] B.C. Cunningham, K.M. Hoyer, and D. Sparks, Gender Differences in Science, Technology, Engineering, and Mathematics (STEM) Interest, Credits Earned, and NAEP Performance in the 12th Grade (National Center for Education Statistics, Washington,DC, 2015).[26] B.C. Cunningham, K.M. Hoyer, and D. Sparks,“The Condition of STEM 2016,” ACT Inc., Iowa City,IA (2016).[27] P.M. Sadler and R.H. Tai, “Success in introductory col-lege physics: The role of high school preparation,” Sci. Educ. , 111–136 (2001).[28] Z. Hazari, R.H. Tai, and P.M. Sadler, “Gender dif-ferences in introductory university physics performance:The influence of high school physics preparation and af-fective factors,” Sci. Educ. , 847–876 (2007).[29] D. Voyer and S.D. Voyer, “Gender differences in scholas-tic achievement: A meta-analysis.” Psychol. Bull. ,1174 (2014).[30] Y. Maeda and S.Y. Yoon, “A meta-analysis on gender dif-ferences in mental rotation ability measured by the Pur-due Spatial Visualization Tests: Visualization of Rota-tions (PSVT: R),” Educ. Psychol. Rev. , 69–94 (2013).[31] D.F. Halpern, Sex Differences in Cognitive Abilities, 4thed. (Psychology Press, Francis & Tayler Group, NewYork, NY, 2012).[32] J.S. Hyde and M.C. Linn, “Gender differences in verbalability: A meta-analysis.” Psychol. Bull. , 53 (1988).[33] J.S. Hyde, E. Fennema, and S.J. Lamon, “Gender dif-ferences in mathematics performance: A meta-analysis.”Psychol. Bull. , 139 (1990).[34] N.S. Cole,
The ETS Gender Study: How Females andMales Perform in Educational Settings (EducationalTesting Service, Princeton, NJ, 1997).[35] N.M. Else-Quest, J.S. Hyde, and M.C. Linn, “Cross-national patterns of gender differences in mathematics:A meta-analysis.” Psychol. Bull. , 103 (2010).[36] X. Ma, “A meta-analysis of the relationship between anx-iety toward mathematics and achievement in mathemat-ics,” Jour. Res. Math. Educ. , 520–540 (1999).[37] J.V. Mallow and S.L. Greenburg, “Science anxiety:Causes and remedies,” J. Coll. Sci. Teach. , 356–358(1982).[38] M.K. Udo, G.P. Ramsey, and J.V. Mallow, “Science anx-iety and gender in students taking general education sci-ence courses,” J. Sci. Educ. Technol. , 435–446 (2004).[39] J. Mallow, H. Kastrup, F.B. Bryant, N. Hislop,R. Shefner, and M. Udo, “Science anxiety, science atti-tudes, and gender: Interviews from a binational study,”J. Sci. Educ. Technol. , 356–369 (2010).[40] J.R. Shapiro and A.M. Williams, “The role ofstereotype threats in undermining girls’ andwomen’s performance and interest in STEM fields,”Sex Roles , 175–183 (2012).[41] R. Henderson, G. Stewart, J. Stewart, L. Michaluk, andA. Traxler, “Exploring the gender gap in the ConceptualSurvey of Electricity and Magnetism,” Phys. Rev. Phys.Educ. Res. , 020114 (2017).[42] L. McCullough and D.E. Meltzer, “Differ-ences in male/female response patterns onalternative-format versions of FCI items,” in ,edited by K. Cummings, S. Franklin, and J. Marx (AIPPublishing, New York, 2001) pp. 103–106.[43] L. McCullough, “Gender, context, and physics assess-ment,” J. Int. Womens St. , 20–30 (2004).[44] S. Osborn Popp, D. Meltzer, and M.C. Megowan-Romanowicz, “Is the Force Concept Inventory bi-ased? Investigating differential item functioningon a test of conceptual learning in physics,” in (American Education Research Association, Washington,DC, 2011).[45] R.D. Dietz, R.H. Pearson, M.R. Semak, and C.W.Willis, “Gender bias in the Force Concept Inventory?” in ,Vol. 1413, edited by N.S. Rebello, P.V. Engelhardt, andC. Singh (AIP Publishing, New York, 2012) pp. 171–174.[46] R. Henderson, P. Miller, J. Stewart, A. Traxler, andR. Lindell, “Item-level gender fairness in the Force andMotion Conceptual Evaluation and the Conceptual Sur-vey of Electricity and Magnetism,” Phys. Rev. Phys.Educ. Res. , 020103 (2018).[47] J. Clement, “Students’ preconceptions in introductorymechanics,” Am. J. Phys. , 66–71 (1982).[48] J. Clement, D.E. Brown, and A. Zietsman, “Not allpreconceptions are misconceptions: Finding anchoringconceptions for grounding instruction on students intu-itions,” Int. J. Sci. Educ. , 554–565 (1989).[49] J. Clement, “Using bridging analogies and anchor-ing intuitions to deal with students’ preconceptions inphysics,” J. Res. Sci. Teach. , 1241–1257 (1993).[50] I.A. Halloun and D. Hestenes, “The initial knowledgestate of college physics students,” Am. J. Phys. , 1043–1055 (1985).[51] I.A. Halloun and D. Hestenes, “Common sense conceptsabout motion,” Am. J. Phys. , 1056–1065 (1985).[52] R.R. Hake, “Interactive-engagement versus traditionalmethods: A six-thousand-student survey of mechanicstest data for introductory physics courses,” Am. J. Phys. , 64–74 (1998).[53] T.F. Scott and D. Schumayer, “Conceptual coherence ofnon-Newtonian worldviews in Force Concept Inventorydata,” Phys. Rev. Phys. Educ. Res. , 010126 (2017).[54] P. Eaton, K. Vavruska, and S. Willoughby, “Exploringthe preinstruction and postinstruction non-Newtonianworld views as measured by the Force Concept Inven-tory,” Phys. Rev. Phys. Educ. Res. , 010123 (2019).[55] T.F. Scott and D. Schumayer, “Central distractors inForce Concept Inventory data,” Phys. Rev. Phys. Educ.Res. , 010106 (2018).[56] L. Viennot, “Spontaneous reasoning in elementary dy-namics,” Eur. J. Sci. Educ. , 205–221 (1979).[57] D.E. Trowbridge and L.C. McDermott, “Investigation ofstudent understanding of the concept of acceleration inone dimension,” Am. J. Phys. , 242–253 (1981).[58] A. Caramazza, M. McCloskey, and B. Green, “Naivebeliefs in “sophisticated” subjects: Misconceptions abouttrajectories of objects,” Cogn. , 117–123 (1981).[59] P.C. Peters, “Even honors students have conceptual dif-ficulties with physics,” Am. J. Phys. , 501–508 (1982).[60] M. McCloskey, “Intuitive physics,” Sci. Am. , 122–131 (1983).[61] R.F. Gunstone, “Student understanding in mechanics:A large population survey,” Am. J. Phys. , 691–696(1987).[62] C.W. Camp and J.J. Clement, Preconceptions in mechan-ics: Lessons dealing with students’ conceptual difficulties (Kendall/Hunt, Dubuque, IA, 1994).[63] L.C. McDermott, “Students’ conceptions and problemsolving in mechanics,” in
Connecting research in physicseducation with teacher education , edited by Andr´eeTiberghien, E. Leonard Jossem, and Jorge Barojas (In-ternational Commission on Physics Education, 1997) pp.42–47.[64] R. Rosenblatt and A.F. Heckler, “Systematic study ofstudent understanding of the relationships between thedirections of force, velocity, and acceleration in one di-mension,” Phys. Rev. Phys. Educ. Res. , 020112 (2011). [65] N. Erceg and I. Aviani, “Students’ understanding ofvelocity-time graphs and the sources of conceptual dif-ficulties,” Croat. J. Educ. , 43–80 (2014).[66] B. Waldrip, “Impact of a representational approach onstudents’ reasoning and conceptual understanding inlearning mechanics,” Int. J. Sci. Math. Educ. , 741–765 (2014).[67] A.A. diSessa, “Knowledge in pieces,” in Constructivismin the Computer Age , The Jean Piaget Symposium Se-ries, edited by George Forman and Peter B. Pufall(Lawrence Erlbaum, Hillsdale, NJ, 1988) pp. 49–70.[68] A.A. diSessa, “Toward an epistemology of physics,”Cogn. Instr. , 105–225 (1993).[69] A.A. diSessa and B.L. Sherin, “What changes in concep-tual change?” Int. J. Sci. Educ. , 1155–1191 (1998).[70] M.T.H. Chi and J.D. Slotta, “The ontological coherenceof intuitive physics,” Cogn. Instr. , 249–260 (1993).[71] M.T.H. Chi, J.D Slotta, and N. De Leeuw, “From thingsto processes: A theory of conceptual change for learningscience concepts,” Learn. Instr. , 27–43 (1994).[72] J.D. Slotta, M.T.H. Chi, and E. Joram, “Assessing stu-dents’ misclassifications of physics concepts: An ontologi-cal basis for conceptual change,” Cogn. Instr. , 373–400(1995).[73] R. Duit and D. F. Treagust, “Conceptual change: A pow-erful framework for improving science teaching and learn-ing,” Int. J. Sci. Educ. , 671–688 (2003).[74] D. Hammer, “Misconceptions or p-prims: How may al-ternative perspectives of cognitive structure influence in-structional perceptions and intentions,” J. Learn. Sci. ,97–127 (1996).[75] D. Hammer, “More than misconceptions: Multi-ple perspectives on student knowledge and reason-ing, and an appropriate role for education research,”Am. J. Phys. , 1316–1325 (1996).[76] D. Hammer, “Student resources for learning introductoryphysics,” Am. J. Phys. , S52–S59 (2000).[77] E. Etkina, J. Mestre, and A. O’Donnell, “The impactof the cognitive revolution on science learning and teach-ing,” in The Cognitive Revolution in Educational Psy-chology , edited by James M. Royer (IAP, 2005) pp. 119–164.[78] J.D. Bransford, A.L. Brown, and R.R. Cocking,
Howpeople learn: Brain, Mind, Experience, and School (Na-tional Academy Press, Washington, DC, 2000).[79] G. Csardi and T. Nepusz, “The igraph soft-ware package for complex network research,”InterJournal, Complex Systems , 1–9 (2006).[80] R Core Team,
R: A Language and Environment for Statistical Computing ,R Foundation for Statistical Computing, Vienna, Austria (2017).[81] “US News & World Report: Education,” https://premium.usnews.com/best-colleges . Ac-cessed 4/30/2017.[82] D. Edler and Rosvall M., “The mapequa-tion software package,” Available online at . Accessed 2/1/2019.[83] S.P. Borgatti and D.S. Halgin, “Analyzing affiliation net-works,” in
The Sage Handbook of Social Network Analy-sis , edited by J. Scott and P.J. Carrington (Sage Publi-cations, Thousand Oaks, CA, 2011) pp. 417–433.[84] N.J. Foti, J.M. Hughes, and D.N. Rockmore, “Nonpara-metric sparsification of complex multiscale networks,”PLoS ONE , 1–10 (2011).[85] A. Traxler, A. Gavrin, and R. Lindell, “Net-works identify productive forum discussions,”Phys. Rev. Phys. Educ. Res. , 020107 (2018).[86] A. Clauset, M.E.J. Newman, and C. Moore, “Find-ing community structure in very large networks,”Phys. Rev. E , 066111 (2004).[87] M.E.J. Newman, “Fast algorithm for de-tecting community structure in networks,”Phys. Rev. E , 066133 (2004).[88] M.E.J. Newman and M. Girvan, “Finding andevaluating community structure in networks,”Phys. Rev. E , 026113 (2004).[89] S. Epskamp, A.O.J. Cramer, J.L. Waldorp, V.D.Schmittmann, and D. Borsboom, “qgraph: Networkvisualizations of relationships in psychometric data,”J. Stat. Soft. , 1–18 (2012).[90] A.C. Davison and D.V. Hinkley, Bootstrap Methods and Their Applications (CambridgeUniversity Press, Cambridge, UK, 1997).[91] A. Canty and B.D. Ripley, boot: Bootstrap R (S-Plus)Functions (2017), R package version 1.3-20.[92] “Table II for the Force Concept In-ventory (revised from 081695r),” http://modeling.asu.edu/R&E/FCI-RevisedTable-II_2010.pdf .Accessed 3/17/2019.[93] See Supplemental Material at [URL will be inserted bypublisher] for the communities detected at the r > . Statistical Power Analysis for the BehavioralSciences (Academic Press, New York, NY, 1977).[95] R.C. Hudson and F. Munley, “Re-score the Force Con-cept Inventory!” Phys. Teach. , 261–261 (1996).[96] D.P. Maloney, T.L. O’Kuma, C. Hieggelke, andA. Van Huevelen, “Surveying students’ concep-tual knowledge of electricity and magnetism,”Am. J. Phys.69