Investigating students seriousness during selected conceptual inventory surveys
David P Waters, Dragos Amarie, Rebecca A Booth, Christopher Conover, Eleanor C Sayre
IInvestigating students seriousness during selected conceptual inventory surveys
David P. Waters
Department of Basic Sciences, St. Louis College of Pharmacy, St. Louis, Missouri 63110, USA.
Dragos Amarie
Department of Physics and Astronomy, Georgia Southern University, Statesboro, Georgia 30460, USA.
Rebecca A. Booth
Calgary Board of Education, Calgary, Alberta, Canada.
Christopher Conover and Eleanor C. Sayre
Department of Physics and Astronomy, Kansas State University, Manhattan, Kansas 66506, USA. (Dated: July 16, 2019)Conceptual inventory surveys are routinely used in education research to identify student learningneeds and assess instructional practices. Students might not fully engage with these instrumentsbecause of the low stakes attached to them. This paper explores tests that can be used to estimatethe percentage of students in a population who might not have taken such surveys seriously. Thesethree seriousness tests are the pattern recognition test, the easy questions test, and the uncommonanswers test. These three tests are applied to sets of students who were assessed either by the ForceConcept Inventory, the Conceptual Survey of Electricity and Magnetism, or the Brief Electricityand Magnetism Assessment. The results of our investigation are compared to computer simulatedpopulations of random answers.
I. INTRODUCTION
Conceptual Inventories (CIs) came out of our necessityto quantify students’ understanding of concepts and theirprogress in class by monitoring learning gains. [1]. Thephysics education research that followed has driven mod-ern teaching with a focus on developing novel methodsto stimulate students’ understanding, and has also rede-fined our learning goals [2]. Halloun and Hestenes raisedthe concern that traditional instruction marginally af-fects students’ understanding while their common sensebeliefs usually contradict the laws of physics [3, 4]. TheirForce Concept Inventory (FCI) survey arrives as a firsttool to measure students mastery of force concepts widelytaught in the first semester of physics [5]. Since then, CIshave gained widespread use in physics and astronomy ed-ucation [6–8], as well as many other disciplines of STEM[9–22].Since CIs became more useful to instructors, theystarted to be used as research-based assessment instru-ments (RBAIs) in education research [23]. RBAIs aremultiple-choice but carefully designed survey to provideinsight into students’ attitudes and understanding. Overtime, RBAIs have undergone different rounds of scrutinyand validation [24]. When RBAI data is collected reg-ularly, they could be valuable measuring tools by pro-viding standardized comparisons among institutions, in-structors, teaching methods, and over multiple imple-mentations of the same course. They also allow usto track trends and investigate correlations over time[25, 26]. The physics education research that has fol-lowed from the use of RBAIs has driven physics instruc-tors toward developing and implementing novel methodsfor increasing students understanding as well as toward redefining student learning goals [2]. PhysPort, an on-line resource for instructors interested in implementingresearch-based physics teaching practices in their class-rooms, currently provides 92 RBAIs with diverse foci,including content knowledge, problem-solving, scientificreasoning, lab skills, beliefs and attitudes, and interactiveteaching [27].Among the RBAIs available on PhysPort are the ForceConcept Inventory (FCI), the Brief Electricity and Mag-netism Assessment (BEMA), and the Conceptual Surveyof Electricity and Magnetism (CSEM). The FCI is a 30-question RBAI used to measure students mastery of themechanics concepts widely taught in a first-semester in-troductory physics course [5]. The FCI is among themost popular RBAIs, with extensive research on its ef-ficacy as effects on instruction as well as many transla-tions into different languages and formats. In particular,the FCI has been investigated by Hestenes et al. whointerviewed students and instructors to confirm that sur-veyed individuals correctly understood the wording andthe pictographs [5, 28], whereas Stewart et al. confirmsthat test scores are not particularly context dependent[29]. Version H of the CSEM, published by Maloney,O’Kuma, Hieggelke, and Van Heuvelen in 2001 [30], isa 32-question RBAI used to measure student conceptualunderstanding of electricity and magnetism at an intro-ductory undergraduate level. The BEMA is a 31-questionRBAI also designed to assess conceptual understandingof electromagnetism. a r X i v : . [ phy s i c s . e d - ph ] J u l Main concerns with RBAI
From the early days of RBAIs [31], researchers andinstructors have raised concerns about whether studentsmight not make a serious attempt at answering the ques-tions on a conceptual-inventory RBAI, such as the FCI,CSEM, or BEMA correctly [23, 32–35]. In order for in-structors and researchers to appropriately evaluate RBAIdata, it is useful to know what proportion of students ina population are taking that RBAI seriously. We defineserious students as those who chose answers with consid-eration, including educated and/or thoughtful guesses,throughout their entire assessment.Stewart et al. [36, 37] study the effect of guessing onboth the FCI and CSEM tests. They show that gainsare invariant to linear transformation and therefore un-affected by guessing and as such their linear models cancorrect the test results to account for guessing. Yasudaet al. show that while question 5 scores on the FCI aremarginally affected by erroneous reasoning, questions 6,7, and 16 are more prone to guessing. These questions re-turn a high percentage of false positives as students seemto reach the right answer while using erroneous concep-tual reasoning [38, 39].Wang et al. implement item response theory [40] tobuild a 3-parameter item response model and use it toanalyze student performance on FCI surveys [41]. Theyshow that a student’s proficiency is in linear correlationto a student’s raw FCI score. They find that low pro-ficiency students have less than a 5% chance of guess-ing the correct answer on questions 23 and 26, and 34%chance of guess correctly on question 16. They predictsthat questions 1 and 6 are the easiest, whereas questions25 and 26 are the most difficult. As anticipated, eachof the 30 questions in the FCI has a different guessingchance and difficulty level, which comes in support ofour present work hypothesizing that when students takethe survey seriously, there is a better chances that theywill select the correct answer for those questions [41].Hake et al. considered that motivational factors canpersuade students to take the RBAIs seriously. Withoutmuch evidence at the time, he made the remark that sur-veyed students did take the [FCI] pre-test seriously [30].Later, Henderson shows that about 2.8% of surveyed stu-dents may not take an RBAI seriously [31]. Hendersonwas concerned about whether students take the FCI se-riously when it is not graded. To identify those students,answer patterns were examined for lack of seriousnessfrom five different angles. By comparison, Pollock etal. ran a longitudinal study of students’ conceptual un-derstanding using the BEMA survey, and requested thatstudents report how hard they tried. Three levels wereidentified: take it very seriously, take it seriously, anddid not take it seriously. This study shows that over50% of students took the RBAI very seriously, and only3% indicated that they did not take it seriously [42].We have developed a set of seriousness tests and ap-plied them to the FCI, CSEM, and BEMA. It was our
RBAI Pre Post OverallMean (%) FCI 40.5 57.3 47.8CSEM 27.8 43.8 38.9BEMA 22.8 46.1 38.7SD (%) FCI 20.5 22.1 22.8CSEM 12.9 18.6 18.6BEMA 10.1 18.1 19.4TABLE I. Means and Standard Deviations for each RBAI goal to develop seriousness tests that could give instruc-tors and researchers an estimate for the proportion ofstudents who did not take an RBAI seriously. Notably,it was not our goal to develop seriousness tests that couldidentify individual students, and we recommend that theseriousness tests described in this paper not be used inthat manner. In subsequent sections we will describe howthese seriousness tests were developed as well as thosetests’ effectiveness in accurately categorizing students aseither taking an RBAI seriously or not.
II. DATA SOURCES FOR THE FCI, CSEM,AND BEMA
Data for this paper was obtained from PhysPort’s col-lection of student data. After administering an RBAI, in-structors can use the PhysPort Data Explorer to analyzethe data from their students. Once the instructors haveuploaded their students’ responses, the data is stored in adatabase in PhysPort. We were able to use the data fromthis database to run our seriousness tests on both thepre- and post-test data for the FCI, CSEM, and BEMA.The database is larger than any data set that has beentested previously, with 64,076 assessment results for theFCI, 15,032 assessment results for the CSEM, and 8,708assessment results for the BEMA. Table 1 presents theaverage and the standard deviation for each RBAI.Along with the RBAI results from PhysPort, we cre-ated 20,000 simulated RBAI results each for the FCI,CSEM, and BEMA. Our simulated students guessed ran-domly on all questions. We generated this simulated datain order to model the responses we might expect fromnon-serious students. Because we could be certain thateach simulated individual in the random data set wasa random guesser, the seriousness tests needed to flaga significant fraction of this population in order to beconsidered successful. We did not expect our seriousnesstests to identify every member of the simulated popula-tion as non-serious, however, because a seriousness testthat achieves this would likely lead to misidentifying se-rious students as non-serious. It should also be notedthat real students are almost never able to behave in atruly random manner on an RBAI, even when they arebeing non-serious. Their results might show tendenciestoward certain answer choices, patterns on the answersheet, or other trends. This means that students mightexist who do not take an RBAI seriously, who also arenot well-represented in the simulated population.
III. THE SERIOUSNESS TESTS
We developed three seriousness tests that can be ap-plied to FCI, CSEM, and BEMA responses in order toestimate the percent of students in a sample who didnot take that RBAI seriously: the Pattern RecognitionTest (PRT), the Uncommon Answers Test (UAT), andthe Easy Questions Test (EQT). These seriousness testsare not designed, however, to identify individual studentswho did not take an RBAI seriously. In developing thesetests, we made the assumption, based on the previouswork from Henderson as well as from Pollock et al., thatthe majority of students take RBAI seriously. As such,we expect the portion of the real population that a suc-cessful seriousness test identifies to be small.
1. The pattern recognition test
The Pattern Recognition Test (PRT) is based on thepremise that students who do not take an RBAI seriouslymight choose instead to leave certain patterns throughouttheir answers. Since computers are not good at pickingup on these patterns, we came up with patterns basedon what we thought would be likely to find from non-serious test takers. The patterns that we searched for inthe RBAIs were: • more than 50% zeros or blank answers • more than 50% one letter • • • •
2. The uncommon answers test
The Uncommon Answers Test (UAT) is based on theidea that students who do not take an RBAI seriouslysometimes choose answers that were uncommonly cho-sen by the larger student population. There are nine
RBAI FCIQuestion 4 6 12 15 16 22 24 27 29Answers b,c,d c,d,e a,d,e d,e d,e c,e b,d d,e c,eRBAI CSEMQuestion 1 3 4 7 8 12 13 18 24Answers a,e a,e a,e d,e a,e c,e c,d a,b a,eRBAI BEMAQuestion 1 2 3 4 5 10 14 21 25Answers c,d,g c,d,g e,h,i b,d,j d,j h,i d,e,f d,j a,hTABLE II. Nine questions on each RBAI where a small per-centage of the population chose uncommon answers. questions on each of the RBAIs where two or three an-swer choices were preferred by most of the population.The common answers were most often the correct answerplus one or more of the incorrect answers. Evidently,these preferred choices are attractive to people who werereading carefully through all questions and were beingthoughtful in their responses.If a student chose an unpopular answer on several ofthese questions, it is likely that they were guessing ratherthan applying reasoning throughout the assessment. Weidentified uncommon answer choices based on how fewstudents have picked those answers in the existing Phys-Port data. Table 2 summarizes the questions and theless frequently chosen answers. We identified 9 questionswith uncommon answers for each RBAI. For the FCI,fewer than 7% of the population chose one of the uncom-mon answers for each identified question. For the CSEM,fewer than 10% of students chose one of the uncommonanswers for each identified question. For the BEMA,fewer than 6% of students chose one of the uncommonanswers for each identified question. We counted surveytakers who choose at least 4 uncommon answers for theFCI or CSEM or at least 3 uncommon answers for theBEMA as possibly non-serious.
3. The easy questions test
The Easy Questions Test (EQT) was based on the ideathat students who take a concept-inventory RBAI seri-ously will get most of the easier questions correct. A stu-dent making an effort on such an RBAI might still haveone or two of even these questions incorrect, but theyare unlikely to be incorrect for all the easy questions. Itstands to reason that an answer set in which all the re-sponses to the easy questions are incorrect is more likelyto come from a student who did not take that assessmentseriously.We looked at the existing PhysPort data to determinewhich questions were easiest for students (Figure 1). Foreach RBAI, we chose the top four questions which hadthe highest scores, and calculated the percent of studentswho got a certain number of those questions correct. Thestudents who answered all four easy questions incorrectlywere considered as not having taken the assessment seri-
FIG. 1. Score distribution for each question of each RBAI.The pre- and post-tests scores are combined for the FCI andCSEM. RBAI FCIQuestion 1 6 12 24RBAI CSEMQuestion 1 3 6 12RBAI BEMAPre-test Question 1 2 18 21Post-test Question 1 13 19 21TABLE III. Four questions on each RBAI with the highestpercentage of students choosing the right answer. ously. We note, however, that even a random guesser islikely to choose at least one correct answer in any set offive-choice questions. Overall, this means that the EQTwill undercount the number of non-serious test-takers.Figure 2 shows the percent of students who answeredall of the easy test questions incorrectly for an increas-ing number of easy questions. We can see that when wechoose four or more easy questions, the percent of stu-dents who get all of the questions wrong stays relativelyconstant. For this reason, we chose the four easiest ques-tions from each RBAI based on the proportion of correctresponses to that question. The questions chosen for theEQT for each RBAI are shown in Table 3. For the FCI,an easy question has a score greater than 71%. For theCSEM, an easy question is one with a percent of cor-rect responses above 58%. As shown in Figure 1 for theBEMA, there is a large discrepancy on the pre- and post-test as to which questions are easy questions. The easyquestions for the pre- and post-tests of the BEMA hadscores greater than 43% and 69%, respectively.
FIG. 2. Justification for number of Easy Test questions: Per-centage of real test takers who answered none of the questionscorrectly for an increasing number of easy questions as mea-sured by the easy test. The order of the easy test questionswas determined by the data in Figure 1.Test PRT UAT EQTFCI 0.63 3.5 2.3CSEM 1.3 4.6 2.5BEMA 1.2 3.1 3.2TABLE IV. Percentage of each population detected as non-serious for each RBAI by each of the 3 seriousness tests
IV. ANALYSIS AND DISCUSSION
Results of the PRT, UAT, and EQT are shown in Table4 and in Figure 3. In Figure 3, each of the segmentsincludes the percent of test takers caught by that test,excluding what is shown in the intersecting segments. Asan example, the percent of the actual population from theCSEM caught by the PRT is 1.3%, as shown in Table4. This comes from combining 0.92% with each of thesegments within the entire PRT circle. The percent ofthe actual population found to be non-serious by each ofthese tests is very small, ranging from less than 1% upto a few percent.Applying the PRT to the actual population data iden-tifies between 0.6% and 1.3% as non-serious for each ofthe different RBAIs. The PRT identified nearly zero non-serious survey-takers in the simulated population, how-ever. This is unsurprising because the patterns sought forare non-random. Pattern recognition was thus excludedfrom Figure 3 for the random simulated results.For comparison, the UAT and EQT found a high pro-portion of non-serious responses in the random simulateddata. 68% and 52% of the respective simulated popula-tions for the FCI and CSEM had four or more UncommonAnswers in their responses, and 21% of the BEMA ran-dom population had three or more Uncommon Answers
FIG. 3. Venn diagrams showing the percent of non-serious responses caught by each of the 3 tests for each of the data sets foreach RBAI. The percent in each segment is based on the symmetric difference, where each segment includes what is not in theintersection. Note that the PRT results are excluded from the random simulation because almost none of the simulated RBAIresults exhibited the patterns sought for by that test. Values with an asterisk round to zero and represent fewer than 32 (FCI),8 (CSEM), and 5 (BEMA) students. (blue circle). 41%, 41%, and 51% of the respective sim-ulated populations for the FCI, CSEM, and BEMA wereidentified by the EQT as non-serious (green circle). Inthe actual populations, on the other hand, the UAT iden-tified slightly more than 2% of students as non-seriousfor each RBAI, and the EQT identified between 3% and4.6% of students as non-serious for each RBAI.Comparing the uncommon answers chosen for eachdata set and each RBAI in Figure 4, we see that theactual population chose fewer uncommon answers thanthe random simulation. Fewer than 50% of the real stu-dents selected any uncommon answers. Conversely, therewas an average of three or four uncommon answers forthe random population. Therefore, testing for the num-ber of uncommon answers helped us differentiate seriousstudents from random guessers. For the easy questionstest, we saw in Figure 5 that the simulated populationchose fewer correct easy questions while most of the ac-tual population were able to correctly answer at least 2easy questions.
Combining the results to determine the overall percent ofnon-serious students
The percent of the actual population identified by eachof the three seriousness tests as non-serious is small, rang-ing from less than 1% up to a few percent. The centersegments of Figure 3 show that very few real test takerswere caught by all three tests for any of the RBAIs, withthe FCI catching only 0.016% with all 3 tests.As a final comparison, we looked at the scores of anytest taker in the actual population who was identifiedas non-serious by the PRT as well as both of the othertwo seriousness tests. When we look at the tests caughtby the PRT and tests caught by both the EQT and theUAT, the percent of test takers not taking the assess-ment seriously was 1.5% to 2.2% of the population. Thenumber of test takers in the actual population caught bythe non-serious tests was 1,120 out of 63,896 assessments(1.8%) for the FCI, 329 out of 14,876 assessments (2.2%)for the CSEM, and 133 out of 8,642 assessments (1.5%)for the BEMA. These values could be determined usingFigure 3 by combining the percentages within the entirered (PR) circle with the overlapping segment between UAand EQ. These results were very similar, suggesting thatour seriousness tests accurately determine the percent ofstudents who did not take an RBAI seriously. Figure 6
FIG. 4. Distribution of the percentage of assessments thatselected a number of uncommon answers for both the realand simulated populations for each RBAI. There were 9 ques-tions for each RBAI where uncommon answers are rare. Non-serious test takers were those who chose at least 4 uncommonanswers for the FCI and CSEM and at least 3 uncommonanswers for the BEMA. shows graphs of the non-serious scores, identified in thecombined manner described vs. all of the scores in theactual population. From this graph, we can see that thenon-serious assessment scores are much lower than thosefrom the actual population. This is further evidence thatthe scores identified by a combination of the PRT andboth the UAT and the EQT were most likely truly notserious.
V. CONCLUSION
Our results are in contrast to work mentioned in theintroduction by Henderson, who found that about 2.8%of students did not take the FCI seriously [32], and the re-sults from by Pollock et al. who found that 3% indicatedthat they did not take the BEMA seriously [42]. We findthat fewer students were caught by our seriousness tests,and we conclude that the overall percentage of studentswho did not take the CIs seriously is only about 1.5% -1.6%. Regardless, our results are in line with previous work that shows that the incidence of non-seriousness inRBAI results is very low.In addition, our seriousness tests might undercount in-cidents of non-seriousness. We made deliberate choices
FIG. 5. Distribution of the percentage of assessments thatcorrectly answered a number of easy questions comparingboth the actual and simulated populations for each RBAI.There were 4 questions for each RBAI. Non-serious test tak-ers were those who answered all 4 questions incorrectly. to avoid misidentifying serious test-takers as non-serious,and those choices could have resulted in misidentify-ing some non-serious test-takers as serious. It is stilllikely, however, that the methods described will some-times falsely identify a serious student as non-serious.Because of this, we do not recommend using any of theseseriousness tests to identify individual students as seriousor non-serious.We suggest that the three seriousness tests developedhere could be used together as described to give reason-able estimates of percents of non-seriousness for FCI,CSEM, and BEMA datasets, and that results similarto those just described indicate a low incidence of non-seriousness in a dataset. In addition, these seriousnesstests might be applied to other concept-inventory RBAIs,although some details would need to be worked out forthe UAT and EQT for each RBAI. [1] D. Sands, M. Parker, H. Hedgeland, S. Jordan andR. Galloway, ”Using concept inventories to measure,” Higher Education Pedagogies 3, 1, 173-182 (2018).