A novel alternative to analyzing multiple choice questions via discrimination index
AA novel alternative to analyzing multiple choicequestions via discrimination index
P.K.Joshi , Y. Jain , R.Khunyakari and S.Basu HBCSE-TIFR, Mankhurd, Mumbai 400088, India. Department of Computer Science, IIT-Bombay, Powai, Mumbai TISS-Hyderabad, Hyderabad, India IIT-Guwahati, Guwahati, IndiaAbstractThe value of multiple choice questions (MCQs) in seeking large-scale, high-stakes, educational assessment is widely established.Students’ responses to test items with a multiple-choice questionformat enable assess the extent of students’ understanding and alsohelp make valuable decisions about the quality of questions thatmake robust assessments possible. The use of discrimination index(DI) to analyse MCQs is also widely prevalent in literature [11].This paper makes a case for using a novel approach to analyzingdata using the DI. The case for novelty is argued through an em-pirical, comparative analysis on three sets of data: conjecture data,data from an exam for screening talented students for a competitiveevent (two examples), and data from an international competitiveacademic event. The scheme is developed to handle the data gath-ered from different question formats such as MCQs, Long answerquestions (LAQs) and a combination of these two question formats.A code has been developed for carrying out computational analysison large data sets. A comparison with the conventional approach todata analysis establishes the worthiness of ideas proposed for mak-ing meaningful inferences and simultaneously renders it possible toattend to nuances that are greatly compromised while analyzinghuge data-sets. The paper brings a critical value-addition to thebody of analytical knowledge building.
Measurement and assessment have been widely acknowledged as important tool to gaininga certain objective understanding about learning [2, 7]. While acknowledging that tests andassessments come in many different forms, [12] asserts the value of these in establishingaccountability systems intended to improve education. Historically, tests and assessments,he argues, have been key elements in educational reforms. Tests, as an assessment tool, havebeen an integral part of the process of screening the academically talented students fromsocio-culturally rich and linguistically diverse population, who come from different academicexposures and nurture environments. [2] draws our attention to the significant role of testing1 a r X i v : . [ phy s i c s . e d - ph ] J un n education but at the same time cautions us to deal with it, keeping in mind the purposeof assessment. The ‘purpose’ is integral to the design and selection of an assessment tool. In the literatureon educational assessments, scholars have long argued for the salience of purpose in makingcrucial educational decisions. Assessments are variously designed to meet what [20], havebroadly categorized as the communicative, managerial and pedagogical purposes of assess-ments. Interestingly, the use of tests reported in this paper draws from a context where test,as a method to assess students’ performance, gels with all the prepurposes identified. New-ton [15]draws our attention to prevailing confusion about ‘assessment purpose’ in literatureand identifies three distinct interpretations that concern judgement, decision and impactlevels. In this paper, the ‘assessment purpose’ being discussed concerns the decision level inNewton’s taxonomy, which is about the use of an assessment to support a selection decisionfor entry to higher education or screening talent. Given this understanding, it becomesreally important to develop appropriate assessment tools that effectively match the purposeidentified. According to [4], meeting four requirements can accomplish good assessment: (i)a knowledge of correct procedures for student assessment; (ii) a commitment of an adequateamount of time; (iii) planning; and (iv) a reflective analysis of the assessment process.Careful consideration of all the 4 requirements contribute to validity of test instrumentand to a fair degree of realizing the intended purpose of assessment. In other words, asound test instrument is a significant milestone in realizing purpose of assessment. Thedevelopment of test instruments is an involved process covering all the four requirementsoutlined by Cunningham. A brief on the process of development of the test instrumentbeing used will be discussed in section 4.2.Large stake assessments often use items that are largely multiple choice questions (MCQs).They also cover a wide range of students with different cultural backgrounds and orienta-tions. Although MCQs have their limitations, they continue to remain a reliable means tooperationalising large-scale, high-stakes, summative educational assessments. Careful andsystematic use of these tests help us to gain insight into student’s thinking and quality ofquestions allowing scope for widening the educational value of assessment. In attending tothis conceptualization, assessment through test meet the otherwise segregated relation ofassessment and learning represented through the classical preposition ‘of’ and ‘for’ learn-ing. Assessment conceptualized as , of and for learning allows not only gauge quality oflearning but also enables develop a reflective introspection on test items that could provideopportunities for conceptual processing and developing refined understanding. It is indeed hard to miss the socio-historic strides made in transitioning from the use ofClassical Test Theory (CTT) to Item Response Theory (IRT), particularly in situationsseeking screening of talent from a pool of respondents, in reliable and meaningful ways fromlarge data sets. The framing of MCQs is challenging as the range of choices need to bedesigned to capture the likely range of responses, from which the respondent select the mostappropriate response. A test which includes such items is amenable to coding, computation2nd quantitative analysis. However, much investment is needed in developing authentic testitems as well as in mining the enormous data appropriately to meet its intended purpose.The efforts towards developing guidelines that aid multiple choice item writing [3, 7] andevaluating the worthiness of number of choices that afford sufficient enough distracters tobring out the cognitive challenge in a MCQ [19] are some areas of contemporary researchinterest.While test as a tool to elicit patterns from large data set is irrefutable, the quest for seek-ing analytical alternatives for drawing precise and subtle insights is an ongoing one. In thecontext of discussing the best suited approach for analyzing test with single best responseMCQs, [5] assert that the choice of approach is guided by the purposes and data interpre-tation. Item analysis not just reveal about patterns in student responses, but also enablesassessing the quality of items in the test. Literature [7, 1]) does demonstrate consensus onhow features of the design of test items contribute to the performance on test assessments.For instance, the role of language in designing the stem of the item [17], the appropriatechoice of distracters [21], and the larger purpose or intent of the item itself. Attending tothese features add rigor to the insights and inferences one draws from the test assessments.Another feature, that could add value is the understanding of the scheme for analyzing andinferencing from the data-set. It is this feature that has been the central point of discussionin this paper.As we know, large-scale assessments often use tests as instruments, which largely com-prise of single option, multiple-choice questions (MCQs). Typically, each MCQ item includesa stem which details the context and the question and is followed by four response options, ofwhich, one of them is the most appropriate response. In addition to the form and structureof the test item, the time allocated, the marking scheme with relative weightage, guessingbehavior, negative (penalty) marking, etc. are known to influence students’ performanceon the tests. While analysis of students’ responses to MCQs reveals response trends amonga large pool of respondents, the responses garnered can help make informed judgementsabout the quality of test items [9], assimilate an understanding of appropriate, plausibledistracters that challenge students’ thinking [21], and if meticulously developed, can evenenrich the conceptual understanding of respondents [10, 13]. The prospects with test itemsneed systematic and careful engagement with data.Both in terms of the wide number of students involved and the challenge of screeningtalent, the MCQs format of test items are reliable and convenient in achieving the largerobjective. The forms of test items, other than MCQs, are equally important. The otherforms include long answer or conceptual questions (numerical or descriptive in nature) anditems designed to assess understanding and performing science experiments. Long answerquestions (LAQs), especially of numerical calculation type, are a tool to understand theability of the student to extract the science principles hidden in question, in order to solvethe problem. Tests, in these cases, can be reliably used to gauge the ability of studentsto handle multiple concepts, ability to read a question carefully, follow an experimentalprocedure meticulously and several such student attributes. The LAQs can be experimentaltasks which inquire into the practical, hands-on capacities of students. Besides discussingthe scheme to analyze the MCQs, this paper also opens up the possibility of expanding thescheme to LAQs, which have seldom been explored in literature.3
Situating the study: Motivation and context
The motivation for this study stems from two important standpoints: a conceptual tensionand an empirical observation drawn during analyzing a data-set. Both standpoints con-tributed to problematising the work being reported in this paper. The process of developingan understanding of logic in the analytical procedure helped us reach an alternative thatcould reliably address the issues at hand. Let us develop an understanding of the problemin focus in this study through the means of an elaborate example.The Item Response Theory (IRT) is a widely used theoretical framework for analyzingMCQs. The IRT identifies Difficulty Index (Diff-I) and Discrimination Index (DI) as twocritical constructs which reveal about the quality of test items. This section briefly discussesthe algorithms used to arrive at these indices in the conventional analytical approach.Let us define an examination where there are N students attempting P questions. Let stu-dent S i , (i=1,N) appear for the examination which has questions Q j (J=1,P). The studentsare listed, sorted by the total marks in decreasing order. To calculate the discriminationindex (DI) for any j th question Q j the top 27% of the students are identified and similarlythe bottom 27% of the students are identified [11]. For the purposes of this article, let usname them as the “top block”. Let N t is the number of students in the “top block” list whohave answered the question correctly. Let N b be the number of students in the “bottomblock” list who have answered the question Q j correctly. Those students who fall in the list,between these two blocks represent the “middle block”. Let us define quantity C = 0.27 xN and M j as the maximum possible marks for the j th question.Then, the discrimination index (DI) is defined as: DI j = N t − N b C (1)This has been the conventional way of defining and arriving at DI. Similarly, the difficultyindex (Diff-I) is defined as the number of students who have answered a particular questioncorrectly, in percentage. Researchers define difficulty index (Diff-I) as: N t + N b Let us take an example of 290 students with 30 questions with Diff-I and DI values derivedthrough conventional analysis, in Table 1. This is the data of sample 1 also discussed insection 5.1. Table 1: Table of DI for the sample exam described in textQ. No 1 2 3 4 5 6 7 8 9 10Diff-I .241 .08 .75 .57 .64 .27 .39 .35 .41 .30DI .19 -0.03 .21 .35 .38 .32 .27 .22 .49 .35Q. No 11 12 13 14 15 16 17 18 19 20Diff-I .67 .62 .82 .25 .35 .31 1.0 .44 .13 .73DI .45 .57 .12 .12 .4 .22 .00 .16 .1 .42Q. No 21 22 23 24 25 26 27 28 29 30Diff-I .31 .35 .73 .33 .15 .55 .83 .40 .44 .49DI .29 .27 .27 .23 .01 .22 .18 .21 .23 .19In Figure 1, one can see the relationship between the number of students who haveattempted the question correctly with the DI. It can be noted that if a question is answeredcorrectly by all the students, difficulty index (Diff-I) of 1.0, then discrimination index (DI)will have a value of 0.0. On the other side of the spectrum, if no student has answered aquestion correctly, Diff-I of 0.0, then again DI will be 0.0. If very few students have answeredthe questions correctly, then the DI values will be very close to 0.0 and can even have anegative number as in the case of question item 2 in our sample. Only when Diff-I is morethan 0.27 and less than 0.73, the DI value can have a value of 1.0. For all other valuesof Diff-I, upper limit of DI will have values between 0.0 and 1.0. The theoretical limit isplotted in the Figure 1 indicated by a red line.The calculation is further called into question, if we are to consider questions that involvenegative scoring. A negative scoring implies a 0.25 mark/score deduction for every questionitem responded incorrectly. The items attempted by respondents with an element of “ad-venture” or “risk” in attempting maximum number of questions is curtailed by imposing anegative marking scheme. Hence, assuredness in understanding and responding to questionsis noted to be more rigorous in use of tests that involve a negative marking scheme. If wewere to consider negative marking, in the equation 6 the numerator has an additional term(0.25 x total number of students) and the denominator has maximum marks ( M j ) as 1.25. This will be dealt with at approprite relevance in later section. The theoretical limit, for5he j th question, is calculated on the basis of Diffi-I in three phases.For questions whose Diff-I is less than 0.27, the limit is((Diff − I) x 100)C x M j (3)For the questions whose Diff-I is more than 0.27 but less than 0.73, the theoretical limitis 1.0.For questions whose Diff-I is greater than 0.73 it is((1 − Diff − I) x 100)C x M j (4)Figure 1: Relationship between Diff-I and DIIn the conventional method of calculating DI, the “total” is defined by adding all themarks for each student, including the marks of the question whose DI is being calculated andoccasionally even the marks from questions, which are not in MCQ format may be included.In the example discussed above the negative marking of MCQs was included while decidingthe total, for sorting out the list of students and deciding which of the students get includedin the “top block” and which of the students fall in the “bottom list” of the students.It is also observed that which students fall in “top block” and “bottom block” are notuniquely defined. Let us explain this by a simple example. Consider 300 students, whichimplies that 81 students fall in both the “top block” and the “bottom block”. After arrangingthe students data in the decreasing order of marks, if students who come at position 79 to84 all have same total marks, then depending on how the list is arranged, only 3 (out of6) students come in the “top block” who have secured the same total marks. This meansa different format of sorting the list of students can put entirely 3 different students inthe “top block’, thus giving a different value of DI for the j th question under consideration.Depending on the number of students in such category, DI values have been observed to varysignificantly. Such observations from empirical data-set attracted our attention to a stringof questions that may have serious implications on assessment practices. For instance, is itpossible to drop a few questions from the list of ‘P’ questions to calculate the “total score”?How will it impact the value of DI? How will the value of DI vary, if different combinationsof questions are considered to define the “total” to calculate DI? More importantly, what isthe impact if total is decided on the basis of marks of remaining (P-1) questions? In thispaper, this is called as “without the self marks”, and the conventional method is defined as“with the self marks”, included to calculate the total.It has been observed that for several examinations, where negative marking is included,the incorrect responses are not included in the analysis. Thus, the students who left thequestion unanswered and those who answered it wrongly, get clubbed together into onesingle category. However, this paper argues the value of making this distinction and theimpact of this decision in evaluating the value of DI.The conceptual and the empirical motivations converged when we noted a certain patternin scores obtained on a test on a data-set. We found that some questions had a very low DIvalue and even had a negative sign occasionally had also a corresponding low Diff-I value.This alerted us to look at the nature of questions which revealed a trend. The questionswith a negative value were those that probed conceptual understanding. In other words,the item was designed with a deliberate intent to bring out the conceptual understandingor grasp of the respondent. Thus, in a conventional analysis a test item may have beencounted as not so good in quality and therefore, would have been eliminated from the test[9, 18]. This made us revisit the technique of item analysis itself. A set of data was generated for the purpose of simulation. The set consisted of 100 studentsand 10 questions. The set was so designed that first 50 students (roll number wise) answeredfirst 4 questions correctly. Whereas for the remaining 6 questions, 50 students answeredthem correctly but these were some 50 students randomly selected from the set of 100. Herevalue of C is 27 and Diff-I for all questions is 0.5. The values of DI calculated can be seenin Figure 2When DI values were calculated the conventional way, it turned out that the DI value forfirst 4 questions gave a value of 1.0 and a high value for the 10 th question. When the subsetof these 5 questions was separated out, and analysis carried out for all possible combinationsof these 5 questions, the DI value for question number 10 dropped to a low value whereasthe DI for the first 4 questions remained 1.0. These values can be seen in Table 2.Thus, isolating the subset of first 4 questions separate from the remaining 6 questions.This data was generated to test the newly developed computational method and test if“grouping” is possible in-principle. The lurking question, then becomes, is it possible toidentify a subset of questions which all have very high values of DI?7igure 2: Different values of DI as a function of question number. Here Diff-I is 0.5 for allquestions. This study aims to revisit the prevalent, conventional approach to analyzing multiple-choicequestions using the constructs of Diff-I and DI. Using an empirical, data-driven approach,the study seeks to compare the findings obtained from the computed data-sets and bringto discussion some gray areas, which perhaps can be strengthened with suitable alterations.The objective has been systematically addressed through raising research questions thatbring greater focus and direction to the work.The research questions that guided this study are as follows:1. What kind of questions remain unanswered while applying the conventional approach(prevalent algorithm) using the item response theory framework?2. How can the existing algorithm be orchestrated to meet the demands of rigor and in-sight from the analysis of diverse question formats (such as, multiple-choice questions,long-answer questions) employing negative (penalty) scoring scheme?3. What kind of challenges and prospects does the alternative algorithm present?4. Do the experiences of applying the alternative algorithm establish its worth with largedata-sets and different test item formats?8able 2: Table of average DI with standard deviations for questions 1,2,3,4,and 10Q. No Average DI1 0.9 ± .32 0.9 ± .33 0.9 ± .34 0.9 ± .310 0.11 ± ( < ) .01 The study is a data-driven, empirical effort to introspect into conventional procedures foranalyzing and interpreting response trends on test items using the IRT. The study makesan effort to suggest a novel alternative to strengthening inferences gained from analyzingmultiple choice questions. The proposed alternative is derived and tested on three large datasets, involving student scores on test items. The computational analysis arrived at using thealternative procedure provides a neat and elegant way to use the notion of discriminationand difficulty indices not just to adjudge the quality of performances but also as a means toreflect on quality of questions. Besides this, the alternative procedure promises to developstrong grounds for noticing and minimizing several effects that come into play in interpretingresults. These include assessing meaningfully the impact of no-responses to certain items,guess behavior patterns in responses or negative (penalty) marking being employed. Also,the study goes beyond the conventional limiting of analytical procedure to MCQs for testanalysis.The study extends the same analytical scheme to LAQs.
The study methodology is located within the quantitative research paradigm. The thrust ofthe work reported, in our opinion, falls within the analytical research tradition. The studyreported begins from the large-scale data being analyzed to assess students’ performance ona test that is used to screen and select talented students. Our engagement with literatureon the process of analyzing data led to identifying and problematising a conceptual area ofdiscomfort which, in principle, could lead to inferences that need to be treaded with caution.Through the process of analysis and discussion, a novel scheme has been envisaged whichhas been experimented with three data sets, reported in this paper. DI values are calculatedusing different techniques deviating from the conventional techniques. This exercise wascarried out to study some of these variations and observe the changes in the value of DI.
The data constitutes the most significant subject and object diagnostics and summative testing , identified by[3] . Given the purposes, these tests include items which assess knowledge comprehension,reasoning abilities, conceptual clarity, and application of knowledge. On many occasionsmore than one of these purposes are targeted through a test item.The three data sets used in the study represent students’ performance scores on testsfrom: (a) preparation training in India for an international, academic competitive eventin 2017; (b) preparation training in India for the international event in 2016; and (c) theinternational event organized and hosted in India in 2013. While (a) and (b) are trainingsorganized in India towards preparations for international event, these are referred to asIndian National Junior Science Olympiad (INJSO). The outcomes of the tests result inselection of a contingent representing India for the international event. The international10vent or IJSO was organized and hosted by India in 2013, hence data set represented as(c) comprises of IJSO 2013 data. It is important to draw attention to the fact that thethree data sets involved three different sample of students and also different sets of testitems. The process of setting the test items and test validation from the experts for allthe examples above are consistent. A comparison of algorithm on data involving threedistinctive sets helped ascertain the relevance and worth of the proposed algorithm. Moreimportantly, it allowed scope for experimenting worthiness of the proposed algorithm withother question formats. The algorithm proposed helped handle data gathered from differentquestion formats such as MCQs, LAQs and a combination of these two question formats.Since large data sets were involved, a computational program was designed for runninglarge data sets by one of the authors of this paper. The outcomes were cross-verified throughrandom checks. The analysis of data sets revealed interesting insights into measurement andassessment, which will be discussed elaborately in the next section.
This study examines data sets and raises important questions about educational measure-ments that have implications for assessment of students’ performances. One of the firstendeavors in this study developed from an understanding of applying the established, con-ventional approach to analysis of data set. This led us to raise questions that seem toremain unanswered within the purview of existing algorithmic, statistical procedure in usefor analysis. Some of these questions raised were: Are the performance outcomes and valuesof DI and Diff-I reliable in a case where students have differential patterns of responses totest items or in cases where students have not attempted a few items? Would the DI andDiff-I values vary for tests with different numbers and kinds of question formats?Analysis revealed to us that the present methodology of calculating the DI values havecertain lapses which have been explained below.i. Assume a case with 300 students and 30 questions, resulting in value of C (0.27 x totalnumber of students) = 81. When the data is sorted as per the total marks obtained by 300students, let us assume that students listed from number 75 to 90 have the same total. Inwhich case for all the N t students the last 8 students can be included in various differentways out of the total 16 students who have received the same total marks. Thus resultingin a varying number of DI values. The proposed alternative in this paper is designed tohandle all such discrepancies, since it looks at various different “total” values for selecting N t students.ii. In case of analysis of MCQs, if negative marking is included in the assessment ofstudents, then the current methodology clubs the wrong answer and the no-answer togethertreating them at par . In literature, it is often argued that negative (penalty) markinghelps discriminate between students who answer incorrectly from those who decide to leaveit unanswered, in a multiple choice format. We argue and empirically demonstrate thatsegregating the two categories gives deeper insights into the quality of the questions.Drawing on the outcomes of data simulation and an exposure to the limitations of theexisting algorithm, the question about a possibility of correction or revision of the existingalgorithm occurred. Thus, the question now became how can the algorithm be orchestratedto meet demands of rigor and analyze diverse question formats (such as, multiple-choice11uestions with and without negative markings, long-answer questions etc. ). To includenegative (penalty) marking and tasks which are not multiple choice in nature but longquestions which may have maximum marks larger than 1.0, the equation for calculating DIhas been modified.Let Max marks for the j th question be M j . Let the sum total of marks received by theN t students be S t and the sum total of marks received by the N b students be S b then newdefinition of the discrimination index (DI) is DI j = S t − S b .
27 x N x M j = S t − S b C x M j (5)where, it can be clearly seen that in the multiple choice type questions have M j =1.0and hence placing the value of M j as 1.0 in equation 5 leads to equation 1. The beautyof the equation is that it can handle negative marking in MCQs without any modificationto equation or the data. Thus, an additional feature is introduced in this analysis of MCQquestions where “wrong answer” and “blank answer” are not treated at par . In case ofnegative marks being included the DI values can vary between 1.25 to -1.25, instead of 1.0to -1.0 and hence need to be scaled dividing the DI values by 1.25.Similarly, the difficulty index (Diff-I) is redefined as the number of students who haveanswered a particular question correctly, in percentage. Thus,Diff − I = Total of marks received by all studentsM j x N (6)which is to be modified for MCQ with negative markings asDiff − I = Total of marks received by all students + (0 .
25x N)M j x N (7)The present definition of Difficulty Index does not exclude any students which overcom-ing the problem of exclusion in traditional definition. The next important question thatconcerned this study was to identify the kind of challenges and prospects that the proposedalgorithm presented. Hence, it became important to learn if the experiences of applying thealternative algorithm established its worth with large data-sets. In this section, we try to determine the impact of different definitions of “total” on the valueof DI of the, j th question, under consideration. The sum of marks on all questions exceptthe marks of j th question. The result of dropping the marks of the j th question in decidingthe total can be seen in the Figure 3. The red marks are the DI values when the marks ofthe question, which is being analyzed, are not included in calculating the total. The blackpoints are when negative (-ve) marking is included and even the marks of the question underconsideration are included in deciding the total. Figure 3 indicates large differences in thevalues of DI.The “total” can be further defined as sum of marks of questions which can be differentcombinations of the set of 29 questions. It was observed that for considering several com-binations of more than 8 questions, the number of permutations were so large that it was12igure 4: Difference between the DIs on the basis of total when the marks of same questionis included and not included. This plot represents data from sample1, discussed in section5.1 Line representing the maximum DI values is just a visual guide to the reader to comparewith other values.difficult for common computing devices to calculate them. If one were to take all the possi-ble combinations of 2,3,4... 29 questions, the total combinations would be 536,870,882. i.e.5.37 x 10 combinations. For 30 questions this becomes 1.61 x 10 trials, which is difficultfor a normal laptop computer to handle, hence a random number operation is applied totruncate this large size to a small but manageable smaller sample size which represents thisfull dataset. The DI values for each question were calculated using around fifty thousanddifferent combinations of the remaining 29 questions. The DI values thus obtained, followeda Gaussian distribution leading to an average value with a standard deviation (covered inmore details in the later part of this section). The values of the DI thus obtained can be seenin Figure 4. The repeatability of the values of DI calculated via this procedure is discussedlater in Section 5.3 . From Figure 4, it can be seen that the DI values obtained when the selfmarks of the question are not included in deciding “total”, they fall in the range determinedby the average and the respective deviation. However, the DI values obtained when selfmarks are included in calculating the “total”, and inclusion of negative marking, are muchhigher and outside the range.The maximum value of DI for any question, obtained from any of the several combina-tions is sometimes higher than the value obtained when self marks are included to calculatethe “total”.The average value of the DI for each of the questions is listed in Table 3 where it can beseen that the DI values (where self marks are excluded in calculation of “total”) is within13he range of values decided by average with uncertainty or to say within “the experimentalerror”. The exceptions are question number 4, 18, 26, 29 and 30 which fall in the range of2 σ . From the Table 3 it can be seen that the maximum DI value resulting from differentcombinations can even be higher than the value which comes from total based on all the ques-tions, including self. Such questions can be grouped together as a sub-group of questions.Question 12 has the highest of any DI value which is 0.58 which comes from a combinationof group of questions. In the present work, Question 12 has the highest DI from a groupof Question 1, 2, 3, 4, 5, 8, 9 10, 13, 15, 16, 19, 20, 27 and 29. The same can be said in adifferent way, that if the whole exam had only these fifteen questions, all DI values will bevery high value. Now whether such a situation is desirable is a point of further research. In the exam conducted in January 2016, there were 497 students who took the test whichFigure 5: Distribution of DI values when only combination of remaining 3 and 4 questionsare considered for calculating total.had 30 MCQ and 60 marks worth 12 LAQs. The analysis of these test results were carriedout.For question number 1, it was decided to obtain value of DI for any of the 3 or 4combinations of the remaining questions. The distribution obtained was not Gaussian buta skewed Gaussian as can be seen in Figure 5.When the number of combinations are increased from any 3 to 8 of the remaining ques-tions, the distribution of DI values take up a more Gaussian shape as seen in Figure 6. Itneeds to be mentioned here that the value of DI obtained are not continuous values butmultiples of C M j ( defined in section 2)However, when the combination is increased from any 3 to 29 combinations the distri-bution is more gaussian in nature seen in Figure 6. The value of average DI (with sigma)for question number 1 are listed in Table 4. 14igure 6: Distribution of DI values when only combination of remaining 3 to 8 questionsare considered for calculating total. Second plot represents the 3 to 29 combinationsThe analysis of all the questions can be seen in Table 5. The headings of the differentcalculations, listed in the first row of the table are : Diff-I, DI on the basis of all marks(including long-question marks and negative marking), all MCQ with negative marking, allMCQ without negative marking, DI without the self marks,( i.e. marks of the questionunder consideration are dropped while calculating total), DI calculated on all possible 3 to8 remaining questions, DI on all possible 3 to 29 questions with random selection, maximumpossible DI under any combination and minimum possible DI under any combination. Thevalues in this table are listed in increasing order of Diff-I.Figure 7 depict the data are plotted as a function of Diff-I. It can be noted that when allmarks are included the DI values tend to be lower than the DI values obtained when onlyMCQ are considered, with their negative marking included. Those values are very close tothe maximum value of DI obtained in the given data.Similar trend can also be observed when negative marking is dropped and DI is calculatedusing all 30 questions or when the self marks (the marks of the question whose DI is beingcalculated) are dropped. This can be seen in Figure 8. However, it can be seen that themaximum value is always higher than these values.Hence, the next logical question is what can be the spread in the values of DI underdifferent possibilities of combinations of the questions. A special code was run to calculateall possible combinations of remaining 3 to 8 questions of the questions other than the onewhose DI is being calculated. A similar attempt was made to determine the spread of DIfrom all possible 3 to 29 combinations. However, this runs into a calculation involving morethan 10 trials. Hence a random number operator was introduced to truncate the basisspace of calculations by a factor of 10000. The result from these calculations are depictedin Figure 9.The two values agree within the range of standard deviation and hence the 3-29 combi-nation DI values tend to represent the question more appropriately than when calculated15igure 7: Distribution of DI values when only combination of remaining 3 to 29 questionsare considered for calculating total.using total of all marks.In all these figures, thin dots represent the maximum and minimum value of DI possibleunder various combinations. After having established the reason to report DI values as a range of numbers, representedby an average number along with its error σ , we extend the idea to use the code for longquestions which may be theoretical in nature or arising out of practical exam. The code alsohandles the negative marks in the multiple choice nature of exam thus making it a handle-all comprehensive code to analyze the questions. Since there were 30 MCQs, 10 LAQs (5long questions and 5 experimental questions), leading to approx 4 x 10 combinations, atruncation factor of 10 was used.The full set of choices is truncated by a random number, hence in order to see if the DIvalues calculated on a truncated set represent the full data. The DI values were calculatedin ten different trials, for the same question and the results plotted in Figure 10.It can be seen that the variation in the average value of DI( ± σ ) for each of the 10 trials, is much less than the value of σ .The code analysis conducted on the data from International event held in India in theyear 2013 is depicted in the Figure 11. The representation clearly shows the difference inDI values when the self marks are used and when not used, to calculate the DI values. For16igure 8: Distribution of DI values when only negative marking scheme is ignored and selfmarks are also excluded.simplicity, only the MCQs are analyzed to display this effect.Figure 12 depicts the values of Diff-I and the DI values as a function of question numberswhich also include the theoretical and experimental questions.In order to study the variation of DI values as a function of Difficulty index the data isplotted differently and can be seen in Figure 13, for all the questions.Here it can be seen that the only one question, which happens to be a MCQ, is having avery low DI value and at the same time has Difficulty index of around 0.4, which implies thatonly forty percent of the students answered the questions correctly but at the same time didnot help discriminate between “good” and “not-so-good” students. Similarly an experimen-tal question has very low value of DI but has very high value of Diff-I, implying that almostevery student answered this question and hence did not have very high discrimination value.Other than this question, 21 questions had very high values of DI, (ranging between 0.5 to0.8) and 17 questions happen to have reasonable values of DI (ranging between 0.1 to 0.4). The analysis clearly shows that the conventional analytical approach does not yield finegrained in outcomes when it comes to (a) LAQs, (b) Experimental questions and (c) Whenno-response trend is influential. From the values of DI values in Table 3 it is clear that,calculating “total” using the marks of the question in consideration, which we call “withself” data, gives values of DI larger in many cases, than the case where the self marks are17igure 9: Distribution of DI values calculated using 3-8 combinations and 3-29 combinations.not included in calculating the “total” for obtaining DI values. If several combinations ofremaining questions are used to calculate the value of “total”, then one gets a range ofvalues of DI which mostly cover the DI value when calculated using different mechanisms.Reporting the DI values in form of a range i.e. average with standard deviation, gives amuch broader view of the DI value of a question as compared to a single value from a largepossibility of values of DI.For those questions whose Diff-I values are less than 0.27 or greater than 0.73, the valueof DI has a certain ceiling on the values. Those questions which have Diff-I of around 0.5have the highest potential and possibility to have DI values as close to 1 as possible. Thus,all those questions which have Diff-I close to 0.27 and tend to have maximum value of DIare ideally the best questions. These are the questions which have highest discriminationpower with minimum number of participants correctly answering the questions.Those questions with high Diff-I (specially those above 0.73), do not contribute muchtowards filtering students in a competition. But they can be a potentially used for giving amoral boost to students appearing for the exam. At the same time those questions whichhave Diff-I close to 0.5, i.e. having highest probability to have large value of DI, but insteadhave low DI values, will be the most undesirable questions.18
DI values c a l c u l a t i o n n u m b e r m i n a v g m a x
Figure 10: Ten trial runs of DI calculations for question number 5.There seem to be a range of question item possibilities that add value, rigour and arehelpful in deriving interpretations from large data set. For instance, low Diff-I but high onDI, low on DI but high on Diff-I need to be carefully examined in the larger context of theintended design. The findings from this study open this avenue for further investigation.Analyzing question items using the proposed scheme allowed to assess participants per-formances as well as provided important clue on the quality of questions. The reliability ofdevised algorithm has been established through its operation on three distinct data sets.While data sets used here involve high stake competition as IJSO, the revised algorithmmay be tested with other instruments involving high number of participants in variousNational level tests. While this prospect is proposed, the authors seek to take forward amore detailed study of qualitative assessment of each question on the basis of DI valuesobtained, is the scope of future work and is in progress.
Authors would like to thank Junior Science Olympiad team members for their valuableinputs. Authors also acknowledge the Government of India for funding the Junior ScienceOlympiad through Tata Institute of Fundamental Research and the Department of AtomicEnergy. 19igure 11: Distribution of DI based on 30 (self marks) and 29 questions (without self marks)as a function of Difficulty index Diff-I. This is IJSO 2017 international data. Do we need tokeep it now? 20
Diff-I/DI
Q n o D i f f - I M a x A v e r a g e M i n
Figure 12: Values of difficulty index and DI as a function of question numbers where the last5 questions represent the experimental questions and the 5 questions before that representthe theory (long problems) questions 21 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0- 0 . 20 . 00 . 20 . 40 . 60 . 81 . 0 DI D i f f - I M a x A v g M i n T h e o r e t i c a l
Figure 13: Values of difficulty index and DI as a function of question numbers where the last5 questions represent the experimental questions and the 5 questions before that representthe theory (long problems) questions 22 eferences [1] Baker, F. B., (2001).
The basics of Item Response Theory , ERIC Clearinghouse onAssessment and Evaluation, USA.[2] Bloom, B. (1967).
Toward a theory of testing which includes Measurement - Eval-uation - Assessment . Centre for the Study of Evaluation of Instructional Pro-grams. Occasional Report No.: IX. Paper presented at the Symposium in the Prob-lems in the Evaluation of Instruction, Los Angeles, California, Dec 13-15, 1967.http://files.eric.ed.gov/fulltext/ED036878.pdf[3] Cohen, L., Manion, L., and Morrison, K., (2013)
Research Methods in Education , Sev-enth edition. Routledge, London.[4] Cunningham, George (1998). Assessment in the Classroom: Constructing and Inter-preting Texts. The Falmer Press, Taylor & Francis, Inc., London, UK[5] Ding, L., Beichner, R., (2009). Approaches to data analysis of multiple-choice questions
Phys Rev special topics - Physics Education Research
5, 020103-1-17.[6] Espinosa, M., Gardeazabal, J. (2014). Optimal correction for guessing in multiple-choicetests.
Journal of Mathematical Psychology ,[7] Haladyna, T.M., Downing, S.M., Rodriguez, M.C., (2002). A Review of Multiple-Choiceitem-writing guidelines for classroom assessment,
Applied Measurement in Education ,15(3), 309-334.[8] Hambleton, R., Swaminathan H., Rogers, H., (1991).
Fundamentals of Item ResponseTheory , SAGE Publications.[9] Hassan, S., Hod, R., (2017). Use of Item Analysis to Improve the Quality of Single BestAnswer Multiple Choice Question in Summative Assessment of Undergraduate MedicalStudents in Malaysia,
Original Article , 9(3), 33-43.[10] Hutchinson, D., Wells, J. (2013). An inquiry into the effectiveness of student gener-ated MCQs as a method of assessment to improving teaching and learning.
CreativeEducation , 4 (7A2), 117-125.[11] Kelly,T.L., (1939). The selection of upper and lower groups for the validation of testitems
Journal of Educational Psychology
Educational Researcher
Medical Teacher , 26(8), 709-712.[14] Mehta, G., Mokhasi, V. (2014). Item analysis of multiple choice questions- an assess-ment of the assessment tool.
International Journal of Health Science Research , 4(7),197-202. 2315] Newton, P. (2007). Clarifying the purposes of educational assessment.
Assessment inEducation , 14:2, 149-170, DOI: 10.1080/09695940701478321.[16] Nicol, D. (2007). E-assessment by design: Using multiple-choice tests to good effect.
Journal of Further and Higher Education , 31(1), 53-64.[17] Paxton, M. (2000). A linguistic perspective on multiple choice questioning.
Assessmentand Evaluation in Higher Education . 25(2), 109[18] Pyrczak, F., (1973). Validity of the Discrimination Index as a measure of item quality.
Journal of Educational Measurement , 10(3), 227-231.[19] Tarrant, M., Ware, J., Mohammed, A.M., (2009). An assessment of functioning andnon-functioning distractors in multiple-choice questions: a descriptive analysis,
BMCMedical Education , 9(1) 40-47.[20] Tanner, H., Jones, S. (2006).
Assessment: A practical guide for secondary teachers .Continuum Press.[21] Vyas, R., Supe, A., (2008). Multiple choice questions: A literature review on the optimalnumber of options.
The National Medical Journal of India σ Maximum1 0.33 0.23 0.20 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± .073-8 combinations 0.31 ± .043-29 combinations 0.33 ± .0325 a b l e : T a b l e o f a ll p o ss i b l e D I v a l u e s f o r I N J S O Q . D i ff - I T h e o r y M a x i m u m D I D I D I D I D I w i t h D I w i t h M i n i m u m N o .li m i t D I w i t h M C Q w i t h o u t - - D I o f D I n o n - M C Q i n c l ud i n g M C Q , s e l f C o m b i n a t i o n s C o m b i n a t i o n s m a r k s - v e n o - v e m a r k s A v e r ag e A v e r ag e m a r k i n g m a r k i n g 90 . . . - . . . . . ± . . ± . - .
01 300 . . . - . . . . . ± . . ± . - .
04 290 . . . . . . . . ± . . ± . .
07 20 . . . - . . . . . ± . . ± . - .
01 180 . . . . . . . . ± . . ± . - .
02 230 . . . . . . . . ± . . ± . .
04 270 . . . . . . . . ± . . ± . .
01 130 . . . . . . . . ± . . ± . .
11 190 . . . . . . . . ± . . ± . .
10 240 . . . . . . . . ± . . ± . .
13 110 . . . . . . . . ± . . ± . .
07 50 . . . . . . . ± . . ± . .
13 220 . . . . . . . ± . . ± . .
18 70 . . . . . . . ± . . ± . .
10 80 . . . . . . . ± . . ± . .
19 120 . . . . . . . ± . . ± . .
19 160 . . . . . . . ± . . ± . .
18 280 . . . . . . . ± . . ± . .
10 30 . . . . . . . ± . . ± . .
13 140 . . . . . . . ± . . ± . .
18 10 . . . . . . . ± . . ± . .
04 170 . . . . . . . ± . . ± . .
23 260 . . . . . . . ± . . ± . .
15 200 . . . . . . . ± . . ± . .
23 150 . . . . . . . ± . . ± . .
28 210 . . . . . . . ± . . ± . .
23 40 . . . . . . . ± . . ± . .
28 60 . . . . . . . ± . . ± . .
27 100 . . . . . . . ± . . ± . .
30 250 . . . . . . . ± . . ± . .18