Stephen Humphry | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Stephen Humphry is active.

Explore More

Publication

Featured researches published by Stephen Humphry.

Australian Educational Researcher | 2010

Using the method of pairwise comparison to obtain reliable teacher assessments

Sandra Heldsinger; Stephen Humphry

Demands for accountability have seen the implementation of large scale testing programs in Australia and internationally. There is, however, a growing body of evidence to show that externally imposed testing programs do not have a sustained impact on student achievement. It has been argued that teacher assessment is more effective in raising student achievement levels. However, it is also often argued that teacher assessments are less reliable than the results of testing programs. This paper presents a study in which teachers judged writing scripts using the process of pairwise comparison to generate a scale. The analysis showed high internal consistency of the teacher judgements. The scale locations from pairwise comparisons were highly correlated with scale estimates for the same students from a large-scale testing program. The results demonstrate it is possible to efficiently obtain highly reliable and valid teacher judgements using the process of pairwise comparison. Reliability indices are also provided for a series of small-scale assessments that used the same methodology in a range of other domains. The results support the findings of the main study. The article discusses the benefits of using the method to supplement and validate results from large-scale testing programs.

Applied Psychological Measurement | 2012

Quantifying Local, Response Dependence Between Two Polytomous Items Using the Rasch Model

David Andrich; Stephen Humphry; Ida Marais

Models of modern test theory imply statistical independence among responses, generally referred to as local independence. One violation of local independence occurs when the response to one item governs the response to a subsequent item. Expanding on a formulation of this kind of violation as a process in the dichotomous Rasch model, this article generalizes the dependence process to the case of the unidimensional, polytomous Rasch model. It then shows how the magnitude of this violation can be estimated as a change in the location of thresholds separating adjacent categories in the second item caused by the response dependence on the first. As in the dichotomous model, it is suggested that this index is relatively more tangible in interpretation than other indices of dependence that are either a weight in the interaction term in a model or a correlation coefficient. One function of this method of assessing dependence is likely to be in the development of tests and assessment formats where evidence of the magnitude of dependence of one item on another in a pilot study can be used as part of the evidence in deciding which items will be retained in a final version of a test or which formats might need to be reconstructed. A second function might be to identify the magnitude of response dependence that may then need to be taken into account in some other way, perhaps by applying a model that takes account of the dependence.

Measurement: Interdisciplinary Research & Perspective | 2011

The Role of the Unit in Physics and Psychometrics

Stephen Humphry

The purpose of this article is to examine the role of the unit in physics in order to clarify the role of the unit in psychometrics. Based on this examination, metrological conventions are used to formulate the relationship between discrimination and the unit of a scale in item response theory. Seminal literature in two lines of item response theory is reviewed in light of the standard definition of measurement in physics, and Birnbaums formulation of the discrimination parameter in item response theory is reexamined. Consequently, the article introduces a scale parameter in a model that specializes to both the two-parameter logistic (2PL) and Rasch models. The model has sufficient statistics for person and item parameters, the feature that defines Rasch models, whilst also parameterizing discrimination. By formulating the relationship between discrimination and the unit, this article reconciles differing perspectives regarding the use of a discrimination parameter in the 2PL and Rasch models. A simulation study is used to demonstrate the results of implementing conditional maximum likelihood estimations of item locations. Implications for the progress of measurement in the social sciences are identified and discussed. It is argued that these implications entail substantial shifts in the way we think about measurement in the social sciences.

Educational Research | 2013

Using calibrated exemplars in the teacher-assessment of writing: an empirical study

Sandra Heldsinger; Stephen Humphry

Background: Many in education argue for the importance of incorporating teacher judgements in the assessment and reporting of student performance. Advocates of such an approach are cognisant, though, that obtaining a satisfactory level of consistency in teacher judgements poses a challenge. Purpose: This study investigates the extent to which the use of a two-stage method of assessment involving calibrated exemplars provides judgements from teachers that are consistent. Teachers were not given extensive training and moderation. We chose the assessment of early writing as a context to investigate the method as it is fundamental to students’ progress in schooling. Sample: Stage 1: Eleven teachers of four- to seven-year olds (kindergarten to year 2) were invited to collect their students’ performances. Sixty performances that represented the range of ability were selected from approximately 300 performances. Fifteen teachers from 12 schools made pairwise comparisons of performances. Stage 2: Fourteen teachers representing six schools plus the co-ordinator of the study participated in this stage of the exercise. Convenience sampling of teachers was employed. Design and method: Stage 1: The method of pairwise comparison was used to calibrate the performances of students by developing a performance scale. These performances were then used as exemplars, which are referred to here as calibrated exemplars. Stage 2: Teachers assessed student performances simply by judging which calibrated exemplar a performance was most alike. In a separate exercise, two experienced markers assessed another set of 118 writing performances using both (1) a criterion-based rubric and (2) the calibrated exemplars. Results: The two-staged process showed a level of consistency in teacher judgement-making. In addition, judgements made by experienced markers with the calibrated exemplars correlated well with judgements made using the criterion-based rubric. Conclusions: The findings suggest that using calibrated exemplars has potential as a method of teacher assessment in contexts where extensive training and moderation is not possible or desirable. Further research is needed to establish whether the findings generalise to the classroom context and whether consistency could be demonstrated on a large scale in this and other curriculum areas. Research is also needed to investigate whether the calibrated exemplars can be supported with qualitative information for use in formative assessment.

Journal of Educational and Behavioral Statistics | 2012

Using a Theorem by Andersen and the Dichotomous Rasch Model to Assess the Presence of Random Guessing in Multiple Choice Items

David Andrich; Ida Marais; Stephen Humphry

Andersen (1995, 2002) proves a theorem relating variances of parameter estimates from samples and subsamples and shows its use as an adjunct to standard statistical analyses. The authors show an application where the theorem is central to the hypothesis tested, namely, whether random guessing to multiple choice items affects their estimates in the Rasch model. Taking random guessing to be a function of the difficulty of an item relative to the proficiency of a person, the authors describe a method for creating a subsample of responses, which is least likely to be affected by guessing. Then using Andersen’s theorem, the authors assess the difference in difficulty estimates between responses from the whole sample and the subsample for each item. To demonstrate the effectiveness of the procedure, data are simulated according to a class of models in which random guessing is a function of the proficiency of a person relative to the difficulty of an item. The procedure is also applied to an empirical data set from Raven’s Advanced Progressive Matrices, with the results indicating that guessing is present in a substantial number of items. It is noted that one especially important application in which estimating the correct relative difficulty of items is required is where the items will form part of an item bank and where on subsequent occasions the items will be administered interactively. In this case, items too difficult for a person are not administered and therefore unlikely to attract random guessing.

Educational Researcher | 2014

Common Structural Design Features of Rubrics May Represent a Threat to Validity

Stephen Humphry; Sandra Heldsinger

Rubrics for assessing student performance are often seen as providing rich information about complex skills. Despite their widespread usage, however, little empirical research has focused on whether it is possible for rubrics to validly meet their intended purposes. The authors examine a rubric used to assess students’ writing in a large-scale testing program. They present empirical evidence for the existence of a potentially widespread threat to the validity of rubric assessments that arose due to design features. In this research, an iterative tryout-redesign-tryout approach was adopted. The research casts doubt on whether rubrics with structurally aligned categories can validly assess complex skills. A solution is proposed that involves rethinking the structural design of the rubric to mitigate the threat to validity. Broader implications are discussed.

Theory & Psychology | 2013

A middle path between abandoning measurement and measurement theory

Stephen Humphry

Michell has argued that item response models carry an inherent paradox because they incorporate error. In counter-arguments, the paradox has been dismissed based on the claim that such models need to incorporate error terms. It has also been claimed that item response models use information contained within intervals. The aim is to describe a formal connection in item response models between error and the measurement unit. It is argued that given the connection, Michell’s criticisms are not as easily dismissed as some claim: His argument does raise questions about whether item response theory can be used to measure psychological attributes. On the other hand, Michell’s position offers little in the way of guidance about how to approach the scientific task of quantification. For this reason, the theoretical framework that he adopts is also critically examined with a particular focus on the way in which measurement in physics is approached.

Synthese | 2012

The ontological distinction between units and entities

Gordon Cooper; Stephen Humphry

The base units of the SI include six units of continuous quantities and the mole, which is defined as proportional to the number of specified elementary entities in a sample. The existence of the mole as a unit has prompted comment in Metrologia that units of all enumerable entities should be defined though not listed as base units. In a similar vein, the BIPM defines numbers of entities as quantities of dimension one, although without admitting these entities as base units. However, there is a basic ontological distinction between continuous quantities and enumerable aggregates. The distinction is the basis of the difference between real and natural numbers. This paper clarifies the nature of the distinction: (i) in terms of a set of measurement axioms stated by Hölder; and (ii) using the formalism known in metrology as quantity calculus. We argue that a clear and unambiguous scientific distinction should be made between measurement and enumeration. We examine confusion in metrological definitions and nomenclature concerning this distinction, and discuss the implications of this distinction for ontology and epistemology in all scientific disciplines.

Frontiers in Psychology | 2013

Understanding measurement in light of its origins.

Stephen Humphry

During the course of history, the natural sciences have seen the development of increasingly convenient short-hand symbolic devices for denoting physical quantities. These devices ultimately took the form of physical algebra. However, the convenience of algebra arguably came at a cost – a loss of the clarity of direct insights by Euclid, Galileo, and Newton into natural quantitative relations. Physical algebra is frequently interpreted as ordinary algebra; i.e., it is interpreted as though symbols denote (a) numbers and operations on numbers, as opposed to (b) physical quantities and quantitative relations. The paper revisits the way in which Newton understood and expressed physical definitions and laws. Accordingly, it reviews a compact form of notation that has been used to denote both: (a) ratios of physical quantities; and (b) compound ratios, involving two or more kinds of quantity. The purpose is to show that it is consistent with historical developments to regard physical algebra as a device for denoting relations among ratios. Understood in the historical context, the objective of measurement is to establish that a physical quantity stands in a specific ratio to another quantity of the same kind. To clarify the meaning of measurement in terms of the historical origins of physics carries basic implications for the way in which measurement is understood and approached. Possible implications for the social sciences are considered.

Educational and Psychological Measurement | 2010

Modeling the Effects of Person Group Factors on Discrimination

Stephen Humphry

Discrimination has traditionally been parameterized for items but not other empirical factors. Consequently, if person factors affect discrimination they cause misfit. However, by explicitly formulating the relationship between discrimination and the unit of a metric, it is possible to parameterize discrimination for person groups. This article applies the Rasch model with a person group discrimination parameter to demonstrate the empirical effect of a person group factor on the degree of discrimination. The model is applied to the responses of students in different grades of schooling to a reading test, resulting in improved equating and fit of the data to the model. A simulation study confirms the efficacy of the model and tests of fit. Applied implications are discussed.

Explore More