Michael J. Kolen
University of Iowa
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Michael J. Kolen.
Journal of Educational and Behavioral Statistics | 1984
Michael J. Kolen
An analytic procedure for smoothing in equipercentile equating using cubic smoothing splines is described and illustrated. The effectiveness of the procedure is judged by comparing the results from smoothed equipercentile equating with those from other equating methods using multiple cross-validations for a variety of sample sizes. Data on randomly equivalent groups of approximately 3,000 examinees per form from four forms of each of the four tests of the ACT Assessment Program (AAP) were used in this evaluation. Relative to the other equating procedures studied, smoothed equipercentile equating was found to be most adequate for the AAP, especially for the most dissimilar form pairs.
Applied Psychological Measurement | 1987
Michael J. Kolen; Robert L. Brennan
The Tucker and Levine equally reliable linear meth ods for test form equating in the common-item non equivalent-populations design are formulated in a way that promotes understanding of the methods. The for mulation emphasizes population notions and is used to draw attention to the practical differences between the methods. It is shown that the Levine method weights group differences more heavily than the Tucker method. A scheme for forming a synthetic population is suggested that is intended to facilitate interpretation of equating results. A procedure for displaying form and group differences is developed that also aids inter pretation.
Applied Measurement in Education | 2007
Ye Tong; Michael J. Kolen
A number of vertical scaling methodologies were examined in this article. Scaling variations included data collection design, scaling method, item response theory (IRT) scoring procedure, and proficiency estimation method. Vertical scales were developed for Grade 3 through Grade 8 for 4 content areas and 9 simulated datasets. A total of 11 scaling variations were investigated for both real and simulated data. When the assumptions were met with the simulated data, all 11 scaling variations investigated were able to preserve the general characteristics of the scales. With the real data, vertical scales using all the methods showed decelerating growth from lower to higher grades. For within-grade variability, the Thurstone method produced scales with increasing variability over grades, whereas the IRT methods produced scales with fluctuating or decreasing variability over grades. Consequently, the growth patterns of high- and low-achieving students tended to differ across scaling methodologies. The scaling designs produced scales with dissimilar properties, especially for the tests that tended to be less homogeneous in content across grades and for tests that included testlet-based items. Discussion of the findings is provided, followed by a description of limitations of the study and possibilities for future research. Practical implications of the study also are discussed.
Applied Psychological Measurement | 1987
Robert L. Brennan; Michael J. Kolen
The practice of equating frequently involves not only the choice of a statistical equating procedure but also consideration of practical issues that bear upon the use and/or interpretation of equating results. In this paper, major emphasis is given to issues involved in identifying, quantifying, and (to the extent possible) eliminating various sources of error in equating. Other topics considered include content specifications and equating, equating in the context of cutting scores, reequating, and the effects of a security breach on equating. To simplify discussion, some issues are treated from the linear equating perspective in Kolen and Brennan (1987).
Applied Psychological Measurement | 2004
Michael J. Kolen
In this article, the history of linking is summarized, and current linking frameworks that have been proposed are considered. Key publications discussed include Flanagan (1951), Angoff (1971), Linn (1993), Mislevy (1992), and Feuer, Holland, Green, Bertenthal, and Hemphill (1999). The article further focuses on the concordance situation for linking and discusses the concept of concordance and future research that is needed.
Applied Psychological Measurement | 2008
Tianyou Wang; Won-Chan Lee; Robert L. Brennan; Michael J. Kolen
This article uses simulation to compare two test equating methods under the common-item nonequivalent groups design: the frequency estimation method and the chained equipercentile method. An item response theory model is used to define the true equating criterion, simulate group differences, and generate response data. Three linear equating methods are also included for reference. The results show that when there is substantial group difference, the frequency estimation method has larger bias than the chained equipercentile method. The frequency estimation method, however, almost always has a smaller standard error of equating than the chained equipercentile method.
Applied Psychological Measurement | 1986
Deborah J. Harris; Michael J. Kolen
Many educational tests make use of multiple test forms, which are then horizontally equated to establish interchangeability among forms. To have confidence in this interchangeability, the equating relationships should be robust to the particular group of examinees on which the equating is conducted. This study inves tigated the effects of ability of the examinee group used to establish the equating relationship on linear, equipercentile, and three-parameter logistic IRT esti mated true score equating methods. The results show all of the methods to be reasonably independent of ex aminee group, and suggest that population independ ence is not a good reason for selecting one method over another.
Applied Measurement in Education | 2006
Seonghoon Kim; Michael J. Kolen
Four item response theory linking methods (2 moment methods and 2 characteristic curve methods) were compared to concurrent (CO) calibration with the focus on the degree of robustness to format effects (FEs) when applying the methods to multidimensional data that reflected the FEs associated with mixed-format tests. Based on the quantification of FEs as the correlation between 2 dominant constructs measured by multiple-choice items and constructed-response items, a hypothetical yet possibly practical situation was assumed where FEs occurred in such a way that a mixed-format test had dimensions aligned with item format. Among linking methods, the characteristic curve methods outperformed the moment methods, regardless of the presence of FEs. In general, CO calibration outperformed the 4 linking methods in linking accuracy and robustness to FEs. However, the performance of CO calibration was only slightly better than that of the characteristic curve methods. Although CO calibration and the characteristic curve methods showed some evidence of being robust to severe FEs (correlation of 0.5), the evidence did not seem to be consistent across test types.
Applied Psychological Measurement | 2001
Guemin Lee; Michael J. Kolen; David A. Frisbie; Robert D. Ankenmann
The performance of two polytomous item response theory models was compared to that of the dichotomous three-parameter logistic model in the context of equating tests composed of testlets. For the polytomous models, testlet scores were used to eliminate the effect of the dependence among within-testlet items. Traditional equating methods were used as criteria for both. The equating methods based on polytomous models were found to produce results that more closely agreed with the results of traditional methods.
Applied Psychological Measurement | 1999
E. Matthew Schulz; Michael J. Kolen; W. Alan Nicewander
A new procedure for defining achievement levels on continuous scales was developed using aspects of Guttman scaling and item response theory. This procedure assigns examinees to levels of achievement when the levels are represented by separate pools of multiple-choice items. Items were assigned to levels on the basis of their content and hierarchically defined level descriptions. The resulting level response functions were well-spaced and noncrossing. This result allowed well-spaced levels of achievement to be defined by a common percent-correct standard of mastery on the level pools. Guttman patterns of mastery could be inferred from level scores. The new scoring procedure was found to have higher reliability, higher classification consistency, and lower classification error, when compared to two Guttman scoring procedures.