Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Adam E. Wyse is active.

Publication


Featured researches published by Adam E. Wyse.


Educational Assessment | 2011

How Item Writers Understand Depth of Knowledge

Adam E. Wyse; Steven G. Viger

An important part of test development is ensuring alignment between test forms and content standards. One common way of measuring alignment is the Webb (1997, 2007) alignment procedure. This article investigates (a) how well item writers understand components of the definition of Depth of Knowledge (DOK) from the Webb alignment procedure and (b) how consistent their DOK ratings are with ratings provided by other committees of educators across grade levels, content areas, and alternate assessment levels in a Midwestern state alternate assessment system. Results indicate that many item writers understand key features of DOK. However, some item writers struggled to articulate what DOK means and had some misconceptions. Additional analyses suggested some lack of consistency between the item writer DOK ratings and the committee DOK ratings. Some notable differences were found across alternate assessment levels and content areas. Implications for future item writing training and alignment studies are provided.


Measurement: Interdisciplinary Research & Perspective | 2013

Construct Maps as a Foundation for Standard Setting

Adam E. Wyse

Construct maps are tools that display how the underlying achievement construct upon which one is trying to set cut-scores is related to other information used in the process of standard setting. This article reviews what construct maps are, uses construct maps to provide a conceptual framework to view commonly used standard-setting procedures (the Angoff, Bookmark, Mapmark, Briefing Book, Body of Work, Contrasting Groups, Borderline Groups, and Construct Mapping methods), and describes how construct maps can be applied to set cut-scores and provide feedback, evaluate standard-setting methods, and synthesize data from various standard-setting methods when deciding on cut-scores. Suggestions of how construct maps could help resolve several of the common criticisms of operational standard-setting procedures, including issues related to panelist inconsistency and score gaps, are also provided. An example from a large-scale state-testing program illustrates how construct maps may be applied in practice.


Educational and Psychological Measurement | 2012

Examining Rounding Rules in Angoff-Type Standard-Setting Methods

Adam E. Wyse; Mark D. Reckase

This study investigates how different rounding rules and ways of providing Angoff standard-setting judgments affect cut-scores. A simulation design based on data from the National Assessment of Education Progress was used to investigate how rounding judgments to the nearest whole number (e.g., 0, 1, 2, etc.), nearest 0.05, or nearest two decimal places for individual items or clusters of items affected cut-scores for individual panelists and a group of panelists across four different pools of items. For the simulated ratings from a group of panelists, the recovery of the cut-scores was examined using the mean and the median. Results showed that rounding to nearest whole number had the potential to produce fairly large statistical biases in cut-score estimates. Biases were less when judgments were simulated across cluster of items. The largest biases were found at the advanced cut-score, but the greatest potential changes in the percentage of students that would be above the cut-score were found for the basic cut-score. Rounding to the nearest 0.05 or nearest two decimals places did not have a large impact on cut-score estimates and had little effect on the percentage of students above the cut-score. Implications for policy and future standard-setting practices are provided.


Applied Psychological Measurement | 2012

An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

Adam E. Wyse; Shiqi Hao

This article introduces two new classification consistency indices that can be used when item response theory (IRT) models have been applied. The new indices are shown to be related to Rudner’s classification accuracy index and Guo’s classification accuracy index. The Rudner- and Guo-based classification accuracy and consistency indices are evaluated and compared with estimates from the more commonly applied IRT-recursive procedure using a simulation study and data from two large-scale assessments. Results from the simulation study and practical examples suggested that the Guo- and Rudner-based indices tended to produce estimates that were closer to the simulated values and exceeded those from the IRT-recursive-based procedure. However, results did suggest that the Rudner- and Guo-based indices can have some undesirable features that are important to keep in mind when applying them in practice. The values of the classification accuracy and consistency indices appeared to be affected by a number of factors including the distribution of examinees, test length, the placement of the cut-scores, and the proficiency estimators applied to estimate examinee ability. Suggestions are made that an important part of investigations evaluating classification accuracy and consistency indices should be the creation of figures that show the value of the classification accuracy and classification consistency for individual examinees across the range of possible scores as these figures can help provide indications into subtle and important differences between indices.


Educational and Psychological Measurement | 2014

A Body of Work Standard-Setting Method with Construct Maps.

Adam E. Wyse; Michael Bunch; Craig Deville; Steven G. Viger

This article describes a novel variation of the Body of Work method that uses construct maps to overcome problems of transparency, rater inconsistency, and scores gaps commonly occurring with the Body of Work method. The Body of Work method with construct maps was implemented to set cut-scores for two separate K-12 assessment programs in a large Midwestern state and was compared with a previous standard setting for one of the K-12 assessments programs that used the traditional Body of Work method. Data from the standard settings were used to investigate the procedural, internal, and external validity of the Body of Work method with construct maps. Results suggested that the method had strong procedural, internal, and external validity evidence to support its application.


Educational and Psychological Measurement | 2011

The Similarity of Bookmark Cut Scores With Different Response Probability Values

Adam E. Wyse

Standard setting is a method used to set cut scores on large-scale assessments. One of the most popular standard setting methods is the Bookmark method. In the Bookmark method, panelists are asked to envision a response probability (RP) criterion and move through a booklet of ordered items based on a RP criterion. This study investigates whether or not it is possible to end up with the same cut scores if one were to apply the Bookmark method with two different RP values. Analytical formulas and two hypothetical examples from a large-scale state testing program indicate that it is rarely possible to obtain the same cut score estimates with two different RP values because of the presence of item difficulty gaps present when applying the procedure in practice. Results indicate that if the same group of panelists applied the Bookmark procedure as it is traditionally explained, then cut scores should be lower with the second chosen RP value than they were with the first RP value. This result holds whether or not the second RP value is higher or lower than the first RP value. The examples also reveal that differences in cut score estimates with different RP values can lead to changes in the percentage of examinees at or above the cut scores that may have important practical impacts.


Applied Psychological Measurement | 2011

The Potential Impact of Not Being Able to Create Parallel Tests on Expected Classification Accuracy

Adam E. Wyse

In many practical testing situations, alternate test forms from the same testing program are not strictly parallel to each other and instead the test forms exhibit small psychometric differences. This article investigates the potential practical impact that these small psychometric differences can have on expected classification accuracy. Ten different sets of tests were assembled by minimizing the differences in test information at five θ locations. The impact of the psychometric differences between the assembled test forms was quantified for two different groups of simulated examinees across a range of possible cut scores. Results indicated that using sequential or simultaneous test assembly is preferred to random test assembly. Analyses also implied that the small differences in the psychometric properties between tests produced differences in overall classification accuracy that were less than 1.5%. The biggest differences in classification accuracy were found when the test information functions were not as well matched in regions where there were more examinees. Although these differences were fairly small, they may have the potential to have a practically significant impact on decision making. It is suggested that classification accuracy critically depends on the differences in test information, the location of the cut score, and the groups of examinees considered.


Measurement: Interdisciplinary Research & Perspective | 2015

Challenges on the Path to Implementation

Joseph A. Martineau; Adam E. Wyse

Our view is that in large-scale assessment, the industry has attended reasonably well to priorities 1 to 3 at the expense of priority 4 and has had excellent success with priority 1, some difficulty with priority 3, greater difficulty with priority 2, and arguably failed in priority 4. The lack of attention to priority 4 and the difficulties with priorities 2 to 4 can be partly attributed to the nearly exclusive focus on estimating and reporting on broad content area knowledge rather than providing smaller-grain-size reports grounded in theory and analysis about how knowledge develops. The authors present an admirable framework for addressing many of the tensions through modeling both student achievement and student learning through the use of learning progressions, reporting on both broad swaths and finer slices of content. As they point out, this introduces some considerable psychometric difficulties. We also add that it also introduces some considerable policy and score-reporting/interpretation difficulties. To be clear, the existence and/or validity of the issues we raise should not be interpreted as a reason not to move forward with the approach the authors have described. We see this approach as promising enough to deserve near immediate implementation to at least some degree and full implementation (with ongoing research and refinement) at some time in the near future. A policy issue that deserves its own section is that the authors present their framework as an approach that would need to be implemented essentially in full to allow for meaningful measurement of student growth that addresses many of the major difficulties. While we agree with the


Applied Measurement in Education | 2015

Considering the Use of General and Modified Assessment Items in Computerized Adaptive Testing.

Adam E. Wyse; Anthony D. Albano

This article used several data sets from a large-scale state testing program to examine the feasibility of combining general and modified assessment items in computerized adaptive testing (CAT) for different groups of students. Results suggested that several of the assumptions made when employing this type of mixed-item CAT may not be met for students with disabilities that have typically taken alternate assessments based on modified achievement standards (AA-MAS). A simulation study indicated that the abilities of AA-MAS students can be underestimated or overestimated by the mixed-item CAT, depending on students’ location on the underlying ability scale. These findings held across grade levels and test lengths. The mixed-item CAT appeared to function well for non-AA-MAS students.


Applied Measurement in Education | 2018

Regression Effects in Angoff Ratings: Examples from Credentialing Exams

Adam E. Wyse

ABSTRACT This article discusses regression effects that are commonly observed in Angoff ratings where panelists tend to think that hard items are easier than they are and easy items are more difficult than they are in comparison to estimated item difficulties. Analyses of data from two credentialing exams illustrate these regression effects and the persistence of these regression effects across rounds of standard setting, even after panelists have received feedback information and have been given the opportunity to discuss their ratings. Additional analyses show that there tended to be a relationship between the average item ratings provided by panelists and the standard deviations of those item ratings and that the relationship followed a quadratic form with peak variation in average item ratings found toward the middle of the item difficulty scale. The study concludes with discussion of these findings and what they may imply for future standard settings.

Collaboration


Dive into the Adam E. Wyse's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Anthony D. Albano

University of Nebraska–Lincoln

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Mark D. Reckase

Michigan State University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge