Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Stefanie A. Wind is active.

Publication


Featured researches published by Stefanie A. Wind.


Musicae Scientiae | 2015

Rater fairness in music performance assessment: Evaluating model-data fit and differential rater functioning

Brian C. Wesolowski; Stefanie A. Wind; George Engelhard

The purpose of this study was to investigate model-data fit and differential rater functioning in the context of large group music performance assessment using the Many-Facet Rasch Partial Credit Measurement Model. In particular, we sought to identify whether or not expert raters’ (N = 24) severity was invariant across four school levels (middle school, high school, collegiate, professional). Interaction analyses suggested that differential rater functioning existed for both the group of raters and some individual raters based on their expected locations on the logit scale. This indicates that expert raters did not demonstrate invariant levels of severity when rating subgroups of ensembles across the four school levels. Of the 92 potential pairwise interactions examined, 14 (15.2%) interactions were found to be statistically significant, indicating that 10 individual raters demonstrated differential severity across at least one school level. Interpretations of meaningful systematic patterns emerged for some raters after investigating individual pairwise interactions. Implications for improving the fairness and equity in large group music performance evaluations are discussed.


Educational Assessment | 2016

Exploring the Effects of Rater Linking Designs and Rater Fit on Achievement Estimates Within the Context of Music Performance Assessments

Stefanie A. Wind; George Engelhard; Brian C. Wesolowski

When good model-data fit is observed, the Many-Facet Rasch (MFR) model acts as a linking and equating model that can be used to estimate student achievement, item difficulties, and rater severity on the same linear continuum. Given sufficient connectivity among the facets, the MFR model provides estimates of student achievement that are equated to control for differences in rater severity. Although several different linking designs are used in practice to establish connectivity, the implications of design differences have not been fully explored. Research is also limited related to the impact of model-data fit on the quality of MFR model-based adjustments for rater severity. This study explores the effects of linking designs and model-data fit for raters on the interpretation of student achievement estimates within the context of performance assessments in music. Results indicate that performances cannot be effectively adjusted for rater effects when inadequate linking or model-data fit is present.


Language Testing | 2018

A systematic review of methods for evaluating rating quality in language assessment

Stefanie A. Wind; Meghan E. Peterson

The use of assessments that require rater judgment (i.e., rater-mediated assessments) has become increasingly popular in high-stakes language assessments worldwide. Using a systematic literature review, the purpose of this study is to identify and explore the dominant methods for evaluating rating quality within the context of research on large-scale rater-mediated language assessments. Results from the review of 259 methodological and applied studies reveal an emphasis on inter-rater reliability as evidence of rating quality that persists across methodological and applied studies, studies primarily focused on rating quality and studies not primarily focused on rating quality, and across multiple language constructs. Additional findings suggest discrepancies in rating designs used in empirical research and practical concerns in performance assessment systems. Taken together, the findings from this study highlight the reliance upon aggregate-level information that is not specific to individual raters or specific facets of an assessment context as evidence of rating quality in rater-mediated assessments. In order to inform the interpretation and use of ratings, as well as the improvement of rater-mediated assessment systems, rating quality indices are needed that go beyond group-level indicators of inter-rater reliability, and provide diagnostic evidence of rating quality specific to individual raters, students, and other facets of the assessment system. These indicators are available based on modern measurement techniques, such as Rasch measurement theory and other item response theory approaches. Implications are discussed as they relate to validity, reliability/precision, and fairness for rater-mediated assessments.


Educational and Psychological Measurement | 2018

The Stabilizing Influences of Linking Set Size and Model–Data Fit in Sparse Rater-Mediated Assessment Networks

Stefanie A. Wind; Eli Jones

Previous research includes frequent admonitions regarding the importance of establishing connectivity in data collection designs prior to the application of Rasch models. However, details regarding the influence of characteristics of the linking sets used to establish connections among facets, such as locations on the latent variable, model–data fit, and sample size, have not been thoroughly explored. These considerations are particularly important in assessment systems that involve large proportions of missing data (i.e., sparse designs) and are associated with high-stakes decisions, such as teacher evaluations based on teaching observations. The purpose of this study is to explore the influence of characteristics of linking sets in sparsely connected rating designs on examinee, rater, and task estimates. A simulation design whose characteristics were intended to reflect practical large-scale assessment networks with sparse connections were used to consider the influence of locations on the latent variable, model–data fit, and sample size within linking sets on the stability and model–data fit of estimates. Results suggested that parameter estimates for examinee and task facets are quite robust to modifications in the size, model–data fit, and latent-variable location of the link. Parameter estimates for the rater, while still quite robust, are more sensitive to reductions in link size. The implications are discussed as they relate to research, theory, and practice.


Educational and Psychological Measurement | 2016

Exploring Rating Quality in Rater-Mediated Assessments Using Mokken Scale Analysis

Stefanie A. Wind; George Engelhard

Mokken scale analysis is a probabilistic nonparametric approach that offers statistical and graphical tools for evaluating the quality of social science measurement without placing potentially inappropriate restrictions on the structure of a data set. In particular, Mokken scaling provides a useful method for evaluating important measurement properties, such as invariance, in contexts where response processes are not well understood. Because rater-mediated assessments involve complex interactions among many variables, including assessment contexts, student artifacts, rubrics, individual rater characteristics, and others, rater-assigned scores are suitable candidates for Mokken scale analysis. The purposes of this study are to describe a suite of indices that can be used to explore the psychometric quality of data from rater-mediated assessments and to illustrate the substantive interpretation of Mokken-based statistics and displays in this context. Techniques that are commonly used in polytomous applications of Mokken scaling are adapted for use with rater-mediated assessments, with a focus on the substantive interpretation related to individual raters. Overall, the findings suggest that indices of rater monotonicity, rater scalability, and invariant rater ordering based on Mokken scaling provide diagnostic information at the level of individual raters related to the requirements for invariant measurement. These Mokken-based indices serve as an additional suite of diagnostic tools for exploring the quality of data from rater-mediated assessments that can supplement rating quality indices based on parametric models.


Educational and Psychological Measurement | 2017

Adjacent-Categories Mokken Models for Rater-Mediated Assessments

Stefanie A. Wind

Molenaar extended Mokken’s original probabilistic-nonparametric scaling models for use with polytomous data. These polytomous extensions of Mokken’s original scaling procedure have facilitated the use of Mokken scale analysis as an approach to exploring fundamental measurement properties across a variety of domains in which polytomous ratings are used, including rater-mediated educational assessments. Because their underlying item step response functions (i.e., category response functions) are defined using cumulative probabilities, polytomous Mokken models can be classified as cumulative models based on the classifications of polytomous item response theory models proposed by several scholars. In order to permit a closer conceptual alignment with educational performance assessments, this study presents an adjacent-categories variation on the polytomous monotone homogeneity and double monotonicity models. Data from a large-scale rater-mediated writing assessment are used to illustrate the adjacent-categories approach, and results are compared with the original formulations. Major findings suggest that the adjacent-categories models provide additional diagnostic information related to individual raters’ use of rating scale categories that is not observed under the original formulation. Implications are discussed in terms of methods for evaluating rating quality.


Educational and Psychological Measurement | 2018

Exploring Incomplete Rating Designs With Mokken Scale Analysis

Stefanie A. Wind; Yogendra Patil

Recent research has explored the use of models adapted from Mokken scale analysis as a nonparametric approach to evaluating rating quality in educational performance assessments. A potential limiting factor to the widespread use of these techniques is the requirement for complete data, as practical constraints in operational assessment systems often limit the use of complete rating designs. In order to address this challenge, this study explores the use of missing data imputation techniques and their impact on Mokken-based rating quality indicators related to rater monotonicity, rater scalability, and invariant rater ordering. Simulated data and real data from a rater-mediated writing assessment were modified to reflect varying levels of missingness, and four imputation techniques were used to impute missing ratings. Overall, the results indicated that simple imputation techniques based on rater and student means result in generally accurate recovery of rater monotonicity indices and rater scalability coefficients. However, discrepancies between violations of invariant rater ordering in the original and imputed data are somewhat unpredictable across imputation methods. Implications for research and practice are discussed.


School Effectiveness and School Improvement | 2018

Principals’ use of rating scale categories in classroom observations for teacher evaluation

Stefanie A. Wind; Chia-Lin Tsai; Sara Grajeda; Christi Bergin

ABSTRACT Teacher evaluation systems commonly rely on observation of teaching practice (OTP) by school principals. However, the value of OTP as evidence of teacher effectiveness depends on its psychometric quality. In this study, we address a key aspect of the psychometric quality of principals’ OTP ratings. Specifically, we investigate the degree to which rating scale categories have a consistent interpretation across teaching episodes and practices. Results suggest that the 1,324 principals’ use of the rating scale categories functioned as intended overall. However, we also found that the midpoint category is underutilized and that rating categories do not always reflect similar levels of teaching effectiveness across teaching episodes and practices. When such discrepancies occur, we cannot assume principals’ ratings reflect a consistent level of teacher effectiveness within and across classrooms. This is a critical component of validity evidence that can inform the interpretation of OTP ratings and point to areas for improvement in both the rubrics and in principals’ training for classroom observations.


Musicae Scientiae | 2018

Exploring decision consistency and decision accuracy across rating designs in rater-mediated music performance assessments

Stefanie A. Wind; Pey Shin Ooi; George Engelhard

Music performance assessments frequently include panels of raters who evaluate the quality of musical performances using rating scales. As a result of practical considerations, it is often not possible to obtain ratings from every rater on every performance (i.e., complete rating designs). When there are differences in rater severity, and not all raters rate all performances, ratings of musical performances and their resulting classification (e.g., pass or fail) depend on the “luck of the rater draw.” In this study, we explored the implications of different types of incomplete rating designs for the classification of musical performances in rater-mediated musical performance assessments. We present a procedure that researchers and practitioners can use to adjust student scores for differences in rater severity when incomplete rating designs are used, and we consider the effects of the adjustment procedure across different types of rating designs. Our results suggested that differences in rater severity have large practical consequences for ratings of musical performances that impact individual students and group of students differently. Furthermore, our findings suggest that it is possible to adjust musical performance ratings for differences in rater severity as long as there are common raters across scoring panels. We consider the implications of our findings as they relate to music assessment research and practice.


International Journal of Testing | 2018

The Influence of Rater Effects in Training Sets on the Psychometric Quality of Automated Scoring for Writing Assessments.

Stefanie A. Wind; Edward W. Wolfe; George Engelhard; Peter W. Foltz; Mark Rosenstein

Automated essay scoring engines (AESEs) are becoming increasingly popular as an efficient method for performance assessments in writing, including many language assessments that are used worldwide. Before they can be used operationally, AESEs must be “trained” using machine-learning techniques that incorporate human ratings. However, the quality of the human ratings used to train the AESEs is rarely examined. As a result, the impact of various rater effects (e.g., severity and centrality) on the quality of AESE-assigned scores is not known. In this study, we use data from a large-scale rater-mediated writing assessment to examine the impact of rater effects on the quality of AESE-assigned scores. Overall, the results suggest that if rater effects are present in the ratings used to train an AESE, the AESE scores may replicate these effects. Implications are discussed in terms of research and practice related to automated scoring.

Collaboration


Dive into the Stefanie A. Wind's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jessica Gale

Georgia Institute of Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Barbara S. Plake

University of Nebraska–Lincoln

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Chia-Lin Tsai

University of Northern Colorado

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Eli Jones

University of Missouri

View shared research outputs
Researchain Logo
Decentralizing Knowledge