Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where George Engelhard is active.

Publication


Featured researches published by George Engelhard.


Musicae Scientiae | 2015

Rater fairness in music performance assessment: Evaluating model-data fit and differential rater functioning

Brian C. Wesolowski; Stefanie A. Wind; George Engelhard

The purpose of this study was to investigate model-data fit and differential rater functioning in the context of large group music performance assessment using the Many-Facet Rasch Partial Credit Measurement Model. In particular, we sought to identify whether or not expert raters’ (N = 24) severity was invariant across four school levels (middle school, high school, collegiate, professional). Interaction analyses suggested that differential rater functioning existed for both the group of raters and some individual raters based on their expected locations on the logit scale. This indicates that expert raters did not demonstrate invariant levels of severity when rating subgroups of ensembles across the four school levels. Of the 92 potential pairwise interactions examined, 14 (15.2%) interactions were found to be statistically significant, indicating that 10 individual raters demonstrated differential severity across at least one school level. Interpretations of meaningful systematic patterns emerged for some raters after investigating individual pairwise interactions. Implications for improving the fairness and equity in large group music performance evaluations are discussed.


Educational Assessment | 2016

Exploring the Effects of Rater Linking Designs and Rater Fit on Achievement Estimates Within the Context of Music Performance Assessments

Stefanie A. Wind; George Engelhard; Brian C. Wesolowski

When good model-data fit is observed, the Many-Facet Rasch (MFR) model acts as a linking and equating model that can be used to estimate student achievement, item difficulties, and rater severity on the same linear continuum. Given sufficient connectivity among the facets, the MFR model provides estimates of student achievement that are equated to control for differences in rater severity. Although several different linking designs are used in practice to establish connectivity, the implications of design differences have not been fully explored. Research is also limited related to the impact of model-data fit on the quality of MFR model-based adjustments for rater severity. This study explores the effects of linking designs and model-data fit for raters on the interpretation of student achievement estimates within the context of performance assessments in music. Results indicate that performances cannot be effectively adjusted for rater effects when inadequate linking or model-data fit is present.


Measurement: Interdisciplinary Research & Perspective | 2014

Alternative Measurement Paradigms for Measuring Executive Functions: SEM (Formative and Reflective Models) and IRT (Rasch Models)

George Engelhard; Jue Wang

The authors of the Focus article pose important questions regarding whether or not performancebased tasks related to executive functioning are best viewed as reflective or formative indicators. Miyake and Friedman (2012) define executive functioning (EF) as “a set of general-purpose control mechanisms, often linked to the prefrontal cortex of the brain, that regulate the dynamics of human cognition and action. EFs are important to study because they are a core component of self-control or self-regulation ability (or ‘willpower’)” (p. 8). Reflective and formative models are being actively debated within the structural equation modeling (SEM) literature. In essence, reflective models are defined by effect indicators (construct causes the tasks), while formative models are defined by causal indicators (tasks cause the construct). Major measurement models grounded in classical test theory and factor analysis are reflective models. Bollen and Lennox (1991) have suggested an alternative perspective based on SEM that offers an option for using formative models to define constructs. The purpose of the Focus article is to critically evaluate the routine use of reflective models for measuring individual differences in performance-based tasks and to challenge researchers to formative models of executive functioning. Recently, Engelhard (2013) stressed the importance of research traditions and paradigms (Andrich, 2004) in understanding how measurement problems are posed and addressed in the social, behavioral, and health sciences. As pointed out by Lazarsfeld (1966),


Journal of Psychoeducational Assessment | 2016

Examining the Teachers' Sense of Efficacy Scale at the Item Level with Rasch Measurement Model.

Mei-Lin Chang; George Engelhard

The purpose of this study is to examine the psychometric quality of the Teachers’ Sense of Efficacy Scale (TSES) with data collected from 554 teachers in a U.S. Midwestern state. The many-facet Rasch model was used to examine several potential contextual influences (years of teaching experience, school context, and levels of emotional exhaustion) on item functioning within the TSES. Results suggest that although TSES items are rather easy for teachers to endorse, sufficient variance in the item endorsement hierarchy of the scale exists to support the validity of score interpretations. The items are invariant across years of teaching experience or school locations, but not invariant across levels of emotional exhaustion.


Educational and Psychological Measurement | 2016

Exploring Rating Quality in Rater-Mediated Assessments Using Mokken Scale Analysis

Stefanie A. Wind; George Engelhard

Mokken scale analysis is a probabilistic nonparametric approach that offers statistical and graphical tools for evaluating the quality of social science measurement without placing potentially inappropriate restrictions on the structure of a data set. In particular, Mokken scaling provides a useful method for evaluating important measurement properties, such as invariance, in contexts where response processes are not well understood. Because rater-mediated assessments involve complex interactions among many variables, including assessment contexts, student artifacts, rubrics, individual rater characteristics, and others, rater-assigned scores are suitable candidates for Mokken scale analysis. The purposes of this study are to describe a suite of indices that can be used to explore the psychometric quality of data from rater-mediated assessments and to illustrate the substantive interpretation of Mokken-based statistics and displays in this context. Techniques that are commonly used in polytomous applications of Mokken scaling are adapted for use with rater-mediated assessments, with a focus on the substantive interpretation related to individual raters. Overall, the findings suggest that indices of rater monotonicity, rater scalability, and invariant rater ordering based on Mokken scaling provide diagnostic information at the level of individual raters related to the requirements for invariant measurement. These Mokken-based indices serve as an additional suite of diagnostic tools for exploring the quality of data from rater-mediated assessments that can supplement rating quality indices based on parametric models.


Measurement: Interdisciplinary Research & Perspective | 2014

Clarifying the Conceptualization of Indicators Within Different Models

Jue Wang; George Engelhard; Zhenqiu Lu

The authors of the focus article have emphasized the continuing confusion among some researchers regarding various indicators used in structural equation models (SEMs). Their major claim is that causal indicators are not inherently unstable, and even if they are unstable they are at least not more unstable than other types of indicators. The authors describe effect indicators as conceptually dependent on a latent variable, while causal indicators influence the latent variable. Furthermore, they stress that causal indicators are different from composite indicators and covariates as described recently by Bollen and Bauldry (2011). We agree that “in measurement theory, causal indicators are controversial and little understood” (Bainter & Bollen, this issue, p. 125). For example, Howell, Breivik, and Wilcox (2007a) argued that the latent variables are open to interpretational confounding when causal indicators and other types of indicators are included in SEMs. From an empirical perspective, Bainter and Bollen (this issue) state that structural misspecifications of models (Bollen, 2009) or inappropriate use of causal indicators are the main source of confounding. They suggested that the coefficient instability found in Wilcox et al.’s (2008) and Kim et al.’s (2010) studies may be due to the structural misspecification instead of the causal indicators being inherently unstable. Bainter and Bollen (this issue) conclude that there is “no evidence of unstable causal indicator coefficients across properly specified models” (p. 137). In our commentary, we briefly discuss types of indicators used in SEMs, and suggest additional steps to consider in clarifying both theoretical and empirical conceptualizations of indicators. Two questions guide our comments:


Measurement: Interdisciplinary Research & Perspective | 2015

Involving Diverse Communities of Practice to Minimize Unintended Consequences of Test-Based Accountability Systems

Nadia Behizadeh; George Engelhard

In his focus article, Koretz (this issue) argues that accountability has become the primary function of large-scale testing in the United States. He then points out that tests being used for accountability purposes are flawed and that the high-stakes nature of these tests creates a context that encourages score inflation. Koretz is concerned about what he calls “behavioral responses to testing that might threaten validity, in particular, instructional responses such as inappropriate narrowing of instruction or coaching students to capitalize on incidental attributes of items” (p. 3). Based on this contention, he suggests that current methods of designing, linking, and validating high-stakes assessments must be modified to prevent negative consequences. Specifically, Koretz proposes creating tests that are less predictable and, thus, less likely to encourage narrowing of instruction and coaching. As pointed out in the revised Test Standards (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 2014), assessment systems should be revised in light of unintended consequences. We agree with Koretz (this issue) that assessment and policy contexts are both key factors in the consequential


Educational and Psychological Measurement | 2016

Evaluating Rater Accuracy in Rater-Mediated Assessments Using an Unfolding Model

Jue Wang; George Engelhard; Edward W. Wolfe

The number of performance assessments continues to increase around the world, and it is important to explore new methods for evaluating the quality of ratings obtained from raters. This study describes an unfolding model for examining rater accuracy. Accuracy is defined as the difference between observed and expert ratings. Dichotomous accuracy ratings (0 = inaccurate, 1 = accurate) are unfolded into three latent categories: inaccurate below expert ratings, accurate ratings, and inaccurate above expert ratings. The hyperbolic cosine model (HCM) is used to examine dichotomous accuracy ratings from a statewide writing assessment. This study suggests that HCM is a promising approach for examining rater accuracy, and that the HCM can provide a useful interpretive framework for evaluating the quality of ratings obtained within the context of rater-mediated assessments.


Measurement: Interdisciplinary Research & Perspective | 2014

Game-Based Assessments: A Promising Way to Create Idiographic Perspectives

A. Adrienne Walker; George Engelhard

Assessments embedded within simulations and games represent the cutting edge of nextgeneration assessments. Almond, Kim, Velasquez, and Shute (Focus article) suggest that a careful consideration of the principles of evidence-centered assessment design can improve game-based assessments. Student information can be gathered, analyzed, and systematically used to substantiate claims and inferences about student academic performance in ways that are fun, engaging, and perhaps even unknown to the students who are being assessed. The purpose of the Focus article is to describe how game “tasks” can be modified so that certain psychometric properties, such as task difficulty and task discrimination, can be identified to support the construct validity of game-based assessments. One of the key ideas of the Focus article is the potential of computerized assessments to collect and, therefore, potentially measure important aspects of student achievement that cannot be easily measured by traditional assessment formats. For example, constructs such as student creativity and persistence are not typically measured in traditional assessments. However, within the format of a game-based assessment, where vast amounts of data on the student are recorded, it is conceivable to measure both cognitive and affective student characteristics. Obstacles regarding the appropriate measurement of cognitive and affective constructs for game-based assessment exist, and they can be addressed to ensure valid measurement of student achievement. The potential to collect data on more than one dimension of student performance has exciting implications for assessment practice. Our commentary argues that game-based assessments using evidence-centered design provide an opportunity to focus on students as multidimensional individuals acting within a particular context. In other words, a more complex and nuanced picture of a student’s academic achievement may be drawn from information collected with computerized game-based assessment. Further, we suggest that the pursuit of construct validation for measuring student performance can be theoretically grounded in both nomothetic and idiographic scientific inquiry.


Measurement: Interdisciplinary Research & Perspective | 2013

Goodness of Model-Data Fit and Invariant Measurement

George Engelhard; Aminah Perkins

Maydeu-Olivares (this issue) presents a framework for evaluating the goodness of model-data fit for item response theory (IRT) models. As he correctly points out, overall goodness-of-fit evaluations of IRT models and data are not generally explored within most applications in educational and psychological measurement. He argues that overall goodness-of-fit (GOF) is not typically evaluated in psychometric applications because accurate p-values are not known for the commonly used GOF statistics, such as the Pearson (χ2) and likelihood ratio (G2) statistics. In the focus article, the author presents new statistics for model-data fit with accurate p-values that are called limited-information statistics. Essentially, Maydeu-Olivares proposes using the information in the lower-level marginals of the contingency tables (univariate and bivariate) to provide more accurate GOF statistics. This strategy is designed to minimize the effects of sparse cells when a large number of test items are being evaluated for overall goodness of model-data fit. In the following commentary, we limit our comments to some historical and philosophical aspects of model-data fit from the perspective of invariant measurement (Engelhard, 2013; Millsap, 2011). Our aim is to selectively consider some of the major points made by Maydeu-Olivares from the perspective of invariant measurement.

Collaboration


Dive into the George Engelhard's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jue Wang

University of Georgia

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge