Is this you? Create Your Porfile

Amery D. Wu

University of British Columbia

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Amery D. Wu is active.

Explore More

Publication

Featured researches published by Amery D. Wu.

Educational and Psychological Measurement | 2010

The Impact of Outliers on Cronbach’s Coefficient Alpha Estimate of Reliability: Ordinal/Rating Scale Item Responses

Yan Liu; Amery D. Wu; Bruno D. Zumbo

In a recent Monte Carlo simulation study, Liu and Zumbo showed that outliers can severely inflate the estimates of Cronbach’s coefficient alpha for continuous item response data—visual analogue response format. Little, however, is known about the effect of outliers for ordinal item response data—also commonly referred to as Likert, Likert-type, ordered categorical, or ordinal/rating scale item responses. Building on the work of Liu and Zumbo, the authors investigated the effects of outlier contamination for binary and ordinal response scales. Their results showed that coefficient alpha estimates were severely inflated with the presence of outliers, and like the earlier findings, the effects of outliers were reduced with increasing theoretical reliability. The efficiency of coefficient alpha estimates (i.e., sample-to-sample variation) was inflated as well and affected by the number of scale points. It is worth noting that when there were no outliers, the alpha estimates were downward biased because of the ordinal scaling. However, the alpha estimates were, in general, inflated in the presence of outliers leading to positive bias.

Language Assessment Quarterly | 2015

A Methodology for Zumbo’s Third Generation DIF Analyses and the Ecology of Item Responding

Bruno D. Zumbo; Yan Liu; Amery D. Wu; Benjamin R. Shear; Oscar L. Olvera Astivia; Tavinder K. Ark

Methods for detecting differential item functioning (DIF) and item bias are typically used in the process of item analysis when developing new measures; adapting existing measures for different populations, languages, or cultures; or more generally validating test score inferences. In 2007 in Language Assessment Quarterly, Zumbo introduced the concept of Third Generation DIF. In the current article we introduce a new methodology, latent class logistic regression, for Zumbo’s Third Generation DIF, whose foundation is a novel ecological model of item responding. The ecological model and the new statistical methodology are introduced, and a proof-of-concept is provided, in the context of an example of an international reading test focusing on DIF due to testing language. The new DIF framework is described and contrasted with other methods, Mplus code is provided, and the new method is shown to have potential for application in assessment.

International Journal of Behavioral Development | 2014

A Method to Aid in the Interpretation of EFA Results: An Application of Pratt's Measures.

Amery D. Wu; Bruno D. Zumbo; Sheila K. Marshall

This article describes a method based on Pratt’s measures and demonstrates its use in exploratory factor analyses. The article discusses the interpretational complexities due to factor correlations and how Pratt’s measures resolve these interpretational problems. Two real data examples demonstrate the calculation of what we call the “D matrix,” of which the elements are Pratt’s measures. Focusing on the rows of the D matrix allows one to compare the importance of the factors to the communality of each observed indicator (horizontal interpretation); whereas a focus on the columns of the D matrix allows one to compare the contribution of the indicators to the common variance extracted by each factor (vertical interpretation). The application showed that the method based on Pratt’s measures is a very simple but useful technique for EFA, in particular, for behavioral and developmental constructs, which are often multidimensional and mutually correlated.

Educational and Psychological Measurement | 2012

A Demonstration of the Impact of Outliers on the Decisions About the Number of Factors in Exploratory Factor Analysis

Yan Liu; Bruno D. Zumbo; Amery D. Wu

Previous studies have rarely examined the impact of outliers on the decisions about the number of factors to extract in an exploratory factor analysis. The few studies that have investigated this issue have arrived at contradictory conclusions regarding whether outliers inflated or deflated the number of factors extracted. By systematically inducing outliers as well as computer simulations based on real data, the present study demonstrated how outliers affected the decisions about the number of factors to extract using four commonly used and/or recommended decision methods. The studies revealed that both inflation and deflation of the number of factors were found, but the effect depended on (a) the decision methods used and (b) the magnitude and amount of outliers, hence resolving the apparent contradictory conclusions in the previous literature.

Journal of Psychoeducational Assessment | 2016

Validation Through Understanding Test-Taking Strategies: An Illustration With the CELPIP-General Reading Pilot Test Using Structural Equation Modeling

Amery D. Wu; Jake E. Stone

This article explores an approach for test score validation that examines test takers’ strategies for taking a reading comprehension test. The authors formulated three working hypotheses about score validity pertaining to three types of test-taking strategy (comprehending meaning, test management, and test-wiseness). These hypotheses were formulated in terms of the use of three types of test-taking strategy and their relationships with performance on specific task types (testlets) and overall test performances. We illustrated the proposed method for validation using example data from the Canadian English Language Proficiency Index Program-General (CELPIP-General) reading pilot test. The findings were that (a) test takers were engaging more in processing the texts for comprehending meaning, less in test-management skills, and least in test-wiseness; (b) at the task level, task characteristics (e.g., difficulty) had implications on test takers’ engagement with different types of strategies, which, in turn, led to differences in predicting task performances; and (c) at the test level, higher engagement in comprehending meaning led to higher test performance, engagement in test management showed a small negative association with test performance, and higher engagement in test-wiseness led to poorer performance. The high congruence between the working hypotheses and the empirical results offered plausible evidence that supported the validity of CELPIP-General reading scores. Revisions to both hypotheses and research design that might improve the proposed validation method are reviewed in the “Discussion” section.

Archive | 2017

Understanding Test-Taking Strategies for a Reading Comprehension Test via Latent Variable Regression with Pratt’s Importance Measures

Amery D. Wu; Bruno D. Zumbo

This chapter considers how the process-based variables of test-taking strategies as reported by test-takers can help to explain the differences in the outcome of a reading comprehension test and serve to provide process level evidence of validity. With the process variables as the explanatory variables, test-takers’ performance was analyzed via a latent variable regression in a structural equation model (SEM), along with Pratt’s importance measures (Pratt, 1987) to assist in understanding the score variation in the latent outcome. We consider how understanding test-taking strategy can help inform test design and validation practices.

PLOS ONE | 2016

The Accuracy of Computerized Adaptive Testing in Heterogeneous Populations: A Mixture Item-Response Theory Analysis.

Richard Sawatzky; Pamela A. Ratner; Jacek A. Kopec; Amery D. Wu; Bruno D. Zumbo

Background Computerized adaptive testing (CAT) utilizes latent variable measurement model parameters that are typically assumed to be equivalently applicable to all people. Biased latent variable scores may be obtained in samples that are heterogeneous with respect to a specified measurement model. We examined the implications of sample heterogeneity with respect to CAT-predicted patient-reported outcomes (PRO) scores for the measurement of pain. Methods A latent variable mixture modeling (LVMM) analysis was conducted using data collected from a heterogeneous sample of people in British Columbia, Canada, who were administered the 36 pain domain items of the CAT-5D-QOL. The fitted LVMM was then used to produce data for a simulation analysis. We evaluated bias by comparing the referent PRO scores of the LVMM with PRO scores predicted by a “conventional” CAT (ignoring heterogeneity) and a LVMM-based “mixture” CAT (accommodating heterogeneity). Results The LVMM analysis indicated support for three latent classes with class proportions of 0.25, 0.30 and 0.45, which suggests that the sample was heterogeneous. The simulation analyses revealed differences between the referent PRO scores and the PRO scores produced by the “conventional” CAT. The “mixture” CAT produced PRO scores that were nearly equivalent to the referent scores. Conclusion Bias in PRO scores based on latent variable models may result when population heterogeneity is ignored. Improved accuracy could be obtained by using CATs that are parameterized using LVMM.

Archive | 2017

Putting Flesh on the Psychometric Bone: Making Sense of IRT Parameters in Non-cognitive Measures by Investigating the Social-Cognitive Aspects of the Items

Anita M. Hubley; Amery D. Wu; Yan Liu; Bruno D. Zumbo

This chapter focuses on item response theory (IRT) item parameters as windows into response processes. The study purpose was to examine relationships between item parameters and five social-cognitive aspects of items (i.e., wording specificity, availability heuristic, emotional comfort, meaning clarity, and social desirability). IRT parameters were estimated using responses to the Geriatric Depression Scale (GDS) from a sample of 729 men and women. Ratings of the social-cognitive aspects of each GDS item were obtained from a sample of 30 men and women. After testing five 2, 3, and 4 parameter logistic (PL) models, a 3-PL model with a-, b-, and d-parameters (i.e., discrimination, difficulty, upper asymptote) best fit the data. The study findings expand our understanding of the substantive meanings behind IRT parameters but also suggest that relationships among IRT parameters and the social-cognitive aspects of items may be more specific to the construct of interest than previously realized.

Archive | 2017

National and International Educational Achievement Testing: A Case of Multi-level Validation Framed by the Ecological Model of Item Responding

Bruno D. Zumbo; Yan Liu; Amery D. Wu; Barry Forer; Benjamin R. Shear

The results of large-scale student assessments are increasingly being used to rank nations, states, and schools and to inform policy decisions. These uses often rely on aggregated student test score data, and imply inferences about multilevel constructs. Validating uses and interpretations about these multilevel constructs requires appropriate multilevel validation techniques. This chapter combines multilevel data analysis techniques with an explanatory view of validity to develop explanations of score variation that can be used to evaluate multilevel measurement inferences. We use country-level mathematics scores from the Trends in International Mathematics and Science Study (TIMSS) to illustrate the integration of these techniques. The explanation focused view of validity accompanied by the ecological model of item responding situates conventional response process research in a multilevel construct setting and moves response process studies beyond the traditional focus on individual test-takers’ behaviors.

Frontiers in Education | 2017

Is Difference in Measurement Outcome between Groups Differential Responding, Bias or Disparity? A Methodology for Detecting Bias and Impact from an Attributional Stance

Amery D. Wu; Yan Liu; Jake E. Stone; Danjie Zou; Bruno D. Zumbo

Measurement bias is a crucial concern for test fairness. Impact (true group difference in the measured scores) is of the ultimate interest in many scientific inquiries. This paper revisits and refines the definitions for bias and impact and articulates a conceptual framework that decouples them from differential item functioning. The conditions for showing bias and impact are articulated and a methodology for empirically detecting them is proposed. The framework and methodology hinges on attributing bias and impact to the studied groups by way of matching on balance scores (e.g., propensity scores estimated by the confounding covariates). A real data demonstration comparing two test-language groups on the mathematics items of TIMSS is provided as a proof of concept and guide for application. In closing, we draw readers’ attention to some caveats and suggestions for adopting this conceptual framework and methodology.

Explore More