Amery D. Wu
University of British Columbia
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Amery D. Wu.
Educational and Psychological Measurement | 2010
Yan Liu; Amery D. Wu; Bruno D. Zumbo
In a recent Monte Carlo simulation study, Liu and Zumbo showed that outliers can severely inflate the estimates of Cronbach’s coefficient alpha for continuous item response data—visual analogue response format. Little, however, is known about the effect of outliers for ordinal item response data—also commonly referred to as Likert, Likert-type, ordered categorical, or ordinal/rating scale item responses. Building on the work of Liu and Zumbo, the authors investigated the effects of outlier contamination for binary and ordinal response scales. Their results showed that coefficient alpha estimates were severely inflated with the presence of outliers, and like the earlier findings, the effects of outliers were reduced with increasing theoretical reliability. The efficiency of coefficient alpha estimates (i.e., sample-to-sample variation) was inflated as well and affected by the number of scale points. It is worth noting that when there were no outliers, the alpha estimates were downward biased because of the ordinal scaling. However, the alpha estimates were, in general, inflated in the presence of outliers leading to positive bias.
Language Assessment Quarterly | 2015
Bruno D. Zumbo; Yan Liu; Amery D. Wu; Benjamin R. Shear; Oscar L. Olvera Astivia; Tavinder K. Ark
Methods for detecting differential item functioning (DIF) and item bias are typically used in the process of item analysis when developing new measures; adapting existing measures for different populations, languages, or cultures; or more generally validating test score inferences. In 2007 in Language Assessment Quarterly, Zumbo introduced the concept of Third Generation DIF. In the current article we introduce a new methodology, latent class logistic regression, for Zumbo’s Third Generation DIF, whose foundation is a novel ecological model of item responding. The ecological model and the new statistical methodology are introduced, and a proof-of-concept is provided, in the context of an example of an international reading test focusing on DIF due to testing language. The new DIF framework is described and contrasted with other methods, Mplus code is provided, and the new method is shown to have potential for application in assessment.
International Journal of Behavioral Development | 2014
Amery D. Wu; Bruno D. Zumbo; Sheila K. Marshall
This article describes a method based on Pratt’s measures and demonstrates its use in exploratory factor analyses. The article discusses the interpretational complexities due to factor correlations and how Pratt’s measures resolve these interpretational problems. Two real data examples demonstrate the calculation of what we call the “D matrix,” of which the elements are Pratt’s measures. Focusing on the rows of the D matrix allows one to compare the importance of the factors to the communality of each observed indicator (horizontal interpretation); whereas a focus on the columns of the D matrix allows one to compare the contribution of the indicators to the common variance extracted by each factor (vertical interpretation). The application showed that the method based on Pratt’s measures is a very simple but useful technique for EFA, in particular, for behavioral and developmental constructs, which are often multidimensional and mutually correlated.
Educational and Psychological Measurement | 2012
Yan Liu; Bruno D. Zumbo; Amery D. Wu
Previous studies have rarely examined the impact of outliers on the decisions about the number of factors to extract in an exploratory factor analysis. The few studies that have investigated this issue have arrived at contradictory conclusions regarding whether outliers inflated or deflated the number of factors extracted. By systematically inducing outliers as well as computer simulations based on real data, the present study demonstrated how outliers affected the decisions about the number of factors to extract using four commonly used and/or recommended decision methods. The studies revealed that both inflation and deflation of the number of factors were found, but the effect depended on (a) the decision methods used and (b) the magnitude and amount of outliers, hence resolving the apparent contradictory conclusions in the previous literature.
Journal of Psychoeducational Assessment | 2016
Amery D. Wu; Jake E. Stone
This article explores an approach for test score validation that examines test takers’ strategies for taking a reading comprehension test. The authors formulated three working hypotheses about score validity pertaining to three types of test-taking strategy (comprehending meaning, test management, and test-wiseness). These hypotheses were formulated in terms of the use of three types of test-taking strategy and their relationships with performance on specific task types (testlets) and overall test performances. We illustrated the proposed method for validation using example data from the Canadian English Language Proficiency Index Program-General (CELPIP-General) reading pilot test. The findings were that (a) test takers were engaging more in processing the texts for comprehending meaning, less in test-management skills, and least in test-wiseness; (b) at the task level, task characteristics (e.g., difficulty) had implications on test takers’ engagement with different types of strategies, which, in turn, led to differences in predicting task performances; and (c) at the test level, higher engagement in comprehending meaning led to higher test performance, engagement in test management showed a small negative association with test performance, and higher engagement in test-wiseness led to poorer performance. The high congruence between the working hypotheses and the empirical results offered plausible evidence that supported the validity of CELPIP-General reading scores. Revisions to both hypotheses and research design that might improve the proposed validation method are reviewed in the “Discussion” section.
Archive | 2017
Amery D. Wu; Bruno D. Zumbo
This chapter considers how the process-based variables of test-taking strategies as reported by test-takers can help to explain the differences in the outcome of a reading comprehension test and serve to provide process level evidence of validity. With the process variables as the explanatory variables, test-takers’ performance was analyzed via a latent variable regression in a structural equation model (SEM), along with Pratt’s importance measures (Pratt, 1987) to assist in understanding the score variation in the latent outcome. We consider how understanding test-taking strategy can help inform test design and validation practices.
PLOS ONE | 2016
Richard Sawatzky; Pamela A. Ratner; Jacek A. Kopec; Amery D. Wu; Bruno D. Zumbo
Background Computerized adaptive testing (CAT) utilizes latent variable measurement model parameters that are typically assumed to be equivalently applicable to all people. Biased latent variable scores may be obtained in samples that are heterogeneous with respect to a specified measurement model. We examined the implications of sample heterogeneity with respect to CAT-predicted patient-reported outcomes (PRO) scores for the measurement of pain. Methods A latent variable mixture modeling (LVMM) analysis was conducted using data collected from a heterogeneous sample of people in British Columbia, Canada, who were administered the 36 pain domain items of the CAT-5D-QOL. The fitted LVMM was then used to produce data for a simulation analysis. We evaluated bias by comparing the referent PRO scores of the LVMM with PRO scores predicted by a “conventional” CAT (ignoring heterogeneity) and a LVMM-based “mixture” CAT (accommodating heterogeneity). Results The LVMM analysis indicated support for three latent classes with class proportions of 0.25, 0.30 and 0.45, which suggests that the sample was heterogeneous. The simulation analyses revealed differences between the referent PRO scores and the PRO scores produced by the “conventional” CAT. The “mixture” CAT produced PRO scores that were nearly equivalent to the referent scores. Conclusion Bias in PRO scores based on latent variable models may result when population heterogeneity is ignored. Improved accuracy could be obtained by using CATs that are parameterized using LVMM.
Archive | 2017
Anita M. Hubley; Amery D. Wu; Yan Liu; Bruno D. Zumbo
This chapter focuses on item response theory (IRT) item parameters as windows into response processes. The study purpose was to examine relationships between item parameters and five social-cognitive aspects of items (i.e., wording specificity, availability heuristic, emotional comfort, meaning clarity, and social desirability). IRT parameters were estimated using responses to the Geriatric Depression Scale (GDS) from a sample of 729 men and women. Ratings of the social-cognitive aspects of each GDS item were obtained from a sample of 30 men and women. After testing five 2, 3, and 4 parameter logistic (PL) models, a 3-PL model with a-, b-, and d-parameters (i.e., discrimination, difficulty, upper asymptote) best fit the data. The study findings expand our understanding of the substantive meanings behind IRT parameters but also suggest that relationships among IRT parameters and the social-cognitive aspects of items may be more specific to the construct of interest than previously realized.
Archive | 2017
Bruno D. Zumbo; Yan Liu; Amery D. Wu; Barry Forer; Benjamin R. Shear
The results of large-scale student assessments are increasingly being used to rank nations, states, and schools and to inform policy decisions. These uses often rely on aggregated student test score data, and imply inferences about multilevel constructs. Validating uses and interpretations about these multilevel constructs requires appropriate multilevel validation techniques. This chapter combines multilevel data analysis techniques with an explanatory view of validity to develop explanations of score variation that can be used to evaluate multilevel measurement inferences. We use country-level mathematics scores from the Trends in International Mathematics and Science Study (TIMSS) to illustrate the integration of these techniques. The explanation focused view of validity accompanied by the ecological model of item responding situates conventional response process research in a multilevel construct setting and moves response process studies beyond the traditional focus on individual test-takers’ behaviors.
Frontiers in Education | 2017
Amery D. Wu; Yan Liu; Jake E. Stone; Danjie Zou; Bruno D. Zumbo
Measurement bias is a crucial concern for test fairness. Impact (true group difference in the measured scores) is of the ultimate interest in many scientific inquiries. This paper revisits and refines the definitions for bias and impact and articulates a conceptual framework that decouples them from differential item functioning. The conditions for showing bias and impact are articulated and a methodology for empirically detecting them is proposed. The framework and methodology hinges on attributing bias and impact to the studied groups by way of matching on balance scores (e.g., propensity scores estimated by the confounding covariates). A real data demonstration comparing two test-language groups on the mathematics items of TIMSS is provided as a proof of concept and guide for application. In closing, we draw readers’ attention to some caveats and suggestions for adopting this conceptual framework and methodology.