Wen-Chung Wang
University of Hong Kong
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Wen-Chung Wang.
Educational and Psychological Measurement | 2009
Wen-Chung Wang; Ching-Lin Shih; Chih-Chien Yang
This study implements a scale purification procedure onto the standard MIMIC method for differential item functioning (DIF) detection and assesses its performance through a series of simulations. It is found that the MIMIC method with scale purification (denoted as M-SP) outperforms the standard MIMIC method (denoted as M-ST) in controlling false-positive rates and yielding higher true-positive rates. Only when the DIF pattern is balanced between groups or when there is a small percentage of DIF items in the test does M-ST perform as appropriately as M-SP. Moreover, both methods yield a higher true-positive rate under the two-parameter logistic model than under the three-parameter model. M-SP is preferable to M-ST, because DIF patterns in real tests are unlikely to be perfectly balanced and the percentages of DIF items may not be small.
Educational and Psychological Measurement | 2010
Yeh-Tai Chou; Wen-Chung Wang
Dimensionality is an important assumption in item response theory (IRT). Principal component analysis on standardized residuals has been used to check dimensionality, especially under the family of Rasch models. It has been suggested that an eigenvalue greater than 1.5 for the first eigenvalue signifies a violation of unidimensionality when there are 500 persons and 30 items. The cut-point of 1.5 is often used beyond this specific condition of sample size and test length. This study argues that a fixed cut-point is not applicable because the distribution of eigenvalues or their ratios depends on sample size and test length, just like other statistics. The authors conducted a series of simulations to verify this argument. They then proposed three chi-square statistics for multivariate independence to test the correlation matrix obtained from the standardized residuals. Through simulations, it was found that Steiger’s statistic behaved fairly like a chi-square distribution, when its degrees of freedom were adjusted.
Educational and Psychological Measurement | 2009
Ying-Yao Cheng; Wen-Chung Wang; Yi-Hui Ho
Educational and psychological tests are often composed of multiple short subtests, each measuring a distinct latent trait. Unfortunately, short subtests suffer from low measurement precision, which makes the bandwidth—fidelity dilemma inevitable. In this study, the authors demonstrate how a multidimensional Rasch analysis can be employed to take into account the information about the correlation between latent traits such that the precision of each subtest measure can be improved and the correlation between latent traits can be accurately estimated. A real data set of the 13-scale Thinking Styles Inventory was analyzed with the traditional unidimensional approach and the multidimensional approach. The results demonstrate that in contrast to the unidimensional approach, the multidimensional approach yields a much higher level of measurement precision and a more appropriate estimate for the correlation between thinking styles. In conclusion, even short subtests can yield highly precise measures such that the bandwidth—fidelity dilemma is resolved.
Educational and Psychological Measurement | 2012
Wen-Chung Wang; Ching-Lin Shih; Guo-Wei Sun
The DIF-free-then-DIF (DFTD) strategy consists of two steps: (a) select a set of items that are the most likely to be DIF-free and (b) assess the other items for DIF (differential item functioning) using the designated items as anchors. The rank-based method together with the computer software IRTLRDIF can select a set of DIF-free polytomous items very accurately, but it loses accuracy when tests contain many DIF items. To resolve this problem, the authors developed a new method by adding a scale purification procedure to the rank-based method and conducted two simulation studies to evaluate its performances on DIF assessment. It was found that the new method outperformed the rank-based method in identifying DIF-free items, especially when the tests contained many DIF items. In addition, the new method, combined with the DFTD strategy, yielded a well-controlled Type I error rate and a high power rate of DIF detection. In contrast, conventional DIF assessment methods yielded an inflated Type I error rate and a deflated power rate when the tests contained many DIF items favoring the same group. In conclusion, the simulation results support the new method and the DFTD strategy in DIF assessment.
Educational and Psychological Measurement | 2015
Wen-Chung Wang; Hui-Fang Chen; Kuan-Yu Jin
Many scales contain both positively and negatively worded items. Reverse recoding of negatively worded items might not be enough for them to function as positively worded items do. In this study, we commented on the drawbacks of existing approaches to wording effect in mixed-format scales and used bi-factor item response theory (IRT) models to test the assumption of reverse coding and evaluate the magnitude of the wording effect. The parameters of the bi-factor IRT models can be estimated with existing computer programs. Two empirical examples from the Program for International Student Assessment and the Trends in International Mathematics and Science Study were given to demonstrate the advantages of the bi-factor approach over traditional ones. It was found that the wording effect in these two data sets was substantial and that ignoring the wording effect resulted in overestimated test reliability and biased person measures.
Educational and Psychological Measurement | 2013
Hung-Yu Huang; Wen-Chung Wang
Both testlet design and hierarchical latent traits are fairly common in educational and psychological measurements. This study aimed to develop a new class of higher order testlet response models that consider both local item dependence within testlets and a hierarchy of latent traits. Due to high dimensionality, the authors adopted the Bayesian approach implemented in the WinBUGS freeware for parameter estimation. A series of simulations were conducted to evaluate parameter recovery, consequences of model misspecification, and effectiveness of model–data fit statistics. Results show that the parameters of the new models can be recovered well. Ignoring the testlet effect led to a biased estimation of item parameters, underestimation of factor loadings, and overestimation of test reliability for the first-order latent traits. The Bayesian deviance information criterion and the posterior predictive model checking were helpful for model comparison and model–data fit assessment. Two empirical examples of ability tests and nonability tests are given.
Applied Psychological Measurement | 2010
Wen-Chung Wang; Kuan-Yu Jin
In this study, all the advantages of slope parameters, random weights, and latent regression are acknowledged when dealing with component and composite items by adding slope parameters and random weights into the standard item response model with internal restrictions on item difficulty and formulating this new model within a multilevel framework in which Level 2 predictors are added to account for variation in the latent trait. The resulting model is a nonlinear mixed model (NLMM) so that existing parameter estimation procedures and computer packages for NLMMs can be directly adopted to estimate the parameters. Through simulations, it was found that the SAS NLMIXED procedure could recover the parameters in the new model fairly well and produce appropriate standard errors. To illustrate applications of the new model, a real data set pertaining to guilt was analyzed with gender as a Level 2 predictor. Further model generalization is discussed.
Applied Psychological Measurement | 2013
Wen-Chung Wang; Chen-Wei Liu; Shiu-Lien Wu
The random-threshold generalized unfolding model (RTGUM) was developed by treating the thresholds in the generalized unfolding model as random effects rather than fixed effects to account for the subjective nature of the selection of categories in Likert items. The parameters of the new model can be estimated with the JAGS (Just Another Gibbs Sampler) freeware, which adopts a Bayesian approach for estimation. A series of simulations was conducted to evaluate the parameter recovery of the new model and the consequences of ignoring the randomness in thresholds. The results showed that the parameters of RTGUM were recovered fairly well and that ignoring the randomness in thresholds led to biased estimates. Computerized adaptive testing was also implemented on RTGUM, where the Fisher information criterion was used for item selection and the maximum a posteriori method was used for ability estimation. The simulation study showed that the longer the test length, the smaller the randomness in thresholds, and the more categories in an item, the more precise the ability estimates would be.
Educational and Psychological Measurement | 2011
Wen-Chung Wang; Sheng-Yun Huang
The one-parameter logistic model with ability-based guessing (1PL-AG) has been recently developed to account for effect of ability on guessing behavior in multiple-choice items. In this study, the authors developed algorithms for computerized classification testing under the 1PL-AG and conducted a series of simulations to evaluate their performances. Four item selection methods (the Fisher information, the Fisher information with a posterior distribution, the progressive method, and the adjusted progressive method) and two termination criteria (the ability confidence interval [ACI] method and the sequential probability ratio test [SPRT]) were developed. In addition, the Sympson–Hetter online method with freeze (SHOF) was implemented for item exposure control. Major results include the following: (a) when no item exposure control was made, all the four item selection methods yielded very similar correct classification rates, but the Fisher information method had the worst item bank usage and the highest item exposure rate; (b) SHOF can successfully maintain the item exposure rate at a prespecified level, without compromising substantial accuracy and efficiency in classification; (c) once SHOF was implemented, all the four methods performed almost identically; (d) ACI appeared to be slightly more efficient than SPRT; and (e) in general, a higher weight of ability in guessing led to a slightly higher accuracy and efficiency, and a lower forced classification rate.
Educational and Psychological Measurement | 2011
Wen-Chung Wang; Chen-Wei Liu
The generalized graded unfolding model (GGUM) has been recently developed to describe item responses to Likert items (agree—disagree) in attitude measurement. In this study, the authors (a) developed two item selection methods in computerized classification testing under the GGUM, the current estimate/ability confidence interval method and the cut score/sequential probability ratio test method and (b) evaluated their accuracy and efficiency in classification through simulations. The results indicated that both methods were very accurate and efficient. The more points each item had and the fewer the classification categories, the more accurate and efficient the classification would be. However, the latter method may yield a very low accuracy in dichotomous items with a short maximum test length. Thus, if it is to be used to classify examinees with dichotomous items, the maximum text length should be increased.