Youngsuk Suh
Rutgers University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Youngsuk Suh.
Measurement: Interdisciplinary Research & Perspective | 2009
Michael T. Kane; Andrew A. Mroch; Youngsuk Suh; Douglas R. Ripkey
This paper analyzes five linear equating models for the nonequivalent groups with anchor test (NEAT) design with internal anchors (i.e., the anchor test is part of the full test). The analysis employs a two-dimensional framework. The first dimension contrasts two general approaches to developing the equating relationship. Under a parameter substitution (PS) approach, estimates of the means and variances for the two tests for some target population are substituted into a generic equating formula; under a chained linear relationship (CLR) approach, expressions for the anchor test scores as functions of total test scores for each of the test forms are simply “equated” to each other. In order to implement either of these approaches, some relationships must be assumed invariant across the groups. The second dimension involves three different choices for the invariant relationships, the regressions of test scores (X or Y) on anchor scores (V), the regression of anchor scores on test scores, or a basic scaling/equating relationship between anchor scores and test scores. If we adopt a scaling/equating relationship of Y with V and X with V as the invariant relationship, the resulting equating relationship is the same for the PS and CLR approaches. So, five distinct regression models yielding five different equating relationships are developed within the two-dimensional framework. The equating relationships for the Tucker, Chained Linear, and Levine Observed-score methods are derived under the PS approach. The equating relationships for the Levine True-score, Chained Linear, and a Tucker-like method (Angoff Design V) are derived under the CLR approach.
Applied Psychological Measurement | 2014
Youngsuk Suh; Sun-Joo Cho
The performance of χ2 difference tests based on limited information estimation methods has not been extensively examined for differential functioning, particularly in the context of multidimensional item response theory (MIRT) models. Chi-square tests for detecting differential item functioning (DIF) and global differential item functioning (GDIF) in an MIRT model were conducted using two robust weighted least square estimators: weighted least square with adjusted means and variance (WLSMV) and weighted least square with adjusted means (WLSM), and the results were evaluated in terms of Type I error rates and rejection rates. The present study demonstrated systematic test procedures for detecting different types of GDIF and DIF in multidimensional tests. For the χ2 tests for detecting GDIF, WLSM tended to produce inflated Type I error rates for small sample size conditions, whereas WLSMV appeared to yield lower error rates than the expected value on average. In addition, WLSM produced higher rejection rates than WLSMV. For the χ2 tests for detecting DIF, WLSMV appeared to yield somewhat higher rejection rates than WLSM for all DIF tests except for the omnibus test. The error rates for both estimators were close to the expected value on average.
Measurement: Interdisciplinary Research & Perspective | 2009
Andrew A. Mroch; Youngsuk Suh; Michael T. Kane; Douglas R. Ripkey
This study uses the results of two previous papers (Kane, Mroch, Suh, & Ripkey, this issue; Suh, Mroch, Kane, & Ripkey, this issue) and the literature on linear equating to evaluate five linear equating methods along several dimensions, including the plausibility of their assumptions and their levels of bias and root mean squared difference (RMSD). The methods all employ non-equivalent groups anchor test (NEAT) design, but make different assumptions about the empirical relationship to be generalized across groups. The analyses indicate that the assumptions employed in Levine Observed-score and Levine True-score methods are more plausible than those for a Tucker, Tucker-like, and Chained Linear method, and that the Levine methods generally have lower levels of bias and RMSD than the other three methods. Furthermore, the methods that employed a chained linear relationship (CLR) approach, in which observed relationships between total test scores and anchor test scores are generalized across groups taking various tests, are found to be more consistent with programs in which a series of test forms administered over a period of years are equated to each other, than a parameter substitution (PS) approach, which estimates results for specific synthetic populations. It is argued that the Levine Observed-score and Levine True-score methods have strong advantages over the other methods studied, unless the groups taking the tests to be equated are known to be very similar.
Measurement: Interdisciplinary Research & Perspective | 2009
Youngsuk Suh; Andrew A. Mroch; Michael T. Kane; Douglas R. Ripkey
In this study, a data base containing the responses of 40,000 candidates to 90 multiple-choice questions was used to mimic data sets for 50-item tests under the nonequivalent groups with anchor test (NEAT) design. Using these smaller data sets, we evaluated the performance of five linear equating methods for the NEAT design with five levels of group differences and five levels of form differences. The completeness of the data base, with all 40,000 candidates answering all 90 items, allowed us to use a linear equating relationship based on a single group design for the full data base as the criterion for evaluating equating methods with respect to bias, root mean squared difference (RMSD), and differences associated with slopes. All five methods worked well when groups were similar, regardless of form differences, and the Levine methods showed smaller levels of bias and RMSD values than the other methods across a range of group differences. Overall, the Levine methods were most effective in controlling for group differences and, in particular, were most accurate when group differences were moderate to large.
Multivariate Behavioral Research | 2013
Su-Young Kim; Youngsuk Suh; Jee-Seon Kim; Mark A. Albanese; Michelle M. Langer
Latent variable models with many categorical items and multiple latent constructs result in many dimensions of numerical integration, and the traditional frequentist estimation approach, such as maximum likelihood (ML), tends to fail due to model complexity. In such cases, Bayesian estimation with diffuse priors can be used as a viable alternative to ML estimation. This study compares the performance of Bayesian estimation with ML estimation in estimating single or multiple ability factors across 2 types of measurement models in the structural equation modeling framework: a multidimensional item response theory (MIRT) model and a multiple-indicator multiple-cause (MIMIC) model. A Monte Carlo simulation study demonstrates that Bayesian estimation with diffuse priors, under various conditions, produces results quite comparable with ML estimation in the single- and multilevel MIRT and MIMIC models. Additionally, an empirical example utilizing the Multistate Bar Examination is provided to compare the practical utility of the MIRT and MIMIC models. Structural relationships among the ability factors, covariates, and a binary outcome variable are investigated through the single- and multilevel measurement models. The article concludes with a summary of the relative advantages of Bayesian estimation over ML estimation in MIRT and MIMIC models and suggests strategies for implementing these methods.
Educational and Psychological Measurement | 2017
Soo Lee; Okan Bulut; Youngsuk Suh
A number of studies have found multiple indicators multiple causes (MIMIC) models to be an effective tool in detecting uniform differential item functioning (DIF) for individual items and item bundles. A recently developed MIMIC-interaction model is capable of detecting both uniform and nonuniform DIF in the unidimensional item response theory (IRT) framework. The goal of the current study is to extend the MIMIC-interaction model for detecting DIF in the context of multidimensional IRT modelling and examine the performance of the multidimensional MIMIC-interaction model under various simulation conditions with respect to Type I error and power rates. Simulation conditions include DIF pattern and magnitude, test length, correlation between latent traits, sample size, and latent mean differences between focal and reference groups. The results of this study indicate that power rates of the multidimensional MIMIC-interaction model under uniform DIF conditions were higher than those of nonuniform DIF conditions. When anchor item length and sample size increased, power for detecting DIF increased. Also, the equal latent mean condition tended to produce higher power rates than the different mean condition. Although the multidimensional MIMIC-interaction model was found to be a reasonably useful tool for identifying uniform DIF, the performance of the model in detecting nonuniform DIF appeared to be questionable.
Applied Psychological Measurement | 2012
Sun-Joo Cho; Youngsuk Suh
WinBUGS 1.4.3 (Spiegelhalter, Thomas, & Best, 2003) language is a syntax to generate a random sample from the posterior distribution of the parameters of a Bayesian model. WinBUGS users only have to specify the structure of the model, initial values (optional), and the data. WinBUGS 1.4.3 codes for both simple and complex item response models are provided. Considered models include a Rasch model, a two-parameter item response model, a partial credit model, a multilevel Rasch model, a compensatory multidimensional Rasch model, and a mixture Rasch model. The specification of the multilevel, multidimensional, or mixture structures are illustrated within the framework of the Rasch model. Readers can simply extend these models to more complex models that include item discrimination and item guessing parameters by referring to codes for two-parameter or three-parameter item response models. Using empirical data, results from WinBUGS 1.4.3 are compared with those from other existing software packages using different estimation algorithms (e.g., marginal maximum likelihood estimation and Laplace approximation). For a simulation study, data can be generated by fixing the generating parameters in the WinBUGS code, running the program for a single iteration, and then saving the current state of the sampler for the item responses (Spiegelhalter, personal communication, September 30, 2005). Example codes for the data generation and a batch mode for replications in the simulation study are provided.
Structural Equation Modeling | 2015
Youngsuk Suh
The relative performance of the maximum likelihood (ML) and weighted least square mean and variance adjusted (WLSMV) estimators was investigated by studying differential item functioning (DIF) with ordinal data when the latent variable () was not normally distributed. As the ML estimator, ML with robust standard errors (labeled MLR in Mplus) was chosen and implemented with 2 link functions (logit vs. probit). The Type I error and power of χ2 tests were evaluated under various simulation conditions including the shape of the distributions for the reference and focal groups. Type I error was better controlled with MLR estimators than WLSMV. The error from WLSMV was inflated when there was a large difference in the shape of the distribution between the 2 groups. In general, the power remained quite stable across different distribution conditions regardless of the estimators. WLSMV and MLR-probit showed comparable power, whereas MLR-logit performed the worst.
Applied Measurement in Education | 2015
Youngsuk Suh; Anna E. Talley
This study compared and illustrated four differential distractor functioning (DDF) detection methods for analyzing multiple-choice items. The log-linear approach, two item response theory-model-based approaches with likelihood ratio tests, and the odds ratio approach were compared to examine the congruence among the four DDF detection methods. Data from a college-level mathematics placement test were analyzed to understand the causes of differential functioning. Results indicated some agreement among the four detection methods. To facilitate practical interpretation of the DDF results, several possible effect size measures were also obtained and compared.
Frontiers in Education | 2017
Okan Bulut; Youngsuk Suh
Differential item functioning (DIF) is typically evaluated in educational assessments with a simple structure in which items are associated with a single latent trait. This study aims to extend the investigation of DIF for multidimensional assessments with a nonsimple structure in which items can be associated with two or more latent traits. A simulation study was conducted with the multidimensional extensions of the item response theory likelihood ratio (IRT-LR) test, the multiple indicators multiple causes (MIMIC) model, and logistic regression for detecting uniform and nonuniform DIF in multidimensional assessments. The results indicated that the IRT-LR test outperformed the MIMIC and logistic regression approaches in detecting nonuniform DIF. When detecting uniform DIF, the MIMIC and logistic regression approaches appeared to perform better than the IRT-LR test in short tests, while the performances of all three approaches were very similar in longer tests. Type I error rates for logistic regression were severely inflated compared to the other two approaches. The IRT-LR test appears to be a more balanced and powerful method than the MIMIC and logistic regression approaches in detecting DIF in multidimensional assessments with a nonsimple structure.