Hang J. Kim
University of Cincinnati
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Hang J. Kim.
Journal of Business & Economic Statistics | 2014
Hang J. Kim; Quanli Wang; Lawrence H. Cox; Alan F. Karr
Many statistical agencies, survey organizations, and research centers collect data that suffer from item nonresponse and erroneous or inconsistent values. These data may be required to satisfy linear constraints, for example, bounds on individual variables and inequalities for ratios or sums of variables. Often these constraints are designed to identify faulty values, which then are blanked and imputed. The data also may exhibit complex distributional features, including nonlinear relationships and highly nonnormal distributions. We present a fully Bayesian, joint model for modeling or imputing data with missing/blanked values under linear constraints that (i) automatically incorporates the constraints in inferences and imputations, and (ii) uses a flexible Dirichlet process mixture of multivariate normal distributions to reflect complex distributional features. Our strategy for estimation is to augment the observed data with draws from a hypothetical population in which the constraints are not present, thereby taking advantage of computationally expedient methods for fitting mixture models. Missing/blanked items are sampled from their posterior distribution using the Hit-and-Run sampler, which guarantees that all imputations satisfy the constraints. We illustrate the approach using manufacturing data from Colombia, examining the potential to preserve joint distributions and a regression from the plant productivity literature. Supplementary materials for this article are available online.
Journal of the American Statistical Association | 2015
Hang J. Kim; Lawrence H. Cox; Alan F. Karr; Quanli Wang
Many statistical organizations collect data that are expected to satisfy linear constraints; as examples, component variables should sum to total variables, and ratios of pairs of variables should be bounded by expert-specified constants. When reported data violate constraints, organizations identify and replace values potentially in error in a process known as edit-imputation. To date, most approaches separate the error localization and imputation steps, typically using optimization methods to identify the variables to change followed by hot deck imputation. We present an approach that fully integrates editing and imputation for continuous microdata under linear constraints. Our approach relies on a Bayesian hierarchical model that includes (i) a flexible joint probability model for the underlying true values of the data with support only on the set of values that satisfy all editing constraints, (ii) a model for latent indicators of the variables that are in error, and (iii) a model for the reported responses for variables in error. We illustrate the potential advantages of the Bayesian editing approach over existing approaches using simulation studies. We apply the model to edit faulty data from the 2007 U.S. Census of Manufactures. Supplementary materials for this article are available online.
Journal of Computational and Graphical Statistics | 2015
Hang J. Kim; Steven N. MacEachern
The multiset sampler, an MCMC algorithm recently proposed by Leman and coauthors, is an easy-to-implement algorithm which is especially well-suited to drawing samples from a multimodal distribution. We generalize the algorithm by redefining the multiset sampler with an explicit link between target distribution and sampling distribution. The generalized formulation replaces the multiset with a K-tuple, which allows us to use the algorithm on unbounded parameter spaces, improves estimation, and sets up further extensions to adaptive MCMC techniques. Theoretical properties of the algorithm are provided and guidance is given on its implementation. Examples, both simulated and real, confirm that the generalized multiset sampler provides a simple, general and effective approach to sampling from multimodal distributions. Supplementary materials for this article are available online.
PLOS Computational Biology | 2017
Dongjun Chung; Hang J. Kim; Hongyu Zhao
Genome-wide association studies (GWAS) have identified tens of thousands of genetic variants associated with hundreds of phenotypes and diseases, which have provided clinical and medical benefits to patients with novel biomarkers and therapeutic targets. However, identification of risk variants associated with complex diseases remains challenging as they are often affected by many genetic variants with small or moderate effects. There has been accumulating evidence suggesting that different complex traits share common risk basis, namely pleiotropy. Recently, several statistical methods have been developed to improve statistical power to identify risk variants for complex traits through a joint analysis of multiple GWAS datasets by leveraging pleiotropy. While these methods were shown to improve statistical power for association mapping compared to separate analyses, they are still limited in the number of phenotypes that can be integrated. In order to address this challenge, in this paper, we propose a novel statistical framework, graph-GPA, to integrate a large number of GWAS datasets for multiple phenotypes using a hidden Markov random field approach. Application of graph-GPA to a joint analysis of GWAS datasets for 12 phenotypes shows that graph-GPA improves statistical power to identify risk variants compared to statistical methods based on smaller number of GWAS datasets. In addition, graph-GPA also promotes better understanding of genetic mechanisms shared among phenotypes, which can potentially be useful for the development of improved diagnosis and therapeutics. The R implementation of graph-GPA is currently available at https://dongjunchung.github.io/GGPA/.
Journal of Official Statistics | 2015
Hang J. Kim; Alan F. Karr
Abstract We compare two general strategies for performing statistical disclosure limitation (SDL) for continuous microdata subject to edit rules. In the first, existing SDL methods are applied, and any constraint-violating values they produce are replaced using a constraint-preserving imputation procedure. In the second, the SDL methods are modified to prevent them from generating violations. We present a simulation study, based on data from the Colombian Annual Manufacturing Survey, that evaluates the performance of the two strategies as applied to several SDL methods. The results suggest that differences in risk-utility profiles across SDL methods dwarf differences between the two general strategies. Among the SDL strategies, variants of microaggregation and partially synthetic data offer the most attractive risk-utility profiles.
PLOS ONE | 2018
Emma Kortemeier; Paula S. Ramos; Kelly J. Hunt; Hang J. Kim; Gary Hardiman; Dongjun Chung
In spite of accumulating evidence suggesting that different complex traits share a common risk basis, namely pleiotropy, effective investigation of pleiotropic architecture still remains challenging. In order to address this challenge, we developed ShinyGPA, an interactive and dynamic visualization toolkit to investigate pleiotropic structure. ShinyGPA requires only the summary statistics from genome-wide association studies (GWAS), which reduces the burden on researchers using this tool. ShinyGPA allows users to effectively investigate genetic relationships among phenotypes using a flexible low-dimensional visualization and an intuitive user interface. In addition, ShinyGPA provides joint association mapping functionality that can facilitate biological understanding of the pleiotropic architecture. We analyzed GWAS summary statistics for 12 phenotypes using ShinyGPA and obtained visualization results and joint association mapping results that are well supported by the literature. The visualization produced by ShinyGPA can also be used as a hypothesis generating tool for relationships between phenotypes, which might also be used to improve the design of future genetic studies. ShinyGPA is currently available at https://dongjunchung.github.io/GPA/.
Korean Journal of Applied Statistics | 2016
Min-Jeong Park; Hang J. Kim
The increasing demand from researchers and policy makers for microdata has also increased related privacy and security concerns. During the past two decades, a large volume of literature on statistical disclosure control (SDC) has been published in international journals. This review paper introduces relatively recent SDC approaches to the communities of Korean statisticians and statistical agencies. In addition to the traditional masking techniques (such as microaggregation and noise addition), we introduce an online analytic system, differential privacy, and synthetic data. For each approach, the application example (with pros and cons, as well as methodology) is highlighted, so that the paper can assist statical agencies that seek a practical SDC approach.
Journal of Applied Statistics | 2018
Hang J. Kim; Alan F. Karr
Bioinformatics | 2018
Hang J. Kim; Zhenning Yu; Andrew B. Lawson; Hongyu Zhao; Dongjun Chung
arXiv: Methodology | 2016
Hang J. Kim; Steven N. MacEachern; Yoonsuh Jung