Hosik Choi
Kyonggi University
                                 Network
                            
                            Latest external collaboration on country level. Dive into details by clicking on the dots.
                                 Publication
                            
                            Featured researches published by Hosik Choi.
Journal of the American Statistical Association | 2008
Yongdai Kim; Hosik Choi; Hee-Seok Oh
The smoothly clipped absolute deviation (SCAD) estimator, proposed by Fan and Li, has many desirable properties, including continuity, sparsity, and unbiasedness. The SCAD estimator also has the (asymptotically) oracle property when the dimension of covariates is fixed or diverges more slowly than the sample size. In this article we study the SCAD estimator in high-dimensional settings where the dimension of covariates can be much larger than the sample size. First, we develop an efficient optimization algorithm that is fast and always converges to a local minimum. Second, we prove that the SCAD estimator still has the oracle property on high-dimensional problems. We perform numerical studies to compare the SCAD estimator with the LASSO and SIS–SCAD estimators in terms of prediction accuracy and variable selectivity when the true model is sparse. Through the simulation, we show that the variance estimator of Fan and Li still works well for some limited high-dimensional cases where the true nonzero coefficients are not too small and the sample size is moderately large. We apply the proposed algorithm to analyze a high-dimensional microarray data set.
Journal of Computational and Graphical Statistics | 2010
Kyupil Yeon; Moon Sup Song; Yongdai Kim; Hosik Choi; Cheolwoo Park
A supervised learning algorithm aims to build a prediction model using training examples. This paradigm typically has the assumptions that the underl ying distribution and the true input–output dependency do not change. However, these assumptions often fail to hold, especially in data streams. This phenomenon is known as concept drift. We propose a new model combining algorithm for tracking concept drift in data streams. The final predictive ensemble model has a form of a weighted average and ridge regression combiner. The coefficients of the combiner are determined by ridge regression with the constraints such that the coefficients are nonnegative and sum to 1. The proposed algorithm is devised via a new measure of concept drift, the angle between the estimated weights from data and the optimal weight vector obtained under no concept drift. It is shown that the ridge tuning parameter plays a crucial role of forcing the proposed algorithm to adapt to concept drift. Our main findings include (i) the proposed algorithm can achieve the optimal weights in the case of no concept drift if the tuning parameter is sufficiently large, and (ii) the angle is monotonically increasing as the tuning parameter decreases. These imply that if the tuning parameter is well-controlled, the algorithm can produce weights which reflect the degree of concept drift measured by the angle. Using various numerical examples, it is shown that the proposed algorithm can track concept drift better than other existing ensemble methods. Supplemental materials, computer code and R-package, are available online.
BioMed Research International | 2015
Sungho Won; Hosik Choi; Su Yeon Park; Juyoung Lee; Changyi Park; Sunghoon Kwon
Owing to recent improvement of genotyping technology, large-scale genetic data can be utilized to identify disease susceptibility loci and this successful finding has substantially improved our understanding of complex diseases. However, in spite of these successes, most of the genetic effects for many complex diseases were found to be very small, which have been a big hurdle to build disease prediction model. Recently, many statistical methods based on penalized regressions have been proposed to tackle the so-called “large P and small N” problem. Penalized regressions including least absolute selection and shrinkage operator (LASSO) and ridge regression limit the space of parameters, and this constraint enables the estimation of effects for very large number of SNPs. Various extensions have been suggested, and, in this report, we compare their accuracy by applying them to several complex diseases. Our results show that penalized regressions are usually robust and provide better accuracy than the existing methods for at least diseases under consideration.
Computational Statistics & Data Analysis | 2011
Sunghoon Kwon; Hosik Choi; Yongdai Kim
In this paper, we propose a method of quadratic approximation that unifies various types of smoothly clipped absolute deviation (SCAD) penalized estimations. For convenience, we call it the quadratically approximated SCAD penalized estimation (Q-SCAD). We prove that the proposed Q-SCAD estimator achieves the oracle property and requires only the least angle regression (LARS) algorithm for computation. Numerical studies including simulations and real data analysis confirm that the Q-SCAD estimator performs as efficient as the original SCAD estimator.
Computational Statistics & Data Analysis | 2017
Li-Yu Wang; Cheolwoo Park; Kyupil Yeon; Hosik Choi
The objective of this work is to develop a predictive model when data batches are collected in a sequential manner. With streaming data, information is constantly being updated and a major statistical challenge for these types of data is that the underlying distribution and the true inputoutput dependency might change over time, a phenomenon known as concept drift. The concept drift phenomenon makes the learning process complicated because a predictive model constructed on the past data is no longer consistent with new examples. In order to effectively track concept drift, we propose model-combining methods using constrained and penalized regression that possesses a grouping property. The new learning methods enable us to select data batches as a group that are relevant to the current one, reduce the effects of irrelevant batches, and adaptively reflect the degree of concept drift emerging in data streams. We demonstrate the finite sample performance of the proposed method using simulated and real examples. The analytical and empirical results indicate that the proposed methods can effectively adapt to various types of concept drift.
Journal of Statistical Computation and Simulation | 2015
Hosik Choi; Ja-Yong Koo; Changyi Park
Credit scoring can be defined as the set of statistical models and techniques that help financial institutions in their credit decision makings. In this paper, we consider a coarse classification method based on fused least absolute shrinkage and selection operator (LASSO) penalization. By adopting fused LASSO, one can deal continuous as well as discrete variables in a unified framework. For computational efficiency, we develop a penalization path algorithm. Through numerical examples, we compare the performances of fused LASSO and LASSO with dummy variable coding.
Human Brain Mapping | 2018
Li-Yu Wang; Jongik Chung; Cheolwoo Park; Hosik Choi; Amanda L. Rodrigue; Jordan E. Pierce; Brett A. Clementz; Jennifer E. McDowell
Combining statistical parametric maps (SPM) from individual subjects is the goal in some types of group‐level analyses of functional magnetic resonance imaging data. Brain maps are usually combined using a simple average across subjects, making them susceptible to subjects with outlying values. Furthermore, t tests are prone to false positives and false negatives when outlying values are observed. We propose a regularized unsupervised aggregation method for SPMs to find an optimal weight for aggregation, which aids in detecting and mitigating the effect of outlying subjects. We also present a bootstrap‐based weighted t test using the optimal weights to construct an activation map robust to outlying subjects. We validate the performance of the proposed aggregation method and test using simulated and real data examples. Results show that the regularized aggregation approach can effectively detect outlying subjects, lower their weights, and produce robust SPMs.
Applied Intelligence | 2018
Jong-June Jeon; Hosik Choi
The Luce model is one of the most popular ranking models used to estimate the ranks of items. In this study, we focus on grouping items with similar abilities and consider a new supervised clustering method by fusing specific parameters used in the Luce model. By modifying the penalty function conventionally used in grouping parameters, we obtain a new method of grouping items in the Luce model without pairwise comparison modeling and develop an efficient algorithm to estimate the parameters. Moreover, we give an application of the proposed algorithm to the Bradley-Terry model with ties. In the real data analysis, we confirm that the proposed estimator provides an easier interpretation of ranks and an improvement in the quality of prediction.
Genetics | 2017
Jungsoo Gim; Wonji Kim; Soo Heon Kwak; Hosik Choi; Changyi Park; Kyong Soo Park; Sunghoon Kwon; Taesung Park; Sungho Won
Despite the many successes of genome-wide association studies (GWAS), the known susceptibility variants identified by GWAS have modest effect sizes, leading to notable skepticism about the effectiveness of building a risk prediction model from large-scale genetic data. However, in contrast to genetic variants, the family history of diseases has been largely accepted as an important risk factor in clinical diagnosis and risk prediction. Nevertheless, the complicated structures of the family history of diseases have limited their application in clinical practice. Here, we developed a new method that enables incorporation of the general family history of diseases with a liability threshold model, and propose a new analysis strategy for risk prediction with penalized regression analysis that incorporates both large numbers of genetic variants and clinical risk factors. Application of our model to type 2 diabetes in the Korean population (1846 cases and 1846 controls) demonstrated that single-nucleotide polymorphisms accounted for 32.5% of the variation explained by the predicted risk scores in the test data set, and incorporation of family history led to an additional 6.3% improvement in prediction. Our results illustrate that family medical history provides valuable information on the variation of complex diseases and improves prediction performance.
Computational Statistics & Data Analysis | 2017
Jong-June Jeon; Sunghoon Kwon; Hosik Choi
We propose to use a penalized estimator for detecting homogeneity of the high-dimensional generalized linear model. Here, the homogeneity is a specific model structure where regression coefficients are grouped having exactly the same value in each group. The proposed estimator achieves weak oracle property under mild regularity conditions and is invariant to the choice of reference levels when there are categorical covariates in the model. An efficient algorithm is also provided. Various numerical studies confirm that the proposed penalized estimator gives better performance than other conventional variable selection estimators when the model has homogeneity.
