Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Jangsun Baek is active.

Publication


Featured researches published by Jangsun Baek.


IEEE Transactions on Pattern Analysis and Machine Intelligence | 2010

Mixtures of Factor Analyzers with Common Factor Loadings: Applications to the Clustering and Visualization of High-Dimensional Data

Jangsun Baek; Geoffrey J. McLachlan; Lloyd K. Flack

Mixtures of factor analyzers enable model-based density estimation to be undertaken for high-dimensional data, where the number of observations n is not very large relative to their dimension p. In practice, there is often the need to further reduce the number of parameters in the specification of the component-covariance matrices. To this end, we propose the use of common component-factor loadings, which considerably reduces further the number of parameters. Moreover, it allows the data to be displayed in low--dimensional plots.


Bioinformatics | 2011

Mixtures of common t-factor analyzers for clustering high-dimensional microarray data

Jangsun Baek; Geoffrey J. McLachlan

MOTIVATION Mixtures of factor analyzers enable model-based clustering to be undertaken for high-dimensional microarray data, where the number of observations n is small relative to the number of genes p. Moreover, when the number of clusters is not small, for example, where there are several different types of cancer, there may be the need to reduce further the number of parameters in the specification of the component-covariance matrices. A further reduction can be achieved by using mixtures of factor analyzers with common component-factor loadings (MCFA), which is a more parsimonious model. However, this approach is sensitive to both non-normality and outliers, which are commonly observed in microarray experiments. This sensitivity of the MCFA approach is due to its being based on a mixture model in which the multivariate normal family of distributions is assumed for the component-error and factor distributions. RESULTS An extension to mixtures of t-factor analyzers with common component-factor loadings is considered, whereby the multivariate t-family is adopted for the component-error and factor distributions. An EM algorithm is developed for the fitting of mixtures of common t-factor analyzers. The model can handle data with tails longer than that of the normal distribution, is robust against outliers and allows the data to be displayed in low-dimensional plots. It is applied here to both synthetic data and some microarray gene expression data for clustering and shows its better performance over several existing methods. AVAILABILITY The algorithms were implemented in Matlab. The Matlab code is available at http://blog.naver.com/aggie100.


Pattern Recognition | 2004

Face recognition using partial least squares components

Jangsun Baek; Min-Soo Kim

The paper considers partial least squares (PLS) as a new dimension reduction technique for the feature vector to overcome the small sample size problem in face recognition. Principal component analysis (PCA), a conventional dimension reduction method, selects the components with maximum variability, irrespective of the class information. So PCA does not necessarily extract features that are important for the discrimination of classes. PLS, on the other hand, constructs the components so that the correlation between the class variable and themselves is maximized. Therefore PLS components are more predictive than PCA components in classification. The experimental results on Manchester and ORL databases show that PLS is to be preferred over PCA when classification is the goal and dimension reduction is needed.


Pattern Recognition Letters | 2008

A modified correlation coefficient based similarity measure for clustering time-course gene expression data

Young Sook Son; Jangsun Baek

Gene expression levels are often measured consecutively in time through microarray experiments to detect cellular processes underlying regulatory effects observed and to assign functionality to genes whose function is yet unknown. Clustering methods allow us to group genes that show similar time-course expression profiles and that are thus likely to be co-regulated. The correlation coefficient, the most well-liked similarity measure in the context of gene expression data, is not very reliable in representing the association of two temporal profile patterns. Moreover, the clustering methods with the correlation coefficient generate the same clustering result even when the time points are permuted arbitrarily. We propose a new similarity measure for clustering time-course gene expression data. The proposed measure is based on the correlation coefficient and the two indices representing the concordance of temporal profile patterns and that of the time points at which maximum and minimum expression levels are measured between two profiles, respectively. We applied the hierarchical clustering method with the proposed similarity measure to both synthetic and breast cancer cell line data. We observed favorable results compared to the correlation coefficient based method. The proposed similarity measure is simple to implement, and it is much more consistent for clustering than the correlation coefficient based method according to the cross-validation criterion.


Computers & Geosciences | 2001

Efficient computation of maximum likelihood estimators in a spatial linear model with power exponential covariogram

Jeong-Soo Park; Jangsun Baek

Abstract A computational implementation with a Fortran-77 program MAXPEC for searching MLE in a Gaussian spatial linear model with a power exponential covariance family, is presented. Algorithms for efficiently evaluating the likelihood function and its gradient are given. The concentrated log likelihood and its gradient are computed in the order of 2 3 n 3 . The computer program is designed to work with various linear models and a combination of covariance parameters.


Bioinformatics | 2007

Segmentation and intensity estimation of microarray images using a gamma-t mixture model

Jangsun Baek; Young Sook Son; Geoffrey J. McLachlan

MOTIVATION We present a new approach to the analysis of images for complementary DNA microarray experiments. The image segmentation and intensity estimation are performed simultaneously by adopting a two-component mixture model. One component of this mixture corresponds to the distribution of the background intensity, while the other corresponds to the distribution of the foreground intensity. The intensity measurement is a bivariate vector consisting of red and green intensities. The background intensity component is modeled by the bivariate gamma distribution, whose marginal densities for the red and green intensities are independent three-parameter gamma distributions with different parameters. The foreground intensity component is taken to be the bivariate t distribution, with the constraint that the mean of the foreground is greater than that of the background for each of the two colors. The degrees of freedom of this t distribution are inferred from the data but they could be specified in advance to reduce the computation time. Also, the covariance matrix is not restricted to being diagonal and so it allows for nonzero correlation between R and G foreground intensities. This gamma-t mixture model is fitted by maximum likelihood via the EM algorithm. A final step is executed whereby nonparametric (kernel) smoothing is undertaken of the posterior probabilities of component membership. The main advantages of this approach are: (1) it enjoys the well-known strengths of a mixture model, namely flexibility and adaptability to the data; (2) it considers the segmentation and intensity simultaneously and not separately as in commonly used existing software, and it also works with the red and green intensities in a bivariate framework as opposed to their separate estimation via univariate methods; (3) the use of the three-parameter gamma distribution for the background red and green intensities provides a much better fit than the normal (log normal) or t distributions; (4) the use of the bivariate t distribution for the foreground intensity provides a model that is less sensitive to extreme observations; (5) as a consequence of the aforementioned properties, it allows segmentation to be undertaken for a wide range of spot shapes, including doughnut, sickle shape and artifacts. RESULTS We apply our method for gridding, segmentation and estimation to cDNA microarray real images and artificial data. Our method provides better segmentation results in spot shapes as well as intensity estimation than Spot and spotSegmentation R language softwares. It detected blank spots as well as bright artifact for the real data, and estimated spot intensities with high-accuracy for the synthetic data. AVAILABILITY The algorithms were implemented in Matlab. The Matlab codes implementing both the gridding and segmentation/estimation are available upon request. SUPPLEMENTARY INFORMATION Supplementary material is available at Bioinformatics online.


Stochastic Processes and their Applications | 1993

Kernel estimation for additive models under dependence

Jangsun Baek; Thomas E. Wehrly

Nonparametric estimation of the conditional mean function for additive models is investigated in cases where the observed data are dependent. We use an additive kernel estimator which is a sum of Nadaraya--Watson estimators. Under a strong mixing condition, the kernel estimator is shown to be asymptotically normal and to achieve the univariate optimal rate of convergence in mean squared error.


Computational Statistics & Data Analysis | 1996

A bootstrap generalized likelihood ratio test in discriminant analysis

Henry L. Gray; Jangsun Baek; Wayne A. Woodward; Jeffrey Miller; Mark D. Fisk

Abstract A generalized likelihood ratio test is developed for classification in two populations when one needs to control one of the probabilities of misclassification. The proposed classification procedure is constructed by applying the parametric bootstrap to the generalized likelihood ratio. There are known methods for controlling this misclassification probability for the case where normal distributions with the same covariance matrix are assumed. Our approach, however, can be applied to not only this case but to the case of normal distributions with different covariance matrices and the case of a mixture of discrete and continuous variables. The results given here do not depend on normality but can, in fact, be applied to any distribution for which the maximum likelihood estimates exist. We do, however, restrict our simulation of these results to the normal distribution if the variates are all continuous. Three cases are simulated: normal distributions with equal covariance matrix, normal distributions with unequal covariance matrices, and mixture of categorical and normal variables. An application to classifying seismic events is presented.


Studies in Classification, Data Analysis, and Knowledge Organization | 2009

Clustering of high-dimensional data via finite mixture models

Geoff McLachlan; Jangsun Baek

Finite mixture models are being commonly used in a wide range of applications in practice concerning density estimation and clustering. An attractive feature of this approach to clustering is that it provides a sound statistical framework in which to assess the important question of how many clusters there are in the data and their validity. We review the application of normal mixture models to high-dimensional data of a continuous nature. One way to handle the fitting of normal mixture models is to adopt mixtures of factor analyzers. They enable model-based density estimation and clustering to be undertaken for high-dimensional data, where the number of observations n is not very large relative to their dimension p. In practice, there is often the need to reduce further the number of parameters in the specification of the component-covariance matrices. We focus here on a new modified approach that uses common component-factor loadings, which considerably reduces further the number of parameters. Moreover, it allows the data to be displayed in low-dimensional plots.


advanced data mining and applications | 2006

Local linear logistic discriminant analysis with partial least square components

Jangsun Baek; Young Sook Son

We propose a nonparametric local linear logistic approach based on local likelihood in multi-class discrimination. The combination of the local linear logistic discriminant analysis and partial least square components yields better prediction results than the conventional statistical classifiers in case where the class boundaries have curvature. We applied our method to both synthetic and real data sets.

Collaboration


Dive into the Jangsun Baek's collaboration.

Top Co-Authors

Avatar

Min-Soo Kim

Daegu Gyeongbuk Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Young Sook Son

Chonnam National University

View shared research outputs
Top Co-Authors

Avatar

Mi-Ra Oh

Chonnam National University

View shared research outputs
Top Co-Authors

Avatar

Gueesang Lee

Chonnam National University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Henry L. Gray

Southern Methodist University

View shared research outputs
Top Co-Authors

Avatar

Wayne A. Woodward

Southern Methodist University

View shared research outputs
Top Co-Authors

Avatar

Jeong-Soo Park

Chonnam National University

View shared research outputs
Researchain Logo
Decentralizing Knowledge