Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Chen-An Tsai is active.

Publication


Featured researches published by Chen-An Tsai.


Bioinformatics | 2009

Multivariate analysis of variance test for gene set analysis

Chen-An Tsai; James J. Chen

MOTIVATION Gene class testing (GCT) or gene set analysis (GSA) is a statistical approach to determine whether some functionally predefined sets of genes express differently under different experimental conditions. Shortcomings of the Fishers exact test for the overrepresentation analysis are illustrated by an example. Most alternative GSA methods are developed for data collected from two experimental conditions, and most is based on a univariate gene-by-gene test statistic or assume independence among genes in the gene set. A multivariate analysis of variance (MANOVA) approach is proposed for studies with two or more experimental conditions. RESULTS When the number of genes in the gene set is greater than the number of samples, the sample covariance matrix is singular and ill-condition. The use of standard multivariate methods can result in biases in the analysis. The proposed MANOVA test uses a shrinkage covariance matrix estimator for the sample covariance matrix. The MANOVA test and six other GSA published methods, principal component analysis, SAM-GS, analysis of covariance, Global, GSEA and MaxMean, are evaluated using simulation. The MANOVA test appears to perform the best in terms of control of type I error and power under the models considered in the simulation. Several publicly available microarray datasets under two and three experimental conditions are analyzed for illustrations of GSA. Most methods, except for GSEA and MaxMean, generally are comparable in terms of power of identification of significant gene sets. AVAILABILITY A free R-code to perform MANOVA test is available at http://mail.cmu.edu.tw/~catsai/research.htm. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.


Bioinformatics | 2004

Analysis of variance components in gene expression data

James J. Chen; Robert R. Delongchamp; Chen-An Tsai; Huey-miin Hsueh; Frank D. Sistare; Karol L. Thompson; Varsha G. Desai; James C. Fuscoe

MOTIVATION A microarray experiment is a multi-step process, and each step is a potential source of variation. There are two major sources of variation: biological variation and technical variation. This study presents a variance-components approach to investigating animal-to-animal, between-array, within-array and day-to-day variations for two data sets. The first data set involved estimation of technical variances for pooled control and pooled treated RNA samples. The variance components included between-array, and two nested within-array variances: between-section (the upper- and lower-sections of the array are replicates) and within-section (two adjacent spots of the same gene are printed within each section). The second experiment was conducted on four different weeks. Each week there were reference and test samples with a dye-flip replicate in two hybridization days. The variance components included week-to-week, animal-to-animal and between-array and within-array variances. RESULTS We applied the linear mixed-effects model to quantify different sources of variation. In the first data set, we found that the between-array variance is greater than the between-section variance, which, in turn, is greater than the within-section variance. In the second data set, for the reference samples, the week-to-week variance is larger than the between-array variance, which, in turn, is slightly larger than the within-array variance. For the test samples, the week-to-week variance has the largest variation. The animal-to-animal variance is slightly larger than the between-array and within-array variances. However, in a gene-by-gene analysis, the animal-to-animal variance is smaller than the between-array variance in four out of five housekeeping genes. In summary, the largest variation observed is the week-to-week effect. Another important source of variability is the animal-to-animal variation. Finally, we describe the use of variance-component estimates to determine optimal numbers of animals, arrays per animal and sections per array in planning microarray experiments.


Bioinformatics | 2005

Sample size for gene expression microarray experiments†The views presented in this paper are those of the authors and not necessarily representing those of the US Food and Drug Administration.

Chen-An Tsai; Sue-Jane Wang; Dung-Tsa Chen; James J. Chen

MOTIVATION Microarray experiments often involve hundreds or thousands of genes. In a typical experiment, only a fraction of genes are expected to be differentially expressed; in addition, the measured intensities among different genes may be correlated. Depending on the experimental objectives, sample size calculations can be based on one of the three specified measures: sensitivity, true discovery and accuracy rates. The sample size problem is formulated as: the number of arrays needed in order to achieve the desired fraction of the specified measure at the desired family-wise power at the given type I error and (standardized) effect size. RESULTS We present a general approach for estimating sample size under independent and equally correlated models using binomial and beta-binomial models, respectively. The sample sizes needed for a two-sample z-test are computed; the computed theoretical numbers agree well with the Monte Carlo simulation results. But, under more general correlation structures, the beta-binomial model can underestimate the needed samples by about 1-5 arrays. CONTACT [email protected].


Sar and Qsar in Environmental Research | 2005

Classification ensembles for unbalanced class sizes in predictive toxicology

James J. Chen; Chen-An Tsai; John F. Young; Ralph L. Kodell

This paper investigates the effects of the ratio of positive-to-negative samples on the sensitivity, specificity, and concordance. When the class sizes in the training samples are not equal, the classification rule derived will favor the majority class and result in a low sensitivity on the minority class prediction. We propose an ensemble classification approach to adjust for differential class sizes in a binary classifier system. An ensemble classifier consists of a set of base classifiers; its prediction rule is based on a summary measure of individual classifications by the base classifiers. Two re-sampling methods, augmentation and abatement, are proposed to generate different bootstrap samples of equal class size to build the base classifiers. The augmentation method balances the two class sizes by bootstrapping additional samples from the minority class, whereas the abatement method balances the two class sizes by sampling only a subset of samples from the majority class. The proposed procedure is applied to a data set to predict estrogen receptor binding activity and to a data set to predict animal liver carcinogenicity using SAR (structure-activity relationship) models as base classifiers. The abatement method appears to perform well in balancing sensitivity and specificity.


BMC Bioinformatics | 2007

Gene selection with multiple ordering criteria

James J. Chen; Chen-An Tsai; ShengLi Tzeng; Chun-Houh Chen

BackgroundA microarray study may select different differentially expressed gene sets because of different selection criteria. For example, the fold-change and p-value are two commonly known criteria to select differentially expressed genes under two experimental conditions. These two selection criteria often result in incompatible selected gene sets. Also, in a two-factor, say, treatment by time experiment, the investigator may be interested in one gene list that responds to both treatment and time effects.ResultsWe propose three layer ranking algorithms, point-admissible, line-admissible (convex), and Pareto, to provide a preference gene list from multiple gene lists generated by different ranking criteria. Using the public colon data as an example, the layer ranking algorithms are applied to the three univariate ranking criteria, fold-change, p-value, and frequency of selections by the SVM-RFE classifier. A simulation experiment shows that for experiments with small or moderate sample sizes (less than 20 per group) and detecting a 4-fold change or less, the two-dimensional (p-value and fold-change) convex layer ranking selects differentially expressed genes with generally lower FDR and higher power than the standard p-value ranking. Three applications are presented. The first application illustrates a use of the layer rankings to potentially improve predictive accuracy. The second application illustrates an application to a two-factor experiment involving two dose levels and two time points. The layer rankings are applied to selecting differentially expressed genes relating to the dose and time effects. In the third application, the layer rankings are applied to a benchmark data set consisting of three dilution concentrations to provide a ranking system from a long list of differentially expressed genes generated from the three dilution concentrations.ConclusionThe layer ranking algorithms are useful to help investigators in selecting the most promising genes from multiple gene lists generated by different filter, normalization, or analysis methods for various objectives.


Bioinformatics | 2007

Significance analysis of groups of genes in expression profiling studies

James J. Chen; Taewon Lee; Robert R. Delongchamp; Tao Chen; Chen-An Tsai

MOTIVATION Gene class testing (GCT) is a statistical approach to determine whether some functionally predefined classes of genes express differently under two experimental conditions. GCT computes the P-value of each gene class based on the null distribution and the gene classes are ranked for importance in accordance with their P-values. Currently, two null hypotheses have been considered: the Q1 hypothesis tests the relative strength of association with the phenotypes among the gene classes, and the Q2 hypothesis assesses the statistical significance. These two hypotheses are related but not equivalent. METHOD We investigate three one-sided and two two-sided test statistics under Q1 and Q2. The null distributions of gene classes under Q1 are generated by permuting gene labels and the null distributions under Q2 are generated by permuting samples. RESULTS We applied the five statistics to a diabetes dataset with 143 gene classes and to a breast cancer dataset with 508 GO (Gene Ontology) terms. In each statistic, the null distributions of the gene classes under Q1 are different from those under Q2 in both datasets, and their rankings can be different too. We clarify the one-sided and two-sided hypotheses, and discuss some issues regarding the Q1 and Q2 hypotheses for gene class ranking in the GCT. Because Q1 does not deal with correlations among genes, we prefer test based on Q2. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.


Sar and Qsar in Environmental Research | 2006

Decision threshold adjustment in class prediction

James J. Chen; Chen-An Tsai; Hojin Moon; Hongshik Ahn; J. J. Young; Chun-Houh Chen

Standard classification algorithms are generally designed to maximize the number of correct predictions (concordance). The criterion of maximizing the concordance may not be appropriate in certain applications. In practice, some applications may emphasize high sensitivity (e.g., clinical diagnostic tests) and others may emphasize high specificity (e.g., epidemiology screening studies). This paper considers effects of the decision threshold on sensitivity, specificity, and concordance for four classification methods: logistic regression, classification tree, Fishers linear discriminant analysis, and a weighted k-nearest neighbor. We investigated the use of decision threshold adjustment to improve performance of either sensitivity or specificity of a classifier under specific conditions. We conducted a Monte Carlo simulation showing that as the decision threshold increases, the sensitivity decreases and the specificity increases; but, the concordance values in an interval around the maximum concordance are similar. For specified sensitivity and specificity levels, an optimal decision threshold might be determined in an interval around the maximum concordance that meets the specified requirement. Three example data sets were analyzed for illustrations.


Journal of Biopharmaceutical Statistics | 2004

Significance Analysis of ROC Indices for Comparing Diagnostic Markers: Applications to Gene Microarray Data

Chen-An Tsai; James J. Chen

Abstract A common objective in microarray experiments is to select genes that are differentially expressed between two classes (two treatment groups). Selection of differentially expressed genes involves two steps. The first step is to calculate a discriminatory score that will rank the genes in order of evidence of differential expressions. The second step is to determine a cutoff for the ranked scores. Summary indices of the receiver operating characteristic (ROC) curve provide relative measures for a ranking of differential expressions. This article proposes using the hypothesis-testing approach to compute the raw p-values and/or adjusted p-values for three ROC discrimination measures. A cutoff p-value can be determined from the (ranked) p-values or the adjusted p-values to select differentially expressed genes. To quantify the degree of confidence in the selected top-ranked genes, the conditional false discovery rate (FDR) over the selected gene set and the “Type I” (false positive) error probability for each selected gene are estimated. The proposed approach is applied to a public colon tumor data set for illustration. The selected gene sets from three ROC summary indices and the commonly used two-sample t-statistic are applied to the sample classification to evaluate the predictability of the four discrimination measures.


Computational Statistics & Data Analysis | 2007

Kernel estimation for adjusted p-values in multiple testing

Chen-An Tsai; James J. Chen

Multiple testing procedures are frequently applied to biomedical and genomic research, for instance, identification of differentially expressed genes in microarray experiments. Resampling methods are commonly used to compute adjusted p-values in multiple hypothesis testing problems. Importantly, the resampling-based multiple testing procedures are sensitive to the number of permutations, especially for the MinP adjustment procedure. The single-step MinP adjusted p-values are derived from the distribution of the minimum of the p-values. Because of computational complexity, the adjusted p-values are often computed using the distribution of the maximum of the test statistics (MaxT). This paper proposes an approach based on the kernel density estimation (KDE) technique to reduce the number of permutations for implementing the single-step MinP adjustment. Simulation studies are conducted to demonstrate that the KDE method is more powerful than the MinP adjustment method under independent and correlated models. The three resampling-based single-step adjustment procedures, MaxT, MinP, and KDE, are applied to two published microarray data sets, the colon tumor data set consisting of 40 tumor and 22 normal colon tissue samples on 2000 human genes (endpoints) and the leukemia data set consisting of 27 acute lymphoblastic leukemia and 11 acute myeloid leukemia samples on 3051 genes. The MaxT adjusted p-values are very robust to the number of permutations. The MaxT adjusted p-values are stable with 10,000 permutations, while the MinP adjusted p-values are step functions. As the number of permutations increases, the number of ties decrease. The adjusted p-values are stable with 500,000 permutations. For the KDE method, the adjusted p-values are stable at 50,000 permutations. At 1000,000 permutations, the three procedures have similar adjusted p-values.


Journal of Biopharmaceutical Statistics | 2004

A Generalized Additive Model For Microarray Gene Expression Data Analysis

Chen-An Tsai; Huey-miin Hsueh; James J. Chen

Abstract Microarray technology allows the measurement of expression levels of a large number of genes simultaneously. There are inherent biases in microarray data generated from an experiment. Various statistical methods have been proposed for data normalization and data analysis. This paper proposes a generalized additive model for the analysis of gene expression data. This model consists of two sub-models: a non-linear model and a linear model. We propose a two-step normalization algorithm to fit the two sub-models sequentially. The first step involves a non-parametric regression using lowess fits to adjust for non-linear systematic biases. The second step uses a linear ANOVA model to estimate the remaining effects including the interaction effect of genes and treatments, the effect of interest in a study. The proposed model is a generalization of the ANOVA model for microarray data analysis. We show correspondences between the lowess fit and the ANOVA model methods. The normalization procedure does not assume the majority of genes do not change their expression levels, and neither does it assume two channel intensities from the same spot are independent. The procedure can be applied to either one channel or two channel data from the experiments with multiple treatments or multiple nuisance factors. Two toxicogenomic experiment data sets and a simulated data set are used to contrast the proposed method with the commonly known lowess fit and ANOVA methods.

Collaboration


Dive into the Chen-An Tsai's collaboration.

Top Co-Authors

Avatar

James J. Chen

National Center for Toxicological Research

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Huey-miin Hsueh

National Chengchi University

View shared research outputs
Top Co-Authors

Avatar

Robert R. Delongchamp

University of Arkansas for Medical Sciences

View shared research outputs
Top Co-Authors

Avatar

Chien-Ju Lin

National Center for Toxicological Research

View shared research outputs
Top Co-Authors

Avatar

James C. Fuscoe

National Center for Toxicological Research

View shared research outputs
Top Co-Authors

Avatar

John F. Young

National Center for Toxicological Research

View shared research outputs
Top Co-Authors

Avatar

Ralph L. Kodell

University of Arkansas for Medical Sciences

View shared research outputs
Top Co-Authors

Avatar

Suzanne M. Morris

National Center for Toxicological Research

View shared research outputs
Researchain Logo
Decentralizing Knowledge