Chao Sima
Translational Genomics Research Institute
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Chao Sima.
Bioinformatics | 2010
Blaise Hanczar; Jianping Hua; Chao Sima; John N. Weinstein; Michael L. Bittner; Edward R. Dougherty
MOTIVATION The receiver operator characteristic (ROC) curves are commonly used in biomedical applications to judge the performance of a discriminant across varying decision thresholds. The estimated ROC curve depends on the true positive rate (TPR) and false positive rate (FPR), with the key metric being the area under the curve (AUC). With small samples these rates need to be estimated from the training data, so a natural question arises: How well do the estimates of the AUC, TPR and FPR compare with the true metrics? RESULTS Through a simulation study using data models and analysis of real microarray data, we show that (i) for small samples the root mean square differences of the estimated and true metrics are considerable; (ii) even for large samples, there is only weak correlation between the true and estimated metrics; and (iii) generally, there is weak regression of the true metric on the estimated metric. For classification rules, we consider linear discriminant analysis, linear support vector machine (SVM) and radial basis function SVM. For error estimation, we consider resubstitution, three kinds of cross-validation and bootstrap. Using resampling, we show the unreliability of some published ROC results. AVAILABILITY Companion web site at http://compbio.tgen.org/paper_supp/ROC/roc.html CONTACT [email protected].
Pattern Recognition | 2007
Marcel Brun; Chao Sima; Jianping Hua; James Lowey; Brent Carroll; Edward Suh; Edward R. Dougherty
A cluster operator takes a set of data points and partitions the points into clusters (subsets). As with any scientific model, the scientific content of a cluster operator lies in its ability to predict results. This ability is measured by its error rate relative to cluster formation. To estimate the error of a cluster operator, a sample of point sets is generated, the algorithm is applied to each point set and the clusters evaluated relative to the known partition according to the distributions, and then the errors are averaged over the point sets composing the sample. Many validity measures have been proposed for evaluating clustering results based on a single realization of the random-point-set process. In this paper we consider a number of proposed validity measures and we examine how well they correlate with error rates across a number of clustering algorithms and random-point-set models. Validity measures fall broadly into three classes: internal validation is based on calculating properties of the resulting clusters; relative validation is based on comparisons of partitions generated by the same algorithm with different parameters or different subsets of the data; and external validation compares the partition generated by the clustering algorithm and a given partition of the data. To quantify the degree of similarity between the validation indices and the clustering errors, we use Kendalls rank correlation between their values. Our results indicate that, overall, the performance of validity indices is highly variable. For complex models or when a clustering algorithm yields complex clusters, both the internal and relative indices fail to predict the error of the algorithm. Some external indices appear to perform well, whereas others do not. We conclude that one should not put much faith in a validity score unless there is evidence, either in terms of sufficient data for model estimation or prior model knowledge, that a validity measure is well-correlated to the error rate of the clustering algorithm.
Bioinformatics | 2006
Chao Sima; Edward R. Dougherty
MOTIVATION High-throughput technologies for rapid measurement of vast numbers of biological variables offer the potential for highly discriminatory diagnosis and prognosis; however, high dimensionality together with small samples creates the need for feature selection, while at the same time making feature-selection algorithms less reliable. Feature selection must typically be carried out from among thousands of gene-expression features and in the context of a small sample (small number of microarrays). Two basic questions arise: (1) Can one expect feature selection to yield a feature set whose error is close to that of an optimal feature set? (2) If a good feature set is not found, should it be expected that good feature sets do not exist? RESULTS The two questions translate quantitatively into questions concerning conditional expectation. (1) Given the error of an optimal feature set, what is the conditionally expected error of the selected feature set? (2) Given the error of the selected feature set, what is the conditionally expected error of the optimal feature set? We address these questions using three classification rules (linear discriminant analysis, linear support vector machine and k-nearest-neighbor classification) and feature selection via sequential floating forward search and the t-test. We consider three feature-label models and patient data from a study concerning survival prognosis for breast cancer. With regard to the two focus questions, there is similarity across all experiments: (1) One cannot expect to find a feature set whose error is close to optimal, and (2) the inability to find a good feature set should not lead to the conclusion that good feature sets do not exist. In practice, the latter conclusion may be more immediately relevant, since when faced with the common occurrence that a feature set discovered from the data does not give satisfactory results, the experimenter can draw no conclusions regarding the existence or nonexistence of suitable feature sets. AVAILABILITY http://ee.tamu.edu/~edward/feature_regression/
Pattern Recognition | 2005
Chao Sima; Sanju N. Attoor; Ulisses Brag-Neto; James Lowey; Edward Suh; Edward R. Dougherty
Given a large set of potential features, it is usually necessary to find a small subset with which to classify. The task of finding an optimal feature set is inherently combinatoric and therefore suboptimal algorithms are typically used to find feature sets. If feature selection is based directly on classification error, then a feature-selection algorithm must base its decision on error estimates. This paper addresses the impact of error estimation on feature selection using two performance measures: comparison of the true error of the optimal feature set with the true error of the feature set found by a feature-selection algorithm, and the number of features among the truly optimal feature set that appear in the feature set found by the algorithm. The study considers seven error estimators applied to three standard suboptimal feature-selection algorithms and exhaustive search, and it considers three different feature-label model distributions. It draws two conclusions for the cases considered: (1) depending on the sample size and the classification rule, feature-selection algorithms can produce feature sets whose corresponding classifiers possess errors far in excess of the classifier corresponding to the optimal feature set; and (2) for small samples, differences in performances among the feature-selection algorithms are less significant than performance differences among the error estimators used to implement the algorithms. Moreover, keeping in mind that results depend on the particular classifier-distribution pair, for the error estimators considered in this study, bootstrap and bolstered resubstitution usually outperform cross-validation, and bolstered resubstitution usually performs as well as or better than bootstrap.
Pattern Recognition Letters | 2008
Chao Sima; Edward R. Dougherty
For a fixed sample size, a common phenomenon is that the error of a designed classifier decreases and then increases as the number of features grows. This peaking phenomenon has been recognized for forty years and depends on the classification rule and feature-label distribution. Historically, the peaking phenomenon has been treated by assuming a fixed ordering of the features, usually beginning with the strongest individual feature and proceeding with features of decreasing individual classification capability. This does not take into account feature-selection, which is commonplace in high-dimensional and small sample settings. This paper revisits the peaking phenomenon in the presence of feature-selection. Using massive simulation in a high-performance computing environment, the paper considers various combinations of feature-label models, feature-selection algorithms, and classifier models to produce a large library of error versus feature size curves. Owing to the prevalence of feature-selection in genomic classification, we also consider gene-expression-based classification of breast-cancer patient prognosis. Results vary widely and are strongly dependent on the combination. The error curves tend to fall into three categories: peaking, settling into a plateau, or falling very slowly over a long range of feature set sizes. It can be concluded that one should be wary of applying peaking results found in the absence of feature-selection to settings in which feature-selection is employed.
Current Bioinformatics | 2010
Edward R. Dougherty; Chao Sima; Hua; Blaise Hanczar; Ulisses Braga-Neto
Classification in bioinformatics often suffers from small samples in conjunction with large numbers of features, which makes error estimation problematic. When a sample is small, there is insufficient data to split the sample and the same data are used for both classifier design and error estimation. Error estimation can suffer from high variance, bias, or both. The problem of choosing a suitable error estimator is exacerbated by the fact that estimation performance depends on the rule used to design the classifier, the feature-label distribution to which the classifier is to be applied, and the sam- ple size. This paper reviews the performance of training-sample error estimators with respect to several criteria: estimation accuracy, variance, bias, correlation with the true error, regression on the true error, and accuracy in ranking feature sets. A number of error estimators are considered: resubstitution, leave-one-out cross-validation, 10-fold cross-validation, bol- stered resubstitution, semi-bolstered resubstitution, .632 bootstrap, .632+ bootstrap, and optimal bootstrap. It illustrates these performance criteria for certain models and for two real data sets, referring to the literature for more extensive appli- cations of these criteria. The results given in the present paper are consistent with those in the literature and lead to two conclusions: (1) much greater effort needs to be focused on error estimation, and (2) owing to the generally poor perform- ance of error estimators on small samples, for a conclusion based on a small-sample error estimator to be considered valid, it should be supported by evidence that the estimator in question can be expected to perform sufficiently well under the circumstances to justify the conclusion.
Bioinformatics | 2010
Mohammadmahdi R. Yousefi; Jianping Hua; Chao Sima; Edward R. Dougherty
MOTIVATION It is commonplace for authors to propose a new classification rule, either the operator construction part or feature selection, and demonstrate its performance on real data sets, which often come from high-dimensional studies, such as from gene-expression microarrays, with small samples. Owing to the variability in feature selection and error estimation, individual reported performances are highly imprecise. Hence, if only the best test results are reported, then these will be biased relative to the overall performance of the proposed procedure. RESULTS This article characterizes reporting bias with several statistics and computes these statistics in a large simulation study using both modeled and real data. The results appear as curves giving the different reporting biases as functions of the number of samples tested when reporting only the best or second best performance. It does this for two classification rules, linear discriminant analysis (LDA) and 3-nearest-neighbor (3NN), and for filter and wrapper feature selection, t-test and sequential forward search. These were chosen on account of their well-studied properties and because they were amenable to the extremely large amount of processing required for the simulations. The results across all the experiments are consistent: there is generally large bias overriding what would be considered a significant performance differential, when reporting the best or second best performing data set. We conclude that there needs to be a database of data sets and that, for those studies depending on real data, results should be reported for all data sets in the database. AVAILABILITY Companion web site at http://gsp.tamu.edu/Publications/supplementary/yousefi09a/
Journal of Biomedical Optics | 2012
Jianping Hua; Chao Sima; Milana Cypert; Gerald C. Gooden; Sonsoles Shack; Lalitamba Alla; Edward A. Smith; Jeffrey M. Trent; Edward R. Dougherty; Michael L. Bittner
High-content cell imaging based on fluorescent protein reporters has recently been used to track the transcriptional activities of multiple genes under different external stimuli for extended periods. This technology enhances our ability to discover treatment-induced regulatory mechanisms, temporally order their onsets and recognize their relationships. To fully realize these possibilities and explore their potential in biological and pharmaceutical applications, we introduce a new data processing procedure to extract information about the dynamics of cell processes based on this technology. The proposed procedure contains two parts: (1) image processing, where the fluorescent images are processed to identify individual cells and allow their transcriptional activity levels to be quantified; and (2) data representation, where the extracted time course data are summarized and represented in a way that facilitates efficient evaluation. Experiments show that the proposed procedure achieves fast and robust image segmentation with sufficient accuracy. The extracted cellular dynamics are highly reproducible and sensitive enough to detect subtle activity differences and identify mechanisms responding to selected perturbations. This method should be able to help biologists identify the alterations of cellular mechanisms that allow drug candidates to change cell behavior and thereby improve the efficiency of drug discovery and treatment design.
Current Genomics | 2009
Edward R. Dougherty; Jianping Hua; Chao Sima
High-throughput biological technologies offer the promise of finding feature sets to serve as biomarkers for medical applications; however, the sheer number of potential features (genes, proteins, etc.) means that there needs to be massive feature selection, far greater than that envisioned in the classical literature. This paper considers performance analysis for feature-selection algorithms from two fundamental perspectives: How does the classification accuracy achieved with a selected feature set compare to the accuracy when the best feature set is used and what is the optimal number of features that should be used? The criteria manifest themselves in several issues that need to be considered when examining the efficacy of a feature-selection algorithm: (1) the correlation between the classifier errors for the selected feature set and the theoretically best feature set; (2) the regressions of the aforementioned errors upon one another; (3) the peaking phenomenon, that is, the effect of sample size on feature selection; and (4) the analysis of feature selection in the framework of high-dimensional models corresponding to high-throughput data.
Molecular Cancer Research | 2014
Timothy G. Whitsett; Ian T. Mathews; Michael H. Cardone; Ryan Lena; William E. Pierceall; Michael L. Bittner; Chao Sima; Janine LoBello; Glen J. Weiss; Nhan L. Tran
Insensitivity to standard clinical interventions, including chemotherapy, radiotherapy, and tyrosine kinase inhibitor (TKI) treatment, remains a substantial hindrance towards improving the prognosis of patients with non–small cell lung cancer (NSCLC). The molecular mechanism of therapeutic resistance remains poorly understood. The TNF-like weak inducer of apoptosis (TWEAK)–FGF-inducible 14 (TNFRSF12A/Fn14) signaling axis is known to promote cancer cell survival via NF-κB activation and the upregulation of prosurvival Bcl-2 family members. Here, a role was determined for TWEAK–Fn14 prosurvival signaling in NSCLC through the upregulation of myeloid cell leukemia sequence 1 (MCL1/Mcl-1). Mcl-1 expression significantly correlated with Fn14 expression, advanced NSCLC tumor stage, and poor patient prognosis in human primary NSCLC tumors. TWEAK stimulation of NSCLC cells induced NF-κB–dependent Mcl-1 protein expression and conferred Mcl-1–dependent chemo- and radioresistance. Depletion of Mcl-1 via siRNA or pharmacologic inhibition of Mcl-1, using EU-5148, sensitized TWEAK-treated NSCLC cells to cisplatin- or radiation-mediated inhibition of cell survival. Moreover, EU-5148 inhibited cell survival across a panel of NSCLC cell lines. In contrast, inhibition of Bcl-2/Bcl-xL function had minimal effect on suppressing TWEAK-induced cell survival. Collectively, these results position TWEAK–Fn14 signaling through Mcl-1 as a significant mechanism for NSCLC tumor cell survival and open new therapeutic avenues to abrogate the high mortality rate seen in NSCLC. Implications: The TWEAK–Fn14 signaling axis enhances lung cancer cell survival and therapeutic resistance through Mcl-1, positioning both TWEAK–Fn14 and Mcl-1 as therapeutic opportunities in lung cancer. Mol Cancer Res; 12(4); 550–9. ©2014 AACR.