Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Jianping Hua is active.

Publication


Featured researches published by Jianping Hua.


Pattern Recognition | 2009

Performance of feature-selection methods in the classification of high-dimension data

Jianping Hua; Waibhav Tembe; Edward R. Dougherty

Contemporary biological technologies produce extremely high-dimensional data sets from which to design classifiers, with 20,000 or more potential features being common place. In addition, sample sizes tend to be small. In such settings, feature selection is an inevitable part of classifier design. Heretofore, there have been a number of comparative studies for feature selection, but they have either considered settings with much smaller dimensionality than those occurring in current bioinformatics applications or constrained their study to a few real data sets. This study compares some basic feature-selection methods in settings involving thousands of features, using both model-based synthetic data and real data. It defines distribution models involving different numbers of markers (useful features) versus non-markers (useless features) and different kinds of relations among the features. Under this framework, it evaluates the performances of feature-selection algorithms for different distribution models and classifiers. Both classification error and the number of discovered markers are computed. Although the results clearly show that none of the considered feature-selection methods performs best across all scenarios, there are some general trends relative to sample size and relations among the features. For instance, the classifier-independent univariate filter methods have similar trends. Filter methods such as the t-test have better or similar performance with wrapper methods for harder problems. This improved performance is usually accompanied with significant peaking. Wrapper methods have better performance when the sample size is sufficiently large. ReliefF, the classifier-independent multivariate filter method, has worse performance than univariate filter methods in most cases; however, ReliefF-based wrapper methods show performance similar to their t-test-based counterparts.


Bioinformatics | 2010

Small-sample precision of ROC-related estimates

Blaise Hanczar; Jianping Hua; Chao Sima; John N. Weinstein; Michael L. Bittner; Edward R. Dougherty

MOTIVATION The receiver operator characteristic (ROC) curves are commonly used in biomedical applications to judge the performance of a discriminant across varying decision thresholds. The estimated ROC curve depends on the true positive rate (TPR) and false positive rate (FPR), with the key metric being the area under the curve (AUC). With small samples these rates need to be estimated from the training data, so a natural question arises: How well do the estimates of the AUC, TPR and FPR compare with the true metrics? RESULTS Through a simulation study using data models and analysis of real microarray data, we show that (i) for small samples the root mean square differences of the estimated and true metrics are considerable; (ii) even for large samples, there is only weak correlation between the true and estimated metrics; and (iii) generally, there is weak regression of the true metric on the estimated metric. For classification rules, we consider linear discriminant analysis, linear support vector machine (SVM) and radial basis function SVM. For error estimation, we consider resubstitution, three kinds of cross-validation and bootstrap. Using resampling, we show the unreliability of some published ROC results. AVAILABILITY Companion web site at http://compbio.tgen.org/paper_supp/ROC/roc.html CONTACT [email protected].


Bioinformatics | 2005

Optimal number of features as a function of sample size for various classification rules

Jianping Hua; Zixiang Xiong; James Lowey; Edward Suh; Edward R. Dougherty

MOTIVATION Given the joint feature-label distribution, increasing the number of features always results in decreased classification error; however, this is not the case when a classifier is designed via a classification rule from sample data. Typically (but not always), for fixed sample size, the error of a designed classifier decreases and then increases as the number of features grows. The potential downside of using too many features is most critical for small samples, which are commonplace for gene-expression-based classifiers for phenotype discrimination. For fixed sample size and feature-label distribution, the issue is to find an optimal number of features. RESULTS Since only in rare cases is there a known distribution of the error as a function of the number of features and sample size, this study employs simulation for various feature-label distributions and classification rules, and across a wide range of sample and feature-set sizes. To achieve the desired end, finding the optimal number of features as a function of sample size, it employs massively parallel computation. Seven classifiers are treated: 3-nearest-neighbor, Gaussian kernel, linear support vector machine, polynomial support vector machine, perceptron, regular histogram and linear discriminant analysis. Three Gaussian-based models are considered: linear, nonlinear and bimodal. In addition, real patient data from a large breast-cancer study is considered. To mitigate the combinatorial search for finding optimal feature sets, and to model the situation in which subsets of genes are co-regulated and correlation is internal to these subsets, we assume that the covariance matrix of the features is blocked, with each block corresponding to a group of correlated features. Altogether there are a large number of error surfaces for the many cases. These are provided in full on a companion website, which is meant to serve as resource for those working with small-sample classification. AVAILABILITY For the companion website, please visit http://public.tgen.org/tamu/ofs/ CONTACT [email protected].


Pattern Recognition | 2007

Model-based evaluation of clustering validation measures

Marcel Brun; Chao Sima; Jianping Hua; James Lowey; Brent Carroll; Edward Suh; Edward R. Dougherty

A cluster operator takes a set of data points and partitions the points into clusters (subsets). As with any scientific model, the scientific content of a cluster operator lies in its ability to predict results. This ability is measured by its error rate relative to cluster formation. To estimate the error of a cluster operator, a sample of point sets is generated, the algorithm is applied to each point set and the clusters evaluated relative to the known partition according to the distributions, and then the errors are averaged over the point sets composing the sample. Many validity measures have been proposed for evaluating clustering results based on a single realization of the random-point-set process. In this paper we consider a number of proposed validity measures and we examine how well they correlate with error rates across a number of clustering algorithms and random-point-set models. Validity measures fall broadly into three classes: internal validation is based on calculating properties of the resulting clusters; relative validation is based on comparisons of partitions generated by the same algorithm with different parameters or different subsets of the data; and external validation compares the partition generated by the clustering algorithm and a given partition of the data. To quantify the degree of similarity between the validation indices and the clustering errors, we use Kendalls rank correlation between their values. Our results indicate that, overall, the performance of validity indices is highly variable. For complex models or when a clustering algorithm yields complex clusters, both the internal and relative indices fail to predict the error of the algorithm. Some external indices appear to perform well, whereas others do not. We conclude that one should not put much faith in a validity score unless there is evidence, either in terms of sufficient data for model estimation or prior model knowledge, that a validity measure is well-correlated to the error rate of the clustering algorithm.


IEEE Transactions on Medical Imaging | 2003

Lossy-to-lossless compression of medical volumetric data using three-dimensional integer wavelet transforms

Zixiang Xiong; Xiaolin Wu; Samuel Cheng; Jianping Hua

We study lossy-to-lossless compression of medical volumetric data using three-dimensional (3-D) integer wavelet transforms. To achieve good lossy coding performance, it is important to have transforms that are unitary. In addition to the lifting approach, we first introduce a general 3-D integer wavelet packet transform structure that allows implicit bit shifting of wavelet coefficients to approximate a 3-D unitary transformation. We then focus on context modeling for efficient arithmetic coding of wavelet coefficients. Two state-of-the-art 3-D wavelet video coding techniques, namely, 3-D set partitioning in hierarchical trees (Kim et al., 2000) and 3-D embedded subband coding with optimal truncation (Xu et al., 2001), are modified and applied to compression of medical volumetric data, achieving the best performance published so far in the literature-both in terms of lossy and lossless compression.


Eurasip Journal on Bioinformatics and Systems Biology | 2007

Decorrelation of the true and estimated classifier errors in high-dimensional settings

Blaise Hanczar; Jianping Hua; Edward R. Dougherty

The aim of many microarray experiments is to build discriminatory diagnosis and prognosis models. Given the huge number of features and the small number of examples, model validity which refers to the precision of error estimation is a critical issue. Previous studies have addressed this issue via the deviation distribution (estimated error minus true error), in particular, the deterioration of cross-validation precision in high-dimensional settings where feature selection is used to mitigate the peaking phenomenon (overfitting). Because classifier design is based upon random samples, both the true and estimated errors are sample-dependent random variables, and one would expect a loss of precision if the estimated and true errors are not well correlated, so that natural questions arise as to the degree of correlation and the manner in which lack of correlation impacts error estimation. We demonstrate the effect of correlation on error precision via a decomposition of the variance of the deviation distribution, observe that the correlation is often severely decreased in high-dimensional settings, and show that the effect of high dimensionality on error estimation tends to result more from its decorrelating effects than from its impact on the variance of the estimated error. We consider the correlation between the true and estimated errors under different experimental conditions using both synthetic and real data, several feature-selection methods, different classification rules, and three error estimators commonly used (leave-one-out cross-validation, -fold cross-validation, and .632 bootstrap). Moreover, three scenarios are considered: (1) feature selection, (2) known-feature set, and (3) all features. Only the first is of practical interest; however, the other two are needed for comparison purposes. We will observe that the true and estimated errors tend to be much more correlated in the case of a known feature set than with either feature selection or using all features, with the better correlation between the latter two showing no general trend, but differing for different models.


Bioinformatics | 2010

Reporting bias when using real data sets to analyze classification performance

Mohammadmahdi R. Yousefi; Jianping Hua; Chao Sima; Edward R. Dougherty

MOTIVATION It is commonplace for authors to propose a new classification rule, either the operator construction part or feature selection, and demonstrate its performance on real data sets, which often come from high-dimensional studies, such as from gene-expression microarrays, with small samples. Owing to the variability in feature selection and error estimation, individual reported performances are highly imprecise. Hence, if only the best test results are reported, then these will be biased relative to the overall performance of the proposed procedure. RESULTS This article characterizes reporting bias with several statistics and computes these statistics in a large simulation study using both modeled and real data. The results appear as curves giving the different reporting biases as functions of the number of samples tested when reporting only the best or second best performance. It does this for two classification rules, linear discriminant analysis (LDA) and 3-nearest-neighbor (3NN), and for filter and wrapper feature selection, t-test and sequential forward search. These were chosen on account of their well-studied properties and because they were amenable to the extremely large amount of processing required for the simulations. The results across all the experiments are consistent: there is generally large bias overriding what would be considered a significant performance differential, when reporting the best or second best performing data set. We conclude that there needs to be a database of data sets and that, for those studies depending on real data, results should be reported for all data sets in the database. AVAILABILITY Companion web site at http://gsp.tamu.edu/Publications/supplementary/yousefi09a/


Pattern Recognition | 2005

Determination of the optimal number of features for quadratic discriminant analysis via the normal approximation to the discriminant distribution

Jianping Hua; Zixiang Xiong; Edward R. Dougherty

Given the joint feature-label distribution, increasing the number of features always results in decreased classification error; however, this is not the case when a classifier is designed via a classification rule from sample data. Typically, for fixed sample size, the error of a designed classifier decreases and then increases as the number of features grows. The problem is especially acute when sample sizes are very small and the potential number of features is very large. To obtain a general understanding of the kinds of feature-set sizes that provide good performance for a particular classification rule, performance must be evaluated based on accurate error estimation, and hence a model-based setting for optimizing the number of features is needed. This paper treats quadratic discriminant analysis (QDA) in the case of unequal covariance matrices. For two normal class-conditional distributions, the QDA classifier is determined according to a discriminant. The standard plug-in rule estimates the discriminant from a feature-label sample to obtain an estimate of the discriminant by replacing the means and covariance matrices by their respective sample means and sample covariance matrices. The unbiasedness of these estimators assures good estimation for large samples, but not for small samples. Our goal is to find an essentially analytic method to produce an error curve as a function of the number of features so that the curve can be minimized to determine an optimal number of features. We use a normal approximation to the distribution of the estimated discriminant. Since the mean and variance of the estimated discriminant will be exact, these provide insight into how the covariance matrices affect the optimal number of features. We derive the mean and variance of the estimated discriminant and compare feature-size optimization using the normal approximation to the estimated discriminant with optimization obtained by simulating the true distribution of the estimated discriminant. Optimization via the normal approximation to the estimated discriminant provides huge computational savings in comparison to optimization via simulation of the true distribution. Feature-size optimization via the normal approximation is very accurate when the covariance matrices differ modestly. The optimal number of features based on the normal approximation will exceed the actual optimal number when there is large disagreement between the covariance matrices; however, this difference is not important because the true misclassification error using the number of features obtained from the normal approximation and the number obtained from the true distribution differ only slightly, even for significantly different covariance matrices.


international symposium on biomedical imaging | 2002

Lossy-to-lossless ROI coding of chromosome images using modified SPIHT and EBCOT

Zhongmin Liu; Jianping Hua; Zixiang Xiong; Qiang Wu; Kenneth R. Castleman

This paper proposes a lossy-to-lossless region of interest (ROI) compression scheme based on set partitioning in hierarchical trees (SPIHT) and embedded block coding with optimized truncation (EBCOT) (EBCOT is the base of the JPEG-2000 standard). However, SPIHT does not support ROI coding and JPEG-2000 does not allow lossy-to-lossless ROI compression. For our application, we modify the original SPIHT and EBCOT algorithms for lossy-to-lossless ROI coding of both the foreground and the background of chromosome images. Experiments show that our implementations offer better compression performance with exact ROI support than other image coders.


multimedia signal processing | 2001

High-performance 3-D embedded wavelet video (EWV) coding

Jianping Hua; Zixiang Xiong; Xiaolin Wu

This paper presents a rate-distortion (R-D) optimized 3-D embedded wavelet video (EWV) coder by extending the concept of EBCOT from 2-D to 3-D. After a lifting based 3-D wavelet transform, different subbands are coded independently using bit plane coding with different context models to provide flexible scalability in both spatial and temporal domain. A global R-D optimization procedure is used to generate an embedded bitstream for a target bit rate. Experiments show that, even without motion estimation, the EWV coder outperforms both MPEG-4 and 3-D ESCOT for most low motion video sequences.

Collaboration


Dive into the Jianping Hua's collaboration.

Top Co-Authors

Avatar

Michael L. Bittner

Translational Genomics Research Institute

View shared research outputs
Top Co-Authors

Avatar

Chao Sima

Translational Genomics Research Institute

View shared research outputs
Top Co-Authors

Avatar

Edward Suh

Translational Genomics Research Institute

View shared research outputs
Top Co-Authors

Avatar

James Lowey

Translational Genomics Research Institute

View shared research outputs
Top Co-Authors

Avatar

Marcel Brun

Translational Genomics Research Institute

View shared research outputs
Top Co-Authors

Avatar

Milana Cypert

Translational Genomics Research Institute

View shared research outputs
Researchain Logo
Decentralizing Knowledge