Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Edward Suh is active.

Publication


Featured researches published by Edward Suh.


Bioinformatics | 2005

Optimal number of features as a function of sample size for various classification rules

Jianping Hua; Zixiang Xiong; James Lowey; Edward Suh; Edward R. Dougherty

MOTIVATION Given the joint feature-label distribution, increasing the number of features always results in decreased classification error; however, this is not the case when a classifier is designed via a classification rule from sample data. Typically (but not always), for fixed sample size, the error of a designed classifier decreases and then increases as the number of features grows. The potential downside of using too many features is most critical for small samples, which are commonplace for gene-expression-based classifiers for phenotype discrimination. For fixed sample size and feature-label distribution, the issue is to find an optimal number of features. RESULTS Since only in rare cases is there a known distribution of the error as a function of the number of features and sample size, this study employs simulation for various feature-label distributions and classification rules, and across a wide range of sample and feature-set sizes. To achieve the desired end, finding the optimal number of features as a function of sample size, it employs massively parallel computation. Seven classifiers are treated: 3-nearest-neighbor, Gaussian kernel, linear support vector machine, polynomial support vector machine, perceptron, regular histogram and linear discriminant analysis. Three Gaussian-based models are considered: linear, nonlinear and bimodal. In addition, real patient data from a large breast-cancer study is considered. To mitigate the combinatorial search for finding optimal feature sets, and to model the situation in which subsets of genes are co-regulated and correlation is internal to these subsets, we assume that the covariance matrix of the features is blocked, with each block corresponding to a group of correlated features. Altogether there are a large number of error surfaces for the many cases. These are provided in full on a companion website, which is meant to serve as resource for those working with small-sample classification. AVAILABILITY For the companion website, please visit http://public.tgen.org/tamu/ofs/ CONTACT [email protected].


Genome Research | 2011

Copy number and targeted mutational analysis reveals novel somatic events in metastatic prostate tumors

Christiane M. Robbins; Waibov A. Tembe; Angela Baker; Shripad Sinari; Tracy Moses; Stephen M. Beckstrom-Sternberg; James S. Beckstrom-Sternberg; Michael T. Barrett; James Long; Arul M. Chinnaiyan; James Lowey; Edward Suh; John V. Pearson; David Craig; David B. Agus; Kenneth J. Pienta; John D. Carpten

Advanced prostate cancer can progress to systemic metastatic tumors, which are generally androgen insensitive and ultimately lethal. Here, we report a comprehensive genomic survey for somatic events in systemic metastatic prostate tumors using both high-resolution copy number analysis and targeted mutational survey of 3508 exons from 577 cancer-related genes using next generation sequencing. Focal homozygous deletions were detected at 8p22, 10q23.31, 13q13.1, 13q14.11, and 13q14.12. Key genes mapping within these deleted regions include PTEN, BRCA2, C13ORF15, and SIAH3. Focal high-level amplifications were detected at 5p13.2-p12, 14q21.1, 7q22.1, and Xq12. Key amplified genes mapping within these regions include SKP2, FOXA1, and AR. Furthermore, targeted mutational analysis of normal-tumor pairs has identified somatic mutations in genes known to be associated with prostate cancer including AR and TP53, but has also revealed novel somatic point mutations in genes including MTOR, BRCA2, ARHGEF12, and CHD5. Finally, in one patient where multiple independent metastatic tumors were available, we show common and divergent somatic alterations that occur at both the copy number and point mutation level, supporting a model for a common clonal progenitor with metastatic tumor-specific divergence. Our study represents a deep genomic analysis of advanced metastatic prostate tumors and has revealed candidate somatic alterations, possibly contributing to lethal prostate cancer.


Pattern Recognition | 2007

Model-based evaluation of clustering validation measures

Marcel Brun; Chao Sima; Jianping Hua; James Lowey; Brent Carroll; Edward Suh; Edward R. Dougherty

A cluster operator takes a set of data points and partitions the points into clusters (subsets). As with any scientific model, the scientific content of a cluster operator lies in its ability to predict results. This ability is measured by its error rate relative to cluster formation. To estimate the error of a cluster operator, a sample of point sets is generated, the algorithm is applied to each point set and the clusters evaluated relative to the known partition according to the distributions, and then the errors are averaged over the point sets composing the sample. Many validity measures have been proposed for evaluating clustering results based on a single realization of the random-point-set process. In this paper we consider a number of proposed validity measures and we examine how well they correlate with error rates across a number of clustering algorithms and random-point-set models. Validity measures fall broadly into three classes: internal validation is based on calculating properties of the resulting clusters; relative validation is based on comparisons of partitions generated by the same algorithm with different parameters or different subsets of the data; and external validation compares the partition generated by the clustering algorithm and a given partition of the data. To quantify the degree of similarity between the validation indices and the clustering errors, we use Kendalls rank correlation between their values. Our results indicate that, overall, the performance of validity indices is highly variable. For complex models or when a clustering algorithm yields complex clusters, both the internal and relative indices fail to predict the error of the algorithm. Some external indices appear to perform well, whereas others do not. We conclude that one should not put much faith in a validity score unless there is evidence, either in terms of sufficient data for model estimation or prior model knowledge, that a validity measure is well-correlated to the error rate of the clustering algorithm.


Journal of Biological Systems | 2002

Can Markov chain models mimic biological regulation

Seungchan Kim; Huai Li; Edward R. Dougherty; Nanwei Cao; Yidong Chen; Michael L. Bittner; Edward Suh

A fundamental question in biology is whether the network of interactions that regulate gene expression can be modeled by existing mathematical techniques. Studies of the ability to predict a genes...


Cancer Epidemiology, Biomarkers & Prevention | 2005

The Interaction of Four Genes in the Inflammation Pathway Significantly Predicts Prostate Cancer Risk

Jianfeng Xu; James Lowey; Fredrik Wiklund; Jielin Sun; Fredrik Lindmark; Fang-Chi Hsu; Latchezar Dimitrov; Bao-Li Chang; Aubrey R. Turner; Wennan Liu; Hans-Olov Adami; Edward Suh; Jason H. Moore; S. Lilly Zheng; William B. Isaacs; Jeffrey M. Trent; Henrik Grönberg

It is widely hypothesized that the interactions of multiple genes influence individual risk to prostate cancer. However, current efforts at identifying prostate cancer risk genes primarily rely on single-gene approaches. In an attempt to fill this gap, we carried out a study to explore the joint effect of multiple genes in the inflammation pathway on prostate cancer risk. We studied 20 genes in the Toll-like receptor signaling pathway as well as several cytokines. For each of these genes, we selected and genotyped haplotype-tagging single nucleotide polymorphisms (SNP) among 1,383 cases and 780 controls from the CAPS (CAncer Prostate in Sweden) study population. A total of 57 SNPs were included in the final analysis. A data mining method, multifactor dimensionality reduction, was used to explore the interaction effects of SNPs on prostate cancer risk. Interaction effects were assessed for all possible n SNP combinations, where n = 2, 3, or 4. For each n SNP combination, the model providing lowest prediction error among 100 cross-validations was chosen. The statistical significance levels of the best models in each n SNP combination were determined using permutation tests. A four-SNP interaction (one SNP each from IL-10, IL-1RN, TIRAP, and TLR5) had the lowest prediction error (43.28%, P = 0.019). Our ability to analyze a large number of SNPs in a large sample size is one of the first efforts in exploring the effect of high-order gene-gene interactions on prostate cancer risk, and this is an important contribution to this new and quickly evolving field.


Bioinformatics | 2010

G-SQZ

Waibhav Tembe; James Lowey; Edward Suh

SUMMARY Large volumes of data generated by high-throughput sequencing instruments present non-trivial challenges in data storage, content access and transfer. We present G-SQZ, a Huffman coding-based sequencing-reads-specific representation scheme that compresses data without altering the relative order. G-SQZ has achieved from 65% to 81% compression on benchmark datasets, and it allows selective access without scanning and decoding from start. This article focuses on describing the underlying encoding scheme and its software implementation, and a more theoretical problem of optimal compression is out of scope. The immediate practical benefits include reduced infrastructure and informatics costs in managing and analyzing large sequencing data. AVAILABILITY http://public.tgen.org/sqz. Academic/non-profit: SOURCE available at no cost under a non-open-source license by requesting from the web-site; Binary: available for direct download at no cost. For-Profit: Submit request for for-profit license from the web-site.


Journal of Computational Biology | 2004

Gene Clustering Based on Clusterwide Mutual Information

Xiaobo Zhou; Xiaodong Wang; Edward R. Dougherty; Daniel E. Russ; Edward Suh

Cluster analysis of gene-wide expression data from DNA microarray hybridization studies has proved to be a useful tool for identifying biologically relevant groupings of genes and constructing gene regulatory networks. The motivation for considering mutual information is its capacity to measure a general dependence among gene random variables. We propose a novel clustering strategy based on minimizing mutual information among gene clusters. Simulated annealing is employed to solve the optimization problem. Bootstrap techniques are employed to get more accurate estimates of mutual information when the data sample size is small. Moreover, we propose to combine the mutual information criterion and traditional distance criteria such as the Euclidean distance and the fuzzy membership metric in designing the clustering algorithm. The performances of the new clustering methods are compared with those of some existing methods, using both synthesized data and experimental data. It is seen that the clustering algorithm based on a combined metric of mutual information and fuzzy membership achieves the best performance. The supplemental material is available at www.gspsnap.tamu.edu/gspweb/zxb/glioma_zxb.


Glycobiology | 2010

A motif-based analysis of glycan array data to determine the specificities of glycan-binding proteins

Andrew C. Porter; Tingting Yue; Lee Heeringa; Steven Day; Edward Suh; Brian B. Haab

Glycan arrays have enabled detailed studies of the specificities of glycan-binding proteins. A challenge in the interpretation of glycan array data is to determine the specific features of glycan structures that are critical for binding. To address this challenge, we have developed a systematic method to interpret glycan array data using a motif-based analysis. Each glycan on a glycan array is classified according to its component sub-structures, or motifs. We analyze the binding of a given lectin to each glycan in terms of the motifs in order to identify the motifs that are selectively present in the glycans that are bound by the lectin. We compared two different methods to calculate the identification, termed intensity segregation and motif segregation, for the analysis of three well-characterized lectins with highly divergent behaviors. Both methods accurately identified the primary specificities as well as the weaker, secondary specificities of all three lectins. The complex binding behavior of wheat germ agglutinin was reduced to its simplified, independent specificities. We compiled the motif specificities of a wide variety of plant lectins, human lectins, and glycan-binding antibodies to uncover the relationships among the glycan-binding proteins and to provide a means to search for lectins with particular binding specificities. This approach should be valuable for rapidly analyzing and using glycan array data, for better describing and understanding glycan-binding specificities, and as a means to systematize and compare data from glycan arrays.


Pattern Recognition | 2005

Impact of error estimation on feature selection

Chao Sima; Sanju N. Attoor; Ulisses Brag-Neto; James Lowey; Edward Suh; Edward R. Dougherty

Given a large set of potential features, it is usually necessary to find a small subset with which to classify. The task of finding an optimal feature set is inherently combinatoric and therefore suboptimal algorithms are typically used to find feature sets. If feature selection is based directly on classification error, then a feature-selection algorithm must base its decision on error estimates. This paper addresses the impact of error estimation on feature selection using two performance measures: comparison of the true error of the optimal feature set with the true error of the feature set found by a feature-selection algorithm, and the number of features among the truly optimal feature set that appear in the feature set found by the algorithm. The study considers seven error estimators applied to three standard suboptimal feature-selection algorithms and exhaustive search, and it considers three different feature-label model distributions. It draws two conclusions for the cases considered: (1) depending on the sample size and the classification rule, feature-selection algorithms can produce feature sets whose corresponding classifiers possess errors far in excess of the classifier corresponding to the optimal feature set; and (2) for small samples, differences in performances among the feature-selection algorithms are less significant than performance differences among the error estimators used to implement the algorithms. Moreover, keeping in mind that results depend on the particular classifier-distribution pair, for the error estimators considered in this study, bootstrap and bolstered resubstitution usually outperform cross-validation, and bolstered resubstitution usually performs as well as or better than bootstrap.


Pattern Recognition Letters | 2005

Feature selection algorithms to find strong genes

Paulo J. S. Silva; Ronaldo Fumio Hashimoto; Seungchan Kim; Junior Barrera; Leônidas de Oliveira Brandão; Edward Suh; Edward R. Dougherty

The cDNA microarray technology allows us to estimate the expression of thousands of genes of a given tissue. It is natural then to use such information to classify different cell states, like healthy or diseased, or one particular type of cancer or another. However, usually the number of microarray samples is very small and leads to a classification problem with only tens of samples and thousands of features. Recently, Kim et al. proposed to use a parameterized distribution based on the original sample set as a way to attenuate such difficulty. Genes that contribute to good classifiers in such setting are called strong. In this paper, we investigate how to use feature selection techniques to speed up the quest for strong genes. The idea is to use a feature selection algorithm to filter the gene set considered before the original strong feature technique, that is based on a combinatorial search. The filtering helps us to find very good strong gene sets, without resorting to super computers. We have tested several filter options and compared the strong genes obtained with the ones got by the original full combinatorial search.

Collaboration


Dive into the Edward Suh's collaboration.

Top Co-Authors

Avatar

James Lowey

Translational Genomics Research Institute

View shared research outputs
Top Co-Authors

Avatar

Robert L. Martino

National Institutes of Health

View shared research outputs
Top Co-Authors

Avatar

Seungchan Kim

Arizona State University

View shared research outputs
Top Co-Authors

Avatar

Jianping Hua

Translational Genomics Research Institute

View shared research outputs
Top Co-Authors

Avatar

Michael L. Bittner

Translational Genomics Research Institute

View shared research outputs
Top Co-Authors

Avatar

Tieng K. Yap

National Institutes of Health

View shared research outputs
Top Co-Authors

Avatar

Yidong Chen

University of Texas Health Science Center at San Antonio

View shared research outputs
Top Co-Authors

Avatar

Benes L. Trus

National Institutes of Health

View shared research outputs
Researchain Logo
Decentralizing Knowledge