James Lowey
Translational Genomics Research Institute
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by James Lowey.
Bioinformatics | 2005
Jianping Hua; Zixiang Xiong; James Lowey; Edward Suh; Edward R. Dougherty
MOTIVATION Given the joint feature-label distribution, increasing the number of features always results in decreased classification error; however, this is not the case when a classifier is designed via a classification rule from sample data. Typically (but not always), for fixed sample size, the error of a designed classifier decreases and then increases as the number of features grows. The potential downside of using too many features is most critical for small samples, which are commonplace for gene-expression-based classifiers for phenotype discrimination. For fixed sample size and feature-label distribution, the issue is to find an optimal number of features. RESULTS Since only in rare cases is there a known distribution of the error as a function of the number of features and sample size, this study employs simulation for various feature-label distributions and classification rules, and across a wide range of sample and feature-set sizes. To achieve the desired end, finding the optimal number of features as a function of sample size, it employs massively parallel computation. Seven classifiers are treated: 3-nearest-neighbor, Gaussian kernel, linear support vector machine, polynomial support vector machine, perceptron, regular histogram and linear discriminant analysis. Three Gaussian-based models are considered: linear, nonlinear and bimodal. In addition, real patient data from a large breast-cancer study is considered. To mitigate the combinatorial search for finding optimal feature sets, and to model the situation in which subsets of genes are co-regulated and correlation is internal to these subsets, we assume that the covariance matrix of the features is blocked, with each block corresponding to a group of correlated features. Altogether there are a large number of error surfaces for the many cases. These are provided in full on a companion website, which is meant to serve as resource for those working with small-sample classification. AVAILABILITY For the companion website, please visit http://public.tgen.org/tamu/ofs/ CONTACT [email protected].
Nature Genetics | 2008
Jielin Sun; Siqun Lilly Zheng; Fredrik Wiklund; Sarah D. Isaacs; Lina Purcell; Zhengrong Gao; Fang-Chi Hsu; Seong Tae Kim; Wennuan Liu; Yi Zhu; Pär Stattin; Hans-Olov Adami; Kathleen E. Wiley; Latchezar Dimitrov; Jishan Sun; Tao Li; Aubrey R. Turner; Tamara S. Adams; Jan Adolfsson; Jan-Erik Johansson; James Lowey; Bruce J. Trock; Alan W. Partin; Patrick C. Walsh; Jeffrey M. Trent; David Duggan; John D. Carpten; Bao Li Chang; Henrik Grönberg; William B. Isaacs
We carried out a fine-mapping study in the HNF1B gene at 17q12 in two study populations and identified a second locus associated with prostate cancer risk, ∼26 kb centromeric to the first known locus (rs4430796); these loci are separated by a recombination hot spot. We confirmed the association with a SNP in the second locus (rs11649743) in five additional populations, with P = 1.7 × 10−9 for an allelic test of the seven studies combined. The association at each SNP remained significant after adjustment for the other SNP.
Genome Research | 2011
Christiane M. Robbins; Waibov A. Tembe; Angela Baker; Shripad Sinari; Tracy Moses; Stephen M. Beckstrom-Sternberg; James S. Beckstrom-Sternberg; Michael T. Barrett; James Long; Arul M. Chinnaiyan; James Lowey; Edward Suh; John V. Pearson; David Craig; David B. Agus; Kenneth J. Pienta; John D. Carpten
Advanced prostate cancer can progress to systemic metastatic tumors, which are generally androgen insensitive and ultimately lethal. Here, we report a comprehensive genomic survey for somatic events in systemic metastatic prostate tumors using both high-resolution copy number analysis and targeted mutational survey of 3508 exons from 577 cancer-related genes using next generation sequencing. Focal homozygous deletions were detected at 8p22, 10q23.31, 13q13.1, 13q14.11, and 13q14.12. Key genes mapping within these deleted regions include PTEN, BRCA2, C13ORF15, and SIAH3. Focal high-level amplifications were detected at 5p13.2-p12, 14q21.1, 7q22.1, and Xq12. Key amplified genes mapping within these regions include SKP2, FOXA1, and AR. Furthermore, targeted mutational analysis of normal-tumor pairs has identified somatic mutations in genes known to be associated with prostate cancer including AR and TP53, but has also revealed novel somatic point mutations in genes including MTOR, BRCA2, ARHGEF12, and CHD5. Finally, in one patient where multiple independent metastatic tumors were available, we show common and divergent somatic alterations that occur at both the copy number and point mutation level, supporting a model for a common clonal progenitor with metastatic tumor-specific divergence. Our study represents a deep genomic analysis of advanced metastatic prostate tumors and has revealed candidate somatic alterations, possibly contributing to lethal prostate cancer.
Pattern Recognition | 2007
Marcel Brun; Chao Sima; Jianping Hua; James Lowey; Brent Carroll; Edward Suh; Edward R. Dougherty
A cluster operator takes a set of data points and partitions the points into clusters (subsets). As with any scientific model, the scientific content of a cluster operator lies in its ability to predict results. This ability is measured by its error rate relative to cluster formation. To estimate the error of a cluster operator, a sample of point sets is generated, the algorithm is applied to each point set and the clusters evaluated relative to the known partition according to the distributions, and then the errors are averaged over the point sets composing the sample. Many validity measures have been proposed for evaluating clustering results based on a single realization of the random-point-set process. In this paper we consider a number of proposed validity measures and we examine how well they correlate with error rates across a number of clustering algorithms and random-point-set models. Validity measures fall broadly into three classes: internal validation is based on calculating properties of the resulting clusters; relative validation is based on comparisons of partitions generated by the same algorithm with different parameters or different subsets of the data; and external validation compares the partition generated by the clustering algorithm and a given partition of the data. To quantify the degree of similarity between the validation indices and the clustering errors, we use Kendalls rank correlation between their values. Our results indicate that, overall, the performance of validity indices is highly variable. For complex models or when a clustering algorithm yields complex clusters, both the internal and relative indices fail to predict the error of the algorithm. Some external indices appear to perform well, whereas others do not. We conclude that one should not put much faith in a validity score unless there is evidence, either in terms of sufficient data for model estimation or prior model knowledge, that a validity measure is well-correlated to the error rate of the clustering algorithm.
Cancer Epidemiology, Biomarkers & Prevention | 2005
Jianfeng Xu; James Lowey; Fredrik Wiklund; Jielin Sun; Fredrik Lindmark; Fang-Chi Hsu; Latchezar Dimitrov; Bao-Li Chang; Aubrey R. Turner; Wennan Liu; Hans-Olov Adami; Edward Suh; Jason H. Moore; S. Lilly Zheng; William B. Isaacs; Jeffrey M. Trent; Henrik Grönberg
It is widely hypothesized that the interactions of multiple genes influence individual risk to prostate cancer. However, current efforts at identifying prostate cancer risk genes primarily rely on single-gene approaches. In an attempt to fill this gap, we carried out a study to explore the joint effect of multiple genes in the inflammation pathway on prostate cancer risk. We studied 20 genes in the Toll-like receptor signaling pathway as well as several cytokines. For each of these genes, we selected and genotyped haplotype-tagging single nucleotide polymorphisms (SNP) among 1,383 cases and 780 controls from the CAPS (CAncer Prostate in Sweden) study population. A total of 57 SNPs were included in the final analysis. A data mining method, multifactor dimensionality reduction, was used to explore the interaction effects of SNPs on prostate cancer risk. Interaction effects were assessed for all possible n SNP combinations, where n = 2, 3, or 4. For each n SNP combination, the model providing lowest prediction error among 100 cross-validations was chosen. The statistical significance levels of the best models in each n SNP combination were determined using permutation tests. A four-SNP interaction (one SNP each from IL-10, IL-1RN, TIRAP, and TLR5) had the lowest prediction error (43.28%, P = 0.019). Our ability to analyze a large number of SNPs in a large sample size is one of the first efforts in exploring the effect of high-order gene-gene interactions on prostate cancer risk, and this is an important contribution to this new and quickly evolving field.
Bioinformatics | 2010
Waibhav Tembe; James Lowey; Edward Suh
SUMMARY Large volumes of data generated by high-throughput sequencing instruments present non-trivial challenges in data storage, content access and transfer. We present G-SQZ, a Huffman coding-based sequencing-reads-specific representation scheme that compresses data without altering the relative order. G-SQZ has achieved from 65% to 81% compression on benchmark datasets, and it allows selective access without scanning and decoding from start. This article focuses on describing the underlying encoding scheme and its software implementation, and a more theoretical problem of optimal compression is out of scope. The immediate practical benefits include reduced infrastructure and informatics costs in managing and analyzing large sequencing data. AVAILABILITY http://public.tgen.org/sqz. Academic/non-profit: SOURCE available at no cost under a non-open-source license by requesting from the web-site; Binary: available for direct download at no cost. For-Profit: Submit request for for-profit license from the web-site.
Pattern Recognition | 2005
Chao Sima; Sanju N. Attoor; Ulisses Brag-Neto; James Lowey; Edward Suh; Edward R. Dougherty
Given a large set of potential features, it is usually necessary to find a small subset with which to classify. The task of finding an optimal feature set is inherently combinatoric and therefore suboptimal algorithms are typically used to find feature sets. If feature selection is based directly on classification error, then a feature-selection algorithm must base its decision on error estimates. This paper addresses the impact of error estimation on feature selection using two performance measures: comparison of the true error of the optimal feature set with the true error of the feature set found by a feature-selection algorithm, and the number of features among the truly optimal feature set that appear in the feature set found by the algorithm. The study considers seven error estimators applied to three standard suboptimal feature-selection algorithms and exhaustive search, and it considers three different feature-label model distributions. It draws two conclusions for the cases considered: (1) depending on the sample size and the classification rule, feature-selection algorithms can produce feature sets whose corresponding classifiers possess errors far in excess of the classifier corresponding to the optimal feature set; and (2) for small samples, differences in performances among the feature-selection algorithms are less significant than performance differences among the error estimators used to implement the algorithms. Moreover, keeping in mind that results depend on the particular classifier-distribution pair, for the error estimators considered in this study, bootstrap and bolstered resubstitution usually outperform cross-validation, and bolstered resubstitution usually performs as well as or better than bootstrap.
Eurasip Journal on Bioinformatics and Systems Biology | 2006
Jianping Hua; Yoganand Balagurunathan; Yidong Chen; James Lowey; Michael L. Bittner; Zixiang Xiong; Edward Suh; Edward R. Dougherty
When using cDNA microarrays, normalization to correct labeling bias is a common preliminary step before further data analysis is applied, its objective being to reduce the variation between arrays. To date, assessment of the effectiveness of normalization has mainly been confined to the ability to detect differentially expressed genes. Since a major use of microarrays is the expression-based phenotype classification, it is important to evaluate microarray normalization procedures relative to classification. Using a model-based approach, we model the systemic-error process to generate synthetic gene-expression values with known ground truth. These synthetic expression values are subjected to typical normalization methods and passed through a set of classification rules, the objective being to carry out a systematic study of the effect of normalization on classification. Three normalization methods are considered: offset, linear regression, and Lowess regression. Seven classification rules are considered: 3-nearest neighbor, linear support vector machine, linear discriminant analysis, regular histogram, Gaussian kernel, perceptron, and multiple perceptron with majority voting. The results of the first three are presented in the paper, with the full results being given on a complementary website. The conclusion from the different experiment models considered in the study is that normalization can have a significant benefit for classification under difficult experimental conditions, with linear and Lowess regression slightly outperforming the offset method.
The Prostate | 2012
Yizhen Lu; Jielin Sun; Andrew Karim Kader; Seong Tae Kim; Jin Woo Kim; Wennuan Liu; Jishan Sun; Daru Lu; Junjie Feng; Yi Zhu; Tao Jin; Zheng Zhang; Latchezar Dimitrov; James Lowey; Kevin Campbell; Edward Suh; David Duggan; John D. Carpten; Jeffrey M. Trent; Henrik Grönberg; S. Lilly Zheng; William B. Isaacs; Jianfeng Xu
Genome‐wide association studies (GWAS) have identified approximately three dozen single nucleotide polymorphisms (SNPs) consistently associated with prostate cancer (PCa) risk. Despite the reproducibility of these associations, the molecular mechanism for most of these SNPs has not been well elaborated as most lie within non‐coding regions of the genome. Androgens play a key role in prostate carcinogenesis. Recently, using ChIP‐on‐chip technology, 22,447 androgen receptor (AR) binding sites have been mapped throughout the genome, greatly expanding the genomic regions potentially involved in androgen‐mediated activity.
BMC Bioinformatics | 2006
Jianping Hua; James Lowey; Zixiang Xiong; Edward R. Dougherty
BackgroundOverfitting the data is a salient issue for classifier design in small-sample settings. This is why selecting a classifier from a constrained family of classifiers, ones that do not possess the potential to too finely partition the feature space, is typically preferable. But overfitting is not merely a consequence of the classifier family; it is highly dependent on the classification rule used to design a classifier from the sample data. Thus, it is possible to consider families that are rather complex but for which there are classification rules that perform well for small samples. Such classification rules can be advantageous because they facilitate satisfactory classification when the class-conditional distributions are not easily separated and the sample is not large. Here we consider neural networks, from the perspectives of classical design based solely on the sample data and from noise-injection-based design.ResultsThis paper provides an extensive simulation-based comparative study of noise-injected neural-network design. It considers a number of different feature-label models across various small sample sizes using varying amounts of noise injection. Besides comparing noise-injected neural-network design to classical neural-network design, the paper compares it to a number of other classification rules. Our particular interest is with the use of microarray data for expression-based classification for diagnosis and prognosis. To that end, we consider noise-injected neural-network design as it relates to a study of survivability of breast cancer patients.ConclusionThe conclusion is that in many instances noise-injected neural network design is superior to the other tested methods, and in almost all cases it does not perform substantially worse than the best of the other methods. Since the amount of noise injected is consequential, the effect of differing amounts of injected noise must be considered.