Guoqiang Yu
Virginia Tech
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Guoqiang Yu.
Nature Medicine | 2009
Wennuan Liu; Sari Laitinen; Sofia Khan; Mauno Vihinen; Jeanne Kowalski; Guoqiang Yu; Li Chen; Charles M. Ewing; Mario A. Eisenberger; Michael A. Carducci; William G. Nelson; Srinivasan Yegnasubramanian; Jun Luo; Yue Wang; Jianfeng Xu; William B. Isaacs; Tapio Visakorpi; G. Steven Bova
Many studies have shown that primary prostate cancers are multifocal and are composed of multiple genetically distinct cancer cell clones. Whether or not multiclonal primary prostate cancers typically give rise to multiclonal or monoclonal prostate cancer metastases is largely unknown, although studies at single chromosomal loci are consistent with the latter case. Here we show through a high-resolution genome-wide single nucleotide polymorphism and copy number survey that most, if not all, metastatic prostate cancers have monoclonal origins and maintain a unique signature copy number pattern of the parent cancer cell while also accumulating a variable number of separate subclonally sustained changes. We find no relationship between anatomic site of metastasis and genomic copy number change pattern. Taken together with past animal and cytogenetic studies of metastasis and recent single-locus genetic data in prostate and other metastatic cancers, these data indicate that despite common genomic heterogeneity in primary cancers, most metastatic cancers arise from a single precursor cancer cell. This study establishes that genomic archeology of multiple anatomically separate metastatic cancers in individuals can be used to define the salient genomic features of a parent cancer clone of proven lethal metastatic phenotype.
Bioinformatics | 2009
David J. Miller; Yanxin Zhang; Guoqiang Yu; Yongmei Liu; Li Chen; Carl D. Langefeld; David M. Herrington; Yue Joseph Wang
MOTIVATION In both genome-wide association studies (GWAS) and pathway analysis, the modest sample size relative to the number of genetic markers presents formidable computational, statistical and methodological challenges for accurately identifying markers/interactions and for building phenotype-predictive models. RESULTS We address these objectives via maximum entropy conditional probability modeling (MECPM), coupled with a novel model structure search. Unlike neural networks and support vector machines (SVMs), MECPM makes explicit and is determined by the interactions that confer phenotype-predictive power. Our method identifies both a marker subset and the multiple k-way interactions between these markers. Additional key aspects are: (i) evaluation of a select subset of up to five-way interactions while retaining relatively low complexity; (ii) flexible single nucleotide polymorphism (SNP) coding (dominant, recessive) within each interaction; (iii) no mathematical interaction form assumed; (iv) model structure and order selection based on the Bayesian Information Criterion, which fairly compares interactions at different orders and automatically sets the experiment-wide significance level; (v) MECPM directly yields a phenotype-predictive model. MECPM was compared with a panel of methods on datasets with up to 1000 SNPs and up to eight embedded penetrance function (i.e. ground-truth) interactions, including a five-way, involving less than 20 SNPs. MECPM achieved improved sensitivity and specificity for detecting both ground-truth markers and interactions, compared with previous methods. AVAILABILITY http://www.cbil.ece.vt.edu/ResearchOngoingSNP.htm
BMC Genomics | 2011
Li Chen; Guoqiang Yu; Carl D. Langefeld; David J. Miller; Richard T. Guy; Jayaram Raghuram; Xiguo Yuan; David M. Herrington; Yue Wang
BackgroundInteractions among genetic loci are believed to play an important role in disease risk. While many methods have been proposed for detecting such interactions, their relative performance remains largely unclear, mainly because different data sources, detection performance criteria, and experimental protocols were used in the papers introducing these methods and in subsequent studies. Moreover, there have been very few studies strictly focused on comparison of existing methods. Given the importance of detecting gene-gene and gene-environment interactions, a rigorous, comprehensive comparison of performance and limitations of available interaction detection methods is warranted.ResultsWe report a comparison of eight representative methods, of which seven were specifically designed to detect interactions among single nucleotide polymorphisms (SNPs), with the last a popular main-effect testing method used as a baseline for performance evaluation. The selected methods, multifactor dimensionality reduction (MDR), full interaction model (FIM), information gain (IG), Bayesian epistasis association mapping (BEAM), SNP harvester (SH), maximum entropy conditional probability modeling (MECPM), logistic regression with an interaction term (LRIT), and logistic regression (LR) were compared on a large number of simulated data sets, each, consistent with complex disease models, embedding multiple sets of interacting SNPs, under different interaction models. The assessment criteria included several relevant detection power measures, family-wise type I error rate, and computational complexity. There are several important results from this study. First, while some SNPs in interactions with strong effects are successfully detected, most of the methods miss many interacting SNPs at an acceptable rate of false positives. In this study, the best-performing method was MECPM. Second, the statistical significance assessment criteria, used by some of the methods to control the type I error rate, are quite conservative, thereby limiting their power and making it difficult to fairly compare them. Third, as expected, power varies for different models and as a function of penetrance, minor allele frequency, linkage disequilibrium and marginal effects. Fourth, the analytical relationships between power and these factors are derived, aiding in the interpretation of the study results. Fifth, for these methods the magnitude of the main effect influences the power of the tests. Sixth, most methods can detect some ground-truth SNPs but have modest power to detect the whole set of interacting SNPs.ConclusionThis comparison study provides new insights into the strengths and limitations of current methods for detecting interacting loci. This study, along with freely available simulation tools we provide, should help support development of improved methods. The simulation tools are available at: http://code.google.com/p/simulation-tool-bmc-ms9169818735220977/downloads/list.
Methods | 2014
Yiming Zuo; Guoqiang Yu; Mahlet G. Tadesse; Habtom W. Ressom
Biological network inference is a major challenge in systems biology. Traditional correlation-based network analysis results in too many spurious edges since correlation cannot distinguish between direct and indirect associations. To address this issue, Gaussian graphical models (GGM) were proposed and have been widely used. Though they can significantly reduce the number of spurious edges, GGM are insufficient to uncover a network structure faithfully due to the fact that they only consider the full order partial correlation. Moreover, when the number of samples is smaller than the number of variables, further technique based on sparse regularization needs to be incorporated into GGM to solve the singular covariance inversion problem. In this paper, we propose an efficient and mathematically solid algorithm that infers biological networks by computing low order partial correlation (LOPC) up to the second order. The bias introduced by the low order constraint is minimal compared to the more reliable approximation of the network structure achieved. In addition, the algorithm is suitable for a dataset with small sample size but large number of variables. Simulation results show that LOPC yields far less spurious edges and works well under various conditions commonly seen in practice. The application to a real metabolomics dataset further validates the performance of LOPC and suggests its potential power in detecting novel biomarkers for complex disease.
BMC Genomics | 2012
Xiguo Yuan; Guoqiang Yu; Xuchu Hou; Ie Ming Shih; Robert Clarke; Junying Zhang; Eric P. Hoffman; Roger R. Wang; Zhen Zhang; Yue Wang
BackgroundSomatic Copy Number Alterations (CNAs) in human genomes are present in almost all human cancers. Systematic efforts to characterize such structural variants must effectively distinguish significant consensus events from random background aberrations. Here we introduce Significant Aberration in Cancer (SAIC), a new method for characterizing and assessing the statistical significance of recurrent CNA units. Three main features of SAIC include: (1) exploiting the intrinsic correlation among consecutive probes to assign a score to each CNA unit instead of single probes; (2) performing permutations on CNA units that preserve correlations inherent in the copy number data; and (3) iteratively detecting Significant Copy Number Aberrations (SCAs) and estimating an unbiased null distribution by applying an SCA-exclusive permutation scheme.ResultsWe test and compare the performance of SAIC against four peer methods (GISTIC, STAC, KC-SMART, CMDS) on a large number of simulation datasets. Experimental results show that SAIC outperforms peer methods in terms of larger area under the Receiver Operating Characteristics curve and increased detection power. We then apply SAIC to analyze structural genomic aberrations acquired in four real cancer genome-wide copy number data sets (ovarian cancer, metastatic prostate cancer, lung adenocarcinoma, glioblastoma). When compared with previously reported results, SAIC successfully identifies most SCAs known to be of biological significance and associated with oncogenes (e.g., KRAS, CCNE1, and MYC) or tumor suppressor genes (e.g., CDKN2A/B). Furthermore, SAIC identifies a number of novel SCAs in these copy number data that encompass tumor related genes and may warrant further studies.ConclusionsSupported by a well-grounded theoretical framework, SAIC has been developed and used to identify SCAs in various cancer copy number data sets, providing useful information to study the landscape of cancer genomes. Open–source and platform-independent SAIC software is implemented using C++, together with R scripts for data formatting and Perl scripts for user interfacing, and it is easy to install and efficient to use. The source code and documentation are freely available at http://www.cbil.ece.vt.edu/software.htm.
bioinformatics and biomedicine | 2009
Li Chen; Guoqiang Yu; David J. Miller; Lei Song; Carl D. Langefeld; David M. Herrington; Yongmei Liu; Yue Wang
Genome-wide association studies (GWAS) have been widely applied to identify informative SNPs associated with common and complex diseases. Besides single-SNP analysis, the interaction between SNPs is believed to play an important role in disease risk due to the complex networking of genetic regulations. While many approaches have been proposed for detecting SNP interactions, the relative performance and merits of these methods in practice are largely unclear. In this paper, a ground-truth based comparative study is reported involving 9 popular SNP detection methods using realistic simulation datasets. The results provide general characteristics and guidelines on these methods that may be informative to the biological investigators.
Bioinformatics | 2011
Guoqiang Yu; Huai Li; Sook S. Ha; Ie Ming Shih; Robert Clarke; Eric P. Hoffman; Subha Madhavan; Jianhua Xuan; Yue Joseph Wang
UNLABELLED Phenotypic Up-regulated Gene Support Vector Machine (PUGSVM) is a cancer Biomedical Informatics Grid (caBIG™) analytical tool for multiclass gene selection and classification. PUGSVM addresses the problem of imbalanced class separability, small sample size and high gene space dimensionality, where multiclass gene markers are defined by the union of one-versus-everyone phenotypic upregulated genes, and used by a well-matched one-versus-rest support vector machine. PUGSVM provides a simple yet more accurate strategy to identify statistically reproducible mechanistic marker genes for characterization of heterogeneous diseases. AVAILABILITY http://www.cbil.ece.vt.edu/caBIG-PUGSVM.htm.
PLOS ONE | 2012
Xiguo Yuan; Junying Zhang; Shengli Zhang; Guoqiang Yu; Yue Wang
Recurrent copy number alterations (CNAs) play an important role in cancer genesis. While a number of computational methods have been proposed for identifying such CNAs, their relative merits remain largely unknown in practice since very few efforts have been focused on comparative analysis of the methods. To facilitate studies of recurrent CNA identification in cancer genome, it is imperative to conduct a comprehensive comparison of performance and limitations among existing methods. In this paper, six representative methods proposed in the latest six years are compared. These include one-stage and two-stage approaches, working with raw intensity ratio data and discretized data respectively. They are based on various techniques such as kernel regression, correlation matrix diagonal segmentation, semi-parametric permutation and cyclic permutation schemes. We explore multiple criteria including type I error rate, detection power, Receiver Operating Characteristics (ROC) curve and the area under curve (AUC), and computational complexity, to evaluate performance of the methods under multiple simulation scenarios. We also characterize their abilities on applications to two real datasets obtained from cancers with lung adenocarcinoma and glioblastoma. This comparison study reveals general characteristics of the existing methods for identifying recurrent CNAs, and further provides new insights into their strengths and weaknesses. It is believed helpful to accelerate the development of novel and improved methods.
Scientific Reports | 2016
Niya Wang; Eric P. Hoffman; Lulu Chen; Li Chen; Zhen Zhang; Chunyu Liu; Guoqiang Yu; David M. Herrington; Robert Clarke; Yue Wang
Tissue heterogeneity is both a major confounding factor and an underexploited information source. While a handful of reports have demonstrated the potential of supervised computational methods to deconvolute tissue heterogeneity, these approaches require a priori information on the marker genes or composition of known subpopulations. To address the critical problem of the absence of validated marker genes for many (including novel) subpopulations, we describe convex analysis of mixtures (CAM), a fully unsupervised in silico method, for identifying subpopulation marker genes directly from the original mixed gene expressions in scatter space that can improve molecular analyses in many biological contexts. Validated with predesigned mixtures, CAM on the gene expression data from peripheral leukocytes, brain tissue, and yeast cell cycle, revealed novel marker genes that were otherwise undetectable using existing methods. Importantly, CAM requires no a priori information on the number, identity, or composition of the subpopulations present in mixed samples, and does not require the presence of pure subpopulations in sample space. This advantage is significant in that CAM can achieve all of its goals using only a small number of heterogeneous samples, and is more powerful to distinguish between phenotypically similar subpopulations.
BMC Bioinformatics | 2017
Yiming Zuo; Yi Cui; Guoqiang Yu; Ruijiang Li; Habtom W. Ressom
BackgroundConventional differential gene expression analysis by methods such as student’s t-test, SAM, and Empirical Bayes often searches for statistically significant genes without considering the interactions among them. Network-based approaches provide a natural way to study these interactions and to investigate the rewiring interactions in disease versus control groups. In this paper, we apply weighted graphical LASSO (wgLASSO) algorithm to integrate a data-driven network model with prior biological knowledge (i.e., protein-protein interactions) for biological network inference. We propose a novel differentially weighted graphical LASSO (dwgLASSO) algorithm that builds group-specific networks and perform network-based differential gene expression analysis to select biomarker candidates by considering their topological differences between the groups.ResultsThrough simulation, we showed that wgLASSO can achieve better performance in building biologically relevant networks than purely data-driven models (e.g., neighbor selection, graphical LASSO), even when only a moderate level of information is available as prior biological knowledge. We evaluated the performance of dwgLASSO for survival time prediction using two microarray breast cancer datasets previously reported by Bild et al. and van de Vijver et al. Compared with the top 10 significant genes selected by conventional differential gene expression analysis method, the top 10 significant genes selected by dwgLASSO in the dataset from Bild et al. led to a significantly improved survival time prediction in the independent dataset from van de Vijver et al. Among the 10 genes selected by dwgLASSO, UBE2S, SALL2, XBP1 and KIAA0922 have been confirmed by literature survey to be highly relevant in breast cancer biomarker discovery study. Additionally, we tested dwgLASSO on TCGA RNA-seq data acquired from patients with hepatocellular carcinoma (HCC) on tumors samples and their corresponding non-tumorous liver tissues. Improved sensitivity, specificity and area under curve (AUC) were observed when comparing dwgLASSO with conventional differential gene expression analysis method.ConclusionsThe proposed network-based differential gene expression analysis algorithm dwgLASSO can achieve better performance than conventional differential gene expression analysis methods by integrating information at both gene expression and network topology levels. The incorporation of prior biological knowledge can lead to the identification of biologically meaningful genes in cancer biomarker studies.