Xuequn Shang
Northwestern Polytechnical University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Xuequn Shang.
International Journal of Data Mining and Bioinformatics | 2017
Jiajie Peng; Hansheng Xue; Yukai Shao; Xuequn Shang; Yadong Wang; Jin Chen
It is critical yet remains to be challenging to make precise disease diagnosis from complex clinical features and highly heterogeneous genetic background. Recently, phenotype similarity has been effectively applied to model patient phenotype data. However, the existing measurements are revised based on the Gene Ontology-based term similarity models, which are not optimised for human phenotype ontologies. We propose a new similarity measure called PhenoSim. Our model includes a noise reduction component to model the noisy patient phenotype data, and a path-constrained Information Content-based method for phenotype semantics similarity measurement. Evaluation tests compared PhenoSim with four existing approaches. It showed that PhenoSim, could effectively improve the performance of HPO-based phenotype similarity measurement, thus increasing the accuracy of phenotype-based causative gene prediction and disease prediction.
BMC Genomics | 2017
Jiajie Peng; Kun Bai; Xuequn Shang; Guohua Wang; Hansheng Xue; Shuilin Jin; Liang Cheng; Yadong Wang; Jin Chen
BackgroundIdentifying the genes associated to human diseases is crucial for disease diagnosis and drug design. Computational approaches, esp. the network-based approaches, have been recently developed to identify disease-related genes effectively from the existing biomedical networks. Meanwhile, the advance in biotechnology enables researchers to produce multi-omics data, enriching our understanding on human diseases, and revealing the complex relationships between genes and diseases. However, none of the existing computational approaches is able to integrate the huge amount of omics data into a weighted integrated network and utilize it to enhance disease related gene discovery.ResultsWe propose a new network-based disease gene prediction method called SLN-SRW (Simplified Laplacian Normalization-Supervised Random Walk) to generate and model the edge weights of a new biomedical network that integrates biomedical data from heterogeneous sources, thus far enhancing the disease related gene discovery.ConclusionsThe experiment results show that SLN-SRW significantly improves the performance of disease gene prediction on both the real and the synthetic data sets.
BMC Medical Genomics | 2015
Bolin Chen; Min Li; Jianxin Wang; Xuequn Shang; Fang-Xiang Wu
BackgroundIntegrating multiple data sources is indispensable in improving disease gene identification. It is not only due to the fact that disease genes associated with similar genetic diseases tend to lie close with each other in various biological networks, but also due to the fact that gene-disease associations are complex. Although various algorithms have been proposed to identify disease genes, their prediction performances and the computational time still should be further improved.ResultsIn this study, we propose a fast and high performance multiple data integration algorithm for identifying human disease genes. A posterior probability of each candidate gene associated with individual diseases is calculated by using a Bayesian analysis method and a binary logistic regression model. Two prior probability estimation strategies and two feature vector construction methods are developed to test the performance of the proposed algorithm.ConclusionsThe proposed algorithm is not only generated predictions with high AUC scores, but also runs very fast. When only a single PPI network is employed, the AUC score is 0.769 by using F2 as feature vectors. The average running time for each leave-one-out experiment is only around 1.5 seconds. When three biological networks are integrated, the AUC score using F3 as feature vectors increases to 0.830, and the average running time for each leave-one-out experiment takes only about 12.54 seconds. It is better than many existing algorithms.
BMC Bioinformatics | 2017
Jiajie Peng; Honggang Wang; Junya Lu; Weiwei Hui; Yadong Wang; Xuequn Shang
BackgroundThe Gene Ontology (GO) is a community-based bioinformatics resource that employs ontologies to represent biological knowledge and describes information about gene and gene product function. GO includes three independent categories: molecular function, biological process and cellular component. For better biological reasoning, identifying the biological relationships between terms in different categories are important.However, the existing measurements to calculate similarity between terms in different categories are either developed by using the GO data only or only take part of combined gene co-function network information.ResultsWe propose an iterative ranking-based method called CroGO2 to measure the cross-categories GO term similarities by incorporating level information of GO terms with both direct and indirect interactions in the gene co-function network.ConclusionsThe evaluation test shows that CroGO2 performs better than the existing methods. A genome-specific term association network for yeast is also generated by connecting terms with the high confidence score. The linkages in the term association network could be supported by the literature. Given a gene set, the related terms identified by using the association network have overlap with the related terms identified by GO enrichment analysis.
Methods | 2017
Jiajie Peng; Junya Lu; Xuequn Shang; Jin Chen
It is critical to identify disease-specific subnetworks from the vastly available genome-wide gene expression data for elucidating how genes perform high-level biological functions together. Various algorithms have been developed for disease gene identification. However, the topological structure of the disease networks (or even the fraction of the networks) has been left largely unexplored. In this article, we present DNet, a method for the identification of significant disease subnetworks by integrating both the network structure and gene expression information. Our work will lead to the identification of missing key disease genes, which are be highly expressed in a disease-specific gene expression dataset. The experimental evaluation of our method on both the Leukemia and the Duchenne Muscular Dystrophy gene expression datasets show that DNet performs better than the existing state-of-the-art methods. In addition, literature supports were found for the discovered disease subnetworks in a case study.
IEEE Transactions on Nanobioscience | 2016
Bolin Chen; Xuequn Shang; Min Li; Jianxin Wang; Fang-Xiang Wu
The identification of individual-cancer-related genes typically is an imbalanced classification issue. The number of known cancer-related genes is far less than the number of all unknown genes, which makes it very hard to detect novel predictions from such imbalanced training samples. A regular machine learning method can either only detect genes related to all cancers or add clinical knowledge to circumvent this issue. In this study, we introduce a training sample rebalancing strategy to overcome this issue by using a two-step logistic regression and a random resampling method. The two-step logistic regression is to select a set of genes that related to all cancers. While the random resampling method is performed to further classify those genes associated with individual cancers. The issue of imbalanced classification is circumvented by randomly adding positive instances related to other cancers at first, and then excluding those unrelated predictions according to the overall performance at the following step. Numerical experiments show that the proposed resampling method is able to identify cancer-related genes even when the number of known genes related to it is small. The final predictions for all individual cancers achieve AUC values around 0.93 by using the leave-one-out cross validation method, which is very promising, compared with existing methods.
BMC Systems Biology | 2018
Jiajie Peng; Xuanshuo Zhang; Weiwei Hui; Junya Lu; Qianqian Li; Shuhui Liu; Xuequn Shang
BackgroundGene Ontology (GO) is one of the most popular bioinformatics resources. In the past decade, Gene Ontology-based gene semantic similarity has been effectively used to model gene-to-gene interactions in multiple research areas. However, most existing semantic similarity approaches rely only on GO annotations and structure, or incorporate only local interactions in the co-functional network. This may lead to inaccurate GO-based similarity resulting from the incomplete GO topology structure and gene annotations.ResultsWe present NETSIM2, a new network-based method that allows researchers to measure GO-based gene functional similarities by considering the global structure of the co-functional network with a random walk with restart (RWR)-based method, and by selecting the significant term pairs to decrease the noise information. Based on the EC number (Enzyme Commission)-based groups of yeast and Arabidopsis, evaluation test shows that NETSIM2 can enhance the accuracy of Gene Ontology-based gene functional similarity.ConclusionsUsing NETSIM2 as an example, we found that the accuracy of semantic similarities can be significantly improved after effectively incorporating the global gene-to-gene interactions in the co-functional network, especially on the species that gene annotations in GO are far from complete.
bioinformatics and biomedicine | 2015
Bolin Chen; Xuequn Shang; Min Li; Jianxin Wang; Fang-Xiang Wu
The identification of cancer-related genes is important towards the understanding of complex genetic diseases. Although many machine learning algorithms are proposed to identify disease-related genes, they often either have poor performance to identify locus heterogeneity cancer-related genes or are not applicable to predict individual-disease-related genes due to the lack of positive instances (imbalanced classification). To overcome these two issues, a two-step logistic regression (LR) based algorithm is proposed in this study for identifying individual-cancer-related genes. A set of high potential cancer-class-related genes is first generated in step 1, followed by a second round of LR-based algorithm conducted on this smaller dataset for identifying individual-cancer-related genes. Numerical experiments show that the proposed two-step LR-based algorithm not only works well for locus heterogeneity data, but also has good performance to handle the imbalanced classification problem. The individual-cancer-related gene identification experiments achieve AUC values of around 0.85 when the threshold of posterior probability is chosen between 0.3 and 0.6. All evaluations are conducted by using the leave-one-out cross validation method.
BMC Bioinformatics | 2018
Yang Guo; Shuhui Liu; Zhanhuai Li; Xuequn Shang
BackgroundThe classification of cancer subtypes is of great importance to cancer disease diagnosis and therapy. Many supervised learning approaches have been applied to cancer subtype classification in the past few years, especially of deep learning based approaches. Recently, the deep forest model has been proposed as an alternative of deep neural networks to learn hyper-representations by using cascade ensemble decision trees. It has been proved that the deep forest model has competitive or even better performance than deep neural networks in some extent. However, the standard deep forest model may face overfitting and ensemble diversity challenges when dealing with small sample size and high-dimensional biology data.ResultsIn this paper, we propose a deep learning model, so-called BCDForest, to address cancer subtype classification on small-scale biology datasets, which can be viewed as a modification of the standard deep forest model. The BCDForest distinguishes from the standard deep forest model with the following two main contributions: First, a named multi-class-grained scanning method is proposed to train multiple binary classifiers to encourage diversity of ensemble. Meanwhile, the fitting quality of each classifier is considered in representation learning. Second, we propose a boosting strategy to emphasize more important features in cascade forests, thus to propagate the benefits of discriminative features among cascade layers to improve the classification performance. Systematic comparison experiments on both microarray and RNA-Seq gene expression datasets demonstrate that our method consistently outperforms the state-of-the-art methods in application of cancer subtype classification.ConclusionsThe multi-class-grained scanning and boosting strategy in our model provide an effective solution to ease the overfitting challenge and improve the robustness of deep forest model working on small-scale data. Our model provides a useful approach to the classification of cancer subtypes by using deep learning on high-dimensional and small-scale biology data.
bioinformatics and biomedicine | 2016
Jiajie Peng; Hansheng Xue; Yukai Shao; Xuequn Shang; Yadong Wang; Jin Chen
It is critical yet remains to be challenging to make right disease diagnosis based on complex clinical characteristic and heterogeneous genetic background. Recently, Human Phenotype Ontology (HPO)-based phenotype similarity has been widely used to aid disease diagnosis. However, the existing measurements are revised based on the Gene Ontology-based term similarity models, which are not optimized for human phenotype ontologies. We propose a new similarity measure called PhenoSim. Our model includes a noise reduction component to model the noisy patient phenotype data, and a path-constrained Information Content-based method for measuring phenotype semantics similarity. Evaluation tests showed that PhenoSim could improve the performance of HPO-based phenotype similarity measurement.