Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Shibiao Wan is active.

Publication


Featured researches published by Shibiao Wan.


Journal of Theoretical Biology | 2013

GOASVM: A subcellular location predictor by incorporating term-frequency gene ontology into the general form of Chou's pseudo-amino acid composition

Shibiao Wan; Man-Wai Mak; Sun-Yuan Kung

Prediction of protein subcellular localization is an important yet challenging problem. Recently, several computational methods based on Gene Ontology (GO) have been proposed to tackle this problem and have demonstrated superiority over methods based on other features. Existing GO-based methods, however, do not fully use the GO information. This paper proposes an efficient GO method called GOASVM that exploits the information from the GO term frequencies and distant homologs to represent a protein in the general form of Chous pseudo-amino acid composition. The method first selects a subset of relevant GO terms to form a GO vector space. Then for each protein, the method uses the accession number (AC) of the protein or the ACs of its homologs to find the number of occurrences of the selected GO terms in the Gene Ontology annotation (GOA) database as a means to construct GO vectors for support vector machines (SVMs) classification. With the advantages of GO term frequencies and a new strategy to incorporate useful homologous information, GOASVM can achieve a prediction accuracy of 72.2% on a new independent test set comprising novel proteins that were added to Swiss-Prot six years later than the creation date of the training set. GOASVM and Supplementary materials are available online at http://bioinfo.eie.polyu.edu.hk/mGoaSvmServer/GOASVM.html.


BMC Bioinformatics | 2012

mGOASVM: Multi-label protein subcellular localization based on gene ontology and support vector machines

Shibiao Wan; Man-Wai Mak; Sun-Yuan Kung

BackgroundAlthough many computational methods have been developed to predict protein subcellular localization, most of the methods are limited to the prediction of single-location proteins. Multi-location proteins are either not considered or assumed not existing. However, proteins with multiple locations are particularly interesting because they may have special biological functions, which are essential to both basic research and drug discovery.ResultsThis paper proposes an efficient multi-label predictor, namely mGOASVM, for predicting the subcellular localization of multi-location proteins. Given a protein, the accession numbers of its homologs are obtained via BLAST search. Then, the original accession number and the homologous accession numbers of the protein are used as keys to search against the Gene Ontology (GO) annotation database to obtain a set of GO terms. Given a set of training proteins, a set of T relevant GO terms is obtained by finding all of the GO terms in the GO annotation database that are relevant to the training proteins. These relevant GO terms then form the basis of a T-dimensional Euclidean space on which the GO vectors lie. A support vector machine (SVM) classifier with a new decision scheme is proposed to classify the multi-label GO vectors. The mGOASVM predictor has the following advantages: (1) it uses the frequency of occurrences of GO terms for feature representation; (2) it selects the relevant GO subspace which can substantially speed up the prediction without compromising performance; and (3) it adopts an efficient multi-label SVM classifier which significantly outperforms other predictors. Briefly, on two recently published virus and plant datasets, mGOASVM achieves an actual accuracy of 88.9% and 87.4%, respectively, which are significantly higher than those achieved by the state-of-the-art predictors such as iLoc-Virus (74.8%) and iLoc-Plant (68.1%).ConclusionsmGOASVM can efficiently predict the subcellular locations of multi-label proteins. The mGOASVM predictor is available online athttp://bioinfo.eie.polyu.edu.hk/mGoaSvmServer/mGOASVM.html.


Journal of Theoretical Biology | 2014

R3P-Loc: A compact multi-label predictor using ridge regression and random projection for protein subcellular localization

Shibiao Wan; Man-Wai Mak; Sun-Yuan Kung

Locating proteins within cellular contexts is of paramount significance in elucidating their biological functions. Computational methods based on knowledge databases (such as gene ontology annotation (GOA) database) are known to be more efficient than sequence-based methods. However, the predominant scenarios of knowledge-based methods are that (1) knowledge databases typically have enormous size and are growing exponentially, (2) knowledge databases contain redundant information, and (3) the number of extracted features from knowledge databases is much larger than the number of data samples with ground-truth labels. These properties render the extracted features liable to redundant or irrelevant information, causing the prediction systems suffer from overfitting. To address these problems, this paper proposes an efficient multi-label predictor, namely R3P-Loc, which uses two compact databases for feature extraction and applies random projection (RP) to reduce the feature dimensions of an ensemble ridge regression (RR) classifier. Two new compact databases are created from Swiss-Prot and GOA databases. These databases possess almost the same amount of information as their full-size counterparts but with much smaller size. Experimental results on two recent datasets (eukaryote and plant) suggest that R3P-Loc can reduce the dimensions by seven-folds and significantly outperforms state-of-the-art predictors. This paper also demonstrates that the compact databases reduce the memory consumption by 39 times without causing degradation in prediction accuracy. For readers׳ convenience, the R3P-Loc server is available online at url:http://bioinfo.eie.polyu.edu.hk/R3PLocServer/.


international workshop on machine learning for signal processing | 2011

Protein subcellular localization prediction based on profile alignment and Gene Ontology

Shibiao Wan; Man-Wai Mak; Sun-Yuan Kung

The functions of proteins are closely related to their subcellular locations. Computational methods are required to replace the laborious and time-consuming experimental processes for proteomics research. This paper proposes combining homology-based profile alignment methods and functional-domain based Gene Ontology (GO) methods to predict the subcellular locations of proteins. The feature vectors constructed by these two methods are recognized by support vector machine (SVM) classifiers, and their scores are fused to enhance classification performance. The paper also investigates different approaches to constructing the GO vectors based on the GO terms returned from InterProScan. The results demonstrate that the GO methods are comparable to profile-alignment methods and overshadow those based on amino-acid compositions. Also, the fusion of these two methods can outperform the individual methods.


international conference on acoustics, speech, and signal processing | 2013

Adaptive thresholding for multi-label SVM classification with application to protein subcellular localization prediction

Shibiao Wan; Man-Wai Mak; Sun-Yuan Kung

Multi-label classification has received increasing attention in computational proteomics, especially in protein subcellular localization. Many existing multi-label protein predictors suffer from over-prediction because they use a fixed decision threshold to determine the number of labels to which a query protein should be assigned. To address this problem, this paper proposes an adaptive thresholding scheme for multi-label support vector machine (SVM) classifiers. Specifically, each one-vs-rest SVM has an adaptive threshold that is a fraction of the maximum score of the one-vs-rest SVMs in the classifier. Therefore, the number of class labels of the query protein depends on the confidence of the SVMs in the classification. This scheme is integrated into our recently proposed subcellular localization predictor that uses the frequency of occurrences of gene-ontology terms as feature vectors and one-vs-rest SVMs as classifiers. Experimental results on two recent datasets suggest that the scheme can effectively avoid both over-prediction and under-prediction, resulting in performance significantly better than other gene-ontology based subcellular localization predictors.


Journal of Theoretical Biology | 2016

Mem-ADSVM: a two-layer multi-label predictor for identifying multi-functional types of membrane proteins

Shibiao Wan; Man-Wai Mak; Sun-Yuan Kung

Identifying membrane proteins and their multi-functional types is an indispensable yet challenging topic in proteomics and bioinformatics. However, most of the existing membrane-protein predictors have the following problems: (1) they do not predict whether a given protein is a membrane protein or not; (2) they are limited to predicting membrane proteins with single-label functional types but ignore those with multi-functional types; and (3) there is still much room for improvement for their performance. To address these problems, this paper proposes a two-layer multi-label predictor, namely Mem-ADSVM, which can identify membrane proteins (Layer I) and their multi-functional types (Layer II). Specifically, given a query protein, its associated gene ontology (GO) information is retrieved by searching a compact GO-term database with its homologous accession number. Subsequently, the GO information is classified by a binary support vector machine (SVM) classifier to determine whether it is a membrane protein or not. If yes, it will be further classified by a multi-label multi-class SVM classifier equipped with an adaptive-decision (AD) scheme to determine to which functional type(s) it belongs. Experimental results show that Mem-ADSVM significantly outperforms state-of-the-art predictors in terms of identifying both membrane proteins and their multi-functional types. This paper also suggests that the two-layer prediction architecture is better than the one-layer for prediction performance. For reader׳s convenience, the Mem-ADSVM server is available online at http://bioinfo.eie.polyu.edu.hk/MemADSVMServer/.


international conference on acoustics, speech, and signal processing | 2012

GOASVM: Protein subcellular localization prediction based on Gene ontology annotation and SVM

Shibiao Wan; Man-Wai Mak; Sun-Yuan Kung

Protein subcellular localization is an essential step to annotate proteins and to design drugs. This paper proposes a functional-domain based method-GOASVM-by making full use of Gene Ontology Annotation (GOA) database to predict the subcellular locations of proteins. GOASVM uses the accession number (AC) of a query protein and the accession numbers (ACs) of homologous proteins returned from PSI-BLAST as the query strings to search against the GOA database. The occurrences of a set of predefined GO terms are used to construct the GO vectors for classification by support vector machines (SVMs). The paper investigated two different approaches to constructing the GO vectors. Experimental results suggest that using the ACs of homologous proteins as the query strings can achieve an accuracy of 94.68%, which is significantly higher than all published results based on the same dataset. As a user-friendly web-server, GOASVM is freely accessible to the public at http://bioinfo.eie.polyu.edu.hk/mGoaSvmServer/GOASVM.html.


IEEE/ACM Transactions on Computational Biology and Bioinformatics | 2016

Mem-mEN: Predicting Multi-Functional Types of Membrane Proteins by Interpretable Elastic Nets

Shibiao Wan; Man-Wai Mak; Sun-Yuan Kung

Membrane proteins play important roles in various biological processes within organisms. Predicting the functional types of membrane proteins is indispensable to the characterization of membrane proteins. Recent studies have extended to predicting singleand multi-type membrane proteins. However, existing predictors perform poorly and more importantly, they are often lack of interpretability. To address these problems, this paper proposes an efficient predictor, namely Mem-mEN, which can produce sparse and interpretable solutions for predicting membrane proteins with singleand multi-label functional types. Given a query membrane protein, its associated gene ontology (GO) information is retrieved by searching a compact GO-term database with its homologous accession number, which is subsequently classified by a multi-label elastic net (EN) classifier. Experimental results show that MemmEN significantly outperforms existing state-of-the-art membrane-protein predictors. Moreover, by using Mem-mEN, 338 out of more than 7,900 GO terms are found to play more essential roles in determining the functional types. Based on these 338 essential GO terms, Mem-mEN can not only predict the functional type of a membrane protein, but also explain why it belongs to that type. For the readers convenience, the Mem-mEN server is available online at http://bioinfo.eie.polyu.edu.hk/MemmENServer/.


BMC Bioinformatics | 2016

Sparse regressions for predicting and interpreting subcellular localization of multi-label proteins.

Shibiao Wan; Man-Wai Mak; Sun-Yuan Kung

BackgroundPredicting protein subcellular localization is indispensable for inferring protein functions. Recent studies have been focusing on predicting not only single-location proteins, but also multi-location proteins. Almost all of the high performing predictors proposed recently use gene ontology (GO) terms to construct feature vectors for classification. Despite their high performance, their prediction decisions are difficult to interpret because of the large number of GO terms involved.ResultsThis paper proposes using sparse regressions to exploit GO information for both predicting and interpreting subcellular localization of single- and multi-location proteins. Specifically, we compared two multi-label sparse regression algorithms, namely multi-label LASSO (mLASSO) and multi-label elastic net (mEN), for large-scale predictions of protein subcellular localization. Both algorithms can yield sparse and interpretable solutions. By using the one-vs-rest strategy, mLASSO and mEN identified 87 and 429 out of more than 8,000 GO terms, respectively, which play essential roles in determining subcellular localization. More interestingly, many of the GO terms selected by mEN are from the biological process and molecular function categories, suggesting that the GO terms of these categories also play vital roles in the prediction. With these essential GO terms, not only where a protein locates can be decided, but also why it resides there can be revealed.ConclusionsExperimental results show that the output of both mEN and mLASSO are interpretable and they perform significantly better than existing state-of-the-art predictors. Moreover, mEN selects more features and performs better than mLASSO on a stringent human benchmark dataset. For readers’ convenience, an online server called SpaPredictor for both mLASSO and mEN is available at http://bioinfo.eie.polyu.edu.hk/SpaPredictorServer/.


international conference on acoustics, speech, and signal processing | 2014

Ensemble random projection for multi-label classification with application to protein subcellular localization

Shibiao Wan; Man-Wai Mak; Bai Zhang; Yue Wang; Sun-Yuan Kung

The curse of dimensionality severely restricts the predictive power of multi-label classification systems. High-dimensional feature vectors may contain redundant or irrelevant information, causing the classification systems suffer from overfitting. To address this problem, this paper proposes a dimensionality-reduction method that applies random projection (RP) to construct an ensemble of multilabel classifiers. The merits of the proposed method are demonstrated through a multi-label protein classification task. Specifically, high-dimensional feature vectors are extracted from protein sequences using the gene ontology (GO) and Swiss-Prot databases. The feature vectors are then projected onto lower-dimensional spaces by random projection matrices whose elements conform to a distribution with zero mean and unit variance. The transformed low-dimensional vectors are classified by an ensemble of one-vs-rest multi-label support vector machine (SVM) classifiers, each corresponding to one of the RP matrices. The scores obtained from the ensemble are then fused for predicting the subcellular localization of proteins. Experimental results suggest that the proposed method can reduce the dimensions by seven folds and impressively improve the classification performance.

Collaboration


Dive into the Shibiao Wan's collaboration.

Top Co-Authors

Avatar

Man-Wai Mak

Hong Kong Polytechnic University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Bai Zhang

Johns Hopkins University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge